What’s new in 2.0.0 (April 3, 2023)#

These are the changes in pandas 2.0.0. See Release notes for a full changelog including other versions of pandas.

Enhancements#

Installing optional dependencies with pip extras#

When installing pandas using pip, sets of optional dependencies can also be installed by specifying extras.

pip install "pandas[performance, aws]>=2.0.0"

The available extras, found in the installation guide, are [all, performance, computation, fss, aws, gcp, excel, parquet, feather, hdf5, spss, postgresql, mysql, sql-other, html, xml, plot, output_formatting, clipboard, compression, test] (GH 39164).

Index can now hold numpy numeric dtypes#

It is now possible to use any numpy numeric dtype in a Index (GH 42717).

Previously it was only possible to use int64, uint64 & float64 dtypes:

In [1]: pd.Index([1, 2, 3], dtype=np.int8)
Out[1]: Int64Index([1, 2, 3], dtype="int64")
In [2]: pd.Index([1, 2, 3], dtype=np.uint16)
Out[2]: UInt64Index([1, 2, 3], dtype="uint64")
In [3]: pd.Index([1, 2, 3], dtype=np.float32)
Out[3]: Float64Index([1.0, 2.0, 3.0], dtype="float64")

Int64Index, UInt64Index & Float64Index were deprecated in pandas version 1.4 and have now been removed. Instead Index should be used directly, and can it now take all numpy numeric dtypes, i.e. int8/ int16/int32/int64/uint8/uint16/uint32/uint64/float32/float64 dtypes:

In [1]: pd.Index([1, 2, 3], dtype=np.int8)
Out[1]: Index([1, 2, 3], dtype='int8')

In [2]: pd.Index([1, 2, 3], dtype=np.uint16)
Out[2]: Index([1, 2, 3], dtype='uint16')

In [3]: pd.Index([1, 2, 3], dtype=np.float32)
Out[3]: Index([1.0, 2.0, 3.0], dtype='float32')

The ability for Index to hold the numpy numeric dtypes has meant some changes in pandas functionality. In particular, operations that previously were forced to create 64-bit indexes, can now create indexes with lower bit sizes, e.g. 32-bit indexes.

Below is a possibly non-exhaustive list of changes:

  1. Instantiating using a numpy numeric array now follows the dtype of the numpy array. Previously, all indexes created from numpy numeric arrays were forced to 64-bit. Now, for example, Index(np.array([1, 2, 3])) will be int32 on 32-bit systems, where it previously would have been int64 even on 32-bit systems. Instantiating Index using a list of numbers will still return 64bit dtypes, e.g. Index([1, 2, 3]) will have a int64 dtype, which is the same as previously.

  2. The various numeric datetime attributes of DatetimeIndex (day, month, year etc.) were previously in of dtype int64, while they were int32 for arrays.DatetimeArray. They are now int32 on DatetimeIndex also:

    In [4]: idx = pd.date_range(start='1/1/2018', periods=3, freq='ME')
    
    In [5]: idx.array.year
    Out[5]: array([2018, 2018, 2018], dtype=int32)
    
    In [6]: idx.year
    Out[6]: Index([2018, 2018, 2018], dtype='int32')
    
  3. Level dtypes on Indexes from Series.sparse.from_coo() are now of dtype int32, the same as they are on the rows/cols on a scipy sparse matrix. Previously they were of dtype int64.

    In [7]: from scipy import sparse
    
    In [8]: A = sparse.coo_matrix(
       ...:     ([3.0, 1.0, 2.0], ([1, 0, 0], [0, 2, 3])), shape=(3, 4)
       ...: )
       ...: 
    
    In [9]: ser = pd.Series.sparse.from_coo(A)
    
    In [10]: ser.index.dtypes
    Out[10]: 
    level_0    int32
    level_1    int32
    dtype: object
    
  4. Index cannot be instantiated using a float16 dtype. Previously instantiating an Index using dtype float16 resulted in a Float64Index with a float64 dtype. It now raises a NotImplementedError:

    In [11]: pd.Index([1, 2, 3], dtype=np.float16)
    ---------------------------------------------------------------------------
    NotImplementedError                       Traceback (most recent call last)
    Cell In[11], line 1
    ----> 1 pd.Index([1, 2, 3], dtype=np.float16)
    
    File ~/work/pandas/pandas/pandas/core/indexes/base.py:579, in Index.__new__(cls, data, dtype, copy, name, tupleize_cols)
        575 arr = ensure_wrapped_if_datetimelike(arr)
        577 klass = cls._dtype_to_subclass(arr.dtype)
    --> 579 arr = klass._ensure_array(arr, arr.dtype, copy=False)
        580 return klass._simple_new(arr, name, refs=refs)
    
    File ~/work/pandas/pandas/pandas/core/indexes/base.py:592, in Index._ensure_array(cls, data, dtype, copy)
        589     raise ValueError("Index data must be 1-dimensional")
        590 elif dtype == np.float16:
        591     # float16 not supported (no indexing engine)
    --> 592     raise NotImplementedError("float16 indexes are not supported")
        594 if copy:
        595     # asarray_tuplesafe does not always copy underlying data,
        596     #  so need to make sure that this happens
        597     data = data.copy()
    
    NotImplementedError: float16 indexes are not supported
    

Argument dtype_backend, to return pyarrow-backed or numpy-backed nullable dtypes#

The following functions gained a new keyword dtype_backend (GH 36712)

When this option is set to "numpy_nullable" it will return a DataFrame that is backed by nullable dtypes.

When this keyword is set to "pyarrow", then these functions will return pyarrow-backed nullable ArrowDtype DataFrames (GH 48957, GH 49997):

In [12]: import io

In [13]: data = io.StringIO("""a,b,c,d,e,f,g,h,i
   ....:     1,2.5,True,a,,,,,
   ....:     3,4.5,False,b,6,7.5,True,a,
   ....: """)
   ....: 

In [14]: df = pd.read_csv(data, dtype_backend="pyarrow")

In [15]: df.dtypes
Out[15]: 
a     int64[pyarrow]
b    double[pyarrow]
c      bool[pyarrow]
d    string[pyarrow]
e     int64[pyarrow]
f    double[pyarrow]
g      bool[pyarrow]
h    string[pyarrow]
i      null[pyarrow]
dtype: object

In [16]: data.seek(0)
Out[16]: 0

In [17]: df_pyarrow = pd.read_csv(data, dtype_backend="pyarrow", engine="pyarrow")

In [18]: df_pyarrow.dtypes
Out[18]: 
a     int64[pyarrow]
b    double[pyarrow]
c      bool[pyarrow]
d    string[pyarrow]
e     int64[pyarrow]
f    double[pyarrow]
g      bool[pyarrow]
h    string[pyarrow]
i      null[pyarrow]
dtype: object

Copy-on-Write improvements#

  • A new lazy copy mechanism that defers the copy until the object in question is modified was added to the methods listed in Copy-on-Write optimizations. These methods return views when Copy-on-Write is enabled, which provides a significant performance improvement compared to the regular execution (GH 49473).

  • Accessing a single column of a DataFrame as a Series (e.g. df["col"]) now always returns a new object every time it is constructed when Copy-on-Write is enabled (not returning multiple times an identical, cached Series object). This ensures that those Series objects correctly follow the Copy-on-Write rules (GH 49450)

  • The Series constructor will now create a lazy copy (deferring the copy until a modification to the data happens) when constructing a Series from an existing Series with the default of copy=False (GH 50471)

  • The DataFrame constructor will now create a lazy copy (deferring the copy until a modification to the data happens) when constructing from an existing DataFrame with the default of copy=False (GH 51239)

  • The DataFrame constructor, when constructing a DataFrame from a dictionary of Series objects and specifying copy=False, will now use a lazy copy of those Series objects for the columns of the DataFrame (GH 50777)

  • The DataFrame constructor, when constructing a DataFrame from a Series or Index and specifying copy=False, will now respect Copy-on-Write.

  • The DataFrame and Series constructors, when constructing from a NumPy array, will now copy the array by default to avoid mutating the DataFrame / Series when mutating the array. Specify copy=False to get the old behavior. When setting copy=False pandas does not guarantee correct Copy-on-Write behavior when the NumPy array is modified after creation of the DataFrame / Series.

  • The DataFrame.from_records() will now respect Copy-on-Write when called with a DataFrame.

  • Trying to set values using chained assignment (for example, df["a"][1:3] = 0) will now always raise a warning when Copy-on-Write is enabled. In this mode, chained assignment can never work because we are always setting into a temporary object that is the result of an indexing operation (getitem), which under Copy-on-Write always behaves as a copy. Thus, assigning through a chain can never update the original Series or DataFrame. Therefore, an informative warning is raised to the user to avoid silently doing nothing (GH 49467)

  • DataFrame.replace() will now respect the Copy-on-Write mechanism when inplace=True.

  • DataFrame.transpose() will now respect the Copy-on-Write mechanism.

  • Arithmetic operations that can be inplace, e.g. ser *= 2 will now respect the Copy-on-Write mechanism.

  • DataFrame.__getitem__() will now respect the Copy-on-Write mechanism when the DataFrame has MultiIndex columns.

  • Series.__getitem__() will now respect the Copy-on-Write mechanism when the

    Series has a MultiIndex.

  • Series.view() will now respect the Copy-on-Write mechanism.

Copy-on-Write can be enabled through one of

pd.set_option("mode.copy_on_write", True)
pd.options.mode.copy_on_write = True

Alternatively, copy on write can be enabled locally through:

with pd.option_context("mode.copy_on_write", True):
    ...

Other enhancements#

Notable bug fixes#

These are bug fixes that might have notable behavior changes.

DataFrameGroupBy.cumsum() and DataFrameGroupBy.cumprod() overflow instead of lossy casting to float#

In previous versions we cast to float when applying cumsum and cumprod which lead to incorrect results even if the result could be hold by int64 dtype. Additionally, the aggregation overflows consistent with numpy and the regular DataFrame.cumprod() and DataFrame.cumsum() methods when the limit of int64 is reached (GH 37493).

Old Behavior

In [1]: df = pd.DataFrame({"key": ["b"] * 7, "value": 625})
In [2]: df.groupby("key")["value"].cumprod()[5]
Out[2]: 5.960464477539062e+16

We return incorrect results with the 6th value.

New Behavior

In [19]: df = pd.DataFrame({"key": ["b"] * 7, "value": 625})

In [20]: df.groupby("key")["value"].cumprod()
Out[20]: 
0                   625
1                390625
2             244140625
3          152587890625
4        95367431640625
5     59604644775390625
6    359414837200037393
Name: value, dtype: int64

We overflow with the 7th value, but the 6th value is still correct.

DataFrameGroupBy.nth() and SeriesGroupBy.nth() now behave as filtrations#

In previous versions of pandas, DataFrameGroupBy.nth() and SeriesGroupBy.nth() acted as if they were aggregations. However, for most inputs n, they may return either zero or multiple rows per group. This means that they are filtrations, similar to e.g. DataFrameGroupBy.head(). pandas now treats them as filtrations (GH 13666).

In [21]: df = pd.DataFrame({"a": [1, 1, 2, 1, 2], "b": [np.nan, 2.0, 3.0, 4.0, 5.0]})

In [22]: gb = df.groupby("a")

Old Behavior

In [5]: gb.nth(n=1)
Out[5]:
   A    B
1  1  2.0
4  2  5.0

New Behavior

In [23]: gb.nth(n=1)
Out[23]: 
   a    b
1  1  2.0
4  2  5.0

In particular, the index of the result is derived from the input by selecting the appropriate rows. Also, when n is larger than the group, no rows instead of NaN is returned.

Old Behavior

In [5]: gb.nth(n=3, dropna="any")
Out[5]:
    B
A
1 NaN
2 NaN

New Behavior

In [24]: gb.nth(n=3, dropna="any")
Out[24]: 
Empty DataFrame
Columns: [a, b]
Index: []

Backwards incompatible API changes#

Construction with datetime64 or timedelta64 dtype with unsupported resolution#

In past versions, when constructing a Series or DataFrame and passing a “datetime64” or “timedelta64” dtype with unsupported resolution (i.e. anything other than “ns”), pandas would silently replace the given dtype with its nanosecond analogue:

Previous behavior:

In [5]: pd.Series(["2016-01-01"], dtype="datetime64[s]")
Out[5]:
0   2016-01-01
dtype: datetime64[ns]

In [6] pd.Series(["2016-01-01"], dtype="datetime64[D]")
Out[6]:
0   2016-01-01
dtype: datetime64[ns]

In pandas 2.0 we support resolutions “s”, “ms”, “us”, and “ns”. When passing a supported dtype (e.g. “datetime64[s]”), the result now has exactly the requested dtype:

New behavior:

In [25]: pd.Series(["2016-01-01"], dtype="datetime64[s]")
Out[25]: 
0   2016-01-01
dtype: datetime64[s]

With an un-supported dtype, pandas now raises instead of silently swapping in a supported dtype:

New behavior:

In [26]: pd.Series(["2016-01-01"], dtype="datetime64[D]")
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[26], line 1
----> 1 pd.Series(["2016-01-01"], dtype="datetime64[D]")

File ~/work/pandas/pandas/pandas/core/series.py:507, in Series.__init__(self, data, index, dtype, name, copy)
    505         data = data.copy()
    506 else:
--> 507     data = sanitize_array(data, index, dtype, copy)
    508     data = SingleBlockManager.from_array(data, index, refs=refs)
    510 NDFrame.__init__(self, data)

File ~/work/pandas/pandas/pandas/core/construction.py:650, in sanitize_array(data, index, dtype, copy, allow_2d)
    647     subarr = np.array([], dtype=np.float64)
    649 elif dtype is not None:
--> 650     subarr = _try_cast(data, dtype, copy)
    652 else:
    653     subarr = maybe_convert_platform(data)

File ~/work/pandas/pandas/pandas/core/construction.py:816, in _try_cast(arr, dtype, copy)
    812         if arr.ndim == 2 and arr.shape[1] == 1:
    813             # GH#60081: DataFrame Constructor converts 1D data to array of
    814             # shape (N, 1), but maybe_cast_to_datetime assumes 1D input
    815             return maybe_cast_to_datetime(arr[:, 0], dtype).reshape(arr.shape)
--> 816     return maybe_cast_to_datetime(arr, dtype)
    818 # GH#15832: Check if we are requesting a numeric dtype and
    819 # that we can convert the data to the requested dtype.
    820 elif dtype.kind in "iu":
    821     # this will raise if we have e.g. floats

File ~/work/pandas/pandas/pandas/core/dtypes/cast.py:1227, in maybe_cast_to_datetime(value, dtype)
   1223     raise TypeError("value must be listlike")
   1225 # TODO: _from_sequence would raise ValueError in cases where
   1226 #  _ensure_nanosecond_dtype raises TypeError
-> 1227 _ensure_nanosecond_dtype(dtype)
   1229 if lib.is_np_dtype(dtype, "m"):
   1230     res = TimedeltaArray._from_sequence(value, dtype=dtype)

File ~/work/pandas/pandas/pandas/core/dtypes/cast.py:1284, in _ensure_nanosecond_dtype(dtype)
   1281     raise ValueError(msg)
   1282 # TODO: ValueError or TypeError? existing test
   1283 #  test_constructor_generic_timestamp_bad_frequency expects TypeError
-> 1284 raise TypeError(
   1285     f"dtype={dtype} is not supported. Supported resolutions are 's', "
   1286     "'ms', 'us', and 'ns'"
   1287 )

TypeError: dtype=datetime64[D] is not supported. Supported resolutions are 's', 'ms', 'us', and 'ns'

Value counts sets the resulting name to count#

In past versions, when running Series.value_counts(), the result would inherit the original object’s name, and the result index would be nameless. This would cause confusion when resetting the index, and the column names would not correspond with the column values. Now, the result name will be 'count' (or 'proportion' if normalize=True was passed), and the index will be named after the original object (GH 49497).

Previous behavior:

In [8]: pd.Series(['quetzal', 'quetzal', 'elk'], name='animal').value_counts()

Out[2]:
quetzal    2
elk        1
Name: animal, dtype: int64

New behavior:

In [27]: pd.Series(['quetzal', 'quetzal', 'elk'], name='animal').value_counts()
Out[27]: 
animal
quetzal    2
elk        1
Name: count, dtype: int64

Likewise for other value_counts methods (for example, DataFrame.value_counts()).

Disallow astype conversion to non-supported datetime64/timedelta64 dtypes#

In previous versions, converting a Series or DataFrame from datetime64[ns] to a different datetime64[X] dtype would return with datetime64[ns] dtype instead of the requested dtype. In pandas 2.0, support is added for “datetime64[s]”, “datetime64[ms]”, and “datetime64[us]” dtypes, so converting to those dtypes gives exactly the requested dtype:

Previous behavior:

In [28]: idx = pd.date_range("2016-01-01", periods=3)

In [29]: ser = pd.Series(idx)

Previous behavior:

In [4]: ser.astype("datetime64[s]")
Out[4]:
0   2016-01-01
1   2016-01-02
2   2016-01-03
dtype: datetime64[ns]

With the new behavior, we get exactly the requested dtype:

New behavior:

In [30]: ser.astype("datetime64[s]")
Out[30]: 
0   2016-01-01
1   2016-01-02
2   2016-01-03
dtype: datetime64[s]

For non-supported resolutions e.g. “datetime64[D]”, we raise instead of silently ignoring the requested dtype:

New behavior:

In [31]: ser.astype("datetime64[D]")
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[31], line 1
----> 1 ser.astype("datetime64[D]")

File ~/work/pandas/pandas/pandas/core/generic.py:6429, in NDFrame.astype(self, dtype, copy, errors)
   6425     results = [ser.astype(dtype, errors=errors) for _, ser in self.items()]
   6427 else:
   6428     # else, only a single dtype is given
-> 6429     new_data = self._mgr.astype(dtype=dtype, errors=errors)
   6430     res = self._constructor_from_mgr(new_data, axes=new_data.axes)
   6431     return res.__finalize__(self, method="astype")

File ~/work/pandas/pandas/pandas/core/internals/managers.py:588, in BaseBlockManager.astype(self, dtype, errors)
    587 def astype(self, dtype, errors: str = "raise") -> Self:
--> 588     return self.apply("astype", dtype=dtype, errors=errors)

File ~/work/pandas/pandas/pandas/core/internals/managers.py:438, in BaseBlockManager.apply(self, f, align_keys, **kwargs)
    436         applied = b.apply(f, **kwargs)
    437     else:
--> 438         applied = getattr(b, f)(**kwargs)
    439     result_blocks = extend_blocks(applied, result_blocks)
    441 out = type(self).from_blocks(result_blocks, self.axes)

File ~/work/pandas/pandas/pandas/core/internals/blocks.py:610, in Block.astype(self, dtype, errors, squeeze)
    607         raise ValueError("Can not squeeze with more than one column.")
    608     values = values[0, :]  # type: ignore[call-overload]
--> 610 new_values = astype_array_safe(values, dtype, errors=errors)
    612 new_values = maybe_coerce_values(new_values)
    614 refs = None

File ~/work/pandas/pandas/pandas/core/dtypes/astype.py:234, in astype_array_safe(values, dtype, copy, errors)
    231     dtype = dtype.numpy_dtype
    233 try:
--> 234     new_values = astype_array(values, dtype, copy=copy)
    235 except (ValueError, TypeError):
    236     # e.g. _astype_nansafe can fail on object-dtype of strings
    237     #  trying to convert to float
    238     if errors == "ignore":

File ~/work/pandas/pandas/pandas/core/dtypes/astype.py:176, in astype_array(values, dtype, copy)
    172     return values
    174 if not isinstance(values, np.ndarray):
    175     # i.e. ExtensionArray
--> 176     values = values.astype(dtype, copy=copy)
    178 else:
    179     values = _astype_nansafe(values, dtype, copy=copy)

File ~/work/pandas/pandas/pandas/core/arrays/datetimes.py:762, in DatetimeArray.astype(self, dtype, copy)
    760 elif isinstance(dtype, PeriodDtype):
    761     return self.to_period(freq=dtype.freq)
--> 762 return dtl.DatetimeLikeArrayMixin.astype(self, dtype, copy)

File ~/work/pandas/pandas/pandas/core/arrays/datetimelike.py:508, in DatetimeLikeArrayMixin.astype(self, dtype, copy)
    504 elif (dtype.kind in "mM" and self.dtype != dtype) or dtype.kind == "f":
    505     # disallow conversion between datetime/timedelta,
    506     # and conversions for any datetimelike to float
    507     msg = f"Cannot cast {type(self).__name__} to dtype {dtype}"
--> 508     raise TypeError(msg)
    509 else:
    510     return np.asarray(self, dtype=dtype)

TypeError: Cannot cast DatetimeArray to dtype datetime64[D]

For conversion from timedelta64[ns] dtypes, the old behavior converted to a floating point format.

Previous behavior:

In [32]: idx = pd.timedelta_range("1 Day", periods=3)

In [33]: ser = pd.Series(idx)

Previous behavior:

In [7]: ser.astype("timedelta64[s]")
Out[7]:
0     86400.0
1    172800.0
2    259200.0
dtype: float64

In [8]: ser.astype("timedelta64[D]")
Out[8]:
0    1.0
1    2.0
2    3.0
dtype: float64

The new behavior, as for datetime64, either gives exactly the requested dtype or raises:

New behavior:

In [34]: ser.astype("timedelta64[s]")
Out[34]: 
0   1 days
1   2 days
2   3 days
dtype: timedelta64[s]

In [35]: ser.astype("timedelta64[D]")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[35], line 1
----> 1 ser.astype("timedelta64[D]")

File ~/work/pandas/pandas/pandas/core/generic.py:6429, in NDFrame.astype(self, dtype, copy, errors)
   6425     results = [ser.astype(dtype, errors=errors) for _, ser in self.items()]
   6427 else:
   6428     # else, only a single dtype is given
-> 6429     new_data = self._mgr.astype(dtype=dtype, errors=errors)
   6430     res = self._constructor_from_mgr(new_data, axes=new_data.axes)
   6431     return res.__finalize__(self, method="astype")

File ~/work/pandas/pandas/pandas/core/internals/managers.py:588, in BaseBlockManager.astype(self, dtype, errors)
    587 def astype(self, dtype, errors: str = "raise") -> Self:
--> 588     return self.apply("astype", dtype=dtype, errors=errors)

File ~/work/pandas/pandas/pandas/core/internals/managers.py:438, in BaseBlockManager.apply(self, f, align_keys, **kwargs)
    436         applied = b.apply(f, **kwargs)
    437     else:
--> 438         applied = getattr(b, f)(**kwargs)
    439     result_blocks = extend_blocks(applied, result_blocks)
    441 out = type(self).from_blocks(result_blocks, self.axes)

File ~/work/pandas/pandas/pandas/core/internals/blocks.py:610, in Block.astype(self, dtype, errors, squeeze)
    607         raise ValueError("Can not squeeze with more than one column.")
    608     values = values[0, :]  # type: ignore[call-overload]
--> 610 new_values = astype_array_safe(values, dtype, errors=errors)
    612 new_values = maybe_coerce_values(new_values)
    614 refs = None

File ~/work/pandas/pandas/pandas/core/dtypes/astype.py:234, in astype_array_safe(values, dtype, copy, errors)
    231     dtype = dtype.numpy_dtype
    233 try:
--> 234     new_values = astype_array(values, dtype, copy=copy)
    235 except (ValueError, TypeError):
    236     # e.g. _astype_nansafe can fail on object-dtype of strings
    237     #  trying to convert to float
    238     if errors == "ignore":

File ~/work/pandas/pandas/pandas/core/dtypes/astype.py:176, in astype_array(values, dtype, copy)
    172     return values
    174 if not isinstance(values, np.ndarray):
    175     # i.e. ExtensionArray
--> 176     values = values.astype(dtype, copy=copy)
    178 else:
    179     values = _astype_nansafe(values, dtype, copy=copy)

File ~/work/pandas/pandas/pandas/core/arrays/timedeltas.py:356, in TimedeltaArray.astype(self, dtype, copy)
    352         return type(self)._simple_new(
    353             res_values, dtype=res_values.dtype, freq=self.freq
    354         )
    355     else:
--> 356         raise ValueError(
    357             f"Cannot convert from {self.dtype} to {dtype}. "
    358             "Supported resolutions are 's', 'ms', 'us', 'ns'"
    359         )
    361 return dtl.DatetimeLikeArrayMixin.astype(self, dtype, copy=copy)

ValueError: Cannot convert from timedelta64[ns] to timedelta64[D]. Supported resolutions are 's', 'ms', 'us', 'ns'

UTC and fixed-offset timezones default to standard-library tzinfo objects#

In previous versions, the default tzinfo object used to represent UTC was pytz.UTC. In pandas 2.0, we default to datetime.timezone.utc instead. Similarly, for timezones represent fixed UTC offsets, we use datetime.timezone objects instead of pytz.FixedOffset objects. See (GH 34916)

Previous behavior:

In [2]: ts = pd.Timestamp("2016-01-01", tz="UTC")
In [3]: type(ts.tzinfo)
Out[3]: pytz.UTC

In [4]: ts2 = pd.Timestamp("2016-01-01 04:05:06-07:00")
In [3]: type(ts2.tzinfo)
Out[5]: pytz._FixedOffset

New behavior:

In [36]: ts = pd.Timestamp("2016-01-01", tz="UTC")

In [37]: type(ts.tzinfo)
Out[37]: datetime.timezone

In [38]: ts2 = pd.Timestamp("2016-01-01 04:05:06-07:00")

In [39]: type(ts2.tzinfo)
Out[39]: datetime.timezone

For timezones that are neither UTC nor fixed offsets, e.g. “US/Pacific”, we continue to default to pytz objects.

Empty DataFrames/Series will now default to have a RangeIndex#

Before, constructing an empty (where data is None or an empty list-like argument) Series or DataFrame without specifying the axes (index=None, columns=None) would return the axes as empty Index with object dtype.

Now, the axes return an empty RangeIndex (GH 49572).

Previous behavior:

In [8]: pd.Series().index
Out[8]:
Index([], dtype='object')

In [9] pd.DataFrame().axes
Out[9]:
[Index([], dtype='object'), Index([], dtype='object')]

New behavior:

In [40]: pd.Series().index
Out[40]: RangeIndex(start=0, stop=0, step=1)

In [41]: pd.DataFrame().axes
Out[41]: [RangeIndex(start=0, stop=0, step=1), RangeIndex(start=0, stop=0, step=1)]

DataFrame to LaTeX has a new render engine#

The existing DataFrame.to_latex() has been restructured to utilise the extended implementation previously available under Styler.to_latex(). The arguments signature is similar, albeit col_space has been removed since it is ignored by LaTeX engines. This render engine also requires jinja2 as a dependency which needs to be installed, since rendering is based upon jinja2 templates.

The pandas latex options below are no longer used and have been removed. The generic max rows and columns arguments remain but for this functionality should be replaced by the Styler equivalents. The alternative options giving similar functionality are indicated below:

  • display.latex.escape: replaced with styler.format.escape,

  • display.latex.longtable: replaced with styler.latex.environment,

  • display.latex.multicolumn, display.latex.multicolumn_format and display.latex.multirow: replaced with styler.sparse.rows, styler.sparse.columns, styler.latex.multirow_align and styler.latex.multicol_align,

  • display.latex.repr: replaced with styler.render.repr,

  • display.max_rows and display.max_columns: replace with styler.render.max_rows, styler.render.max_columns and styler.render.max_elements.

Note that due to this change some defaults have also changed:

  • multirow now defaults to True.

  • multirow_align defaults to “r” instead of “l”.

  • multicol_align defaults to “r” instead of “l”.

  • escape now defaults to False.

Note that the behaviour of _repr_latex_ is also changed. Previously setting display.latex.repr would generate LaTeX only when using nbconvert for a JupyterNotebook, and not when the user is running the notebook. Now the styler.render.repr option allows control of the specific output within JupyterNotebooks for operations (not just on nbconvert). See GH 39911.

Increased minimum versions for dependencies#

Some minimum supported versions of dependencies were updated. If installed, we now require:

Package

Minimum Version

Required

Changed

mypy (dev)

1.0

X

pytest (dev)

7.0.0

X

pytest-xdist (dev)

2.2.0

X

hypothesis (dev)

6.34.2

X

python-dateutil

2.8.2

X

X

tzdata

2022.1

X

X

For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.

Package

Minimum Version

Changed

pyarrow

7.0.0

X

matplotlib

3.6.1

X

fastparquet

0.6.3

X

xarray

0.21.0

X

See Dependencies and Optional dependencies for more.

Datetimes are now parsed with a consistent format#

In the past, to_datetime() guessed the format for each element independently. This was appropriate for some cases where elements had mixed date formats - however, it would regularly cause problems when users expected a consistent format but the function would switch formats between elements. As of version 2.0.0, parsing will use a consistent format, determined by the first non-NA value (unless the user specifies a format, in which case that is used).

Old behavior:

In [1]: ser = pd.Series(['13-01-2000', '12-01-2000'])
In [2]: pd.to_datetime(ser)
Out[2]:
0   2000-01-13
1   2000-12-01
dtype: datetime64[ns]

New behavior:

In [42]: ser = pd.Series(['13-01-2000', '12-01-2000'])

In [43]: pd.to_datetime(ser)
Out[43]: 
0   2000-01-13
1   2000-01-12
dtype: datetime64[s]

Note that this affects read_csv() as well.

If you still need to parse dates with inconsistent formats, you can use format='mixed' (possibly alongside dayfirst)

ser = pd.Series(['13-01-2000', '12 January 2000'])
pd.to_datetime(ser, format='mixed', dayfirst=True)

or, if your formats are all ISO8601 (but possibly not identically-formatted)

ser = pd.Series(['2020-01-01', '2020-01-01 03:00'])
pd.to_datetime(ser, format='ISO8601')

Other API changes#

  • The tz, nanosecond, and unit keywords in the Timestamp constructor are now keyword-only (GH 45307, GH 32526)

  • Passing nanoseconds greater than 999 or less than 0 in Timestamp now raises a ValueError (GH 48538, GH 48255)

  • read_csv(): specifying an incorrect number of columns with index_col of now raises ParserError instead of IndexError when using the c parser.

  • Default value of dtype in get_dummies() is changed to bool from uint8 (GH 45848)

  • DataFrame.astype(), Series.astype(), and DatetimeIndex.astype() casting datetime64 data to any of “datetime64[s]”, “datetime64[ms]”, “datetime64[us]” will return an object with the given resolution instead of coercing back to “datetime64[ns]” (GH 48928)

  • DataFrame.astype(), Series.astype(), and DatetimeIndex.astype() casting timedelta64 data to any of “timedelta64[s]”, “timedelta64[ms]”, “timedelta64[us]” will return an object with the given resolution instead of coercing to “float64” dtype (GH 48963)

  • DatetimeIndex.astype(), TimedeltaIndex.astype(), PeriodIndex.astype() Series.astype(), DataFrame.astype() with datetime64, timedelta64 or PeriodDtype dtypes no longer allow converting to integer dtypes other than “int64”, do obj.astype('int64', copy=False).astype(dtype) instead (GH 49715)

  • Index.astype() now allows casting from float64 dtype to datetime-like dtypes, matching Series behavior (GH 49660)

  • Passing data with dtype of “timedelta64[s]”, “timedelta64[ms]”, or “timedelta64[us]” to TimedeltaIndex, Series, or DataFrame constructors will now retain that dtype instead of casting to “timedelta64[ns]”; timedelta64 data with lower resolution will be cast to the lowest supported resolution “timedelta64[s]” (GH 49014)

  • Passing dtype of “timedelta64[s]”, “timedelta64[ms]”, or “timedelta64[us]” to TimedeltaIndex, Series, or DataFrame constructors will now retain that dtype instead of casting to “timedelta64[ns]”; passing a dtype with lower resolution for Series or DataFrame will be cast to the lowest supported resolution “timedelta64[s]” (GH 49014)

  • Passing a np.datetime64 object with non-nanosecond resolution to Timestamp will retain the input resolution if it is “s”, “ms”, “us”, or “ns”; otherwise it will be cast to the closest supported resolution (GH 49008)

  • Passing datetime64 values with resolution other than nanosecond to to_datetime() will retain the input resolution if it is “s”, “ms”, “us”, or “ns”; otherwise it will be cast to the closest supported resolution (GH 50369)

  • Passing integer values and a non-nanosecond datetime64 dtype (e.g. “datetime64[s]”) DataFrame, Series, or Index will treat the values as multiples of the dtype’s unit, matching the behavior of e.g. Series(np.array(values, dtype="M8[s]")) (GH 51092)

  • Passing a string in ISO-8601 format to Timestamp will retain the resolution of the parsed input if it is “s”, “ms”, “us”, or “ns”; otherwise it will be cast to the closest supported resolution (GH 49737)

  • The other argument in DataFrame.mask() and Series.mask() now defaults to no_default instead of np.nan consistent with DataFrame.where() and Series.where(). Entries will be filled with the corresponding NULL value (np.nan for numpy dtypes, pd.NA for extension dtypes). (GH 49111)

  • Changed behavior of Series.quantile() and DataFrame.quantile() with SparseDtype to retain sparse dtype (GH 49583)

  • When creating a Series with a object-dtype Index of datetime objects, pandas no longer silently converts the index to a DatetimeIndex (GH 39307, GH 23598)

  • pandas.testing.assert_index_equal() with parameter exact="equiv" now considers two indexes equal when both are either a RangeIndex or Index with an int64 dtype. Previously it meant either a RangeIndex or a Int64Index (GH 51098)

  • Series.unique() with dtype “timedelta64[ns]” or “datetime64[ns]” now returns TimedeltaArray or DatetimeArray instead of numpy.ndarray (GH 49176)

  • to_datetime() and DatetimeIndex now allow sequences containing both datetime objects and numeric entries, matching Series behavior (GH 49037, GH 50453)

  • pandas.api.types.is_string_dtype() now only returns True for array-likes with dtype=object when the elements are inferred to be strings (GH 15585)

  • Passing a sequence containing datetime objects and date objects to Series constructor will return with object dtype instead of datetime64[ns] dtype, consistent with Index behavior (GH 49341)

  • Passing strings that cannot be parsed as datetimes to Series or DataFrame with dtype="datetime64[ns]" will raise instead of silently ignoring the keyword and returning object dtype (GH 24435)

  • Passing a sequence containing a type that cannot be converted to Timedelta to to_timedelta() or to the Series or DataFrame constructor with dtype="timedelta64[ns]" or to TimedeltaIndex now raises TypeError instead of ValueError (GH 49525)

  • Changed behavior of Index constructor with sequence containing at least one NaT and everything else either None or NaN to infer datetime64[ns] dtype instead of object, matching Series behavior (GH 49340)

  • read_stata() with parameter index_col set to None (the default) will now set the index on the returned DataFrame to a RangeIndex instead of a Int64Index (GH 49745)

  • Changed behavior of Index, Series, and DataFrame arithmetic methods when working with object-dtypes, the results no longer do type inference on the result of the array operations, use result.infer_objects(copy=False) to do type inference on the result (GH 49999, GH 49714)

  • Changed behavior of Index constructor with an object-dtype numpy.ndarray containing all-bool values or all-complex values, this will now retain object dtype, consistent with the Series behavior (GH 49594)

  • Changed behavior of Series.astype() from object-dtype containing bytes objects to string dtypes; this now does val.decode() on bytes objects instead of str(val), matching Index.astype() behavior (GH 45326)

  • Added "None" to default na_values in read_csv() (GH 50286)

  • Changed behavior of Series and DataFrame constructors when given an integer dtype and floating-point data that is not round numbers, this now raises ValueError instead of silently retaining the float dtype; do Series(data) or DataFrame(data) to get the old behavior, and Series(data).astype(dtype) or DataFrame(data).astype(dtype) to get the specified dtype (GH 49599)

  • Changed behavior of DataFrame.shift() with axis=1, an integer fill_value, and homogeneous datetime-like dtype, this now fills new columns with integer dtypes instead of casting to datetimelike (GH 49842)

  • Files are now closed when encountering an exception in read_json() (GH 49921)

  • Changed behavior of read_csv(), read_json() & read_fwf(), where the index will now always be a RangeIndex, when no index is specified. Previously the index would be a Index with dtype object if the new DataFrame/Series has length 0 (GH 49572)

  • DataFrame.values(), DataFrame.to_numpy(), DataFrame.xs(), DataFrame.reindex(), DataFrame.fillna(), and DataFrame.replace() no longer silently consolidate the underlying arrays; do df = df.copy() to ensure consolidation (GH 49356)

  • Creating a new DataFrame using a full slice on both axes with loc or iloc (thus, df.loc[:, :] or df.iloc[:, :]) now returns a new DataFrame (shallow copy) instead of the original DataFrame, consistent with other methods to get a full slice (for example df.loc[:] or df[:]) (GH 49469)

  • The Series and DataFrame constructors will now return a shallow copy (i.e. share data, but not attributes) when passed a Series and DataFrame, respectively, and with the default of copy=False (and if no other keyword triggers a copy). Previously, the new Series or DataFrame would share the index attribute (e.g. df.index = ... would also update the index of the parent or child) (GH 49523)

  • Disallow computing cumprod for Timedelta object; previously this returned incorrect values (GH 50246)

  • DataFrame objects read from a HDFStore file without an index now have a RangeIndex instead of an int64 index (GH 51076)

  • Instantiating an Index with an numeric numpy dtype with data containing NA and/or NaT now raises a ValueError. Previously a TypeError was raised (GH 51050)

  • Loading a JSON file with duplicate columns using read_json(orient='split') renames columns to avoid duplicates, as read_csv() and the other readers do (GH 50370)

  • The levels of the index of the Series returned from Series.sparse.from_coo now always have dtype int32. Previously they had dtype int64 (GH 50926)

  • to_datetime() with unit of either “Y” or “M” will now raise if a sequence contains a non-round float value, matching the Timestamp behavior (GH 50301)

  • The methods Series.round(), DataFrame.__invert__(), Series.__invert__(), DataFrame.swapaxes(), DataFrame.first(), DataFrame.last(), Series.first(), Series.last() and DataFrame.align() will now always return new objects (GH 51032)

  • DataFrame and DataFrameGroupBy aggregations (e.g. “sum”) with object-dtype columns no longer infer non-object dtypes for their results, explicitly call result.infer_objects(copy=False) on the result to obtain the old behavior (GH 51205, GH 49603)

  • Division by zero with ArrowDtype dtypes returns -inf, nan, or inf depending on the numerator, instead of raising (GH 51541)

  • Added pandas.api.types.is_any_real_numeric_dtype() to check for real numeric dtypes (GH 51152)

  • value_counts() now returns data with ArrowDtype with pyarrow.int64 type instead of "Int64" type (GH 51462)

  • factorize() and unique() preserve the original dtype when passed numpy timedelta64 or datetime64 with non-nanosecond resolution (GH 48670)

Note

A current PDEP proposes the deprecation and removal of the keywords inplace and copy for all but a small subset of methods from the pandas API. The current discussion takes place at here. The keywords won’t be necessary anymore in the context of Copy-on-Write. If this proposal is accepted, both keywords would be deprecated in the next release of pandas and removed in pandas 3.0.

Deprecations#

Removal of prior version deprecations/changes#

Performance improvements#

Bug fixes#

Categorical#

Datetimelike#

Timedelta#

  • Bug in to_timedelta() raising error when input has nullable dtype Float64 (GH 48796)

  • Bug in Timedelta constructor incorrectly raising instead of returning NaT when given a np.timedelta64("nat") (GH 48898)

  • Bug in Timedelta constructor failing to raise when passed both a Timedelta object and keywords (e.g. days, seconds) (GH 48898)

  • Bug in Timedelta comparisons with very large datetime.timedelta objects incorrect raising OutOfBoundsTimedelta (GH 49021)

Timezones#

  • Bug in Series.astype() and DataFrame.astype() with object-dtype containing multiple timezone-aware datetime objects with heterogeneous timezones to a DatetimeTZDtype incorrectly raising (GH 32581)

  • Bug in to_datetime() was failing to parse date strings with timezone name when format was specified with %Z (GH 49748)

  • Better error message when passing invalid values to ambiguous parameter in Timestamp.tz_localize() (GH 49565)

  • Bug in string parsing incorrectly allowing a Timestamp to be constructed with an invalid timezone, which would raise when trying to print (GH 50668)

  • Corrected TypeError message in objects_to_datetime64ns() to inform that DatetimeIndex has mixed timezones (GH 50974)

Numeric#

Conversion#

Strings#

Interval#

Indexing#

Missing#

MultiIndex#

I/O#

Period#

  • Bug in Period.strftime() and PeriodIndex.strftime(), raising UnicodeDecodeError when a locale-specific directive was passed (GH 46319)

  • Bug in adding a Period object to an array of DateOffset objects incorrectly raising TypeError (GH 50162)

  • Bug in Period where passing a string with finer resolution than nanosecond would result in a KeyError instead of dropping the extra precision (GH 50417)

  • Bug in parsing strings representing Week-periods e.g. “2017-01-23/2017-01-29” as minute-frequency instead of week-frequency (GH 50803)

  • Bug in DataFrameGroupBy.sum(), DataFrameGroupByGroupBy.cumsum(), DataFrameGroupByGroupBy.prod(), DataFrameGroupByGroupBy.cumprod() with PeriodDtype failing to raise TypeError (GH 51040)

  • Bug in parsing empty string with Period incorrectly raising ValueError instead of returning NaT (GH 51349)

Plotting#

  • Bug in DataFrame.plot.hist(), not dropping elements of weights corresponding to NaN values in data (GH 48884)

  • ax.set_xlim was sometimes raising UserWarning which users couldn’t address due to set_xlim not accepting parsing arguments - the converter now uses Timestamp() instead (GH 49148)

Groupby/resample/rolling#

Reshaping#

Sparse#

ExtensionArray#

Styler#

Metadata#

Other#

Contributors#

A total of 260 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.

  • 5j9 +

  • ABCPAN-rank +

  • Aarni Koskela +

  • Aashish KC +

  • Abubeker Mohammed +

  • Adam Mróz +

  • Adam Ormondroyd +

  • Aditya Anulekh +

  • Ahmed Ibrahim

  • Akshay Babbar +

  • Aleksa Radojicic +

  • Alex +

  • Alex Buzenet +

  • Alex Kirko

  • Allison Kwan +

  • Amay Patel +

  • Ambuj Pawar +

  • Amotz +

  • Andreas Schwab +

  • Andrew Chen +

  • Anton Shevtsov

  • Antonio Ossa Guerra +

  • Antonio Ossa-Guerra +

  • Anushka Bishnoi +

  • Arda Kosar

  • Armin Berres

  • Asadullah Naeem +

  • Asish Mahapatra

  • Bailey Lissington +

  • BarkotBeyene

  • Ben Beasley

  • Bhavesh Rajendra Patil +

  • Bibek Jha +

  • Bill +

  • Bishwas +

  • CarlosGDCJ +

  • Carlotta Fabian +

  • Chris Roth +

  • Chuck Cadman +

  • Corralien +

  • DG +

  • Dan Hendry +

  • Daniel Isaac

  • David Kleindienst +

  • David Poznik +

  • David Rudel +

  • DavidKleindienst +

  • Dea María Léon +

  • Deepak Sirohiwal +

  • Dennis Chukwunta

  • Douglas Lohmann +

  • Dries Schaumont

  • Dustin K +

  • Edoardo Abati +

  • Eduardo Chaves +

  • Ege Özgüroğlu +

  • Ekaterina Borovikova +

  • Eli Schwartz +

  • Elvis Lim +

  • Emily Taylor +

  • Emma Carballal Haire +

  • Erik Welch +

  • Fangchen Li

  • Florian Hofstetter +

  • Flynn Owen +

  • Fredrik Erlandsson +

  • Gaurav Sheni

  • Georeth Chow +

  • George Munyoro +

  • Guilherme Beltramini

  • Gulnur Baimukhambetova +

  • H L +

  • Hans

  • Hatim Zahid +

  • HighYoda +

  • Hiki +

  • Himanshu Wagh +

  • Hugo van Kemenade +

  • Idil Ismiguzel +

  • Irv Lustig

  • Isaac Chung

  • Isaac Virshup

  • JHM Darbyshire

  • JHM Darbyshire (iMac)

  • JMBurley

  • Jaime Di Cristina

  • Jan Koch

  • JanVHII +

  • Janosh Riebesell

  • JasmandeepKaur +

  • Jeremy Tuloup

  • Jessica M +

  • Jonas Haag

  • Joris Van den Bossche

  • João Meirelles +

  • Julia Aoun +

  • Justus Magin +

  • Kang Su Min +

  • Kevin Sheppard

  • Khor Chean Wei

  • Kian Eliasi

  • Kostya Farber +

  • KotlinIsland +

  • Lakmal Pinnaduwage +

  • Lakshya A Agrawal +

  • Lawrence Mitchell +

  • Levi Ob +

  • Loic Diridollou

  • Lorenzo Vainigli +

  • Luca Pizzini +

  • Lucas Damo +

  • Luke Manley

  • Madhuri Patil +

  • Marc Garcia

  • Marco Edward Gorelli

  • Marco Gorelli

  • MarcoGorelli

  • Maren Westermann +

  • Maria Stazherova +

  • Marie K +

  • Marielle +

  • Mark Harfouche +

  • Marko Pacak +

  • Martin +

  • Matheus Cerqueira +

  • Matheus Pedroni +

  • Matteo Raso +

  • Matthew Roeschke

  • MeeseeksMachine +

  • Mehdi Mohammadi +

  • Michael Harris +

  • Michael Mior +

  • Natalia Mokeeva +

  • Neal Muppidi +

  • Nick Crews

  • Nishu Choudhary +

  • Noa Tamir

  • Noritada Kobayashi

  • Omkar Yadav +

  • P. Talley +

  • Pablo +

  • Pandas Development Team

  • Parfait Gasana

  • Patrick Hoefler

  • Pedro Nacht +

  • Philip +

  • Pietro Battiston

  • Pooja Subramaniam +

  • Pranav Saibhushan Ravuri +

  • Pranav. P. A +

  • Ralf Gommers +

  • RaphSku +

  • Richard Shadrach

  • Robsdedude +

  • Roger

  • Roger Thomas

  • RogerThomas +

  • SFuller4 +

  • Salahuddin +

  • Sam Rao

  • Sean Patrick Malloy +

  • Sebastian Roll +

  • Shantanu

  • Shashwat +

  • Shashwat Agrawal +

  • Shiko Wamwea +

  • Shoham Debnath

  • Shubhankar Lohani +

  • Siddhartha Gandhi +

  • Simon Hawkins

  • Soumik Dutta +

  • Sowrov Talukder +

  • Stefanie Molin

  • Stefanie Senger +

  • Stepfen Shawn +

  • Steven Rotondo

  • Stijn Van Hoey

  • Sudhansu +

  • Sven

  • Sylvain MARIE

  • Sylvain Marié

  • Tabea Kossen +

  • Taylor Packard

  • Terji Petersen

  • Thierry Moisan

  • Thomas H +

  • Thomas Li

  • Torsten Wörtwein

  • Tsvika S +

  • Tsvika Shapira +

  • Vamsi Verma +

  • Vinicius Akira +

  • William Andrea

  • William Ayd

  • William Blum +

  • Wilson Xing +

  • Xiao Yuan +

  • Xnot +

  • Yasin Tatar +

  • Yuanhao Geng

  • Yvan Cywan +

  • Zachary Moon +

  • Zhengbo Wang +

  • abonte +

  • adrienpacifico +

  • alm

  • amotzop +

  • andyjessen +

  • anonmouse1 +

  • bang128 +

  • bishwas jha +

  • calhockemeyer +

  • carla-alves-24 +

  • carlotta +

  • casadipietra +

  • catmar22 +

  • cfabian +

  • codamuse +

  • dataxerik

  • davidleon123 +

  • dependabot[bot] +

  • fdrocha +

  • github-actions[bot]

  • himanshu_wagh +

  • iofall +

  • jakirkham +

  • jbrockmendel

  • jnclt +

  • joelchen +

  • joelsonoda +

  • joshuabello2550

  • joycewamwea +

  • kathleenhang +

  • krasch +

  • ltoniazzi +

  • luke396 +

  • milosz-martynow +

  • minat-hub +

  • mliu08 +

  • monosans +

  • nealxm

  • nikitaved +

  • paradox-lab +

  • partev

  • raisadz +

  • ram vikram singh +

  • rebecca-palmer

  • sarvaSanjay +

  • seljaks +

  • silviaovo +

  • smij720 +

  • soumilbaldota +

  • stellalin7 +

  • strawberry beach sandals +

  • tmoschou +

  • uzzell +

  • yqyqyq-W +

  • yun +

  • Ádám Lippai

  • 김동현 (Daniel Donghyun Kim) +