What’s new in 3.0.0 (Month XX, 2024)#

These are the changes in pandas 3.0.0. See Release notes for a full changelog including other versions of pandas.

Enhancements#

Enhancement1#

Enhancement2#

Other enhancements#

Notable bug fixes#

These are bug fixes that might have notable behavior changes.

Improved behavior in groupby for observed=False#

A number of bugs have been fixed due to improved handling of unobserved groups (GH 55738). All remarks in this section equally impact SeriesGroupBy.

In previous versions of pandas, a single grouping with DataFrameGroupBy.apply() or DataFrameGroupBy.agg() would pass the unobserved groups to the provided function, resulting in 0 below.

In [1]: df = pd.DataFrame(
   ...:     {
   ...:         "key1": pd.Categorical(list("aabb"), categories=list("abc")),
   ...:         "key2": [1, 1, 1, 2],
   ...:         "values": [1, 2, 3, 4],
   ...:     }
   ...: )
   ...: 

In [2]: df
Out[2]: 
  key1  key2  values
0    a     1       1
1    a     1       2
2    b     1       3
3    b     2       4

In [3]: gb = df.groupby("key1", observed=False)

In [4]: gb[["values"]].apply(lambda x: x.sum())
Out[4]: 
      values
key1        
a          3
b          7
c          0

However this was not the case when using multiple groupings, resulting in NaN below.

In [1]: gb = df.groupby(["key1", "key2"], observed=False)
In [2]: gb[["values"]].apply(lambda x: x.sum())
Out[2]:
           values
key1 key2
a    1        3.0
     2        NaN
b    1        3.0
     2        4.0
c    1        NaN
     2        NaN

Now using multiple groupings will also pass the unobserved groups to the provided function.

In [5]: gb = df.groupby(["key1", "key2"], observed=False)

In [6]: gb[["values"]].apply(lambda x: x.sum())
Out[6]: 
           values
key1 key2        
a    1          3
     2          0
b    1          3
     2          4
c    1          0
     2          0

Similarly:

These improvements also fixed certain bugs in groupby:

notable_bug_fix2#

Backwards incompatible API changes#

Datetime resolution inference#

Converting a sequence of strings, datetime objects, or np.datetime64 objects to a datetime64 dtype now performs inference on the appropriate resolution (AKA unit) for the output dtype. This affects Series, DataFrame, Index, DatetimeIndex, and to_datetime().

Previously, these would always give nanosecond resolution:

In [1]: dt = pd.Timestamp("2024-03-22 11:36").to_pydatetime()
In [2]: pd.to_datetime([dt]).dtype
Out[2]: dtype('<M8[ns]')
In [3]: pd.Index([dt]).dtype
Out[3]: dtype('<M8[ns]')
In [4]: pd.DatetimeIndex([dt]).dtype
Out[4]: dtype('<M8[ns]')
In [5]: pd.Series([dt]).dtype
Out[5]: dtype('<M8[ns]')

This now infers the unit microsecond unit “us” from the pydatetime object, matching the scalar Timestamp behavior.

In [7]: In [1]: dt = pd.Timestamp("2024-03-22 11:36").to_pydatetime()

In [8]: In [2]: pd.to_datetime([dt]).dtype
Out[8]: dtype('<M8[us]')

In [9]: In [3]: pd.Index([dt]).dtype
Out[9]: dtype('<M8[us]')

In [10]: In [4]: pd.DatetimeIndex([dt]).dtype
Out[10]: dtype('<M8[us]')

In [11]: In [5]: pd.Series([dt]).dtype
Out[11]: dtype('<M8[us]')

Similar when passed a sequence of np.datetime64 objects, the resolution of the passed objects will be retained (or for lower-than-second resolution, second resolution will be used).

When passing strings, the resolution will depend on the precision of the string, again matching the Timestamp behavior. Previously:

In [2]: pd.to_datetime(["2024-03-22 11:43:01"]).dtype
Out[2]: dtype('<M8[ns]')
In [3]: pd.to_datetime(["2024-03-22 11:43:01.002"]).dtype
Out[3]: dtype('<M8[ns]')
In [4]: pd.to_datetime(["2024-03-22 11:43:01.002003"]).dtype
Out[4]: dtype('<M8[ns]')
In [5]: pd.to_datetime(["2024-03-22 11:43:01.002003004"]).dtype
Out[5]: dtype('<M8[ns]')

The inferred resolution now matches that of the input strings:

In [12]: In [2]: pd.to_datetime(["2024-03-22 11:43:01"]).dtype
Out[12]: dtype('<M8[s]')

In [13]: In [3]: pd.to_datetime(["2024-03-22 11:43:01.002"]).dtype
Out[13]: dtype('<M8[ms]')

In [14]: In [4]: pd.to_datetime(["2024-03-22 11:43:01.002003"]).dtype
Out[14]: dtype('<M8[us]')

In [15]: In [5]: pd.to_datetime(["2024-03-22 11:43:01.002003004"]).dtype
Out[15]: dtype('<M8[ns]')

In cases with mixed-resolution inputs, the highest resolution is used:

In [2]: pd.to_datetime([pd.Timestamp("2024-03-22 11:43:01"), "2024-03-22 11:43:01.002"]).dtype
Out[2]: dtype('<M8[ns]')

Changed behavior in DataFrame.value_counts() and DataFrameGroupBy.value_counts() when sort=False#

In previous versions of pandas, DataFrame.value_counts() with sort=False would sort the result by row labels (as was documented). This was nonintuitive and inconsistent with Series.value_counts() which would maintain the order of the input. Now DataFrame.value_counts() will maintain the order of the input.

In [16]: df = pd.DataFrame(
   ....:     {
   ....:         "a": [2, 2, 2, 2, 1, 1, 1, 1],
   ....:         "b": [2, 1, 3, 1, 2, 3, 1, 1],
   ....:     }
   ....: )
   ....: 

In [17]: df
Out[17]: 
   a  b
0  2  2
1  2  1
2  2  3
3  2  1
4  1  2
5  1  3
6  1  1
7  1  1

Old behavior

In [3]: df.value_counts(sort=False)
Out[3]:
a  b
1  1    2
   2    1
   3    1
2  1    2
   2    1
   3    1
Name: count, dtype: int64

New behavior

In [18]: df.value_counts(sort=False)
Out[18]: 
a  b
2  2    1
   1    2
   3    1
1  2    1
   3    1
   1    2
Name: count, dtype: int64

This change also applies to DataFrameGroupBy.value_counts(). Here, there are two options for sorting: one sort passed to DataFrame.groupby() and one passed directly to DataFrameGroupBy.value_counts(). The former will determine whether to sort the groups, the latter whether to sort the counts. All non-grouping columns will maintain the order of the input within groups.

Old behavior

In [5]: df.groupby("a", sort=True).value_counts(sort=False)
Out[5]:
a  b
1  1    2
   2    1
   3    1
2  1    2
   2    1
   3    1
dtype: int64

New behavior

In [19]: df.groupby("a", sort=True).value_counts(sort=False)
Out[19]: 
a  b
1  2    1
   3    1
   1    2
2  2    1
   1    2
   3    1
Name: count, dtype: int64

Increased minimum version for Python#

pandas 3.0.0 supports Python 3.10 and higher.

Increased minimum versions for dependencies#

Some minimum supported versions of dependencies were updated. If installed, we now require:

Package

Minimum Version

Required

Changed

numpy

1.23.5

X

X

For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.

Package

New Minimum Version

pytz

2023.4

fastparquet

2023.10.0

adbc-driver-postgresql

0.10.0

mypy (dev)

1.9.0

See Dependencies and Optional dependencies for more.

pytz now an optional dependency#

pandas now uses zoneinfo from the standard library as the default timezone implementation when passing a timezone string to various methods. (GH 34916)

Old behavior:

In [1]: ts = pd.Timestamp(2024, 1, 1).tz_localize("US/Pacific")
In [2]: ts.tz
<DstTzInfo 'US/Pacific' LMT-1 day, 16:07:00 STD>

New behavior:

In [20]: ts = pd.Timestamp(2024, 1, 1).tz_localize("US/Pacific")

In [21]: ts.tz
Out[21]: zoneinfo.ZoneInfo(key='US/Pacific')

pytz timezone objects are still supported when passed directly, but they will no longer be returned by default from string inputs. Moreover, pytz is no longer a required dependency of pandas, but can be installed with the pip extra pip install pandas[timezone].

Additionally, pandas no longer throws pytz exceptions for timezone operations leading to ambiguous or nonexistent times. These cases will now raise a ValueError.

Other API changes#

  • 3rd party py.path objects are no longer explicitly supported in IO methods. Use pathlib.Path objects instead (GH 57091)

  • read_table()’s parse_dates argument defaults to None to improve consistency with read_csv() (GH 57476)

  • All classes inheriting from builtin tuple (including types created with collections.namedtuple()) are now hashed and compared as builtin tuple during indexing operations (GH 57922)

  • Made dtype a required argument in ExtensionArray._from_sequence_of_strings() (GH 56519)

  • Passing a Series input to json_normalize() will now retain the Series Index, previously output had a new RangeIndex (GH 51452)

  • Removed Index.sort() which always raised a TypeError. This attribute is not defined and will raise an AttributeError (GH 59283)

  • Updated DataFrame.to_excel() so that the output spreadsheet has no styling. Custom styling can still be done using Styler.to_excel() (GH 54154)

  • pickle and HDF (.h5) files created with Python 2 are no longer explicitly supported (GH 57387)

  • pickled objects from pandas version less than 1.0.0 are no longer supported (GH 57155)

  • when comparing the indexes in testing.assert_series_equal(), check_exact defaults to True if an Index is of integer dtypes. (GH 57386)

Deprecations#

Copy keyword#

The copy keyword argument in the following methods is deprecated and will be removed in a future version:

Copy-on-Write utilizes a lazy copy mechanism that defers copying the data until necessary. Use .copy to trigger an eager copy. The copy keyword has no effect starting with 3.0, so it can be safely removed from your code.

Other Deprecations#

Removal of prior version deprecations/changes#

Enforced deprecation of aliases M, Q, Y, etc. in favour of ME, QE, YE, etc. for offsets#

Renamed the following offset aliases (GH 57986):

offset

removed alias

new alias

MonthEnd

M

ME

BusinessMonthEnd

BM

BME

SemiMonthEnd

SM

SME

CustomBusinessMonthEnd

CBM

CBME

QuarterEnd

Q

QE

BQuarterEnd

BQ

BQE

YearEnd

Y

YE

BYearEnd

BY

BYE

Other Removals#

Performance improvements#

Bug fixes#

Categorical#

Datetimelike#

Timedelta#

Timezones#

Numeric#

  • Bug in DataFrame.quantile() where the column type was not preserved when numeric_only=True with a list-like q produced an empty result (GH 59035)

  • Bug in np.matmul with Index inputs raising a TypeError (GH 57079)

Conversion#

Strings#

Interval#

Indexing#

  • Bug in DataFrame.__getitem__() returning modified columns when called with slice in Python 3.12 (GH 57500)

  • Bug in DataFrame.from_records() throwing a ValueError when passed an empty list in index (GH 58594)

  • Bug in MultiIndex.insert() when a new value inserted to a datetime-like level gets cast to NaT and fails indexing (GH 60388)

  • Bug in printing Index.names and MultiIndex.levels would not escape single quotes (GH 60190)

Missing#

MultiIndex#

I/O#

Period#

Plotting#

Groupby/resample/rolling#

  • Bug in DataFrameGroupBy.__len__() and SeriesGroupBy.__len__() would raise when the grouping contained NA values and dropna=False (GH 58644)

  • Bug in DataFrameGroupBy.any() that returned True for groups where all Timedelta values are NaT. (GH 59712)

  • Bug in DataFrameGroupBy.groups() and SeriesGroupby.groups() that would not respect groupby argument dropna (GH 55919)

  • Bug in DataFrameGroupBy.median() where nat values gave an incorrect result. (GH 57926)

  • Bug in DataFrameGroupBy.quantile() when interpolation="nearest" is inconsistent with DataFrame.quantile() (GH 47942)

  • Bug in Resampler.interpolate() on a DataFrame with non-uniform sampling and/or indices not aligning with the resulting resampled index would result in wrong interpolation (GH 21351)

  • Bug in DataFrame.ewm() and Series.ewm() when passed times and aggregation functions other than mean (GH 51695)

  • Bug in DataFrameGroupBy.agg() that raises AttributeError when there is dictionary input and duplicated columns, instead of returning a DataFrame with the aggregation of all duplicate columns. (GH 55041)

  • Bug in DataFrameGroupBy.apply() and SeriesGroupBy.apply() for empty data frame with group_keys=False still creating output index using group keys. (GH 60471)

  • Bug in DataFrameGroupBy.apply() that was returning a completely empty DataFrame when all return values of func were None instead of returning an empty DataFrame with the original columns and dtypes. (GH 57775)

  • Bug in DataFrameGroupBy.apply() with as_index=False that was returning MultiIndex instead of returning Index. (GH 58291)

  • Bug in DataFrameGroupBy.cumsum() and DataFrameGroupBy.cumprod() where numeric_only parameter was passed indirectly through kwargs instead of passing directly. (GH 58811)

  • Bug in DataFrameGroupBy.cumsum() where it did not return the correct dtype when the label contained None. (GH 58811)

  • Bug in DataFrameGroupby.transform() and SeriesGroupby.transform() with a reducer and observed=False that coerces dtype to float when there are unobserved categories. (GH 55326)

  • Bug in Rolling.apply() for method="table" where column order was not being respected due to the columns getting sorted by default. (GH 59666)

  • Bug in Rolling.apply() where the applied function could be called on fewer than min_period periods if method="table". (GH 58868)

  • Bug in Series.resample() could raise when the the date range ended shortly before a non-existent time. (GH 58380)

Reshaping#

Sparse#

ExtensionArray#

  • Bug in arrays.ArrowExtensionArray.__setitem__() which caused wrong behavior when using an integer array with repeated values as a key (GH 58530)

  • Bug in api.types.is_datetime64_any_dtype() where a custom ExtensionDtype would return False for array-likes (GH 57055)

  • Bug in comparison between object with ArrowDtype and incompatible-dtyped (e.g. string vs bool) incorrectly raising instead of returning all-False (for ==) or all-True (for !=) (GH 59505)

  • Bug in constructing pandas data structures when passing into dtype a string of the type followed by [pyarrow] while PyArrow is not installed would raise NameError rather than ImportError (GH 57928)

  • Bug in various DataFrame reductions for pyarrow temporal dtypes returning incorrect dtype when result was null (GH 59234)

Styler#

  • Bug in Styler.to_latex() where styling column headers when combined with a hidden index or hidden index-levels is fixed.

Other#

Contributors#