What’s new in 3.0.0 (Month XX, 2024)#

These are the changes in pandas 3.0.0. See Release notes for a full changelog including other versions of pandas.

Enhancements#

enhancement1#

enhancement2#

Other enhancements#

Notable bug fixes#

These are bug fixes that might have notable behavior changes.

Improved behavior in groupby for observed=False#

A number of bugs have been fixed due to improved handling of unobserved groups (GH 55738). All remarks in this section equally impact SeriesGroupBy.

In previous versions of pandas, a single grouping with DataFrameGroupBy.apply() or DataFrameGroupBy.agg() would pass the unobserved groups to the provided function, resulting in 0 below.

In [1]: df = pd.DataFrame(
   ...:     {
   ...:         "key1": pd.Categorical(list("aabb"), categories=list("abc")),
   ...:         "key2": [1, 1, 1, 2],
   ...:         "values": [1, 2, 3, 4],
   ...:     }
   ...: )
   ...: 

In [2]: df
Out[2]: 
  key1  key2  values
0    a     1       1
1    a     1       2
2    b     1       3
3    b     2       4

In [3]: gb = df.groupby("key1", observed=False)

In [4]: gb[["values"]].apply(lambda x: x.sum())
Out[4]: 
      values
key1        
a          3
b          7
c          0

However this was not the case when using multiple groupings, resulting in NaN below.

In [1]: gb = df.groupby(["key1", "key2"], observed=False)
In [2]: gb[["values"]].apply(lambda x: x.sum())
Out[2]:
           values
key1 key2
a    1        3.0
     2        NaN
b    1        3.0
     2        4.0
c    1        NaN
     2        NaN

Now using multiple groupings will also pass the unobserved groups to the provided function.

In [5]: gb = df.groupby(["key1", "key2"], observed=False)

In [6]: gb[["values"]].apply(lambda x: x.sum())
Out[6]: 
           values
key1 key2        
a    1          3
     2          0
b    1          3
     2          4
c    1          0
     2          0

Similarly:

These improvements also fixed certain bugs in groupby:

notable_bug_fix2#

Backwards incompatible API changes#

Increased minimum versions for dependencies#

Some minimum supported versions of dependencies were updated. If installed, we now require:

Package

Minimum Version

Required

Changed

numpy

1.23.5

X

X

For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.

Package

New Minimum Version

fastparquet

2023.10.0

adbc-driver-postgresql

0.10.0

mypy (dev)

1.9.0

See Dependencies and Optional dependencies for more.

Other API changes#

  • 3rd party py.path objects are no longer explicitly supported in IO methods. Use pathlib.Path objects instead (GH 57091)

  • read_table()’s parse_dates argument defaults to None to improve consistency with read_csv() (GH 57476)

  • Made dtype a required argument in ExtensionArray._from_sequence_of_strings() (GH 56519)

  • Updated DataFrame.to_excel() so that the output spreadsheet has no styling. Custom styling can still be done using Styler.to_excel() (GH 54154)

  • pickle and HDF (.h5) files created with Python 2 are no longer explicitly supported (GH 57387)

  • pickled objects from pandas version less than 1.0.0 are no longer supported (GH 57155)

  • when comparing the indexes in testing.assert_series_equal(), check_exact defaults to True if an Index is of integer dtypes. (GH 57386)

Deprecations#

Copy keyword#

The copy keyword argument in the following methods is deprecated and will be removed in a future version:

Copy-on-Write utilizes a lazy copy mechanism that defers copying the data until necessary. Use .copy to trigger an eager copy. The copy keyword has no effect starting with 3.0, so it can be safely removed from your code.

Other Deprecations#

Removal of prior version deprecations/changes#

Performance improvements#

Bug fixes#

Categorical#

Datetimelike#

  • Bug in Timestamp constructor failing to raise when tz=None is explicitly specified in conjunction with timezone-aware tzinfo or data (GH 48688)

  • Bug in date_range() where the last valid timestamp would sometimes not be produced (GH 56134)

  • Bug in date_range() where using a negative frequency value would not include all points between the start and end values (GH 56382)

  • Bug in tseries.api.guess_datetime_format() would fail to infer time format when “%Y” == “%H%M” (GH 57452)

  • Bug in setting scalar values with mismatched resolution into arrays with non-nanosecond datetime64, timedelta64 or DatetimeTZDtype incorrectly truncating those scalars (GH 56410)

Timedelta#

Timezones#

Numeric#

  • Bug in np.matmul with Index inputs raising a TypeError (GH 57079)

Conversion#

Strings#

Interval#

Indexing#

  • Bug in DataFrame.__getitem__() returning modified columns when called with slice in Python 3.12 (GH 57500)

Missing#

MultiIndex#

I/O#

Period#

Plotting#

Groupby/resample/rolling#

Reshaping#

Sparse#

ExtensionArray#

Styler#

Other#

Contributors#