What’s new in 2.2.0 (Month XX, 2024)#

These are the changes in pandas 2.2.0. See Release notes for a full changelog including other versions of pandas.

Enhancements#

ADBC Driver support in to_sql and read_sql#

read_sql() and to_sql() now work with Apache Arrow ADBC drivers. Compared to traditional drivers used via SQLAlchemy, ADBC drivers should provide significant performance improvements, better type support and cleaner nullability handling.

import adbc_driver_postgresql.dbapi as pg_dbapi

df = pd.DataFrame(
    [
        [1, 2, 3],
        [4, 5, 6],
    ],
    columns=['a', 'b', 'c']
)
uri = "postgresql://postgres:postgres@localhost/postgres"
with pg_dbapi.connect(uri) as conn:
    df.to_sql("pandas_table", conn, index=False)

# for roundtripping
with pg_dbapi.connect(uri) as conn:
    df2 = pd.read_sql("pandas_table", conn)

The Arrow type system offers a wider array of types that can more closely match what databases like PostgreSQL can offer. To illustrate, note this (non-exhaustive) listing of types available in different databases and pandas backends:

numpy/pandas

arrow

postgres

sqlite

int16/Int16

int16

SMALLINT

INTEGER

int32/Int32

int32

INTEGER

INTEGER

int64/Int64

int64

BIGINT

INTEGER

float32

float32

REAL

REAL

float64

float64

DOUBLE PRECISION

REAL

object

string

TEXT

TEXT

bool

bool_

BOOLEAN

datetime64[ns]

timestamp(us)

TIMESTAMP

datetime64[ns,tz]

timestamp(us,tz)

TIMESTAMPTZ

date32

DATE

month_day_nano_interval

INTERVAL

binary

BINARY

BLOB

decimal128

DECIMAL [1]

list

ARRAY [1]

struct

COMPOSITE TYPE

[1]

Footnotes

If you are interested in preserving database types as best as possible throughout the lifecycle of your DataFrame, users are encouraged to leverage the dtype_backend="pyarrow" argument of read_sql()

# for roundtripping
with pg_dbapi.connect(uri) as conn:
    df2 = pd.read_sql("pandas_table", conn, dtype_backend="pyarrow")

This will prevent your data from being converted to the traditional pandas/NumPy type system, which often converts SQL types in ways that make them impossible to round-trip.

For a full list of ADBC drivers and their development status, see the ADBC Driver Implementation Status documentation.

ExtensionArray.to_numpy converts to suitable NumPy dtype#

ExtensionArray.to_numpy() will now convert to a suitable NumPy dtype instead of object dtype for nullable extension dtypes.

Old behavior:

In [1]: ser = pd.Series([1, 2, 3], dtype="Int64")
In [2]: ser.to_numpy()
Out[2]: array([1, 2, 3], dtype=object)

New behavior:

In [1]: ser = pd.Series([1, 2, 3], dtype="Int64")

In [2]: ser.to_numpy()
Out[2]: array([1, 2, 3])

The default NumPy dtype (without any arguments) is determined as follows:

  • float dtypes are cast to NumPy floats

  • integer dtypes without missing values are cast to NumPy integer dtypes

  • integer dtypes with missing values are cast to NumPy float dtypes and NaN is used as missing value indicator

  • boolean dtypes without missing values are cast to NumPy bool dtype

  • boolean dtypes with missing values keep object dtype

Series.struct accessor for PyArrow structured data#

The Series.struct accessor provides attributes and methods for processing data with struct[pyarrow] dtype Series. For example, Series.struct.explode() converts PyArrow structured data to a pandas DataFrame. (GH 54938)

In [3]: import pyarrow as pa

In [4]: series = pd.Series(
   ...:     [
   ...:         {"project": "pandas", "version": "2.2.0"},
   ...:         {"project": "numpy", "version": "1.25.2"},
   ...:         {"project": "pyarrow", "version": "13.0.0"},
   ...:     ],
   ...:     dtype=pd.ArrowDtype(
   ...:         pa.struct([
   ...:             ("project", pa.string()),
   ...:             ("version", pa.string()),
   ...:         ])
   ...:     ),
   ...: )
   ...: 

In [5]: series.struct.explode()
Out[5]: 
   project version
0   pandas   2.2.0
1    numpy  1.25.2
2  pyarrow  13.0.0

Series.list accessor for PyArrow list data#

The Series.list accessor provides attributes and methods for processing data with list[pyarrow] dtype Series. For example, Series.list.__getitem__() allows indexing pyarrow lists in a Series. (GH 55323)

In [6]: import pyarrow as pa

In [7]: series = pd.Series(
   ...:     [
   ...:         [1, 2, 3],
   ...:         [4, 5],
   ...:         [6],
   ...:     ],
   ...:     dtype=pd.ArrowDtype(
   ...:         pa.list_(pa.int64())
   ...:     ),
   ...: )
   ...: 

In [8]: series.list[0]
Out[8]: 
0    1
1    4
2    6
dtype: int64[pyarrow]

Calamine engine for read_excel()#

The calamine engine was added to read_excel(). It uses python-calamine, which provides Python bindings for the Rust library calamine. This engine supports Excel files (.xlsx, .xlsm, .xls, .xlsb) and OpenDocument spreadsheets (.ods) (GH 50395).

There are two advantages of this engine:

  1. Calamine is often faster than other engines, some benchmarks show results up to 5x faster than ‘openpyxl’, 20x - ‘odf’, 4x - ‘pyxlsb’, and 1.5x - ‘xlrd’. But, ‘openpyxl’ and ‘pyxlsb’ are faster in reading a few rows from large files because of lazy iteration over rows.

  2. Calamine supports the recognition of datetime in .xlsb files, unlike ‘pyxlsb’ which is the only other engine in pandas that can read .xlsb files.

pd.read_excel("path_to_file.xlsb", engine="calamine")

For more, see Calamine (Excel and ODS files) in the user guide on IO tools.

Other enhancements#

Notable bug fixes#

These are bug fixes that might have notable behavior changes.

merge() and DataFrame.join() now consistently follow documented sort behavior#

In previous versions of pandas, merge() and DataFrame.join() did not always return a result that followed the documented sort behavior. pandas now follows the documented sort behavior in merge and join operations (GH 54611).

As documented, sort=True sorts the join keys lexicographically in the resulting DataFrame. With sort=False, the order of the join keys depends on the join type (how keyword):

  • how="left": preserve the order of the left keys

  • how="right": preserve the order of the right keys

  • how="inner": preserve the order of the left keys

  • how="outer": sort keys lexicographically

One example with changing behavior is inner joins with non-unique left join keys and sort=False:

In [9]: left = pd.DataFrame({"a": [1, 2, 1]})

In [10]: right = pd.DataFrame({"a": [1, 2]})

In [11]: result = pd.merge(left, right, how="inner", on="a", sort=False)

Old Behavior

In [5]: result
Out[5]:
   a
0  1
1  1
2  2

New Behavior

In [12]: result
Out[12]: 
   a
0  1
1  2
2  1

merge() and DataFrame.join() no longer reorder levels when levels differ#

In previous versions of pandas, merge() and DataFrame.join() would reorder index levels when joining on two indexes with different levels (GH 34133).

In [13]: left = pd.DataFrame({"left": 1}, index=pd.MultiIndex.from_tuples([("x", 1), ("x", 2)], names=["A", "B"]))

In [14]: right = pd.DataFrame({"right": 2}, index=pd.MultiIndex.from_tuples([(1, 1), (2, 2)], names=["B", "C"]))

In [15]: result = left.join(right)

Old Behavior

In [5]: result
Out[5]:
       left  right
B A C
1 x 1     1      2
2 x 2     1      2

New Behavior

In [16]: result
Out[16]: 
       left  right
A B C             
x 1 1     1      2
  2 2     1      2

Backwards incompatible API changes#

Increased minimum versions for dependencies#

Some minimum supported versions of dependencies were updated. If installed, we now require:

Package

Minimum Version

Required

Changed

X

X

For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.

Package

Minimum Version

Changed

X

See Dependencies and Optional dependencies for more.

Other API changes#

Deprecations#

Deprecate aliases M, Q, Y, etc. in favour of ME, QE, YE, etc. for offsets#

Deprecated the following frequency aliases (GH 9586):

offsets

deprecated aliases

new aliases

MonthEnd

M

ME

BusinessMonthEnd

BM

BME

SemiMonthEnd

SM

SME

CustomBusinessMonthEnd

CBM

CBME

QuarterEnd

Q

QE

BQuarterEnd

BQ

BQE

YearEnd

Y

YE

BYearEnd

BY

BYE

For example:

Previous behavior:

In [8]: pd.date_range('2020-01-01', periods=3, freq='Q-NOV')
Out[8]:
DatetimeIndex(['2020-02-29', '2020-05-31', '2020-08-31'],
              dtype='datetime64[ns]', freq='Q-NOV')

Future behavior:

In [17]: pd.date_range('2020-01-01', periods=3, freq='QE-NOV')
Out[17]: DatetimeIndex(['2020-02-29', '2020-05-31', '2020-08-31'], dtype='datetime64[ns]', freq='QE-NOV')

Deprecated automatic downcasting#

Deprecated the automatic downcasting of object dtype results in a number of methods. These would silently change the dtype in a hard to predict manner since the behavior was value dependent. Additionally, pandas is moving away from silent dtype changes (GH 54710, GH 54261).

These methods are:

Explicitly call DataFrame.infer_objects() to replicate the current behavior in the future.

result = result.infer_objects(copy=False)

Set the following option to opt into the future behavior:

In [9]: pd.set_option("future.no_silent_downcasting", True)

Other Deprecations#

Performance improvements#

Bug fixes#

Categorical#

  • Categorical.isin() raising InvalidIndexError for categorical containing overlapping Interval values (GH 34974)

  • Bug in CategoricalDtype.__eq__() returning false for unordered categorical data with mixed types (GH 55468)

Datetimelike#

  • Bug in DatetimeIndex construction when passing both a tz and either dayfirst or yearfirst ignoring dayfirst/yearfirst (GH 55813)

  • Bug in DatetimeIndex when passing an object-dtype ndarray of float objects and a tz incorrectly localizing the result (GH 55780)

  • Bug in concat() raising AttributeError when concatenating all-NA DataFrame with DatetimeTZDtype dtype DataFrame. (GH 52093)

  • Bug in testing.assert_extension_array_equal() that could use the wrong unit when comparing resolutions (GH 55730)

  • Bug in to_datetime() and DatetimeIndex when passing a list of mixed-string-and-numeric types incorrectly raising (GH 55780)

  • Bug in to_datetime() and DatetimeIndex when passing mixed-type objects with a mix of timezones or mix of timezone-awareness failing to raise ValueError (GH 55693)

  • Bug in DatetimeIndex.shift() with non-nanosecond resolution incorrectly returning with nanosecond resolution (GH 56117)

  • Bug in DatetimeIndex.union() returning object dtype for tz-aware indexes with the same timezone but different units (GH 55238)

  • Bug in Index.is_monotonic_increasing() and Index.is_monotonic_decreasing() always caching Index.is_unique() as True when first value in index is NaT (GH 55755)

  • Bug in Index.view() to a datetime64 dtype with non-supported resolution incorrectly raising (GH 55710)

  • Bug in Series.dt.round() with non-nanosecond resolution and NaT entries incorrectly raising OverflowError (GH 56158)

  • Bug in Tick.delta() with very large ticks raising OverflowError instead of OutOfBoundsTimedelta (GH 55503)

  • Bug in Timestamp.unit() being inferred incorrectly from an ISO8601 format string with minute or hour resolution and a timezone offset (GH 56208)

  • Bug in .astype converting from a higher-resolution datetime64 dtype to a lower-resolution datetime64 dtype (e.g. datetime64[us]->datetim64[ms]) silently overflowing with values near the lower implementation bound (GH 55979)

  • Bug in adding or subtracting a Week offset to a datetime64 Series, Index, or DataFrame column with non-nanosecond resolution returning incorrect results (GH 55583)

  • Bug in addition or subtraction of BusinessDay offset with offset attribute to non-nanosecond Index, Series, or DataFrame column giving incorrect results (GH 55608)

  • Bug in addition or subtraction of DateOffset objects with microsecond components to datetime64 Index, Series, or DataFrame columns with non-nanosecond resolution (GH 55595)

  • Bug in addition or subtraction of very large Tick objects with Timestamp or Timedelta objects raising OverflowError instead of OutOfBoundsTimedelta (GH 55503)

  • Bug in creating a Index, Series, or DataFrame with a non-nanosecond DatetimeTZDtype and inputs that would be out of bounds with nanosecond resolution incorrectly raising OutOfBoundsDatetime (GH 54620)

  • Bug in creating a Index, Series, or DataFrame with a non-nanosecond datetime64 (or DatetimeTZDtype) from mixed-numeric inputs treating those as nanoseconds instead of as multiples of the dtype’s unit (which would happen with non-mixed numeric inputs) (GH 56004)

  • Bug in creating a Index, Series, or DataFrame with a non-nanosecond datetime64 dtype and inputs that would be out of bounds for a datetime64[ns] incorrectly raising OutOfBoundsDatetime (GH 55756)

  • Bug in parsing datetime strings with nanosecond resolution with non-ISO8601 formats incorrectly truncating sub-microsecond components (GH 56051)

  • Bug in parsing datetime strings with sub-second resolution and trailing zeros incorrectly inferring second or millisecond resolution (GH 55737)

  • Bug in the results of to_datetime() with an floating-dtype argument with unit not matching the pointwise results of Timestamp (GH 56037)

Timedelta#

  • Bug in Timedelta construction raising OverflowError instead of OutOfBoundsTimedelta (GH 55503)

  • Bug in rendering (__repr__) of TimedeltaIndex and Series with timedelta64 values with non-nanosecond resolution entries that are all multiples of 24 hours failing to use the compact representation used in the nanosecond cases (GH 55405)

Timezones#

  • Bug in AbstractHolidayCalendar where timezone data was not propagated when computing holiday observances (GH 54580)

  • Bug in Timestamp construction with an ambiguous value and a pytz timezone failing to raise pytz.AmbiguousTimeError (GH 55657)

  • Bug in Timestamp.tz_localize() with nonexistent="shift_forward around UTC+0 during DST (GH 51501)

Numeric#

Conversion#

Strings#

Interval#

Indexing#

Missing#

MultiIndex#

I/O#

Period#

  • Bug in PeriodIndex construction when more than one of data, ordinal and **fields are passed failing to raise ValueError (GH 55961)

  • Bug in Period addition silently wrapping around instead of raising OverflowError (GH 55503)

  • Bug in casting from PeriodDtype with astype to datetime64 or DatetimeTZDtype with non-nanosecond unit incorrectly returning with nanosecond unit (GH 55958)

Plotting#

Groupby/resample/rolling#

Reshaping#

Sparse#

  • Bug in SparseArray.take() when using a different fill value than the array’s fill value (GH 55181)

ExtensionArray#

Styler#

Other#

Contributors#