What’s new in 1.5.0 (??)

These are the changes in pandas 1.5.0. See Release notes for a full changelog including other versions of pandas.


DataFrame exchange protocol implementation

Pandas now implement the DataFrame exchange API spec. See the full details on the API at https://data-apis.org/dataframe-protocol/latest/index.html

The protocol consists of two parts:

  • New method DataFrame.__dataframe__() which produces the exchange object. It effectively “exports” the Pandas dataframe as an exchange object so any other library which has the protocol implemented can “import” that dataframe without knowing anything about the producer except that it makes an exchange object.

  • New function pandas.api.exchange.from_dataframe() which can take an arbitrary exchange object from any conformant library and construct a Pandas DataFrame out of it.


The most notable development is the new method Styler.concat() which allows adding customised footer rows to visualise additional calculations on the data, e.g. totals and counts etc. (GH43875, GH46186)

Additionally there is an alternative output method Styler.to_string(), which allows using the Styler’s formatting methods to create, for example, CSVs (GH44502).

Minor feature improvements are:

  • Adding the ability to render border and border-{side} CSS properties in Excel (GH42276)

  • Making keyword arguments consist: Styler.highlight_null() now accepts color and deprecates null_color although this remains backwards compatible (GH45907)

Control of index with group_keys in DataFrame.resample()

The argument group_keys has been added to the method DataFrame.resample(). As with DataFrame.groupby(), this argument controls the whether each group is added to the index in the resample when Resampler.apply() is used.


Not specifying the group_keys argument will retain the previous behavior and emit a warning if the result will change by specifying group_keys=False. In a future version of pandas, not specifying group_keys will default to the same behavior as group_keys=False.

In [1]: df = pd.DataFrame(
   ...:     {'a': range(6)},
   ...:     index=pd.date_range("2021-01-01", periods=6, freq="8H")
   ...: )

In [2]: df.resample("D", group_keys=True).apply(lambda x: x)
2021-01-01 2021-01-01 00:00:00  0
           2021-01-01 08:00:00  1
           2021-01-01 16:00:00  2
2021-01-02 2021-01-02 00:00:00  3
           2021-01-02 08:00:00  4
           2021-01-02 16:00:00  5

[6 rows x 1 columns]

In [3]: df.resample("D", group_keys=False).apply(lambda x: x)
2021-01-01 00:00:00  0
2021-01-01 08:00:00  1
2021-01-01 16:00:00  2
2021-01-02 00:00:00  3
2021-01-02 08:00:00  4
2021-01-02 16:00:00  5

[6 rows x 1 columns]

Previously, the resulting index would depend upon the values returned by apply, as seen in the following example.

In [1]: # pandas 1.3
In [2]: df.resample("D").apply(lambda x: x)
2021-01-01 00:00:00  0
2021-01-01 08:00:00  1
2021-01-01 16:00:00  2
2021-01-02 00:00:00  3
2021-01-02 08:00:00  4
2021-01-02 16:00:00  5

In [3]: df.resample("D").apply(lambda x: x.reset_index())
                           index  a
2021-01-01 0 2021-01-01 00:00:00  0
           1 2021-01-01 08:00:00  1
           2 2021-01-01 16:00:00  2
2021-01-02 0 2021-01-02 00:00:00  3
           1 2021-01-02 08:00:00  4
           2 2021-01-02 16:00:00  5

Reading directly from TAR archives

I/O methods like read_csv() or DataFrame.to_json() now allow reading and writing directly on TAR archives (GH44787).

df = pd.read_csv("./movement.tar.gz")
# ...

This supports .tar, .tar.gz, .tar.bz and .tar.xz2 archives. The used compression method is inferred from the filename. If the compression method cannot be inferred, use the compression argument:

df = pd.read_csv(some_file_obj, compression={"method": "tar", "mode": "r:gz"}) # noqa F821

(mode being one of tarfile.open’s modes: https://docs.python.org/3/library/tarfile.html#tarfile.open)

Other enhancements

Notable bug fixes

These are bug fixes that might have notable behavior changes.

Using dropna=True with groupby transforms

A transform is an operation whose result has the same size as its input. When the result is a DataFrame or Series, it is also required that the index of the result matches that of the input. In pandas 1.4, using DataFrameGroupBy.transform() or SeriesGroupBy.transform() with null values in the groups and dropna=True gave incorrect results. Demonstrated by the examples below, the incorrect results either contained incorrect values, or the result did not have the same index as the input.

In [4]: df = pd.DataFrame({'a': [1, 1, np.nan], 'b': [2, 3, 4]})

Old behavior:

In [3]: # Value in the last row should be np.nan
        df.groupby('a', dropna=True).transform('sum')
0  5
1  5
2  5

In [3]: # Should have one additional row with the value np.nan
        df.groupby('a', dropna=True).transform(lambda x: x.sum())
0  5
1  5

In [3]: # The value in the last row is np.nan interpreted as an integer
        df.groupby('a', dropna=True).transform('ffill')
0                    2
1                    3
2 -9223372036854775808

In [3]: # Should have one additional row with the value np.nan
        df.groupby('a', dropna=True).transform(lambda x: x)
0  2
1  3

New behavior:

In [5]: df.groupby('a', dropna=True).transform('sum')
0  5.0
1  5.0
2  NaN

[3 rows x 1 columns]

In [6]: df.groupby('a', dropna=True).transform(lambda x: x.sum())
0  5.0
1  5.0
2  NaN

[3 rows x 1 columns]

In [7]: df.groupby('a', dropna=True).transform('ffill')
0  2.0
1  3.0
2  NaN

[3 rows x 1 columns]

In [8]: df.groupby('a', dropna=True).transform(lambda x: x)
0  2.0
1  3.0
2  NaN

[3 rows x 1 columns]

Serializing tz-naive Timestamps with to_json() with iso_dates=True

DataFrame.to_json(), Series.to_json(), and Index.to_json() would incorrectly localize DatetimeArrays/DatetimeIndexes with tz-naive Timestamps to UTC. (GH38760)

Note that this patch does not fix the localization of tz-aware Timestamps to UTC upon serialization. (Related issue GH12997)

Old Behavior

In [9]: index = pd.date_range(
   ...:     start='2020-12-28 00:00:00',
   ...:     end='2020-12-28 02:00:00',
   ...:     freq='1H',
   ...: )

In [10]: a = pd.Series(
   ....:     data=range(3),
   ....:     index=index,
   ....: )
In [4]: a.to_json(date_format='iso')
Out[4]: '{"2020-12-28T00:00:00.000Z":0,"2020-12-28T01:00:00.000Z":1,"2020-12-28T02:00:00.000Z":2}'

In [5]: pd.read_json(a.to_json(date_format='iso'), typ="series").index == a.index
Out[5]: array([False, False, False])

New Behavior

In [11]: a.to_json(date_format='iso')
Out[11]: '{"2020-12-28T00:00:00.000":0,"2020-12-28T01:00:00.000":1,"2020-12-28T02:00:00.000":2}'

# Roundtripping now works
In [12]: pd.read_json(a.to_json(date_format='iso'), typ="series").index == a.index
Out[12]: array([ True,  True,  True])

Backwards incompatible API changes

read_xml now supports dtype, converters, and parse_dates

Similar to other IO methods, pandas.read_xml() now supports assigning specific dtypes to columns, apply converter methods, and parse dates (GH43567).

In [13]: xml_dates = """<?xml version='1.0' encoding='utf-8'?>
   ....: <data>
   ....:   <row>
   ....:     <shape>square</shape>
   ....:     <degrees>00360</degrees>
   ....:     <sides>4.0</sides>
   ....:     <date>2020-01-01</date>
   ....:    </row>
   ....:   <row>
   ....:     <shape>circle</shape>
   ....:     <degrees>00360</degrees>
   ....:     <sides/>
   ....:     <date>2021-01-01</date>
   ....:   </row>
   ....:   <row>
   ....:     <shape>triangle</shape>
   ....:     <degrees>00180</degrees>
   ....:     <sides>3.0</sides>
   ....:     <date>2022-01-01</date>
   ....:   </row>
   ....: </data>"""

In [14]: df = pd.read_xml(
   ....:     xml_dates,
   ....:     dtype={'sides': 'Int64'},
   ....:     converters={'degrees': str},
   ....:     parse_dates=['date']
   ....: )

In [15]: df
      shape degrees  sides       date
0    square   00360      4 2020-01-01
1    circle   00360   <NA> 2021-01-01
2  triangle   00180      3 2022-01-01

[3 rows x 4 columns]

In [16]: df.dtypes
shape              object
degrees            object
sides               Int64
date       datetime64[ns]
Length: 4, dtype: object

read_xml now supports large XML using iterparse

For very large XML files that can range in hundreds of megabytes to gigabytes, pandas.read_xml() now supports parsing such sizeable files using lxml’s iterparse and etree’s iterparse which are memory-efficient methods to iterate through XML trees and extract specific elements and attributes without holding entire tree in memory (GH#45442).

In [1]: df = pd.read_xml(
...      "/path/to/downloaded/enwikisource-latest-pages-articles.xml",
...      iterparse = {"page": ["title", "ns", "id"]})
...  )
                                                     title   ns        id
0                                       Gettysburg Address    0     21450
1                                                Main Page    0     42950
2                            Declaration by United Nations    0      8435
3             Constitution of the United States of America    0      8435
4                     Declaration of Independence (Israel)    0     17858
...                                                    ...  ...       ...
3578760               Page:Black cat 1897 07 v2 n10.pdf/17  104    219649
3578761               Page:Black cat 1897 07 v2 n10.pdf/43  104    219649
3578762               Page:Black cat 1897 07 v2 n10.pdf/44  104    219649
3578763      The History of Tom Jones, a Foundling/Book IX    0  12084291
3578764  Page:Shakespeare of Stratford (1926) Yale.djvu/91  104     21450

[3578765 rows x 3 columns]


Increased minimum versions for dependencies

Some minimum supported versions of dependencies were updated. If installed, we now require:


Minimum Version



mypy (dev)





































































For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.


Minimum Version



See Dependencies and Optional dependencies for more.

Other API changes


In a future version, integer slicing on a Series with a Int64Index or RangeIndex will be treated as label-based, not positional. This will make the behavior consistent with other Series.__getitem__() and Series.__setitem__() behaviors (GH45162).

For example:

In [17]: ser = pd.Series([1, 2, 3, 4, 5], index=[2, 3, 5, 7, 11])

In the old behavior, ser[2:4] treats the slice as positional:

Old behavior:

In [3]: ser[2:4]
5    3
7    4
dtype: int64

In a future version, this will be treated as label-based:

Future behavior:

In [4]: ser.loc[2:4]
2    1
3    2
dtype: int64

To retain the old behavior, use series.iloc[i:j]. To get the future behavior, use series.loc[i:j].

Slicing on a DataFrame will not be affected.

ExcelWriter attributes

All attributes of ExcelWriter were previously documented as not public. However some third party Excel engines documented accessing ExcelWriter.book or ExcelWriter.sheets, and users were utilizing these and possibly other attributes. Previously these attributes were not safe to use; e.g. modifications to ExcelWriter.book would not update ExcelWriter.sheets and conversely. In order to support this, pandas has made some attributes public and improved their implementations so that they may now be safely used. (GH45572)

The following attributes are now public and considered safe to access.

  • book

  • check_extension

  • close

  • date_format

  • datetime_format

  • engine

  • if_sheet_exists

  • sheets

  • supported_extensions

The following attributes have been deprecated. They now raise a FutureWarning when accessed and will be removed in a future version. Users should be aware that their usage is considered unsafe, and can lead to unexpected results.

  • cur_sheet

  • handles

  • path

  • save

  • write_cells

See the documentation of ExcelWriter for further details.

Using group_keys with transformers in GroupBy.apply()

In previous versions of pandas, if it was inferred that the function passed to GroupBy.apply() was a transformer (i.e. the resulting index was equal to the input index), the group_keys argument of DataFrame.groupby() and Series.groupby() was ignored and the group keys would never be added to the index of the result. In the future, the group keys will be added to the index when the user specifies group_keys=True.

As group_keys=True is the default value of DataFrame.groupby() and Series.groupby(), not specifying group_keys with a transformer will raise a FutureWarning. This can be silenced and the previous behavior retained by specifying group_keys=False.

Try operating inplace when setting values with loc and iloc

Most of the time setting values with frame.iloc attempts to set values in-place, only falling back to inserting a new array if necessary. In the past, setting entire columns has been an exception to this rule:

In [18]: values = np.arange(4).reshape(2, 2)

In [19]: df = pd.DataFrame(values)

In [20]: ser = df[0]

Old behavior:

In [3]: df.iloc[:, 0] = np.array([10, 11])
In [4]: ser
0    0
1    2
Name: 0, dtype: int64

This behavior is deprecated. In a future version, setting an entire column with iloc will attempt to operate inplace.

Future behavior:

In [3]: df.iloc[:, 0] = np.array([10, 11])
In [4]: ser
0    10
1    11
Name: 0, dtype: int64

To get the old behavior, use DataFrame.__setitem__() directly:

Future behavior:

In [5]: df[0] = np.array([21, 31])
In [4]: ser
0    10
1    11
Name: 0, dtype: int64

In the case where df.columns is not unique, use DataFrame.isetitem():

Future behavior:

In [5]: df.columns = ["A", "A"]
In [5]: df.isetitem(0, np.array([21, 31]))
In [4]: ser
0    10
1    11
Name: 0, dtype: int64

numeric_only default value

Across the DataFrame and DataFrameGroupBy operations such as min, sum, and idxmax, the default value of the numeric_only argument, if it exists at all, was inconsistent. Furthermore, operations with the default value None can lead to surprising results. (GH46560)

In [1]: df = pd.DataFrame({"a": [1, 2], "b": ["x", "y"]})

In [2]: # Reading the next line without knowing the contents of df, one would
        # expect the result to contain the products for both columns a and b.
        df[["a", "b"]].prod()
a    2
dtype: int64

To avoid this behavior, the specifying the value numeric_only=None has been deprecated, and will be removed in a future version of pandas. In the future, all operations with a numeric_only argument will default to False. Users should either call the operation only with columns that can be operated on, or specify numeric_only=True to operate only on Boolean, integer, and float columns.

In order to support the transition to the new behavior, the following methods have gained the numeric_only argument.

Other Deprecations

Performance improvements

Bug fixes


  • Bug in Categorical.view() not accepting integer dtypes (GH25464)

  • Bug in CategoricalIndex.union() when the index’s categories are integer-dtype and the index contains NaN values incorrectly raising instead of casting to float64 (GH45362)


  • Bug in DataFrame.quantile() with datetime-like dtypes and no rows incorrectly returning float64 dtype instead of retaining datetime-like dtype (GH41544)

  • Bug in to_datetime() with sequences of np.str_ objects incorrectly raising (GH32264)

  • Bug in Timestamp construction when passing datetime components as positional arguments and tzinfo as a keyword argument incorrectly raising (GH31929)

  • Bug in Index.astype() when casting from object dtype to timedelta64[ns] dtype incorrectly casting np.datetime64("NaT") values to np.timedelta64("NaT") instead of raising (GH45722)

  • Bug in SeriesGroupBy.value_counts() index when passing categorical column (GH44324)

  • Bug in DatetimeIndex.tz_localize() localizing to UTC failing to make a copy of the underlying data (GH46460)

  • Bug in DatetimeIndex.resolution() incorrectly returning “day” instead of “nanosecond” for nanosecond-resolution indexes (GH46903)


  • Bug in astype_nansafe() astype(“timedelta64[ns]”) fails when np.nan is included (GH45798)

  • Bug in constructing a Timedelta with a np.timedelta64 object and a unit sometimes silently overflowing and returning incorrect results instead of raising OutOfBoundsTimedelta (GH46827)

Time Zones

  • Bug in Timestamp constructor raising when passed a ZoneInfo tzinfo object (GH46425)


  • Bug in operations with array-likes with dtype="boolean" and NA incorrectly altering the array in-place (GH45421)

  • Bug in division, pow and mod operations on array-likes with dtype="boolean" not being like their np.bool_ counterparts (GH46063)

  • Bug in multiplying a Series with IntegerDtype or FloatingDtype by an array-like with timedelta64[ns] dtype incorrectly raising (GH45622)


  • Bug in DataFrame.astype() not preserving subclasses (GH40810)

  • Bug in constructing a Series from a float-containing list or a floating-dtype ndarray-like (e.g. dask.Array) and an integer dtype raising instead of casting like we would with an np.ndarray (GH40110)

  • Bug in Float64Index.astype() to unsigned integer dtype incorrectly casting to np.int64 dtype (GH45309)

  • Bug in Series.astype() and DataFrame.astype() from floating dtype to unsigned integer dtype failing to raise in the presence of negative values (GH45151)

  • Bug in array() with FloatingDtype and values containing float-castable strings incorrectly raising (GH45424)

  • Bug when comparing string and datetime64ns objects causing OverflowError exception. (GH45506)

  • Bug in metaclass of generic abstract dtypes causing DataFrame.apply() and Series.apply() to raise for the built-in function type (GH46684)

  • Bug in DataFrame.to_dict() for orient="list" or orient="index" was not returning native types (GH46751)



  • Bug in IntervalArray.__setitem__() when setting np.nan into an integer-backed array raising ValueError instead of TypeError (GH45484)


  • Bug in loc.__getitem__() with a list of keys causing an internal inconsistency that could lead to a disconnect between frame.at[x, y] vs frame[y].loc[x] (GH22372)

  • Bug in DataFrame.iloc() where indexing a single row on a DataFrame with a single ExtensionDtype column gave a copy instead of a view on the underlying data (GH45241)

  • Bug in Series.align() does not create MultiIndex with union of levels when both MultiIndexes intersections are identical (GH45224)

  • Bug in setting a NA value (None or np.nan) into a Series with int-based IntervalDtype incorrectly casting to object dtype instead of a float-based IntervalDtype (GH45568)

  • Bug in indexing setting values into an ExtensionDtype column with df.iloc[:, i] = values with values having the same dtype as df.iloc[:, i] incorrectly inserting a new array instead of setting in-place (GH33457)

  • Bug in Series.__setitem__() with a non-integer Index when using an integer key to set a value that cannot be set inplace where a ValueError was raised instead of casting to a common dtype (GH45070)

  • Bug in Series.__setitem__() when setting incompatible values into a PeriodDtype or IntervalDtype Series raising when indexing with a boolean mask but coercing when indexing with otherwise-equivalent indexers; these now consistently coerce, along with Series.mask() and Series.where() (GH45768)

  • Bug in DataFrame.where() with multiple columns with datetime-like dtypes failing to downcast results consistent with other dtypes (GH45837)

  • Bug in Series.loc.__setitem__() and Series.loc.__getitem__() not raising when using multiple keys without using a MultiIndex (GH13831)

  • Bug in Index.reindex() raising AssertionError when level was specified but no MultiIndex was given; level is ignored now (GH35132)

  • Bug when setting a value too large for a Series dtype failing to coerce to a common type (GH26049, GH32878)

  • Bug in loc.__setitem__() treating range keys as positional instead of label-based (GH45479)

  • Bug in Series.__setitem__() when setting boolean dtype values containing NA incorrectly raising instead of casting to boolean dtype (GH45462)

  • Bug in Series.__setitem__() where setting NA into a numeric-dtpye Series would incorrectly upcast to object-dtype rather than treating the value as np.nan (GH44199)

  • Bug in Series.__setitem__() with datetime64[ns] dtype, an all-False boolean mask, and an incompatible value incorrectly casting to object instead of retaining datetime64[ns] dtype (GH45967)

  • Bug in Index.__getitem__() raising ValueError when indexer is from boolean dtype with NA (GH45806)

  • Bug in Series.mask() with inplace=True or setting values with a boolean mask with small integer dtypes incorrectly raising (GH45750)

  • Bug in DataFrame.mask() with inplace=True and ExtensionDtype columns incorrectly raising (GH45577)

  • Bug in getting a column from a DataFrame with an object-dtype row index with datetime-like values: the resulting Series now preserves the exact object-dtype Index from the parent DataFrame (GH42950)

  • Bug in DataFrame.__getattribute__() raising AttributeError if columns have "string" dtype (GH46185)

  • Bug in indexing on a DatetimeIndex with a np.str_ key incorrectly raising (GH45580)

  • Bug in CategoricalIndex.get_indexer() when index contains NaN values, resulting in elements that are in target but not present in the index to be mapped to the index of the NaN element, instead of -1 (GH45361)

  • Bug in setting large integer values into Series with float32 or float16 dtype incorrectly altering these values instead of coercing to float64 dtype (GH45844)

  • Bug in Series.asof() and DataFrame.asof() incorrectly casting bool-dtype results to float64 dtype (GH16063)

  • Bug in NDFrame.xs(), DataFrame.iterrows(), DataFrame.loc() and DataFrame.iloc() not always propagating metadata (GH28283)







  • Bug in DataFrame.resample() ignoring closed="right" on TimedeltaIndex (GH45414)

  • Bug in DataFrameGroupBy.transform() fails when func="size" and the input DataFrame has multiple columns (GH27469)

  • Bug in DataFrameGroupBy.size() and DataFrameGroupBy.transform() with func="size" produced incorrect results when axis=1 (GH45715)

  • Bug in ExponentialMovingWindow.mean() with axis=1 and engine='numba' when the DataFrame has more columns than rows (GH46086)

  • Bug when using engine="numba" would return the same jitted function when modifying engine_kwargs (GH46086)

  • Bug in DataFrameGroupby.transform() fails when axis=1 and func is "first" or "last" (GH45986)

  • Bug in DataFrameGroupby.cumsum() with skipna=False giving incorrect results (GH46216)

  • Bug in GroupBy.cumsum() with timedelta64[ns] dtype failing to recognize NaT as a null value (GH46216)

  • Bug in GroupBy.cummin() and GroupBy.cummax() with nullable dtypes incorrectly altering the original data in place (GH46220)

  • Bug in GroupBy.cummax() with int64 dtype with leading value being the smallest possible int64 (GH46382)

  • Bug in GroupBy.max() with empty groups and uint64 dtype incorrectly raising RuntimeError (GH46408)

  • Bug in GroupBy.apply() would fail when func was a string and args or kwargs were supplied (GH46479)

  • Bug in SeriesGroupBy.apply() would incorrectly name its result when there was a unique group (GH46369)

  • Bug in Rolling.sum() and Rolling.mean() would give incorrect result with window of same values (GH42064, GH46431)

  • Bug in Rolling.var() and Rolling.std() would give non-zero result with window of same values (GH42064)

  • Bug in Rolling.skew() and Rolling.kurt() would give NaN with window of same values (GH30993)

  • Bug in Rolling.var() would segfault calculating weighted variance when window size was larger than data size (GH46760)

  • Bug in Grouper.__repr__() where dropna was not included. Now it is (GH46754)

  • Bug in DataFrame.rolling() gives ValueError when center=True, axis=1 and win_type is specified (GH46135)

  • Bug in DataFrameGroupBy.describe() and SeriesGroupBy.describe() produces inconsistent results for empty datasets (GH41575)




  • Bug in IntegerArray.searchsorted() and FloatingArray.searchsorted() returning inconsistent results when acting on np.nan (GH45255)


  • Bug when attempting to apply styling functions to an empty DataFrame subset (GH45313)

  • Bug in CSSToExcelConverter leading to TypeError when border color provided without border style for xlsxwriter engine (GH42276)