What’s new in 1.5.0 (September 19, 2022)#

These are the changes in pandas 1.5.0. See Release notes for a full changelog including other versions of pandas.

Enhancements#

pandas-stubs#

The pandas-stubs library is now supported by the pandas development team, providing type stubs for the pandas API. Please visit https://github.com/pandas-dev/pandas-stubs for more information.

We thank VirtusLab and Microsoft for their initial, significant contributions to pandas-stubs

Native PyArrow-backed ExtensionArray#

With Pyarrow installed, users can now create pandas objects that are backed by a pyarrow.ChunkedArray and pyarrow.DataType.

The dtype argument can accept a string of a pyarrow data type with pyarrow in brackets e.g. "int64[pyarrow]" or, for pyarrow data types that take parameters, a ArrowDtype initialized with a pyarrow.DataType.

In [1]: import pyarrow as pa

In [2]: ser_float = pd.Series([1.0, 2.0, None], dtype="float32[pyarrow]")

In [3]: ser_float
Out[3]: 
0     1.0
1     2.0
2    <NA>
dtype: float[pyarrow]

In [4]: list_of_int_type = pd.ArrowDtype(pa.list_(pa.int64()))

In [5]: ser_list = pd.Series([[1, 2], [3, None]], dtype=list_of_int_type)

In [6]: ser_list
Out[6]: 
0      [1. 2.]
1    [ 3. nan]
dtype: list<item: int64>[pyarrow]

In [7]: ser_list.take([1, 0])
Out[7]: 
1    [ 3. nan]
0      [1. 2.]
dtype: list<item: int64>[pyarrow]

In [8]: ser_float * 5
Out[8]: 
0     5.0
1    10.0
2    <NA>
dtype: float[pyarrow]

In [9]: ser_float.mean()
Out[9]: 1.5

In [10]: ser_float.dropna()
Out[10]: 
0    1.0
1    2.0
dtype: float[pyarrow]

Most operations are supported and have been implemented using pyarrow compute functions. We recommend installing the latest version of PyArrow to access the most recently implemented compute functions.

Warning

This feature is experimental, and the API can change in a future release without warning.

DataFrame interchange protocol implementation#

Pandas now implement the DataFrame interchange API spec. See the full details on the API at https://data-apis.org/dataframe-protocol/latest/index.html

The protocol consists of two parts:

  • New method DataFrame.__dataframe__() which produces the interchange object. It effectively “exports” the pandas dataframe as an interchange object so any other library which has the protocol implemented can “import” that dataframe without knowing anything about the producer except that it makes an interchange object.

  • New function pandas.api.interchange.from_dataframe() which can take an arbitrary interchange object from any conformant library and construct a pandas DataFrame out of it.

Styler#

The most notable development is the new method Styler.concat() which allows adding customised footer rows to visualise additional calculations on the data, e.g. totals and counts etc. (GH43875, GH46186)

Additionally there is an alternative output method Styler.to_string(), which allows using the Styler’s formatting methods to create, for example, CSVs (GH44502).

A new feature Styler.relabel_index() is also made available to provide full customisation of the display of index or column headers (GH47864)

Minor feature improvements are:

  • Adding the ability to render border and border-{side} CSS properties in Excel (GH42276)

  • Making keyword arguments consist: Styler.highlight_null() now accepts color and deprecates null_color although this remains backwards compatible (GH45907)

Control of index with group_keys in DataFrame.resample()#

The argument group_keys has been added to the method DataFrame.resample(). As with DataFrame.groupby(), this argument controls the whether each group is added to the index in the resample when Resampler.apply() is used.

Warning

Not specifying the group_keys argument will retain the previous behavior and emit a warning if the result will change by specifying group_keys=False. In a future version of pandas, not specifying group_keys will default to the same behavior as group_keys=False.

In [11]: df = pd.DataFrame(
   ....:     {'a': range(6)},
   ....:     index=pd.date_range("2021-01-01", periods=6, freq="8H")
   ....: )
   ....: 

In [12]: df.resample("D", group_keys=True).apply(lambda x: x)
Out[12]: 
                                a
2021-01-01 2021-01-01 00:00:00  0
           2021-01-01 08:00:00  1
           2021-01-01 16:00:00  2
2021-01-02 2021-01-02 00:00:00  3
           2021-01-02 08:00:00  4
           2021-01-02 16:00:00  5

In [13]: df.resample("D", group_keys=False).apply(lambda x: x)
Out[13]: 
                     a
2021-01-01 00:00:00  0
2021-01-01 08:00:00  1
2021-01-01 16:00:00  2
2021-01-02 00:00:00  3
2021-01-02 08:00:00  4
2021-01-02 16:00:00  5

Previously, the resulting index would depend upon the values returned by apply, as seen in the following example.

In [1]: # pandas 1.3
In [2]: df.resample("D").apply(lambda x: x)
Out[2]:
                     a
2021-01-01 00:00:00  0
2021-01-01 08:00:00  1
2021-01-01 16:00:00  2
2021-01-02 00:00:00  3
2021-01-02 08:00:00  4
2021-01-02 16:00:00  5

In [3]: df.resample("D").apply(lambda x: x.reset_index())
Out[3]:
                           index  a
2021-01-01 0 2021-01-01 00:00:00  0
           1 2021-01-01 08:00:00  1
           2 2021-01-01 16:00:00  2
2021-01-02 0 2021-01-02 00:00:00  3
           1 2021-01-02 08:00:00  4
           2 2021-01-02 16:00:00  5

from_dummies#

Added new function from_dummies() to convert a dummy coded DataFrame into a categorical DataFrame.

In [14]: import pandas as pd

In [15]: df = pd.DataFrame({"col1_a": [1, 0, 1], "col1_b": [0, 1, 0],
   ....:                    "col2_a": [0, 1, 0], "col2_b": [1, 0, 0],
   ....:                    "col2_c": [0, 0, 1]})
   ....: 

In [16]: pd.from_dummies(df, sep="_")
Out[16]: 
  col1 col2
0    a    b
1    b    a
2    a    c

Writing to ORC files#

The new method DataFrame.to_orc() allows writing to ORC files (GH43864).

This functionality depends the pyarrow library. For more details, see the IO docs on ORC.

Warning

  • It is highly recommended to install pyarrow using conda due to some issues occurred by pyarrow.

  • to_orc() requires pyarrow>=7.0.0.

  • to_orc() is not supported on Windows yet, you can find valid environments on install optional dependencies.

  • For supported dtypes please refer to supported ORC features in Arrow.

  • Currently timezones in datetime columns are not preserved when a dataframe is converted into ORC files.

df = pd.DataFrame(data={"col1": [1, 2], "col2": [3, 4]})
df.to_orc("./out.orc")

Reading directly from TAR archives#

I/O methods like read_csv() or DataFrame.to_json() now allow reading and writing directly on TAR archives (GH44787).

df = pd.read_csv("./movement.tar.gz")
# ...
df.to_csv("./out.tar.gz")

This supports .tar, .tar.gz, .tar.bz and .tar.xz2 archives. The used compression method is inferred from the filename. If the compression method cannot be inferred, use the compression argument:

df = pd.read_csv(some_file_obj, compression={"method": "tar", "mode": "r:gz"}) # noqa F821

(mode being one of tarfile.open’s modes: https://docs.python.org/3/library/tarfile.html#tarfile.open)

read_xml now supports dtype, converters, and parse_dates#

Similar to other IO methods, pandas.read_xml() now supports assigning specific dtypes to columns, apply converter methods, and parse dates (GH43567).

In [17]: xml_dates = """<?xml version='1.0' encoding='utf-8'?>
   ....: <data>
   ....:   <row>
   ....:     <shape>square</shape>
   ....:     <degrees>00360</degrees>
   ....:     <sides>4.0</sides>
   ....:     <date>2020-01-01</date>
   ....:    </row>
   ....:   <row>
   ....:     <shape>circle</shape>
   ....:     <degrees>00360</degrees>
   ....:     <sides/>
   ....:     <date>2021-01-01</date>
   ....:   </row>
   ....:   <row>
   ....:     <shape>triangle</shape>
   ....:     <degrees>00180</degrees>
   ....:     <sides>3.0</sides>
   ....:     <date>2022-01-01</date>
   ....:   </row>
   ....: </data>"""
   ....: 

In [18]: df = pd.read_xml(
   ....:     xml_dates,
   ....:     dtype={'sides': 'Int64'},
   ....:     converters={'degrees': str},
   ....:     parse_dates=['date']
   ....: )
   ....: 

In [19]: df
Out[19]: 
      shape degrees  sides       date
0    square   00360      4 2020-01-01
1    circle   00360   <NA> 2021-01-01
2  triangle   00180      3 2022-01-01

In [20]: df.dtypes
Out[20]: 
shape              object
degrees            object
sides               Int64
date       datetime64[ns]
dtype: object

read_xml now supports large XML using iterparse#

For very large XML files that can range in hundreds of megabytes to gigabytes, pandas.read_xml() now supports parsing such sizeable files using lxml’s iterparse and etree’s iterparse which are memory-efficient methods to iterate through XML trees and extract specific elements and attributes without holding entire tree in memory (GH45442).

In [1]: df = pd.read_xml(
...      "/path/to/downloaded/enwikisource-latest-pages-articles.xml",
...      iterparse = {"page": ["title", "ns", "id"]})
...  )
df
Out[2]:
                                                     title   ns        id
0                                       Gettysburg Address    0     21450
1                                                Main Page    0     42950
2                            Declaration by United Nations    0      8435
3             Constitution of the United States of America    0      8435
4                     Declaration of Independence (Israel)    0     17858
...                                                    ...  ...       ...
3578760               Page:Black cat 1897 07 v2 n10.pdf/17  104    219649
3578761               Page:Black cat 1897 07 v2 n10.pdf/43  104    219649
3578762               Page:Black cat 1897 07 v2 n10.pdf/44  104    219649
3578763      The History of Tom Jones, a Foundling/Book IX    0  12084291
3578764  Page:Shakespeare of Stratford (1926) Yale.djvu/91  104     21450

[3578765 rows x 3 columns]

Copy on Write#

A new feature copy_on_write was added (GH46958). Copy on write ensures that any DataFrame or Series derived from another in any way always behaves as a copy. Copy on write disallows updating any other object than the object the method was applied to.

Copy on write can be enabled through:

pd.set_option("mode.copy_on_write", True)
pd.options.mode.copy_on_write = True

Alternatively, copy on write can be enabled locally through:

with pd.option_context("mode.copy_on_write", True):
    ...

Without copy on write, the parent DataFrame is updated when updating a child DataFrame that was derived from this DataFrame.

In [21]: df = pd.DataFrame({"foo": [1, 2, 3], "bar": 1})

In [22]: view = df["foo"]

In [23]: view.iloc[0]
Out[23]: 1

In [24]: df
Out[24]: 
   foo  bar
0    1    1
1    2    1
2    3    1

With copy on write enabled, df won’t be updated anymore:

In [25]: with pd.option_context("mode.copy_on_write", True):
   ....:     df = pd.DataFrame({"foo": [1, 2, 3], "bar": 1})
   ....:     view = df["foo"]
   ....:     view.iloc[0]
   ....:     df
   ....: 

A more detailed explanation can be found here.

Other enhancements#

Notable bug fixes#

These are bug fixes that might have notable behavior changes.

Using dropna=True with groupby transforms#

A transform is an operation whose result has the same size as its input. When the result is a DataFrame or Series, it is also required that the index of the result matches that of the input. In pandas 1.4, using DataFrameGroupBy.transform() or SeriesGroupBy.transform() with null values in the groups and dropna=True gave incorrect results. Demonstrated by the examples below, the incorrect results either contained incorrect values, or the result did not have the same index as the input.

In [26]: df = pd.DataFrame({'a': [1, 1, np.nan], 'b': [2, 3, 4]})

Old behavior:

In [3]: # Value in the last row should be np.nan
        df.groupby('a', dropna=True).transform('sum')
Out[3]:
   b
0  5
1  5
2  5

In [3]: # Should have one additional row with the value np.nan
        df.groupby('a', dropna=True).transform(lambda x: x.sum())
Out[3]:
   b
0  5
1  5

In [3]: # The value in the last row is np.nan interpreted as an integer
        df.groupby('a', dropna=True).transform('ffill')
Out[3]:
                     b
0                    2
1                    3
2 -9223372036854775808

In [3]: # Should have one additional row with the value np.nan
        df.groupby('a', dropna=True).transform(lambda x: x)
Out[3]:
   b
0  2
1  3

New behavior:

In [27]: df.groupby('a', dropna=True).transform('sum')
Out[27]: 
     b
0  5.0
1  5.0
2  NaN

In [28]: df.groupby('a', dropna=True).transform(lambda x: x.sum())
Out[28]: 
     b
0  5.0
1  5.0
2  NaN

In [29]: df.groupby('a', dropna=True).transform('ffill')
Out[29]: 
     b
0  2.0
1  3.0
2  NaN

In [30]: df.groupby('a', dropna=True).transform(lambda x: x)
Out[30]: 
     b
0  2.0
1  3.0
2  NaN

Serializing tz-naive Timestamps with to_json() with iso_dates=True#

DataFrame.to_json(), Series.to_json(), and Index.to_json() would incorrectly localize DatetimeArrays/DatetimeIndexes with tz-naive Timestamps to UTC. (GH38760)

Note that this patch does not fix the localization of tz-aware Timestamps to UTC upon serialization. (Related issue GH12997)

Old Behavior

In [31]: index = pd.date_range(
   ....:     start='2020-12-28 00:00:00',
   ....:     end='2020-12-28 02:00:00',
   ....:     freq='1H',
   ....: )
   ....: 

In [32]: a = pd.Series(
   ....:     data=range(3),
   ....:     index=index,
   ....: )
   ....: 
In [4]: a.to_json(date_format='iso')
Out[4]: '{"2020-12-28T00:00:00.000Z":0,"2020-12-28T01:00:00.000Z":1,"2020-12-28T02:00:00.000Z":2}'

In [5]: pd.read_json(a.to_json(date_format='iso'), typ="series").index == a.index
Out[5]: array([False, False, False])

New Behavior

In [33]: a.to_json(date_format='iso')
Out[33]: '{"2020-12-28T00:00:00.000":0,"2020-12-28T01:00:00.000":1,"2020-12-28T02:00:00.000":2}'

# Roundtripping now works
In [34]: pd.read_json(a.to_json(date_format='iso'), typ="series").index == a.index
Out[34]: array([ True,  True,  True])

DataFrameGroupBy.value_counts with non-grouping categorical columns and observed=True#

Calling DataFrameGroupBy.value_counts() with observed=True would incorrectly drop non-observed categories of non-grouping columns (GH46357).

In [6]: df = pd.DataFrame(["a", "b", "c"], dtype="category").iloc[0:2]
In [7]: df
Out[7]:
   0
0  a
1  b

Old Behavior

In [8]: df.groupby(level=0, observed=True).value_counts()
Out[8]:
0  a    1
1  b    1
dtype: int64

New Behavior

In [9]: df.groupby(level=0, observed=True).value_counts()
Out[9]:
0  a    1
1  a    0
   b    1
0  b    0
   c    0
1  c    0
dtype: int64

Backwards incompatible API changes#

Increased minimum versions for dependencies#

Some minimum supported versions of dependencies were updated. If installed, we now require:

Package

Minimum Version

Required

Changed

numpy

1.20.3

X

X

mypy (dev)

0.971

X

beautifulsoup4

4.9.3

X

blosc

1.21.0

X

bottleneck

1.3.2

X

fsspec

2021.07.0

X

hypothesis

6.13.0

X

gcsfs

2021.07.0

X

jinja2

3.0.0

X

lxml

4.6.3

X

numba

0.53.1

X

numexpr

2.7.3

X

openpyxl

3.0.7

X

pandas-gbq

0.15.0

X

psycopg2

2.8.6

X

pymysql

1.0.2

X

pyreadstat

1.1.2

X

pyxlsb

1.0.8

X

s3fs

2021.08.0

X

scipy

1.7.1

X

sqlalchemy

1.4.16

X

tabulate

0.8.9

X

xarray

0.19.0

X

xlsxwriter

1.4.3

X

For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.

Package

Minimum Version

Changed

beautifulsoup4

4.9.3

X

blosc

1.21.0

X

bottleneck

1.3.2

X

brotlipy

0.7.0

fastparquet

0.4.0

fsspec

2021.08.0

X

html5lib

1.1

hypothesis

6.13.0

X

gcsfs

2021.08.0

X

jinja2

3.0.0

X

lxml

4.6.3

X

matplotlib

3.3.2

numba

0.53.1

X

numexpr

2.7.3

X

odfpy

1.4.1

openpyxl

3.0.7

X

pandas-gbq

0.15.0

X

psycopg2

2.8.6

X

pyarrow

1.0.1

pymysql

1.0.2

X

pyreadstat

1.1.2

X

pytables

3.6.1

python-snappy

0.6.0

pyxlsb

1.0.8

X

s3fs

2021.08.0

X

scipy

1.7.1

X

sqlalchemy

1.4.16

X

tabulate

0.8.9

X

tzdata

2022a

xarray

0.19.0

X

xlrd

2.0.1

xlsxwriter

1.4.3

X

xlwt

1.3.0

zstandard

0.15.2

See Dependencies and Optional dependencies for more.

Other API changes#

  • BigQuery I/O methods read_gbq() and DataFrame.to_gbq() default to auth_local_webserver = True. Google has deprecated the auth_local_webserver = False “out of band” (copy-paste) flow. The auth_local_webserver = False option is planned to stop working in October 2022. (GH46312)

  • read_json() now raises FileNotFoundError (previously ValueError) when input is a string ending in .json, .json.gz, .json.bz2, etc. but no such file exists. (GH29102)

  • Operations with Timestamp or Timedelta that would previously raise OverflowError instead raise OutOfBoundsDatetime or OutOfBoundsTimedelta where appropriate (GH47268)

  • When read_sas() previously returned None, it now returns an empty DataFrame (GH47410)

  • DataFrame constructor raises if index or columns arguments are sets (GH47215)

Deprecations#

Warning

In the next major version release, 2.0, several larger API changes are being considered without a formal deprecation such as making the standard library zoneinfo the default timezone implementation instead of pytz, having the Index support all data types instead of having multiple subclasses (CategoricalIndex, Int64Index, etc.), and more. The changes under consideration are logged in this GitHub issue, and any feedback or concerns are welcome.

Label-based integer slicing on a Series with an Int64Index or RangeIndex#

In a future version, integer slicing on a Series with a Int64Index or RangeIndex will be treated as label-based, not positional. This will make the behavior consistent with other Series.__getitem__() and Series.__setitem__() behaviors (GH45162).

For example:

In [35]: ser = pd.Series([1, 2, 3, 4, 5], index=[2, 3, 5, 7, 11])

In the old behavior, ser[2:4] treats the slice as positional:

Old behavior:

In [3]: ser[2:4]
Out[3]:
5    3
7    4
dtype: int64

In a future version, this will be treated as label-based:

Future behavior:

In [4]: ser.loc[2:4]
Out[4]:
2    1
3    2
dtype: int64

To retain the old behavior, use series.iloc[i:j]. To get the future behavior, use series.loc[i:j].

Slicing on a DataFrame will not be affected.

ExcelWriter attributes#

All attributes of ExcelWriter were previously documented as not public. However some third party Excel engines documented accessing ExcelWriter.book or ExcelWriter.sheets, and users were utilizing these and possibly other attributes. Previously these attributes were not safe to use; e.g. modifications to ExcelWriter.book would not update ExcelWriter.sheets and conversely. In order to support this, pandas has made some attributes public and improved their implementations so that they may now be safely used. (GH45572)

The following attributes are now public and considered safe to access.

  • book

  • check_extension

  • close

  • date_format

  • datetime_format

  • engine

  • if_sheet_exists

  • sheets

  • supported_extensions

The following attributes have been deprecated. They now raise a FutureWarning when accessed and will be removed in a future version. Users should be aware that their usage is considered unsafe, and can lead to unexpected results.

  • cur_sheet

  • handles

  • path

  • save

  • write_cells

See the documentation of ExcelWriter for further details.

Using group_keys with transformers in DataFrameGroupBy.apply() and SeriesGroupBy.apply()#

In previous versions of pandas, if it was inferred that the function passed to DataFrameGroupBy.apply() or SeriesGroupBy.apply() was a transformer (i.e. the resulting index was equal to the input index), the group_keys argument of DataFrame.groupby() and Series.groupby() was ignored and the group keys would never be added to the index of the result. In the future, the group keys will be added to the index when the user specifies group_keys=True.

As group_keys=True is the default value of DataFrame.groupby() and Series.groupby(), not specifying group_keys with a transformer will raise a FutureWarning. This can be silenced and the previous behavior retained by specifying group_keys=False.

Inplace operation when setting values with loc and iloc#

Most of the time setting values with DataFrame.iloc() attempts to set values inplace, only falling back to inserting a new array if necessary. There are some cases where this rule is not followed, for example when setting an entire column from an array with different dtype:

In [36]: df = pd.DataFrame({'price': [11.1, 12.2]}, index=['book1', 'book2'])

In [37]: original_prices = df['price']

In [38]: new_prices = np.array([98, 99])

Old behavior:

In [3]: df.iloc[:, 0] = new_prices
In [4]: df.iloc[:, 0]
Out[4]:
book1    98
book2    99
Name: price, dtype: int64
In [5]: original_prices
Out[5]:
book1    11.1
book2    12.2
Name: price, float: 64

This behavior is deprecated. In a future version, setting an entire column with iloc will attempt to operate inplace.

Future behavior:

In [3]: df.iloc[:, 0] = new_prices
In [4]: df.iloc[:, 0]
Out[4]:
book1    98.0
book2    99.0
Name: price, dtype: float64
In [5]: original_prices
Out[5]:
book1    98.0
book2    99.0
Name: price, dtype: float64

To get the old behavior, use DataFrame.__setitem__() directly:

In [3]: df[df.columns[0]] = new_prices
In [4]: df.iloc[:, 0]
Out[4]
book1    98
book2    99
Name: price, dtype: int64
In [5]: original_prices
Out[5]:
book1    11.1
book2    12.2
Name: price, dtype: float64

To get the old behaviour when df.columns is not unique and you want to change a single column by index, you can use DataFrame.isetitem(), which has been added in pandas 1.5:

In [3]: df_with_duplicated_cols = pd.concat([df, df], axis='columns')
In [3]: df_with_duplicated_cols.isetitem(0, new_prices)
In [4]: df_with_duplicated_cols.iloc[:, 0]
Out[4]:
book1    98
book2    99
Name: price, dtype: int64
In [5]: original_prices
Out[5]:
book1    11.1
book2    12.2
Name: 0, dtype: float64

numeric_only default value#

Across the DataFrame, DataFrameGroupBy, and Resampler operations such as min, sum, and idxmax, the default value of the numeric_only argument, if it exists at all, was inconsistent. Furthermore, operations with the default value None can lead to surprising results. (GH46560)

In [1]: df = pd.DataFrame({"a": [1, 2], "b": ["x", "y"]})

In [2]: # Reading the next line without knowing the contents of df, one would
        # expect the result to contain the products for both columns a and b.
        df[["a", "b"]].prod()
Out[2]:
a    2
dtype: int64

To avoid this behavior, the specifying the value numeric_only=None has been deprecated, and will be removed in a future version of pandas. In the future, all operations with a numeric_only argument will default to False. Users should either call the operation only with columns that can be operated on, or specify numeric_only=True to operate only on Boolean, integer, and float columns.

In order to support the transition to the new behavior, the following methods have gained the numeric_only argument.

Other Deprecations#

  • Deprecated the keyword line_terminator in DataFrame.to_csv() and Series.to_csv(), use lineterminator instead; this is for consistency with read_csv() and the standard library ‘csv’ module (GH9568)

  • Deprecated behavior of SparseArray.astype(), Series.astype(), and DataFrame.astype() with SparseDtype when passing a non-sparse dtype. In a future version, this will cast to that non-sparse dtype instead of wrapping it in a SparseDtype (GH34457)

  • Deprecated behavior of DatetimeIndex.intersection() and DatetimeIndex.symmetric_difference() (union behavior was already deprecated in version 1.3.0) with mixed time zones; in a future version both will be cast to UTC instead of object dtype (GH39328, GH45357)

  • Deprecated DataFrame.iteritems(), Series.iteritems(), HDFStore.iteritems() in favor of DataFrame.items(), Series.items(), HDFStore.items() (GH45321)

  • Deprecated Series.is_monotonic() and Index.is_monotonic() in favor of Series.is_monotonic_increasing() and Index.is_monotonic_increasing() (GH45422, GH21335)

  • Deprecated behavior of DatetimeIndex.astype(), TimedeltaIndex.astype(), PeriodIndex.astype() when converting to an integer dtype other than int64. In a future version, these will convert to exactly the specified dtype (instead of always int64) and will raise if the conversion overflows (GH45034)

  • Deprecated the __array_wrap__ method of DataFrame and Series, rely on standard numpy ufuncs instead (GH45451)

  • Deprecated treating float-dtype data as wall-times when passed with a timezone to Series or DatetimeIndex (GH45573)

  • Deprecated the behavior of Series.fillna() and DataFrame.fillna() with timedelta64[ns] dtype and incompatible fill value; in a future version this will cast to a common dtype (usually object) instead of raising, matching the behavior of other dtypes (GH45746)

  • Deprecated the warn parameter in infer_freq() (GH45947)

  • Deprecated allowing non-keyword arguments in ExtensionArray.argsort() (GH46134)

  • Deprecated treating all-bool object-dtype columns as bool-like in DataFrame.any() and DataFrame.all() with bool_only=True, explicitly cast to bool instead (GH46188)

  • Deprecated behavior of method DataFrame.quantile(), attribute numeric_only will default False. Including datetime/timedelta columns in the result (GH7308).

  • Deprecated Timedelta.freq and Timedelta.is_populated (GH46430)

  • Deprecated Timedelta.delta (GH46476)

  • Deprecated passing arguments as positional in DataFrame.any() and Series.any() (GH44802)

  • Deprecated passing positional arguments to DataFrame.pivot() and pivot() except data (GH30228)

  • Deprecated the methods DataFrame.mad(), Series.mad(), and the corresponding groupby methods (GH11787)

  • Deprecated positional arguments to Index.join() except for other, use keyword-only arguments instead of positional arguments (GH46518)

  • Deprecated positional arguments to StringMethods.rsplit() and StringMethods.split() except for pat, use keyword-only arguments instead of positional arguments (GH47423)

  • Deprecated indexing on a timezone-naive DatetimeIndex using a string representing a timezone-aware datetime (GH46903, GH36148)

  • Deprecated allowing unit="M" or unit="Y" in Timestamp constructor with a non-round float value (GH47267)

  • Deprecated the display.column_space global configuration option (GH7576)

  • Deprecated the argument na_sentinel in factorize(), Index.factorize(), and ExtensionArray.factorize(); pass use_na_sentinel=True instead to use the sentinel -1 for NaN values and use_na_sentinel=False instead of na_sentinel=None to encode NaN values (GH46910)

  • Deprecated DataFrameGroupBy.transform() not aligning the result when the UDF returned DataFrame (GH45648)

  • Clarified warning from to_datetime() when delimited dates can’t be parsed in accordance to specified dayfirst argument (GH46210)

  • Emit warning from to_datetime() when delimited dates can’t be parsed in accordance to specified dayfirst argument even for dates where leading zero is omitted (e.g. 31/1/2001) (GH47880)

  • Deprecated Series and Resampler reducers (e.g. min, max, sum, mean) raising a NotImplementedError when the dtype is non-numric and numeric_only=True is provided; this will raise a TypeError in a future version (GH47500)

  • Deprecated Series.rank() returning an empty result when the dtype is non-numeric and numeric_only=True is provided; this will raise a TypeError in a future version (GH47500)

  • Deprecated argument errors for Series.mask(), Series.where(), DataFrame.mask(), and DataFrame.where() as errors had no effect on this methods (GH47728)

  • Deprecated arguments *args and **kwargs in Rolling, Expanding, and ExponentialMovingWindow ops. (GH47836)

  • Deprecated the inplace keyword in Categorical.set_ordered(), Categorical.as_ordered(), and Categorical.as_unordered() (GH37643)

  • Deprecated setting a categorical’s categories with cat.categories = ['a', 'b', 'c'], use Categorical.rename_categories() instead (GH37643)

  • Deprecated unused arguments encoding and verbose in Series.to_excel() and DataFrame.to_excel() (GH47912)

  • Deprecated the inplace keyword in DataFrame.set_axis() and Series.set_axis(), use obj = obj.set_axis(..., copy=False) instead (GH48130)

  • Deprecated producing a single element when iterating over a DataFrameGroupBy or a SeriesGroupBy that has been grouped by a list of length 1; A tuple of length one will be returned instead (GH42795)

  • Fixed up warning message of deprecation of MultiIndex.lesort_depth() as public method, as the message previously referred to MultiIndex.is_lexsorted() instead (GH38701)

  • Deprecated the sort_columns argument in DataFrame.plot() and Series.plot() (GH47563).

  • Deprecated positional arguments for all but the first argument of DataFrame.to_stata() and read_stata(), use keyword arguments instead (GH48128).

  • Deprecated the mangle_dupe_cols argument in read_csv(), read_fwf(), read_table() and read_excel(). The argument was never implemented, and a new argument where the renaming pattern can be specified will be added instead (GH47718)

  • Deprecated allowing dtype='datetime64' or dtype=np.datetime64 in Series.astype(), use “datetime64[ns]” instead (GH47844)

Performance improvements#

Bug fixes#

Categorical#

  • Bug in Categorical.view() not accepting integer dtypes (GH25464)

  • Bug in CategoricalIndex.union() when the index’s categories are integer-dtype and the index contains NaN values incorrectly raising instead of casting to float64 (GH45362)

  • Bug in concat() when concatenating two (or more) unordered CategoricalIndex variables, whose categories are permutations, yields incorrect index values (GH24845)

Datetimelike#

  • Bug in DataFrame.quantile() with datetime-like dtypes and no rows incorrectly returning float64 dtype instead of retaining datetime-like dtype (GH41544)

  • Bug in to_datetime() with sequences of np.str_ objects incorrectly raising (GH32264)

  • Bug in Timestamp construction when passing datetime components as positional arguments and tzinfo as a keyword argument incorrectly raising (GH31929)

  • Bug in Index.astype() when casting from object dtype to timedelta64[ns] dtype incorrectly casting np.datetime64("NaT") values to np.timedelta64("NaT") instead of raising (GH45722)

  • Bug in SeriesGroupBy.value_counts() index when passing categorical column (GH44324)

  • Bug in DatetimeIndex.tz_localize() localizing to UTC failing to make a copy of the underlying data (GH46460)

  • Bug in DatetimeIndex.resolution() incorrectly returning “day” instead of “nanosecond” for nanosecond-resolution indexes (GH46903)

  • Bug in Timestamp with an integer or float value and unit="Y" or unit="M" giving slightly-wrong results (GH47266)

  • Bug in DatetimeArray construction when passed another DatetimeArray and freq=None incorrectly inferring the freq from the given array (GH47296)

  • Bug in to_datetime() where OutOfBoundsDatetime would be thrown even if errors=coerce if there were more than 50 rows (GH45319)

  • Bug when adding a DateOffset to a Series would not add the nanoseconds field (GH47856)

Timedelta#

  • Bug in astype_nansafe() astype(“timedelta64[ns]”) fails when np.nan is included (GH45798)

  • Bug in constructing a Timedelta with a np.timedelta64 object and a unit sometimes silently overflowing and returning incorrect results instead of raising OutOfBoundsTimedelta (GH46827)

  • Bug in constructing a Timedelta from a large integer or float with unit="W" silently overflowing and returning incorrect results instead of raising OutOfBoundsTimedelta (GH47268)

Time Zones#

  • Bug in Timestamp constructor raising when passed a ZoneInfo tzinfo object (GH46425)

Numeric#

  • Bug in operations with array-likes with dtype="boolean" and NA incorrectly altering the array in-place (GH45421)

  • Bug in arithmetic operations with nullable types without NA values not matching the same operation with non-nullable types (GH48223)

  • Bug in floordiv when dividing by IntegerDtype 0 would return 0 instead of inf (GH48223)

  • Bug in division, pow and mod operations on array-likes with dtype="boolean" not being like their np.bool_ counterparts (GH46063)

  • Bug in multiplying a Series with IntegerDtype or FloatingDtype by an array-like with timedelta64[ns] dtype incorrectly raising (GH45622)

  • Bug in mean() where the optional dependency bottleneck causes precision loss linear in the length of the array. bottleneck has been disabled for mean() improving the loss to log-linear but may result in a performance decrease. (GH42878)

Conversion#

  • Bug in DataFrame.astype() not preserving subclasses (GH40810)

  • Bug in constructing a Series from a float-containing list or a floating-dtype ndarray-like (e.g. dask.Array) and an integer dtype raising instead of casting like we would with an np.ndarray (GH40110)

  • Bug in Float64Index.astype() to unsigned integer dtype incorrectly casting to np.int64 dtype (GH45309)

  • Bug in Series.astype() and DataFrame.astype() from floating dtype to unsigned integer dtype failing to raise in the presence of negative values (GH45151)

  • Bug in array() with FloatingDtype and values containing float-castable strings incorrectly raising (GH45424)

  • Bug when comparing string and datetime64ns objects causing OverflowError exception. (GH45506)

  • Bug in metaclass of generic abstract dtypes causing DataFrame.apply() and Series.apply() to raise for the built-in function type (GH46684)

  • Bug in DataFrame.to_records() returning inconsistent numpy types if the index was a MultiIndex (GH47263)

  • Bug in DataFrame.to_dict() for orient="list" or orient="index" was not returning native types (GH46751)

  • Bug in DataFrame.apply() that returns a DataFrame instead of a Series when applied to an empty DataFrame and axis=1 (GH39111)

  • Bug when inferring the dtype from an iterable that is not a NumPy ndarray consisting of all NumPy unsigned integer scalars did not result in an unsigned integer dtype (GH47294)

  • Bug in DataFrame.eval() when pandas objects (e.g. 'Timestamp') were column names (GH44603)

Strings#

Interval#

  • Bug in IntervalArray.__setitem__() when setting np.nan into an integer-backed array raising ValueError instead of TypeError (GH45484)

  • Bug in IntervalDtype when using datetime64[ns, tz] as a dtype string (GH46999)

Indexing#

  • Bug in DataFrame.iloc() where indexing a single row on a DataFrame with a single ExtensionDtype column gave a copy instead of a view on the underlying data (GH45241)

  • Bug in DataFrame.__getitem__() returning copy when DataFrame has duplicated columns even if a unique column is selected (GH45316, GH41062)

  • Bug in Series.align() does not create MultiIndex with union of levels when both MultiIndexes intersections are identical (GH45224)

  • Bug in setting a NA value (None or np.nan) into a Series with int-based IntervalDtype incorrectly casting to object dtype instead of a float-based IntervalDtype (GH45568)

  • Bug in indexing setting values into an ExtensionDtype column with df.iloc[:, i] = values with values having the same dtype as df.iloc[:, i] incorrectly inserting a new array instead of setting in-place (GH33457)

  • Bug in Series.__setitem__() with a non-integer Index when using an integer key to set a value that cannot be set inplace where a ValueError was raised instead of casting to a common dtype (GH45070)

  • Bug in DataFrame.loc() not casting None to NA when setting value as a list into DataFrame (GH47987)

  • Bug in Series.__setitem__() when setting incompatible values into a PeriodDtype or IntervalDtype Series raising when indexing with a boolean mask but coercing when indexing with otherwise-equivalent indexers; these now consistently coerce, along with Series.mask() and Series.where() (GH45768)

  • Bug in DataFrame.where() with multiple columns with datetime-like dtypes failing to downcast results consistent with other dtypes (GH45837)

  • Bug in isin() upcasting to float64 with unsigned integer dtype and list-like argument without a dtype (GH46485)

  • Bug in Series.loc.__setitem__() and Series.loc.__getitem__() not raising when using multiple keys without using a MultiIndex (GH13831)

  • Bug in Index.reindex() raising AssertionError when level was specified but no MultiIndex was given; level is ignored now (GH35132)

  • Bug when setting a value too large for a Series dtype failing to coerce to a common type (GH26049, GH32878)

  • Bug in loc.__setitem__() treating range keys as positional instead of label-based (GH45479)

  • Bug in DataFrame.__setitem__() casting extension array dtypes to object when setting with a scalar key and DataFrame as value (GH46896)

  • Bug in Series.__setitem__() when setting a scalar to a nullable pandas dtype would not raise a TypeError if the scalar could not be cast (losslessly) to the nullable type (GH45404)

  • Bug in Series.__setitem__() when setting boolean dtype values containing NA incorrectly raising instead of casting to boolean dtype (GH45462)

  • Bug in Series.loc() raising with boolean indexer containing NA when Index did not match (GH46551)

  • Bug in Series.__setitem__() where setting NA into a numeric-dtype Series would incorrectly upcast to object-dtype rather than treating the value as np.nan (GH44199)

  • Bug in DataFrame.loc() when setting values to a column and right hand side is a dictionary (GH47216)

  • Bug in Series.__setitem__() with datetime64[ns] dtype, an all-False boolean mask, and an incompatible value incorrectly casting to object instead of retaining datetime64[ns] dtype (GH45967)

  • Bug in Index.__getitem__() raising ValueError when indexer is from boolean dtype with NA (GH45806)

  • Bug in Series.__setitem__() losing precision when enlarging Series with scalar (GH32346)

  • Bug in Series.mask() with inplace=True or setting values with a boolean mask with small integer dtypes incorrectly raising (GH45750)

  • Bug in DataFrame.mask() with inplace=True and ExtensionDtype columns incorrectly raising (GH45577)

  • Bug in getting a column from a DataFrame with an object-dtype row index with datetime-like values: the resulting Series now preserves the exact object-dtype Index from the parent DataFrame (GH42950)

  • Bug in DataFrame.__getattribute__() raising AttributeError if columns have "string" dtype (GH46185)

  • Bug in DataFrame.compare() returning all NaN column when comparing extension array dtype and numpy dtype (GH44014)

  • Bug in DataFrame.where() setting wrong values with "boolean" mask for numpy dtype (GH44014)

  • Bug in indexing on a DatetimeIndex with a np.str_ key incorrectly raising (GH45580)

  • Bug in CategoricalIndex.get_indexer() when index contains NaN values, resulting in elements that are in target but not present in the index to be mapped to the index of the NaN element, instead of -1 (GH45361)

  • Bug in setting large integer values into Series with float32 or float16 dtype incorrectly altering these values instead of coercing to float64 dtype (GH45844)

  • Bug in Series.asof() and DataFrame.asof() incorrectly casting bool-dtype results to float64 dtype (GH16063)

  • Bug in NDFrame.xs(), DataFrame.iterrows(), DataFrame.loc() and DataFrame.iloc() not always propagating metadata (GH28283)

  • Bug in DataFrame.sum() min_count changes dtype if input contains NaNs (GH46947)

  • Bug in IntervalTree that lead to an infinite recursion. (GH46658)

  • Bug in PeriodIndex raising AttributeError when indexing on NA, rather than putting NaT in its place. (GH46673)

  • Bug in DataFrame.at() would allow the modification of multiple columns (GH48296)

Missing#

MultiIndex#

I/O#

Period#

  • Bug in subtraction of Period from PeriodArray returning wrong results (GH45999)

  • Bug in Period.strftime() and PeriodIndex.strftime(), directives %l and %u were giving wrong results (GH46252)

  • Bug in inferring an incorrect freq when passing a string to Period microseconds that are a multiple of 1000 (GH46811)

  • Bug in constructing a Period from a Timestamp or np.datetime64 object with non-zero nanoseconds and freq="ns" incorrectly truncating the nanoseconds (GH46811)

  • Bug in adding np.timedelta64("NaT", "ns") to a Period with a timedelta-like freq incorrectly raising IncompatibleFrequency instead of returning NaT (GH47196)

  • Bug in adding an array of integers to an array with PeriodDtype giving incorrect results when dtype.freq.n > 1 (GH47209)

  • Bug in subtracting a Period from an array with PeriodDtype returning incorrect results instead of raising OverflowError when the operation overflows (GH47538)

Plotting#

Groupby/resample/rolling#

Reshaping#

Sparse#

ExtensionArray#

  • Bug in IntegerArray.searchsorted() and FloatingArray.searchsorted() returning inconsistent results when acting on np.nan (GH45255)

Styler#

  • Bug when attempting to apply styling functions to an empty DataFrame subset (GH45313)

  • Bug in CSSToExcelConverter leading to TypeError when border color provided without border style for xlsxwriter engine (GH42276)

  • Bug in Styler.set_sticky() leading to white text on white background in dark mode (GH46984)

  • Bug in Styler.to_latex() causing UnboundLocalError when clines="all;data" and the DataFrame has no rows. (GH47203)

  • Bug in Styler.to_excel() when using vertical-align: middle; with xlsxwriter engine (GH30107)

  • Bug when applying styles to a DataFrame with boolean column labels (GH47838)

Metadata#

Other#

Contributors#

A total of 271 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.

  • Aadharsh Acharya +

  • Aadharsh-Acharya +

  • Aadhi Manivannan +

  • Adam Bowden

  • Aditya Agarwal +

  • Ahmed Ibrahim +

  • Alastair Porter +

  • Alex Povel +

  • Alex-Blade

  • Alexandra Sciocchetti +

  • AlonMenczer +

  • Andras Deak +

  • Andrew Hawyrluk

  • Andy Grigg +

  • Aneta Kahleová +

  • Anthony Givans +

  • Anton Shevtsov +

  • B. J. Potter +

  • BarkotBeyene +

  • Ben Beasley +

  • Ben Wozniak +

  • Bernhard Wagner +

  • Boris Rumyantsev

  • Brian Gollop +

  • CCXXXI +

  • Chandrasekaran Anirudh Bhardwaj +

  • Charles Blackmon-Luca +

  • Chris Moradi +

  • ChrisAlbertsen +

  • Compro Prasad +

  • DaPy15

  • Damian Barabonkov +

  • Daniel I +

  • Daniel Isaac +

  • Daniel Schmidt

  • Danil Iashchenko +

  • Dare Adewumi

  • Dennis Chukwunta +

  • Dennis J. Gray +

  • Derek Sharp +

  • Dhruv Samdani +

  • Dimitra Karadima +

  • Dmitry Savostyanov +

  • Dmytro Litvinov +

  • Do Young Kim +

  • Dries Schaumont +

  • Edward Huang +

  • Eirik +

  • Ekaterina +

  • Eli Dourado +

  • Ezra Brauner +

  • Fabian Gabel +

  • FactorizeD +

  • Fangchen Li

  • Francesco Romandini +

  • Greg Gandenberger +

  • Guo Ci +

  • Hiroaki Ogasawara

  • Hood Chatham +

  • Ian Alexander Joiner +

  • Irv Lustig

  • Ivan Ng +

  • JHM Darbyshire

  • JHM Darbyshire (MBP)

  • JHM Darbyshire (iMac)

  • JMBurley

  • Jack Goldsmith +

  • James Freeman +

  • James Lamb

  • James Moro +

  • Janosh Riebesell

  • Jarrod Millman

  • Jason Jia +

  • Jeff Reback

  • Jeremy Tuloup +

  • Johannes Mueller

  • John Bencina +

  • John Mantios +

  • John Zangwill

  • Jon Bramley +

  • Jonas Haag

  • Jordan Hicks

  • Joris Van den Bossche

  • Jose Ortiz +

  • JosephParampathu +

  • José Duarte

  • Julian Steger +

  • Kai Priester +

  • Kapil E. Iyer +

  • Karthik Velayutham +

  • Kashif Khan

  • Kazuki Igeta +

  • Kevin Jan Anker +

  • Kevin Sheppard

  • Khor Chean Wei

  • Kian Eliasi

  • Kian S +

  • Kim, KwonHyun +

  • Kinza-Raza +

  • Konjeti Maruthi +

  • Leonardus Chen

  • Linxiao Francis Cong +

  • Loïc Estève

  • LucasG0 +

  • Lucy Jiménez +

  • Luis Pinto

  • Luke Manley

  • Marc Garcia

  • Marco Edward Gorelli

  • Marco Gorelli

  • MarcoGorelli

  • Margarete Dippel +

  • Mariam-ke +

  • Martin Fleischmann

  • Marvin John Walter +

  • Marvin Walter +

  • Mateusz

  • Matilda M +

  • Matthew Roeschke

  • Matthias Bussonnier

  • MeeseeksMachine

  • Mehgarg +

  • Melissa Weber Mendonça +

  • Michael Milton +

  • Michael Wang

  • Mike McCarty +

  • Miloni Atal +

  • Mitlasóczki Bence +

  • Moritz Schreiber +

  • Morten Canth Hels +

  • Nick Crews +

  • NickFillot +

  • Nicolas Hug +

  • Nima Sarang

  • Noa Tamir +

  • Pandas Development Team

  • Parfait Gasana

  • Parthi +

  • Partho +

  • Patrick Hoefler

  • Peter

  • Peter Hawkins +

  • Philipp A

  • Philipp Schaefer +

  • Pierrot +

  • Pratik Patel +

  • Prithvijit

  • Purna Chandra Mansingh +

  • Radoslaw Lemiec +

  • RaphSku +

  • Reinert Huseby Karlsen +

  • Richard Shadrach

  • Richard Shadrach +

  • Robbie Palmer

  • Robert de Vries

  • Roger +

  • Roger Murray +

  • Ruizhe Deng +

  • SELEE +

  • Sachin Yadav +

  • Saiwing Yeung +

  • Sam Rao +

  • Sandro Casagrande +

  • Sebastiaan Vermeulen +

  • Shaghayegh +

  • Shantanu +

  • Shashank Shet +

  • Shawn Zhong +

  • Shuangchi He +

  • Simon Hawkins

  • Simon Knott +

  • Solomon Song +

  • Somtochi Umeh +

  • Stefan Krawczyk +

  • Stefanie Molin

  • Steffen Rehberg

  • Steven Bamford +

  • Steven Rotondo +

  • Steven Schaerer

  • Sylvain MARIE +

  • Sylvain Marié

  • Tarun Raghunandan Kaushik +

  • Taylor Packard +

  • Terji Petersen

  • Thierry Moisan

  • Thomas Grainger

  • Thomas Hunter +

  • Thomas Li

  • Tim McFarland +

  • Tim Swast

  • Tim Yang +

  • Tobias Pitters

  • Tom Aarsen +

  • Tom Augspurger

  • Torsten Wörtwein

  • TraverseTowner +

  • Tyler Reddy

  • Valentin Iovene

  • Varun Sharma +

  • Vasily Litvinov

  • Venaturum

  • Vinicius Akira Imaizumi +

  • Vladimir Fokow +

  • Wenjun Si

  • Will Lachance +

  • William Andrea

  • Wolfgang F. Riedl +

  • Xingrong Chen

  • Yago González

  • Yikun Jiang +

  • Yuanhao Geng

  • Yuval +

  • Zero

  • Zhengfei Wang +

  • abmyii

  • alexondor +

  • alm

  • andjhall +

  • anilbey +

  • arnaudlegout +

  • asv-bot +

  • ateki +

  • auderson +

  • bherwerth +

  • bicarlsen +

  • carbonleakage +

  • charles +

  • charlogazzo +

  • code-review-doctor +

  • dataxerik +

  • deponovo

  • dimitra-karadima +

  • dospix +

  • ehallam +

  • ehsan shirvanian +

  • ember91 +

  • eshirvana

  • fractionalhare +

  • gaotian98 +

  • gesoos

  • github-actions[bot]

  • gunghub +

  • hasan-yaman

  • iansheng +

  • iasoon +

  • jbrockmendel

  • joshuabello2550 +

  • jyuv +

  • kouya takahashi +

  • mariana-LJ +

  • matt +

  • mattB1989 +

  • nealxm +

  • partev

  • poloso +

  • realead

  • roib20 +

  • rtpsw

  • ryangilmour +

  • shourya5 +

  • srotondo +

  • stanleycai95 +

  • staticdev +

  • tehunter +

  • theidexisted +

  • tobias.pitters +

  • uncjackg +

  • vernetya

  • wany-oh +

  • wfr +

  • z3c0 +