What’s new in 1.5.0 (September 19, 2022)#
These are the changes in pandas 1.5.0. See Release notes for a full changelog including other versions of pandas.
Enhancements#
pandas-stubs#
The pandas-stubs library is now supported by the pandas development team, providing type stubs for the pandas API. Please visit
pandas-dev/pandas-stubs for more information.
We thank VirtusLab and Microsoft for their initial, significant contributions to pandas-stubs
Native PyArrow-backed ExtensionArray#
With Pyarrow installed, users can now create pandas objects
that are backed by a pyarrow.ChunkedArray and pyarrow.DataType.
The dtype argument can accept a string of a pyarrow data type
with pyarrow in brackets e.g. "int64[pyarrow]" or, for pyarrow data types that take parameters, a ArrowDtype
initialized with a pyarrow.DataType.
In [1]: import pyarrow as pa
In [2]: ser_float = pd.Series([1.0, 2.0, None], dtype="float32[pyarrow]")
In [3]: ser_float
Out[3]: 
0     1.0
1     2.0
2    <NA>
dtype: float[pyarrow]
In [4]: list_of_int_type = pd.ArrowDtype(pa.list_(pa.int64()))
In [5]: ser_list = pd.Series([[1, 2], [3, None]], dtype=list_of_int_type)
In [6]: ser_list
Out[6]: 
0      [1. 2.]
1    [ 3. nan]
dtype: list<item: int64>[pyarrow]
In [7]: ser_list.take([1, 0])
Out[7]: 
1    [ 3. nan]
0      [1. 2.]
dtype: list<item: int64>[pyarrow]
In [8]: ser_float * 5
Out[8]: 
0     5.0
1    10.0
2    <NA>
dtype: float[pyarrow]
In [9]: ser_float.mean()
Out[9]: 1.5
In [10]: ser_float.dropna()
Out[10]: 
0    1.0
1    2.0
dtype: float[pyarrow]
Most operations are supported and have been implemented using pyarrow compute functions. We recommend installing the latest version of PyArrow to access the most recently implemented compute functions.
Warning
This feature is experimental, and the API can change in a future release without warning.
DataFrame interchange protocol implementation#
Pandas now implement the DataFrame interchange API spec. See the full details on the API at https://data-apis.org/dataframe-protocol/latest/index.html
The protocol consists of two parts:
- New method - DataFrame.__dataframe__()which produces the interchange object. It effectively “exports” the pandas dataframe as an interchange object so any other library which has the protocol implemented can “import” that dataframe without knowing anything about the producer except that it makes an interchange object.
- New function - pandas.api.interchange.from_dataframe()which can take an arbitrary interchange object from any conformant library and construct a pandas DataFrame out of it.
Styler#
The most notable development is the new method Styler.concat() which
allows adding customised footer rows to visualise additional calculations on the data,
e.g. totals and counts etc. (GH 43875, GH 46186)
Additionally there is an alternative output method Styler.to_string(),
which allows using the Styler’s formatting methods to create, for example, CSVs (GH 44502).
A new feature Styler.relabel_index() is also made available to provide full customisation of the display of
index or column headers (GH 47864)
Minor feature improvements are:
Adding the ability to render
borderandborder-{side}CSS properties in Excel (GH 42276)
Making keyword arguments consist:
Styler.highlight_null()now acceptscolorand deprecatesnull_coloralthough this remains backwards compatible (GH 45907)
Control of index with group_keys in DataFrame.resample()#
The argument group_keys has been added to the method DataFrame.resample().
As with DataFrame.groupby(), this argument controls the whether each group is added
to the index in the resample when Resampler.apply() is used.
Warning
Not specifying the group_keys argument will retain the
previous behavior and emit a warning if the result will change
by specifying group_keys=False. In a future version
of pandas, not specifying group_keys will default to
the same behavior as group_keys=False.
In [11]: df = pd.DataFrame(
   ....:     {'a': range(6)},
   ....:     index=pd.date_range("2021-01-01", periods=6, freq="8H")
   ....: )
   ....:
In [12]: df.resample("D", group_keys=True).apply(lambda x: x)
Out[12]:
                                a
2021-01-01 2021-01-01 00:00:00  0
           2021-01-01 08:00:00  1
           2021-01-01 16:00:00  2
2021-01-02 2021-01-02 00:00:00  3
           2021-01-02 08:00:00  4
           2021-01-02 16:00:00  5
In [13]: df.resample("D", group_keys=False).apply(lambda x: x)
Out[13]:
                     a
2021-01-01 00:00:00  0
2021-01-01 08:00:00  1
2021-01-01 16:00:00  2
2021-01-02 00:00:00  3
2021-01-02 08:00:00  4
2021-01-02 16:00:00  5
Previously, the resulting index would depend upon the values returned by apply,
as seen in the following example.
In [1]: # pandas 1.3
In [2]: df.resample("D").apply(lambda x: x)
Out[2]:
                     a
2021-01-01 00:00:00  0
2021-01-01 08:00:00  1
2021-01-01 16:00:00  2
2021-01-02 00:00:00  3
2021-01-02 08:00:00  4
2021-01-02 16:00:00  5
In [3]: df.resample("D").apply(lambda x: x.reset_index())
Out[3]:
                           index  a
2021-01-01 0 2021-01-01 00:00:00  0
           1 2021-01-01 08:00:00  1
           2 2021-01-01 16:00:00  2
2021-01-02 0 2021-01-02 00:00:00  3
           1 2021-01-02 08:00:00  4
           2 2021-01-02 16:00:00  5
from_dummies#
Added new function from_dummies() to convert a dummy coded DataFrame into a categorical DataFrame.
In [11]: import pandas as pd
In [12]: df = pd.DataFrame({"col1_a": [1, 0, 1], "col1_b": [0, 1, 0],
   ....:                    "col2_a": [0, 1, 0], "col2_b": [1, 0, 0],
   ....:                    "col2_c": [0, 0, 1]})
   ....: 
In [13]: pd.from_dummies(df, sep="_")
Out[13]: 
  col1 col2
0    a    b
1    b    a
2    a    c
Writing to ORC files#
The new method DataFrame.to_orc() allows writing to ORC files (GH 43864).
This functionality depends the pyarrow library. For more details, see the IO docs on ORC.
Warning
- It is highly recommended to install pyarrow using conda due to some issues occurred by pyarrow. 
- to_orc()requires pyarrow>=7.0.0.
- to_orc()is not supported on Windows yet, you can find valid environments on install optional dependencies.
- For supported dtypes please refer to supported ORC features in Arrow. 
- Currently timezones in datetime columns are not preserved when a dataframe is converted into ORC files. 
df = pd.DataFrame(data={"col1": [1, 2], "col2": [3, 4]})
df.to_orc("./out.orc")
Reading directly from TAR archives#
I/O methods like read_csv() or DataFrame.to_json() now allow reading and writing
directly on TAR archives (GH 44787).
df = pd.read_csv("./movement.tar.gz")
# ...
df.to_csv("./out.tar.gz")
This supports .tar, .tar.gz, .tar.bz and .tar.xz2 archives.
The used compression method is inferred from the filename.
If the compression method cannot be inferred, use the compression argument:
df = pd.read_csv(some_file_obj, compression={"method": "tar", "mode": "r:gz"}) # noqa F821
(mode being one of tarfile.open’s modes: https://docs.python.org/3/library/tarfile.html#tarfile.open)
read_xml now supports dtype, converters, and parse_dates#
Similar to other IO methods, pandas.read_xml() now supports assigning specific dtypes to columns,
apply converter methods, and parse dates (GH 43567).
In [14]: from io import StringIO
In [15]: xml_dates = """<?xml version='1.0' encoding='utf-8'?>
   ....: <data>
   ....:   <row>
   ....:     <shape>square</shape>
   ....:     <degrees>00360</degrees>
   ....:     <sides>4.0</sides>
   ....:     <date>2020-01-01</date>
   ....:    </row>
   ....:   <row>
   ....:     <shape>circle</shape>
   ....:     <degrees>00360</degrees>
   ....:     <sides/>
   ....:     <date>2021-01-01</date>
   ....:   </row>
   ....:   <row>
   ....:     <shape>triangle</shape>
   ....:     <degrees>00180</degrees>
   ....:     <sides>3.0</sides>
   ....:     <date>2022-01-01</date>
   ....:   </row>
   ....: </data>"""
   ....: 
In [16]: df = pd.read_xml(
   ....:     StringIO(xml_dates),
   ....:     dtype={'sides': 'Int64'},
   ....:     converters={'degrees': str},
   ....:     parse_dates=['date']
   ....: )
   ....: 
In [17]: df
Out[17]: 
      shape degrees  sides       date
0    square   00360      4 2020-01-01
1    circle   00360   <NA> 2021-01-01
2  triangle   00180      3 2022-01-01
In [18]: df.dtypes
Out[18]: 
shape              object
degrees            object
sides               Int64
date       datetime64[ns]
dtype: object
read_xml now supports large XML using iterparse#
For very large XML files that can range in hundreds of megabytes to gigabytes, pandas.read_xml()
now supports parsing such sizeable files using lxml’s iterparse and etree’s iterparse
which are memory-efficient methods to iterate through XML trees and extract specific elements
and attributes without holding entire tree in memory (GH 45442).
In [1]: df = pd.read_xml(
...      "/path/to/downloaded/enwikisource-latest-pages-articles.xml",
...      iterparse = {"page": ["title", "ns", "id"]})
...  )
df
Out[2]:
                                                     title   ns        id
0                                       Gettysburg Address    0     21450
1                                                Main Page    0     42950
2                            Declaration by United Nations    0      8435
3             Constitution of the United States of America    0      8435
4                     Declaration of Independence (Israel)    0     17858
...                                                    ...  ...       ...
3578760               Page:Black cat 1897 07 v2 n10.pdf/17  104    219649
3578761               Page:Black cat 1897 07 v2 n10.pdf/43  104    219649
3578762               Page:Black cat 1897 07 v2 n10.pdf/44  104    219649
3578763      The History of Tom Jones, a Foundling/Book IX    0  12084291
3578764  Page:Shakespeare of Stratford (1926) Yale.djvu/91  104     21450
[3578765 rows x 3 columns]
Copy on Write#
A new feature copy_on_write was added (GH 46958). Copy on write ensures that
any DataFrame or Series derived from another in any way always behaves as a copy.
Copy on write disallows updating any other object than the object the method
was applied to.
Copy on write can be enabled through:
pd.set_option("mode.copy_on_write", True)
pd.options.mode.copy_on_write = True
Alternatively, copy on write can be enabled locally through:
with pd.option_context("mode.copy_on_write", True):
    ...
Without copy on write, the parent DataFrame is updated when updating a child
DataFrame that was derived from this DataFrame.
In [19]: df = pd.DataFrame({"foo": [1, 2, 3], "bar": 1})
In [20]: view = df["foo"]
In [21]: view.iloc[0]
Out[21]: 1
In [22]: df
Out[22]: 
   foo  bar
0    1    1
1    2    1
2    3    1
With copy on write enabled, df won’t be updated anymore:
In [23]: with pd.option_context("mode.copy_on_write", True):
   ....:     df = pd.DataFrame({"foo": [1, 2, 3], "bar": 1})
   ....:     view = df["foo"]
   ....:     view.iloc[0]
   ....:     df
   ....: 
A more detailed explanation can be found here.
Other enhancements#
- Series.map()now raises when- argis dict but- na_actionis not either- Noneor- 'ignore'(GH 46588)
- MultiIndex.to_frame()now supports the argument- allow_duplicatesand raises on duplicate labels if it is missing or False (GH 45245)
- StringArraynow accepts array-likes containing nan-likes (- None,- np.nan) for the- valuesparameter in its constructor in addition to strings and- pandas.NA. (GH 40839)
- Improved the rendering of - categoriesin- CategoricalIndex(GH 45218)
- DataFrame.plot()will now allow the- subplotsparameter to be a list of iterables specifying column groups, so that columns may be grouped together in the same subplot (GH 29688).
- to_numeric()now preserves float64 arrays when downcasting would generate values not representable in float32 (GH 43693)
- Series.reset_index()and- DataFrame.reset_index()now support the argument- allow_duplicates(GH 44410)
- DataFrameGroupBy.min(),- SeriesGroupBy.min(),- DataFrameGroupBy.max(), and- SeriesGroupBy.max()now supports Numba execution with the- enginekeyword (GH 45428)
- read_csv()now supports- defaultdictas a- dtypeparameter (GH 41574)
- DataFrame.rolling()and- Series.rolling()now support a- stepparameter with fixed-length windows (GH 15354)
- Implemented a - bool-dtype- Index, passing a bool-dtype array-like to- pd.Indexwill now retain- booldtype instead of casting to- object(GH 45061)
- Implemented a complex-dtype - Index, passing a complex-dtype array-like to- pd.Indexwill now retain complex dtype instead of casting to- object(GH 45845)
- Seriesand- DataFramewith- IntegerDtypenow supports bitwise operations (GH 34463)
- Add - millisecondsfield support for- DateOffset(GH 43371)
- DataFrame.where()tries to maintain dtype of- DataFrameif fill value can be cast without loss of precision (GH 45582)
- DataFrame.reset_index()now accepts a- namesargument which renames the index names (GH 6878)
- concat()now raises when- levelsis given but- keysis None (GH 46653)
- concat()now raises when- levelscontains duplicate values (GH 46653)
- Added - numeric_onlyargument to- DataFrame.corr(),- DataFrame.corrwith(),- DataFrame.cov(),- DataFrame.idxmin(),- DataFrame.idxmax(),- DataFrameGroupBy.idxmin(),- DataFrameGroupBy.idxmax(),- DataFrameGroupBy.var(),- SeriesGroupBy.var(),- DataFrameGroupBy.std(),- SeriesGroupBy.std(),- DataFrameGroupBy.sem(),- SeriesGroupBy.sem(), and- DataFrameGroupBy.quantile()(GH 46560)
- A - errors.PerformanceWarningis now thrown when using- string[pyarrow]dtype with methods that don’t dispatch to- pyarrow.computemethods (GH 42613, GH 46725)
- Added - validateargument to- DataFrame.join()(GH 46622)
- Added - numeric_onlyargument to- Resampler.sum(),- Resampler.prod(),- Resampler.min(),- Resampler.max(),- Resampler.first(), and- Resampler.last()(GH 46442)
- timesargument in- ExponentialMovingWindownow accepts- np.timedelta64(GH 47003)
- DataError,- SpecificationError,- SettingWithCopyError,- SettingWithCopyWarning,- NumExprClobberingError,- UndefinedVariableError,- IndexingError,- PyperclipException,- PyperclipWindowsException,- CSSWarning,- PossibleDataLossError,- ClosedFileError,- IncompatibilityWarning,- AttributeConflictWarning,- DatabaseError,- PossiblePrecisionLoss,- ValueLabelTypeMismatch,- InvalidColumnName, and- CategoricalConversionWarningare now exposed in- pandas.errors(GH 27656)
- Added - check_likeargument to- testing.assert_series_equal()(GH 47247)
- Add support for - DataFrameGroupBy.ohlc()and- SeriesGroupBy.ohlc()for extension array dtypes (GH 37493)
- Allow reading compressed SAS files with - read_sas()(e.g.,- .sas7bdat.gzfiles)
- pandas.read_html()now supports extracting links from table cells (GH 13141)
- DatetimeIndex.astype()now supports casting timezone-naive indexes to- datetime64[s],- datetime64[ms], and- datetime64[us], and timezone-aware indexes to the corresponding- datetime64[unit, tzname]dtypes (GH 47579)
- Seriesreducers (e.g.- min,- max,- sum,- mean) will now successfully operate when the dtype is numeric and- numeric_only=Trueis provided; previously this would raise a- NotImplementedError(GH 47500)
- RangeIndex.union()now can return a- RangeIndexinstead of a- Int64Indexif the resulting values are equally spaced (GH 47557, GH 43885)
- DataFrame.compare()now accepts an argument- result_namesto allow the user to specify the result’s names of both left and right DataFrame which are being compared. This is by default- 'self'and- 'other'(GH 44354)
- DataFrame.quantile()gained a- methodargument that can accept- tableto evaluate multi-column quantiles (GH 43881)
- Intervalnow supports checking whether one interval is contained by another interval (GH 46613)
- Added - copykeyword to- Series.set_axis()and- DataFrame.set_axis()to allow user to set axis on a new object without necessarily copying the underlying data (GH 47932)
- The method - ExtensionArray.factorize()accepts- use_na_sentinel=Falsefor determining how null values are to be treated (GH 46601)
- The - Dockerfilenow installs a dedicated- pandas-devvirtual environment for pandas development instead of using the- baseenvironment (GH 48427)
Notable bug fixes#
These are bug fixes that might have notable behavior changes.
Using dropna=True with groupby transforms#
A transform is an operation whose result has the same size as its input. When the
result is a DataFrame or Series, it is also required that the
index of the result matches that of the input. In pandas 1.4, using
DataFrameGroupBy.transform() or SeriesGroupBy.transform() with null
values in the groups and dropna=True gave incorrect results. Demonstrated by the
examples below, the incorrect results either contained incorrect values, or the result
did not have the same index as the input.
In [24]: df = pd.DataFrame({'a': [1, 1, np.nan], 'b': [2, 3, 4]})
Old behavior:
In [3]: # Value in the last row should be np.nan
        df.groupby('a', dropna=True).transform('sum')
Out[3]:
   b
0  5
1  5
2  5
In [3]: # Should have one additional row with the value np.nan
        df.groupby('a', dropna=True).transform(lambda x: x.sum())
Out[3]:
   b
0  5
1  5
In [3]: # The value in the last row is np.nan interpreted as an integer
        df.groupby('a', dropna=True).transform('ffill')
Out[3]:
                     b
0                    2
1                    3
2 -9223372036854775808
In [3]: # Should have one additional row with the value np.nan
        df.groupby('a', dropna=True).transform(lambda x: x)
Out[3]:
   b
0  2
1  3
New behavior:
In [25]: df.groupby('a', dropna=True).transform('sum')
Out[25]: 
     b
0  5.0
1  5.0
2  NaN
In [26]: df.groupby('a', dropna=True).transform(lambda x: x.sum())
Out[26]: 
     b
0  5.0
1  5.0
2  NaN
In [27]: df.groupby('a', dropna=True).transform('ffill')
Out[27]: 
     b
0  2.0
1  3.0
2  NaN
In [28]: df.groupby('a', dropna=True).transform(lambda x: x)
Out[28]: 
     b
0  2.0
1  3.0
2  NaN
Serializing tz-naive Timestamps with to_json() with iso_dates=True#
DataFrame.to_json(), Series.to_json(), and Index.to_json()
would incorrectly localize DatetimeArrays/DatetimeIndexes with tz-naive Timestamps
to UTC. (GH 38760)
Note that this patch does not fix the localization of tz-aware Timestamps to UTC upon serialization. (Related issue GH 12997)
Old Behavior
In [32]: index = pd.date_range(
   ....:     start='2020-12-28 00:00:00',
   ....:     end='2020-12-28 02:00:00',
   ....:     freq='1H',
   ....: )
   ....:
In [33]: a = pd.Series(
   ....:     data=range(3),
   ....:     index=index,
   ....: )
   ....:
In [4]: from io import StringIO
In [5]: a.to_json(date_format='iso')
Out[5]: '{"2020-12-28T00:00:00.000Z":0,"2020-12-28T01:00:00.000Z":1,"2020-12-28T02:00:00.000Z":2}'
In [6]: pd.read_json(StringIO(a.to_json(date_format='iso')), typ="series").index == a.index
Out[6]: array([False, False, False])
New Behavior
In [34]: from io import StringIO
In [35]: a.to_json(date_format='iso')
Out[35]: '{"2020-12-28T00:00:00.000Z":0,"2020-12-28T01:00:00.000Z":1,"2020-12-28T02:00:00.000Z":2}'
# Roundtripping now works
In [36]: pd.read_json(StringIO(a.to_json(date_format='iso')), typ="series").index == a.index
Out[36]: array([ True,  True,  True])
DataFrameGroupBy.value_counts with non-grouping categorical columns and observed=True#
Calling DataFrameGroupBy.value_counts() with observed=True would incorrectly drop non-observed categories of non-grouping columns (GH 46357).
In [6]: df = pd.DataFrame(["a", "b", "c"], dtype="category").iloc[0:2]
In [7]: df
Out[7]:
   0
0  a
1  b
Old Behavior
In [8]: df.groupby(level=0, observed=True).value_counts()
Out[8]:
0  a    1
1  b    1
dtype: int64
New Behavior
In [9]: df.groupby(level=0, observed=True).value_counts()
Out[9]:
0  a    1
1  a    0
   b    1
0  b    0
   c    0
1  c    0
dtype: int64
Backwards incompatible API changes#
Increased minimum versions for dependencies#
Some minimum supported versions of dependencies were updated. If installed, we now require:
| Package | Minimum Version | Required | Changed | 
|---|---|---|---|
| numpy | 1.20.3 | X | X | 
| mypy (dev) | 0.971 | X | |
| beautifulsoup4 | 4.9.3 | X | |
| blosc | 1.21.0 | X | |
| bottleneck | 1.3.2 | X | |
| fsspec | 2021.07.0 | X | |
| hypothesis | 6.13.0 | X | |
| gcsfs | 2021.07.0 | X | |
| jinja2 | 3.0.0 | X | |
| lxml | 4.6.3 | X | |
| numba | 0.53.1 | X | |
| numexpr | 2.7.3 | X | |
| openpyxl | 3.0.7 | X | |
| pandas-gbq | 0.15.0 | X | |
| psycopg2 | 2.8.6 | X | |
| pymysql | 1.0.2 | X | |
| pyreadstat | 1.1.2 | X | |
| pyxlsb | 1.0.8 | X | |
| s3fs | 2021.08.0 | X | |
| scipy | 1.7.1 | X | |
| sqlalchemy | 1.4.16 | X | |
| tabulate | 0.8.9 | X | |
| xarray | 0.19.0 | X | |
| xlsxwriter | 1.4.3 | X | 
For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.
| Package | Minimum Version | Changed | 
|---|---|---|
| beautifulsoup4 | 4.9.3 | X | 
| blosc | 1.21.0 | X | 
| bottleneck | 1.3.2 | X | 
| brotlipy | 0.7.0 | |
| fastparquet | 0.4.0 | |
| fsspec | 2021.08.0 | X | 
| html5lib | 1.1 | |
| hypothesis | 6.13.0 | X | 
| gcsfs | 2021.08.0 | X | 
| jinja2 | 3.0.0 | X | 
| lxml | 4.6.3 | X | 
| matplotlib | 3.3.2 | |
| numba | 0.53.1 | X | 
| numexpr | 2.7.3 | X | 
| odfpy | 1.4.1 | |
| openpyxl | 3.0.7 | X | 
| pandas-gbq | 0.15.0 | X | 
| psycopg2 | 2.8.6 | X | 
| pyarrow | 1.0.1 | |
| pymysql | 1.0.2 | X | 
| pyreadstat | 1.1.2 | X | 
| pytables | 3.6.1 | |
| python-snappy | 0.6.0 | |
| pyxlsb | 1.0.8 | X | 
| s3fs | 2021.08.0 | X | 
| scipy | 1.7.1 | X | 
| sqlalchemy | 1.4.16 | X | 
| tabulate | 0.8.9 | X | 
| tzdata | 2022a | |
| xarray | 0.19.0 | X | 
| xlrd | 2.0.1 | |
| xlsxwriter | 1.4.3 | X | 
| xlwt | 1.3.0 | |
| zstandard | 0.15.2 | 
See Dependencies and Optional dependencies for more.
Other API changes#
- BigQuery I/O methods - read_gbq()and- DataFrame.to_gbq()default to- auth_local_webserver = True. Google has deprecated the- auth_local_webserver = False“out of band” (copy-paste) flow. The- auth_local_webserver = Falseoption is planned to stop working in October 2022. (GH 46312)
- read_json()now raises- FileNotFoundError(previously- ValueError) when input is a string ending in- .json,- .json.gz,- .json.bz2, etc. but no such file exists. (GH 29102)
- Operations with - Timestampor- Timedeltathat would previously raise- OverflowErrorinstead raise- OutOfBoundsDatetimeor- OutOfBoundsTimedeltawhere appropriate (GH 47268)
- When - read_sas()previously returned- None, it now returns an empty- DataFrame(GH 47410)
- DataFrameconstructor raises if- indexor- columnsarguments are sets (GH 47215)
Deprecations#
Warning
In the next major version release, 2.0, several larger API changes are being considered without a formal deprecation such as
making the standard library zoneinfo the default timezone implementation instead of pytz,
having the Index support all data types instead of having multiple subclasses (CategoricalIndex, Int64Index, etc.), and more.
The changes under consideration are logged in this GitHub issue, and any
feedback or concerns are welcome.
Label-based integer slicing on a Series with an Int64Index or RangeIndex#
In a future version, integer slicing on a Series with a Int64Index or RangeIndex will be treated as label-based, not positional. This will make the behavior consistent with other Series.__getitem__() and Series.__setitem__() behaviors (GH 45162).
For example:
In [29]: ser = pd.Series([1, 2, 3, 4, 5], index=[2, 3, 5, 7, 11])
In the old behavior, ser[2:4] treats the slice as positional:
Old behavior:
In [3]: ser[2:4]
Out[3]:
5    3
7    4
dtype: int64
In a future version, this will be treated as label-based:
Future behavior:
In [4]: ser.loc[2:4]
Out[4]:
2    1
3    2
dtype: int64
To retain the old behavior, use series.iloc[i:j]. To get the future behavior,
use series.loc[i:j].
Slicing on a DataFrame will not be affected.
ExcelWriter attributes#
All attributes of ExcelWriter were previously documented as not
public. However some third party Excel engines documented accessing
ExcelWriter.book or ExcelWriter.sheets, and users were utilizing these
and possibly other attributes. Previously these attributes were not safe to use;
e.g. modifications to ExcelWriter.book would not update ExcelWriter.sheets
and conversely. In order to support this, pandas has made some attributes public
and improved their implementations so that they may now be safely used. (GH 45572)
The following attributes are now public and considered safe to access.
book
check_extension
close
date_format
datetime_format
engine
if_sheet_exists
sheets
supported_extensions
The following attributes have been deprecated. They now raise a FutureWarning
when accessed and will be removed in a future version. Users should be aware
that their usage is considered unsafe, and can lead to unexpected results.
cur_sheet
handles
path
save
write_cells
See the documentation of ExcelWriter for further details.
Using group_keys with transformers in DataFrameGroupBy.apply() and SeriesGroupBy.apply()#
In previous versions of pandas, if it was inferred that the function passed to
DataFrameGroupBy.apply() or SeriesGroupBy.apply() was a transformer (i.e. the resulting index was equal to
the input index), the group_keys argument of DataFrame.groupby() and
Series.groupby() was ignored and the group keys would never be added to
the index of the result. In the future, the group keys will be added to the index
when the user specifies group_keys=True.
As group_keys=True is the default value of DataFrame.groupby() and
Series.groupby(), not specifying group_keys with a transformer will
raise a FutureWarning. This can be silenced and the previous behavior
retained by specifying group_keys=False.
Inplace operation when setting values with loc and iloc#
Most of the time setting values with DataFrame.iloc() attempts to set values
inplace, only falling back to inserting a new array if necessary. There are
some cases where this rule is not followed, for example when setting an entire
column from an array with different dtype:
In [30]: df = pd.DataFrame({'price': [11.1, 12.2]}, index=['book1', 'book2'])
In [31]: original_prices = df['price']
In [32]: new_prices = np.array([98, 99])
Old behavior:
In [3]: df.iloc[:, 0] = new_prices
In [4]: df.iloc[:, 0]
Out[4]:
book1    98
book2    99
Name: price, dtype: int64
In [5]: original_prices
Out[5]:
book1    11.1
book2    12.2
Name: price, float: 64
This behavior is deprecated. In a future version, setting an entire column with iloc will attempt to operate inplace.
Future behavior:
In [3]: df.iloc[:, 0] = new_prices
In [4]: df.iloc[:, 0]
Out[4]:
book1    98.0
book2    99.0
Name: price, dtype: float64
In [5]: original_prices
Out[5]:
book1    98.0
book2    99.0
Name: price, dtype: float64
To get the old behavior, use DataFrame.__setitem__() directly:
In [3]: df[df.columns[0]] = new_prices
In [4]: df.iloc[:, 0]
Out[4]
book1    98
book2    99
Name: price, dtype: int64
In [5]: original_prices
Out[5]:
book1    11.1
book2    12.2
Name: price, dtype: float64
To get the old behaviour when df.columns is not unique and you want to
change a single column by index, you can use DataFrame.isetitem(), which
has been added in pandas 1.5:
In [3]: df_with_duplicated_cols = pd.concat([df, df], axis='columns')
In [3]: df_with_duplicated_cols.isetitem(0, new_prices)
In [4]: df_with_duplicated_cols.iloc[:, 0]
Out[4]:
book1    98
book2    99
Name: price, dtype: int64
In [5]: original_prices
Out[5]:
book1    11.1
book2    12.2
Name: 0, dtype: float64
numeric_only default value#
Across the DataFrame, DataFrameGroupBy, and Resampler operations such as
min, sum, and idxmax, the default
value of the numeric_only argument, if it exists at all, was inconsistent.
Furthermore, operations with the default value None can lead to surprising
results. (GH 46560)
In [1]: df = pd.DataFrame({"a": [1, 2], "b": ["x", "y"]})
In [2]: # Reading the next line without knowing the contents of df, one would
        # expect the result to contain the products for both columns a and b.
        df[["a", "b"]].prod()
Out[2]:
a    2
dtype: int64
To avoid this behavior, the specifying the value numeric_only=None has been
deprecated, and will be removed in a future version of pandas. In the future,
all operations with a numeric_only argument will default to False. Users
should either call the operation only with columns that can be operated on, or
specify numeric_only=True to operate only on Boolean, integer, and float columns.
In order to support the transition to the new behavior, the following methods have
gained the numeric_only argument.
- DataFrame.rolling()operations
- DataFrame.expanding()operations
- DataFrame.ewm()operations
Other Deprecations#
- Deprecated the keyword - line_terminatorin- DataFrame.to_csv()and- Series.to_csv(), use- lineterminatorinstead; this is for consistency with- read_csv()and the standard library ‘csv’ module (GH 9568)
- Deprecated behavior of - SparseArray.astype(),- Series.astype(), and- DataFrame.astype()with- SparseDtypewhen passing a non-sparse- dtype. In a future version, this will cast to that non-sparse dtype instead of wrapping it in a- SparseDtype(GH 34457)
- Deprecated behavior of - DatetimeIndex.intersection()and- DatetimeIndex.symmetric_difference()(- unionbehavior was already deprecated in version 1.3.0) with mixed time zones; in a future version both will be cast to UTC instead of object dtype (GH 39328, GH 45357)
- Deprecated - DataFrame.iteritems(),- Series.iteritems(),- HDFStore.iteritems()in favor of- DataFrame.items(),- Series.items(),- HDFStore.items()(GH 45321)
- Deprecated - Series.is_monotonic()and- Index.is_monotonic()in favor of- Series.is_monotonic_increasing()and- Index.is_monotonic_increasing()(GH 45422, GH 21335)
- Deprecated behavior of - DatetimeIndex.astype(),- TimedeltaIndex.astype(),- PeriodIndex.astype()when converting to an integer dtype other than- int64. In a future version, these will convert to exactly the specified dtype (instead of always- int64) and will raise if the conversion overflows (GH 45034)
- Deprecated the - __array_wrap__method of DataFrame and Series, rely on standard numpy ufuncs instead (GH 45451)
- Deprecated treating float-dtype data as wall-times when passed with a timezone to - Seriesor- DatetimeIndex(GH 45573)
- Deprecated the behavior of - Series.fillna()and- DataFrame.fillna()with- timedelta64[ns]dtype and incompatible fill value; in a future version this will cast to a common dtype (usually object) instead of raising, matching the behavior of other dtypes (GH 45746)
- Deprecated the - warnparameter in- infer_freq()(GH 45947)
- Deprecated allowing non-keyword arguments in - ExtensionArray.argsort()(GH 46134)
- Deprecated treating all-bool - object-dtype columns as bool-like in- DataFrame.any()and- DataFrame.all()with- bool_only=True, explicitly cast to bool instead (GH 46188)
- Deprecated behavior of method - DataFrame.quantile(), attribute- numeric_onlywill default False. Including datetime/timedelta columns in the result (GH 7308).
- Deprecated - Timedelta.freqand- Timedelta.is_populated(GH 46430)
- Deprecated - Timedelta.delta(GH 46476)
- Deprecated passing arguments as positional in - DataFrame.any()and- Series.any()(GH 44802)
- Deprecated passing positional arguments to - DataFrame.pivot()and- pivot()except- data(GH 30228)
- Deprecated the methods - DataFrame.mad(),- Series.mad(), and the corresponding groupby methods (GH 11787)
- Deprecated positional arguments to - Index.join()except for- other, use keyword-only arguments instead of positional arguments (GH 46518)
- Deprecated positional arguments to - StringMethods.rsplit()and- StringMethods.split()except for- pat, use keyword-only arguments instead of positional arguments (GH 47423)
- Deprecated indexing on a timezone-naive - DatetimeIndexusing a string representing a timezone-aware datetime (GH 46903, GH 36148)
- Deprecated allowing - unit="M"or- unit="Y"in- Timestampconstructor with a non-round float value (GH 47267)
- Deprecated the - display.column_spaceglobal configuration option (GH 7576)
- Deprecated the argument - na_sentinelin- factorize(),- Index.factorize(), and- ExtensionArray.factorize(); pass- use_na_sentinel=Trueinstead to use the sentinel- -1for NaN values and- use_na_sentinel=Falseinstead of- na_sentinel=Noneto encode NaN values (GH 46910)
- Deprecated - DataFrameGroupBy.transform()not aligning the result when the UDF returned DataFrame (GH 45648)
- Clarified warning from - to_datetime()when delimited dates can’t be parsed in accordance to specified- dayfirstargument (GH 46210)
- Emit warning from - to_datetime()when delimited dates can’t be parsed in accordance to specified- dayfirstargument even for dates where leading zero is omitted (e.g.- 31/1/2001) (GH 47880)
- Deprecated - Seriesand- Resamplerreducers (e.g.- min,- max,- sum,- mean) raising a- NotImplementedErrorwhen the dtype is non-numric and- numeric_only=Trueis provided; this will raise a- TypeErrorin a future version (GH 47500)
- Deprecated - Series.rank()returning an empty result when the dtype is non-numeric and- numeric_only=Trueis provided; this will raise a- TypeErrorin a future version (GH 47500)
- Deprecated argument - errorsfor- Series.mask(),- Series.where(),- DataFrame.mask(), and- DataFrame.where()as- errorshad no effect on this methods (GH 47728)
- Deprecated arguments - *argsand- **kwargsin- Rolling,- Expanding, and- ExponentialMovingWindowops. (GH 47836)
- Deprecated the - inplacekeyword in- Categorical.set_ordered(),- Categorical.as_ordered(), and- Categorical.as_unordered()(GH 37643)
- Deprecated setting a categorical’s categories with - cat.categories = ['a', 'b', 'c'], use- Categorical.rename_categories()instead (GH 37643)
- Deprecated unused arguments - encodingand- verbosein- Series.to_excel()and- DataFrame.to_excel()(GH 47912)
- Deprecated the - inplacekeyword in- DataFrame.set_axis()and- Series.set_axis(), use- obj = obj.set_axis(..., copy=False)instead (GH 48130)
- Deprecated producing a single element when iterating over a - DataFrameGroupByor a- SeriesGroupBythat has been grouped by a list of length 1; A tuple of length one will be returned instead (GH 42795)
- Fixed up warning message of deprecation of - MultiIndex.lesort_depth()as public method, as the message previously referred to- MultiIndex.is_lexsorted()instead (GH 38701)
- Deprecated the - sort_columnsargument in- DataFrame.plot()and- Series.plot()(GH 47563).
- Deprecated positional arguments for all but the first argument of - DataFrame.to_stata()and- read_stata(), use keyword arguments instead (GH 48128).
- Deprecated the - mangle_dupe_colsargument in- read_csv(),- read_fwf(),- read_table()and- read_excel(). The argument was never implemented, and a new argument where the renaming pattern can be specified will be added instead (GH 47718)
- Deprecated allowing - dtype='datetime64'or- dtype=np.datetime64in- Series.astype(), use “datetime64[ns]” instead (GH 47844)
Performance improvements#
- Performance improvement in - DataFrame.corrwith()for column-wise (axis=0) Pearson and Spearman correlation when other is a- Series(GH 46174)
- Performance improvement in - DataFrameGroupBy.transform()and- SeriesGroupBy.transform()for some user-defined DataFrame -> Series functions (GH 45387)
- Performance improvement in - DataFrame.duplicated()when subset consists of only one column (GH 45236)
- Performance improvement in - DataFrameGroupBy.diff()and- SeriesGroupBy.diff()(GH 16706)
- Performance improvement in - DataFrameGroupBy.transform()and- SeriesGroupBy.transform()when broadcasting values for user-defined functions (GH 45708)
- Performance improvement in - DataFrameGroupBy.transform()and- SeriesGroupBy.transform()for user-defined functions when only a single group exists (GH 44977)
- Performance improvement in - DataFrameGroupBy.apply()and- SeriesGroupBy.apply()when grouping on a non-unique unsorted index (GH 46527)
- Performance improvement in - DataFrame.loc()and- Series.loc()for tuple-based indexing of a- MultiIndex(GH 45681, GH 46040, GH 46330)
- Performance improvement in - DataFrameGroupBy.var()and- SeriesGroupBy.var()with- ddofother than one (GH 48152)
- Performance improvement in - DataFrame.to_records()when the index is a- MultiIndex(GH 47263)
- Performance improvement in - MultiIndex.valueswhen the MultiIndex contains levels of type DatetimeIndex, TimedeltaIndex or ExtensionDtypes (GH 46288)
- Performance improvement in - merge()when left and/or right are empty (GH 45838)
- Performance improvement in - DataFrame.join()when left and/or right are empty (GH 46015)
- Performance improvement in - DataFrame.reindex()and- Series.reindex()when target is a- MultiIndex(GH 46235)
- Performance improvement when setting values in a pyarrow backed string array (GH 46400) 
- Performance improvement in - factorize()(GH 46109)
- Performance improvement in - DataFrameand- Seriesconstructors for extension dtype scalars (GH 45854)
- Performance improvement in - read_excel()when- nrowsargument provided (GH 32727)
- Performance improvement in - Styler.to_excel()when applying repeated CSS formats (GH 47371)
- Performance improvement in - MultiIndex.is_monotonic_increasing()(GH 47458)
- Performance improvement in - BusinessHour- strand- repr(GH 44764)
- Performance improvement in datetime arrays string formatting when one of the default strftime formats - "%Y-%m-%d %H:%M:%S"or- "%Y-%m-%d %H:%M:%S.%f"is used. (GH 44764)
- Performance improvement in - Series.to_sql()and- DataFrame.to_sql()(- SQLiteTable) when processing time arrays. (GH 44764)
- Performance improvement to - read_sas()(GH 47404)
- Performance improvement in - argmaxand- argminfor- arrays.SparseArray(GH 34197)
Bug fixes#
Categorical#
- Bug in - Categorical.view()not accepting integer dtypes (GH 25464)
- Bug in - CategoricalIndex.union()when the index’s categories are integer-dtype and the index contains- NaNvalues incorrectly raising instead of casting to- float64(GH 45362)
- Bug in - concat()when concatenating two (or more) unordered- CategoricalIndexvariables, whose categories are permutations, yields incorrect index values (GH 24845)
Datetimelike#
- Bug in - DataFrame.quantile()with datetime-like dtypes and no rows incorrectly returning- float64dtype instead of retaining datetime-like dtype (GH 41544)
- Bug in - to_datetime()with sequences of- np.str_objects incorrectly raising (GH 32264)
- Bug in - Timestampconstruction when passing datetime components as positional arguments and- tzinfoas a keyword argument incorrectly raising (GH 31929)
- Bug in - Index.astype()when casting from object dtype to- timedelta64[ns]dtype incorrectly casting- np.datetime64("NaT")values to- np.timedelta64("NaT")instead of raising (GH 45722)
- Bug in - SeriesGroupBy.value_counts()index when passing categorical column (GH 44324)
- Bug in - DatetimeIndex.tz_localize()localizing to UTC failing to make a copy of the underlying data (GH 46460)
- Bug in - DatetimeIndex.resolution()incorrectly returning “day” instead of “nanosecond” for nanosecond-resolution indexes (GH 46903)
- Bug in - Timestampwith an integer or float value and- unit="Y"or- unit="M"giving slightly-wrong results (GH 47266)
- Bug in - DatetimeArrayconstruction when passed another- DatetimeArrayand- freq=Noneincorrectly inferring the freq from the given array (GH 47296)
- Bug in - to_datetime()where- OutOfBoundsDatetimewould be thrown even if- errors=coerceif there were more than 50 rows (GH 45319)
- Bug when adding a - DateOffsetto a- Serieswould not add the- nanosecondsfield (GH 47856)
Timedelta#
- Bug in - astype_nansafe()astype(“timedelta64[ns]”) fails when np.nan is included (GH 45798)
- Bug in constructing a - Timedeltawith a- np.timedelta64object and a- unitsometimes silently overflowing and returning incorrect results instead of raising- OutOfBoundsTimedelta(GH 46827)
- Bug in constructing a - Timedeltafrom a large integer or float with- unit="W"silently overflowing and returning incorrect results instead of raising- OutOfBoundsTimedelta(GH 47268)
Time Zones#
Numeric#
- Bug in operations with array-likes with - dtype="boolean"and- NAincorrectly altering the array in-place (GH 45421)
- Bug in arithmetic operations with nullable types without - NAvalues not matching the same operation with non-nullable types (GH 48223)
- Bug in - floordivwhen dividing by- IntegerDtype- 0would return- 0instead of- inf(GH 48223)
- Bug in division, - powand- modoperations on array-likes with- dtype="boolean"not being like their- np.bool_counterparts (GH 46063)
- Bug in multiplying a - Serieswith- IntegerDtypeor- FloatingDtypeby an array-like with- timedelta64[ns]dtype incorrectly raising (GH 45622)
- Bug in - mean()where the optional dependency- bottleneckcauses precision loss linear in the length of the array.- bottleneckhas been disabled for- mean()improving the loss to log-linear but may result in a performance decrease. (GH 42878)
Conversion#
- Bug in - DataFrame.astype()not preserving subclasses (GH 40810)
- Bug in constructing a - Seriesfrom a float-containing list or a floating-dtype ndarray-like (e.g.- dask.Array) and an integer dtype raising instead of casting like we would with an- np.ndarray(GH 40110)
- Bug in - Float64Index.astype()to unsigned integer dtype incorrectly casting to- np.int64dtype (GH 45309)
- Bug in - Series.astype()and- DataFrame.astype()from floating dtype to unsigned integer dtype failing to raise in the presence of negative values (GH 45151)
- Bug in - array()with- FloatingDtypeand values containing float-castable strings incorrectly raising (GH 45424)
- Bug when comparing string and datetime64ns objects causing - OverflowErrorexception. (GH 45506)
- Bug in metaclass of generic abstract dtypes causing - DataFrame.apply()and- Series.apply()to raise for the built-in function- type(GH 46684)
- Bug in - DataFrame.to_records()returning inconsistent numpy types if the index was a- MultiIndex(GH 47263)
- Bug in - DataFrame.to_dict()for- orient="list"or- orient="index"was not returning native types (GH 46751)
- Bug in - DataFrame.apply()that returns a- DataFrameinstead of a- Serieswhen applied to an empty- DataFrameand- axis=1(GH 39111)
- Bug when inferring the dtype from an iterable that is not a NumPy - ndarrayconsisting of all NumPy unsigned integer scalars did not result in an unsigned integer dtype (GH 47294)
- Bug in - DataFrame.eval()when pandas objects (e.g.- 'Timestamp') were column names (GH 44603)
Strings#
- Bug in - str.startswith()and- str.endswith()when using other series as parameter _pat_. Now raises- TypeError(GH 3485)
- Bug in - Series.str.zfill()when strings contain leading signs, padding ‘0’ before the sign character rather than after as- str.zfillfrom standard library (GH 20868)
Interval#
- Bug in - IntervalArray.__setitem__()when setting- np.naninto an integer-backed array raising- ValueErrorinstead of- TypeError(GH 45484)
- Bug in - IntervalDtypewhen using datetime64[ns, tz] as a dtype string (GH 46999)
Indexing#
- Bug in - DataFrame.iloc()where indexing a single row on a- DataFramewith a single ExtensionDtype column gave a copy instead of a view on the underlying data (GH 45241)
- Bug in - DataFrame.__getitem__()returning copy when- DataFramehas duplicated columns even if a unique column is selected (GH 45316, GH 41062)
- Bug in - Series.align()does not create- MultiIndexwith union of levels when both MultiIndexes intersections are identical (GH 45224)
- Bug in setting a NA value ( - Noneor- np.nan) into a- Serieswith int-based- IntervalDtypeincorrectly casting to object dtype instead of a float-based- IntervalDtype(GH 45568)
- Bug in indexing setting values into an - ExtensionDtypecolumn with- df.iloc[:, i] = valueswith- valueshaving the same dtype as- df.iloc[:, i]incorrectly inserting a new array instead of setting in-place (GH 33457)
- Bug in - Series.__setitem__()with a non-integer- Indexwhen using an integer key to set a value that cannot be set inplace where a- ValueErrorwas raised instead of casting to a common dtype (GH 45070)
- Bug in - DataFrame.loc()not casting- Noneto- NAwhen setting value as a list into- DataFrame(GH 47987)
- Bug in - Series.__setitem__()when setting incompatible values into a- PeriodDtypeor- IntervalDtype- Seriesraising when indexing with a boolean mask but coercing when indexing with otherwise-equivalent indexers; these now consistently coerce, along with- Series.mask()and- Series.where()(GH 45768)
- Bug in - DataFrame.where()with multiple columns with datetime-like dtypes failing to downcast results consistent with other dtypes (GH 45837)
- Bug in - isin()upcasting to- float64with unsigned integer dtype and list-like argument without a dtype (GH 46485)
- Bug in - Series.loc.__setitem__()and- Series.loc.__getitem__()not raising when using multiple keys without using a- MultiIndex(GH 13831)
- Bug in - Index.reindex()raising- AssertionErrorwhen- levelwas specified but no- MultiIndexwas given; level is ignored now (GH 35132)
- Bug when setting a value too large for a - Seriesdtype failing to coerce to a common type (GH 26049, GH 32878)
- Bug in - loc.__setitem__()treating- rangekeys as positional instead of label-based (GH 45479)
- Bug in - DataFrame.__setitem__()casting extension array dtypes to object when setting with a scalar key and- DataFrameas value (GH 46896)
- Bug in - Series.__setitem__()when setting a scalar to a nullable pandas dtype would not raise a- TypeErrorif the scalar could not be cast (losslessly) to the nullable type (GH 45404)
- Bug in - Series.__setitem__()when setting- booleandtype values containing- NAincorrectly raising instead of casting to- booleandtype (GH 45462)
- Bug in - Series.loc()raising with boolean indexer containing- NAwhen- Indexdid not match (GH 46551)
- Bug in - Series.__setitem__()where setting- NAinto a numeric-dtype- Serieswould incorrectly upcast to object-dtype rather than treating the value as- np.nan(GH 44199)
- Bug in - DataFrame.loc()when setting values to a column and right hand side is a dictionary (GH 47216)
- Bug in - Series.__setitem__()with- datetime64[ns]dtype, an all-- Falseboolean mask, and an incompatible value incorrectly casting to- objectinstead of retaining- datetime64[ns]dtype (GH 45967)
- Bug in - Index.__getitem__()raising- ValueErrorwhen indexer is from boolean dtype with- NA(GH 45806)
- Bug in - Series.__setitem__()losing precision when enlarging- Serieswith scalar (GH 32346)
- Bug in - Series.mask()with- inplace=Trueor setting values with a boolean mask with small integer dtypes incorrectly raising (GH 45750)
- Bug in - DataFrame.mask()with- inplace=Trueand- ExtensionDtypecolumns incorrectly raising (GH 45577)
- Bug in getting a column from a DataFrame with an object-dtype row index with datetime-like values: the resulting Series now preserves the exact object-dtype Index from the parent DataFrame (GH 42950) 
- Bug in - DataFrame.__getattribute__()raising- AttributeErrorif columns have- "string"dtype (GH 46185)
- Bug in - DataFrame.compare()returning all- NaNcolumn when comparing extension array dtype and numpy dtype (GH 44014)
- Bug in - DataFrame.where()setting wrong values with- "boolean"mask for numpy dtype (GH 44014)
- Bug in indexing on a - DatetimeIndexwith a- np.str_key incorrectly raising (GH 45580)
- Bug in - CategoricalIndex.get_indexer()when index contains- NaNvalues, resulting in elements that are in target but not present in the index to be mapped to the index of the NaN element, instead of -1 (GH 45361)
- Bug in setting large integer values into - Serieswith- float32or- float16dtype incorrectly altering these values instead of coercing to- float64dtype (GH 45844)
- Bug in - Series.asof()and- DataFrame.asof()incorrectly casting bool-dtype results to- float64dtype (GH 16063)
- Bug in - NDFrame.xs(),- DataFrame.iterrows(),- DataFrame.loc()and- DataFrame.iloc()not always propagating metadata (GH 28283)
- Bug in - DataFrame.sum()min_count changes dtype if input contains NaNs (GH 46947)
- Bug in - IntervalTreethat lead to an infinite recursion. (GH 46658)
- Bug in - PeriodIndexraising- AttributeErrorwhen indexing on- NA, rather than putting- NaTin its place. (GH 46673)
- Bug in - DataFrame.at()would allow the modification of multiple columns (GH 48296)
Missing#
- Bug in - Series.fillna()and- DataFrame.fillna()with- downcastkeyword not being respected in some cases where there are no NA values present (GH 45423)
- Bug in - Series.fillna()and- DataFrame.fillna()with- IntervalDtypeand incompatible value raising instead of casting to a common (usually object) dtype (GH 45796)
- Bug in - Series.map()not respecting- na_actionargument if mapper is a- dictor- Series(GH 47527)
- Bug in - DataFrame.interpolate()with object-dtype column not returning a copy with- inplace=False(GH 45791)
- Bug in - DataFrame.dropna()allows to set both- howand- threshincompatible arguments (GH 46575)
- Bug in - DataFrame.fillna()ignored- axiswhen- DataFrameis single block (GH 47713)
MultiIndex#
- Bug in - DataFrame.loc()returning empty result when slicing a- MultiIndexwith a negative step size and non-null start/stop values (GH 46156)
- Bug in - DataFrame.loc()raising when slicing a- MultiIndexwith a negative step size other than -1 (GH 46156)
- Bug in - DataFrame.loc()raising when slicing a- MultiIndexwith a negative step size and slicing a non-int labeled index level (GH 46156)
- Bug in - Series.to_numpy()where multiindexed Series could not be converted to numpy arrays when an- na_valuewas supplied (GH 45774)
- Bug in - MultiIndex.equalsnot commutative when only one side has extension array dtype (GH 46026)
- Bug in - MultiIndex.from_tuples()cannot construct Index of empty tuples (GH 45608)
I/O#
- Bug in - DataFrame.to_stata()where no error is raised if the- DataFramecontains- -np.inf(GH 45350)
- Bug in - read_excel()results in an infinite loop with certain- skiprowscallables (GH 45585)
- Bug in - DataFrame.info()where a new line at the end of the output is omitted when called on an empty- DataFrame(GH 45494)
- Bug in - read_csv()not recognizing line break for- on_bad_lines="warn"for- engine="c"(GH 41710)
- Bug in - DataFrame.to_csv()not respecting- float_formatfor- Float64dtype (GH 45991)
- Bug in - read_csv()not respecting a specified converter to index columns in all cases (GH 40589)
- Bug in - read_csv()interpreting second row as- Indexnames even when- index_col=False(GH 46569)
- Bug in - read_parquet()when- engine="pyarrow"which caused partial write to disk when column of unsupported datatype was passed (GH 44914)
- Bug in - DataFrame.to_excel()and- ExcelWriterwould raise when writing an empty DataFrame to a- .odsfile (GH 45793)
- Bug in - read_csv()ignoring non-existing header row for- engine="python"(GH 47400)
- Bug in - read_excel()raising uncontrolled- IndexErrorwhen- headerreferences non-existing rows (GH 43143)
- Bug in - read_html()where elements surrounding- <br>were joined without a space between them (GH 29528)
- Bug in - read_csv()when data is longer than header leading to issues with callables in- usecolsexpecting strings (GH 46997)
- Bug in Parquet roundtrip for Interval dtype with - datetime64[ns]subtype (GH 45881)
- Bug in - read_excel()when reading a- .odsfile with newlines between xml elements (GH 45598)
- Bug in - read_parquet()when- engine="fastparquet"where the file was not closed on error (GH 46555)
- DataFrame.to_html()now excludes the- borderattribute from- <table>elements when- borderkeyword is set to- False.
- Bug in - read_sas()with certain types of compressed SAS7BDAT files (GH 35545)
- Bug in - read_excel()not forward filling- MultiIndexwhen no names were given (GH 47487)
- Bug in - read_sas()returned- Nonerather than an empty DataFrame for SAS7BDAT files with zero rows (GH 18198)
- Bug in - DataFrame.to_string()using wrong missing value with extension arrays in- MultiIndex(GH 47986)
- Bug in - StataWriterwhere value labels were always written with default encoding (GH 46750)
- Bug in - StataWriterUTF8where some valid characters were removed from variable names (GH 47276)
- Bug in - DataFrame.to_excel()when writing an empty dataframe with- MultiIndex(GH 19543)
- Bug in - read_sas()with RLE-compressed SAS7BDAT files that contain 0x40 control bytes (GH 31243)
- Bug in - read_sas()that scrambled column names (GH 31243)
- Bug in - read_sas()with RLE-compressed SAS7BDAT files that contain 0x00 control bytes (GH 47099)
- Bug in - read_parquet()with- use_nullable_dtypes=Truewhere- float64dtype was returned instead of nullable- Float64dtype (GH 45694)
- Bug in - DataFrame.to_json()where- PeriodDtypewould not make the serialization roundtrip when read back with- read_json()(GH 44720)
- Bug in - read_xml()when reading XML files with Chinese character tags and would raise- XMLSyntaxError(GH 47902)
Period#
- Bug in subtraction of - Periodfrom- PeriodArrayreturning wrong results (GH 45999)
- Bug in - Period.strftime()and- PeriodIndex.strftime(), directives- %land- %uwere giving wrong results (GH 46252)
- Bug in inferring an incorrect - freqwhen passing a string to- Periodmicroseconds that are a multiple of 1000 (GH 46811)
- Bug in constructing a - Periodfrom a- Timestampor- np.datetime64object with non-zero nanoseconds and- freq="ns"incorrectly truncating the nanoseconds (GH 46811)
- Bug in adding - np.timedelta64("NaT", "ns")to a- Periodwith a timedelta-like freq incorrectly raising- IncompatibleFrequencyinstead of returning- NaT(GH 47196)
- Bug in adding an array of integers to an array with - PeriodDtypegiving incorrect results when- dtype.freq.n > 1(GH 47209)
- Bug in subtracting a - Periodfrom an array with- PeriodDtypereturning incorrect results instead of raising- OverflowErrorwhen the operation overflows (GH 47538)
Plotting#
- Bug in - DataFrame.plot.barh()that prevented labeling the x-axis and- xlabelupdating the y-axis label (GH 45144)
- Bug in - DataFrame.plot.box()that prevented labeling the x-axis (GH 45463)
- Bug in - DataFrame.boxplot()that prevented passing in- xlabeland- ylabel(GH 45463)
- Bug in - DataFrame.boxplot()that prevented specifying- vert=False(GH 36918)
- Bug in - DataFrame.plot.scatter()that prevented specifying- norm(GH 45809)
- Fix showing “None” as ylabel in - Series.plot()when not setting ylabel (GH 46129)
- Bug in - DataFrame.plot()that led to xticks and vertical grids being improperly placed when plotting a quarterly series (GH 47602)
- Bug in - DataFrame.plot()that prevented setting y-axis label, limits and ticks for a secondary y-axis (GH 47753)
Groupby/resample/rolling#
- Bug in - DataFrame.resample()ignoring- closed="right"on- TimedeltaIndex(GH 45414)
- Bug in - DataFrameGroupBy.transform()fails when- func="size"and the input DataFrame has multiple columns (GH 27469)
- Bug in - DataFrameGroupBy.size()and- DataFrameGroupBy.transform()with- func="size"produced incorrect results when- axis=1(GH 45715)
- Bug in - ExponentialMovingWindow.mean()with- axis=1and- engine='numba'when the- DataFramehas more columns than rows (GH 46086)
- Bug when using - engine="numba"would return the same jitted function when modifying- engine_kwargs(GH 46086)
- Bug in - DataFrameGroupBy.transform()fails when- axis=1and- funcis- "first"or- "last"(GH 45986)
- Bug in - DataFrameGroupBy.cumsum()with- skipna=Falsegiving incorrect results (GH 46216)
- Bug in - DataFrameGroupBy.sum(),- SeriesGroupBy.sum(),- DataFrameGroupBy.prod(),- SeriesGroupBy.prod, :meth:().DataFrameGroupBy.cumsum`, and- SeriesGroupBy.cumsum()with integer dtypes losing precision (GH 37493)
- Bug in - DataFrameGroupBy.cumsum()and- SeriesGroupBy.cumsum()with- timedelta64[ns]dtype failing to recognize- NaTas a null value (GH 46216)
- Bug in - DataFrameGroupBy.cumsum()and- SeriesGroupBy.cumsum()with integer dtypes causing overflows when sum was bigger than maximum of dtype (GH 37493)
- Bug in - DataFrameGroupBy.cummin(),- SeriesGroupBy.cummin(),- DataFrameGroupBy.cummax()and- SeriesGroupBy.cummax()with nullable dtypes incorrectly altering the original data in place (GH 46220)
- Bug in - DataFrame.groupby()raising error when- Noneis in first level of- MultiIndex(GH 47348)
- Bug in - DataFrameGroupBy.cummax()and- SeriesGroupBy.cummax()with- int64dtype with leading value being the smallest possible int64 (GH 46382)
- Bug in - DataFrameGroupBy.cumprod()and- SeriesGroupBy.cumprod()- NaNinfluences calculation in different columns with- skipna=False(GH 48064)
- Bug in - DataFrameGroupBy.max()and- SeriesGroupBy.max()with empty groups and- uint64dtype incorrectly raising- RuntimeError(GH 46408)
- Bug in - DataFrameGroupBy.apply()and- SeriesGroupBy.apply()would fail when- funcwas a string and args or kwargs were supplied (GH 46479)
- Bug in - SeriesGroupBy.apply()would incorrectly name its result when there was a unique group (GH 46369)
- Bug in - Rolling.sum()and- Rolling.mean()would give incorrect result with window of same values (GH 42064, GH 46431)
- Bug in - Rolling.var()and- Rolling.std()would give non-zero result with window of same values (GH 42064)
- Bug in - Rolling.skew()and- Rolling.kurt()would give NaN with window of same values (GH 30993)
- Bug in - Rolling.var()would segfault calculating weighted variance when window size was larger than data size (GH 46760)
- Bug in - Grouper.__repr__()where- dropnawas not included. Now it is (GH 46754)
- Bug in - DataFrame.rolling()gives ValueError when center=True, axis=1 and win_type is specified (GH 46135)
- Bug in - DataFrameGroupBy.describe()and- SeriesGroupBy.describe()produces inconsistent results for empty datasets (GH 41575)
- Bug in - DataFrame.resample()reduction methods when used with- onwould attempt to aggregate the provided column (GH 47079)
- Bug in - DataFrame.groupby()and- Series.groupby()would not respect- dropna=Falsewhen the input DataFrame/Series had a NaN values in a- MultiIndex(GH 46783)
- Bug in - DataFrameGroupBy.resample()raises- KeyErrorwhen getting the result from a key list which misses the resample key (GH 47362)
- Bug in - DataFrame.groupby()would lose index columns when the DataFrame is empty for transforms, like fillna (GH 47787)
- Bug in - DataFrame.groupby()and- Series.groupby()with- dropna=Falseand- sort=Falsewould put any null groups at the end instead the order that they are encountered (GH 46584)
Reshaping#
- Bug in - concat()between a- Serieswith integer dtype and another with- CategoricalDtypewith integer categories and containing- NaNvalues casting to object dtype instead of- float64(GH 45359)
- Bug in - get_dummies()that selected object and categorical dtypes but not string (GH 44965)
- Bug in - DataFrame.align()when aligning a- MultiIndexto a- Serieswith another- MultiIndex(GH 46001)
- Bug in concatenation with - IntegerDtype, or- FloatingDtypearrays where the resulting dtype did not mirror the behavior of the non-nullable dtypes (GH 46379)
- Bug in - concat()losing dtype of columns when- join="outer"and- sort=True(GH 47329)
- Bug in - concat()not sorting the column names when- Noneis included (GH 47331)
- Bug in - concat()with identical key leads to error when indexing- MultiIndex(GH 46519)
- Bug in - pivot_table()raising- TypeErrorwhen- dropna=Trueand aggregation column has extension array dtype (GH 47477)
- Bug in - merge()raising error for- how="cross"when using- FIPSmode in ssl library (GH 48024)
- Bug in - DataFrame.join()with a list when using suffixes to join DataFrames with duplicate column names (GH 46396)
- Bug in - DataFrame.pivot_table()with- sort=Falseresults in sorted index (GH 17041)
- Bug in - concat()when- axis=1and- sort=Falsewhere the resulting Index was a- Int64Indexinstead of a- RangeIndex(GH 46675)
- Bug in - wide_to_long()raises when- stubnamesis missing in columns and- icontains string dtype column (GH 46044)
- Bug in - DataFrame.join()with categorical index results in unexpected reordering (GH 47812)
Sparse#
- Bug in - Series.where()and- DataFrame.where()with- SparseDtypefailing to retain the array’s- fill_value(GH 45691)
- Bug in - SparseArray.unique()fails to keep original elements order (GH 47809)
ExtensionArray#
- Bug in - IntegerArray.searchsorted()and- FloatingArray.searchsorted()returning inconsistent results when acting on- np.nan(GH 45255)
Styler#
- Bug when attempting to apply styling functions to an empty DataFrame subset (GH 45313) 
- Bug in - CSSToExcelConverterleading to- TypeErrorwhen border color provided without border style for- xlsxwriterengine (GH 42276)
- Bug in - Styler.set_sticky()leading to white text on white background in dark mode (GH 46984)
- Bug in - Styler.to_latex()causing- UnboundLocalErrorwhen- clines="all;data"and the- DataFramehas no rows. (GH 47203)
- Bug in - Styler.to_excel()when using- vertical-align: middle;with- xlsxwriterengine (GH 30107)
- Bug when applying styles to a DataFrame with boolean column labels (GH 47838) 
Metadata#
- Fixed metadata propagation in - DataFrame.melt()(GH 28283)
- Fixed metadata propagation in - DataFrame.explode()(GH 28283)
Other#
- Bug in - assert_index_equal()with- names=Trueand- check_order=Falsenot checking names (GH 47328)
Contributors#
A total of 271 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.
- Aadharsh Acharya + 
- Aadharsh-Acharya + 
- Aadhi Manivannan + 
- Adam Bowden 
- Aditya Agarwal + 
- Ahmed Ibrahim + 
- Alastair Porter + 
- Alex Povel + 
- Alex-Blade 
- Alexandra Sciocchetti + 
- AlonMenczer + 
- Andras Deak + 
- Andrew Hawyrluk 
- Andy Grigg + 
- Aneta Kahleová + 
- Anthony Givans + 
- Anton Shevtsov + 
- B. J. Potter + 
- BarkotBeyene + 
- Ben Beasley + 
- Ben Wozniak + 
- Bernhard Wagner + 
- Boris Rumyantsev 
- Brian Gollop + 
- CCXXXI + 
- Chandrasekaran Anirudh Bhardwaj + 
- Charles Blackmon-Luca + 
- Chris Moradi + 
- ChrisAlbertsen + 
- Compro Prasad + 
- DaPy15 
- Damian Barabonkov + 
- Daniel I + 
- Daniel Isaac + 
- Daniel Schmidt 
- Danil Iashchenko + 
- Dare Adewumi 
- Dennis Chukwunta + 
- Dennis J. Gray + 
- Derek Sharp + 
- Dhruv Samdani + 
- Dimitra Karadima + 
- Dmitry Savostyanov + 
- Dmytro Litvinov + 
- Do Young Kim + 
- Dries Schaumont + 
- Edward Huang + 
- Eirik + 
- Ekaterina + 
- Eli Dourado + 
- Ezra Brauner + 
- Fabian Gabel + 
- FactorizeD + 
- Fangchen Li 
- Francesco Romandini + 
- Greg Gandenberger + 
- Guo Ci + 
- Hiroaki Ogasawara 
- Hood Chatham + 
- Ian Alexander Joiner + 
- Irv Lustig 
- Ivan Ng + 
- JHM Darbyshire 
- JHM Darbyshire (MBP) 
- JHM Darbyshire (iMac) 
- JMBurley 
- Jack Goldsmith + 
- James Freeman + 
- James Lamb 
- James Moro + 
- Janosh Riebesell 
- Jarrod Millman 
- Jason Jia + 
- Jeff Reback 
- Jeremy Tuloup + 
- Johannes Mueller 
- John Bencina + 
- John Mantios + 
- John Zangwill 
- Jon Bramley + 
- Jonas Haag 
- Jordan Hicks 
- Joris Van den Bossche 
- Jose Ortiz + 
- JosephParampathu + 
- José Duarte 
- Julian Steger + 
- Kai Priester + 
- Kapil E. Iyer + 
- Karthik Velayutham + 
- Kashif Khan 
- Kazuki Igeta + 
- Kevin Jan Anker + 
- Kevin Sheppard 
- Khor Chean Wei 
- Kian Eliasi 
- Kian S + 
- Kim, KwonHyun + 
- Kinza-Raza + 
- Konjeti Maruthi + 
- Leonardus Chen 
- Linxiao Francis Cong + 
- Loïc Estève 
- LucasG0 + 
- Lucy Jiménez + 
- Luis Pinto 
- Luke Manley 
- Marc Garcia 
- Marco Edward Gorelli 
- Marco Gorelli 
- MarcoGorelli 
- Margarete Dippel + 
- Mariam-ke + 
- Martin Fleischmann 
- Marvin John Walter + 
- Marvin Walter + 
- Mateusz 
- Matilda M + 
- Matthew Roeschke 
- Matthias Bussonnier 
- MeeseeksMachine 
- Mehgarg + 
- Melissa Weber Mendonça + 
- Michael Milton + 
- Michael Wang 
- Mike McCarty + 
- Miloni Atal + 
- Mitlasóczki Bence + 
- Moritz Schreiber + 
- Morten Canth Hels + 
- Nick Crews + 
- NickFillot + 
- Nicolas Hug + 
- Nima Sarang 
- Noa Tamir + 
- Pandas Development Team 
- Parfait Gasana 
- Parthi + 
- Partho + 
- Patrick Hoefler 
- Peter 
- Peter Hawkins + 
- Philipp A 
- Philipp Schaefer + 
- Pierrot + 
- Pratik Patel + 
- Prithvijit 
- Purna Chandra Mansingh + 
- Radoslaw Lemiec + 
- RaphSku + 
- Reinert Huseby Karlsen + 
- Richard Shadrach 
- Richard Shadrach + 
- Robbie Palmer 
- Robert de Vries 
- Roger + 
- Roger Murray + 
- Ruizhe Deng + 
- SELEE + 
- Sachin Yadav + 
- Saiwing Yeung + 
- Sam Rao + 
- Sandro Casagrande + 
- Sebastiaan Vermeulen + 
- Shaghayegh + 
- Shantanu + 
- Shashank Shet + 
- Shawn Zhong + 
- Shuangchi He + 
- Simon Hawkins 
- Simon Knott + 
- Solomon Song + 
- Somtochi Umeh + 
- Stefan Krawczyk + 
- Stefanie Molin 
- Steffen Rehberg 
- Steven Bamford + 
- Steven Rotondo + 
- Steven Schaerer 
- Sylvain MARIE + 
- Sylvain Marié 
- Tarun Raghunandan Kaushik + 
- Taylor Packard + 
- Terji Petersen 
- Thierry Moisan 
- Thomas Grainger 
- Thomas Hunter + 
- Thomas Li 
- Tim McFarland + 
- Tim Swast 
- Tim Yang + 
- Tobias Pitters 
- Tom Aarsen + 
- Tom Augspurger 
- Torsten Wörtwein 
- TraverseTowner + 
- Tyler Reddy 
- Valentin Iovene 
- Varun Sharma + 
- Vasily Litvinov 
- Venaturum 
- Vinicius Akira Imaizumi + 
- Vladimir Fokow + 
- Wenjun Si 
- Will Lachance + 
- William Andrea 
- Wolfgang F. Riedl + 
- Xingrong Chen 
- Yago González 
- Yikun Jiang + 
- Yuanhao Geng 
- Yuval + 
- Zero 
- Zhengfei Wang + 
- abmyii 
- alexondor + 
- alm 
- andjhall + 
- anilbey + 
- arnaudlegout + 
- asv-bot + 
- ateki + 
- auderson + 
- bherwerth + 
- bicarlsen + 
- carbonleakage + 
- charles + 
- charlogazzo + 
- code-review-doctor + 
- dataxerik + 
- deponovo 
- dimitra-karadima + 
- dospix + 
- ehallam + 
- ehsan shirvanian + 
- ember91 + 
- eshirvana 
- fractionalhare + 
- gaotian98 + 
- gesoos 
- github-actions[bot] 
- gunghub + 
- hasan-yaman 
- iansheng + 
- iasoon + 
- jbrockmendel 
- joshuabello2550 + 
- jyuv + 
- kouya takahashi + 
- mariana-LJ + 
- matt + 
- mattB1989 + 
- nealxm + 
- partev 
- poloso + 
- realead 
- roib20 + 
- rtpsw 
- ryangilmour + 
- shourya5 + 
- srotondo + 
- stanleycai95 + 
- staticdev + 
- tehunter + 
- theidexisted + 
- tobias.pitters + 
- uncjackg + 
- vernetya 
- wany-oh + 
- wfr + 
- z3c0 +