What’s new in 2.2.0 (January 19, 2024)#
These are the changes in pandas 2.2.0. See Release notes for a full changelog including other versions of pandas.
Upcoming changes in pandas 3.0#
pandas 3.0 will bring two bigger changes to the default behavior of pandas.
Copy-on-Write#
The currently optional mode Copy-on-Write will be enabled by default in pandas 3.0. There won’t be an option to keep the current behavior enabled. The new behavioral semantics are explained in the user guide about Copy-on-Write.
The new behavior can be enabled since pandas 2.0 with the following option:
pd.options.mode.copy_on_write = True
This change brings different changes in behavior in how pandas operates with respect to copies and views. Some of these changes allow a clear deprecation, like the changes in chained assignment. Other changes are more subtle and thus, the warnings are hidden behind an option that can be enabled in pandas 2.2.
pd.options.mode.copy_on_write = "warn"
This mode will warn in many different scenarios that aren’t actually relevant to most queries. We recommend exploring this mode, but it is not necessary to get rid of all of these warnings. The migration guide explains the upgrade process in more detail.
Dedicated string data type (backed by Arrow) by default#
Historically, pandas represented string columns with NumPy object data type. This
representation has numerous problems, including slow performance and a large memory
footprint. This will change in pandas 3.0. pandas will start inferring string columns
as a new string data type, backed by Arrow, which represents strings contiguous in memory. This brings
a huge performance and memory improvement.
Old behavior:
In [1]: ser = pd.Series(["a", "b"])
Out[1]:
0 a
1 b
dtype: object
New behavior:
In [1]: ser = pd.Series(["a", "b"])
Out[1]:
0 a
1 b
dtype: string
The string data type that is used in these scenarios will mostly behave as NumPy object would, including missing value semantics and general operations on these columns.
This change includes a few additional changes across the API:
Currently, specifying
dtype="string"creates a dtype that is backed by Python strings which are stored in a NumPy array. This will change in pandas 3.0, this dtype will create an Arrow backed string column.The column names and the Index will also be backed by Arrow strings.
PyArrow will become a required dependency with pandas 3.0 to accommodate this change.
This future dtype inference logic can be enabled with:
pd.options.future.infer_string = True
Enhancements#
ADBC Driver support in to_sql and read_sql#
read_sql() and to_sql() now work with Apache Arrow ADBC drivers. Compared to
traditional drivers used via SQLAlchemy, ADBC drivers should provide
significant performance improvements, better type support and cleaner
nullability handling.
import adbc_driver_postgresql.dbapi as pg_dbapi
df = pd.DataFrame(
[
[1, 2, 3],
[4, 5, 6],
],
columns=['a', 'b', 'c']
)
uri = "postgresql://postgres:postgres@localhost/postgres"
with pg_dbapi.connect(uri) as conn:
df.to_sql("pandas_table", conn, index=False)
# for round-tripping
with pg_dbapi.connect(uri) as conn:
df2 = pd.read_sql("pandas_table", conn)
The Arrow type system offers a wider array of types that can more closely match what databases like PostgreSQL can offer. To illustrate, note this (non-exhaustive) listing of types available in different databases and pandas backends:
numpy/pandas |
arrow |
postgres |
sqlite |
|---|---|---|---|
int16/Int16 |
int16 |
SMALLINT |
INTEGER |
int32/Int32 |
int32 |
INTEGER |
INTEGER |
int64/Int64 |
int64 |
BIGINT |
INTEGER |
float32 |
float32 |
REAL |
REAL |
float64 |
float64 |
DOUBLE PRECISION |
REAL |
object |
string |
TEXT |
TEXT |
bool |
|
BOOLEAN |
|
datetime64[ns] |
timestamp(us) |
TIMESTAMP |
|
datetime64[ns,tz] |
timestamp(us,tz) |
TIMESTAMPTZ |
|
date32 |
DATE |
||
month_day_nano_interval |
INTERVAL |
||
binary |
BINARY |
BLOB |
|
decimal128 |
DECIMAL [1] |
||
list |
ARRAY [1] |
||
struct |
|
Footnotes
If you are interested in preserving database types as best as possible
throughout the lifecycle of your DataFrame, users are encouraged to
leverage the dtype_backend="pyarrow" argument of read_sql()
# for round-tripping
with pg_dbapi.connect(uri) as conn:
df2 = pd.read_sql("pandas_table", conn, dtype_backend="pyarrow")
This will prevent your data from being converted to the traditional pandas/NumPy type system, which often converts SQL types in ways that make them impossible to round-trip.
For a full list of ADBC drivers and their development status, see the ADBC Driver Implementation Status documentation.
Create a pandas Series based on one or more conditions#
The Series.case_when() function has been added to create a Series object based on one or more conditions. (GH 39154)
In [1]: import pandas as pd
In [2]: df = pd.DataFrame(dict(a=[1, 2, 3], b=[4, 5, 6]))
In [3]: default=pd.Series('default', index=df.index)
In [4]: default.case_when(
...: caselist=[
...: (df.a == 1, 'first'), # condition, replacement
...: (df.a.gt(1) & df.b.eq(5), 'second'), # condition, replacement
...: ],
...: )
...:
Out[4]:
0 first
1 second
2 default
dtype: object
to_numpy for NumPy nullable and Arrow types converts to suitable NumPy dtype#
to_numpy for NumPy nullable and Arrow types will now convert to a
suitable NumPy dtype instead of object dtype for nullable and PyArrow backed extension dtypes.
Old behavior:
In [1]: ser = pd.Series([1, 2, 3], dtype="Int64")
In [2]: ser.to_numpy()
Out[2]: array([1, 2, 3], dtype=object)
New behavior:
In [5]: ser = pd.Series([1, 2, 3], dtype="Int64")
In [6]: ser.to_numpy()
Out[6]: array([1, 2, 3])
In [7]: ser = pd.Series([1, 2, 3], dtype="timestamp[ns][pyarrow]")
In [8]: ser.to_numpy()
Out[8]:
array(['1970-01-01T00:00:00.000000001', '1970-01-01T00:00:00.000000002',
'1970-01-01T00:00:00.000000003'], dtype='datetime64[ns]')
The default NumPy dtype (without any arguments) is determined as follows:
float dtypes are cast to NumPy floats
integer dtypes without missing values are cast to NumPy integer dtypes
integer dtypes with missing values are cast to NumPy float dtypes and
NaNis used as missing value indicatorboolean dtypes without missing values are cast to NumPy bool dtype
boolean dtypes with missing values keep object dtype
datetime and timedelta types are cast to Numpy datetime64 and timedelta64 types respectively and
NaTis used as missing value indicator
Series.struct accessor for PyArrow structured data#
The Series.struct accessor provides attributes and methods for processing
data with struct[pyarrow] dtype Series. For example,
Series.struct.explode() converts PyArrow structured data to a pandas
DataFrame. (GH 54938)
In [9]: import pyarrow as pa
In [10]: series = pd.Series(
....: [
....: {"project": "pandas", "version": "2.2.0"},
....: {"project": "numpy", "version": "1.25.2"},
....: {"project": "pyarrow", "version": "13.0.0"},
....: ],
....: dtype=pd.ArrowDtype(
....: pa.struct([
....: ("project", pa.string()),
....: ("version", pa.string()),
....: ])
....: ),
....: )
....:
In [11]: series.struct.explode()
Out[11]:
project version
0 pandas 2.2.0
1 numpy 1.25.2
2 pyarrow 13.0.0
Use Series.struct.field() to index into a (possible nested)
struct field.
In [12]: series.struct.field("project")
Out[12]:
0 pandas
1 numpy
2 pyarrow
Name: project, dtype: string[pyarrow]
Series.list accessor for PyArrow list data#
The Series.list accessor provides attributes and methods for processing
data with list[pyarrow] dtype Series. For example,
Series.list.__getitem__() allows indexing pyarrow lists in
a Series. (GH 55323)
In [13]: import pyarrow as pa
In [14]: series = pd.Series(
....: [
....: [1, 2, 3],
....: [4, 5],
....: [6],
....: ],
....: dtype=pd.ArrowDtype(
....: pa.list_(pa.int64())
....: ),
....: )
....:
In [15]: series.list[0]
Out[15]:
0 1
1 4
2 6
dtype: int64[pyarrow]
Calamine engine for read_excel()#
The calamine engine was added to read_excel().
It uses python-calamine, which provides Python bindings for the Rust library calamine.
This engine supports Excel files (.xlsx, .xlsm, .xls, .xlsb) and OpenDocument spreadsheets (.ods) (GH 50395).
There are two advantages of this engine:
Calamine is often faster than other engines, some benchmarks show results up to 5x faster than ‘openpyxl’, 20x - ‘odf’, 4x - ‘pyxlsb’, and 1.5x - ‘xlrd’. But, ‘openpyxl’ and ‘pyxlsb’ are faster in reading a few rows from large files because of lazy iteration over rows.
Calamine supports the recognition of datetime in
.xlsbfiles, unlike ‘pyxlsb’ which is the only other engine in pandas that can read.xlsbfiles.
pd.read_excel("path_to_file.xlsb", engine="calamine")
For more, see Calamine (Excel and ODS files) in the user guide on IO tools.
Other enhancements#
to_sql()with method parameter set tomultiworks with Oracle on the backendSeries.attrs/DataFrame.attrsnow uses a deepcopy for propagatingattrs(GH 54134).get_dummies()now returning extension dtypesbooleanorbool[pyarrow]that are compatible with the input dtype (GH 56273)read_csv()now supportson_bad_linesparameter withengine="pyarrow"(GH 54480)read_sas()returnsdatetime64dtypes with resolutions better matching those stored natively in SAS, and avoids returning object-dtype in cases that cannot be stored withdatetime64[ns]dtype (GH 56127)read_spss()now returns aDataFramethat stores the metadata inDataFrame.attrs(GH 54264)tseries.api.guess_datetime_format()is now part of the public API (GH 54727)DataFrame.apply()now allows the usage of numba (viaengine="numba") to JIT compile the passed function, allowing for potential speedups (GH 54666)ExtensionArray._explode()interface method added to allow extension type implementations of theexplodemethod (GH 54833)ExtensionArray.duplicated()added to allow extension type implementations of theduplicatedmethod (GH 55255)Series.ffill(),Series.bfill(),DataFrame.ffill(), andDataFrame.bfill()have gained the argumentlimit_area; 3rd partyExtensionArrayauthors need to add this argument to the method_pad_or_backfill(GH 56492)Allow passing
read_only,data_onlyandkeep_linksarguments to openpyxl usingengine_kwargsofread_excel()(GH 55027)Implement
Series.interpolate()andDataFrame.interpolate()forArrowDtypeand masked dtypes (GH 56267)Implement masked algorithms for
Series.value_counts()(GH 54984)Implemented
Series.dt()methods and attributes forArrowDtypewithpyarrow.durationtype (GH 52284)Implemented
Series.str.extract()forArrowDtype(GH 56268)Improved error message that appears in
DatetimeIndex.to_period()with frequencies which are not supported as period frequencies, such as"BMS"(GH 56243)Improved error message when constructing
Periodwith invalid offsets such as"QS"(GH 55785)The dtypes
string[pyarrow]andstring[pyarrow_numpy]now both utilize thelarge_stringtype from PyArrow to avoid overflow for long columns (GH 56259)
Notable bug fixes#
These are bug fixes that might have notable behavior changes.
merge() and DataFrame.join() now consistently follow documented sort behavior#
In previous versions of pandas, merge() and DataFrame.join() did not
always return a result that followed the documented sort behavior. pandas now
follows the documented sort behavior in merge and join operations (GH 54611, GH 56426, GH 56443).
As documented, sort=True sorts the join keys lexicographically in the resulting
DataFrame. With sort=False, the order of the join keys depends on the
join type (how keyword):
how="left": preserve the order of the left keyshow="right": preserve the order of the right keyshow="inner": preserve the order of the left keyshow="outer": sort keys lexicographically
One example with changing behavior is inner joins with non-unique left join keys
and sort=False:
In [16]: left = pd.DataFrame({"a": [1, 2, 1]})
In [17]: right = pd.DataFrame({"a": [1, 2]})
In [18]: result = pd.merge(left, right, how="inner", on="a", sort=False)
Old Behavior
In [5]: result
Out[5]:
a
0 1
1 1
2 2
New Behavior
In [19]: result
Out[19]:
a
0 1
1 2
2 1
merge() and DataFrame.join() no longer reorder levels when levels differ#
In previous versions of pandas, merge() and DataFrame.join() would reorder
index levels when joining on two indexes with different levels (GH 34133).
In [20]: left = pd.DataFrame({"left": 1}, index=pd.MultiIndex.from_tuples([("x", 1), ("x", 2)], names=["A", "B"]))
In [21]: right = pd.DataFrame({"right": 2}, index=pd.MultiIndex.from_tuples([(1, 1), (2, 2)], names=["B", "C"]))
In [22]: left
Out[22]:
left
A B
x 1 1
2 1
In [23]: right
Out[23]:
right
B C
1 1 2
2 2 2
In [24]: result = left.join(right)
Old Behavior
In [5]: result
Out[5]:
left right
B A C
1 x 1 1 2
2 x 2 1 2
New Behavior
In [25]: result
Out[25]:
left right
A B C
x 1 1 1 2
2 2 1 2
Increased minimum versions for dependencies#
For optional dependencies the general recommendation is to use the latest version. Optional dependencies below the lowest tested version may still work but are not considered supported. The following table lists the optional dependencies that have had their minimum tested version increased.
Package |
New Minimum Version |
|---|---|
beautifulsoup4 |
4.11.2 |
blosc |
1.21.3 |
bottleneck |
1.3.6 |
fastparquet |
2022.12.0 |
fsspec |
2022.11.0 |
gcsfs |
2022.11.0 |
lxml |
4.9.2 |
matplotlib |
3.6.3 |
numba |
0.56.4 |
numexpr |
2.8.4 |
qtpy |
2.3.0 |
openpyxl |
3.1.0 |
psycopg2 |
2.9.6 |
pyreadstat |
1.2.0 |
pytables |
3.8.0 |
pyxlsb |
1.0.10 |
s3fs |
2022.11.0 |
scipy |
1.10.0 |
sqlalchemy |
2.0.0 |
tabulate |
0.9.0 |
xarray |
2022.12.0 |
xlsxwriter |
3.0.5 |
zstandard |
0.19.0 |
pyqt5 |
5.15.8 |
tzdata |
2022.7 |
See Dependencies and Optional dependencies for more.
Other API changes#
The hash values of nullable extension dtypes changed to improve the performance of the hashing operation (GH 56507)
check_exactnow only takes effect for floating-point dtypes intesting.assert_frame_equal()andtesting.assert_series_equal(). In particular, integer dtypes are always checked exactly (GH 55882)
Deprecations#
Chained assignment#
In preparation of larger upcoming changes to the copy / view behaviour in pandas 3.0 (Copy-on-Write (CoW), PDEP-7), we started deprecating chained assignment.
Chained assignment occurs when you try to update a pandas DataFrame or Series through two subsequent indexing operations. Depending on the type and order of those operations this currently does or does not work.
A typical example is as follows:
df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
# first selecting rows with a mask, then assigning values to a column
# -> this has never worked and raises a SettingWithCopyWarning
df[df["bar"] > 5]["foo"] = 100
# first selecting the column, and then assigning to a subset of that column
# -> this currently works
df["foo"][df["bar"] > 5] = 100
This second example of chained assignment currently works to update the original df.
This will no longer work in pandas 3.0, and therefore we started deprecating this:
>>> df["foo"][df["bar"] > 5] = 100
FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:
df["col"][row_indexer] = value
Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
You can fix this warning and ensure your code is ready for pandas 3.0 by removing
the usage of chained assignment. Typically, this can be done by doing the assignment
in a single step using for example .loc. For the example above, we can do:
df.loc[df["bar"] > 5, "foo"] = 100
The same deprecation applies to inplace methods that are done in a chained manner, such as:
>>> df["foo"].fillna(0, inplace=True)
FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
When the goal is to update the column in the DataFrame df, the alternative here is
to call the method on df itself, such as df.fillna({"foo": 0}, inplace=True).
See more details in the migration guide.
Deprecate aliases M, Q, Y, etc. in favour of ME, QE, YE, etc. for offsets#
Deprecated the following frequency aliases (GH 9586):
offsets |
deprecated aliases |
new aliases |
|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
For example:
Previous behavior:
In [8]: pd.date_range('2020-01-01', periods=3, freq='Q-NOV')
Out[8]:
DatetimeIndex(['2020-02-29', '2020-05-31', '2020-08-31'],
dtype='datetime64[ns]', freq='Q-NOV')
Future behavior:
In [26]: pd.date_range('2020-01-01', periods=3, freq='QE-NOV')
Out[26]: DatetimeIndex(['2020-02-29', '2020-05-31', '2020-08-31'], dtype='datetime64[ns]', freq='QE-NOV')
Deprecated automatic downcasting#
Deprecated the automatic downcasting of object dtype results in a number of methods. These would silently change the dtype in a hard to predict manner since the behavior was value dependent. Additionally, pandas is moving away from silent dtype changes (GH 54710, GH 54261).
These methods are:
Explicitly call DataFrame.infer_objects() to replicate the current behavior in the future.
result = result.infer_objects(copy=False)
Or explicitly cast all-round floats to ints using astype.
Set the following option to opt into the future behavior:
In [9]: pd.set_option("future.no_silent_downcasting", True)
Other Deprecations#
Changed
Timedelta.resolution_string()to returnh,min,s,ms,us, andnsinstead ofH,T,S,L,U, andN, for compatibility with respective deprecations in frequency aliases (GH 52536)Deprecated
offsets.Day.delta,offsets.Hour.delta,offsets.Minute.delta,offsets.Second.delta,offsets.Milli.delta,offsets.Micro.delta,offsets.Nano.delta, usepd.Timedelta(obj)instead (GH 55498)Deprecated
pandas.api.types.is_interval()andpandas.api.types.is_period(), useisinstance(obj, pd.Interval)andisinstance(obj, pd.Period)instead (GH 55264)Deprecated
read_gbq()andDataFrame.to_gbq(). Usepandas_gbq.read_gbqandpandas_gbq.to_gbqinstead https://pandas-gbq.readthedocs.io/en/latest/api.html (GH 55525)Deprecated
DataFrameGroupBy.fillna()andSeriesGroupBy.fillna(); useDataFrameGroupBy.ffill(),DataFrameGroupBy.bfill()for forward and backward filling orDataFrame.fillna()to fill with a single value (or the Series equivalents) (GH 55718)Deprecated
DateOffset.is_anchored(), useobj.n == 1for non-Tick subclasses (for Tick this was always False) (GH 55388)Deprecated
DatetimeArray.__init__()andTimedeltaArray.__init__(), usearray()instead (GH 55623)Deprecated
Index.format(), useindex.astype(str)orindex.map(formatter)instead (GH 55413)Deprecated
Series.ravel(), the underlying array is already 1D, so ravel is not necessary (GH 52511)Deprecated
Series.resample()andDataFrame.resample()with aPeriodIndex(and the ‘convention’ keyword), convert toDatetimeIndex(with.to_timestamp()) before resampling instead (GH 53481). Note: this deprecation was later undone in pandas 2.3.3 (GH 57033)Deprecated
Series.view(), useSeries.astype()instead to change the dtype (GH 20251)Deprecated
offsets.Tick.is_anchored(), useFalseinstead (GH 55388)Deprecated
core.internalsmembersBlock,ExtensionBlock, andDatetimeTZBlock, use public APIs instead (GH 55139)Deprecated
year,month,quarter,day,hour,minute, andsecondkeywords in thePeriodIndexconstructor, usePeriodIndex.from_fields()instead (GH 55960)Deprecated accepting a type as an argument in
Index.view(), call without any arguments instead (GH 55709)Deprecated allowing non-integer
periodsargument indate_range(),timedelta_range(),period_range(), andinterval_range()(GH 56036)Deprecated allowing non-keyword arguments in
DataFrame.to_clipboard()(GH 54229)Deprecated allowing non-keyword arguments in
DataFrame.to_csv()exceptpath_or_buf(GH 54229)Deprecated allowing non-keyword arguments in
DataFrame.to_dict()(GH 54229)Deprecated allowing non-keyword arguments in
DataFrame.to_excel()exceptexcel_writer(GH 54229)Deprecated allowing non-keyword arguments in
DataFrame.to_gbq()exceptdestination_table(GH 54229)Deprecated allowing non-keyword arguments in
DataFrame.to_hdf()exceptpath_or_buf(GH 54229)Deprecated allowing non-keyword arguments in
DataFrame.to_html()exceptbuf(GH 54229)Deprecated allowing non-keyword arguments in
DataFrame.to_json()exceptpath_or_buf(GH 54229)Deprecated allowing non-keyword arguments in
DataFrame.to_latex()exceptbuf(GH 54229)Deprecated allowing non-keyword arguments in
DataFrame.to_markdown()exceptbuf(GH 54229)Deprecated allowing non-keyword arguments in
DataFrame.to_parquet()exceptpath(GH 54229)Deprecated allowing non-keyword arguments in
DataFrame.to_pickle()exceptpath(GH 54229)Deprecated allowing non-keyword arguments in
DataFrame.to_string()exceptbuf(GH 54229)Deprecated allowing non-keyword arguments in
DataFrame.to_xml()exceptpath_or_buffer(GH 54229)Deprecated allowing passing
BlockManagerobjects toDataFrameorSingleBlockManagerobjects toSeries(GH 52419)Deprecated behavior of
Index.insert()with an object-dtype index silently performing type inference on the result, explicitly callresult.infer_objects(copy=False)for the old behavior instead (GH 51363)Deprecated casting non-datetimelike values (mainly strings) in
Series.isin()andIndex.isin()withdatetime64,timedelta64, andPeriodDtypedtypes (GH 53111)Deprecated dtype inference in
Index,SeriesandDataFrameconstructors when giving a pandas input, call.infer_objectson the input to keep the current behavior (GH 56012)Deprecated dtype inference when setting a
Indexinto aDataFrame, cast explicitly instead (GH 56102)Deprecated including the groups in computations when using
DataFrameGroupBy.apply()andDataFrameGroupBy.resample(); passinclude_groups=Falseto exclude the groups (GH 7155)Deprecated indexing an
Indexwith a boolean indexer of length zero (GH 55820)Deprecated not passing a tuple to
DataFrameGroupBy.get_grouporSeriesGroupBy.get_groupwhen grouping by a length-1 list-like (GH 25971)Deprecated string
ASdenoting frequency inYearBeginand stringsAS-DEC,AS-JAN, etc. denoting annual frequencies with various fiscal year starts (GH 54275)Deprecated string
Adenoting frequency inYearEndand stringsA-DEC,A-JAN, etc. denoting annual frequencies with various fiscal year ends (GH 54275)Deprecated string
BASdenoting frequency inBYearBeginand stringsBAS-DEC,BAS-JAN, etc. denoting annual frequencies with various fiscal year starts (GH 54275)Deprecated string
BAdenoting frequency inBYearEndand stringsBA-DEC,BA-JAN, etc. denoting annual frequencies with various fiscal year ends (GH 54275)Deprecated strings
H,BH, andCBHdenoting frequencies inHour,BusinessHour,CustomBusinessHour(GH 52536)Deprecated strings
H,S,U, andNdenoting units into_timedelta()(GH 52536)Deprecated strings
H,T,S,L,U, andNdenoting units inTimedelta(GH 52536)Deprecated strings
T,S,L,U, andNdenoting frequencies inMinute,Second,Milli,Micro,Nano(GH 52536)Deprecated support for combining parsed datetime columns in
read_csv()along with thekeep_date_colkeyword (GH 55569)Deprecated the
DataFrameGroupBy.grouperandSeriesGroupBy.grouper; these attributes will be removed in a future version of pandas (GH 56521)Deprecated the
Groupingattributesgroup_index,result_index, andgroup_arraylike; these will be removed in a future version of pandas (GH 56148)Deprecated the
delim_whitespacekeyword inread_csv()andread_table(), usesep="\\s+"instead (GH 55569)Deprecated the
errors="ignore"option into_datetime(),to_timedelta(), andto_numeric(); explicitly catch exceptions instead (GH 54467)Deprecated the
fastpathkeyword in theSeriesconstructor (GH 20110)Deprecated the
kindkeyword inSeries.resample()andDataFrame.resample(), explicitly cast the object’sindexinstead (GH 55895)Deprecated the
ordinalkeyword inPeriodIndex, usePeriodIndex.from_ordinals()instead (GH 55960)Deprecated the
unitkeyword inTimedeltaIndexconstruction, useto_timedelta()instead (GH 55499)Deprecated the
verbosekeyword inread_csv()andread_table()(GH 55569)Deprecated the behavior of
DataFrame.replace()andSeries.replace()withCategoricalDtype; in a future version replace will change the values while preserving the categories. To change the categories, useser.cat.rename_categoriesinstead (GH 55147)Deprecated the behavior of
Series.value_counts()andIndex.value_counts()with object dtype; in a future version these will not perform dtype inference on the resultingIndex, doresult.index = result.index.infer_objects()to retain the old behavior (GH 56161)Deprecated the default of
observed=FalseinDataFrame.pivot_table(); will beTruein a future version (GH 56236)Deprecated the extension test classes
BaseNoReduceTests,BaseBooleanReduceTests, andBaseNumericReduceTests, useBaseReduceTestsinstead (GH 54663)Deprecated the option
mode.data_managerand theArrayManager; only theBlockManagerwill be available in future versions (GH 55043)Deprecated the previous implementation of
DataFrame.stack; specifyfuture_stack=Trueto adopt the future version (GH 53515)
Performance improvements#
Performance improvement in
testing.assert_frame_equal()andtesting.assert_series_equal()(GH 55949, GH 55971)Performance improvement in
concat()withaxis=1and objects with unaligned indexes (GH 55084)Performance improvement in
get_dummies()(GH 56089)Performance improvement in
merge()andmerge_ordered()when joining on sorted ascending keys (GH 56115)Performance improvement in
merge_asof()whenbyis notNone(GH 55580, GH 55678)Performance improvement in
read_stata()for files with many variables (GH 55515)Performance improvement in
DataFrame.groupby()when aggregating pyarrow timestamp and duration dtypes (GH 55031)Performance improvement in
DataFrame.join()when joining on unordered categorical indexes (GH 56345)Performance improvement in
DataFrame.loc()andSeries.loc()when indexing with aMultiIndex(GH 56062)Performance improvement in
DataFrame.sort_index()andSeries.sort_index()when indexed by aMultiIndex(GH 54835)Performance improvement in
DataFrame.to_dict()on converting DataFrame to dictionary (GH 50990)Performance improvement in
Index.difference()(GH 55108)Performance improvement in
Index.sort_values()when index is already sorted (GH 56128)Performance improvement in
MultiIndex.get_indexer()whenmethodis notNone(GH 55839)Performance improvement in
Series.duplicated()for pyarrow dtypes (GH 55255)Performance improvement in
Series.str.get_dummies()when dtype is"string[pyarrow]"or"string[pyarrow_numpy]"(GH 56110)Performance improvement in
Series.str()methods (GH 55736)Performance improvement in
Series.value_counts()andSeries.mode()for masked dtypes (GH 54984, GH 55340)Performance improvement in
DataFrameGroupBy.nunique()andSeriesGroupBy.nunique()(GH 55972)Performance improvement in
SeriesGroupBy.idxmax(),SeriesGroupBy.idxmin(),DataFrameGroupBy.idxmax(),DataFrameGroupBy.idxmin()(GH 54234)Performance improvement when hashing a nullable extension array (GH 56507)
Performance improvement when indexing into a non-unique index (GH 55816)
Performance improvement when indexing with more than 4 keys (GH 54550)
Performance improvement when localizing time to UTC (GH 55241)
Bug fixes#
Categorical#
Categorical.isin()raisingInvalidIndexErrorfor categorical containing overlappingIntervalvalues (GH 34974)Bug in
CategoricalDtype.__eq__()returningFalsefor unordered categorical data with mixed types (GH 55468)Bug when casting
pa.dictionarytoCategoricalDtypeusing apa.DictionaryArrayas categories (GH 56672)
Datetimelike#
Bug in
DatetimeIndexconstruction when passing both atzand eitherdayfirstoryearfirstignoring dayfirst/yearfirst (GH 55813)Bug in
DatetimeIndexwhen passing an object-dtype ndarray of float objects and atzincorrectly localizing the result (GH 55780)Bug in
Series.isin()withDatetimeTZDtypedtype and comparison values that are allNaTincorrectly returning all-Falseeven if the series containsNaTentries (GH 56427)Bug in
concat()raisingAttributeErrorwhen concatenating all-NA DataFrame withDatetimeTZDtypedtype DataFrame (GH 52093)Bug in
testing.assert_extension_array_equal()that could use the wrong unit when comparing resolutions (GH 55730)Bug in
to_datetime()andDatetimeIndexwhen passing a list of mixed-string-and-numeric types incorrectly raising (GH 55780)Bug in
to_datetime()andDatetimeIndexwhen passing mixed-type objects with a mix of timezones or mix of timezone-awareness failing to raiseValueError(GH 55693)Bug in
Tick.delta()with very large ticks raisingOverflowErrorinstead ofOutOfBoundsTimedelta(GH 55503)Bug in
DatetimeIndex.shift()with non-nanosecond resolution incorrectly returning with nanosecond resolution (GH 56117)Bug in
DatetimeIndex.union()returning object dtype for tz-aware indexes with the same timezone but different units (GH 55238)Bug in
Index.is_monotonic_increasing()andIndex.is_monotonic_decreasing()always cachingIndex.is_unique()asTruewhen first value in index isNaT(GH 55755)Bug in
Index.view()to a datetime64 dtype with non-supported resolution incorrectly raising (GH 55710)Bug in
Series.dt.round()with non-nanosecond resolution andNaTentries incorrectly raisingOverflowError(GH 56158)Bug in
Series.fillna()with non-nanosecond resolution dtypes and higher-resolution vector values returning incorrect (internally-corrupted) results (GH 56410)Bug in
Timestamp.unit()being inferred incorrectly from an ISO8601 format string with minute or hour resolution and a timezone offset (GH 56208)Bug in
.astypeconverting from a higher-resolutiondatetime64dtype to a lower-resolutiondatetime64dtype (e.g.datetime64[us]->datetime64[ms]) silently overflowing with values near the lower implementation bound (GH 55979)Bug in adding or subtracting a
Weekoffset to adatetime64Series,Index, orDataFramecolumn with non-nanosecond resolution returning incorrect results (GH 55583)Bug in addition or subtraction of
BusinessDayoffset withoffsetattribute to non-nanosecondIndex,Series, orDataFramecolumn giving incorrect results (GH 55608)Bug in addition or subtraction of
DateOffsetobjects with microsecond components todatetime64Index,Series, orDataFramecolumns with non-nanosecond resolution (GH 55595)Bug in addition or subtraction of very large
Tickobjects withTimestamporTimedeltaobjects raisingOverflowErrorinstead ofOutOfBoundsTimedelta(GH 55503)Bug in creating a
Index,Series, orDataFramewith a non-nanosecondDatetimeTZDtypeand inputs that would be out of bounds with nanosecond resolution incorrectly raisingOutOfBoundsDatetime(GH 54620)Bug in creating a
Index,Series, orDataFramewith a non-nanoseconddatetime64(orDatetimeTZDtype) from mixed-numeric inputs treating those as nanoseconds instead of as multiples of the dtype’s unit (which would happen with non-mixed numeric inputs) (GH 56004)Bug in creating a
Index,Series, orDataFramewith a non-nanoseconddatetime64dtype and inputs that would be out of bounds for adatetime64[ns]incorrectly raisingOutOfBoundsDatetime(GH 55756)Bug in parsing datetime strings with nanosecond resolution with non-ISO8601 formats incorrectly truncating sub-microsecond components (GH 56051)
Bug in parsing datetime strings with sub-second resolution and trailing zeros incorrectly inferring second or millisecond resolution (GH 55737)
Bug in the results of
to_datetime()with an floating-dtype argument withunitnot matching the pointwise results ofTimestamp(GH 56037)Fixed regression where
concat()would raise an error when concatenatingdatetime64columns with differing resolutions (GH 53641)
Timedelta#
Bug in
Timedeltaconstruction raisingOverflowErrorinstead ofOutOfBoundsTimedelta(GH 55503)Bug in rendering (
__repr__) ofTimedeltaIndexandSerieswith timedelta64 values with non-nanosecond resolution entries that are all multiples of 24 hours failing to use the compact representation used in the nanosecond cases (GH 55405)
Timezones#
Bug in
AbstractHolidayCalendarwhere timezone data was not propagated when computing holiday observances (GH 54580)Bug in
Timestampconstruction with an ambiguous value and apytztimezone failing to raisepytz.AmbiguousTimeError(GH 55657)Bug in
Timestamp.tz_localize()withnonexistent="shift_forwardaround UTC+0 during DST (GH 51501)
Numeric#
Bug in
read_csv()withengine="pyarrow"causing rounding errors for large integers (GH 52505)Bug in
Series.__floordiv__()andSeries.__truediv__()forArrowDtypewith integral dtypes raising for large divisors (GH 56706)Bug in
Series.__floordiv__()forArrowDtypewith integral dtypes raising for large values (GH 56645)Bug in
Series.pow()not filling missing values correctly (GH 55512)Bug in
Series.replace()andDataFrame.replace()matching float0.0withFalseand vice versa (GH 55398)Bug in
Series.round()raising for nullable boolean dtype (GH 55936)
Conversion#
Bug in
DataFrame.astype()when called withstron unpickled array - the array might change in-place (GH 54654)Bug in
DataFrame.astype()whereerrors="ignore"had no effect for extension types (GH 54654)Bug in
Series.convert_dtypes()not converting all NA column tonull[pyarrow](GH 55346)Bug in :meth:
DataFrame.locwas not throwing “incompatible dtype warning” (see PDEP6) when assigning aSerieswith a different dtype using a full column setter (e.g.df.loc[:, 'a'] = incompatible_value) (GH 39584)
Strings#
Bug in
pandas.api.types.is_string_dtype()while checking object array with no elements is of the string dtype (GH 54661)Bug in
DataFrame.apply()failing whenengine="numba"and columns or index haveStringDtype(GH 56189)Bug in
DataFrame.reindex()not matchingIndexwithstring[pyarrow_numpy]dtype (GH 56106)Bug in
Index.str.cat()always casting result to object dtype (GH 56157)Bug in
Series.__mul__()forArrowDtypewithpyarrow.stringdtype andstring[pyarrow]for the pyarrow backend (GH 51970)Bug in
Series.str.find()whenstart < 0forArrowDtypewithpyarrow.string(GH 56411)Bug in
Series.str.fullmatch()whendtype=pandas.ArrowDtype(pyarrow.string()))allows partial matches when regex ends in literal //$ (GH 56652)Bug in
Series.str.replace()whenn < 0forArrowDtypewithpyarrow.string(GH 56404)Bug in
Series.str.startswith()andSeries.str.endswith()with arguments of typetuple[str, ...]forArrowDtypewithpyarrow.stringdtype (GH 56579)Bug in
Series.str.startswith()andSeries.str.endswith()with arguments of typetuple[str, ...]forstring[pyarrow](GH 54942)Bug in comparison operations for
dtype="string[pyarrow_numpy]"raising if dtypes can’t be compared (GH 56008)
Interval#
Bug in
Interval__repr__not displaying UTC offsets forTimestampbounds. Additionally the hour, minute and second components will now be shown (GH 55015)Bug in
IntervalIndex.factorize()andSeries.factorize()withIntervalDtypewith datetime64 or timedelta64 intervals not preserving non-nanosecond units (GH 56099)Bug in
IntervalIndex.from_arrays()when passeddatetime64ortimedelta64arrays with mismatched resolutions constructing an invalidIntervalArrayobject (GH 55714)Bug in
IntervalIndex.from_tuples()raising if subtype is a nullable extension dtype (GH 56765)Bug in
IntervalIndex.get_indexer()with datetime or timedelta intervals incorrectly matching on integer targets (GH 47772)Bug in
IntervalIndex.get_indexer()with timezone-aware datetime intervals incorrectly matching on a sequence of timezone-naive targets (GH 47772)Bug in setting values on a
Serieswith anIntervalIndexusing a slice incorrectly raising (GH 54722)
Indexing#
Bug in
DataFrame.loc()mutating a boolean indexer whenDataFramehas aMultiIndex(GH 56635)Bug in
DataFrame.loc()when settingSerieswith extension dtype into NumPy dtype (GH 55604)Bug in
Index.difference()not returning a unique set of values whenotheris empty orotheris considered non-comparable (GH 55113)Bug in setting
Categoricalvalues into aDataFramewith numpy dtypes raisingRecursionError(GH 52927)Fixed bug when creating new column with missing values when setting a single string value (GH 56204)
Missing#
Bug in
DataFrame.update()wasn’t updating in-place for tz-aware datetime64 dtypes (GH 56227)
MultiIndex#
Bug in
MultiIndex.get_indexer()not raisingValueErrorwhenmethodprovided and index is non-monotonic (GH 53452)
I/O#
Bug in
read_csv()whereengine="python"did not respectchunksizearg whenskiprowswas specified (GH 56323)Bug in
read_csv()whereengine="python"was causing aTypeErrorwhen a callableskiprowsand a chunk size was specified (GH 55677)Bug in
read_csv()whereon_bad_lines="warn"would write tostderrinstead of raising a Python warning; this now yields aerrors.ParserWarning(GH 54296)Bug in
read_csv()withengine="pyarrow"wherequotecharwas ignored (GH 52266)Bug in
read_csv()withengine="pyarrow"whereusecolswasn’t working with a CSV with no headers (GH 54459)Bug in
read_excel(), withengine="xlrd"(xlsfiles) erroring when the file containsNaNorInf(GH 54564)Bug in
read_json()not handling dtype conversion properly ifinfer_stringis set (GH 56195)Bug in
DataFrame.to_excel(), withOdsWriter(odsfiles) writing Boolean/string value (GH 54994)Bug in
DataFrame.to_hdf()andread_hdf()withdatetime64dtypes with non-nanosecond resolution failing to round-trip correctly (GH 55622)Bug in
DataFrame.to_stata()raising for extension dtypes (GH 54671)Bug in
read_excel()withengine="odf"(odsfiles) when a string cell contains an annotation (GH 55200)Bug in
read_excel()with an ODS file without cached formatted cell for float values (GH 55219)Bug where
DataFrame.to_json()would raise anOverflowErrorinstead of aTypeErrorwith unsupported NumPy types (GH 55403)
Period#
Bug in
PeriodIndexconstruction when more than one ofdata,ordinaland**fieldsare passed failing to raiseValueError(GH 55961)Bug in
Periodaddition silently wrapping around instead of raisingOverflowError(GH 55503)Bug in casting from
PeriodDtypewithastypetodatetime64orDatetimeTZDtypewith non-nanosecond unit incorrectly returning with nanosecond unit (GH 55958)
Plotting#
Bug in
DataFrame.plot.box()withvert=Falseand a MatplotlibAxescreated withsharey=True(GH 54941)Bug in
DataFrame.plot.scatter()discarding string columns (GH 56142)Bug in
Series.plot()when reusing anaxobject failing to raise when ahowkeyword is passed (GH 55953)
Groupby/resample/rolling#
Bug in
DataFrameGroupBy.idxmin(),DataFrameGroupBy.idxmax(),SeriesGroupBy.idxmin(), andSeriesGroupBy.idxmax()would not retainCategoricaldtype when the index was aCategoricalIndexthat contained NA values (GH 54234)Bug in
DataFrameGroupBy.transform()andSeriesGroupBy.transform()whenobserved=Falseandf="idxmin"orf="idxmax"would incorrectly raise on unobserved categories (GH 54234)Bug in
DataFrameGroupBy.value_counts()andSeriesGroupBy.value_counts()could result in incorrect sorting if the columns of the DataFrame or name of the Series are integers (GH 55951)Bug in
DataFrameGroupBy.value_counts()andSeriesGroupBy.value_counts()would not respectsort=FalseinDataFrame.groupby()andSeries.groupby()(GH 55951)Bug in
DataFrameGroupBy.value_counts()andSeriesGroupBy.value_counts()would sort by proportions rather than frequencies whensort=Trueandnormalize=True(GH 55951)Bug in
DataFrame.asfreq()andSeries.asfreq()with aDatetimeIndexwith non-nanosecond resolution incorrectly converting to nanosecond resolution (GH 55958)Bug in
DataFrame.ewm()when passedtimeswith non-nanoseconddatetime64orDatetimeTZDtypedtype (GH 56262)Bug in
DataFrame.groupby()andSeries.groupby()where grouping by a combination ofDecimaland NA values would fail whensort=True(GH 54847)Bug in
DataFrame.groupby()for DataFrame subclasses when selecting a subset of columns to apply the function to (GH 56761)Bug in
DataFrame.resample()not respectingclosedandlabelarguments forBusinessDay(GH 55282)Bug in
DataFrame.resample()when resampling on aArrowDtypeofpyarrow.timestamporpyarrow.durationtype (GH 55989)Bug in
DataFrame.resample()where bin edges were not correct forBusinessDay(GH 55281)Bug in
DataFrame.resample()where bin edges were not correct forMonthBegin(GH 55271)Bug in
DataFrame.rolling()andSeries.rolling()where duplicate datetimelike indexes are treated as consecutive rather than equal withclosed='left'andclosed='neither'(GH 20712)Bug in
DataFrame.rolling()andSeries.rolling()where either theindexoroncolumn wasArrowDtypewithpyarrow.timestamptype (GH 55849)
Reshaping#
Bug in
concat()ignoringsortparameter when passedDatetimeIndexindexes (GH 54769)Bug in
concat()renamingSerieswhenignore_index=False(GH 15047)Bug in
merge_asof()raisingTypeErrorwhenbydtype is notobject,int64, oruint64(GH 22794)Bug in
merge_asof()raising incorrect error for string dtype (GH 56444)Bug in
merge_asof()when using aTimedeltatolerance on aArrowDtypecolumn (GH 56486)Bug in
merge()not raising when merging datetime columns with timedelta columns (GH 56455)Bug in
merge()not raising when merging string columns with numeric columns (GH 56441)Bug in
merge()returning columns in incorrect order when left and/or right is empty (GH 51929)Bug in
DataFrame.melt()where an exception was raised ifvar_namewas not a string (GH 55948)Bug in
DataFrame.melt()where it would not preserve the datetime (GH 55254)Bug in
DataFrame.pivot_table()where the row margin is incorrect when the columns have numeric names (GH 26568)Bug in
DataFrame.pivot()with numeric columns and extension dtype for data (GH 56528)Bug in
DataFrame.stack()withfuture_stack=Truewould not preserve NA values in the index (GH 56573)
Sparse#
Bug in
arrays.SparseArray.take()when using a different fill value than the array’s fill value (GH 55181)
Other#
DataFrame.__dataframe__()did not support pyarrow large strings (GH 56702)Bug in
DataFrame.describe()when formatting percentiles in the resulting percentile 99.999% is rounded to 100% (GH 55765)Bug in
api.interchange.from_dataframe()where it raisedNotImplementedErrorwhen handling empty string columns (GH 56703)Bug in
cut()andqcut()withdatetime64dtype values with non-nanosecond units incorrectly returning nanosecond-unit bins (GH 56101)Bug in
cut()incorrectly allowing cutting of timezone-aware datetimes with timezone-naive bins (GH 54964)Bug in
infer_freq()andDatetimeIndex.inferred_freq()with weekly frequencies and non-nanosecond resolutions (GH 55609)Bug in
DataFrame.apply()where passingraw=Trueignoredargspassed to the applied function (GH 55009)Bug in
DataFrame.from_dict()which would always sort the rows of the createdDataFrame. (GH 55683)Bug in
DataFrame.sort_index()when passingaxis="columns"andignore_index=Trueraising aValueError(GH 56478)Bug in rendering
infvalues inside aDataFramewith theuse_inf_as_naoption enabled (GH 55483)Bug in rendering a
Serieswith aMultiIndexwhen one of the index level’s names is 0 not having that name displayed (GH 55415)Bug in the error message when assigning an empty
DataFrameto a column (GH 55956)Bug when time-like strings were being cast to
ArrowDtypewithpyarrow.time64type (GH 56463)Fixed a spurious deprecation warning from
numba>= 0.58.0 when passing a numpy ufunc incore.window.Rolling.applywithengine="numba"(GH 55247)
Contributors#
A total of 162 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.
AG
Aaron Rahman +
Abdullah Ihsan Secer +
Abhijit Deo +
Adrian D’Alessandro
Ahmad Mustafa Anis +
Amanda Bizzinotto
Amith KK +
Aniket Patil +
Antonio Fonseca +
Artur Barseghyan
Ben Greiner
Bill Blum +
Boyd Kane
Damian Kula
Dan King +
Daniel Weindl +
Daniele Nicolodi
David Poznik
David Toneian +
Dea María Léon
Deepak George +
Dmitriy +
Dominique Garmier +
Donald Thevalingam +
Doug Davis +
Dukastlik +
Elahe Sharifi +
Eric Han +
Fangchen Li
Francisco Alfaro +
Gadea Autric +
Guillaume Lemaitre
Hadi Abdi Khojasteh
Hedeer El Showk +
Huanghz2001 +
Isaac Virshup
Issam +
Itay Azolay +
Itayazolay +
Jaca +
Jack McIvor +
JackCollins91 +
James Spencer +
Jay
Jessica Greene
Jirka Borovec +
JohannaTrost +
John C +
Joris Van den Bossche
José Lucas Mayer +
José Lucas Silva Mayer +
João Andrade +
Kai Mühlbauer
Katharina Tielking, MD +
Kazuto Haruguchi +
Kevin
Lawrence Mitchell
Linus +
Linus Sommer +
Louis-Émile Robitaille +
Luke Manley
Lumberbot (aka Jack)
Maggie Liu +
MainHanzo +
Marc Garcia
Marco Edward Gorelli
MarcoGorelli
Martin Šícho +
Mateusz Sokół
Matheus Felipe +
Matthew Roeschke
Matthias Bussonnier
Maxwell Bileschi +
Michael Tiemann
Michał Górny
Molly Bowers +
Moritz Schubert +
NNLNR +
Natalia Mokeeva
Nils Müller-Wendt +
Omar Elbaz
Pandas Development Team
Paras Gupta +
Parthi
Patrick Hoefler
Paul Pellissier +
Paul Uhlenbruck +
Philip Meier
Philippe THOMY +
Quang Nguyễn
Raghav
Rajat Subhra Mukherjee
Ralf Gommers
Randolf Scholz +
Richard Shadrach
Rob +
Rohan Jain +
Ryan Gibson +
Sai-Suraj-27 +
Samuel Oranyeli +
Sara Bonati +
Sebastian Berg
Sergey Zakharov +
Shyamala Venkatakrishnan +
StEmGeo +
Stefanie Molin
Stijn de Gooijer +
Thiago Gariani +
Thomas A Caswell
Thomas Baumann +
Thomas Guillet +
Thomas Lazarus +
Thomas Li
Tim Hoffmann
Tim Swast
Tom Augspurger
Toro +
Torsten Wörtwein
Ville Aikas +
Vinita Parasrampuria +
Vyas Ramasubramani +
William Andrea
William Ayd
Willian Wang +
Xiao Yuan
Yao Xiao
Yves Delley
Zemux1613 +
Ziad Kermadi +
aaron-robeson-8451 +
aram-cinnamon +
caneff +
ccccjone +
chris-caballero +
cobalt
color455nm +
denisrei +
dependabot[bot]
jbrockmendel
jfadia +
johanna.trost +
kgmuzungu +
mecopur +
mhb143 +
morotti +
mvirts +
omar-elbaz
paulreece
pre-commit-ci[bot]
raj-thapa
rebecca-palmer
rmhowe425
rohanjain101
shiersansi +
smij720
srkds +
taytzehao
torext
vboxuser +
xzmeng +
yashb +