What’s new in 1.0.0 (January 29, 2020)#
These are the changes in pandas 1.0.0. See Release notes for a full changelog including other versions of pandas.
Note
The pandas 1.0 release removed a lot of functionality that was deprecated in previous releases (see below for an overview). It is recommended to first upgrade to pandas 0.25 and to ensure your code is working without warnings, before upgrading to pandas 1.0.
New deprecation policy#
Starting with pandas 1.0.0, pandas will adopt a variant of SemVer to version releases. Briefly,
Deprecations will be introduced in minor releases (e.g. 1.1.0, 1.2.0, 2.1.0, …)
Deprecations will be enforced in major releases (e.g. 1.0.0, 2.0.0, 3.0.0, …)
API-breaking changes will be made only in major releases (except for experimental features)
See Version policy for more.
Enhancements#
Using Numba in rolling.apply
and expanding.apply
#
We’ve added an engine
keyword to apply()
and apply()
that allows the user to execute the routine using Numba instead of Cython.
Using the Numba engine can yield significant performance gains if the apply function can operate on numpy arrays and
the data set is larger (1 million rows or greater). For more details, see
rolling apply documentation (GH 28987, GH 30936)
Defining custom windows for rolling operations#
We’ve added a pandas.api.indexers.BaseIndexer()
class that allows users to define how
window bounds are created during rolling
operations. Users can define their own get_window_bounds
method on a pandas.api.indexers.BaseIndexer()
subclass that will generate the start and end
indices used for each window during the rolling aggregation. For more details and example usage, see
the custom window rolling documentation
Converting to markdown#
We’ve added to_markdown()
for creating a markdown table (GH 11052)
In [1]: df = pd.DataFrame({"A": [1, 2, 3], "B": [1, 2, 3]}, index=['a', 'a', 'b'])
In [2]: print(df.to_markdown())
| | A | B |
|:---|----:|----:|
| a | 1 | 1 |
| a | 2 | 2 |
| b | 3 | 3 |
Experimental new features#
Experimental NA
scalar to denote missing values#
A new pd.NA
value (singleton) is introduced to represent scalar missing
values. Up to now, pandas used several values to represent missing data: np.nan
is used for this for float data, np.nan
or
None
for object-dtype data and pd.NaT
for datetime-like data. The
goal of pd.NA
is to provide a “missing” indicator that can be used
consistently across data types. pd.NA
is currently used by the nullable integer and boolean
data types and the new string data type (GH 28095).
Warning
Experimental: the behaviour of pd.NA
can still change without warning.
For example, creating a Series using the nullable integer dtype:
In [3]: s = pd.Series([1, 2, None], dtype="Int64")
In [4]: s
Out[4]:
0 1
1 2
2 <NA>
Length: 3, dtype: Int64
In [5]: s[2]
Out[5]: <NA>
Compared to np.nan
, pd.NA
behaves differently in certain operations.
In addition to arithmetic operations, pd.NA
also propagates as “missing”
or “unknown” in comparison operations:
In [6]: np.nan > 1
Out[6]: False
In [7]: pd.NA > 1
Out[7]: <NA>
For logical operations, pd.NA
follows the rules of the
three-valued logic (or
Kleene logic). For example:
In [8]: pd.NA | True
Out[8]: True
For more, see NA section in the user guide on missing data.
Dedicated string data type#
We’ve added StringDtype
, an extension type dedicated to string data.
Previously, strings were typically stored in object-dtype NumPy arrays. (GH 29975)
Warning
StringDtype
is currently considered experimental. The implementation
and parts of the API may change without warning.
The 'string'
extension type solves several issues with object-dtype NumPy arrays:
You can accidentally store a mixture of strings and non-strings in an
object
dtype array. AStringArray
can only store strings.object
dtype breaks dtype-specific operations likeDataFrame.select_dtypes()
. There isn’t a clear way to select just text while excluding non-text, but still object-dtype columns.When reading code, the contents of an
object
dtype array is less clear thanstring
.
In [9]: pd.Series(['abc', None, 'def'], dtype=pd.StringDtype())
Out[9]:
0 abc
1 <NA>
2 def
Length: 3, dtype: string
You can use the alias "string"
as well.
In [10]: s = pd.Series(['abc', None, 'def'], dtype="string")
In [11]: s
Out[11]:
0 abc
1 <NA>
2 def
Length: 3, dtype: string
The usual string accessor methods work. Where appropriate, the return type of the Series or columns of a DataFrame will also have string dtype.
In [12]: s.str.upper()
Out[12]:
0 ABC
1 <NA>
2 DEF
Length: 3, dtype: string
In [13]: s.str.split('b', expand=True).dtypes
Out[13]:
0 string[python]
1 string[python]
Length: 2, dtype: object
String accessor methods returning integers will return a value with Int64Dtype
In [14]: s.str.count("a")
Out[14]:
0 1
1 <NA>
2 0
Length: 3, dtype: Int64
We recommend explicitly using the string
data type when working with strings.
See Text data types for more.
Boolean data type with missing values support#
We’ve added BooleanDtype
/ BooleanArray
, an extension
type dedicated to boolean data that can hold missing values. The default
bool
data type based on a bool-dtype NumPy array, the column can only hold
True
or False
, and not missing values. This new BooleanArray
can store missing values as well by keeping track of this in a separate mask.
(GH 29555, GH 30095, GH 31131)
In [15]: pd.Series([True, False, None], dtype=pd.BooleanDtype())
Out[15]:
0 True
1 False
2 <NA>
Length: 3, dtype: boolean
You can use the alias "boolean"
as well.
In [16]: s = pd.Series([True, False, None], dtype="boolean")
In [17]: s
Out[17]:
0 True
1 False
2 <NA>
Length: 3, dtype: boolean
Method convert_dtypes
to ease use of supported extension dtypes#
In order to encourage use of the extension dtypes StringDtype
,
BooleanDtype
, Int64Dtype
, Int32Dtype
, etc., that support pd.NA
, the
methods DataFrame.convert_dtypes()
and Series.convert_dtypes()
have been introduced. (GH 29752) (GH 30929)
Example:
In [18]: df = pd.DataFrame({'x': ['abc', None, 'def'],
....: 'y': [1, 2, np.nan],
....: 'z': [True, False, True]})
....:
In [19]: df
Out[19]:
x y z
0 abc 1.0 True
1 None 2.0 False
2 def NaN True
[3 rows x 3 columns]
In [20]: df.dtypes
Out[20]:
x object
y float64
z bool
Length: 3, dtype: object
In [21]: converted = df.convert_dtypes()
In [22]: converted
Out[22]:
x y z
0 abc 1 True
1 <NA> 2 False
2 def <NA> True
[3 rows x 3 columns]
In [23]: converted.dtypes
Out[23]:
x string[python]
y Int64
z boolean
Length: 3, dtype: object
This is especially useful after reading in data using readers such as read_csv()
and read_excel()
.
See here for a description.
Other enhancements#
DataFrame.to_string()
added themax_colwidth
parameter to control when wide columns are truncated (GH 9784)Added the
na_value
argument toSeries.to_numpy()
,Index.to_numpy()
andDataFrame.to_numpy()
to control the value used for missing data (GH 30322)MultiIndex.from_product()
infers level names from inputs if not explicitly provided (GH 27292)DataFrame.to_latex()
now acceptscaption
andlabel
arguments (GH 25436)DataFrames with nullable integer, the new string dtype and period data type can now be converted to
pyarrow
(>=0.15.0), which means that it is supported in writing to the Parquet file format when using thepyarrow
engine (GH 28368). Full roundtrip to parquet (writing and reading back in withto_parquet()
/read_parquet()
) is supported starting with pyarrow >= 0.16 (GH 20612).to_parquet()
now appropriately handles theschema
argument for user defined schemas in the pyarrow engine. (GH 30270)DataFrame.to_json()
now accepts anindent
integer argument to enable pretty printing of JSON output (GH 12004)read_stata()
can read Stata 119 dta files. (GH 28250)Implemented
pandas.core.window.Window.var()
andpandas.core.window.Window.std()
functions (GH 26597)Added
encoding
argument toDataFrame.to_string()
for non-ascii text (GH 28766)Added
encoding
argument toDataFrame.to_html()
for non-ascii text (GH 28663)Styler.background_gradient()
now acceptsvmin
andvmax
arguments (GH 12145)Styler.format()
added thena_rep
parameter to help format the missing values (GH 21527, GH 28358)read_excel()
now can read binary Excel (.xlsb
) files by passingengine='pyxlsb'
. For more details and example usage, see the Binary Excel files documentation. Closes GH 8540.The
partition_cols
argument inDataFrame.to_parquet()
now accepts a string (GH 27117)pandas.read_json()
now parsesNaN
,Infinity
and-Infinity
(GH 12213)DataFrame constructor preserve
ExtensionArray
dtype withExtensionArray
(GH 11363)DataFrame.sort_values()
andSeries.sort_values()
have gainedignore_index
keyword to be able to reset index after sorting (GH 30114)DataFrame.sort_index()
andSeries.sort_index()
have gainedignore_index
keyword to reset index (GH 30114)DataFrame.drop_duplicates()
has gainedignore_index
keyword to reset index (GH 30114)Added new writer for exporting Stata dta files in versions 118 and 119,
StataWriterUTF8
. These files formats support exporting strings containing Unicode characters. Format 119 supports data sets with more than 32,767 variables (GH 23573, GH 30959)Series.map()
now acceptscollections.abc.Mapping
subclasses as a mapper (GH 29733)Added an experimental
attrs
for storing global metadata about a dataset (GH 29062)Timestamp.fromisocalendar()
is now compatible with python 3.8 and above (GH 28115)DataFrame.to_pickle()
andread_pickle()
now accept URL (GH 30163)
Backwards incompatible API changes#
Avoid using names from MultiIndex.levels
#
As part of a larger refactor to MultiIndex
the level names are now
stored separately from the levels (GH 27242). We recommend using
MultiIndex.names
to access the names, and Index.set_names()
to update the names.
For backwards compatibility, you can still access the names via the levels.
In [24]: mi = pd.MultiIndex.from_product([[1, 2], ['a', 'b']], names=['x', 'y'])
In [25]: mi.levels[0].name
Out[25]: 'x'
However, it is no longer possible to update the names of the MultiIndex
via the level.
In [26]: mi.levels[0].name = "new name"
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[26], line 1
----> 1 mi.levels[0].name = "new name"
File ~/work/pandas/pandas/pandas/core/indexes/base.py:1675, in Index.name(self, value)
1671 @name.setter
1672 def name(self, value: Hashable) -> None:
1673 if self._no_setting_name:
1674 # Used in MultiIndex.levels to avoid silently ignoring name updates.
-> 1675 raise RuntimeError(
1676 "Cannot set name on a level of a MultiIndex. Use "
1677 "'MultiIndex.set_names' instead."
1678 )
1679 maybe_extract_name(value, None, type(self))
1680 self._name = value
RuntimeError: Cannot set name on a level of a MultiIndex. Use 'MultiIndex.set_names' instead.
In [27]: mi.names
Out[27]: FrozenList(['x', 'y'])
To update, use MultiIndex.set_names
, which returns a new MultiIndex
.
In [28]: mi2 = mi.set_names("new name", level=0)
In [29]: mi2.names
Out[29]: FrozenList(['new name', 'y'])
New repr for IntervalArray
#
pandas.arrays.IntervalArray
adopts a new __repr__
in accordance with other array classes (GH 25022)
pandas 0.25.x
In [1]: pd.arrays.IntervalArray.from_tuples([(0, 1), (2, 3)])
Out[2]:
IntervalArray([(0, 1], (2, 3]],
closed='right',
dtype='interval[int64]')
pandas 1.0.0
In [30]: pd.arrays.IntervalArray.from_tuples([(0, 1), (2, 3)])
Out[30]:
<IntervalArray>
[(0, 1], (2, 3]]
Length: 2, dtype: interval[int64, right]
DataFrame.rename
now only accepts one positional argument#
DataFrame.rename()
would previously accept positional arguments that would lead
to ambiguous or undefined behavior. From pandas 1.0, only the very first argument, which
maps labels to their new names along the default axis, is allowed to be passed by position
(GH 29136).
pandas 0.25.x
In [1]: df = pd.DataFrame([[1]])
In [2]: df.rename({0: 1}, {0: 2})
Out[2]:
FutureWarning: ...Use named arguments to resolve ambiguity...
2
1 1
pandas 1.0.0
In [3]: df.rename({0: 1}, {0: 2})
Traceback (most recent call last):
...
TypeError: rename() takes from 1 to 2 positional arguments but 3 were given
Note that errors will now be raised when conflicting or potentially ambiguous arguments are provided.
pandas 0.25.x
In [4]: df.rename({0: 1}, index={0: 2})
Out[4]:
0
1 1
In [5]: df.rename(mapper={0: 1}, index={0: 2})
Out[5]:
0
2 1
pandas 1.0.0
In [6]: df.rename({0: 1}, index={0: 2})
Traceback (most recent call last):
...
TypeError: Cannot specify both 'mapper' and any of 'index' or 'columns'
In [7]: df.rename(mapper={0: 1}, index={0: 2})
Traceback (most recent call last):
...
TypeError: Cannot specify both 'mapper' and any of 'index' or 'columns'
You can still change the axis along which the first positional argument is applied by
supplying the axis
keyword argument.
In [31]: df.rename({0: 1})
Out[31]:
0
1 1
[1 rows x 1 columns]
In [32]: df.rename({0: 1}, axis=1)
Out[32]:
1
0 1
[1 rows x 1 columns]
If you would like to update both the index and column labels, be sure to use the respective keywords.
In [33]: df.rename(index={0: 1}, columns={0: 2})
Out[33]:
2
1 1
[1 rows x 1 columns]
Extended verbose info output for DataFrame
#
DataFrame.info()
now shows line numbers for the columns summary (GH 17304)
pandas 0.25.x
In [1]: df = pd.DataFrame({"int_col": [1, 2, 3],
... "text_col": ["a", "b", "c"],
... "float_col": [0.0, 0.1, 0.2]})
In [2]: df.info(verbose=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
int_col 3 non-null int64
text_col 3 non-null object
float_col 3 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 152.0+ bytes
pandas 1.0.0
In [34]: df = pd.DataFrame({"int_col": [1, 2, 3],
....: "text_col": ["a", "b", "c"],
....: "float_col": [0.0, 0.1, 0.2]})
....:
In [35]: df.info(verbose=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 int_col 3 non-null int64
1 text_col 3 non-null object
2 float_col 3 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 200.0+ bytes
pandas.array()
inference changes#
pandas.array()
now infers pandas’ new extension types in several cases (GH 29791):
String data (including missing values) now returns a
arrays.StringArray
.Integer data (including missing values) now returns a
arrays.IntegerArray
.Boolean data (including missing values) now returns the new
arrays.BooleanArray
pandas 0.25.x
In [1]: pd.array(["a", None])
Out[1]:
<PandasArray>
['a', None]
Length: 2, dtype: object
In [2]: pd.array([1, None])
Out[2]:
<PandasArray>
[1, None]
Length: 2, dtype: object
pandas 1.0.0
In [36]: pd.array(["a", None])
Out[36]:
<StringArray>
['a', <NA>]
Length: 2, dtype: string
In [37]: pd.array([1, None])
Out[37]:
<IntegerArray>
[1, <NA>]
Length: 2, dtype: Int64
As a reminder, you can specify the dtype
to disable all inference.
arrays.IntegerArray
now uses pandas.NA
#
arrays.IntegerArray
now uses pandas.NA
rather than
numpy.nan
as its missing value marker (GH 29964).
pandas 0.25.x
In [1]: a = pd.array([1, 2, None], dtype="Int64")
In [2]: a
Out[2]:
<IntegerArray>
[1, 2, NaN]
Length: 3, dtype: Int64
In [3]: a[2]
Out[3]:
nan
pandas 1.0.0
In [38]: a = pd.array([1, 2, None], dtype="Int64")
In [39]: a
Out[39]:
<IntegerArray>
[1, 2, <NA>]
Length: 3, dtype: Int64
In [40]: a[2]
Out[40]: <NA>
This has a few API-breaking consequences.
Converting to a NumPy ndarray
When converting to a NumPy array missing values will be pd.NA
, which cannot
be converted to a float. So calling np.asarray(integer_array, dtype="float")
will now raise.
pandas 0.25.x
In [1]: np.asarray(a, dtype="float")
Out[1]:
array([ 1., 2., nan])
pandas 1.0.0
In [41]: np.asarray(a, dtype="float")
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[41], line 1
----> 1 np.asarray(a, dtype="float")
File ~/work/pandas/pandas/pandas/core/arrays/masked.py:575, in BaseMaskedArray.__array__(self, dtype)
570 def __array__(self, dtype: NpDtype | None = None) -> np.ndarray:
571 """
572 the array interface, return my values
573 We return an object array here to preserve our scalar values
574 """
--> 575 return self.to_numpy(dtype=dtype)
File ~/work/pandas/pandas/pandas/core/arrays/masked.py:487, in BaseMaskedArray.to_numpy(self, dtype, copy, na_value)
481 if self._hasna:
482 if (
483 dtype != object
484 and not is_string_dtype(dtype)
485 and na_value is libmissing.NA
486 ):
--> 487 raise ValueError(
488 f"cannot convert to '{dtype}'-dtype NumPy array "
489 "with missing values. Specify an appropriate 'na_value' "
490 "for this dtype."
491 )
492 # don't pass copy to astype -> always need a copy since we are mutating
493 with warnings.catch_warnings():
ValueError: cannot convert to 'float64'-dtype NumPy array with missing values. Specify an appropriate 'na_value' for this dtype.
Use arrays.IntegerArray.to_numpy()
with an explicit na_value
instead.
In [42]: a.to_numpy(dtype="float", na_value=np.nan)
Out[42]: array([ 1., 2., nan])
Reductions can return pd.NA
When performing a reduction such as a sum with skipna=False
, the result
will now be pd.NA
instead of np.nan
in presence of missing values
(GH 30958).
pandas 0.25.x
In [1]: pd.Series(a).sum(skipna=False)
Out[1]:
nan
pandas 1.0.0
In [43]: pd.Series(a).sum(skipna=False)
Out[43]: <NA>
value_counts returns a nullable integer dtype
Series.value_counts()
with a nullable integer dtype now returns a nullable
integer dtype for the values.
pandas 0.25.x
In [1]: pd.Series([2, 1, 1, None], dtype="Int64").value_counts().dtype
Out[1]:
dtype('int64')
pandas 1.0.0
In [44]: pd.Series([2, 1, 1, None], dtype="Int64").value_counts().dtype
Out[44]: Int64Dtype()
See Experimental NA scalar to denote missing values for more on the differences between pandas.NA
and numpy.nan
.
arrays.IntegerArray
comparisons return arrays.BooleanArray
#
Comparison operations on a arrays.IntegerArray
now returns a
arrays.BooleanArray
rather than a NumPy array (GH 29964).
pandas 0.25.x
In [1]: a = pd.array([1, 2, None], dtype="Int64")
In [2]: a
Out[2]:
<IntegerArray>
[1, 2, NaN]
Length: 3, dtype: Int64
In [3]: a > 1
Out[3]:
array([False, True, False])
pandas 1.0.0
In [45]: a = pd.array([1, 2, None], dtype="Int64")
In [46]: a > 1
Out[46]:
<BooleanArray>
[False, True, <NA>]
Length: 3, dtype: boolean
Note that missing values now propagate, rather than always comparing unequal
like numpy.nan
. See Experimental NA scalar to denote missing values for more.
By default Categorical.min()
now returns the minimum instead of np.nan#
When Categorical
contains np.nan
,
Categorical.min()
no longer return np.nan
by default (skipna=True) (GH 25303)
pandas 0.25.x
In [1]: pd.Categorical([1, 2, np.nan], ordered=True).min()
Out[1]: nan
pandas 1.0.0
In [47]: pd.Categorical([1, 2, np.nan], ordered=True).min()
Out[47]: 1
Default dtype of empty pandas.Series
#
Initialising an empty pandas.Series
without specifying a dtype will raise a DeprecationWarning
now
(GH 17261). The default dtype will change from float64
to object
in future releases so that it is
consistent with the behaviour of DataFrame
and Index
.
pandas 1.0.0
In [1]: pd.Series()
Out[2]:
DeprecationWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
Series([], dtype: float64)
Result dtype inference changes for resample operations#
The rules for the result dtype in DataFrame.resample()
aggregations have changed for extension types (GH 31359).
Previously, pandas would attempt to convert the result back to the original dtype, falling back to the usual
inference rules if that was not possible. Now, pandas will only return a result of the original dtype if the
scalar values in the result are instances of the extension dtype’s scalar type.
In [48]: df = pd.DataFrame({"A": ['a', 'b']}, dtype='category',
....: index=pd.date_range('2000', periods=2))
....:
In [49]: df
Out[49]:
A
2000-01-01 a
2000-01-02 b
[2 rows x 1 columns]
pandas 0.25.x
In [1]> df.resample("2D").agg(lambda x: 'a').A.dtype
Out[1]:
CategoricalDtype(categories=['a', 'b'], ordered=False)
pandas 1.0.0
In [50]: df.resample("2D").agg(lambda x: 'a').A.dtype
Out[50]: dtype('O')
This fixes an inconsistency between resample
and groupby
.
This also fixes a potential bug, where the values of the result might change
depending on how the results are cast back to the original dtype.
pandas 0.25.x
In [1] df.resample("2D").agg(lambda x: 'c')
Out[1]:
A
0 NaN
pandas 1.0.0
In [51]: df.resample("2D").agg(lambda x: 'c')
Out[51]:
A
2000-01-01 c
[1 rows x 1 columns]
Increased minimum version for Python#
pandas 1.0.0 supports Python 3.6.1 and higher (GH 29212).
Increased minimum versions for dependencies#
Some minimum supported versions of dependencies were updated (GH 29766, GH 29723). If installed, we now require:
Package |
Minimum Version |
Required |
Changed |
---|---|---|---|
numpy |
1.13.3 |
X |
|
pytz |
2015.4 |
X |
|
python-dateutil |
2.6.1 |
X |
|
bottleneck |
1.2.1 |
||
numexpr |
2.6.2 |
||
pytest (dev) |
4.0.2 |
For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.
Package |
Minimum Version |
Changed |
---|---|---|
beautifulsoup4 |
4.6.0 |
|
fastparquet |
0.3.2 |
X |
gcsfs |
0.2.2 |
|
lxml |
3.8.0 |
|
matplotlib |
2.2.2 |
|
numba |
0.46.0 |
X |
openpyxl |
2.5.7 |
X |
pyarrow |
0.13.0 |
X |
pymysql |
0.7.1 |
|
pytables |
3.4.2 |
|
s3fs |
0.3.0 |
X |
scipy |
0.19.0 |
|
sqlalchemy |
1.1.4 |
|
xarray |
0.8.2 |
|
xlrd |
1.1.0 |
|
xlsxwriter |
0.9.8 |
|
xlwt |
1.2.0 |
See Dependencies and Optional dependencies for more.
Build changes#
pandas has added a pyproject.toml file and will no longer include
cythonized files in the source distribution uploaded to PyPI (GH 28341, GH 20775). If you’re installing
a built distribution (wheel) or via conda, this shouldn’t have any effect on you. If you’re building pandas from
source, you should no longer need to install Cython into your build environment before calling pip install pandas
.
Other API changes#
DataFrameGroupBy.transform()
andSeriesGroupBy.transform()
now raises on invalid operation names (GH 27489)pandas.api.types.infer_dtype()
will now return “integer-na” for integer andnp.nan
mix (GH 27283)MultiIndex.from_arrays()
will no longer infer names from arrays ifnames=None
is explicitly provided (GH 27292)In order to improve tab-completion, pandas does not include most deprecated attributes when introspecting a pandas object using
dir
(e.g.dir(df)
). To see which attributes are excluded, see an object’s_deprecations
attribute, for examplepd.DataFrame._deprecations
(GH 28805).The returned dtype of
unique()
now matches the input dtype. (GH 27874)Changed the default configuration value for
options.matplotlib.register_converters
fromTrue
to"auto"
(GH 18720). Now, pandas custom formatters will only be applied to plots created by pandas, throughplot()
. Previously, pandas’ formatters would be applied to all plots created after aplot()
. See units registration for more.Series.dropna()
has dropped its**kwargs
argument in favor of a singlehow
parameter. Supplying anything else thanhow
to**kwargs
raised aTypeError
previously (GH 29388)When testing pandas, the new minimum required version of pytest is 5.0.1 (GH 29664)
Series.str.__iter__()
was deprecated and will be removed in future releases (GH 28277).Added
<NA>
to the list of default NA values forread_csv()
(GH 30821)
Documentation improvements#
Added new section on Scaling to large datasets (GH 28315).
Added sub-section on Query MultiIndex for HDF5 datasets (GH 28791).
Deprecations#
Series.item()
andIndex.item()
have been _undeprecated_ (GH 29250)Index.set_value
has been deprecated. For a given indexidx
, arrayarr
, value inidx
ofidx_val
and a new value ofval
,idx.set_value(arr, idx_val, val)
is equivalent toarr[idx.get_loc(idx_val)] = val
, which should be used instead (GH 28621).is_extension_type()
is deprecated,is_extension_array_dtype()
should be used instead (GH 29457)eval()
keyword argument “truediv” is deprecated and will be removed in a future version (GH 29812)DateOffset.isAnchored()
andDatetOffset.onOffset()
are deprecated and will be removed in a future version, useDateOffset.is_anchored()
andDateOffset.is_on_offset()
instead (GH 30340)pandas.tseries.frequencies.get_offset
is deprecated and will be removed in a future version, usepandas.tseries.frequencies.to_offset
instead (GH 4205)Categorical.take_nd()
andCategoricalIndex.take_nd()
are deprecated, useCategorical.take()
andCategoricalIndex.take()
instead (GH 27745)The parameter
numeric_only
ofCategorical.min()
andCategorical.max()
is deprecated and replaced withskipna
(GH 25303)The parameter
label
inlreshape()
has been deprecated and will be removed in a future version (GH 29742)pandas.core.index
has been deprecated and will be removed in a future version, the public classes are available in the top-level namespace (GH 19711)pandas.json_normalize()
is now exposed in the top-level namespace. Usage ofjson_normalize
aspandas.io.json.json_normalize
is now deprecated and it is recommended to usejson_normalize
aspandas.json_normalize()
instead (GH 27586).The
numpy
argument ofpandas.read_json()
is deprecated (GH 28512).DataFrame.to_stata()
,DataFrame.to_feather()
, andDataFrame.to_parquet()
argument “fname” is deprecated, use “path” instead (GH 23574)The deprecated internal attributes
_start
,_stop
and_step
ofRangeIndex
now raise aFutureWarning
instead of aDeprecationWarning
(GH 26581)The
pandas.util.testing
module has been deprecated. Use the public API inpandas.testing
documented at Assertion functions (GH 16232).pandas.SparseArray
has been deprecated. Usepandas.arrays.SparseArray
(arrays.SparseArray
) instead. (GH 30642)The parameter
is_copy
ofSeries.take()
andDataFrame.take()
has been deprecated and will be removed in a future version. (GH 27357)Support for multi-dimensional indexing (e.g.
index[:, None]
) on aIndex
is deprecated and will be removed in a future version, convert to a numpy array before indexing instead (GH 30588)The
pandas.np
submodule is now deprecated. Import numpy directly instead (GH 30296)The
pandas.datetime
class is now deprecated. Import fromdatetime
instead (GH 30610)diff
will raise aTypeError
rather than implicitly losing the dtype of extension types in the future. Convert to the correct dtype before callingdiff
instead (GH 31025)
Selecting Columns from a Grouped DataFrame
When selecting columns from a DataFrameGroupBy
object, passing individual keys (or a tuple of keys) inside single brackets is deprecated,
a list of items should be used instead. (GH 23566) For example:
df = pd.DataFrame({
"A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
"B": np.random.randn(8),
"C": np.random.randn(8),
})
g = df.groupby('A')
# single key, returns SeriesGroupBy
g['B']
# tuple of single key, returns SeriesGroupBy
g[('B',)]
# tuple of multiple keys, returns DataFrameGroupBy, raises FutureWarning
g[('B', 'C')]
# multiple keys passed directly, returns DataFrameGroupBy, raises FutureWarning
# (implicitly converts the passed strings into a single tuple)
g['B', 'C']
# proper way, returns DataFrameGroupBy
g[['B', 'C']]
Removal of prior version deprecations/changes#
Removed SparseSeries and SparseDataFrame
SparseSeries
, SparseDataFrame
and the DataFrame.to_sparse
method
have been removed (GH 28425). We recommend using a Series
or
DataFrame
with sparse values instead.
Matplotlib unit registration
Previously, pandas would register converters with matplotlib as a side effect of importing pandas (GH 18720).
This changed the output of plots made via matplotlib plots after pandas was imported, even if you were using
matplotlib directly rather than plot()
.
To use pandas formatters with a matplotlib plot, specify
In [1]: import pandas as pd
In [2]: pd.options.plotting.matplotlib.register_converters = True
Note that plots created by DataFrame.plot()
and Series.plot()
do register the converters
automatically. The only behavior change is when plotting a date-like object via matplotlib.pyplot.plot
or matplotlib.Axes.plot
. See Custom formatters for timeseries plots for more.
Other removals
Removed the previously deprecated keyword “index” from
read_stata()
,StataReader
, andStataReader.read()
, use “index_col” instead (GH 17328)Removed
StataReader.data
method, useStataReader.read()
instead (GH 9493)Removed
pandas.plotting._matplotlib.tsplot
, useSeries.plot()
instead (GH 19980)pandas.tseries.converter.register
has been moved topandas.plotting.register_matplotlib_converters()
(GH 18307)Series.plot()
no longer accepts positional arguments, pass keyword arguments instead (GH 30003)DataFrame.hist()
andSeries.hist()
no longer allowsfigsize="default"
, specify figure size by passinig a tuple instead (GH 30003)Floordiv of integer-dtyped array by
Timedelta
now raisesTypeError
(GH 21036)TimedeltaIndex
andDatetimeIndex
no longer accept non-nanosecond dtype strings like “timedelta64” or “datetime64”, use “timedelta64[ns]” and “datetime64[ns]” instead (GH 24806)Changed the default “skipna” argument in
pandas.api.types.infer_dtype()
fromFalse
toTrue
(GH 24050)Removed
Series.ix
andDataFrame.ix
(GH 26438)Removed
Index.summary
(GH 18217)Removed the previously deprecated keyword “fastpath” from the
Index
constructor (GH 23110)Removed
Series.get_value
,Series.set_value
,DataFrame.get_value
,DataFrame.set_value
(GH 17739)Removed
Series.compound
andDataFrame.compound
(GH 26405)Changed the default “inplace” argument in
DataFrame.set_index()
andSeries.set_axis()
fromNone
toFalse
(GH 27600)Removed
Series.cat.categorical
,Series.cat.index
,Series.cat.name
(GH 24751)Removed the previously deprecated keyword “box” from
to_datetime()
andto_timedelta()
; in addition these now always returnsDatetimeIndex
,TimedeltaIndex
,Index
,Series
, orDataFrame
(GH 24486)to_timedelta()
,Timedelta
, andTimedeltaIndex
no longer allow “M”, “y”, or “Y” for the “unit” argument (GH 23264)Removed the previously deprecated keyword “time_rule” from (non-public)
offsets.generate_range
, which has been moved tocore.arrays._ranges.generate_range()
(GH 24157)DataFrame.loc()
orSeries.loc()
with listlike indexers and missing labels will no longer reindex (GH 17295)DataFrame.to_excel()
andSeries.to_excel()
with non-existent columns will no longer reindex (GH 17295)Removed the previously deprecated keyword “join_axes” from
concat()
; usereindex_like
on the result instead (GH 22318)Removed the previously deprecated keyword “by” from
DataFrame.sort_index()
, useDataFrame.sort_values()
instead (GH 10726)Removed support for nested renaming in
DataFrame.aggregate()
,Series.aggregate()
,core.groupby.DataFrameGroupBy.aggregate()
,core.groupby.SeriesGroupBy.aggregate()
,core.window.rolling.Rolling.aggregate()
(GH 18529)Passing
datetime64
data toTimedeltaIndex
ortimedelta64
data toDatetimeIndex
now raisesTypeError
(GH 23539, GH 23937)Passing
int64
values toDatetimeIndex
and a timezone now interprets the values as nanosecond timestamps in UTC, not wall times in the given timezone (GH 24559)A tuple passed to
DataFrame.groupby()
is now exclusively treated as a single key (GH 18314)Removed
Index.contains
, usekey in index
instead (GH 30103)Addition and subtraction of
int
or integer-arrays is no longer allowed inTimestamp
,DatetimeIndex
,TimedeltaIndex
, useobj + n * obj.freq
instead ofobj + n
(GH 22535)Removed
Series.ptp
(GH 21614)Removed
Series.from_array
(GH 18258)Removed
DataFrame.from_items
(GH 18458)Removed
DataFrame.as_matrix
,Series.as_matrix
(GH 18458)Removed
Series.asobject
(GH 18477)Removed
DataFrame.as_blocks
,Series.as_blocks
,DataFrame.blocks
,Series.blocks
(GH 17656)pandas.Series.str.cat()
now defaults to aligningothers
, usingjoin='left'
(GH 27611)pandas.Series.str.cat()
does not accept list-likes within list-likes anymore (GH 27611)Series.where()
withCategorical
dtype (orDataFrame.where()
withCategorical
column) no longer allows setting new categories (GH 24114)Removed the previously deprecated keywords “start”, “end”, and “periods” from the
DatetimeIndex
,TimedeltaIndex
, andPeriodIndex
constructors; usedate_range()
,timedelta_range()
, andperiod_range()
instead (GH 23919)Removed the previously deprecated keyword “verify_integrity” from the
DatetimeIndex
andTimedeltaIndex
constructors (GH 23919)Removed the previously deprecated keyword “fastpath” from
pandas.core.internals.blocks.make_block
(GH 19265)Removed the previously deprecated keyword “dtype” from
Block.make_block_same_class()
(GH 19434)Removed
ExtensionArray._formatting_values
. UseExtensionArray._formatter
instead. (GH 23601)Removed
MultiIndex.to_hierarchical
(GH 21613)Removed
MultiIndex.labels
, useMultiIndex.codes
instead (GH 23752)Removed the previously deprecated keyword “labels” from the
MultiIndex
constructor, use “codes” instead (GH 23752)Removed
MultiIndex.set_labels
, useMultiIndex.set_codes()
instead (GH 23752)Removed the previously deprecated keyword “labels” from
MultiIndex.set_codes()
,MultiIndex.copy()
,MultiIndex.drop()
, use “codes” instead (GH 23752)Removed support for legacy HDF5 formats (GH 29787)
Passing a dtype alias (e.g. ‘datetime64[ns, UTC]’) to
DatetimeTZDtype
is no longer allowed, useDatetimeTZDtype.construct_from_string()
instead (GH 23990)Removed the previously deprecated keyword “skip_footer” from
read_excel()
; use “skipfooter” instead (GH 18836)read_excel()
no longer allows an integer value for the parameterusecols
, instead pass a list of integers from 0 tousecols
inclusive (GH 23635)Removed the previously deprecated keyword “convert_datetime64” from
DataFrame.to_records()
(GH 18902)Removed
IntervalIndex.from_intervals
in favor of theIntervalIndex
constructor (GH 19263)Changed the default “keep_tz” argument in
DatetimeIndex.to_series()
fromNone
toTrue
(GH 23739)Removed
api.types.is_period
andapi.types.is_datetimetz
(GH 23917)Ability to read pickles containing
Categorical
instances created with pre-0.16 version of pandas has been removed (GH 27538)Removed
pandas.tseries.plotting.tsplot
(GH 18627)Removed the previously deprecated keywords “reduce” and “broadcast” from
DataFrame.apply()
(GH 18577)Removed the previously deprecated
assert_raises_regex
function inpandas._testing
(GH 29174)Removed the previously deprecated
FrozenNDArray
class inpandas.core.indexes.frozen
(GH 29335)Removed the previously deprecated keyword “nthreads” from
read_feather()
, use “use_threads” instead (GH 23053)Removed
Index.is_lexsorted_for_tuple
(GH 29305)Removed support for nested renaming in
DataFrame.aggregate()
,Series.aggregate()
,core.groupby.DataFrameGroupBy.aggregate()
,core.groupby.SeriesGroupBy.aggregate()
,core.window.rolling.Rolling.aggregate()
(GH 29608)Removed
Series.valid
; useSeries.dropna()
instead (GH 18800)Removed
DataFrame.is_copy
,Series.is_copy
(GH 18812)Removed
DataFrame.get_ftype_counts
,Series.get_ftype_counts
(GH 18243)Removed
DataFrame.ftypes
,Series.ftypes
,Series.ftype
(GH 26744)Removed
Index.get_duplicates
, useidx[idx.duplicated()].unique()
instead (GH 20239)Removed
Series.clip_upper
,Series.clip_lower
,DataFrame.clip_upper
,DataFrame.clip_lower
(GH 24203)Removed the ability to alter
DatetimeIndex.freq
,TimedeltaIndex.freq
, orPeriodIndex.freq
(GH 20772)Removed
DatetimeIndex.offset
(GH 20730)Removed
DatetimeIndex.asobject
,TimedeltaIndex.asobject
,PeriodIndex.asobject
, useastype(object)
instead (GH 29801)Removed the previously deprecated keyword “order” from
factorize()
(GH 19751)Removed the previously deprecated keyword “encoding” from
read_stata()
andDataFrame.to_stata()
(GH 21400)Changed the default “sort” argument in
concat()
fromNone
toFalse
(GH 20613)Removed the previously deprecated keyword “raise_conflict” from
DataFrame.update()
, use “errors” instead (GH 23585)Removed the previously deprecated keyword “n” from
DatetimeIndex.shift()
,TimedeltaIndex.shift()
,PeriodIndex.shift()
, use “periods” instead (GH 22458)Removed the previously deprecated keywords “how”, “fill_method”, and “limit” from
DataFrame.resample()
(GH 30139)Passing an integer to
Series.fillna()
orDataFrame.fillna()
withtimedelta64[ns]
dtype now raisesTypeError
(GH 24694)Passing multiple axes to
DataFrame.dropna()
is no longer supported (GH 20995)Removed
Series.nonzero
, useto_numpy().nonzero()
instead (GH 24048)Passing floating dtype
codes
toCategorical.from_codes()
is no longer supported, passcodes.astype(np.int64)
instead (GH 21775)Removed the previously deprecated keyword “pat” from
Series.str.partition()
andSeries.str.rpartition()
, use “sep” instead (GH 23767)Removed
Series.put
(GH 27106)Removed
Series.real
,Series.imag
(GH 27106)Removed
Series.to_dense
,DataFrame.to_dense
(GH 26684)Removed
Index.dtype_str
, usestr(index.dtype)
instead (GH 27106)Categorical.ravel()
returns aCategorical
instead of andarray
(GH 27199)The ‘outer’ method on Numpy ufuncs, e.g.
np.subtract.outer
operating onSeries
objects is no longer supported, and will raiseNotImplementedError
(GH 27198)Removed
Series.get_dtype_counts
andDataFrame.get_dtype_counts
(GH 27145)Changed the default “fill_value” argument in
Categorical.take()
fromTrue
toFalse
(GH 20841)Changed the default value for the
raw
argument inSeries.rolling().apply()
,DataFrame.rolling().apply()
,Series.expanding().apply()
, andDataFrame.expanding().apply()
fromNone
toFalse
(GH 20584)Removed deprecated behavior of
Series.argmin()
andSeries.argmax()
, useSeries.idxmin()
andSeries.idxmax()
for the old behavior (GH 16955)Passing a tz-aware
datetime.datetime
orTimestamp
into theTimestamp
constructor with thetz
argument now raises aValueError
(GH 23621)Removed
Series.base
,Index.base
,Categorical.base
,Series.flags
,Index.flags
,PeriodArray.flags
,Series.strides
,Index.strides
,Series.itemsize
,Index.itemsize
,Series.data
,Index.data
(GH 20721)Changed
Timedelta.resolution()
to match the behavior of the standard librarydatetime.timedelta.resolution
, for the old behavior, useTimedelta.resolution_string()
(GH 26839)Removed
Timestamp.weekday_name
,DatetimeIndex.weekday_name
, andSeries.dt.weekday_name
(GH 18164)Removed the previously deprecated keyword “errors” in
Timestamp.tz_localize()
,DatetimeIndex.tz_localize()
, andSeries.tz_localize()
(GH 22644)Changed the default “ordered” argument in
CategoricalDtype
fromNone
toFalse
(GH 26336)Series.set_axis()
andDataFrame.set_axis()
now require “labels” as the first argument and “axis” as an optional named parameter (GH 30089)Removed
to_msgpack
,read_msgpack
,DataFrame.to_msgpack
,Series.to_msgpack
(GH 27103)Removed
Series.compress
(GH 21930)Removed the previously deprecated keyword “fill_value” from
Categorical.fillna()
, use “value” instead (GH 19269)Removed the previously deprecated keyword “data” from
andrews_curves()
, use “frame” instead (GH 6956)Removed the previously deprecated keyword “data” from
parallel_coordinates()
, use “frame” instead (GH 6956)Removed the previously deprecated keyword “colors” from
parallel_coordinates()
, use “color” instead (GH 6956)Removed the previously deprecated keywords “verbose” and “private_key” from
read_gbq()
(GH 30200)Calling
np.array
andnp.asarray
on tz-awareSeries
andDatetimeIndex
will now return an object array of tz-awareTimestamp
(GH 24596)
Performance improvements#
Performance improvement in
DataFrame
arithmetic and comparison operations with scalars (GH 24990, GH 29853)Performance improvement in indexing with a non-unique
IntervalIndex
(GH 27489)Performance improvement in
MultiIndex.is_monotonic
(GH 27495)Performance improvement in
cut()
whenbins
is anIntervalIndex
(GH 27668)Performance improvement when initializing a
DataFrame
using arange
(GH 30171)Performance improvement in
DataFrame.corr()
whenmethod
is"spearman"
(GH 28139)Performance improvement in
DataFrame.replace()
when provided a list of values to replace (GH 28099)Performance improvement in
DataFrame.select_dtypes()
by using vectorization instead of iterating over a loop (GH 28317)Performance improvement in
Categorical.searchsorted()
andCategoricalIndex.searchsorted()
(GH 28795)Performance improvement when comparing a
Categorical
with a scalar and the scalar is not found in the categories (GH 29750)Performance improvement when checking if values in a
Categorical
are equal, equal or larger or larger than a given scalar. The improvement is not present if checking if theCategorical
is less than or less than or equal than the scalar (GH 29820)Performance improvement in
Index.equals()
andMultiIndex.equals()
(GH 29134)Performance improvement in
infer_dtype()
whenskipna
isTrue
(GH 28814)
Bug fixes#
Categorical#
Added test to assert the
fillna()
raises the correctValueError
message when the value isn’t a value from categories (GH 13628)Bug in
Categorical.astype()
whereNaN
values were handled incorrectly when casting to int (GH 28406)DataFrame.reindex()
with aCategoricalIndex
would fail when the targets contained duplicates, and wouldn’t fail if the source contained duplicates (GH 28107)Bug in
Categorical.astype()
not allowing for casting to extension dtypes (GH 28668)Bug where
merge()
was unable to join on categorical and extension dtype columns (GH 28668)Categorical.searchsorted()
andCategoricalIndex.searchsorted()
now work on unordered categoricals also (GH 21667)Added test to assert roundtripping to parquet with
DataFrame.to_parquet()
orread_parquet()
will preserve Categorical dtypes for string types (GH 27955)Changed the error message in
Categorical.remove_categories()
to always show the invalid removals as a set (GH 28669)Using date accessors on a categorical dtyped
Series
of datetimes was not returning an object of the same type as if one used thestr.()
/dt.()
on aSeries
of that type. E.g. when accessingSeries.dt.tz_localize()
on aCategorical
with duplicate entries, the accessor was skipping duplicates (GH 27952)Bug in
DataFrame.replace()
andSeries.replace()
that would give incorrect results on categorical data (GH 26988)Bug where calling
Categorical.min()
orCategorical.max()
on an empty Categorical would raise a numpy exception (GH 30227)The following methods now also correctly output values for unobserved categories when called through
groupby(..., observed=False)
(GH 17605) *core.groupby.SeriesGroupBy.count()
*core.groupby.SeriesGroupBy.size()
*core.groupby.SeriesGroupBy.nunique()
*core.groupby.SeriesGroupBy.nth()
Datetimelike#
Bug in
Series.__setitem__()
incorrectly castingnp.timedelta64("NaT")
tonp.datetime64("NaT")
when inserting into aSeries
with datetime64 dtype (GH 27311)Bug in
Series.dt()
property lookups when the underlying data is read-only (GH 27529)Bug in
HDFStore.__getitem__
incorrectly reading tz attribute created in Python 2 (GH 26443)Bug in
to_datetime()
where passing arrays of malformedstr
with errors=”coerce” could incorrectly lead to raisingValueError
(GH 28299)Bug in
core.groupby.SeriesGroupBy.nunique()
whereNaT
values were interfering with the count of unique values (GH 27951)Bug in
Timestamp
subtraction when subtracting aTimestamp
from anp.datetime64
object incorrectly raisingTypeError
(GH 28286)Addition and subtraction of integer or integer-dtype arrays with
Timestamp
will now raiseNullFrequencyError
instead ofValueError
(GH 28268)Bug in
Series
andDataFrame
with integer dtype failing to raiseTypeError
when adding or subtracting anp.datetime64
object (GH 28080)Bug in
Series.astype()
,Index.astype()
, andDataFrame.astype()
failing to handleNaT
when casting to an integer dtype (GH 28492)Bug in
Week
withweekday
incorrectly raisingAttributeError
instead ofTypeError
when adding or subtracting an invalid type (GH 28530)Bug in
DataFrame
arithmetic operations when operating with aSeries
with dtype'timedelta64[ns]'
(GH 28049)Bug in
core.groupby.generic.SeriesGroupBy.apply()
raisingValueError
when a column in the original DataFrame is a datetime and the column labels are not standard integers (GH 28247)Bug in
pandas._config.localization.get_locales()
where thelocales -a
encodes the locales list as windows-1252 (GH 23638, GH 24760, GH 27368)Bug in
Series.var()
failing to raiseTypeError
when called withtimedelta64[ns]
dtype (GH 28289)Bug in
DatetimeIndex.strftime()
andSeries.dt.strftime()
whereNaT
was converted to the string'NaT'
instead ofnp.nan
(GH 29578)Bug in masking datetime-like arrays with a boolean mask of an incorrect length not raising an
IndexError
(GH 30308)Bug in
Timestamp.resolution
being a property instead of a class attribute (GH 29910)Bug in
pandas.to_datetime()
when called withNone
raisingTypeError
instead of returningNaT
(GH 30011)Bug in
pandas.to_datetime()
failing fordeques
when usingcache=True
(the default) (GH 29403)Bug in
Series.item()
withdatetime64
ortimedelta64
dtype,DatetimeIndex.item()
, andTimedeltaIndex.item()
returning an integer instead of aTimestamp
orTimedelta
(GH 30175)Bug in
DatetimeIndex
addition when adding a non-optimizedDateOffset
incorrectly dropping timezone information (GH 30336)Bug in
DataFrame.drop()
where attempting to drop non-existent values from a DatetimeIndex would yield a confusing error message (GH 30399)Bug in
DataFrame.append()
would remove the timezone-awareness of new data (GH 30238)Bug in
Series.cummin()
andSeries.cummax()
with timezone-aware dtype incorrectly dropping its timezone (GH 15553)Bug in
DatetimeArray
,TimedeltaArray
, andPeriodArray
where inplace addition and subtraction did not actually operate inplace (GH 24115)Bug in
pandas.to_datetime()
when called withSeries
storingIntegerArray
raisingTypeError
instead of returningSeries
(GH 30050)Bug in
date_range()
with custom business hours asfreq
and given number ofperiods
(GH 30593)Bug in
PeriodIndex
comparisons with incorrectly casting integers toPeriod
objects, inconsistent with thePeriod
comparison behavior (GH 30722)Bug in
DatetimeIndex.insert()
raising aValueError
instead of aTypeError
when trying to insert a timezone-awareTimestamp
into a timezone-naiveDatetimeIndex
, or vice-versa (GH 30806)
Timedelta#
Bug in subtracting a
TimedeltaIndex
orTimedeltaArray
from anp.datetime64
object (GH 29558)
Timezones#
Numeric#
Bug in
DataFrame.quantile()
with zero-columnDataFrame
incorrectly raising (GH 23925)DataFrame
flex inequality comparisons methods (DataFrame.lt()
,DataFrame.le()
,DataFrame.gt()
,DataFrame.ge()
) with object-dtype andcomplex
entries failing to raiseTypeError
like theirSeries
counterparts (GH 28079)Bug in
DataFrame
logical operations (&
,|
,^
) not matchingSeries
behavior by filling NA values (GH 28741)Bug in
DataFrame.interpolate()
where specifying axis by name references variable before it is assigned (GH 29142)Bug in
Series.var()
not computing the right value with a nullable integer dtype series not passing through ddof argument (GH 29128)Improved error message when using
frac
> 1 andreplace
= False (GH 27451)Bug in numeric indexes resulted in it being possible to instantiate an
Int64Index
,UInt64Index
, orFloat64Index
with an invalid dtype (e.g. datetime-like) (GH 29539)Bug in
UInt64Index
precision loss while constructing from a list with values in thenp.uint64
range (GH 29526)Bug in
NumericIndex
construction that caused indexing to fail when integers in thenp.uint64
range were used (GH 28023)Bug in
NumericIndex
construction that causedUInt64Index
to be casted toFloat64Index
when integers in thenp.uint64
range were used to index aDataFrame
(GH 28279)Bug in
Series.interpolate()
when using method=`index` with an unsorted index, would previously return incorrect results. (GH 21037)Bug in
DataFrame.round()
where aDataFrame
with aCategoricalIndex
ofIntervalIndex
columns would incorrectly raise aTypeError
(GH 30063)Bug in
Series.pct_change()
andDataFrame.pct_change()
when there are duplicated indices (GH 30463)Bug in
DataFrame
cumulative operations (e.g. cumsum, cummax) incorrect casting to object-dtype (GH 19296)Bug in
DataFrame.diff
raising anIndexError
when one of the columns was a nullable integer dtype (GH 30967)
Conversion#
Strings#
Calling
Series.str.isalnum()
(and other “ismethods”) on an emptySeries
would return anobject
dtype instead ofbool
(GH 29624)
Interval#
Bug in
IntervalIndex.get_indexer()
where aCategorical
orCategoricalIndex
target
would incorrectly raise aTypeError
(GH 30063)Bug in
pandas.core.dtypes.cast.infer_dtype_from_scalar
where passingpandas_dtype=True
did not inferIntervalDtype
(GH 30337)Bug in
Series
constructor where constructing aSeries
from alist
ofInterval
objects resulted inobject
dtype instead ofIntervalDtype
(GH 23563)Bug in
IntervalDtype
where thekind
attribute was incorrectly set asNone
instead of"O"
(GH 30568)Bug in
IntervalIndex
,IntervalArray
, andSeries
with interval data where equality comparisons were incorrect (GH 24112)
Indexing#
Bug in assignment using a reverse slicer (GH 26939)
Bug in
DataFrame.explode()
would duplicate frame in the presence of duplicates in the index (GH 28010)Bug in reindexing a
PeriodIndex()
with another type of index that contained aPeriod
(GH 28323) (GH 28337)Fix assignment of column via
.loc
with numpy non-ns datetime type (GH 27395)Bug in
Float64Index.astype()
wherenp.inf
was not handled properly when casting to an integer dtype (GH 28475)Index.union()
could fail when the left contained duplicates (GH 28257)Bug when indexing with
.loc
where the index was aCategoricalIndex
with non-string categories didn’t work (GH 17569, GH 30225)Index.get_indexer_non_unique()
could fail withTypeError
in some cases, such as when searching for ints in a string index (GH 28257)Bug in
Float64Index.get_loc()
incorrectly raisingTypeError
instead ofKeyError
(GH 29189)Bug in
DataFrame.loc()
with incorrect dtype when setting Categorical value in 1-row DataFrame (GH 25495)MultiIndex.get_loc()
can’t find missing values when input includes missing values (GH 19132)Bug in
Series.__setitem__()
incorrectly assigning values with boolean indexer when the length of new data matches the number ofTrue
values and new data is not aSeries
or annp.array
(GH 30567)Bug in indexing with a
PeriodIndex
incorrectly accepting integers representing years, use e.g.ser.loc["2007"]
instead ofser.loc[2007]
(GH 30763)
Missing#
MultiIndex#
Constructor for
MultiIndex
verifies that the givensortorder
is compatible with the actuallexsort_depth
ifverify_integrity
parameter isTrue
(the default) (GH 28735)Series and MultiIndex
.drop
withMultiIndex
raise exception if labels not in given in level (GH 8594)
IO#
read_csv()
now accepts binary mode file buffers when using the Python csv engine (GH 23779)Bug in
DataFrame.to_json()
where using a Tuple as a column or index value and usingorient="columns"
ororient="index"
would produce invalid JSON (GH 20500)Improve infinity parsing.
read_csv()
now interpretsInfinity
,+Infinity
,-Infinity
as floating point values (GH 10065)Bug in
DataFrame.to_csv()
where values were truncated when the length ofna_rep
was shorter than the text input data. (GH 25099)Bug in
DataFrame.to_string()
where values were truncated using display options instead of outputting the full content (GH 9784)Bug in
DataFrame.to_json()
where a datetime column label would not be written out in ISO format withorient="table"
(GH 28130)Bug in
DataFrame.to_parquet()
where writing to GCS would fail withengine='fastparquet'
if the file did not already exist (GH 28326)Bug in
read_hdf()
closing stores that it didn’t open when Exceptions are raised (GH 28699)Bug in
DataFrame.read_json()
where usingorient="index"
would not maintain the order (GH 28557)Bug in
DataFrame.to_html()
where the length of theformatters
argument was not verified (GH 28469)Bug in
DataFrame.read_excel()
withengine='ods'
whensheet_name
argument references a non-existent sheet (GH 27676)Bug in
pandas.io.formats.style.Styler()
formatting for floating values not displaying decimals correctly (GH 13257)Bug in
DataFrame.to_html()
when usingformatters=<list>
andmax_cols
together. (GH 25955)Bug in
Styler.background_gradient()
not able to work with dtypeInt64
(GH 28869)Bug in
DataFrame.to_clipboard()
which did not work reliably in ipython (GH 22707)Bug in
read_json()
where default encoding was not set toutf-8
(GH 29565)Bug in
PythonParser
where str and bytes were being mixed when dealing with the decimal field (GH 29650)read_gbq()
now acceptsprogress_bar_type
to display progress bar while the data downloads. (GH 29857)Bug in
pandas.io.json.json_normalize()
where a missing value in the location specified byrecord_path
would raise aTypeError
(GH 30148)read_excel()
now accepts binary data (GH 15914)Bug in
read_csv()
in which encoding handling was limited to just the stringutf-16
for the C engine (GH 24130)
Plotting#
Bug in
Series.plot()
not able to plot boolean values (GH 23719)Bug in
DataFrame.plot()
not able to plot when no rows (GH 27758)Bug in
DataFrame.plot()
producing incorrect legend markers when plotting multiple series on the same axis (GH 18222)Bug in
DataFrame.plot()
whenkind='box'
and data contains datetime or timedelta data. These types are now automatically dropped (GH 22799)Bug in
DataFrame.plot.line()
andDataFrame.plot.area()
produce wrong xlim in x-axis (GH 27686, GH 25160, GH 24784)Bug where
DataFrame.boxplot()
would not accept acolor
parameter likeDataFrame.plot.box()
(GH 26214)Bug in the
xticks
argument being ignored forDataFrame.plot.bar()
(GH 14119)set_option()
now validates that the plot backend provided to'plotting.backend'
implements the backend when the option is set, rather than when a plot is created (GH 28163)DataFrame.plot()
now allow abackend
keyword argument to allow changing between backends in one session (GH 28619).Bug in color validation incorrectly raising for non-color styles (GH 29122).
Allow
DataFrame.plot.scatter()
to plotobjects
anddatetime
type data (GH 18755, GH 30391)Bug in
DataFrame.hist()
,xrot=0
does not work withby
and subplots (GH 30288).
GroupBy/resample/rolling#
Bug in
core.groupby.DataFrameGroupBy.apply()
only showing output from a single group when function returns anIndex
(GH 28652)Bug in
DataFrame.groupby()
with multiple groups where anIndexError
would be raised if any group contained all NA values (GH 20519)Bug in
pandas.core.resample.Resampler.size()
andpandas.core.resample.Resampler.count()
returning wrong dtype when used with an emptySeries
orDataFrame
(GH 28427)Bug in
DataFrame.rolling()
not allowing for rolling over datetimes whenaxis=1
(GH 28192)Bug in
DataFrame.rolling()
not allowing rolling over multi-index levels (GH 15584).Bug in
DataFrame.rolling()
not allowing rolling on monotonic decreasing time indexes (GH 19248).Bug in
DataFrame.groupby()
not offering selection by column name whenaxis=1
(GH 27614)Bug in
core.groupby.DataFrameGroupby.agg()
not able to use lambda function with named aggregation (GH 27519)Bug in
DataFrame.groupby()
losing column name information when grouping by a categorical column (GH 28787)Remove error raised due to duplicated input functions in named aggregation in
DataFrame.groupby()
andSeries.groupby()
. Previously error will be raised if the same function is applied on the same column and now it is allowed if new assigned names are different. (GH 28426)core.groupby.SeriesGroupBy.value_counts()
will be able to handle the case even when theGrouper
makes empty groups (GH 28479)Bug in
core.window.rolling.Rolling.quantile()
ignoringinterpolation
keyword argument when used within a groupby (GH 28779)Bug in
DataFrame.groupby()
whereany
,all
,nunique
and transform functions would incorrectly handle duplicate column labels (GH 21668)Bug in
core.groupby.DataFrameGroupBy.agg()
with timezone-aware datetime64 column incorrectly casting results to the original dtype (GH 29641)Bug in
DataFrame.groupby()
when using axis=1 and having a single level columns index (GH 30208)Bug in
DataFrame.groupby()
when using nunique on axis=1 (GH 30253)Bug in
DataFrameGroupBy.quantile()
andSeriesGroupBy.quantile()
with multiple list-like q value and integer column names (GH 30289)Bug in
DataFrameGroupBy.pct_change()
andSeriesGroupBy.pct_change()
causesTypeError
whenfill_method
isNone
(GH 30463)Bug in
Rolling.count()
andExpanding.count()
argument wheremin_periods
was ignored (GH 26996)
Reshaping#
Bug in
DataFrame.apply()
that caused incorrect output with emptyDataFrame
(GH 28202, GH 21959)Bug in
DataFrame.stack()
not handling non-unique indexes correctly when creating MultiIndex (GH 28301)Bug in
pivot_table()
not returning correct typefloat
whenmargins=True
andaggfunc='mean'
(GH 24893)Bug
merge_asof()
could not usedatetime.timedelta
fortolerance
kwarg (GH 28098)Bug in
merge()
, did not append suffixes correctly with MultiIndex (GH 28518)Fix to ensure all int dtypes can be used in
merge_asof()
when using a tolerance value. Previously every non-int64 type would raise an erroneousMergeError
(GH 28870).Better error message in
get_dummies()
whencolumns
isn’t a list-like value (GH 28383)Bug in
Index.join()
that caused infinite recursion error for mismatchedMultiIndex
name orders. (GH 25760, GH 28956)Bug
Series.pct_change()
where supplying an anchored frequency would throw aValueError
(GH 28664)Bug where
DataFrame.equals()
returned True incorrectly in some cases when two DataFrames had the same columns in different orders (GH 28839)Bug in
DataFrame.replace()
that caused non-numeric replacer’s dtype not respected (GH 26632)Bug in
melt()
where supplying mixed strings and numeric values forid_vars
orvalue_vars
would incorrectly raise aValueError
(GH 29718)Dtypes are now preserved when transposing a
DataFrame
where each column is the same extension dtype (GH 30091)Bug in
merge_asof()
merging on a tz-awareleft_index
andright_on
a tz-aware column (GH 29864)Improved error message and docstring in
cut()
andqcut()
whenlabels=True
(GH 13318)Bug in missing
fill_na
parameter toDataFrame.unstack()
with list of levels (GH 30740)
Sparse#
Bug in
SparseDataFrame
arithmetic operations incorrectly casting inputs to float (GH 28107)Bug in
DataFrame.sparse
returning aSeries
when there was a column namedsparse
rather than the accessor (GH 30758)Fixed
operator.xor()
with a boolean-dtypeSparseArray
. Now returns a sparse result, rather than object dtype (GH 31025)
ExtensionArray#
Other#
Trying to set the
display.precision
,display.max_rows
ordisplay.max_columns
usingset_option()
to anything but aNone
or a positive int will raise aValueError
(GH 23348)Using
DataFrame.replace()
with overlapping keys in a nested dictionary will no longer raise, now matching the behavior of a flat dictionary (GH 27660)DataFrame.to_csv()
andSeries.to_csv()
now support dicts ascompression
argument with key'method'
being the compression method and others as additional compression options when the compression method is'zip'
. (GH 26023)Bug in
Series.diff()
where a boolean series would incorrectly raise aTypeError
(GH 17294)Series.append()
will no longer raise aTypeError
when passed a tuple ofSeries
(GH 28410)Fix corrupted error message when calling
pandas.libs._json.encode()
on a 0d array (GH 18878)Backtick quoting in
DataFrame.query()
andDataFrame.eval()
can now also be used to use invalid identifiers like names that start with a digit, are python keywords, or are using single character operators. (GH 27017)Bug in
pd.core.util.hashing.hash_pandas_object
where arrays containing tuples were incorrectly treated as non-hashable (GH 28969)Bug in
DataFrame.append()
that raisedIndexError
when appending with empty list (GH 28769)Fix
AbstractHolidayCalendar
to return correct results for years after 2030 (now goes up to 2200) (GH 27790)Fixed
IntegerArray
returninginf
rather thanNaN
for operations dividing by0
(GH 27398)Fixed
pow
operations forIntegerArray
when the other value is0
or1
(GH 29997)Bug in
Series.count()
raises if use_inf_as_na is enabled (GH 29478)Bug in
Index
where a non-hashable name could be set without raisingTypeError
(GH 29069)Bug in
DataFrame
constructor when passing a 2Dndarray
and an extension dtype (GH 12513)Bug in
DataFrame.to_csv()
when supplied a series with adtype="string"
and ana_rep
, thena_rep
was being truncated to 2 characters. (GH 29975)Bug where
DataFrame.itertuples()
would incorrectly determine whether or not namedtuples could be used for dataframes of 255 columns (GH 28282)Handle nested NumPy
object
arrays intesting.assert_series_equal()
for ExtensionArray implementations (GH 30841)Bug in
Index
constructor incorrectly allowing 2-dimensional input arrays (GH 13601, GH 27125)
Contributors#
A total of 308 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.
Aaditya Panikath +
Abdullah İhsan Seçer
Abhijeet Krishnan +
Adam J. Stewart
Adam Klaum +
Addison Lynch
Aivengoe +
Alastair James +
Albert Villanova del Moral
Alex Kirko +
Alfredo Granja +
Allen Downey
Alp Arıbal +
Andreas Buhr +
Andrew Munch +
Andy
Angela Ambroz +
Aniruddha Bhattacharjee +
Ankit Dhankhar +
Antonio Andraues Jr +
Arda Kosar +
Asish Mahapatra +
Austin Hackett +
Avi Kelman +
AyowoleT +
Bas Nijholt +
Ben Thayer
Bharat Raghunathan
Bhavani Ravi
Bhuvana KA +
Big Head
Blake Hawkins +
Bobae Kim +
Brett Naul
Brian Wignall
Bruno P. Kinoshita +
Bryant Moscon +
Cesar H +
Chris Stadler
Chris Zimmerman +
Christopher Whelan
Clemens Brunner
Clemens Tolboom +
Connor Charles +
Daniel Hähnke +
Daniel Saxton
Darin Plutchok +
Dave Hughes
David Stansby
DavidRosen +
Dean +
Deepan Das +
Deepyaman Datta
DorAmram +
Dorothy Kabarozi +
Drew Heenan +
Eliza Mae Saret +
Elle +
Endre Mark Borza +
Eric Brassell +
Eric Wong +
Eunseop Jeong +
Eyden Villanueva +
Felix Divo
ForTimeBeing +
Francesco Truzzi +
Gabriel Corona +
Gabriel Monteiro +
Galuh Sahid +
Georgi Baychev +
Gina
GiuPassarelli +
Grigorios Giannakopoulos +
Guilherme Leite +
Guilherme Salomé +
Gyeongjae Choi +
Harshavardhan Bachina +
Harutaka Kawamura +
Hassan Kibirige
Hielke Walinga
Hubert
Hugh Kelley +
Ian Eaves +
Ignacio Santolin +
Igor Filippov +
Irv Lustig
Isaac Virshup +
Ivan Bessarabov +
JMBurley +
Jack Bicknell +
Jacob Buckheit +
Jan Koch
Jan Pipek +
Jan Škoda +
Jan-Philip Gehrcke
Jasper J.F. van den Bosch +
Javad +
Jeff Reback
Jeremy Schendel
Jeroen Kant +
Jesse Pardue +
Jethro Cao +
Jiang Yue
Jiaxiang +
Jihyung Moon +
Jimmy Callin
Jinyang Zhou +
Joao Victor Martinelli +
Joaq Almirante +
John G Evans +
John Ward +
Jonathan Larkin +
Joris Van den Bossche
Josh Dimarsky +
Joshua Smith +
Josiah Baker +
Julia Signell +
Jung Dong Ho +
Justin Cole +
Justin Zheng
Kaiqi Dong
Karthigeyan +
Katherine Younglove +
Katrin Leinweber
Kee Chong Tan +
Keith Kraus +
Kevin Nguyen +
Kevin Sheppard
Kisekka David +
Koushik +
Kyle Boone +
Kyle McCahill +
Laura Collard, PhD +
LiuSeeker +
Louis Huynh +
Lucas Scarlato Astur +
Luiz Gustavo +
Luke +
Luke Shepard +
MKhalusova +
Mabel Villalba
Maciej J +
Mak Sze Chun
Manu NALEPA +
Marc
Marc Garcia
Marco Gorelli +
Marco Neumann +
Martin Winkel +
Martina G. Vilas +
Mateusz +
Matthew Roeschke
Matthew Tan +
Max Bolingbroke
Max Chen +
MeeseeksMachine
Miguel +
MinGyo Jung +
Mohamed Amine ZGHAL +
Mohit Anand +
MomIsBestFriend +
Naomi Bonnin +
Nathan Abel +
Nico Cernek +
Nigel Markey +
Noritada Kobayashi +
Oktay Sabak +
Oliver Hofkens +
Oluokun Adedayo +
Osman +
Oğuzhan Öğreden +
Pandas Development Team +
Patrik Hlobil +
Paul Lee +
Paul Siegel +
Petr Baev +
Pietro Battiston
Prakhar Pandey +
Puneeth K +
Raghav +
Rajat +
Rajhans Jadhao +
Rajiv Bharadwaj +
Rik-de-Kort +
Roei.r
Rohit Sanjay +
Ronan Lamy +
Roshni +
Roymprog +
Rushabh Vasani +
Ryan Grout +
Ryan Nazareth
Samesh Lakhotia +
Samuel Sinayoko
Samyak Jain +
Sarah Donehower +
Sarah Masud +
Saul Shanabrook +
Scott Cole +
SdgJlbl +
Seb +
Sergei Ivko +
Shadi Akiki
Shorokhov Sergey
Siddhesh Poyarekar +
Sidharthan Nair +
Simon Gibbons
Simon Hawkins
Simon-Martin Schröder +
Sofiane Mahiou +
Sourav kumar +
Souvik Mandal +
Soyoun Kim +
Sparkle Russell-Puleri +
Srinivas Reddy Thatiparthy (శ్రీనివాస్ రెడ్డి తాటిపర్తి)
Stuart Berg +
Sumanau Sareen
Szymon Bednarek +
Tambe Tabitha Achere +
Tan Tran
Tang Heyi +
Tanmay Daripa +
Tanya Jain
Terji Petersen
Thomas Li +
Tirth Jain +
Tola A +
Tom Augspurger
Tommy Lynch +
Tomoyuki Suzuki +
Tony Lorenzo
Unprocessable +
Uwe L. Korn
Vaibhav Vishal
Victoria Zdanovskaya +
Vijayant +
Vishwak Srinivasan +
WANG Aiyong
Wenhuan
Wes McKinney
Will Ayd
Will Holmgren
William Ayd
William Blan +
Wouter Overmeire
Wuraola Oyewusi +
YaOzI +
Yash Shukla +
Yu Wang +
Yusei Tahara +
alexander135 +
alimcmaster1
avelineg +
bganglia +
bolkedebruin
bravech +
chinhwee +
cruzzoe +
dalgarno +
daniellebrown +
danielplawrence
est271 +
francisco souza +
ganevgv +
garanews +
gfyoung
h-vetinari
hasnain2808 +
ianzur +
jalbritt +
jbrockmendel
jeschwar +
jlamborn324 +
joy-rosie +
kernc
killerontherun1
krey +
lexy-lixinyu +
lucyleeow +
lukasbk +
maheshbapatu +
mck619 +
nathalier
naveenkaushik2504 +
nlepleux +
nrebena
ohad83 +
pilkibun
pqzx +
proost +
pv8493013j +
qudade +
rhstanton +
rmunjal29 +
sangarshanan +
sardonick +
saskakarsi +
shaido987 +
ssikdar1
steveayers124 +
tadashigaki +
timcera +
tlaytongoogle +
tobycheese
tonywu1999 +
tsvikas +
yogendrasoni +
zys5945 +