What’s new in 1.0.0 (January 29, 2020)#

These are the changes in pandas 1.0.0. See Release notes for a full changelog including other versions of pandas.

Note

The pandas 1.0 release removed a lot of functionality that was deprecated in previous releases (see below for an overview). It is recommended to first upgrade to pandas 0.25 and to ensure your code is working without warnings, before upgrading to pandas 1.0.

New deprecation policy#

Starting with pandas 1.0.0, pandas will adopt a variant of SemVer to version releases. Briefly,

  • Deprecations will be introduced in minor releases (e.g. 1.1.0, 1.2.0, 2.1.0, …)

  • Deprecations will be enforced in major releases (e.g. 1.0.0, 2.0.0, 3.0.0, …)

  • API-breaking changes will be made only in major releases (except for experimental features)

See Version policy for more.

Enhancements#

Using Numba in rolling.apply and expanding.apply#

We’ve added an engine keyword to apply() and apply() that allows the user to execute the routine using Numba instead of Cython. Using the Numba engine can yield significant performance gains if the apply function can operate on numpy arrays and the data set is larger (1 million rows or greater). For more details, see rolling apply documentation (GH28987, GH30936)

Defining custom windows for rolling operations#

We’ve added a pandas.api.indexers.BaseIndexer() class that allows users to define how window bounds are created during rolling operations. Users can define their own get_window_bounds method on a pandas.api.indexers.BaseIndexer() subclass that will generate the start and end indices used for each window during the rolling aggregation. For more details and example usage, see the custom window rolling documentation

Converting to markdown#

We’ve added to_markdown() for creating a markdown table (GH11052)

In [1]: df = pd.DataFrame({"A": [1, 2, 3], "B": [1, 2, 3]}, index=['a', 'a', 'b'])

In [2]: print(df.to_markdown())
|    |   A |   B |
|:---|----:|----:|
| a  |   1 |   1 |
| a  |   2 |   2 |
| b  |   3 |   3 |

Experimental new features#

Experimental NA scalar to denote missing values#

A new pd.NA value (singleton) is introduced to represent scalar missing values. Up to now, pandas used several values to represent missing data: np.nan is used for this for float data, np.nan or None for object-dtype data and pd.NaT for datetime-like data. The goal of pd.NA is to provide a “missing” indicator that can be used consistently across data types. pd.NA is currently used by the nullable integer and boolean data types and the new string data type (GH28095).

Warning

Experimental: the behaviour of pd.NA can still change without warning.

For example, creating a Series using the nullable integer dtype:

In [3]: s = pd.Series([1, 2, None], dtype="Int64")

In [4]: s
Out[4]: 
0       1
1       2
2    <NA>
Length: 3, dtype: Int64

In [5]: s[2]
Out[5]: <NA>

Compared to np.nan, pd.NA behaves differently in certain operations. In addition to arithmetic operations, pd.NA also propagates as “missing” or “unknown” in comparison operations:

In [6]: np.nan > 1
Out[6]: False

In [7]: pd.NA > 1
Out[7]: <NA>

For logical operations, pd.NA follows the rules of the three-valued logic (or Kleene logic). For example:

In [8]: pd.NA | True
Out[8]: True

For more, see NA section in the user guide on missing data.

Dedicated string data type#

We’ve added StringDtype, an extension type dedicated to string data. Previously, strings were typically stored in object-dtype NumPy arrays. (GH29975)

Warning

StringDtype is currently considered experimental. The implementation and parts of the API may change without warning.

The 'string' extension type solves several issues with object-dtype NumPy arrays:

  1. You can accidentally store a mixture of strings and non-strings in an object dtype array. A StringArray can only store strings.

  2. object dtype breaks dtype-specific operations like DataFrame.select_dtypes(). There isn’t a clear way to select just text while excluding non-text, but still object-dtype columns.

  3. When reading code, the contents of an object dtype array is less clear than string.

In [9]: pd.Series(['abc', None, 'def'], dtype=pd.StringDtype())
Out[9]: 
0     abc
1    <NA>
2     def
Length: 3, dtype: string

You can use the alias "string" as well.

In [10]: s = pd.Series(['abc', None, 'def'], dtype="string")

In [11]: s
Out[11]: 
0     abc
1    <NA>
2     def
Length: 3, dtype: string

The usual string accessor methods work. Where appropriate, the return type of the Series or columns of a DataFrame will also have string dtype.

In [12]: s.str.upper()
Out[12]: 
0     ABC
1    <NA>
2     DEF
Length: 3, dtype: string

In [13]: s.str.split('b', expand=True).dtypes
Out[13]: 
0    string
1    string
Length: 2, dtype: object

String accessor methods returning integers will return a value with Int64Dtype

In [14]: s.str.count("a")
Out[14]: 
0       1
1    <NA>
2       0
Length: 3, dtype: Int64

We recommend explicitly using the string data type when working with strings. See Text data types for more.

Boolean data type with missing values support#

We’ve added BooleanDtype / BooleanArray, an extension type dedicated to boolean data that can hold missing values. The default bool data type based on a bool-dtype NumPy array, the column can only hold True or False, and not missing values. This new BooleanArray can store missing values as well by keeping track of this in a separate mask. (GH29555, GH30095, GH31131)

In [15]: pd.Series([True, False, None], dtype=pd.BooleanDtype())
Out[15]: 
0     True
1    False
2     <NA>
Length: 3, dtype: boolean

You can use the alias "boolean" as well.

In [16]: s = pd.Series([True, False, None], dtype="boolean")

In [17]: s
Out[17]: 
0     True
1    False
2     <NA>
Length: 3, dtype: boolean

Method convert_dtypes to ease use of supported extension dtypes#

In order to encourage use of the extension dtypes StringDtype, BooleanDtype, Int64Dtype, Int32Dtype, etc., that support pd.NA, the methods DataFrame.convert_dtypes() and Series.convert_dtypes() have been introduced. (GH29752) (GH30929)

Example:

In [18]: df = pd.DataFrame({'x': ['abc', None, 'def'],
   ....:                    'y': [1, 2, np.nan],
   ....:                    'z': [True, False, True]})
   ....: 

In [19]: df
Out[19]: 
      x    y      z
0   abc  1.0   True
1  None  2.0  False
2   def  NaN   True

[3 rows x 3 columns]

In [20]: df.dtypes
Out[20]: 
x     object
y    float64
z       bool
Length: 3, dtype: object
In [21]: converted = df.convert_dtypes()

In [22]: converted
Out[22]: 
      x     y      z
0   abc     1   True
1  <NA>     2  False
2   def  <NA>   True

[3 rows x 3 columns]

In [23]: converted.dtypes
Out[23]: 
x     string
y      Int64
z    boolean
Length: 3, dtype: object

This is especially useful after reading in data using readers such as read_csv() and read_excel(). See here for a description.

Other enhancements#

Backwards incompatible API changes#

Avoid using names from MultiIndex.levels#

As part of a larger refactor to MultiIndex the level names are now stored separately from the levels (GH27242). We recommend using MultiIndex.names to access the names, and Index.set_names() to update the names.

For backwards compatibility, you can still access the names via the levels.

In [24]: mi = pd.MultiIndex.from_product([[1, 2], ['a', 'b']], names=['x', 'y'])

In [25]: mi.levels[0].name
Out[25]: 'x'

However, it is no longer possible to update the names of the MultiIndex via the level.

In [26]: mi.levels[0].name = "new name"
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In [26], line 1
----> 1 mi.levels[0].name = "new name"

File ~/work/pandas/pandas/pandas/core/indexes/base.py:1747, in Index.name(self, value)
   1743 @name.setter
   1744 def name(self, value: Hashable) -> None:
   1745     if self._no_setting_name:
   1746         # Used in MultiIndex.levels to avoid silently ignoring name updates.
-> 1747         raise RuntimeError(
   1748             "Cannot set name on a level of a MultiIndex. Use "
   1749             "'MultiIndex.set_names' instead."
   1750         )
   1751     maybe_extract_name(value, None, type(self))
   1752     self._name = value

RuntimeError: Cannot set name on a level of a MultiIndex. Use 'MultiIndex.set_names' instead.

In [27]: mi.names
Out[27]: FrozenList(['x', 'y'])

To update, use MultiIndex.set_names, which returns a new MultiIndex.

In [28]: mi2 = mi.set_names("new name", level=0)

In [29]: mi2.names
Out[29]: FrozenList(['new name', 'y'])

New repr for IntervalArray#

pandas.arrays.IntervalArray adopts a new __repr__ in accordance with other array classes (GH25022)

pandas 0.25.x

In [1]: pd.arrays.IntervalArray.from_tuples([(0, 1), (2, 3)])
Out[2]:
IntervalArray([(0, 1], (2, 3]],
              closed='right',
              dtype='interval[int64]')

pandas 1.0.0

In [30]: pd.arrays.IntervalArray.from_tuples([(0, 1), (2, 3)])
Out[30]: 
<IntervalArray>
[(0, 1], (2, 3]]
Length: 2, dtype: interval[int64, right]

DataFrame.rename now only accepts one positional argument#

DataFrame.rename() would previously accept positional arguments that would lead to ambiguous or undefined behavior. From pandas 1.0, only the very first argument, which maps labels to their new names along the default axis, is allowed to be passed by position (GH29136).

pandas 0.25.x

In [1]: df = pd.DataFrame([[1]])
In [2]: df.rename({0: 1}, {0: 2})
Out[2]:
FutureWarning: ...Use named arguments to resolve ambiguity...
   2
1  1

pandas 1.0.0

In [3]: df.rename({0: 1}, {0: 2})
Traceback (most recent call last):
...
TypeError: rename() takes from 1 to 2 positional arguments but 3 were given

Note that errors will now be raised when conflicting or potentially ambiguous arguments are provided.

pandas 0.25.x

In [4]: df.rename({0: 1}, index={0: 2})
Out[4]:
   0
1  1

In [5]: df.rename(mapper={0: 1}, index={0: 2})
Out[5]:
   0
2  1

pandas 1.0.0

In [6]: df.rename({0: 1}, index={0: 2})
Traceback (most recent call last):
...
TypeError: Cannot specify both 'mapper' and any of 'index' or 'columns'

In [7]: df.rename(mapper={0: 1}, index={0: 2})
Traceback (most recent call last):
...
TypeError: Cannot specify both 'mapper' and any of 'index' or 'columns'

You can still change the axis along which the first positional argument is applied by supplying the axis keyword argument.

In [31]: df.rename({0: 1})
Out[31]: 
   0
1  1

[1 rows x 1 columns]

In [32]: df.rename({0: 1}, axis=1)
Out[32]: 
   1
0  1

[1 rows x 1 columns]

If you would like to update both the index and column labels, be sure to use the respective keywords.

In [33]: df.rename(index={0: 1}, columns={0: 2})
Out[33]: 
   2
1  1

[1 rows x 1 columns]

Extended verbose info output for DataFrame#

DataFrame.info() now shows line numbers for the columns summary (GH17304)

pandas 0.25.x

In [1]: df = pd.DataFrame({"int_col": [1, 2, 3],
...                    "text_col": ["a", "b", "c"],
...                    "float_col": [0.0, 0.1, 0.2]})
In [2]: df.info(verbose=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
int_col      3 non-null int64
text_col     3 non-null object
float_col    3 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 152.0+ bytes

pandas 1.0.0

In [34]: df = pd.DataFrame({"int_col": [1, 2, 3],
   ....:                    "text_col": ["a", "b", "c"],
   ....:                    "float_col": [0.0, 0.1, 0.2]})
   ....: 

In [35]: df.info(verbose=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   int_col    3 non-null      int64  
 1   text_col   3 non-null      object 
 2   float_col  3 non-null      float64
dtypes: float64(1), int64(1), object(1)
memory usage: 200.0+ bytes

pandas.array() inference changes#

pandas.array() now infers pandas’ new extension types in several cases (GH29791):

  1. String data (including missing values) now returns a arrays.StringArray.

  2. Integer data (including missing values) now returns a arrays.IntegerArray.

  3. Boolean data (including missing values) now returns the new arrays.BooleanArray

pandas 0.25.x

In [1]: pd.array(["a", None])
Out[1]:
<PandasArray>
['a', None]
Length: 2, dtype: object

In [2]: pd.array([1, None])
Out[2]:
<PandasArray>
[1, None]
Length: 2, dtype: object

pandas 1.0.0

In [36]: pd.array(["a", None])
Out[36]: 
<StringArray>
['a', <NA>]
Length: 2, dtype: string

In [37]: pd.array([1, None])
Out[37]: 
<IntegerArray>
[1, <NA>]
Length: 2, dtype: Int64

As a reminder, you can specify the dtype to disable all inference.

arrays.IntegerArray now uses pandas.NA#

arrays.IntegerArray now uses pandas.NA rather than numpy.nan as its missing value marker (GH29964).

pandas 0.25.x

In [1]: a = pd.array([1, 2, None], dtype="Int64")
In [2]: a
Out[2]:
<IntegerArray>
[1, 2, NaN]
Length: 3, dtype: Int64

In [3]: a[2]
Out[3]:
nan

pandas 1.0.0

In [38]: a = pd.array([1, 2, None], dtype="Int64")

In [39]: a
Out[39]: 
<IntegerArray>
[1, 2, <NA>]
Length: 3, dtype: Int64

In [40]: a[2]
Out[40]: <NA>

This has a few API-breaking consequences.

Converting to a NumPy ndarray

When converting to a NumPy array missing values will be pd.NA, which cannot be converted to a float. So calling np.asarray(integer_array, dtype="float") will now raise.

pandas 0.25.x

In [1]: np.asarray(a, dtype="float")
Out[1]:
array([ 1.,  2., nan])

pandas 1.0.0

In [41]: np.asarray(a, dtype="float")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [41], line 1
----> 1 np.asarray(a, dtype="float")

File ~/work/pandas/pandas/pandas/core/arrays/masked.py:490, in BaseMaskedArray.__array__(self, dtype)
    485 def __array__(self, dtype: NpDtype | None = None) -> np.ndarray:
    486     """
    487     the array interface, return my values
    488     We return an object array here to preserve our scalar values
    489     """
--> 490     return self.to_numpy(dtype=dtype)

File ~/work/pandas/pandas/pandas/core/arrays/masked.py:412, in BaseMaskedArray.to_numpy(self, dtype, copy, na_value)
    406 if self._hasna:
    407     if (
    408         not is_object_dtype(dtype)
    409         and not is_string_dtype(dtype)
    410         and na_value is libmissing.NA
    411     ):
--> 412         raise ValueError(
    413             f"cannot convert to '{dtype}'-dtype NumPy array "
    414             "with missing values. Specify an appropriate 'na_value' "
    415             "for this dtype."
    416         )
    417     # don't pass copy to astype -> always need a copy since we are mutating
    418     data = self._data.astype(dtype)

ValueError: cannot convert to 'float64'-dtype NumPy array with missing values. Specify an appropriate 'na_value' for this dtype.

Use arrays.IntegerArray.to_numpy() with an explicit na_value instead.

In [42]: a.to_numpy(dtype="float", na_value=np.nan)
Out[42]: array([ 1.,  2., nan])

Reductions can return pd.NA

When performing a reduction such as a sum with skipna=False, the result will now be pd.NA instead of np.nan in presence of missing values (GH30958).

pandas 0.25.x

In [1]: pd.Series(a).sum(skipna=False)
Out[1]:
nan

pandas 1.0.0

In [43]: pd.Series(a).sum(skipna=False)
Out[43]: <NA>

value_counts returns a nullable integer dtype

Series.value_counts() with a nullable integer dtype now returns a nullable integer dtype for the values.

pandas 0.25.x

In [1]: pd.Series([2, 1, 1, None], dtype="Int64").value_counts().dtype
Out[1]:
dtype('int64')

pandas 1.0.0

In [44]: pd.Series([2, 1, 1, None], dtype="Int64").value_counts().dtype
Out[44]: Int64Dtype()

See Experimental NA scalar to denote missing values for more on the differences between pandas.NA and numpy.nan.

arrays.IntegerArray comparisons return arrays.BooleanArray#

Comparison operations on a arrays.IntegerArray now returns a arrays.BooleanArray rather than a NumPy array (GH29964).

pandas 0.25.x

In [1]: a = pd.array([1, 2, None], dtype="Int64")
In [2]: a
Out[2]:
<IntegerArray>
[1, 2, NaN]
Length: 3, dtype: Int64

In [3]: a > 1
Out[3]:
array([False,  True, False])

pandas 1.0.0

In [45]: a = pd.array([1, 2, None], dtype="Int64")

In [46]: a > 1
Out[46]: 
<BooleanArray>
[False, True, <NA>]
Length: 3, dtype: boolean

Note that missing values now propagate, rather than always comparing unequal like numpy.nan. See Experimental NA scalar to denote missing values for more.

By default Categorical.min() now returns the minimum instead of np.nan#

When Categorical contains np.nan, Categorical.min() no longer return np.nan by default (skipna=True) (GH25303)

pandas 0.25.x

In [1]: pd.Categorical([1, 2, np.nan], ordered=True).min()
Out[1]: nan

pandas 1.0.0

In [47]: pd.Categorical([1, 2, np.nan], ordered=True).min()
Out[47]: 1

Default dtype of empty pandas.Series#

Initialising an empty pandas.Series without specifying a dtype will raise a DeprecationWarning now (GH17261). The default dtype will change from float64 to object in future releases so that it is consistent with the behaviour of DataFrame and Index.

pandas 1.0.0

In [1]: pd.Series()
Out[2]:
DeprecationWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
Series([], dtype: float64)

Result dtype inference changes for resample operations#

The rules for the result dtype in DataFrame.resample() aggregations have changed for extension types (GH31359). Previously, pandas would attempt to convert the result back to the original dtype, falling back to the usual inference rules if that was not possible. Now, pandas will only return a result of the original dtype if the scalar values in the result are instances of the extension dtype’s scalar type.

In [48]: df = pd.DataFrame({"A": ['a', 'b']}, dtype='category',
   ....:                   index=pd.date_range('2000', periods=2))
   ....: 

In [49]: df
Out[49]: 
            A
2000-01-01  a
2000-01-02  b

[2 rows x 1 columns]

pandas 0.25.x

In [1]> df.resample("2D").agg(lambda x: 'a').A.dtype
Out[1]:
CategoricalDtype(categories=['a', 'b'], ordered=False)

pandas 1.0.0

In [50]: df.resample("2D").agg(lambda x: 'a').A.dtype
Out[50]: dtype('O')

This fixes an inconsistency between resample and groupby. This also fixes a potential bug, where the values of the result might change depending on how the results are cast back to the original dtype.

pandas 0.25.x

In [1] df.resample("2D").agg(lambda x: 'c')
Out[1]:

     A
0  NaN

pandas 1.0.0

In [51]: df.resample("2D").agg(lambda x: 'c')
Out[51]: 
            A
2000-01-01  c

[1 rows x 1 columns]

Increased minimum version for Python#

pandas 1.0.0 supports Python 3.6.1 and higher (GH29212).

Increased minimum versions for dependencies#

Some minimum supported versions of dependencies were updated (GH29766, GH29723). If installed, we now require:

Package

Minimum Version

Required

Changed

numpy

1.13.3

X

pytz

2015.4

X

python-dateutil

2.6.1

X

bottleneck

1.2.1

numexpr

2.6.2

pytest (dev)

4.0.2

For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.

Package

Minimum Version

Changed

beautifulsoup4

4.6.0

fastparquet

0.3.2

X

gcsfs

0.2.2

lxml

3.8.0

matplotlib

2.2.2

numba

0.46.0

X

openpyxl

2.5.7

X

pyarrow

0.13.0

X

pymysql

0.7.1

pytables

3.4.2

s3fs

0.3.0

X

scipy

0.19.0

sqlalchemy

1.1.4

xarray

0.8.2

xlrd

1.1.0

xlsxwriter

0.9.8

xlwt

1.2.0

See Dependencies and Optional dependencies for more.

Build changes#

pandas has added a pyproject.toml file and will no longer include cythonized files in the source distribution uploaded to PyPI (GH28341, GH20775). If you’re installing a built distribution (wheel) or via conda, this shouldn’t have any effect on you. If you’re building pandas from source, you should no longer need to install Cython into your build environment before calling pip install pandas.

Other API changes#

  • core.groupby.GroupBy.transform now raises on invalid operation names (GH27489)

  • pandas.api.types.infer_dtype() will now return “integer-na” for integer and np.nan mix (GH27283)

  • MultiIndex.from_arrays() will no longer infer names from arrays if names=None is explicitly provided (GH27292)

  • In order to improve tab-completion, pandas does not include most deprecated attributes when introspecting a pandas object using dir (e.g. dir(df)). To see which attributes are excluded, see an object’s _deprecations attribute, for example pd.DataFrame._deprecations (GH28805).

  • The returned dtype of unique() now matches the input dtype. (GH27874)

  • Changed the default configuration value for options.matplotlib.register_converters from True to "auto" (GH18720). Now, pandas custom formatters will only be applied to plots created by pandas, through plot(). Previously, pandas’ formatters would be applied to all plots created after a plot(). See units registration for more.

  • Series.dropna() has dropped its **kwargs argument in favor of a single how parameter. Supplying anything else than how to **kwargs raised a TypeError previously (GH29388)

  • When testing pandas, the new minimum required version of pytest is 5.0.1 (GH29664)

  • Series.str.__iter__() was deprecated and will be removed in future releases (GH28277).

  • Added <NA> to the list of default NA values for read_csv() (GH30821)

Documentation improvements#

Deprecations#

  • Series.item() and Index.item() have been _undeprecated_ (GH29250)

  • Index.set_value has been deprecated. For a given index idx, array arr, value in idx of idx_val and a new value of val, idx.set_value(arr, idx_val, val) is equivalent to arr[idx.get_loc(idx_val)] = val, which should be used instead (GH28621).

  • is_extension_type() is deprecated, is_extension_array_dtype() should be used instead (GH29457)

  • eval() keyword argument “truediv” is deprecated and will be removed in a future version (GH29812)

  • DateOffset.isAnchored() and DatetOffset.onOffset() are deprecated and will be removed in a future version, use DateOffset.is_anchored() and DateOffset.is_on_offset() instead (GH30340)

  • pandas.tseries.frequencies.get_offset is deprecated and will be removed in a future version, use pandas.tseries.frequencies.to_offset instead (GH4205)

  • Categorical.take_nd() and CategoricalIndex.take_nd() are deprecated, use Categorical.take() and CategoricalIndex.take() instead (GH27745)

  • The parameter numeric_only of Categorical.min() and Categorical.max() is deprecated and replaced with skipna (GH25303)

  • The parameter label in lreshape() has been deprecated and will be removed in a future version (GH29742)

  • pandas.core.index has been deprecated and will be removed in a future version, the public classes are available in the top-level namespace (GH19711)

  • pandas.json_normalize() is now exposed in the top-level namespace. Usage of json_normalize as pandas.io.json.json_normalize is now deprecated and it is recommended to use json_normalize as pandas.json_normalize() instead (GH27586).

  • The numpy argument of pandas.read_json() is deprecated (GH28512).

  • DataFrame.to_stata(), DataFrame.to_feather(), and DataFrame.to_parquet() argument “fname” is deprecated, use “path” instead (GH23574)

  • The deprecated internal attributes _start, _stop and _step of RangeIndex now raise a FutureWarning instead of a DeprecationWarning (GH26581)

  • The pandas.util.testing module has been deprecated. Use the public API in pandas.testing documented at Assertion functions (GH16232).

  • pandas.SparseArray has been deprecated. Use pandas.arrays.SparseArray (arrays.SparseArray) instead. (GH30642)

  • The parameter is_copy of Series.take() and DataFrame.take() has been deprecated and will be removed in a future version. (GH27357)

  • Support for multi-dimensional indexing (e.g. index[:, None]) on a Index is deprecated and will be removed in a future version, convert to a numpy array before indexing instead (GH30588)

  • The pandas.np submodule is now deprecated. Import numpy directly instead (GH30296)

  • The pandas.datetime class is now deprecated. Import from datetime instead (GH30610)

  • diff will raise a TypeError rather than implicitly losing the dtype of extension types in the future. Convert to the correct dtype before calling diff instead (GH31025)

Selecting Columns from a Grouped DataFrame

When selecting columns from a DataFrameGroupBy object, passing individual keys (or a tuple of keys) inside single brackets is deprecated, a list of items should be used instead. (GH23566) For example:

df = pd.DataFrame({
    "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
    "B": np.random.randn(8),
    "C": np.random.randn(8),
})
g = df.groupby('A')

# single key, returns SeriesGroupBy
g['B']

# tuple of single key, returns SeriesGroupBy
g[('B',)]

# tuple of multiple keys, returns DataFrameGroupBy, raises FutureWarning
g[('B', 'C')]

# multiple keys passed directly, returns DataFrameGroupBy, raises FutureWarning
# (implicitly converts the passed strings into a single tuple)
g['B', 'C']

# proper way, returns DataFrameGroupBy
g[['B', 'C']]

Removal of prior version deprecations/changes#

Removed SparseSeries and SparseDataFrame

SparseSeries, SparseDataFrame and the DataFrame.to_sparse method have been removed (GH28425). We recommend using a Series or DataFrame with sparse values instead. See Migrating for help with migrating existing code.

Matplotlib unit registration

Previously, pandas would register converters with matplotlib as a side effect of importing pandas (GH18720). This changed the output of plots made via matplotlib plots after pandas was imported, even if you were using matplotlib directly rather than plot().

To use pandas formatters with a matplotlib plot, specify

In [1]: import pandas as pd
In [2]: pd.options.plotting.matplotlib.register_converters = True

Note that plots created by DataFrame.plot() and Series.plot() do register the converters automatically. The only behavior change is when plotting a date-like object via matplotlib.pyplot.plot or matplotlib.Axes.plot. See Custom formatters for timeseries plots for more.

Other removals

Performance improvements#

Bug fixes#

Categorical#

  • Added test to assert the fillna() raises the correct ValueError message when the value isn’t a value from categories (GH13628)

  • Bug in Categorical.astype() where NaN values were handled incorrectly when casting to int (GH28406)

  • DataFrame.reindex() with a CategoricalIndex would fail when the targets contained duplicates, and wouldn’t fail if the source contained duplicates (GH28107)

  • Bug in Categorical.astype() not allowing for casting to extension dtypes (GH28668)

  • Bug where merge() was unable to join on categorical and extension dtype columns (GH28668)

  • Categorical.searchsorted() and CategoricalIndex.searchsorted() now work on unordered categoricals also (GH21667)

  • Added test to assert roundtripping to parquet with DataFrame.to_parquet() or read_parquet() will preserve Categorical dtypes for string types (GH27955)

  • Changed the error message in Categorical.remove_categories() to always show the invalid removals as a set (GH28669)

  • Using date accessors on a categorical dtyped Series of datetimes was not returning an object of the same type as if one used the str.() / dt.() on a Series of that type. E.g. when accessing Series.dt.tz_localize() on a Categorical with duplicate entries, the accessor was skipping duplicates (GH27952)

  • Bug in DataFrame.replace() and Series.replace() that would give incorrect results on categorical data (GH26988)

  • Bug where calling Categorical.min() or Categorical.max() on an empty Categorical would raise a numpy exception (GH30227)

  • The following methods now also correctly output values for unobserved categories when called through groupby(..., observed=False) (GH17605) * core.groupby.SeriesGroupBy.count() * core.groupby.SeriesGroupBy.size() * core.groupby.SeriesGroupBy.nunique() * core.groupby.SeriesGroupBy.nth()

Datetimelike#

  • Bug in Series.__setitem__() incorrectly casting np.timedelta64("NaT") to np.datetime64("NaT") when inserting into a Series with datetime64 dtype (GH27311)

  • Bug in Series.dt() property lookups when the underlying data is read-only (GH27529)

  • Bug in HDFStore.__getitem__ incorrectly reading tz attribute created in Python 2 (GH26443)

  • Bug in to_datetime() where passing arrays of malformed str with errors=”coerce” could incorrectly lead to raising ValueError (GH28299)

  • Bug in core.groupby.SeriesGroupBy.nunique() where NaT values were interfering with the count of unique values (GH27951)

  • Bug in Timestamp subtraction when subtracting a Timestamp from a np.datetime64 object incorrectly raising TypeError (GH28286)

  • Addition and subtraction of integer or integer-dtype arrays with Timestamp will now raise NullFrequencyError instead of ValueError (GH28268)

  • Bug in Series and DataFrame with integer dtype failing to raise TypeError when adding or subtracting a np.datetime64 object (GH28080)

  • Bug in Series.astype(), Index.astype(), and DataFrame.astype() failing to handle NaT when casting to an integer dtype (GH28492)

  • Bug in Week with weekday incorrectly raising AttributeError instead of TypeError when adding or subtracting an invalid type (GH28530)

  • Bug in DataFrame arithmetic operations when operating with a Series with dtype 'timedelta64[ns]' (GH28049)

  • Bug in core.groupby.generic.SeriesGroupBy.apply() raising ValueError when a column in the original DataFrame is a datetime and the column labels are not standard integers (GH28247)

  • Bug in pandas._config.localization.get_locales() where the locales -a encodes the locales list as windows-1252 (GH23638, GH24760, GH27368)

  • Bug in Series.var() failing to raise TypeError when called with timedelta64[ns] dtype (GH28289)

  • Bug in DatetimeIndex.strftime() and Series.dt.strftime() where NaT was converted to the string 'NaT' instead of np.nan (GH29578)

  • Bug in masking datetime-like arrays with a boolean mask of an incorrect length not raising an IndexError (GH30308)

  • Bug in Timestamp.resolution being a property instead of a class attribute (GH29910)

  • Bug in pandas.to_datetime() when called with None raising TypeError instead of returning NaT (GH30011)

  • Bug in pandas.to_datetime() failing for deques when using cache=True (the default) (GH29403)

  • Bug in Series.item() with datetime64 or timedelta64 dtype, DatetimeIndex.item(), and TimedeltaIndex.item() returning an integer instead of a Timestamp or Timedelta (GH30175)

  • Bug in DatetimeIndex addition when adding a non-optimized DateOffset incorrectly dropping timezone information (GH30336)

  • Bug in DataFrame.drop() where attempting to drop non-existent values from a DatetimeIndex would yield a confusing error message (GH30399)

  • Bug in DataFrame.append() would remove the timezone-awareness of new data (GH30238)

  • Bug in Series.cummin() and Series.cummax() with timezone-aware dtype incorrectly dropping its timezone (GH15553)

  • Bug in DatetimeArray, TimedeltaArray, and PeriodArray where inplace addition and subtraction did not actually operate inplace (GH24115)

  • Bug in pandas.to_datetime() when called with Series storing IntegerArray raising TypeError instead of returning Series (GH30050)

  • Bug in date_range() with custom business hours as freq and given number of periods (GH30593)

  • Bug in PeriodIndex comparisons with incorrectly casting integers to Period objects, inconsistent with the Period comparison behavior (GH30722)

  • Bug in DatetimeIndex.insert() raising a ValueError instead of a TypeError when trying to insert a timezone-aware Timestamp into a timezone-naive DatetimeIndex, or vice-versa (GH30806)

Timedelta#

Timezones#

Numeric#

Conversion#

Strings#

Interval#

Indexing#

  • Bug in assignment using a reverse slicer (GH26939)

  • Bug in DataFrame.explode() would duplicate frame in the presence of duplicates in the index (GH28010)

  • Bug in reindexing a PeriodIndex() with another type of index that contained a Period (GH28323) (GH28337)

  • Fix assignment of column via .loc with numpy non-ns datetime type (GH27395)

  • Bug in Float64Index.astype() where np.inf was not handled properly when casting to an integer dtype (GH28475)

  • Index.union() could fail when the left contained duplicates (GH28257)

  • Bug when indexing with .loc where the index was a CategoricalIndex with non-string categories didn’t work (GH17569, GH30225)

  • Index.get_indexer_non_unique() could fail with TypeError in some cases, such as when searching for ints in a string index (GH28257)

  • Bug in Float64Index.get_loc() incorrectly raising TypeError instead of KeyError (GH29189)

  • Bug in DataFrame.loc() with incorrect dtype when setting Categorical value in 1-row DataFrame (GH25495)

  • MultiIndex.get_loc() can’t find missing values when input includes missing values (GH19132)

  • Bug in Series.__setitem__() incorrectly assigning values with boolean indexer when the length of new data matches the number of True values and new data is not a Series or an np.array (GH30567)

  • Bug in indexing with a PeriodIndex incorrectly accepting integers representing years, use e.g. ser.loc["2007"] instead of ser.loc[2007] (GH30763)

Missing#

MultiIndex#

  • Constructor for MultiIndex verifies that the given sortorder is compatible with the actual lexsort_depth if verify_integrity parameter is True (the default) (GH28735)

  • Series and MultiIndex .drop with MultiIndex raise exception if labels not in given in level (GH8594)

IO#

  • read_csv() now accepts binary mode file buffers when using the Python csv engine (GH23779)

  • Bug in DataFrame.to_json() where using a Tuple as a column or index value and using orient="columns" or orient="index" would produce invalid JSON (GH20500)

  • Improve infinity parsing. read_csv() now interprets Infinity, +Infinity, -Infinity as floating point values (GH10065)

  • Bug in DataFrame.to_csv() where values were truncated when the length of na_rep was shorter than the text input data. (GH25099)

  • Bug in DataFrame.to_string() where values were truncated using display options instead of outputting the full content (GH9784)

  • Bug in DataFrame.to_json() where a datetime column label would not be written out in ISO format with orient="table" (GH28130)

  • Bug in DataFrame.to_parquet() where writing to GCS would fail with engine='fastparquet' if the file did not already exist (GH28326)

  • Bug in read_hdf() closing stores that it didn’t open when Exceptions are raised (GH28699)

  • Bug in DataFrame.read_json() where using orient="index" would not maintain the order (GH28557)

  • Bug in DataFrame.to_html() where the length of the formatters argument was not verified (GH28469)

  • Bug in DataFrame.read_excel() with engine='ods' when sheet_name argument references a non-existent sheet (GH27676)

  • Bug in pandas.io.formats.style.Styler() formatting for floating values not displaying decimals correctly (GH13257)

  • Bug in DataFrame.to_html() when using formatters=<list> and max_cols together. (GH25955)

  • Bug in Styler.background_gradient() not able to work with dtype Int64 (GH28869)

  • Bug in DataFrame.to_clipboard() which did not work reliably in ipython (GH22707)

  • Bug in read_json() where default encoding was not set to utf-8 (GH29565)

  • Bug in PythonParser where str and bytes were being mixed when dealing with the decimal field (GH29650)

  • read_gbq() now accepts progress_bar_type to display progress bar while the data downloads. (GH29857)

  • Bug in pandas.io.json.json_normalize() where a missing value in the location specified by record_path would raise a TypeError (GH30148)

  • read_excel() now accepts binary data (GH15914)

  • Bug in read_csv() in which encoding handling was limited to just the string utf-16 for the C engine (GH24130)

Plotting#

GroupBy/resample/rolling#

Reshaping#

Sparse#

  • Bug in SparseDataFrame arithmetic operations incorrectly casting inputs to float (GH28107)

  • Bug in DataFrame.sparse returning a Series when there was a column named sparse rather than the accessor (GH30758)

  • Fixed operator.xor() with a boolean-dtype SparseArray. Now returns a sparse result, rather than object dtype (GH31025)

ExtensionArray#

Other#

  • Trying to set the display.precision, display.max_rows or display.max_columns using set_option() to anything but a None or a positive int will raise a ValueError (GH23348)

  • Using DataFrame.replace() with overlapping keys in a nested dictionary will no longer raise, now matching the behavior of a flat dictionary (GH27660)

  • DataFrame.to_csv() and Series.to_csv() now support dicts as compression argument with key 'method' being the compression method and others as additional compression options when the compression method is 'zip'. (GH26023)

  • Bug in Series.diff() where a boolean series would incorrectly raise a TypeError (GH17294)

  • Series.append() will no longer raise a TypeError when passed a tuple of Series (GH28410)

  • Fix corrupted error message when calling pandas.libs._json.encode() on a 0d array (GH18878)

  • Backtick quoting in DataFrame.query() and DataFrame.eval() can now also be used to use invalid identifiers like names that start with a digit, are python keywords, or are using single character operators. (GH27017)

  • Bug in pd.core.util.hashing.hash_pandas_object where arrays containing tuples were incorrectly treated as non-hashable (GH28969)

  • Bug in DataFrame.append() that raised IndexError when appending with empty list (GH28769)

  • Fix AbstractHolidayCalendar to return correct results for years after 2030 (now goes up to 2200) (GH27790)

  • Fixed IntegerArray returning inf rather than NaN for operations dividing by 0 (GH27398)

  • Fixed pow operations for IntegerArray when the other value is 0 or 1 (GH29997)

  • Bug in Series.count() raises if use_inf_as_na is enabled (GH29478)

  • Bug in Index where a non-hashable name could be set without raising TypeError (GH29069)

  • Bug in DataFrame constructor when passing a 2D ndarray and an extension dtype (GH12513)

  • Bug in DataFrame.to_csv() when supplied a series with a dtype="string" and a na_rep, the na_rep was being truncated to 2 characters. (GH29975)

  • Bug where DataFrame.itertuples() would incorrectly determine whether or not namedtuples could be used for dataframes of 255 columns (GH28282)

  • Handle nested NumPy object arrays in testing.assert_series_equal() for ExtensionArray implementations (GH30841)

  • Bug in Index constructor incorrectly allowing 2-dimensional input arrays (GH13601, GH27125)

Contributors#

A total of 308 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.

  • Aaditya Panikath +

  • Abdullah İhsan Seçer

  • Abhijeet Krishnan +

  • Adam J. Stewart

  • Adam Klaum +

  • Addison Lynch

  • Aivengoe +

  • Alastair James +

  • Albert Villanova del Moral

  • Alex Kirko +

  • Alfredo Granja +

  • Allen Downey

  • Alp Arıbal +

  • Andreas Buhr +

  • Andrew Munch +

  • Andy

  • Angela Ambroz +

  • Aniruddha Bhattacharjee +

  • Ankit Dhankhar +

  • Antonio Andraues Jr +

  • Arda Kosar +

  • Asish Mahapatra +

  • Austin Hackett +

  • Avi Kelman +

  • AyowoleT +

  • Bas Nijholt +

  • Ben Thayer

  • Bharat Raghunathan

  • Bhavani Ravi

  • Bhuvana KA +

  • Big Head

  • Blake Hawkins +

  • Bobae Kim +

  • Brett Naul

  • Brian Wignall

  • Bruno P. Kinoshita +

  • Bryant Moscon +

  • Cesar H +

  • Chris Stadler

  • Chris Zimmerman +

  • Christopher Whelan

  • Clemens Brunner

  • Clemens Tolboom +

  • Connor Charles +

  • Daniel Hähnke +

  • Daniel Saxton

  • Darin Plutchok +

  • Dave Hughes

  • David Stansby

  • DavidRosen +

  • Dean +

  • Deepan Das +

  • Deepyaman Datta

  • DorAmram +

  • Dorothy Kabarozi +

  • Drew Heenan +

  • Eliza Mae Saret +

  • Elle +

  • Endre Mark Borza +

  • Eric Brassell +

  • Eric Wong +

  • Eunseop Jeong +

  • Eyden Villanueva +

  • Felix Divo

  • ForTimeBeing +

  • Francesco Truzzi +

  • Gabriel Corona +

  • Gabriel Monteiro +

  • Galuh Sahid +

  • Georgi Baychev +

  • Gina

  • GiuPassarelli +

  • Grigorios Giannakopoulos +

  • Guilherme Leite +

  • Guilherme Salomé +

  • Gyeongjae Choi +

  • Harshavardhan Bachina +

  • Harutaka Kawamura +

  • Hassan Kibirige

  • Hielke Walinga

  • Hubert

  • Hugh Kelley +

  • Ian Eaves +

  • Ignacio Santolin +

  • Igor Filippov +

  • Irv Lustig

  • Isaac Virshup +

  • Ivan Bessarabov +

  • JMBurley +

  • Jack Bicknell +

  • Jacob Buckheit +

  • Jan Koch

  • Jan Pipek +

  • Jan Škoda +

  • Jan-Philip Gehrcke

  • Jasper J.F. van den Bosch +

  • Javad +

  • Jeff Reback

  • Jeremy Schendel

  • Jeroen Kant +

  • Jesse Pardue +

  • Jethro Cao +

  • Jiang Yue

  • Jiaxiang +

  • Jihyung Moon +

  • Jimmy Callin

  • Jinyang Zhou +

  • Joao Victor Martinelli +

  • Joaq Almirante +

  • John G Evans +

  • John Ward +

  • Jonathan Larkin +

  • Joris Van den Bossche

  • Josh Dimarsky +

  • Joshua Smith +

  • Josiah Baker +

  • Julia Signell +

  • Jung Dong Ho +

  • Justin Cole +

  • Justin Zheng

  • Kaiqi Dong

  • Karthigeyan +

  • Katherine Younglove +

  • Katrin Leinweber

  • Kee Chong Tan +

  • Keith Kraus +

  • Kevin Nguyen +

  • Kevin Sheppard

  • Kisekka David +

  • Koushik +

  • Kyle Boone +

  • Kyle McCahill +

  • Laura Collard, PhD +

  • LiuSeeker +

  • Louis Huynh +

  • Lucas Scarlato Astur +

  • Luiz Gustavo +

  • Luke +

  • Luke Shepard +

  • MKhalusova +

  • Mabel Villalba

  • Maciej J +

  • Mak Sze Chun

  • Manu NALEPA +

  • Marc

  • Marc Garcia

  • Marco Gorelli +

  • Marco Neumann +

  • Martin Winkel +

  • Martina G. Vilas +

  • Mateusz +

  • Matthew Roeschke

  • Matthew Tan +

  • Max Bolingbroke

  • Max Chen +

  • MeeseeksMachine

  • Miguel +

  • MinGyo Jung +

  • Mohamed Amine ZGHAL +

  • Mohit Anand +

  • MomIsBestFriend +

  • Naomi Bonnin +

  • Nathan Abel +

  • Nico Cernek +

  • Nigel Markey +

  • Noritada Kobayashi +

  • Oktay Sabak +

  • Oliver Hofkens +

  • Oluokun Adedayo +

  • Osman +

  • Oğuzhan Öğreden +

  • Pandas Development Team +

  • Patrik Hlobil +

  • Paul Lee +

  • Paul Siegel +

  • Petr Baev +

  • Pietro Battiston

  • Prakhar Pandey +

  • Puneeth K +

  • Raghav +

  • Rajat +

  • Rajhans Jadhao +

  • Rajiv Bharadwaj +

  • Rik-de-Kort +

  • Roei.r

  • Rohit Sanjay +

  • Ronan Lamy +

  • Roshni +

  • Roymprog +

  • Rushabh Vasani +

  • Ryan Grout +

  • Ryan Nazareth

  • Samesh Lakhotia +

  • Samuel Sinayoko

  • Samyak Jain +

  • Sarah Donehower +

  • Sarah Masud +

  • Saul Shanabrook +

  • Scott Cole +

  • SdgJlbl +

  • Seb +

  • Sergei Ivko +

  • Shadi Akiki

  • Shorokhov Sergey

  • Siddhesh Poyarekar +

  • Sidharthan Nair +

  • Simon Gibbons

  • Simon Hawkins

  • Simon-Martin Schröder +

  • Sofiane Mahiou +

  • Sourav kumar +

  • Souvik Mandal +

  • Soyoun Kim +

  • Sparkle Russell-Puleri +

  • Srinivas Reddy Thatiparthy (శ్రీనివాస్ రెడ్డి తాటిపర్తి)

  • Stuart Berg +

  • Sumanau Sareen

  • Szymon Bednarek +

  • Tambe Tabitha Achere +

  • Tan Tran

  • Tang Heyi +

  • Tanmay Daripa +

  • Tanya Jain

  • Terji Petersen

  • Thomas Li +

  • Tirth Jain +

  • Tola A +

  • Tom Augspurger

  • Tommy Lynch +

  • Tomoyuki Suzuki +

  • Tony Lorenzo

  • Unprocessable +

  • Uwe L. Korn

  • Vaibhav Vishal

  • Victoria Zdanovskaya +

  • Vijayant +

  • Vishwak Srinivasan +

  • WANG Aiyong

  • Wenhuan

  • Wes McKinney

  • Will Ayd

  • Will Holmgren

  • William Ayd

  • William Blan +

  • Wouter Overmeire

  • Wuraola Oyewusi +

  • YaOzI +

  • Yash Shukla +

  • Yu Wang +

  • Yusei Tahara +

  • alexander135 +

  • alimcmaster1

  • avelineg +

  • bganglia +

  • bolkedebruin

  • bravech +

  • chinhwee +

  • cruzzoe +

  • dalgarno +

  • daniellebrown +

  • danielplawrence

  • est271 +

  • francisco souza +

  • ganevgv +

  • garanews +

  • gfyoung

  • h-vetinari

  • hasnain2808 +

  • ianzur +

  • jalbritt +

  • jbrockmendel

  • jeschwar +

  • jlamborn324 +

  • joy-rosie +

  • kernc

  • killerontherun1

  • krey +

  • lexy-lixinyu +

  • lucyleeow +

  • lukasbk +

  • maheshbapatu +

  • mck619 +

  • nathalier

  • naveenkaushik2504 +

  • nlepleux +

  • nrebena

  • ohad83 +

  • pilkibun

  • pqzx +

  • proost +

  • pv8493013j +

  • qudade +

  • rhstanton +

  • rmunjal29 +

  • sangarshanan +

  • sardonick +

  • saskakarsi +

  • shaido987 +

  • ssikdar1

  • steveayers124 +

  • tadashigaki +

  • timcera +

  • tlaytongoogle +

  • tobycheese

  • tonywu1999 +

  • tsvikas +

  • yogendrasoni +

  • zys5945 +