What’s new in 0.25.0 (July 18, 2019)#
Warning
Starting with the 0.25.x series of releases, pandas only supports Python 3.5.3 and higher. See Dropping Python 2.7 for more details.
Warning
The minimum supported Python version will be bumped to 3.6 in a future release.
Warning
Panel has been fully removed. For N-D labeled data structures, please
use xarray
Warning
read_pickle() and read_msgpack() are only guaranteed backwards compatible back to
pandas version 0.20.3 (GH 27082)
These are the changes in pandas 0.25.0. See Release notes for a full changelog including other versions of pandas.
Enhancements#
GroupBy aggregation with relabeling#
pandas has added special groupby behavior, known as “named aggregation”, for naming the output columns when applying multiple aggregation functions to specific columns (GH 18366, GH 26512).
In [1]: animals = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
   ...:                         'height': [9.1, 6.0, 9.5, 34.0],
   ...:                         'weight': [7.9, 7.5, 9.9, 198.0]})
   ...: 
In [2]: animals
Out[2]: 
  kind  height  weight
0  cat     9.1     7.9
1  dog     6.0     7.5
2  cat     9.5     9.9
3  dog    34.0   198.0
[4 rows x 3 columns]
In [3]: animals.groupby("kind").agg(
   ...:     min_height=pd.NamedAgg(column='height', aggfunc='min'),
   ...:     max_height=pd.NamedAgg(column='height', aggfunc='max'),
   ...:     average_weight=pd.NamedAgg(column='weight', aggfunc="mean"),
   ...: )
   ...: 
Out[3]: 
      min_height  max_height  average_weight
kind                                        
cat          9.1         9.5            8.90
dog          6.0        34.0          102.75
[2 rows x 3 columns]
Pass the desired columns names as the **kwargs to .agg. The values of **kwargs
should be tuples where the first element is the column selection, and the second element is the
aggregation function to apply. pandas provides the pandas.NamedAgg namedtuple to make it clearer
what the arguments to the function are, but plain tuples are accepted as well.
In [4]: animals.groupby("kind").agg(
   ...:     min_height=('height', 'min'),
   ...:     max_height=('height', 'max'),
   ...:     average_weight=('weight', 'mean'),
   ...: )
   ...: 
Out[4]: 
      min_height  max_height  average_weight
kind                                        
cat          9.1         9.5            8.90
dog          6.0        34.0          102.75
[2 rows x 3 columns]
Named aggregation is the recommended replacement for the deprecated “dict-of-dicts” approach to naming the output of column-specific aggregations (Deprecate groupby.agg() with a dictionary when renaming).
A similar approach is now available for Series groupby objects as well. Because there’s no need for column selection, the values can just be the functions to apply
In [5]: animals.groupby("kind").height.agg(
   ...:     min_height="min",
   ...:     max_height="max",
   ...: )
   ...: 
Out[5]: 
      min_height  max_height
kind                        
cat          9.1         9.5
dog          6.0        34.0
[2 rows x 2 columns]
This type of aggregation is the recommended alternative to the deprecated behavior when passing a dict to a Series groupby aggregation (Deprecate groupby.agg() with a dictionary when renaming).
See Named aggregation for more.
GroupBy aggregation with multiple lambdas#
You can now provide multiple lambda functions to a list-like aggregation in
GroupBy.agg (GH 26430).
In [6]: animals.groupby('kind').height.agg([
   ...:     lambda x: x.iloc[0], lambda x: x.iloc[-1]
   ...: ])
   ...: 
Out[6]: 
      <lambda_0>  <lambda_1>
kind                        
cat          9.1         9.5
dog          6.0        34.0
[2 rows x 2 columns]
In [7]: animals.groupby('kind').agg([
   ...:     lambda x: x.iloc[0] - x.iloc[1],
   ...:     lambda x: x.iloc[0] + x.iloc[1]
   ...: ])
   ...: 
Out[7]: 
         height                weight           
     <lambda_0> <lambda_1> <lambda_0> <lambda_1>
kind                                            
cat        -0.4       18.6       -2.0       17.8
dog       -28.0       40.0     -190.5      205.5
[2 rows x 4 columns]
Previously, these raised a SpecificationError.
Better repr for MultiIndex#
Printing of MultiIndex instances now shows tuples of each row and ensures
that the tuple items are vertically aligned, so it’s now easier to understand
the structure of the MultiIndex. (GH 13480):
The repr now looks like this:
In [8]: pd.MultiIndex.from_product([['a', 'abc'], range(500)])
Out[8]: 
MultiIndex([(  'a',   0),
            (  'a',   1),
            (  'a',   2),
            (  'a',   3),
            (  'a',   4),
            (  'a',   5),
            (  'a',   6),
            (  'a',   7),
            (  'a',   8),
            (  'a',   9),
            ...
            ('abc', 490),
            ('abc', 491),
            ('abc', 492),
            ('abc', 493),
            ('abc', 494),
            ('abc', 495),
            ('abc', 496),
            ('abc', 497),
            ('abc', 498),
            ('abc', 499)],
           length=1000)
Previously, outputting a MultiIndex printed all the levels and
codes of the MultiIndex, which was visually unappealing and made
the output more difficult to navigate. For example (limiting the range to 5):
In [1]: pd.MultiIndex.from_product([['a', 'abc'], range(5)])
Out[1]: MultiIndex(levels=[['a', 'abc'], [0, 1, 2, 3]],
   ...:            codes=[[0, 0, 0, 0, 1, 1, 1, 1], [0, 1, 2, 3, 0, 1, 2, 3]])
In the new repr, all values will be shown, if the number of rows is smaller
than options.display.max_seq_items (default: 100 items). Horizontally,
the output will truncate, if it’s wider than options.display.width
(default: 80 characters).
Shorter truncated repr for Series and DataFrame#
Currently, the default display options of pandas ensure that when a Series
or DataFrame has more than 60 rows, its repr gets truncated to this maximum
of 60 rows (the display.max_rows option). However, this still gives
a repr that takes up a large part of the vertical screen estate. Therefore,
a new option display.min_rows is introduced with a default of 10 which
determines the number of rows showed in the truncated repr:
- For small Series or DataFrames, up to - max_rowsnumber of rows is shown (default: 60).
- For larger Series of DataFrame with a length above - max_rows, only- min_rowsnumber of rows is shown (default: 10, i.e. the first and last 5 rows).
This dual option allows to still see the full content of relatively small
objects (e.g. df.head(20) shows all 20 rows), while giving a brief repr
for large objects.
To restore the previous behaviour of a single threshold, set
pd.options.display.min_rows = None.
JSON normalize with max_level param support#
json_normalize() normalizes the provided input dict to all
nested levels. The new max_level parameter provides more control over
which level to end normalization (GH 23843):
The repr now looks like this:
from pandas.io.json import json_normalize
data = [{
    'CreatedBy': {'Name': 'User001'},
    'Lookup': {'TextField': 'Some text',
               'UserField': {'Id': 'ID001', 'Name': 'Name001'}},
    'Image': {'a': 'b'}
}]
json_normalize(data, max_level=1)
Series.explode to split list-like values to rows#
Series and DataFrame have gained the DataFrame.explode() methods to transform list-likes to individual rows. See section on Exploding list-like column in docs for more information (GH 16538, GH 10511)
Here is a typical usecase. You have comma separated string in a column.
In [9]: df = pd.DataFrame([{'var1': 'a,b,c', 'var2': 1},
   ...:                    {'var1': 'd,e,f', 'var2': 2}])
   ...: 
In [10]: df
Out[10]: 
    var1  var2
0  a,b,c     1
1  d,e,f     2
[2 rows x 2 columns]
Creating a long form DataFrame is now straightforward using chained operations
In [11]: df.assign(var1=df.var1.str.split(',')).explode('var1')
Out[11]: 
  var1  var2
0    a     1
0    b     1
0    c     1
1    d     2
1    e     2
1    f     2
[6 rows x 2 columns]
Other enhancements#
- DataFrame.plot()keywords- logy,- logxand- loglogcan now accept the value- 'sym'for symlog scaling. (GH 24867)
- Added support for ISO week year format (‘%G-%V-%u’) when parsing datetimes using - to_datetime()(GH 16607)
- Indexing of - DataFrameand- Seriesnow accepts zerodim- np.ndarray(GH 24919)
- Timestamp.replace()now supports the- foldargument to disambiguate DST transition times (GH 25017)
- DataFrame.at_time()and- Series.at_time()now support- datetime.timeobjects with timezones (GH 24043)
- DataFrame.pivot_table()now accepts an- observedparameter which is passed to underlying calls to- DataFrame.groupby()to speed up grouping categorical data. (GH 24923)
- Series.strhas gained- Series.str.casefold()method to removes all case distinctions present in a string (GH 25405)
- DataFrame.set_index()now works for instances of- abc.Iterator, provided their output is of the same length as the calling frame (GH 22484, GH 24984)
- DatetimeIndex.union()now supports the- sortargument. The behavior of the sort parameter matches that of- Index.union()(GH 24994)
- RangeIndex.union()now supports the- sortargument. If- sort=Falsean unsorted- Int64Indexis always returned.- sort=Noneis the default and returns a monotonically increasing- RangeIndexif possible or a sorted- Int64Indexif not (GH 24471)
- TimedeltaIndex.intersection()now also supports the- sortkeyword (GH 24471)
- DataFrame.rename()now supports the- errorsargument to raise errors when attempting to rename nonexistent keys (GH 13473)
- Added Sparse accessor for working with a - DataFramewhose values are sparse (GH 25681)
- RangeIndexhas gained- start,- stop, and- stepattributes (GH 25710)
- datetime.timezoneobjects are now supported as arguments to timezone methods and constructors (GH 25065)
- DataFrame.query()and- DataFrame.eval()now supports quoting column names with backticks to refer to names with spaces (GH 6508)
- merge_asof()now gives a more clear error message when merge keys are categoricals that are not equal (GH 26136)
- Rolling()supports exponential (or Poisson) window type (GH 21303)
- Error message for missing required imports now includes the original import error’s text (GH 23868) 
- DatetimeIndexand- TimedeltaIndexnow have a- meanmethod (GH 24757)
- DataFrame.describe()now formats integer percentiles without decimal point (GH 26660)
- Added support for reading SPSS .sav files using - read_spss()(GH 26537)
- Added new option - plotting.backendto be able to select a plotting backend different than the existing- matplotlibone. Use- pandas.set_option('plotting.backend', '<backend-module>')where- <backend-moduleis a library implementing the pandas plotting API (GH 14130)
- pandas.offsets.BusinessHoursupports multiple opening hours intervals (GH 15481)
- read_excel()can now use- openpyxlto read Excel files via the- engine='openpyxl'argument. This will become the default in a future release (GH 11499)
- pandas.io.excel.read_excel()supports reading OpenDocument tables. Specify- engine='odf'to enable. Consult the IO User Guide for more details (GH 9070)
- Interval,- IntervalIndex, and- IntervalArrayhave gained an- is_emptyattribute denoting if the given interval(s) are empty (GH 27219)
Backwards incompatible API changes#
Indexing with date strings with UTC offsets#
Indexing a DataFrame or Series with a DatetimeIndex with a
date string with a UTC offset would previously ignore the UTC offset. Now, the UTC offset
is respected in indexing. (GH 24076, GH 16785)
In [12]: df = pd.DataFrame([0], index=pd.DatetimeIndex(['2019-01-01'], tz='US/Pacific'))
In [13]: df
Out[13]: 
                           0
2019-01-01 00:00:00-08:00  0
[1 rows x 1 columns]
Previous behavior:
In [3]: df['2019-01-01 00:00:00+04:00':'2019-01-01 01:00:00+04:00']
Out[3]:
                           0
2019-01-01 00:00:00-08:00  0
New behavior:
In [14]: df['2019-01-01 12:00:00+04:00':'2019-01-01 13:00:00+04:00']
Out[14]: 
                           0
2019-01-01 00:00:00-08:00  0
[1 rows x 1 columns]
MultiIndex constructed from levels and codes#
Constructing a MultiIndex with NaN levels or codes value < -1 was allowed previously.
Now, construction with codes value < -1 is not allowed and NaN levels’ corresponding codes
would be reassigned as -1. (GH 19387)
Previous behavior:
In [1]: pd.MultiIndex(levels=[[np.nan, None, pd.NaT, 128, 2]],
   ...:               codes=[[0, -1, 1, 2, 3, 4]])
   ...:
Out[1]: MultiIndex(levels=[[nan, None, NaT, 128, 2]],
                   codes=[[0, -1, 1, 2, 3, 4]])
In [2]: pd.MultiIndex(levels=[[1, 2]], codes=[[0, -2]])
Out[2]: MultiIndex(levels=[[1, 2]],
                   codes=[[0, -2]])
New behavior:
In [15]: pd.MultiIndex(levels=[[np.nan, None, pd.NaT, 128, 2]],
   ....:               codes=[[0, -1, 1, 2, 3, 4]])
   ....: 
Out[15]: 
MultiIndex([(nan,),
            (nan,),
            (nan,),
            (nan,),
            (128,),
            (  2,)],
           )
In [16]: pd.MultiIndex(levels=[[1, 2]], codes=[[0, -2]])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[16], line 1
----> 1 pd.MultiIndex(levels=[[1, 2]], codes=[[0, -2]])
File ~/work/pandas/pandas/pandas/core/indexes/multi.py:365, in MultiIndex.__new__(cls, levels, codes, sortorder, names, dtype, copy, name, verify_integrity)
    362     result.sortorder = sortorder
    364 if verify_integrity:
--> 365     new_codes = result._verify_integrity()
    366     result._codes = new_codes
    368 result._reset_identity()
File ~/work/pandas/pandas/pandas/core/indexes/multi.py:452, in MultiIndex._verify_integrity(self, codes, levels, levels_to_verify)
    446     raise ValueError(
    447         f"On level {i}, code max ({level_codes.max()}) >= length of "
    448         f"level ({len(level)}). NOTE: this index is in an "
    449         "inconsistent state"
    450     )
    451 if len(level_codes) and level_codes.min() < -1:
--> 452     raise ValueError(f"On level {i}, code value ({level_codes.min()}) < -1")
    453 if not level.is_unique:
    454     raise ValueError(
    455         f"Level values must be unique: {list(level)} on level {i}"
    456     )
ValueError: On level 0, code value (-2) < -1
GroupBy.apply on DataFrame evaluates first group only once#
The implementation of DataFrameGroupBy.apply()
previously evaluated the supplied function consistently twice on the first group
to infer if it is safe to use a fast code path. Particularly for functions with
side effects, this was an undesired behavior and may have led to surprises. (GH 2936, GH 2656, GH 7739, GH 10519, GH 12155, GH 20084, GH 21417)
Now every group is evaluated only a single time.
In [17]: df = pd.DataFrame({"a": ["x", "y"], "b": [1, 2]})
In [18]: df
Out[18]: 
   a  b
0  x  1
1  y  2
[2 rows x 2 columns]
In [19]: def func(group):
   ....:     print(group.name)
   ....:     return group
   ....: 
Previous behavior:
In [3]: df.groupby('a').apply(func)
x
x
y
Out[3]:
   a  b
0  x  1
1  y  2
New behavior:
In [3]: df.groupby('a').apply(func)
x
y
Out[3]:
   a  b
0  x  1
1  y  2
Concatenating sparse values#
When passed DataFrames whose values are sparse, concat() will now return a
Series or DataFrame with sparse values, rather than a SparseDataFrame (GH 25702).
In [20]: df = pd.DataFrame({"A": pd.arrays.SparseArray([0, 1])})
Previous behavior:
In [2]: type(pd.concat([df, df]))
pandas.core.sparse.frame.SparseDataFrame
New behavior:
In [21]: type(pd.concat([df, df]))
Out[21]: pandas.core.frame.DataFrame
This now matches the existing behavior of concat on Series with sparse values.
concat() will continue to return a SparseDataFrame when all the values
are instances of SparseDataFrame.
This change also affects routines using concat() internally, like get_dummies(),
which now returns a DataFrame in all cases (previously a SparseDataFrame was
returned if all the columns were dummy encoded, and a DataFrame otherwise).
Providing any SparseSeries or SparseDataFrame to concat() will
cause a SparseSeries or SparseDataFrame to be returned, as before.
The .str-accessor performs stricter type checks#
Due to the lack of more fine-grained dtypes, Series.str so far only checked whether the data was
of object dtype. Series.str will now infer the dtype data within the Series; in particular,
'bytes'-only data will raise an exception (except for Series.str.decode(), Series.str.get(),
Series.str.len(), Series.str.slice()), see GH 23163, GH 23011, GH 23551.
Previous behavior:
In [1]: s = pd.Series(np.array(['a', 'ba', 'cba'], 'S'), dtype=object)
In [2]: s
Out[2]:
0      b'a'
1     b'ba'
2    b'cba'
dtype: object
In [3]: s.str.startswith(b'a')
Out[3]:
0     True
1    False
2    False
dtype: bool
New behavior:
In [22]: s = pd.Series(np.array(['a', 'ba', 'cba'], 'S'), dtype=object)
In [23]: s
Out[23]: 
0      b'a'
1     b'ba'
2    b'cba'
Length: 3, dtype: object
In [24]: s.str.startswith(b'a')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[24], line 1
----> 1 s.str.startswith(b'a')
File ~/work/pandas/pandas/pandas/core/strings/accessor.py:139, in forbid_nonstring_types.<locals>._forbid_nonstring_types.<locals>.wrapper(self, *args, **kwargs)
    134 if self._inferred_dtype not in allowed_types:
    135     msg = (
    136         f"Cannot use .str.{func_name} with values of "
    137         f"inferred dtype '{self._inferred_dtype}'."
    138     )
--> 139     raise TypeError(msg)
    140 return func(self, *args, **kwargs)
TypeError: Cannot use .str.startswith with values of inferred dtype 'bytes'.
Categorical dtypes are preserved during GroupBy#
Previously, columns that were categorical, but not the groupby key(s) would be converted to object dtype during groupby operations. pandas now will preserve these dtypes. (GH 18502)
In [25]: cat = pd.Categorical(["foo", "bar", "bar", "qux"], ordered=True)
In [26]: df = pd.DataFrame({'payload': [-1, -2, -1, -2], 'col': cat})
In [27]: df
Out[27]: 
   payload  col
0       -1  foo
1       -2  bar
2       -1  bar
3       -2  qux
[4 rows x 2 columns]
In [28]: df.dtypes
Out[28]: 
payload       int64
col        category
Length: 2, dtype: object
Previous Behavior:
In [5]: df.groupby('payload').first().col.dtype
Out[5]: dtype('O')
New Behavior:
In [29]: df.groupby('payload').first().col.dtype
Out[29]: CategoricalDtype(categories=['bar', 'foo', 'qux'], ordered=True, categories_dtype=object)
Incompatible Index type unions#
When performing Index.union() operations between objects of incompatible dtypes,
the result will be a base Index of dtype object. This behavior holds true for
unions between Index objects that previously would have been prohibited. The dtype
of empty Index objects will now be evaluated before performing union operations
rather than simply returning the other Index object. Index.union() can now be
considered commutative, such that A.union(B) == B.union(A) (GH 23525).
Previous behavior:
In [1]: pd.period_range('19910905', periods=2).union(pd.Int64Index([1, 2, 3]))
...
ValueError: can only call with other PeriodIndex-ed objects
In [2]: pd.Index([], dtype=object).union(pd.Index([1, 2, 3]))
Out[2]: Int64Index([1, 2, 3], dtype='int64')
New behavior:
In [3]: pd.period_range('19910905', periods=2).union(pd.Int64Index([1, 2, 3]))
Out[3]: Index([1991-09-05, 1991-09-06, 1, 2, 3], dtype='object')
In [4]: pd.Index([], dtype=object).union(pd.Index([1, 2, 3]))
Out[4]: Index([1, 2, 3], dtype='object')
Note that integer- and floating-dtype indexes are considered “compatible”. The integer values are coerced to floating point, which may result in loss of precision. See Set operations on Index objects for more.
DataFrame GroupBy ffill/bfill no longer return group labels#
The methods ffill, bfill, pad and backfill of
DataFrameGroupBy
previously included the group labels in the return value, which was
inconsistent with other groupby transforms. Now only the filled values
are returned. (GH 21521)
In [30]: df = pd.DataFrame({"a": ["x", "y"], "b": [1, 2]})
In [31]: df
Out[31]: 
   a  b
0  x  1
1  y  2
[2 rows x 2 columns]
Previous behavior:
In [3]: df.groupby("a").ffill()
Out[3]:
   a  b
0  x  1
1  y  2
New behavior:
In [32]: df.groupby("a").ffill()
Out[32]: 
   b
0  1
1  2
[2 rows x 1 columns]
DataFrame describe on an empty Categorical / object column will return top and freq#
When calling DataFrame.describe() with an empty categorical / object
column, the ‘top’ and ‘freq’ columns were previously omitted, which was inconsistent with
the output for non-empty columns. Now the ‘top’ and ‘freq’ columns will always be included,
with numpy.nan in the case of an empty DataFrame (GH 26397)
In [33]: df = pd.DataFrame({"empty_col": pd.Categorical([])})
In [34]: df
Out[34]: 
Empty DataFrame
Columns: [empty_col]
Index: []
[0 rows x 1 columns]
Previous behavior:
In [3]: df.describe()
Out[3]:
        empty_col
count           0
unique          0
New behavior:
In [35]: df.describe()
Out[35]: 
       empty_col
count          0
unique         0
top          NaN
freq         NaN
[4 rows x 1 columns]
__str__ methods now call __repr__ rather than vice versa#
pandas has until now mostly defined string representations in a pandas objects’
__str__/__unicode__/__bytes__ methods, and called __str__ from the __repr__
method, if a specific __repr__ method is not found. This is not needed for Python3.
In pandas 0.25, the string representations of pandas objects are now generally
defined in __repr__, and calls to __str__ in general now pass the call on to
the __repr__, if a specific __str__ method doesn’t exist, as is standard for Python.
This change is backward compatible for direct usage of pandas, but if you subclass
pandas objects and give your subclasses specific __str__/__repr__ methods,
you may have to adjust your __str__/__repr__ methods (GH 26495).
Indexing an IntervalIndex with Interval objects#
Indexing methods for IntervalIndex have been modified to require exact matches only for Interval queries.
IntervalIndex methods previously matched on any overlapping Interval.  Behavior with scalar points, e.g. querying
with an integer, is unchanged (GH 16316).
In [36]: ii = pd.IntervalIndex.from_tuples([(0, 4), (1, 5), (5, 8)])
In [37]: ii
Out[37]: IntervalIndex([(0, 4], (1, 5], (5, 8]], dtype='interval[int64, right]')
The in operator (__contains__) now only returns True for exact matches to Intervals in the IntervalIndex, whereas
this would previously return True for any Interval overlapping an Interval in the IntervalIndex.
Previous behavior:
In [4]: pd.Interval(1, 2, closed='neither') in ii
Out[4]: True
In [5]: pd.Interval(-10, 10, closed='both') in ii
Out[5]: True
New behavior:
In [38]: pd.Interval(1, 2, closed='neither') in ii
Out[38]: False
In [39]: pd.Interval(-10, 10, closed='both') in ii
Out[39]: False
The get_loc() method now only returns locations for exact matches to Interval queries, as opposed to the previous behavior of
returning locations for overlapping matches.  A KeyError will be raised if an exact match is not found.
Previous behavior:
In [6]: ii.get_loc(pd.Interval(1, 5))
Out[6]: array([0, 1])
In [7]: ii.get_loc(pd.Interval(2, 6))
Out[7]: array([0, 1, 2])
New behavior:
In [6]: ii.get_loc(pd.Interval(1, 5))
Out[6]: 1
In [7]: ii.get_loc(pd.Interval(2, 6))
---------------------------------------------------------------------------
KeyError: Interval(2, 6, closed='right')
Likewise, get_indexer() and get_indexer_non_unique() will also only return locations for exact matches
to Interval queries, with -1 denoting that an exact match was not found.
These indexing changes extend to querying a Series or DataFrame with an IntervalIndex index.
In [40]: s = pd.Series(list('abc'), index=ii)
In [41]: s
Out[41]: 
(0, 4]    a
(1, 5]    b
(5, 8]    c
Length: 3, dtype: object
Selecting from a Series or DataFrame using [] (__getitem__) or loc now only returns exact matches for Interval queries.
Previous behavior:
In [8]: s[pd.Interval(1, 5)]
Out[8]:
(0, 4]    a
(1, 5]    b
dtype: object
In [9]: s.loc[pd.Interval(1, 5)]
Out[9]:
(0, 4]    a
(1, 5]    b
dtype: object
New behavior:
In [42]: s[pd.Interval(1, 5)]
Out[42]: 'b'
In [43]: s.loc[pd.Interval(1, 5)]
Out[43]: 'b'
Similarly, a KeyError will be raised for non-exact matches instead of returning overlapping matches.
Previous behavior:
In [9]: s[pd.Interval(2, 3)]
Out[9]:
(0, 4]    a
(1, 5]    b
dtype: object
In [10]: s.loc[pd.Interval(2, 3)]
Out[10]:
(0, 4]    a
(1, 5]    b
dtype: object
New behavior:
In [6]: s[pd.Interval(2, 3)]
---------------------------------------------------------------------------
KeyError: Interval(2, 3, closed='right')
In [7]: s.loc[pd.Interval(2, 3)]
---------------------------------------------------------------------------
KeyError: Interval(2, 3, closed='right')
The overlaps() method can be used to create a boolean indexer that replicates the
previous behavior of returning overlapping matches.
New behavior:
In [44]: idxr = s.index.overlaps(pd.Interval(2, 3))
In [45]: idxr
Out[45]: array([ True,  True, False])
In [46]: s[idxr]
Out[46]: 
(0, 4]    a
(1, 5]    b
Length: 2, dtype: object
In [47]: s.loc[idxr]
Out[47]: 
(0, 4]    a
(1, 5]    b
Length: 2, dtype: object
Binary ufuncs on Series now align#
Applying a binary ufunc like numpy.power() now aligns the inputs
when both are Series (GH 23293).
In [48]: s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
In [49]: s2 = pd.Series([3, 4, 5], index=['d', 'c', 'b'])
In [50]: s1
Out[50]: 
a    1
b    2
c    3
Length: 3, dtype: int64
In [51]: s2
Out[51]: 
d    3
c    4
b    5
Length: 3, dtype: int64
Previous behavior
In [5]: np.power(s1, s2)
Out[5]:
a      1
b     16
c    243
dtype: int64
New behavior
In [52]: np.power(s1, s2)
Out[52]: 
a     1.0
b    32.0
c    81.0
d     NaN
Length: 4, dtype: float64
This matches the behavior of other binary operations in pandas, like Series.add().
To retain the previous behavior, convert the other Series to an array before
applying the ufunc.
In [53]: np.power(s1, s2.array)
Out[53]: 
a      1
b     16
c    243
Length: 3, dtype: int64
Categorical.argsort now places missing values at the end#
Categorical.argsort() now places missing values at the end of the array, making it
consistent with NumPy and the rest of pandas (GH 21801).
In [54]: cat = pd.Categorical(['b', None, 'a'], categories=['a', 'b'], ordered=True)
Previous behavior
In [2]: cat = pd.Categorical(['b', None, 'a'], categories=['a', 'b'], ordered=True)
In [3]: cat.argsort()
Out[3]: array([1, 2, 0])
In [4]: cat[cat.argsort()]
Out[4]:
[NaN, a, b]
categories (2, object): [a < b]
New behavior
In [55]: cat.argsort()
Out[55]: array([2, 0, 1])
In [56]: cat[cat.argsort()]
Out[56]: 
['a', 'b', NaN]
Categories (2, object): ['a' < 'b']
Column order is preserved when passing a list of dicts to DataFrame#
Starting with Python 3.7 the key-order of dict is guaranteed. In practice, this has been true since
Python 3.6. The DataFrame constructor now treats a list of dicts in the same way as
it does a list of OrderedDict, i.e. preserving the order of the dicts.
This change applies only when pandas is running on Python>=3.6 (GH 27309).
In [57]: data = [
   ....:     {'name': 'Joe', 'state': 'NY', 'age': 18},
   ....:     {'name': 'Jane', 'state': 'KY', 'age': 19, 'hobby': 'Minecraft'},
   ....:     {'name': 'Jean', 'state': 'OK', 'age': 20, 'finances': 'good'}
   ....: ]
   ....: 
Previous Behavior:
The columns were lexicographically sorted previously,
In [1]: pd.DataFrame(data)
Out[1]:
   age finances      hobby  name state
0   18      NaN        NaN   Joe    NY
1   19      NaN  Minecraft  Jane    KY
2   20     good        NaN  Jean    OK
New Behavior:
The column order now matches the insertion-order of the keys in the dict,
considering all the records from top to bottom. As a consequence, the column
order of the resulting DataFrame has changed compared to previous pandas versions.
In [58]: pd.DataFrame(data)
Out[58]: 
   name state  age      hobby finances
0   Joe    NY   18        NaN      NaN
1  Jane    KY   19  Minecraft      NaN
2  Jean    OK   20        NaN     good
[3 rows x 5 columns]
Increased minimum versions for dependencies#
Due to dropping support for Python 2.7, a number of optional dependencies have updated minimum versions (GH 25725, GH 24942, GH 25752). Independently, some minimum supported versions of dependencies were updated (GH 23519, GH 25554). If installed, we now require:
| Package | Minimum Version | Required | 
|---|---|---|
| numpy | 1.13.3 | X | 
| pytz | 2015.4 | X | 
| python-dateutil | 2.6.1 | X | 
| bottleneck | 1.2.1 | |
| numexpr | 2.6.2 | |
| pytest (dev) | 4.0.2 | 
For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.
| Package | Minimum Version | 
|---|---|
| beautifulsoup4 | 4.6.0 | 
| fastparquet | 0.2.1 | 
| gcsfs | 0.2.2 | 
| lxml | 3.8.0 | 
| matplotlib | 2.2.2 | 
| openpyxl | 2.4.8 | 
| pyarrow | 0.9.0 | 
| pymysql | 0.7.1 | 
| pytables | 3.4.2 | 
| scipy | 0.19.0 | 
| sqlalchemy | 1.1.4 | 
| xarray | 0.8.2 | 
| xlrd | 1.1.0 | 
| xlsxwriter | 0.9.8 | 
| xlwt | 1.2.0 | 
See Dependencies and Optional dependencies for more.
Other API changes#
- DatetimeTZDtypewill now standardize pytz timezones to a common timezone instance (GH 24713)
- Timestampand- Timedeltascalars now implement the- to_numpy()method as aliases to- Timestamp.to_datetime64()and- Timedelta.to_timedelta64(), respectively. (GH 24653)
- Timestamp.strptime()will now rise a- NotImplementedError(GH 25016)
- Comparing - Timestampwith unsupported objects now returns- NotImplementedinstead of raising- TypeError. This implies that unsupported rich comparisons are delegated to the other object, and are now consistent with Python 3 behavior for- datetimeobjects (GH 24011)
- Bug in - DatetimeIndex.snap()which didn’t preserving the- nameof the input- Index(GH 25575)
- The - argargument in- DataFrameGroupBy.agg()has been renamed to- func(GH 26089)
- The - argargument in- Window.aggregate()has been renamed to- func(GH 26372)
- Most pandas classes had a - __bytes__method, which was used for getting a python2-style bytestring representation of the object. This method has been removed as a part of dropping Python2 (GH 26447)
- The - .str-accessor has been disabled for 1-level- MultiIndex, use- MultiIndex.to_flat_index()if necessary (GH 23679)
- Removed support of gtk package for clipboards (GH 26563) 
- Using an unsupported version of Beautiful Soup 4 will now raise an - ImportErrorinstead of a- ValueError(GH 27063)
- Series.to_excel()and- DataFrame.to_excel()will now raise a- ValueErrorwhen saving timezone aware data. (GH 27008, GH 7056)
- ExtensionArray.argsort()places NA values at the end of the sorted array. (GH 21801)
- DataFrame.to_hdf()and- Series.to_hdf()will now raise a- NotImplementedErrorwhen saving a- MultiIndexwith extension data types for a- fixedformat. (GH 7775)
- Passing duplicate - namesin- read_csv()will now raise a- ValueError(GH 17346)
Deprecations#
Sparse subclasses#
The SparseSeries and SparseDataFrame subclasses are deprecated. Their functionality is better-provided
by a Series or DataFrame with sparse values.
Previous way
df = pd.SparseDataFrame({"A": [0, 0, 1, 2]})
df.dtypes
New way
In [59]: df = pd.DataFrame({"A": pd.arrays.SparseArray([0, 0, 1, 2])})
In [60]: df.dtypes
Out[60]: 
A    Sparse[int64, 0]
Length: 1, dtype: object
The memory usage of the two approaches is identical (GH 19239).
msgpack format#
The msgpack format is deprecated as of 0.25 and will be removed in a future version. It is recommended to use pyarrow for on-the-wire transmission of pandas objects. (GH 27084)
Other deprecations#
- The deprecated - .ix[]indexer now raises a more visible- FutureWarninginstead of- DeprecationWarning(GH 26438).
- Deprecated the - units=M(months) and- units=Y(year) parameters for- unitsof- pandas.to_timedelta(),- pandas.Timedelta()and- pandas.TimedeltaIndex()(GH 16344)
- pandas.concat()has deprecated the- join_axes-keyword. Instead, use- DataFrame.reindex()or- DataFrame.reindex_like()on the result or on the inputs (GH 21951)
- The - SparseArray.valuesattribute is deprecated. You can use- np.asarray(...)or the- SparseArray.to_dense()method instead (GH 26421).
- The functions - pandas.to_datetime()and- pandas.to_timedelta()have deprecated the- boxkeyword. Instead, use- to_numpy()or- Timestamp.to_datetime64()or- Timedelta.to_timedelta64(). (GH 24416)
- The - DataFrame.compound()and- Series.compound()methods are deprecated and will be removed in a future version (GH 26405).
- The internal attributes - _start,- _stopand- _stepattributes of- RangeIndexhave been deprecated. Use the public attributes- start,- stopand- stepinstead (GH 26581).
- The - Series.ftype(),- Series.ftypes()and- DataFrame.ftypes()methods are deprecated and will be removed in a future version. Instead, use- Series.dtype()and- DataFrame.dtypes()(GH 26705).
- The - Series.get_values(),- DataFrame.get_values(),- Index.get_values(),- SparseArray.get_values()and- Categorical.get_values()methods are deprecated. One of- np.asarray(..)or- to_numpy()can be used instead (GH 19617).
- The ‘outer’ method on NumPy ufuncs, e.g. - np.subtract.outerhas been deprecated on- Seriesobjects. Convert the input to an array with- Series.arrayfirst (GH 27186)
- Timedelta.resolution()is deprecated and replaced with- Timedelta.resolution_string(). In a future version,- Timedelta.resolution()will be changed to behave like the standard library- datetime.timedelta.resolution(GH 21344)
- read_table()has been undeprecated. (GH 25220)
- Index.dtype_stris deprecated. (GH 18262)
- Series.imagand- Series.realare deprecated. (GH 18262)
- Series.put()is deprecated. (GH 18262)
- Index.item()and- Series.item()is deprecated. (GH 18262)
- The default value - ordered=Nonein- CategoricalDtypehas been deprecated in favor of- ordered=False. When converting between categorical types- ordered=Truemust be explicitly passed in order to be preserved. (GH 26336)
- Index.contains()is deprecated. Use- key in index(- __contains__) instead (GH 17753).
- DataFrame.get_dtype_counts()is deprecated. (GH 18262)
- Categorical.ravel()will return a- Categoricalinstead of a- np.ndarray(GH 27199)
Removal of prior version deprecations/changes#
- Removed the previously deprecated - sheetnamekeyword in- read_excel()(GH 16442, GH 20938)
- Removed the previously deprecated - TimeGrouper(GH 16942)
- Removed the previously deprecated - parse_colskeyword in- read_excel()(GH 16488)
- Removed the previously deprecated - pd.options.html.border(GH 16970)
- Removed the previously deprecated - convert_objects(GH 11221)
- Removed the previously deprecated - selectmethod of- DataFrameand- Series(GH 17633)
- Removed the previously deprecated behavior of - Seriestreated as list-like in- rename_categories()(GH 17982)
- Removed the previously deprecated - DataFrame.reindex_axisand- Series.reindex_axis(GH 17842)
- Removed the previously deprecated behavior of altering column or index labels with - Series.rename_axis()or- DataFrame.rename_axis()(GH 17842)
- Removed the previously deprecated - tupleize_colskeyword argument in- read_html(),- read_csv(), and- DataFrame.to_csv()(GH 17877, GH 17820)
- Removed the previously deprecated - DataFrame.from.csvand- Series.from_csv(GH 17812)
- Removed the previously deprecated - raise_on_errorkeyword argument in- DataFrame.where()and- DataFrame.mask()(GH 17744)
- Removed the previously deprecated - orderedand- categorieskeyword arguments in- astype(GH 17742)
- Removed the previously deprecated - cdate_range(GH 17691)
- Removed the previously deprecated - Trueoption for the- dropnakeyword argument in- SeriesGroupBy.nth()(GH 17493)
- Removed the previously deprecated - convertkeyword argument in- Series.take()and- DataFrame.take()(GH 17352)
- Removed the previously deprecated behavior of arithmetic operations with - datetime.dateobjects (GH 21152)
Performance improvements#
- Significant speedup in - SparseArrayinitialization that benefits most operations, fixing performance regression introduced in v0.20.0 (GH 24985)
- DataFrame.to_stata()is now faster when outputting data with any string or non-native endian columns (GH 25045)
- Improved performance of - Series.searchsorted(). The speedup is especially large when the dtype is int8/int16/int32 and the searched key is within the integer bounds for the dtype (GH 22034)
- Improved performance of - GroupBy.quantile()(GH 20405)
- Improved performance of slicing and other selected operation on a - RangeIndex(GH 26565, GH 26617, GH 26722)
- RangeIndexnow performs standard lookup without instantiating an actual hashtable, hence saving memory (GH 16685)
- Improved performance of - read_csv()by faster tokenizing and faster parsing of small float numbers (GH 25784)
- Improved performance of - read_csv()by faster parsing of N/A and boolean values (GH 25804)
- Improved performance of - IntervalIndex.is_monotonic,- IntervalIndex.is_monotonic_increasingand- IntervalIndex.is_monotonic_decreasingby removing conversion to- MultiIndex(GH 24813)
- Improved performance of - DataFrame.to_csv()when writing datetime dtypes (GH 25708)
- Improved performance of - read_csv()by much faster parsing of- MM/YYYYand- DD/MM/YYYYdatetime formats (GH 25922)
- Improved performance of nanops for dtypes that cannot store NaNs. Speedup is particularly prominent for - Series.all()and- Series.any()(GH 25070)
- Improved performance of - Series.map()for dictionary mappers on categorical series by mapping the categories instead of mapping all values (GH 23785)
- Improved performance of - IntervalIndex.intersection()(GH 24813)
- Improved performance of - read_csv()by faster concatenating date columns without extra conversion to string for integer/float zero and float- NaN; by faster checking the string for the possibility of being a date (GH 25754)
- Improved performance of - IntervalIndex.is_uniqueby removing conversion to- MultiIndex(GH 24813)
- Restored performance of - DatetimeIndex.__iter__()by re-enabling specialized code path (GH 26702)
- Improved performance when building - MultiIndexwith at least one- CategoricalIndexlevel (GH 22044)
- Improved performance by removing the need for a garbage collect when checking for - SettingWithCopyWarning(GH 27031)
- For - to_datetime()changed default value of cache parameter to- True(GH 26043)
- Improved performance of - DatetimeIndexand- PeriodIndexslicing given non-unique, monotonic data (GH 27136).
- Improved performance of - pd.read_json()for index-oriented data. (GH 26773)
- Improved performance of - MultiIndex.shape()(GH 27384).
Bug fixes#
Categorical#
- Bug in - DataFrame.at()and- Series.at()that would raise exception if the index was a- CategoricalIndex(GH 20629)
- Fixed bug in comparison of ordered - Categoricalthat contained missing values with a scalar which sometimes incorrectly resulted in- True(GH 26504)
- Bug in - DataFrame.dropna()when the- DataFramehas a- CategoricalIndexcontaining- Intervalobjects incorrectly raised a- TypeError(GH 25087)
Datetimelike#
- Bug in - to_datetime()which would raise an (incorrect)- ValueErrorwhen called with a date far into the future and the- formatargument specified instead of raising- OutOfBoundsDatetime(GH 23830)
- Bug in - to_datetime()which would raise- InvalidIndexError: Reindexing only valid with uniquely valued Index objectswhen called with- cache=True, with- argincluding at least two different elements from the set- {None, numpy.nan, pandas.NaT}(GH 22305)
- Bug in - DataFrameand- Serieswhere timezone aware data with- dtype='datetime64[ns]was not cast to naive (GH 25843)
- Improved - Timestamptype checking in various datetime functions to prevent exceptions when using a subclassed- datetime(GH 25851)
- Bug in - Seriesand- DataFramerepr where- np.datetime64('NaT')and- np.timedelta64('NaT')with- dtype=objectwould be represented as- NaN(GH 25445)
- Bug in - to_datetime()which does not replace the invalid argument with- NaTwhen error is set to coerce (GH 26122)
- Bug in adding - DateOffsetwith nonzero month to- DatetimeIndexwould raise- ValueError(GH 26258)
- Bug in - to_datetime()which raises unhandled- OverflowErrorwhen called with mix of invalid dates and- NaNvalues with- format='%Y%m%d'and- error='coerce'(GH 25512)
- Bug in - isin()for datetimelike indexes;- DatetimeIndex,- TimedeltaIndexand- PeriodIndexwhere the- levelsparameter was ignored. (GH 26675)
- Bug in - to_datetime()which raises- TypeErrorfor- format='%Y%m%d'when called for invalid integer dates with length >= 6 digits with- errors='ignore'
- Bug when comparing a - PeriodIndexagainst a zero-dimensional numpy array (GH 26689)
- Bug in constructing a - Seriesor- DataFramefrom a numpy- datetime64array with a non-ns unit and out-of-bound timestamps generating rubbish data, which will now correctly raise an- OutOfBoundsDatetimeerror (GH 26206).
- Bug in - date_range()with unnecessary- OverflowErrorbeing raised for very large or very small dates (GH 26651)
- Bug where adding - Timestampto a- np.timedelta64object would raise instead of returning a- Timestamp(GH 24775)
- Bug where comparing a zero-dimensional numpy array containing a - np.datetime64object to a- Timestampwould incorrect raise- TypeError(GH 26916)
- Bug in - to_datetime()which would raise- ValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=Truewhen called with- cache=True, with- argincluding datetime strings with different offset (GH 26097)
Timedelta#
- Bug in - TimedeltaIndex.intersection()where for non-monotonic indices in some cases an empty- Indexwas returned when in fact an intersection existed (GH 25913)
- Bug with comparisons between - Timedeltaand- NaTraising- TypeError(GH 26039)
- Bug when adding or subtracting a - BusinessHourto a- Timestampwith the resulting time landing in a following or prior day respectively (GH 26381)
- Bug when comparing a - TimedeltaIndexagainst a zero-dimensional numpy array (GH 26689)
Timezones#
- Bug in - DatetimeIndex.to_frame()where timezone aware data would be converted to timezone naive data (GH 25809)
- Bug in - to_datetime()with- utc=Trueand datetime strings that would apply previously parsed UTC offsets to subsequent arguments (GH 24992)
- Bug in - Timestamp.tz_localize()and- Timestamp.tz_convert()does not propagate- freq(GH 25241)
- Bug in - Series.at()where setting- Timestampwith timezone raises- TypeError(GH 25506)
- Bug in - DataFrame.update()when updating with timezone aware data would return timezone naive data (GH 25807)
- Bug in - to_datetime()where an uninformative- RuntimeErrorwas raised when passing a naive- Timestampwith datetime strings with mixed UTC offsets (GH 25978)
- Bug in - to_datetime()with- unit='ns'would drop timezone information from the parsed argument (GH 26168)
- Bug in - DataFrame.join()where joining a timezone aware index with a timezone aware column would result in a column of- NaN(GH 26335)
- Bug in - date_range()where ambiguous or nonexistent start or end times were not handled by the- ambiguousor- nonexistentkeywords respectively (GH 27088)
- Bug in - DatetimeIndex.union()when combining a timezone aware and timezone unaware- DatetimeIndex(GH 21671)
- Bug when applying a numpy reduction function (e.g. - numpy.minimum()) to a timezone aware- Series(GH 15552)
Numeric#
- Bug in - to_numeric()in which large negative numbers were being improperly handled (GH 24910)
- Bug in - to_numeric()in which numbers were being coerced to float, even though- errorswas not- coerce(GH 24910)
- Bug in - to_numeric()in which invalid values for- errorswere being allowed (GH 26466)
- Bug in - formatin which floating point complex numbers were not being formatted to proper display precision and trimming (GH 25514)
- Bug in error messages in - DataFrame.corr()and- Series.corr(). Added the possibility of using a callable. (GH 25729)
- Bug in - Series.divmod()and- Series.rdivmod()which would raise an (incorrect)- ValueErrorrather than return a pair of- Seriesobjects as result (GH 25557)
- Raises a helpful exception when a non-numeric index is sent to - interpolate()with methods which require numeric index. (GH 21662)
- Bug in - eval()when comparing floats with scalar operators, for example:- x < -0.1(GH 25928)
- Fixed bug where casting all-boolean array to integer extension array failed (GH 25211) 
- Bug in - divmodwith a- Seriesobject containing zeros incorrectly raising- AttributeError(GH 26987)
- Inconsistency in - Seriesfloor-division (//) and- divmodfilling positive//zero with- NaNinstead of- Inf(GH 27321)
Conversion#
- Bug in - DataFrame.astype()when passing a dict of columns and types the- errorsparameter was ignored. (GH 25905)
Strings#
- Bug in the - __name__attribute of several methods of- Series.str, which were set incorrectly (GH 23551)
- Improved error message when passing - Seriesof wrong dtype to- Series.str.cat()(GH 22722)
Interval#
- Construction of - Intervalis restricted to numeric,- Timestampand- Timedeltaendpoints (GH 23013)
- Fixed bug in - Series/- DataFramenot displaying- NaNin- IntervalIndexwith missing values (GH 25984)
- Bug in - IntervalIndex.get_loc()where a- KeyErrorwould be incorrectly raised for a decreasing- IntervalIndex(GH 25860)
- Bug in - Indexconstructor where passing mixed closed- Intervalobjects would result in a- ValueErrorinstead of an- objectdtype- Index(GH 27172)
Indexing#
- Improved exception message when calling - DataFrame.iloc()with a list of non-numeric objects (GH 25753).
- Improved exception message when calling - .ilocor- .locwith a boolean indexer with different length (GH 26658).
- Bug in - KeyErrorexception message when indexing a- MultiIndexwith a non-existent key not displaying the original key (GH 27250).
- Bug in - .ilocand- .locwith a boolean indexer not raising an- IndexErrorwhen too few items are passed (GH 26658).
- Bug in - DataFrame.loc()and- Series.loc()where- KeyErrorwas not raised for a- MultiIndexwhen the key was less than or equal to the number of levels in the- MultiIndex(GH 14885).
- Bug in which - DataFrame.append()produced an erroneous warning indicating that a- KeyErrorwill be thrown in the future when the data to be appended contains new columns (GH 22252).
- Bug in which - DataFrame.to_csv()caused a segfault for a reindexed data frame, when the indices were single-level- MultiIndex(GH 26303).
- Fixed bug where assigning a - arrays.PandasArrayto a- DataFramewould raise error (GH 26390)
- Allow keyword arguments for callable local reference used in the - DataFrame.query()string (GH 26426)
- Fixed a - KeyErrorwhen indexing a- MultiIndexlevel with a list containing exactly one label, which is missing (GH 27148)
- Bug which produced - AttributeErroron partial matching- Timestampin a- MultiIndex(GH 26944)
- Bug in - Categoricaland- CategoricalIndexwith- Intervalvalues when using the- inoperator (- __contains) with objects that are not comparable to the values in the- Interval(GH 23705)
- Bug in - DataFrame.loc()and- DataFrame.iloc()on a- DataFramewith a single timezone-aware datetime64[ns] column incorrectly returning a scalar instead of a- Series(GH 27110)
- Bug in - CategoricalIndexand- Categoricalincorrectly raising- ValueErrorinstead of- TypeErrorwhen a list is passed using the- inoperator (- __contains__) (GH 21729)
- Bug in setting a new value in a - Serieswith a- Timedeltaobject incorrectly casting the value to an integer (GH 22717)
- Bug in - Seriessetting a new key (- __setitem__) with a timezone-aware datetime incorrectly raising- ValueError(GH 12862)
- Bug in - DataFrame.iloc()when indexing with a read-only indexer (GH 17192)
- Bug in - Seriessetting an existing tuple key (- __setitem__) with timezone-aware datetime values incorrectly raising- TypeError(GH 20441)
Missing#
- Fixed misleading exception message in - Series.interpolate()if argument- orderis required, but omitted (GH 10633, GH 24014).
- Fixed class type displayed in exception message in - DataFrame.dropna()if invalid- axisparameter passed (GH 25555)
- A - ValueErrorwill now be thrown by- DataFrame.fillna()when- limitis not a positive integer (GH 27042)
MultiIndex#
- Bug in which incorrect exception raised by - Timedeltawhen testing the membership of- MultiIndex(GH 24570)
IO#
- Bug in - DataFrame.to_html()where values were truncated using display options instead of outputting the full content (GH 17004)
- Fixed bug in missing text when using - to_clipboard()if copying utf-16 characters in Python 3 on Windows (GH 25040)
- Bug in - read_json()for- orient='table'when it tries to infer dtypes by default, which is not applicable as dtypes are already defined in the JSON schema (GH 21345)
- Bug in - read_json()for- orient='table'and float index, as it infers index dtype by default, which is not applicable because index dtype is already defined in the JSON schema (GH 25433)
- Bug in - read_json()for- orient='table'and string of float column names, as it makes a column name type conversion to- Timestamp, which is not applicable because column names are already defined in the JSON schema (GH 25435)
- Bug in - json_normalize()for- errors='ignore'where missing values in the input data, were filled in resulting- DataFramewith the string- "nan"instead of- numpy.nan(GH 25468)
- DataFrame.to_html()now raises- TypeErrorwhen using an invalid type for the- classesparameter instead of- AssertionError(GH 25608)
- Bug in - DataFrame.to_string()and- DataFrame.to_latex()that would lead to incorrect output when the- headerkeyword is used (GH 16718)
- Bug in - read_csv()not properly interpreting the UTF8 encoded filenames on Windows on Python 3.6+ (GH 15086)
- Improved performance in - pandas.read_stata()and- pandas.io.stata.StataReaderwhen converting columns that have missing values (GH 25772)
- Bug in - DataFrame.to_html()where header numbers would ignore display options when rounding (GH 17280)
- Bug in - read_hdf()where reading a table from an HDF5 file written directly with PyTables fails with a- ValueErrorwhen using a sub-selection via the- startor- stoparguments (GH 11188)
- Bug in - read_hdf()not properly closing store after a- KeyErroris raised (GH 25766)
- Improved the explanation for the failure when value labels are repeated in Stata dta files and suggested work-arounds (GH 25772) 
- Improved - pandas.read_stata()and- pandas.io.stata.StataReaderto read incorrectly formatted 118 format files saved by Stata (GH 25960)
- Improved the - col_spaceparameter in- DataFrame.to_html()to accept a string so CSS length values can be set correctly (GH 25941)
- Fixed bug in loading objects from S3 that contain - #characters in the URL (GH 25945)
- Adds - use_bqstorage_apiparameter to- read_gbq()to speed up downloads of large data frames. This feature requires version 0.10.0 of the- pandas-gbqlibrary as well as the- google-cloud-bigquery-storageand- fastavrolibraries. (GH 26104)
- Fixed memory leak in - DataFrame.to_json()when dealing with numeric data (GH 24889)
- Bug in - read_json()where date strings with- Zwere not converted to a UTC timezone (GH 26168)
- Added - cache_dates=Trueparameter to- read_csv(), which allows to cache unique dates when they are parsed (GH 25990)
- DataFrame.to_excel()now raises a- ValueErrorwhen the caller’s dimensions exceed the limitations of Excel (GH 26051)
- Fixed bug in - pandas.read_csv()where a BOM would result in incorrect parsing using engine=’python’ (GH 26545)
- read_excel()now raises a- ValueErrorwhen input is of type- pandas.io.excel.ExcelFileand- engineparam is passed since- pandas.io.excel.ExcelFilehas an engine defined (GH 26566)
- Bug while selecting from - HDFStorewith- where=''specified (GH 26610).
- Fixed bug in - DataFrame.to_excel()where custom objects (i.e.- PeriodIndex) inside merged cells were not being converted into types safe for the Excel writer (GH 27006)
- Bug in - read_hdf()where reading a timezone aware- DatetimeIndexwould raise a- TypeError(GH 11926)
- Bug in - to_msgpack()and- read_msgpack()which would raise a- ValueErrorrather than a- FileNotFoundErrorfor an invalid path (GH 27160)
- Fixed bug in - DataFrame.to_parquet()which would raise a- ValueErrorwhen the dataframe had no columns (GH 27339)
- Allow parsing of - PeriodDtypecolumns when using- read_csv()(GH 26934)
Plotting#
- Fixed bug where - api.extensions.ExtensionArraycould not be used in matplotlib plotting (GH 25587)
- Bug in an error message in - DataFrame.plot(). Improved the error message if non-numerics are passed to- DataFrame.plot()(GH 25481)
- Bug in incorrect ticklabel positions when plotting an index that are non-numeric / non-datetime (GH 7612, GH 15912, GH 22334) 
- Fixed bug causing plots of - PeriodIndextimeseries to fail if the frequency is a multiple of the frequency rule code (GH 14763)
- Fixed bug when plotting a - DatetimeIndexwith- datetime.timezone.utctimezone (GH 17173)
GroupBy/resample/rolling#
- Bug in - Resampler.agg()with a timezone aware index where- OverflowErrorwould raise when passing a list of functions (GH 22660)
- Bug in - DataFrameGroupBy.nunique()in which the names of column levels were lost (GH 23222)
- Bug in - GroupBy.agg()when applying an aggregation function to timezone aware data (GH 23683)
- Bug in - GroupBy.first()and- GroupBy.last()where timezone information would be dropped (GH 21603)
- Bug in - GroupBy.size()when grouping only NA values (GH 23050)
- Bug in - Series.groupby()where- observedkwarg was previously ignored (GH 24880)
- Bug in - Series.groupby()where using- groupbywith a- MultiIndexSeries with a list of labels equal to the length of the series caused incorrect grouping (GH 25704)
- Ensured that ordering of outputs in - groupbyaggregation functions is consistent across all versions of Python (GH 25692)
- Ensured that result group order is correct when grouping on an ordered - Categoricaland specifying- observed=True(GH 25871, GH 25167)
- Bug in - Rolling.min()and- Rolling.max()that caused a memory leak (GH 25893)
- Bug in - Rolling.count()and- .Expanding.countwas previously ignoring the- axiskeyword (GH 13503)
- Bug in - GroupBy.idxmax()and- GroupBy.idxmin()with datetime column would return incorrect dtype (GH 25444, GH 15306)
- Bug in - GroupBy.cumsum(),- GroupBy.cumprod(),- GroupBy.cummin()and- GroupBy.cummax()with categorical column having absent categories, would return incorrect result or segfault (GH 16771)
- Bug in - GroupBy.nth()where NA values in the grouping would return incorrect results (GH 26011)
- Bug in - SeriesGroupBy.transform()where transforming an empty group would raise a- ValueError(GH 26208)
- Bug in - DataFrame.groupby()where passing a- Grouperwould return incorrect groups when using the- .groupsaccessor (GH 26326)
- Bug in - GroupBy.agg()where incorrect results are returned for uint64 columns. (GH 26310)
- Bug in - Rolling.median()and- Rolling.quantile()where MemoryError is raised with empty window (GH 26005)
- Bug in - Rolling.median()and- Rolling.quantile()where incorrect results are returned with- closed='left'and- closed='neither'(GH 26005)
- Improved - Rolling,- Windowand- ExponentialMovingWindowfunctions to exclude nuisance columns from results instead of raising errors and raise a- DataErroronly if all columns are nuisance (GH 12537)
- Bug in - Rolling.max()and- Rolling.min()where incorrect results are returned with an empty variable window (GH 26005)
- Raise a helpful exception when an unsupported weighted window function is used as an argument of - Window.aggregate()(GH 26597)
Reshaping#
- Bug in - pandas.merge()adds a string of- None, if- Noneis assigned in suffixes instead of remain the column name as-is (GH 24782).
- Bug in - merge()when merging by index name would sometimes result in an incorrectly numbered index (missing index values are now assigned NA) (GH 24212, GH 25009)
- to_records()now accepts dtypes to its- column_dtypesparameter (GH 24895)
- Bug in - concat()where order of- OrderedDict(and- dictin Python 3.6+) is not respected, when passed in as- objsargument (GH 21510)
- Bug in - pivot_table()where columns with- NaNvalues are dropped even if- dropnaargument is- False, when the- aggfuncargument contains a- list(GH 22159)
- Bug in - concat()where the resulting- freqof two- DatetimeIndexwith the same- freqwould be dropped (GH 3232).
- Bug in - merge()where merging with equivalent Categorical dtypes was raising an error (GH 22501)
- bug in - DataFrameinstantiating with a dict of iterators or generators (e.g.- pd.DataFrame({'A': reversed(range(3))})) raised an error (GH 26349).
- Bug in - DataFrameinstantiating with a- range(e.g.- pd.DataFrame(range(3))) raised an error (GH 26342).
- Bug in - DataFrameconstructor when passing non-empty tuples would cause a segmentation fault (GH 25691)
- Bug in - Series.apply()failed when the series is a timezone aware- DatetimeIndex(GH 25959)
- Bug in - pandas.cut()where large bins could incorrectly raise an error due to an integer overflow (GH 26045)
- Bug in - DataFrame.sort_index()where an error is thrown when a multi-indexed- DataFrameis sorted on all levels with the initial level sorted last (GH 26053)
- Bug in - Series.nlargest()treats- Trueas smaller than- False(GH 26154)
- Bug in - DataFrame.pivot_table()with a- IntervalIndexas pivot index would raise- TypeError(GH 25814)
- Bug in which - DataFrame.from_dict()ignored order of- OrderedDictwhen- orient='index'(GH 8425).
- Bug in - DataFrame.transpose()where transposing a DataFrame with a timezone-aware datetime column would incorrectly raise- ValueError(GH 26825)
- Bug in - pivot_table()when pivoting a timezone aware column as the- valueswould remove timezone information (GH 14948)
- Bug in - merge_asof()when specifying multiple- bycolumns where one is- datetime64[ns, tz]dtype (GH 26649)
Sparse#
- Significant speedup in - SparseArrayinitialization that benefits most operations, fixing performance regression introduced in v0.20.0 (GH 24985)
- Bug in - SparseFrameconstructor where passing- Noneas the data would cause- default_fill_valueto be ignored (GH 16807)
- Bug in - SparseDataFramewhen adding a column in which the length of values does not match length of index,- AssertionErroris raised instead of raising- ValueError(GH 25484)
- Introduce a better error message in - Series.sparse.from_coo()so it returns a- TypeErrorfor inputs that are not coo matrices (GH 26554)
- Bug in - numpy.modf()on a- SparseArray. Now a tuple of- SparseArrayis returned (GH 26946).
Build changes#
- Fix install error with PyPy on macOS (GH 26536) 
ExtensionArray#
- Bug in - factorize()when passing an- ExtensionArraywith a custom- na_sentinel(GH 25696).
- Series.count()miscounts NA values in ExtensionArrays (GH 26835)
- Added - Series.__array_ufunc__to better handle NumPy ufuncs applied to Series backed by extension arrays (GH 23293).
- Keyword argument - deephas been removed from- ExtensionArray.copy()(GH 27083)
Other#
- Removed unused C functions from vendored UltraJSON implementation (GH 26198) 
- Allow - Indexand- RangeIndexto be passed to numpy- minand- maxfunctions (GH 26125)
- Use actual class name in repr of empty objects of a - Seriessubclass (GH 27001).
- Bug in - DataFramewhere passing an object array of timezone-aware- datetimeobjects would incorrectly raise- ValueError(GH 13287)
Contributors#
A total of 231 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.
- 1_x7 + 
- Abdullah İhsan Seçer + 
- Adam Bull + 
- Adam Hooper 
- Albert Villanova del Moral 
- Alex Watt + 
- AlexTereshenkov + 
- Alexander Buchkovsky 
- Alexander Hendorf + 
- Alexander Nordin + 
- Alexander Ponomaroff 
- Alexandre Batisse + 
- Alexandre Decan + 
- Allen Downey + 
- Alyssa Fu Ward + 
- Andrew Gaspari + 
- Andrew Wood + 
- Antoine Viscardi + 
- Antonio Gutierrez + 
- Arno Veenstra + 
- ArtinSarraf 
- Batalex + 
- Baurzhan Muftakhidinov 
- Benjamin Rowell 
- Bharat Raghunathan + 
- Bhavani Ravi + 
- Big Head + 
- Brett Randall + 
- Bryan Cutler + 
- C John Klehm + 
- Caleb Braun + 
- Cecilia + 
- Chris Bertinato + 
- Chris Stadler + 
- Christian Haege + 
- Christian Hudon 
- Christopher Whelan 
- Chuanzhu Xu + 
- Clemens Brunner 
- Damian Kula + 
- Daniel Hrisca + 
- Daniel Luis Costa + 
- Daniel Saxton 
- DanielFEvans + 
- David Liu + 
- Deepyaman Datta + 
- Denis Belavin + 
- Devin Petersohn + 
- Diane Trout + 
- EdAbati + 
- Enrico Rotundo + 
- EternalLearner42 + 
- Evan + 
- Evan Livelo + 
- Fabian Rost + 
- Flavien Lambert + 
- Florian Rathgeber + 
- Frank Hoang + 
- Gaibo Zhang + 
- Gioia Ballin 
- Giuseppe Romagnuolo + 
- Gordon Blackadder + 
- Gregory Rome + 
- Guillaume Gay 
- HHest + 
- Hielke Walinga + 
- How Si Wei + 
- Hubert 
- Huize Wang + 
- Hyukjin Kwon + 
- Ian Dunn + 
- Inevitable-Marzipan + 
- Irv Lustig 
- JElfner + 
- Jacob Bundgaard + 
- James Cobon-Kerr + 
- Jan-Philip Gehrcke + 
- Jarrod Millman + 
- Jayanth Katuri + 
- Jeff Reback 
- Jeremy Schendel 
- Jiang Yue + 
- Joel Ostblom 
- Johan von Forstner + 
- Johnny Chiu + 
- Jonas + 
- Jonathon Vandezande + 
- Jop Vermeer + 
- Joris Van den Bossche 
- Josh 
- Josh Friedlander + 
- Justin Zheng 
- Kaiqi Dong 
- Kane + 
- Kapil Patel + 
- Kara de la Marck + 
- Katherine Surta + 
- Katrin Leinweber + 
- Kendall Masse 
- Kevin Sheppard 
- Kyle Kosic + 
- Lorenzo Stella + 
- Maarten Rietbergen + 
- Mak Sze Chun 
- Marc Garcia 
- Mateusz Woś 
- Matias Heikkilä 
- Mats Maiwald + 
- Matthew Roeschke 
- Max Bolingbroke + 
- Max Kovalovs + 
- Max van Deursen + 
- Michael 
- Michael Davis + 
- Michael P. Moran + 
- Mike Cramblett + 
- Min ho Kim + 
- Misha Veldhoen + 
- Mukul Ashwath Ram + 
- MusTheDataGuy + 
- Nanda H Krishna + 
- Nicholas Musolino 
- Noam Hershtig + 
- Noora Husseini + 
- Paul 
- Paul Reidy 
- Pauli Virtanen 
- Pav A + 
- Peter Leimbigler + 
- Philippe Ombredanne + 
- Pietro Battiston 
- Richard Eames + 
- Roman Yurchak 
- Ruijing Li 
- Ryan 
- Ryan Joyce + 
- Ryan Nazareth 
- Ryan Rehman + 
- Sakar Panta + 
- Samuel Sinayoko 
- Sandeep Pathak + 
- Sangwoong Yoon 
- Saurav Chakravorty 
- Scott Talbert + 
- Sergey Kopylov + 
- Shantanu Gontia + 
- Shivam Rana + 
- Shorokhov Sergey + 
- Simon Hawkins 
- Soyoun(Rose) Kim 
- Stephan Hoyer 
- Stephen Cowley + 
- Stephen Rauch 
- Sterling Paramore + 
- Steven + 
- Stijn Van Hoey 
- Sumanau Sareen + 
- Takuya N + 
- Tan Tran + 
- Tao He + 
- Tarbo Fukazawa 
- Terji Petersen + 
- Thein Oo 
- ThibTrip + 
- Thijs Damsma + 
- Thiviyan Thanapalasingam 
- Thomas A Caswell 
- Thomas Kluiters + 
- Tilen Kusterle + 
- Tim Gates + 
- Tim Hoffmann 
- Tim Swast 
- Tom Augspurger 
- Tom Neep + 
- Tomáš Chvátal + 
- Tyler Reddy 
- Vaibhav Vishal + 
- Vasily Litvinov + 
- Vibhu Agarwal + 
- Vikramjeet Das + 
- Vladislav + 
- Víctor Moron Tejero + 
- Wenhuan 
- Will Ayd + 
- William Ayd 
- Wouter De Coster + 
- Yoann Goular + 
- Zach Angell + 
- alimcmaster1 
- anmyachev + 
- chris-b1 
- danielplawrence + 
- endenis + 
- enisnazif + 
- ezcitron + 
- fjetter 
- froessler 
- gfyoung 
- gwrome + 
- h-vetinari 
- haison + 
- hannah-c + 
- heckeop + 
- iamshwin + 
- jamesoliverh + 
- jbrockmendel 
- jkovacevic + 
- killerontherun1 + 
- knuu + 
- kpapdac + 
- kpflugshaupt + 
- krsnik93 + 
- leerssej + 
- lrjball + 
- mazayo + 
- nathalier + 
- nrebena + 
- nullptr + 
- pilkibun + 
- pmaxey83 + 
- rbenes + 
- robbuckley 
- shawnbrown + 
- sudhir mohanraj + 
- tadeja + 
- tamuhey + 
- thatneat 
- topper-123 
- willweil + 
- yehia67 + 
- yhaque1213 +