What’s new in 0.25.0 (July 18, 2019)¶
Warning
Starting with the 0.25.x series of releases, pandas only supports Python 3.5.3 and higher. See Dropping Python 2.7 for more details.
Warning
The minimum supported Python version will be bumped to 3.6 in a future release.
Warning
Panel has been fully removed. For N-D labeled data structures, please
use xarray
Warning
read_pickle() and read_msgpack() are only guaranteed backwards compatible back to
pandas version 0.20.3 (GH27082)
These are the changes in pandas 0.25.0. See Release notes for a full changelog including other versions of pandas.
Enhancements¶
GroupBy aggregation with relabeling¶
pandas has added special groupby behavior, known as “named aggregation”, for naming the output columns when applying multiple aggregation functions to specific columns (GH18366, GH26512).
In [1]: animals = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
...: 'height': [9.1, 6.0, 9.5, 34.0],
...: 'weight': [7.9, 7.5, 9.9, 198.0]})
...:
In [2]: animals
Out[2]:
kind height weight
0 cat 9.1 7.9
1 dog 6.0 7.5
2 cat 9.5 9.9
3 dog 34.0 198.0
[4 rows x 3 columns]
In [3]: animals.groupby("kind").agg(
...: min_height=pd.NamedAgg(column='height', aggfunc='min'),
...: max_height=pd.NamedAgg(column='height', aggfunc='max'),
...: average_weight=pd.NamedAgg(column='weight', aggfunc=np.mean),
...: )
...:
Out[3]:
min_height max_height average_weight
kind
cat 9.1 9.5 8.90
dog 6.0 34.0 102.75
[2 rows x 3 columns]
Pass the desired columns names as the **kwargs to .agg. The values of **kwargs
should be tuples where the first element is the column selection, and the second element is the
aggregation function to apply. pandas provides the pandas.NamedAgg namedtuple to make it clearer
what the arguments to the function are, but plain tuples are accepted as well.
In [4]: animals.groupby("kind").agg(
...: min_height=('height', 'min'),
...: max_height=('height', 'max'),
...: average_weight=('weight', np.mean),
...: )
...:
Out[4]:
min_height max_height average_weight
kind
cat 9.1 9.5 8.90
dog 6.0 34.0 102.75
[2 rows x 3 columns]
Named aggregation is the recommended replacement for the deprecated “dict-of-dicts” approach to naming the output of column-specific aggregations (Deprecate groupby.agg() with a dictionary when renaming).
A similar approach is now available for Series groupby objects as well. Because there’s no need for column selection, the values can just be the functions to apply
In [5]: animals.groupby("kind").height.agg(
...: min_height="min",
...: max_height="max",
...: )
...:
Out[5]:
min_height max_height
kind
cat 9.1 9.5
dog 6.0 34.0
[2 rows x 2 columns]
This type of aggregation is the recommended alternative to the deprecated behavior when passing a dict to a Series groupby aggregation (Deprecate groupby.agg() with a dictionary when renaming).
See Named aggregation for more.
GroupBy aggregation with multiple lambdas¶
You can now provide multiple lambda functions to a list-like aggregation in
pandas.core.groupby.GroupBy.agg (GH26430).
In [6]: animals.groupby('kind').height.agg([
...: lambda x: x.iloc[0], lambda x: x.iloc[-1]
...: ])
...:
Out[6]:
<lambda_0> <lambda_1>
kind
cat 9.1 9.5
dog 6.0 34.0
[2 rows x 2 columns]
In [7]: animals.groupby('kind').agg([
...: lambda x: x.iloc[0] - x.iloc[1],
...: lambda x: x.iloc[0] + x.iloc[1]
...: ])
...:
Out[7]:
height weight
<lambda_0> <lambda_1> <lambda_0> <lambda_1>
kind
cat -0.4 18.6 -2.0 17.8
dog -28.0 40.0 -190.5 205.5
[2 rows x 4 columns]
Previously, these raised a SpecificationError.
Better repr for MultiIndex¶
Printing of MultiIndex instances now shows tuples of each row and ensures
that the tuple items are vertically aligned, so it’s now easier to understand
the structure of the MultiIndex. (GH13480):
The repr now looks like this:
In [8]: pd.MultiIndex.from_product([['a', 'abc'], range(500)])
Out[8]:
MultiIndex([( 'a', 0),
( 'a', 1),
( 'a', 2),
( 'a', 3),
( 'a', 4),
( 'a', 5),
( 'a', 6),
( 'a', 7),
( 'a', 8),
( 'a', 9),
...
('abc', 490),
('abc', 491),
('abc', 492),
('abc', 493),
('abc', 494),
('abc', 495),
('abc', 496),
('abc', 497),
('abc', 498),
('abc', 499)],
length=1000)
Previously, outputting a MultiIndex printed all the levels and
codes of the MultiIndex, which was visually unappealing and made
the output more difficult to navigate. For example (limiting the range to 5):
In [1]: pd.MultiIndex.from_product([['a', 'abc'], range(5)])
Out[1]: MultiIndex(levels=[['a', 'abc'], [0, 1, 2, 3]],
...: codes=[[0, 0, 0, 0, 1, 1, 1, 1], [0, 1, 2, 3, 0, 1, 2, 3]])
In the new repr, all values will be shown, if the number of rows is smaller
than options.display.max_seq_items (default: 100 items). Horizontally,
the output will truncate, if it’s wider than options.display.width
(default: 80 characters).
Shorter truncated repr for Series and DataFrame¶
Currently, the default display options of pandas ensure that when a Series
or DataFrame has more than 60 rows, its repr gets truncated to this maximum
of 60 rows (the display.max_rows option). However, this still gives
a repr that takes up a large part of the vertical screen estate. Therefore,
a new option display.min_rows is introduced with a default of 10 which
determines the number of rows showed in the truncated repr:
For small Series or DataFrames, up to
max_rowsnumber of rows is shown (default: 60).For larger Series of DataFrame with a length above
max_rows, onlymin_rowsnumber of rows is shown (default: 10, i.e. the first and last 5 rows).
This dual option allows to still see the full content of relatively small
objects (e.g. df.head(20) shows all 20 rows), while giving a brief repr
for large objects.
To restore the previous behaviour of a single threshold, set
pd.options.display.min_rows = None.
JSON normalize with max_level param support¶
json_normalize() normalizes the provided input dict to all
nested levels. The new max_level parameter provides more control over
which level to end normalization (GH23843):
The repr now looks like this:
from pandas.io.json import json_normalize
data = [{
'CreatedBy': {'Name': 'User001'},
'Lookup': {'TextField': 'Some text',
'UserField': {'Id': 'ID001', 'Name': 'Name001'}},
'Image': {'a': 'b'}
}]
json_normalize(data, max_level=1)
Series.explode to split list-like values to rows¶
Series and DataFrame have gained the DataFrame.explode() methods to transform list-likes to individual rows. See section on Exploding list-like column in docs for more information (GH16538, GH10511)
Here is a typical usecase. You have comma separated string in a column.
In [9]: df = pd.DataFrame([{'var1': 'a,b,c', 'var2': 1},
...: {'var1': 'd,e,f', 'var2': 2}])
...:
In [10]: df
Out[10]:
var1 var2
0 a,b,c 1
1 d,e,f 2
[2 rows x 2 columns]
Creating a long form DataFrame is now straightforward using chained operations
In [11]: df.assign(var1=df.var1.str.split(',')).explode('var1')
Out[11]:
var1 var2
0 a 1
0 b 1
0 c 1
1 d 2
1 e 2
1 f 2
[6 rows x 2 columns]
Other enhancements¶
DataFrame.plot()keywordslogy,logxandloglogcan now accept the value'sym'for symlog scaling. (GH24867)Added support for ISO week year format (‘%G-%V-%u’) when parsing datetimes using
to_datetime()(GH16607)Indexing of
DataFrameandSeriesnow accepts zerodimnp.ndarray(GH24919)Timestamp.replace()now supports thefoldargument to disambiguate DST transition times (GH25017)DataFrame.at_time()andSeries.at_time()now supportdatetime.timeobjects with timezones (GH24043)DataFrame.pivot_table()now accepts anobservedparameter which is passed to underlying calls toDataFrame.groupby()to speed up grouping categorical data. (GH24923)Series.strhas gainedSeries.str.casefold()method to removes all case distinctions present in a string (GH25405)DataFrame.set_index()now works for instances ofabc.Iterator, provided their output is of the same length as the calling frame (GH22484, GH24984)DatetimeIndex.union()now supports thesortargument. The behavior of the sort parameter matches that ofIndex.union()(GH24994)RangeIndex.union()now supports thesortargument. Ifsort=Falsean unsortedInt64Indexis always returned.sort=Noneis the default and returns a monotonically increasingRangeIndexif possible or a sortedInt64Indexif not (GH24471)TimedeltaIndex.intersection()now also supports thesortkeyword (GH24471)DataFrame.rename()now supports theerrorsargument to raise errors when attempting to rename nonexistent keys (GH13473)Added Sparse accessor for working with a
DataFramewhose values are sparse (GH25681)RangeIndexhas gainedstart,stop, andstepattributes (GH25710)datetime.timezoneobjects are now supported as arguments to timezone methods and constructors (GH25065)DataFrame.query()andDataFrame.eval()now supports quoting column names with backticks to refer to names with spaces (GH6508)merge_asof()now gives a more clear error message when merge keys are categoricals that are not equal (GH26136)pandas.core.window.Rolling()supports exponential (or Poisson) window type (GH21303)Error message for missing required imports now includes the original import error’s text (GH23868)
DatetimeIndexandTimedeltaIndexnow have ameanmethod (GH24757)DataFrame.describe()now formats integer percentiles without decimal point (GH26660)Added support for reading SPSS .sav files using
read_spss()(GH26537)Added new option
plotting.backendto be able to select a plotting backend different than the existingmatplotlibone. Usepandas.set_option('plotting.backend', '<backend-module>')where<backend-moduleis a library implementing the pandas plotting API (GH14130)pandas.offsets.BusinessHoursupports multiple opening hours intervals (GH15481)read_excel()can now useopenpyxlto read Excel files via theengine='openpyxl'argument. This will become the default in a future release (GH11499)pandas.io.excel.read_excel()supports reading OpenDocument tables. Specifyengine='odf'to enable. Consult the IO User Guide for more details (GH9070)Interval,IntervalIndex, andIntervalArrayhave gained anis_emptyattribute denoting if the given interval(s) are empty (GH27219)
Backwards incompatible API changes¶
Indexing with date strings with UTC offsets¶
Indexing a DataFrame or Series with a DatetimeIndex with a
date string with a UTC offset would previously ignore the UTC offset. Now, the UTC offset
is respected in indexing. (GH24076, GH16785)
In [12]: df = pd.DataFrame([0], index=pd.DatetimeIndex(['2019-01-01'], tz='US/Pacific'))
In [13]: df
Out[13]:
0
2019-01-01 00:00:00-08:00 0
[1 rows x 1 columns]
Previous behavior:
In [3]: df['2019-01-01 00:00:00+04:00':'2019-01-01 01:00:00+04:00']
Out[3]:
0
2019-01-01 00:00:00-08:00 0
New behavior:
In [14]: df['2019-01-01 12:00:00+04:00':'2019-01-01 13:00:00+04:00']
Out[14]:
0
2019-01-01 00:00:00-08:00 0
[1 rows x 1 columns]
MultiIndex constructed from levels and codes¶
Constructing a MultiIndex with NaN levels or codes value < -1 was allowed previously.
Now, construction with codes value < -1 is not allowed and NaN levels’ corresponding codes
would be reassigned as -1. (GH19387)
Previous behavior:
In [1]: pd.MultiIndex(levels=[[np.nan, None, pd.NaT, 128, 2]],
...: codes=[[0, -1, 1, 2, 3, 4]])
...:
Out[1]: MultiIndex(levels=[[nan, None, NaT, 128, 2]],
codes=[[0, -1, 1, 2, 3, 4]])
In [2]: pd.MultiIndex(levels=[[1, 2]], codes=[[0, -2]])
Out[2]: MultiIndex(levels=[[1, 2]],
codes=[[0, -2]])
New behavior:
In [15]: pd.MultiIndex(levels=[[np.nan, None, pd.NaT, 128, 2]],
....: codes=[[0, -1, 1, 2, 3, 4]])
....:
Out[15]:
MultiIndex([(nan,),
(nan,),
(nan,),
(nan,),
(128,),
( 2,)],
)
In [16]: pd.MultiIndex(levels=[[1, 2]], codes=[[0, -2]])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-16-225a01af3975> in <module>
----> 1 pd.MultiIndex(levels=[[1, 2]], codes=[[0, -2]])
/pandas/pandas/core/indexes/multi.py in __new__(cls, levels, codes, sortorder, names, dtype, copy, name, verify_integrity)
314
315 if verify_integrity:
--> 316 new_codes = result._verify_integrity()
317 result._codes = new_codes
318
/pandas/pandas/core/indexes/multi.py in _verify_integrity(self, codes, levels)
387 )
388 if len(level_codes) and level_codes.min() < -1:
--> 389 raise ValueError(f"On level {i}, code value ({level_codes.min()}) < -1")
390 if not level.is_unique:
391 raise ValueError(
ValueError: On level 0, code value (-2) < -1
GroupBy.apply on DataFrame evaluates first group only once¶
The implementation of DataFrameGroupBy.apply()
previously evaluated the supplied function consistently twice on the first group
to infer if it is safe to use a fast code path. Particularly for functions with
side effects, this was an undesired behavior and may have led to surprises. (GH2936, GH2656, GH7739, GH10519, GH12155, GH20084, GH21417)
Now every group is evaluated only a single time.
In [17]: df = pd.DataFrame({"a": ["x", "y"], "b": [1, 2]})
In [18]: df
Out[18]:
a b
0 x 1
1 y 2
[2 rows x 2 columns]
In [19]: def func(group):
....: print(group.name)
....: return group
....:
Previous behavior:
In [3]: df.groupby('a').apply(func)
x
x
y
Out[3]:
a b
0 x 1
1 y 2
New behavior:
In [20]: df.groupby("a").apply(func)
x
y
Out[20]:
a b
0 x 1
1 y 2
[2 rows x 2 columns]
Concatenating sparse values¶
When passed DataFrames whose values are sparse, concat() will now return a
Series or DataFrame with sparse values, rather than a SparseDataFrame (GH25702).
In [21]: df = pd.DataFrame({"A": pd.SparseArray([0, 1])})
Previous behavior:
In [2]: type(pd.concat([df, df]))
pandas.core.sparse.frame.SparseDataFrame
New behavior:
In [22]: type(pd.concat([df, df]))
Out[22]: pandas.core.frame.DataFrame
This now matches the existing behavior of concat on Series with sparse values.
concat() will continue to return a SparseDataFrame when all the values
are instances of SparseDataFrame.
This change also affects routines using concat() internally, like get_dummies(),
which now returns a DataFrame in all cases (previously a SparseDataFrame was
returned if all the columns were dummy encoded, and a DataFrame otherwise).
Providing any SparseSeries or SparseDataFrame to concat() will
cause a SparseSeries or SparseDataFrame to be returned, as before.
The .str-accessor performs stricter type checks¶
Due to the lack of more fine-grained dtypes, Series.str so far only checked whether the data was
of object dtype. Series.str will now infer the dtype data within the Series; in particular,
'bytes'-only data will raise an exception (except for Series.str.decode(), Series.str.get(),
Series.str.len(), Series.str.slice()), see GH23163, GH23011, GH23551.
Previous behavior:
In [1]: s = pd.Series(np.array(['a', 'ba', 'cba'], 'S'), dtype=object)
In [2]: s
Out[2]:
0 b'a'
1 b'ba'
2 b'cba'
dtype: object
In [3]: s.str.startswith(b'a')
Out[3]:
0 True
1 False
2 False
dtype: bool
New behavior:
In [23]: s = pd.Series(np.array(['a', 'ba', 'cba'], 'S'), dtype=object)
In [24]: s
Out[24]:
0 b'a'
1 b'ba'
2 b'cba'
Length: 3, dtype: object
In [25]: s.str.startswith(b'a')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-25-ac784692b361> in <module>
----> 1 s.str.startswith(b'a')
/pandas/pandas/core/strings/accessor.py in wrapper(self, *args, **kwargs)
98 f"inferred dtype '{self._inferred_dtype}'."
99 )
--> 100 raise TypeError(msg)
101 return func(self, *args, **kwargs)
102
TypeError: Cannot use .str.startswith with values of inferred dtype 'bytes'.
Categorical dtypes are preserved during GroupBy¶
Previously, columns that were categorical, but not the groupby key(s) would be converted to object dtype during groupby operations. pandas now will preserve these dtypes. (GH18502)
In [26]: cat = pd.Categorical(["foo", "bar", "bar", "qux"], ordered=True)
In [27]: df = pd.DataFrame({'payload': [-1, -2, -1, -2], 'col': cat})
In [28]: df
Out[28]:
payload col
0 -1 foo
1 -2 bar
2 -1 bar
3 -2 qux
[4 rows x 2 columns]
In [29]: df.dtypes
Out[29]:
payload int64
col category
Length: 2, dtype: object
Previous Behavior:
In [5]: df.groupby('payload').first().col.dtype
Out[5]: dtype('O')
New Behavior:
In [30]: df.groupby('payload').first().col.dtype
Out[30]: CategoricalDtype(categories=['bar', 'foo', 'qux'], ordered=True)
Incompatible Index type unions¶
When performing Index.union() operations between objects of incompatible dtypes,
the result will be a base Index of dtype object. This behavior holds true for
unions between Index objects that previously would have been prohibited. The dtype
of empty Index objects will now be evaluated before performing union operations
rather than simply returning the other Index object. Index.union() can now be
considered commutative, such that A.union(B) == B.union(A) (GH23525).
Previous behavior:
In [1]: pd.period_range('19910905', periods=2).union(pd.Int64Index([1, 2, 3]))
...
ValueError: can only call with other PeriodIndex-ed objects
In [2]: pd.Index([], dtype=object).union(pd.Index([1, 2, 3]))
Out[2]: Int64Index([1, 2, 3], dtype='int64')
New behavior:
In [31]: pd.period_range('19910905', periods=2).union(pd.Int64Index([1, 2, 3]))
Out[31]: Index([1991-09-05, 1991-09-06, 1, 2, 3], dtype='object')
In [32]: pd.Index([], dtype=object).union(pd.Index([1, 2, 3]))
Out[32]: Index([1, 2, 3], dtype='object')
Note that integer- and floating-dtype indexes are considered “compatible”. The integer values are coerced to floating point, which may result in loss of precision. See Set operations on Index objects for more.
DataFrame GroupBy ffill/bfill no longer return group labels¶
The methods ffill, bfill, pad and backfill of
DataFrameGroupBy
previously included the group labels in the return value, which was
inconsistent with other groupby transforms. Now only the filled values
are returned. (GH21521)
In [33]: df = pd.DataFrame({"a": ["x", "y"], "b": [1, 2]})
In [34]: df
Out[34]:
a b
0 x 1
1 y 2
[2 rows x 2 columns]
Previous behavior:
In [3]: df.groupby("a").ffill()
Out[3]:
a b
0 x 1
1 y 2
New behavior:
In [35]: df.groupby("a").ffill()
Out[35]:
b
0 1
1 2
[2 rows x 1 columns]
DataFrame describe on an empty Categorical / object column will return top and freq¶
When calling DataFrame.describe() with an empty categorical / object
column, the ‘top’ and ‘freq’ columns were previously omitted, which was inconsistent with
the output for non-empty columns. Now the ‘top’ and ‘freq’ columns will always be included,
with numpy.nan in the case of an empty DataFrame (GH26397)
In [36]: df = pd.DataFrame({"empty_col": pd.Categorical([])})
In [37]: df
Out[37]:
Empty DataFrame
Columns: [empty_col]
Index: []
[0 rows x 1 columns]
Previous behavior:
In [3]: df.describe()
Out[3]:
empty_col
count 0
unique 0
New behavior:
In [38]: df.describe()
Out[38]:
empty_col
count 0
unique 0
top NaN
freq NaN
[4 rows x 1 columns]
__str__ methods now call __repr__ rather than vice versa¶
pandas has until now mostly defined string representations in a pandas objects’
__str__/__unicode__/__bytes__ methods, and called __str__ from the __repr__
method, if a specific __repr__ method is not found. This is not needed for Python3.
In pandas 0.25, the string representations of pandas objects are now generally
defined in __repr__, and calls to __str__ in general now pass the call on to
the __repr__, if a specific __str__ method doesn’t exist, as is standard for Python.
This change is backward compatible for direct usage of pandas, but if you subclass
pandas objects and give your subclasses specific __str__/__repr__ methods,
you may have to adjust your __str__/__repr__ methods (GH26495).
Indexing an IntervalIndex with Interval objects¶
Indexing methods for IntervalIndex have been modified to require exact matches only for Interval queries.
IntervalIndex methods previously matched on any overlapping Interval. Behavior with scalar points, e.g. querying
with an integer, is unchanged (GH16316).
In [39]: ii = pd.IntervalIndex.from_tuples([(0, 4), (1, 5), (5, 8)])
In [40]: ii
Out[40]:
IntervalIndex([(0, 4], (1, 5], (5, 8]],
closed='right',
dtype='interval[int64]')
The in operator (__contains__) now only returns True for exact matches to Intervals in the IntervalIndex, whereas
this would previously return True for any Interval overlapping an Interval in the IntervalIndex.
Previous behavior:
In [4]: pd.Interval(1, 2, closed='neither') in ii
Out[4]: True
In [5]: pd.Interval(-10, 10, closed='both') in ii
Out[5]: True
New behavior:
In [41]: pd.Interval(1, 2, closed='neither') in ii
Out[41]: False
In [42]: pd.Interval(-10, 10, closed='both') in ii
Out[42]: False
The get_loc() method now only returns locations for exact matches to Interval queries, as opposed to the previous behavior of
returning locations for overlapping matches. A KeyError will be raised if an exact match is not found.
Previous behavior:
In [6]: ii.get_loc(pd.Interval(1, 5))
Out[6]: array([0, 1])
In [7]: ii.get_loc(pd.Interval(2, 6))
Out[7]: array([0, 1, 2])
New behavior:
In [6]: ii.get_loc(pd.Interval(1, 5))
Out[6]: 1
In [7]: ii.get_loc(pd.Interval(2, 6))
---------------------------------------------------------------------------
KeyError: Interval(2, 6, closed='right')
Likewise, get_indexer() and get_indexer_non_unique() will also only return locations for exact matches
to Interval queries, with -1 denoting that an exact match was not found.
These indexing changes extend to querying a Series or DataFrame with an IntervalIndex index.
In [43]: s = pd.Series(list('abc'), index=ii)
In [44]: s
Out[44]:
(0, 4] a
(1, 5] b
(5, 8] c
Length: 3, dtype: object
Selecting from a Series or DataFrame using [] (__getitem__) or loc now only returns exact matches for Interval queries.
Previous behavior:
In [8]: s[pd.Interval(1, 5)]
Out[8]:
(0, 4] a
(1, 5] b
dtype: object
In [9]: s.loc[pd.Interval(1, 5)]
Out[9]:
(0, 4] a
(1, 5] b
dtype: object
New behavior:
In [45]: s[pd.Interval(1, 5)]
Out[45]: 'b'
In [46]: s.loc[pd.Interval(1, 5)]
Out[46]: 'b'
Similarly, a KeyError will be raised for non-exact matches instead of returning overlapping matches.
Previous behavior:
In [9]: s[pd.Interval(2, 3)]
Out[9]:
(0, 4] a
(1, 5] b
dtype: object
In [10]: s.loc[pd.Interval(2, 3)]
Out[10]:
(0, 4] a
(1, 5] b
dtype: object
New behavior:
In [6]: s[pd.Interval(2, 3)]
---------------------------------------------------------------------------
KeyError: Interval(2, 3, closed='right')
In [7]: s.loc[pd.Interval(2, 3)]
---------------------------------------------------------------------------
KeyError: Interval(2, 3, closed='right')
The overlaps() method can be used to create a boolean indexer that replicates the
previous behavior of returning overlapping matches.
New behavior:
In [47]: idxr = s.index.overlaps(pd.Interval(2, 3))
In [48]: idxr
Out[48]: array([ True, True, False])
In [49]: s[idxr]
Out[49]:
(0, 4] a
(1, 5] b
Length: 2, dtype: object
In [50]: s.loc[idxr]
Out[50]:
(0, 4] a
(1, 5] b
Length: 2, dtype: object
Binary ufuncs on Series now align¶
Applying a binary ufunc like numpy.power() now aligns the inputs
when both are Series (GH23293).
In [51]: s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
In [52]: s2 = pd.Series([3, 4, 5], index=['d', 'c', 'b'])
In [53]: s1
Out[53]:
a 1
b 2
c 3
Length: 3, dtype: int64
In [54]: s2
Out[54]:
d 3
c 4
b 5
Length: 3, dtype: int64
Previous behavior
In [5]: np.power(s1, s2)
Out[5]:
a 1
b 16
c 243
dtype: int64
New behavior
In [55]: np.power(s1, s2)
Out[55]:
a 1.0
b 32.0
c 81.0
d NaN
Length: 4, dtype: float64
This matches the behavior of other binary operations in pandas, like Series.add().
To retain the previous behavior, convert the other Series to an array before
applying the ufunc.
In [56]: np.power(s1, s2.array)
Out[56]:
a 1
b 16
c 243
Length: 3, dtype: int64
Categorical.argsort now places missing values at the end¶
Categorical.argsort() now places missing values at the end of the array, making it
consistent with NumPy and the rest of pandas (GH21801).
In [57]: cat = pd.Categorical(['b', None, 'a'], categories=['a', 'b'], ordered=True)
Previous behavior
In [2]: cat = pd.Categorical(['b', None, 'a'], categories=['a', 'b'], ordered=True)
In [3]: cat.argsort()
Out[3]: array([1, 2, 0])
In [4]: cat[cat.argsort()]
Out[4]:
[NaN, a, b]
categories (2, object): [a < b]
New behavior
In [58]: cat.argsort()
Out[58]: array([2, 0, 1])
In [59]: cat[cat.argsort()]
Out[59]:
['a', 'b', NaN]
Categories (2, object): ['a' < 'b']
Column order is preserved when passing a list of dicts to DataFrame¶
Starting with Python 3.7 the key-order of dict is guaranteed. In practice, this has been true since
Python 3.6. The DataFrame constructor now treats a list of dicts in the same way as
it does a list of OrderedDict, i.e. preserving the order of the dicts.
This change applies only when pandas is running on Python>=3.6 (GH27309).
In [60]: data = [
....: {'name': 'Joe', 'state': 'NY', 'age': 18},
....: {'name': 'Jane', 'state': 'KY', 'age': 19, 'hobby': 'Minecraft'},
....: {'name': 'Jean', 'state': 'OK', 'age': 20, 'finances': 'good'}
....: ]
....:
Previous Behavior:
The columns were lexicographically sorted previously,
In [1]: pd.DataFrame(data)
Out[1]:
age finances hobby name state
0 18 NaN NaN Joe NY
1 19 NaN Minecraft Jane KY
2 20 good NaN Jean OK
New Behavior:
The column order now matches the insertion-order of the keys in the dict,
considering all the records from top to bottom. As a consequence, the column
order of the resulting DataFrame has changed compared to previous pandas versions.
In [61]: pd.DataFrame(data)
Out[61]:
name state age hobby finances
0 Joe NY 18 NaN NaN
1 Jane KY 19 Minecraft NaN
2 Jean OK 20 NaN good
[3 rows x 5 columns]
Increased minimum versions for dependencies¶
Due to dropping support for Python 2.7, a number of optional dependencies have updated minimum versions (GH25725, GH24942, GH25752). Independently, some minimum supported versions of dependencies were updated (GH23519, GH25554). If installed, we now require:
Package |
Minimum Version |
Required |
|---|---|---|
numpy |
1.13.3 |
X |
pytz |
2015.4 |
X |
python-dateutil |
2.6.1 |
X |
bottleneck |
1.2.1 |
|
numexpr |
2.6.2 |
|
pytest (dev) |
4.0.2 |
For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.
Package |
Minimum Version |
|---|---|
beautifulsoup4 |
4.6.0 |
fastparquet |
0.2.1 |
gcsfs |
0.2.2 |
lxml |
3.8.0 |
matplotlib |
2.2.2 |
openpyxl |
2.4.8 |
pyarrow |
0.9.0 |
pymysql |
0.7.1 |
pytables |
3.4.2 |
scipy |
0.19.0 |
sqlalchemy |
1.1.4 |
xarray |
0.8.2 |
xlrd |
1.1.0 |
xlsxwriter |
0.9.8 |
xlwt |
1.2.0 |
See Dependencies and Optional dependencies for more.
Other API changes¶
DatetimeTZDtypewill now standardize pytz timezones to a common timezone instance (GH24713)TimestampandTimedeltascalars now implement theto_numpy()method as aliases toTimestamp.to_datetime64()andTimedelta.to_timedelta64(), respectively. (GH24653)Timestamp.strptime()will now rise aNotImplementedError(GH25016)Comparing
Timestampwith unsupported objects now returnsNotImplementedinstead of raisingTypeError. This implies that unsupported rich comparisons are delegated to the other object, and are now consistent with Python 3 behavior fordatetimeobjects (GH24011)Bug in
DatetimeIndex.snap()which didn’t preserving thenameof the inputIndex(GH25575)The
argargument inpandas.core.groupby.DataFrameGroupBy.agg()has been renamed tofunc(GH26089)The
argargument inpandas.core.window._Window.aggregate()has been renamed tofunc(GH26372)Most pandas classes had a
__bytes__method, which was used for getting a python2-style bytestring representation of the object. This method has been removed as a part of dropping Python2 (GH26447)The
.str-accessor has been disabled for 1-levelMultiIndex, useMultiIndex.to_flat_index()if necessary (GH23679)Removed support of gtk package for clipboards (GH26563)
Using an unsupported version of Beautiful Soup 4 will now raise an
ImportErrorinstead of aValueError(GH27063)Series.to_excel()andDataFrame.to_excel()will now raise aValueErrorwhen saving timezone aware data. (GH27008, GH7056)ExtensionArray.argsort()places NA values at the end of the sorted array. (GH21801)DataFrame.to_hdf()andSeries.to_hdf()will now raise aNotImplementedErrorwhen saving aMultiIndexwith extension data types for afixedformat. (GH7775)Passing duplicate
namesinread_csv()will now raise aValueError(GH17346)
Deprecations¶
Sparse subclasses¶
The SparseSeries and SparseDataFrame subclasses are deprecated. Their functionality is better-provided
by a Series or DataFrame with sparse values.
Previous way
df = pd.SparseDataFrame({"A": [0, 0, 1, 2]})
df.dtypes
New way
In [62]: df = pd.DataFrame({"A": pd.SparseArray([0, 0, 1, 2])})
In [63]: df.dtypes
Out[63]:
A Sparse[int64, 0]
Length: 1, dtype: object
The memory usage of the two approaches is identical. See Migrating for more (GH19239).
msgpack format¶
The msgpack format is deprecated as of 0.25 and will be removed in a future version. It is recommended to use pyarrow for on-the-wire transmission of pandas objects. (GH27084)
Other deprecations¶
The deprecated
.ix[]indexer now raises a more visibleFutureWarninginstead ofDeprecationWarning(GH26438).Deprecated the
units=M(months) andunits=Y(year) parameters forunitsofpandas.to_timedelta(),pandas.Timedelta()andpandas.TimedeltaIndex()(GH16344)pandas.concat()has deprecated thejoin_axes-keyword. Instead, useDataFrame.reindex()orDataFrame.reindex_like()on the result or on the inputs (GH21951)The
SparseArray.valuesattribute is deprecated. You can usenp.asarray(...)or theSparseArray.to_dense()method instead (GH26421).The functions
pandas.to_datetime()andpandas.to_timedelta()have deprecated theboxkeyword. Instead, useto_numpy()orTimestamp.to_datetime64()orTimedelta.to_timedelta64(). (GH24416)The
DataFrame.compound()andSeries.compound()methods are deprecated and will be removed in a future version (GH26405).The internal attributes
_start,_stopand_stepattributes ofRangeIndexhave been deprecated. Use the public attributesstart,stopandstepinstead (GH26581).The
Series.ftype(),Series.ftypes()andDataFrame.ftypes()methods are deprecated and will be removed in a future version. Instead, useSeries.dtype()andDataFrame.dtypes()(GH26705).The
Series.get_values(),DataFrame.get_values(),Index.get_values(),SparseArray.get_values()andCategorical.get_values()methods are deprecated. One ofnp.asarray(..)orto_numpy()can be used instead (GH19617).The ‘outer’ method on NumPy ufuncs, e.g.
np.subtract.outerhas been deprecated onSeriesobjects. Convert the input to an array withSeries.arrayfirst (GH27186)Timedelta.resolution()is deprecated and replaced withTimedelta.resolution_string(). In a future version,Timedelta.resolution()will be changed to behave like the standard librarydatetime.timedelta.resolution(GH21344)read_table()has been undeprecated. (GH25220)Index.dtype_stris deprecated. (GH18262)Series.imagandSeries.realare deprecated. (GH18262)Series.put()is deprecated. (GH18262)Index.item()andSeries.item()is deprecated. (GH18262)The default value
ordered=NoneinCategoricalDtypehas been deprecated in favor ofordered=False. When converting between categorical typesordered=Truemust be explicitly passed in order to be preserved. (GH26336)Index.contains()is deprecated. Usekey in index(__contains__) instead (GH17753).DataFrame.get_dtype_counts()is deprecated. (GH18262)Categorical.ravel()will return aCategoricalinstead of anp.ndarray(GH27199)
Removal of prior version deprecations/changes¶
Removed the previously deprecated
sheetnamekeyword inread_excel()(GH16442, GH20938)Removed the previously deprecated
TimeGrouper(GH16942)Removed the previously deprecated
parse_colskeyword inread_excel()(GH16488)Removed the previously deprecated
pd.options.html.border(GH16970)Removed the previously deprecated
convert_objects(GH11221)Removed the previously deprecated
selectmethod ofDataFrameandSeries(GH17633)Removed the previously deprecated behavior of
Seriestreated as list-like inrename_categories()(GH17982)Removed the previously deprecated
DataFrame.reindex_axisandSeries.reindex_axis(GH17842)Removed the previously deprecated behavior of altering column or index labels with
Series.rename_axis()orDataFrame.rename_axis()(GH17842)Removed the previously deprecated
tupleize_colskeyword argument inread_html(),read_csv(), andDataFrame.to_csv()(GH17877, GH17820)Removed the previously deprecated
DataFrame.from.csvandSeries.from_csv(GH17812)Removed the previously deprecated
raise_on_errorkeyword argument inDataFrame.where()andDataFrame.mask()(GH17744)Removed the previously deprecated
orderedandcategorieskeyword arguments inastype(GH17742)Removed the previously deprecated
cdate_range(GH17691)Removed the previously deprecated
Trueoption for thedropnakeyword argument inSeriesGroupBy.nth()(GH17493)Removed the previously deprecated
convertkeyword argument inSeries.take()andDataFrame.take()(GH17352)Removed the previously deprecated behavior of arithmetic operations with
datetime.dateobjects (GH21152)
Performance improvements¶
Significant speedup in
SparseArrayinitialization that benefits most operations, fixing performance regression introduced in v0.20.0 (GH24985)DataFrame.to_stata()is now faster when outputting data with any string or non-native endian columns (GH25045)Improved performance of
Series.searchsorted(). The speedup is especially large when the dtype is int8/int16/int32 and the searched key is within the integer bounds for the dtype (GH22034)Improved performance of
pandas.core.groupby.GroupBy.quantile()(GH20405)Improved performance of slicing and other selected operation on a
RangeIndex(GH26565, GH26617, GH26722)RangeIndexnow performs standard lookup without instantiating an actual hashtable, hence saving memory (GH16685)Improved performance of
read_csv()by faster tokenizing and faster parsing of small float numbers (GH25784)Improved performance of
read_csv()by faster parsing of N/A and boolean values (GH25804)Improved performance of
IntervalIndex.is_monotonic,IntervalIndex.is_monotonic_increasingandIntervalIndex.is_monotonic_decreasingby removing conversion toMultiIndex(GH24813)Improved performance of
DataFrame.to_csv()when writing datetime dtypes (GH25708)Improved performance of
read_csv()by much faster parsing ofMM/YYYYandDD/MM/YYYYdatetime formats (GH25922)Improved performance of nanops for dtypes that cannot store NaNs. Speedup is particularly prominent for
Series.all()andSeries.any()(GH25070)Improved performance of
Series.map()for dictionary mappers on categorical series by mapping the categories instead of mapping all values (GH23785)Improved performance of
IntervalIndex.intersection()(GH24813)Improved performance of
read_csv()by faster concatenating date columns without extra conversion to string for integer/float zero and floatNaN; by faster checking the string for the possibility of being a date (GH25754)Improved performance of
IntervalIndex.is_uniqueby removing conversion toMultiIndex(GH24813)Restored performance of
DatetimeIndex.__iter__()by re-enabling specialized code path (GH26702)Improved performance when building
MultiIndexwith at least oneCategoricalIndexlevel (GH22044)Improved performance by removing the need for a garbage collect when checking for
SettingWithCopyWarning(GH27031)For
to_datetime()changed default value of cache parameter toTrue(GH26043)Improved performance of
DatetimeIndexandPeriodIndexslicing given non-unique, monotonic data (GH27136).Improved performance of
pd.read_json()for index-oriented data. (GH26773)Improved performance of
MultiIndex.shape()(GH27384).
Bug fixes¶
Categorical¶
Bug in
DataFrame.at()andSeries.at()that would raise exception if the index was aCategoricalIndex(GH20629)Fixed bug in comparison of ordered
Categoricalthat contained missing values with a scalar which sometimes incorrectly resulted inTrue(GH26504)Bug in
DataFrame.dropna()when theDataFramehas aCategoricalIndexcontainingIntervalobjects incorrectly raised aTypeError(GH25087)
Datetimelike¶
Bug in
to_datetime()which would raise an (incorrect)ValueErrorwhen called with a date far into the future and theformatargument specified instead of raisingOutOfBoundsDatetime(GH23830)Bug in
to_datetime()which would raiseInvalidIndexError: Reindexing only valid with uniquely valued Index objectswhen called withcache=True, withargincluding at least two different elements from the set{None, numpy.nan, pandas.NaT}(GH22305)Bug in
DataFrameandSerieswhere timezone aware data withdtype='datetime64[ns]was not cast to naive (GH25843)Improved
Timestamptype checking in various datetime functions to prevent exceptions when using a subclasseddatetime(GH25851)Bug in
SeriesandDataFramerepr wherenp.datetime64('NaT')andnp.timedelta64('NaT')withdtype=objectwould be represented asNaN(GH25445)Bug in
to_datetime()which does not replace the invalid argument withNaTwhen error is set to coerce (GH26122)Bug in adding
DateOffsetwith nonzero month toDatetimeIndexwould raiseValueError(GH26258)Bug in
to_datetime()which raises unhandledOverflowErrorwhen called with mix of invalid dates andNaNvalues withformat='%Y%m%d'anderror='coerce'(GH25512)Bug in
isin()for datetimelike indexes;DatetimeIndex,TimedeltaIndexandPeriodIndexwhere thelevelsparameter was ignored. (GH26675)Bug in
to_datetime()which raisesTypeErrorforformat='%Y%m%d'when called for invalid integer dates with length >= 6 digits witherrors='ignore'Bug when comparing a
PeriodIndexagainst a zero-dimensional numpy array (GH26689)Bug in constructing a
SeriesorDataFramefrom a numpydatetime64array with a non-ns unit and out-of-bound timestamps generating rubbish data, which will now correctly raise anOutOfBoundsDatetimeerror (GH26206).Bug in
date_range()with unnecessaryOverflowErrorbeing raised for very large or very small dates (GH26651)Bug where adding
Timestampto anp.timedelta64object would raise instead of returning aTimestamp(GH24775)Bug where comparing a zero-dimensional numpy array containing a
np.datetime64object to aTimestampwould incorrect raiseTypeError(GH26916)Bug in
to_datetime()which would raiseValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=Truewhen called withcache=True, withargincluding datetime strings with different offset (GH26097)
Timedelta¶
Bug in
TimedeltaIndex.intersection()where for non-monotonic indices in some cases an emptyIndexwas returned when in fact an intersection existed (GH25913)Bug with comparisons between
TimedeltaandNaTraisingTypeError(GH26039)Bug when adding or subtracting a
BusinessHourto aTimestampwith the resulting time landing in a following or prior day respectively (GH26381)Bug when comparing a
TimedeltaIndexagainst a zero-dimensional numpy array (GH26689)
Timezones¶
Bug in
DatetimeIndex.to_frame()where timezone aware data would be converted to timezone naive data (GH25809)Bug in
to_datetime()withutc=Trueand datetime strings that would apply previously parsed UTC offsets to subsequent arguments (GH24992)Bug in
Timestamp.tz_localize()andTimestamp.tz_convert()does not propagatefreq(GH25241)Bug in
Series.at()where settingTimestampwith timezone raisesTypeError(GH25506)Bug in
DataFrame.update()when updating with timezone aware data would return timezone naive data (GH25807)Bug in
to_datetime()where an uninformativeRuntimeErrorwas raised when passing a naiveTimestampwith datetime strings with mixed UTC offsets (GH25978)Bug in
to_datetime()withunit='ns'would drop timezone information from the parsed argument (GH26168)Bug in
DataFrame.join()where joining a timezone aware index with a timezone aware column would result in a column ofNaN(GH26335)Bug in
date_range()where ambiguous or nonexistent start or end times were not handled by theambiguousornonexistentkeywords respectively (GH27088)Bug in
DatetimeIndex.union()when combining a timezone aware and timezone unawareDatetimeIndex(GH21671)Bug when applying a numpy reduction function (e.g.
numpy.minimum()) to a timezone awareSeries(GH15552)
Numeric¶
Bug in
to_numeric()in which large negative numbers were being improperly handled (GH24910)Bug in
to_numeric()in which numbers were being coerced to float, even thougherrorswas notcoerce(GH24910)Bug in
to_numeric()in which invalid values forerrorswere being allowed (GH26466)Bug in
formatin which floating point complex numbers were not being formatted to proper display precision and trimming (GH25514)Bug in error messages in
DataFrame.corr()andSeries.corr(). Added the possibility of using a callable. (GH25729)Bug in
Series.divmod()andSeries.rdivmod()which would raise an (incorrect)ValueErrorrather than return a pair ofSeriesobjects as result (GH25557)Raises a helpful exception when a non-numeric index is sent to
interpolate()with methods which require numeric index. (GH21662)Bug in
eval()when comparing floats with scalar operators, for example:x < -0.1(GH25928)Fixed bug where casting all-boolean array to integer extension array failed (GH25211)
Bug in
divmodwith aSeriesobject containing zeros incorrectly raisingAttributeError(GH26987)Inconsistency in
Seriesfloor-division (//) anddivmodfilling positive//zero withNaNinstead ofInf(GH27321)
Conversion¶
Bug in
DataFrame.astype()when passing a dict of columns and types theerrorsparameter was ignored. (GH25905)
Strings¶
Bug in the
__name__attribute of several methods ofSeries.str, which were set incorrectly (GH23551)Improved error message when passing
Seriesof wrong dtype toSeries.str.cat()(GH22722)
Interval¶
Construction of
Intervalis restricted to numeric,TimestampandTimedeltaendpoints (GH23013)Fixed bug in
Series/DataFramenot displayingNaNinIntervalIndexwith missing values (GH25984)Bug in
IntervalIndex.get_loc()where aKeyErrorwould be incorrectly raised for a decreasingIntervalIndex(GH25860)Bug in
Indexconstructor where passing mixed closedIntervalobjects would result in aValueErrorinstead of anobjectdtypeIndex(GH27172)
Indexing¶
Improved exception message when calling
DataFrame.iloc()with a list of non-numeric objects (GH25753).Improved exception message when calling
.ilocor.locwith a boolean indexer with different length (GH26658).Bug in
KeyErrorexception message when indexing aMultiIndexwith a non-existent key not displaying the original key (GH27250).Bug in
.ilocand.locwith a boolean indexer not raising anIndexErrorwhen too few items are passed (GH26658).Bug in
DataFrame.loc()andSeries.loc()whereKeyErrorwas not raised for aMultiIndexwhen the key was less than or equal to the number of levels in theMultiIndex(GH14885).Bug in which
DataFrame.append()produced an erroneous warning indicating that aKeyErrorwill be thrown in the future when the data to be appended contains new columns (GH22252).Bug in which
DataFrame.to_csv()caused a segfault for a reindexed data frame, when the indices were single-levelMultiIndex(GH26303).Fixed bug where assigning a
arrays.PandasArrayto apandas.core.frame.DataFramewould raise error (GH26390)Allow keyword arguments for callable local reference used in the
DataFrame.query()string (GH26426)Fixed a
KeyErrorwhen indexing aMultiIndex`level with a list containing exactly one label, which is missing (GH27148)Bug which produced
AttributeErroron partial matchingTimestampin aMultiIndex(GH26944)Bug in
CategoricalandCategoricalIndexwithIntervalvalues when using theinoperator (__contains) with objects that are not comparable to the values in theInterval(GH23705)Bug in
DataFrame.loc()andDataFrame.iloc()on aDataFramewith a single timezone-aware datetime64[ns] column incorrectly returning a scalar instead of aSeries(GH27110)Bug in
CategoricalIndexandCategoricalincorrectly raisingValueErrorinstead ofTypeErrorwhen a list is passed using theinoperator (__contains__) (GH21729)Bug in setting a new value in a
Serieswith aTimedeltaobject incorrectly casting the value to an integer (GH22717)Bug in
Seriessetting a new key (__setitem__) with a timezone-aware datetime incorrectly raisingValueError(GH12862)Bug in
DataFrame.iloc()when indexing with a read-only indexer (GH17192)Bug in
Seriessetting an existing tuple key (__setitem__) with timezone-aware datetime values incorrectly raisingTypeError(GH20441)
Missing¶
Fixed misleading exception message in
Series.interpolate()if argumentorderis required, but omitted (GH10633, GH24014).Fixed class type displayed in exception message in
DataFrame.dropna()if invalidaxisparameter passed (GH25555)A
ValueErrorwill now be thrown byDataFrame.fillna()whenlimitis not a positive integer (GH27042)
MultiIndex¶
Bug in which incorrect exception raised by
Timedeltawhen testing the membership ofMultiIndex(GH24570)
IO¶
Bug in
DataFrame.to_html()where values were truncated using display options instead of outputting the full content (GH17004)Fixed bug in missing text when using
to_clipboard()if copying utf-16 characters in Python 3 on Windows (GH25040)Bug in
read_json()fororient='table'when it tries to infer dtypes by default, which is not applicable as dtypes are already defined in the JSON schema (GH21345)Bug in
read_json()fororient='table'and float index, as it infers index dtype by default, which is not applicable because index dtype is already defined in the JSON schema (GH25433)Bug in
read_json()fororient='table'and string of float column names, as it makes a column name type conversion toTimestamp, which is not applicable because column names are already defined in the JSON schema (GH25435)Bug in
json_normalize()forerrors='ignore'where missing values in the input data, were filled in resultingDataFramewith the string"nan"instead ofnumpy.nan(GH25468)DataFrame.to_html()now raisesTypeErrorwhen using an invalid type for theclassesparameter instead ofAssertionError(GH25608)Bug in
DataFrame.to_string()andDataFrame.to_latex()that would lead to incorrect output when theheaderkeyword is used (GH16718)Bug in
read_csv()not properly interpreting the UTF8 encoded filenames on Windows on Python 3.6+ (GH15086)Improved performance in
pandas.read_stata()andpandas.io.stata.StataReaderwhen converting columns that have missing values (GH25772)Bug in
DataFrame.to_html()where header numbers would ignore display options when rounding (GH17280)Bug in
read_hdf()where reading a table from an HDF5 file written directly with PyTables fails with aValueErrorwhen using a sub-selection via thestartorstoparguments (GH11188)Bug in
read_hdf()not properly closing store after aKeyErroris raised (GH25766)Improved the explanation for the failure when value labels are repeated in Stata dta files and suggested work-arounds (GH25772)
Improved
pandas.read_stata()andpandas.io.stata.StataReaderto read incorrectly formatted 118 format files saved by Stata (GH25960)Improved the
col_spaceparameter inDataFrame.to_html()to accept a string so CSS length values can be set correctly (GH25941)Fixed bug in loading objects from S3 that contain
#characters in the URL (GH25945)Adds
use_bqstorage_apiparameter toread_gbq()to speed up downloads of large data frames. This feature requires version 0.10.0 of thepandas-gbqlibrary as well as thegoogle-cloud-bigquery-storageandfastavrolibraries. (GH26104)Fixed memory leak in
DataFrame.to_json()when dealing with numeric data (GH24889)Bug in
read_json()where date strings withZwere not converted to a UTC timezone (GH26168)Added
cache_dates=Trueparameter toread_csv(), which allows to cache unique dates when they are parsed (GH25990)DataFrame.to_excel()now raises aValueErrorwhen the caller’s dimensions exceed the limitations of Excel (GH26051)Fixed bug in
pandas.read_csv()where a BOM would result in incorrect parsing using engine=’python’ (GH26545)read_excel()now raises aValueErrorwhen input is of typepandas.io.excel.ExcelFileandengineparam is passed sincepandas.io.excel.ExcelFilehas an engine defined (GH26566)Bug while selecting from
HDFStorewithwhere=''specified (GH26610).Fixed bug in
DataFrame.to_excel()where custom objects (i.e.PeriodIndex) inside merged cells were not being converted into types safe for the Excel writer (GH27006)Bug in
read_hdf()where reading a timezone awareDatetimeIndexwould raise aTypeError(GH11926)Bug in
to_msgpack()andread_msgpack()which would raise aValueErrorrather than aFileNotFoundErrorfor an invalid path (GH27160)Fixed bug in
DataFrame.to_parquet()which would raise aValueErrorwhen the dataframe had no columns (GH27339)Allow parsing of
PeriodDtypecolumns when usingread_csv()(GH26934)
Plotting¶
Fixed bug where
api.extensions.ExtensionArraycould not be used in matplotlib plotting (GH25587)Bug in an error message in
DataFrame.plot(). Improved the error message if non-numerics are passed toDataFrame.plot()(GH25481)Bug in incorrect ticklabel positions when plotting an index that are non-numeric / non-datetime (GH7612, GH15912, GH22334)
Fixed bug causing plots of
PeriodIndextimeseries to fail if the frequency is a multiple of the frequency rule code (GH14763)Fixed bug when plotting a
DatetimeIndexwithdatetime.timezone.utctimezone (GH17173)
GroupBy/resample/rolling¶
Bug in
pandas.core.resample.Resampler.agg()with a timezone aware index whereOverflowErrorwould raise when passing a list of functions (GH22660)Bug in
pandas.core.groupby.DataFrameGroupBy.nunique()in which the names of column levels were lost (GH23222)Bug in
pandas.core.groupby.GroupBy.agg()when applying an aggregation function to timezone aware data (GH23683)Bug in
pandas.core.groupby.GroupBy.first()andpandas.core.groupby.GroupBy.last()where timezone information would be dropped (GH21603)Bug in
pandas.core.groupby.GroupBy.size()when grouping only NA values (GH23050)Bug in
Series.groupby()whereobservedkwarg was previously ignored (GH24880)Bug in
Series.groupby()where usinggroupbywith aMultiIndexSeries with a list of labels equal to the length of the series caused incorrect grouping (GH25704)Ensured that ordering of outputs in
groupbyaggregation functions is consistent across all versions of Python (GH25692)Ensured that result group order is correct when grouping on an ordered
Categoricaland specifyingobserved=True(GH25871, GH25167)Bug in
pandas.core.window.Rolling.min()andpandas.core.window.Rolling.max()that caused a memory leak (GH25893)Bug in
pandas.core.window.Rolling.count()andpandas.core.window.Expanding.countwas previously ignoring theaxiskeyword (GH13503)Bug in
pandas.core.groupby.GroupBy.idxmax()andpandas.core.groupby.GroupBy.idxmin()with datetime column would return incorrect dtype (GH25444, GH15306)Bug in
pandas.core.groupby.GroupBy.cumsum(),pandas.core.groupby.GroupBy.cumprod(),pandas.core.groupby.GroupBy.cummin()andpandas.core.groupby.GroupBy.cummax()with categorical column having absent categories, would return incorrect result or segfault (GH16771)Bug in
pandas.core.groupby.GroupBy.nth()where NA values in the grouping would return incorrect results (GH26011)Bug in
pandas.core.groupby.SeriesGroupBy.transform()where transforming an empty group would raise aValueError(GH26208)Bug in
pandas.core.frame.DataFrame.groupby()where passing apandas.core.groupby.grouper.Grouperwould return incorrect groups when using the.groupsaccessor (GH26326)Bug in
pandas.core.groupby.GroupBy.agg()where incorrect results are returned for uint64 columns. (GH26310)Bug in
pandas.core.window.Rolling.median()andpandas.core.window.Rolling.quantile()where MemoryError is raised with empty window (GH26005)Bug in
pandas.core.window.Rolling.median()andpandas.core.window.Rolling.quantile()where incorrect results are returned withclosed='left'andclosed='neither'(GH26005)Improved
pandas.core.window.Rolling,pandas.core.window.Windowandpandas.core.window.ExponentialMovingWindowfunctions to exclude nuisance columns from results instead of raising errors and raise aDataErroronly if all columns are nuisance (GH12537)Bug in
pandas.core.window.Rolling.max()andpandas.core.window.Rolling.min()where incorrect results are returned with an empty variable window (GH26005)Raise a helpful exception when an unsupported weighted window function is used as an argument of
pandas.core.window.Window.aggregate()(GH26597)
Reshaping¶
Bug in
pandas.merge()adds a string ofNone, ifNoneis assigned in suffixes instead of remain the column name as-is (GH24782).Bug in
merge()when merging by index name would sometimes result in an incorrectly numbered index (missing index values are now assigned NA) (GH24212, GH25009)to_records()now accepts dtypes to itscolumn_dtypesparameter (GH24895)Bug in
concat()where order ofOrderedDict(anddictin Python 3.6+) is not respected, when passed in asobjsargument (GH21510)Bug in
pivot_table()where columns withNaNvalues are dropped even ifdropnaargument isFalse, when theaggfuncargument contains alist(GH22159)Bug in
concat()where the resultingfreqof twoDatetimeIndexwith the samefreqwould be dropped (GH3232).Bug in
merge()where merging with equivalent Categorical dtypes was raising an error (GH22501)bug in
DataFrameinstantiating with a dict of iterators or generators (e.g.pd.DataFrame({'A': reversed(range(3))})) raised an error (GH26349).Bug in
DataFrameinstantiating with arange(e.g.pd.DataFrame(range(3))) raised an error (GH26342).Bug in
DataFrameconstructor when passing non-empty tuples would cause a segmentation fault (GH25691)Bug in
Series.apply()failed when the series is a timezone awareDatetimeIndex(GH25959)Bug in
pandas.cut()where large bins could incorrectly raise an error due to an integer overflow (GH26045)Bug in
DataFrame.sort_index()where an error is thrown when a multi-indexedDataFrameis sorted on all levels with the initial level sorted last (GH26053)Bug in
Series.nlargest()treatsTrueas smaller thanFalse(GH26154)Bug in
DataFrame.pivot_table()with aIntervalIndexas pivot index would raiseTypeError(GH25814)Bug in which
DataFrame.from_dict()ignored order ofOrderedDictwhenorient='index'(GH8425).Bug in
DataFrame.transpose()where transposing a DataFrame with a timezone-aware datetime column would incorrectly raiseValueError(GH26825)Bug in
pivot_table()when pivoting a timezone aware column as thevalueswould remove timezone information (GH14948)Bug in
merge_asof()when specifying multiplebycolumns where one isdatetime64[ns, tz]dtype (GH26649)
Sparse¶
Significant speedup in
SparseArrayinitialization that benefits most operations, fixing performance regression introduced in v0.20.0 (GH24985)Bug in
SparseFrameconstructor where passingNoneas the data would causedefault_fill_valueto be ignored (GH16807)Bug in
SparseDataFramewhen adding a column in which the length of values does not match length of index,AssertionErroris raised instead of raisingValueError(GH25484)Introduce a better error message in
Series.sparse.from_coo()so it returns aTypeErrorfor inputs that are not coo matrices (GH26554)Bug in
numpy.modf()on aSparseArray. Now a tuple ofSparseArrayis returned (GH26946).
ExtensionArray¶
Bug in
factorize()when passing anExtensionArraywith a customna_sentinel(GH25696).Series.count()miscounts NA values in ExtensionArrays (GH26835)Added
Series.__array_ufunc__to better handle NumPy ufuncs applied to Series backed by extension arrays (GH23293).Keyword argument
deephas been removed fromExtensionArray.copy()(GH27083)
Other¶
Removed unused C functions from vendored UltraJSON implementation (GH26198)
Allow
IndexandRangeIndexto be passed to numpyminandmaxfunctions (GH26125)Use actual class name in repr of empty objects of a
Seriessubclass (GH27001).Bug in
DataFramewhere passing an object array of timezone-awaredatetimeobjects would incorrectly raiseValueError(GH13287)
Contributors¶
A total of 231 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.
1_x7 +
Abdullah İhsan Seçer +
Adam Bull +
Adam Hooper
Albert Villanova del Moral
Alex Watt +
AlexTereshenkov +
Alexander Buchkovsky
Alexander Hendorf +
Alexander Nordin +
Alexander Ponomaroff
Alexandre Batisse +
Alexandre Decan +
Allen Downey +
Alyssa Fu Ward +
Andrew Gaspari +
Andrew Wood +
Antoine Viscardi +
Antonio Gutierrez +
Arno Veenstra +
ArtinSarraf
Batalex +
Baurzhan Muftakhidinov
Benjamin Rowell
Bharat Raghunathan +
Bhavani Ravi +
Big Head +
Brett Randall +
Bryan Cutler +
C John Klehm +
Caleb Braun +
Cecilia +
Chris Bertinato +
Chris Stadler +
Christian Haege +
Christian Hudon
Christopher Whelan
Chuanzhu Xu +
Clemens Brunner
Damian Kula +
Daniel Hrisca +
Daniel Luis Costa +
Daniel Saxton
DanielFEvans +
David Liu +
Deepyaman Datta +
Denis Belavin +
Devin Petersohn +
Diane Trout +
EdAbati +
Enrico Rotundo +
EternalLearner42 +
Evan +
Evan Livelo +
Fabian Rost +
Flavien Lambert +
Florian Rathgeber +
Frank Hoang +
Gaibo Zhang +
Gioia Ballin
Giuseppe Romagnuolo +
Gordon Blackadder +
Gregory Rome +
Guillaume Gay
HHest +
Hielke Walinga +
How Si Wei +
Hubert
Huize Wang +
Hyukjin Kwon +
Ian Dunn +
Inevitable-Marzipan +
Irv Lustig
JElfner +
Jacob Bundgaard +
James Cobon-Kerr +
Jan-Philip Gehrcke +
Jarrod Millman +
Jayanth Katuri +
Jeff Reback
Jeremy Schendel
Jiang Yue +
Joel Ostblom
Johan von Forstner +
Johnny Chiu +
Jonas +
Jonathon Vandezande +
Jop Vermeer +
Joris Van den Bossche
Josh
Josh Friedlander +
Justin Zheng
Kaiqi Dong
Kane +
Kapil Patel +
Kara de la Marck +
Katherine Surta +
Katrin Leinweber +
Kendall Masse
Kevin Sheppard
Kyle Kosic +
Lorenzo Stella +
Maarten Rietbergen +
Mak Sze Chun
Marc Garcia
Mateusz Woś
Matias Heikkilä
Mats Maiwald +
Matthew Roeschke
Max Bolingbroke +
Max Kovalovs +
Max van Deursen +
Michael
Michael Davis +
Michael P. Moran +
Mike Cramblett +
Min ho Kim +
Misha Veldhoen +
Mukul Ashwath Ram +
MusTheDataGuy +
Nanda H Krishna +
Nicholas Musolino
Noam Hershtig +
Noora Husseini +
Paul
Paul Reidy
Pauli Virtanen
Pav A +
Peter Leimbigler +
Philippe Ombredanne +
Pietro Battiston
Richard Eames +
Roman Yurchak
Ruijing Li
Ryan
Ryan Joyce +
Ryan Nazareth
Ryan Rehman +
Sakar Panta +
Samuel Sinayoko
Sandeep Pathak +
Sangwoong Yoon
Saurav Chakravorty
Scott Talbert +
Sergey Kopylov +
Shantanu Gontia +
Shivam Rana +
Shorokhov Sergey +
Simon Hawkins
Soyoun(Rose) Kim
Stephan Hoyer
Stephen Cowley +
Stephen Rauch
Sterling Paramore +
Steven +
Stijn Van Hoey
Sumanau Sareen +
Takuya N +
Tan Tran +
Tao He +
Tarbo Fukazawa
Terji Petersen +
Thein Oo
ThibTrip +
Thijs Damsma +
Thiviyan Thanapalasingam
Thomas A Caswell
Thomas Kluiters +
Tilen Kusterle +
Tim Gates +
Tim Hoffmann
Tim Swast
Tom Augspurger
Tom Neep +
Tomáš Chvátal +
Tyler Reddy
Vaibhav Vishal +
Vasily Litvinov +
Vibhu Agarwal +
Vikramjeet Das +
Vladislav +
Víctor Moron Tejero +
Wenhuan
Will Ayd +
William Ayd
Wouter De Coster +
Yoann Goular +
Zach Angell +
alimcmaster1
anmyachev +
chris-b1
danielplawrence +
endenis +
enisnazif +
ezcitron +
fjetter
froessler
gfyoung
gwrome +
h-vetinari
haison +
hannah-c +
heckeop +
iamshwin +
jamesoliverh +
jbrockmendel
jkovacevic +
killerontherun1 +
knuu +
kpapdac +
kpflugshaupt +
krsnik93 +
leerssej +
lrjball +
mazayo +
nathalier +
nrebena +
nullptr +
pilkibun +
pmaxey83 +
rbenes +
robbuckley
shawnbrown +
sudhir mohanraj +
tadeja +
tamuhey +
thatneat
topper-123
willweil +
yehia67 +
yhaque1213 +