What’s new in 0.25.0 (July 18, 2019)#
Warning
Starting with the 0.25.x series of releases, pandas only supports Python 3.5.3 and higher. See Dropping Python 2.7 for more details.
Warning
The minimum supported Python version will be bumped to 3.6 in a future release.
Warning
Panel
has been fully removed. For N-D labeled data structures, please
use xarray
Warning
read_pickle()
and read_msgpack()
are only guaranteed backwards compatible back to
pandas version 0.20.3 (GH 27082)
These are the changes in pandas 0.25.0. See Release notes for a full changelog including other versions of pandas.
Enhancements#
GroupBy aggregation with relabeling#
pandas has added special groupby behavior, known as “named aggregation”, for naming the output columns when applying multiple aggregation functions to specific columns (GH 18366, GH 26512).
In [1]: animals = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
...: 'height': [9.1, 6.0, 9.5, 34.0],
...: 'weight': [7.9, 7.5, 9.9, 198.0]})
...:
In [2]: animals
Out[2]:
kind height weight
0 cat 9.1 7.9
1 dog 6.0 7.5
2 cat 9.5 9.9
3 dog 34.0 198.0
In [3]: animals.groupby("kind").agg(
...: min_height=pd.NamedAgg(column='height', aggfunc='min'),
...: max_height=pd.NamedAgg(column='height', aggfunc='max'),
...: average_weight=pd.NamedAgg(column='weight', aggfunc="mean"),
...: )
...:
Out[3]:
min_height max_height average_weight
kind
cat 9.1 9.5 8.90
dog 6.0 34.0 102.75
Pass the desired columns names as the **kwargs
to .agg
. The values of **kwargs
should be tuples where the first element is the column selection, and the second element is the
aggregation function to apply. pandas provides the pandas.NamedAgg
namedtuple to make it clearer
what the arguments to the function are, but plain tuples are accepted as well.
In [4]: animals.groupby("kind").agg(
...: min_height=('height', 'min'),
...: max_height=('height', 'max'),
...: average_weight=('weight', 'mean'),
...: )
...:
Out[4]:
min_height max_height average_weight
kind
cat 9.1 9.5 8.90
dog 6.0 34.0 102.75
Named aggregation is the recommended replacement for the deprecated “dict-of-dicts” approach to naming the output of column-specific aggregations (Deprecate groupby.agg() with a dictionary when renaming).
A similar approach is now available for Series groupby objects as well. Because there’s no need for column selection, the values can just be the functions to apply
In [5]: animals.groupby("kind").height.agg(
...: min_height="min",
...: max_height="max",
...: )
...:
Out[5]:
min_height max_height
kind
cat 9.1 9.5
dog 6.0 34.0
This type of aggregation is the recommended alternative to the deprecated behavior when passing a dict to a Series groupby aggregation (Deprecate groupby.agg() with a dictionary when renaming).
See Named aggregation for more.
GroupBy aggregation with multiple lambdas#
You can now provide multiple lambda functions to a list-like aggregation in
GroupBy.agg
(GH 26430).
In [6]: animals.groupby('kind').height.agg([
...: lambda x: x.iloc[0], lambda x: x.iloc[-1]
...: ])
...:
Out[6]:
<lambda_0> <lambda_1>
kind
cat 9.1 9.5
dog 6.0 34.0
In [7]: animals.groupby('kind').agg([
...: lambda x: x.iloc[0] - x.iloc[1],
...: lambda x: x.iloc[0] + x.iloc[1]
...: ])
...:
Out[7]:
height weight
<lambda_0> <lambda_1> <lambda_0> <lambda_1>
kind
cat -0.4 18.6 -2.0 17.8
dog -28.0 40.0 -190.5 205.5
Previously, these raised a SpecificationError
.
Better repr for MultiIndex#
Printing of MultiIndex
instances now shows tuples of each row and ensures
that the tuple items are vertically aligned, so it’s now easier to understand
the structure of the MultiIndex
. (GH 13480):
The repr now looks like this:
In [8]: pd.MultiIndex.from_product([['a', 'abc'], range(500)])
Out[8]:
MultiIndex([( 'a', 0),
( 'a', 1),
( 'a', 2),
( 'a', 3),
( 'a', 4),
( 'a', 5),
( 'a', 6),
( 'a', 7),
( 'a', 8),
( 'a', 9),
...
('abc', 490),
('abc', 491),
('abc', 492),
('abc', 493),
('abc', 494),
('abc', 495),
('abc', 496),
('abc', 497),
('abc', 498),
('abc', 499)],
length=1000)
Previously, outputting a MultiIndex
printed all the levels
and
codes
of the MultiIndex
, which was visually unappealing and made
the output more difficult to navigate. For example (limiting the range to 5):
In [1]: pd.MultiIndex.from_product([['a', 'abc'], range(5)])
Out[1]: MultiIndex(levels=[['a', 'abc'], [0, 1, 2, 3]],
...: codes=[[0, 0, 0, 0, 1, 1, 1, 1], [0, 1, 2, 3, 0, 1, 2, 3]])
In the new repr, all values will be shown, if the number of rows is smaller
than options.display.max_seq_items
(default: 100 items). Horizontally,
the output will truncate, if it’s wider than options.display.width
(default: 80 characters).
Shorter truncated repr for Series and DataFrame#
Currently, the default display options of pandas ensure that when a Series
or DataFrame has more than 60 rows, its repr gets truncated to this maximum
of 60 rows (the display.max_rows
option). However, this still gives
a repr that takes up a large part of the vertical screen estate. Therefore,
a new option display.min_rows
is introduced with a default of 10 which
determines the number of rows showed in the truncated repr:
For small Series or DataFrames, up to
max_rows
number of rows is shown (default: 60).For larger Series of DataFrame with a length above
max_rows
, onlymin_rows
number of rows is shown (default: 10, i.e. the first and last 5 rows).
This dual option allows to still see the full content of relatively small
objects (e.g. df.head(20)
shows all 20 rows), while giving a brief repr
for large objects.
To restore the previous behaviour of a single threshold, set
pd.options.display.min_rows = None
.
JSON normalize with max_level param support#
json_normalize()
normalizes the provided input dict to all
nested levels. The new max_level parameter provides more control over
which level to end normalization (GH 23843):
The repr now looks like this:
from pandas.io.json import json_normalize
data = [{
'CreatedBy': {'Name': 'User001'},
'Lookup': {'TextField': 'Some text',
'UserField': {'Id': 'ID001', 'Name': 'Name001'}},
'Image': {'a': 'b'}
}]
json_normalize(data, max_level=1)
Series.explode to split list-like values to rows#
Series
and DataFrame
have gained the DataFrame.explode()
methods to transform list-likes to individual rows. See section on Exploding list-like column in docs for more information (GH 16538, GH 10511)
Here is a typical usecase. You have comma separated string in a column.
In [9]: df = pd.DataFrame([{'var1': 'a,b,c', 'var2': 1},
...: {'var1': 'd,e,f', 'var2': 2}])
...:
In [10]: df
Out[10]:
var1 var2
0 a,b,c 1
1 d,e,f 2
Creating a long form DataFrame
is now straightforward using chained operations
In [11]: df.assign(var1=df.var1.str.split(',')).explode('var1')
Out[11]:
var1 var2
0 a 1
0 b 1
0 c 1
1 d 2
1 e 2
1 f 2
Other enhancements#
DataFrame.plot()
keywordslogy
,logx
andloglog
can now accept the value'sym'
for symlog scaling. (GH 24867)Added support for ISO week year format (‘%G-%V-%u’) when parsing datetimes using
to_datetime()
(GH 16607)Indexing of
DataFrame
andSeries
now accepts zerodimnp.ndarray
(GH 24919)Timestamp.replace()
now supports thefold
argument to disambiguate DST transition times (GH 25017)DataFrame.at_time()
andSeries.at_time()
now supportdatetime.time
objects with timezones (GH 24043)DataFrame.pivot_table()
now accepts anobserved
parameter which is passed to underlying calls toDataFrame.groupby()
to speed up grouping categorical data. (GH 24923)Series.str
has gainedSeries.str.casefold()
method to removes all case distinctions present in a string (GH 25405)DataFrame.set_index()
now works for instances ofabc.Iterator
, provided their output is of the same length as the calling frame (GH 22484, GH 24984)DatetimeIndex.union()
now supports thesort
argument. The behavior of the sort parameter matches that ofIndex.union()
(GH 24994)RangeIndex.union()
now supports thesort
argument. Ifsort=False
an unsortedInt64Index
is always returned.sort=None
is the default and returns a monotonically increasingRangeIndex
if possible or a sortedInt64Index
if not (GH 24471)TimedeltaIndex.intersection()
now also supports thesort
keyword (GH 24471)DataFrame.rename()
now supports theerrors
argument to raise errors when attempting to rename nonexistent keys (GH 13473)Added Sparse accessor for working with a
DataFrame
whose values are sparse (GH 25681)RangeIndex
has gainedstart
,stop
, andstep
attributes (GH 25710)datetime.timezone
objects are now supported as arguments to timezone methods and constructors (GH 25065)DataFrame.query()
andDataFrame.eval()
now supports quoting column names with backticks to refer to names with spaces (GH 6508)merge_asof()
now gives a more clear error message when merge keys are categoricals that are not equal (GH 26136)Rolling()
supports exponential (or Poisson) window type (GH 21303)Error message for missing required imports now includes the original import error’s text (GH 23868)
DatetimeIndex
andTimedeltaIndex
now have amean
method (GH 24757)DataFrame.describe()
now formats integer percentiles without decimal point (GH 26660)Added support for reading SPSS .sav files using
read_spss()
(GH 26537)Added new option
plotting.backend
to be able to select a plotting backend different than the existingmatplotlib
one. Usepandas.set_option('plotting.backend', '<backend-module>')
where<backend-module
is a library implementing the pandas plotting API (GH 14130)pandas.offsets.BusinessHour
supports multiple opening hours intervals (GH 15481)read_excel()
can now useopenpyxl
to read Excel files via theengine='openpyxl'
argument. This will become the default in a future release (GH 11499)pandas.io.excel.read_excel()
supports reading OpenDocument tables. Specifyengine='odf'
to enable. Consult the IO User Guide for more details (GH 9070)Interval
,IntervalIndex
, andIntervalArray
have gained anis_empty
attribute denoting if the given interval(s) are empty (GH 27219)
Backwards incompatible API changes#
Indexing with date strings with UTC offsets#
Indexing a DataFrame
or Series
with a DatetimeIndex
with a
date string with a UTC offset would previously ignore the UTC offset. Now, the UTC offset
is respected in indexing. (GH 24076, GH 16785)
In [12]: df = pd.DataFrame([0], index=pd.DatetimeIndex(['2019-01-01'], tz='US/Pacific'))
In [13]: df
Out[13]:
0
2019-01-01 00:00:00-08:00 0
Previous behavior:
In [3]: df['2019-01-01 00:00:00+04:00':'2019-01-01 01:00:00+04:00']
Out[3]:
0
2019-01-01 00:00:00-08:00 0
New behavior:
In [14]: df['2019-01-01 12:00:00+04:00':'2019-01-01 13:00:00+04:00']
Out[14]:
0
2019-01-01 00:00:00-08:00 0
MultiIndex
constructed from levels and codes#
Constructing a MultiIndex
with NaN
levels or codes value < -1 was allowed previously.
Now, construction with codes value < -1 is not allowed and NaN
levels’ corresponding codes
would be reassigned as -1. (GH 19387)
Previous behavior:
In [1]: pd.MultiIndex(levels=[[np.nan, None, pd.NaT, 128, 2]],
...: codes=[[0, -1, 1, 2, 3, 4]])
...:
Out[1]: MultiIndex(levels=[[nan, None, NaT, 128, 2]],
codes=[[0, -1, 1, 2, 3, 4]])
In [2]: pd.MultiIndex(levels=[[1, 2]], codes=[[0, -2]])
Out[2]: MultiIndex(levels=[[1, 2]],
codes=[[0, -2]])
New behavior:
In [15]: pd.MultiIndex(levels=[[np.nan, None, pd.NaT, 128, 2]],
....: codes=[[0, -1, 1, 2, 3, 4]])
....:
Out[15]:
MultiIndex([(nan,),
(nan,),
(nan,),
(nan,),
(128,),
( 2,)],
)
In [16]: pd.MultiIndex(levels=[[1, 2]], codes=[[0, -2]])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[16], line 1
----> 1 pd.MultiIndex(levels=[[1, 2]], codes=[[0, -2]])
File ~/work/pandas/pandas/pandas/core/indexes/multi.py:341, in MultiIndex.__new__(cls, levels, codes, sortorder, names, dtype, copy, name, verify_integrity)
338 result.sortorder = sortorder
340 if verify_integrity:
--> 341 new_codes = result._verify_integrity()
342 result._codes = new_codes
344 result._reset_identity()
File ~/work/pandas/pandas/pandas/core/indexes/multi.py:425, in MultiIndex._verify_integrity(self, codes, levels, levels_to_verify)
419 raise ValueError(
420 f"On level {i}, code max ({level_codes.max()}) >= length of "
421 f"level ({len(level)}). NOTE: this index is in an "
422 "inconsistent state"
423 )
424 if len(level_codes) and level_codes.min() < -1:
--> 425 raise ValueError(f"On level {i}, code value ({level_codes.min()}) < -1")
426 if not level.is_unique:
427 raise ValueError(
428 f"Level values must be unique: {list(level)} on level {i}"
429 )
ValueError: On level 0, code value (-2) < -1
GroupBy.apply
on DataFrame
evaluates first group only once#
The implementation of DataFrameGroupBy.apply()
previously evaluated the supplied function consistently twice on the first group
to infer if it is safe to use a fast code path. Particularly for functions with
side effects, this was an undesired behavior and may have led to surprises. (GH 2936, GH 2656, GH 7739, GH 10519, GH 12155, GH 20084, GH 21417)
Now every group is evaluated only a single time.
In [17]: df = pd.DataFrame({"a": ["x", "y"], "b": [1, 2]})
In [18]: df
Out[18]:
a b
0 x 1
1 y 2
In [19]: def func(group):
....: print(group.name)
....: return group
....:
Previous behavior:
In [3]: df.groupby('a').apply(func)
x
x
y
Out[3]:
a b
0 x 1
1 y 2
New behavior:
In [3]: df.groupby('a').apply(func)
x
y
Out[3]:
a b
0 x 1
1 y 2
Concatenating sparse values#
When passed DataFrames whose values are sparse, concat()
will now return a
Series
or DataFrame
with sparse values, rather than a SparseDataFrame
(GH 25702).
In [20]: df = pd.DataFrame({"A": pd.arrays.SparseArray([0, 1])})
Previous behavior:
In [2]: type(pd.concat([df, df]))
pandas.core.sparse.frame.SparseDataFrame
New behavior:
In [21]: type(pd.concat([df, df]))
Out[21]: pandas.DataFrame
This now matches the existing behavior of concat
on Series
with sparse values.
concat()
will continue to return a SparseDataFrame
when all the values
are instances of SparseDataFrame
.
This change also affects routines using concat()
internally, like get_dummies()
,
which now returns a DataFrame
in all cases (previously a SparseDataFrame
was
returned if all the columns were dummy encoded, and a DataFrame
otherwise).
Providing any SparseSeries
or SparseDataFrame
to concat()
will
cause a SparseSeries
or SparseDataFrame
to be returned, as before.
The .str
-accessor performs stricter type checks#
Due to the lack of more fine-grained dtypes, Series.str
so far only checked whether the data was
of object
dtype. Series.str
will now infer the dtype data within the Series; in particular,
'bytes'
-only data will raise an exception (except for Series.str.decode()
, Series.str.get()
,
Series.str.len()
, Series.str.slice()
), see GH 23163, GH 23011, GH 23551.
Previous behavior:
In [1]: s = pd.Series(np.array(['a', 'ba', 'cba'], 'S'), dtype=object)
In [2]: s
Out[2]:
0 b'a'
1 b'ba'
2 b'cba'
dtype: object
In [3]: s.str.startswith(b'a')
Out[3]:
0 True
1 False
2 False
dtype: bool
New behavior:
In [22]: s = pd.Series(np.array(['a', 'ba', 'cba'], 'S'), dtype=object)
In [23]: s
Out[23]:
0 b'a'
1 b'ba'
2 b'cba'
dtype: object
In [24]: s.str.startswith(b'a')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[24], line 1
----> 1 s.str.startswith(b'a')
File ~/work/pandas/pandas/pandas/core/strings/accessor.py:139, in forbid_nonstring_types.<locals>._forbid_nonstring_types.<locals>.wrapper(self, *args, **kwargs)
134 if self._inferred_dtype not in allowed_types:
135 msg = (
136 f"Cannot use .str.{func_name} with values of "
137 f"inferred dtype '{self._inferred_dtype}'."
138 )
--> 139 raise TypeError(msg)
140 return func(self, *args, **kwargs)
TypeError: Cannot use .str.startswith with values of inferred dtype 'bytes'.
Categorical dtypes are preserved during GroupBy#
Previously, columns that were categorical, but not the groupby key(s) would be converted to object
dtype during groupby operations. pandas now will preserve these dtypes. (GH 18502)
In [25]: cat = pd.Categorical(["foo", "bar", "bar", "qux"], ordered=True)
In [26]: df = pd.DataFrame({'payload': [-1, -2, -1, -2], 'col': cat})
In [27]: df
Out[27]:
payload col
0 -1 foo
1 -2 bar
2 -1 bar
3 -2 qux
In [28]: df.dtypes
Out[28]:
payload int64
col category
dtype: object
Previous Behavior:
In [5]: df.groupby('payload').first().col.dtype
Out[5]: dtype('O')
New Behavior:
In [29]: df.groupby('payload').first().col.dtype
Out[29]: CategoricalDtype(categories=['bar', 'foo', 'qux'], ordered=True, categories_dtype=object)
Incompatible Index type unions#
When performing Index.union()
operations between objects of incompatible dtypes,
the result will be a base Index
of dtype object
. This behavior holds true for
unions between Index
objects that previously would have been prohibited. The dtype
of empty Index
objects will now be evaluated before performing union operations
rather than simply returning the other Index
object. Index.union()
can now be
considered commutative, such that A.union(B) == B.union(A)
(GH 23525).
Previous behavior:
In [1]: pd.period_range('19910905', periods=2).union(pd.Int64Index([1, 2, 3]))
...
ValueError: can only call with other PeriodIndex-ed objects
In [2]: pd.Index([], dtype=object).union(pd.Index([1, 2, 3]))
Out[2]: Int64Index([1, 2, 3], dtype='int64')
New behavior:
In [3]: pd.period_range('19910905', periods=2).union(pd.Int64Index([1, 2, 3]))
Out[3]: Index([1991-09-05, 1991-09-06, 1, 2, 3], dtype='object')
In [4]: pd.Index([], dtype=object).union(pd.Index([1, 2, 3]))
Out[4]: Index([1, 2, 3], dtype='object')
Note that integer- and floating-dtype indexes are considered “compatible”. The integer values are coerced to floating point, which may result in loss of precision. See Set operations on Index objects for more.
DataFrame
GroupBy ffill/bfill no longer return group labels#
The methods ffill
, bfill
, pad
and backfill
of
DataFrameGroupBy
previously included the group labels in the return value, which was
inconsistent with other groupby transforms. Now only the filled values
are returned. (GH 21521)
In [30]: df = pd.DataFrame({"a": ["x", "y"], "b": [1, 2]})
In [31]: df
Out[31]:
a b
0 x 1
1 y 2
Previous behavior:
In [3]: df.groupby("a").ffill()
Out[3]:
a b
0 x 1
1 y 2
New behavior:
In [32]: df.groupby("a").ffill()
Out[32]:
b
0 1
1 2
DataFrame
describe on an empty Categorical / object column will return top and freq#
When calling DataFrame.describe()
with an empty categorical / object
column, the ‘top’ and ‘freq’ columns were previously omitted, which was inconsistent with
the output for non-empty columns. Now the ‘top’ and ‘freq’ columns will always be included,
with numpy.nan
in the case of an empty DataFrame
(GH 26397)
In [33]: df = pd.DataFrame({"empty_col": pd.Categorical([])})
In [34]: df
Out[34]:
Empty DataFrame
Columns: [empty_col]
Index: []
Previous behavior:
In [3]: df.describe()
Out[3]:
empty_col
count 0
unique 0
New behavior:
In [35]: df.describe()
Out[35]:
empty_col
count 0
unique 0
top NaN
freq NaN
__str__
methods now call __repr__
rather than vice versa#
pandas has until now mostly defined string representations in a pandas objects’
__str__
/__unicode__
/__bytes__
methods, and called __str__
from the __repr__
method, if a specific __repr__
method is not found. This is not needed for Python3.
In pandas 0.25, the string representations of pandas objects are now generally
defined in __repr__
, and calls to __str__
in general now pass the call on to
the __repr__
, if a specific __str__
method doesn’t exist, as is standard for Python.
This change is backward compatible for direct usage of pandas, but if you subclass
pandas objects and give your subclasses specific __str__
/__repr__
methods,
you may have to adjust your __str__
/__repr__
methods (GH 26495).
Indexing an IntervalIndex
with Interval
objects#
Indexing methods for IntervalIndex
have been modified to require exact matches only for Interval
queries.
IntervalIndex
methods previously matched on any overlapping Interval
. Behavior with scalar points, e.g. querying
with an integer, is unchanged (GH 16316).
In [36]: ii = pd.IntervalIndex.from_tuples([(0, 4), (1, 5), (5, 8)])
In [37]: ii
Out[37]: IntervalIndex([(0, 4], (1, 5], (5, 8]], dtype='interval[int64, right]')
The in
operator (__contains__
) now only returns True
for exact matches to Intervals
in the IntervalIndex
, whereas
this would previously return True
for any Interval
overlapping an Interval
in the IntervalIndex
.
Previous behavior:
In [4]: pd.Interval(1, 2, closed='neither') in ii
Out[4]: True
In [5]: pd.Interval(-10, 10, closed='both') in ii
Out[5]: True
New behavior:
In [38]: pd.Interval(1, 2, closed='neither') in ii
Out[38]: False
In [39]: pd.Interval(-10, 10, closed='both') in ii
Out[39]: False
The get_loc()
method now only returns locations for exact matches to Interval
queries, as opposed to the previous behavior of
returning locations for overlapping matches. A KeyError
will be raised if an exact match is not found.
Previous behavior:
In [6]: ii.get_loc(pd.Interval(1, 5))
Out[6]: array([0, 1])
In [7]: ii.get_loc(pd.Interval(2, 6))
Out[7]: array([0, 1, 2])
New behavior:
In [6]: ii.get_loc(pd.Interval(1, 5))
Out[6]: 1
In [7]: ii.get_loc(pd.Interval(2, 6))
---------------------------------------------------------------------------
KeyError: Interval(2, 6, closed='right')
Likewise, get_indexer()
and get_indexer_non_unique()
will also only return locations for exact matches
to Interval
queries, with -1
denoting that an exact match was not found.
These indexing changes extend to querying a Series
or DataFrame
with an IntervalIndex
index.
In [40]: s = pd.Series(list('abc'), index=ii)
In [41]: s
Out[41]:
(0, 4] a
(1, 5] b
(5, 8] c
dtype: object
Selecting from a Series
or DataFrame
using []
(__getitem__
) or loc
now only returns exact matches for Interval
queries.
Previous behavior:
In [8]: s[pd.Interval(1, 5)]
Out[8]:
(0, 4] a
(1, 5] b
dtype: object
In [9]: s.loc[pd.Interval(1, 5)]
Out[9]:
(0, 4] a
(1, 5] b
dtype: object
New behavior:
In [42]: s[pd.Interval(1, 5)]
Out[42]: 'b'
In [43]: s.loc[pd.Interval(1, 5)]
Out[43]: 'b'
Similarly, a KeyError
will be raised for non-exact matches instead of returning overlapping matches.
Previous behavior:
In [9]: s[pd.Interval(2, 3)]
Out[9]:
(0, 4] a
(1, 5] b
dtype: object
In [10]: s.loc[pd.Interval(2, 3)]
Out[10]:
(0, 4] a
(1, 5] b
dtype: object
New behavior:
In [6]: s[pd.Interval(2, 3)]
---------------------------------------------------------------------------
KeyError: Interval(2, 3, closed='right')
In [7]: s.loc[pd.Interval(2, 3)]
---------------------------------------------------------------------------
KeyError: Interval(2, 3, closed='right')
The overlaps()
method can be used to create a boolean indexer that replicates the
previous behavior of returning overlapping matches.
New behavior:
In [44]: idxr = s.index.overlaps(pd.Interval(2, 3))
In [45]: idxr
Out[45]: array([ True, True, False])
In [46]: s[idxr]
Out[46]:
(0, 4] a
(1, 5] b
dtype: object
In [47]: s.loc[idxr]
Out[47]:
(0, 4] a
(1, 5] b
dtype: object
Binary ufuncs on Series now align#
Applying a binary ufunc like numpy.power()
now aligns the inputs
when both are Series
(GH 23293).
In [48]: s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
In [49]: s2 = pd.Series([3, 4, 5], index=['d', 'c', 'b'])
In [50]: s1
Out[50]:
a 1
b 2
c 3
dtype: int64
In [51]: s2
Out[51]:
d 3
c 4
b 5
dtype: int64
Previous behavior
In [5]: np.power(s1, s2)
Out[5]:
a 1
b 16
c 243
dtype: int64
New behavior
In [52]: np.power(s1, s2)
Out[52]:
a 1.0
b 32.0
c 81.0
d NaN
dtype: float64
This matches the behavior of other binary operations in pandas, like Series.add()
.
To retain the previous behavior, convert the other Series
to an array before
applying the ufunc.
In [53]: np.power(s1, s2.array)
Out[53]:
a 1
b 16
c 243
dtype: int64
Categorical.argsort now places missing values at the end#
Categorical.argsort()
now places missing values at the end of the array, making it
consistent with NumPy and the rest of pandas (GH 21801).
In [54]: cat = pd.Categorical(['b', None, 'a'], categories=['a', 'b'], ordered=True)
Previous behavior
In [2]: cat = pd.Categorical(['b', None, 'a'], categories=['a', 'b'], ordered=True)
In [3]: cat.argsort()
Out[3]: array([1, 2, 0])
In [4]: cat[cat.argsort()]
Out[4]:
[NaN, a, b]
categories (2, object): [a < b]
New behavior
In [55]: cat.argsort()
Out[55]: array([2, 0, 1])
In [56]: cat[cat.argsort()]
Out[56]:
['a', 'b', NaN]
Categories (2, object): ['a' < 'b']
Column order is preserved when passing a list of dicts to DataFrame#
Starting with Python 3.7 the key-order of dict
is guaranteed. In practice, this has been true since
Python 3.6. The DataFrame
constructor now treats a list of dicts in the same way as
it does a list of OrderedDict
, i.e. preserving the order of the dicts.
This change applies only when pandas is running on Python>=3.6 (GH 27309).
In [57]: data = [
....: {'name': 'Joe', 'state': 'NY', 'age': 18},
....: {'name': 'Jane', 'state': 'KY', 'age': 19, 'hobby': 'Minecraft'},
....: {'name': 'Jean', 'state': 'OK', 'age': 20, 'finances': 'good'}
....: ]
....:
Previous Behavior:
The columns were lexicographically sorted previously,
In [1]: pd.DataFrame(data)
Out[1]:
age finances hobby name state
0 18 NaN NaN Joe NY
1 19 NaN Minecraft Jane KY
2 20 good NaN Jean OK
New Behavior:
The column order now matches the insertion-order of the keys in the dict
,
considering all the records from top to bottom. As a consequence, the column
order of the resulting DataFrame has changed compared to previous pandas versions.
In [58]: pd.DataFrame(data)
Out[58]:
name state age hobby finances
0 Joe NY 18 NaN NaN
1 Jane KY 19 Minecraft NaN
2 Jean OK 20 NaN good
Increased minimum versions for dependencies#
Due to dropping support for Python 2.7, a number of optional dependencies have updated minimum versions (GH 25725, GH 24942, GH 25752). Independently, some minimum supported versions of dependencies were updated (GH 23519, GH 25554). If installed, we now require:
Package |
Minimum Version |
Required |
---|---|---|
numpy |
1.13.3 |
X |
pytz |
2015.4 |
X |
python-dateutil |
2.6.1 |
X |
bottleneck |
1.2.1 |
|
numexpr |
2.6.2 |
|
pytest (dev) |
4.0.2 |
For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.
Package |
Minimum Version |
---|---|
beautifulsoup4 |
4.6.0 |
fastparquet |
0.2.1 |
gcsfs |
0.2.2 |
lxml |
3.8.0 |
matplotlib |
2.2.2 |
openpyxl |
2.4.8 |
pyarrow |
0.9.0 |
pymysql |
0.7.1 |
pytables |
3.4.2 |
scipy |
0.19.0 |
sqlalchemy |
1.1.4 |
xarray |
0.8.2 |
xlrd |
1.1.0 |
xlsxwriter |
0.9.8 |
xlwt |
1.2.0 |
See Dependencies and Optional dependencies for more.
Other API changes#
DatetimeTZDtype
will now standardize pytz timezones to a common timezone instance (GH 24713)Timestamp
andTimedelta
scalars now implement theto_numpy()
method as aliases toTimestamp.to_datetime64()
andTimedelta.to_timedelta64()
, respectively. (GH 24653)Timestamp.strptime()
will now rise aNotImplementedError
(GH 25016)Comparing
Timestamp
with unsupported objects now returnsNotImplemented
instead of raisingTypeError
. This implies that unsupported rich comparisons are delegated to the other object, and are now consistent with Python 3 behavior fordatetime
objects (GH 24011)Bug in
DatetimeIndex.snap()
which didn’t preserving thename
of the inputIndex
(GH 25575)The
arg
argument inDataFrameGroupBy.agg()
has been renamed tofunc
(GH 26089)The
arg
argument inWindow.aggregate()
has been renamed tofunc
(GH 26372)Most pandas classes had a
__bytes__
method, which was used for getting a python2-style bytestring representation of the object. This method has been removed as a part of dropping Python2 (GH 26447)The
.str
-accessor has been disabled for 1-levelMultiIndex
, useMultiIndex.to_flat_index()
if necessary (GH 23679)Removed support of gtk package for clipboards (GH 26563)
Using an unsupported version of Beautiful Soup 4 will now raise an
ImportError
instead of aValueError
(GH 27063)Series.to_excel()
andDataFrame.to_excel()
will now raise aValueError
when saving timezone aware data. (GH 27008, GH 7056)ExtensionArray.argsort()
places NA values at the end of the sorted array. (GH 21801)DataFrame.to_hdf()
andSeries.to_hdf()
will now raise aNotImplementedError
when saving aMultiIndex
with extension data types for afixed
format. (GH 7775)Passing duplicate
names
inread_csv()
will now raise aValueError
(GH 17346)
Deprecations#
Sparse subclasses#
The SparseSeries
and SparseDataFrame
subclasses are deprecated. Their functionality is better-provided
by a Series
or DataFrame
with sparse values.
Previous way
df = pd.SparseDataFrame({"A": [0, 0, 1, 2]})
df.dtypes
New way
In [59]: df = pd.DataFrame({"A": pd.arrays.SparseArray([0, 0, 1, 2])})
In [60]: df.dtypes
Out[60]:
A Sparse[int64, 0]
dtype: object
The memory usage of the two approaches is identical (GH 19239).
msgpack format#
The msgpack format is deprecated as of 0.25 and will be removed in a future version. It is recommended to use pyarrow for on-the-wire transmission of pandas objects. (GH 27084)
Other deprecations#
The deprecated
.ix[]
indexer now raises a more visibleFutureWarning
instead ofDeprecationWarning
(GH 26438).Deprecated the
units=M
(months) andunits=Y
(year) parameters forunits
ofpandas.to_timedelta()
,pandas.Timedelta()
andpandas.TimedeltaIndex()
(GH 16344)pandas.concat()
has deprecated thejoin_axes
-keyword. Instead, useDataFrame.reindex()
orDataFrame.reindex_like()
on the result or on the inputs (GH 21951)The
SparseArray.values
attribute is deprecated. You can usenp.asarray(...)
or theSparseArray.to_dense()
method instead (GH 26421).The functions
pandas.to_datetime()
andpandas.to_timedelta()
have deprecated thebox
keyword. Instead, useto_numpy()
orTimestamp.to_datetime64()
orTimedelta.to_timedelta64()
. (GH 24416)The
DataFrame.compound()
andSeries.compound()
methods are deprecated and will be removed in a future version (GH 26405).The internal attributes
_start
,_stop
and_step
attributes ofRangeIndex
have been deprecated. Use the public attributesstart
,stop
andstep
instead (GH 26581).The
Series.ftype()
,Series.ftypes()
andDataFrame.ftypes()
methods are deprecated and will be removed in a future version. Instead, useSeries.dtype()
andDataFrame.dtypes()
(GH 26705).The
Series.get_values()
,DataFrame.get_values()
,Index.get_values()
,SparseArray.get_values()
andCategorical.get_values()
methods are deprecated. One ofnp.asarray(..)
orto_numpy()
can be used instead (GH 19617).The ‘outer’ method on NumPy ufuncs, e.g.
np.subtract.outer
has been deprecated onSeries
objects. Convert the input to an array withSeries.array
first (GH 27186)Timedelta.resolution()
is deprecated and replaced withTimedelta.resolution_string()
. In a future version,Timedelta.resolution()
will be changed to behave like the standard librarydatetime.timedelta.resolution
(GH 21344)read_table()
has been undeprecated. (GH 25220)Index.dtype_str
is deprecated. (GH 18262)Series.imag
andSeries.real
are deprecated. (GH 18262)Series.put()
is deprecated. (GH 18262)Index.item()
andSeries.item()
is deprecated. (GH 18262)The default value
ordered=None
inCategoricalDtype
has been deprecated in favor ofordered=False
. When converting between categorical typesordered=True
must be explicitly passed in order to be preserved. (GH 26336)Index.contains()
is deprecated. Usekey in index
(__contains__
) instead (GH 17753).DataFrame.get_dtype_counts()
is deprecated. (GH 18262)Categorical.ravel()
will return aCategorical
instead of anp.ndarray
(GH 27199)
Removal of prior version deprecations/changes#
Removed the previously deprecated
sheetname
keyword inread_excel()
(GH 16442, GH 20938)Removed the previously deprecated
TimeGrouper
(GH 16942)Removed the previously deprecated
parse_cols
keyword inread_excel()
(GH 16488)Removed the previously deprecated
pd.options.html.border
(GH 16970)Removed the previously deprecated
convert_objects
(GH 11221)Removed the previously deprecated
select
method ofDataFrame
andSeries
(GH 17633)Removed the previously deprecated behavior of
Series
treated as list-like inrename_categories()
(GH 17982)Removed the previously deprecated
DataFrame.reindex_axis
andSeries.reindex_axis
(GH 17842)Removed the previously deprecated behavior of altering column or index labels with
Series.rename_axis()
orDataFrame.rename_axis()
(GH 17842)Removed the previously deprecated
tupleize_cols
keyword argument inread_html()
,read_csv()
, andDataFrame.to_csv()
(GH 17877, GH 17820)Removed the previously deprecated
DataFrame.from.csv
andSeries.from_csv
(GH 17812)Removed the previously deprecated
raise_on_error
keyword argument inDataFrame.where()
andDataFrame.mask()
(GH 17744)Removed the previously deprecated
ordered
andcategories
keyword arguments inastype
(GH 17742)Removed the previously deprecated
cdate_range
(GH 17691)Removed the previously deprecated
True
option for thedropna
keyword argument inSeriesGroupBy.nth()
(GH 17493)Removed the previously deprecated
convert
keyword argument inSeries.take()
andDataFrame.take()
(GH 17352)Removed the previously deprecated behavior of arithmetic operations with
datetime.date
objects (GH 21152)
Performance improvements#
Significant speedup in
SparseArray
initialization that benefits most operations, fixing performance regression introduced in v0.20.0 (GH 24985)DataFrame.to_stata()
is now faster when outputting data with any string or non-native endian columns (GH 25045)Improved performance of
Series.searchsorted()
. The speedup is especially large when the dtype is int8/int16/int32 and the searched key is within the integer bounds for the dtype (GH 22034)Improved performance of
GroupBy.quantile()
(GH 20405)Improved performance of slicing and other selected operation on a
RangeIndex
(GH 26565, GH 26617, GH 26722)RangeIndex
now performs standard lookup without instantiating an actual hashtable, hence saving memory (GH 16685)Improved performance of
read_csv()
by faster tokenizing and faster parsing of small float numbers (GH 25784)Improved performance of
read_csv()
by faster parsing of N/A and boolean values (GH 25804)Improved performance of
IntervalIndex.is_monotonic
,IntervalIndex.is_monotonic_increasing
andIntervalIndex.is_monotonic_decreasing
by removing conversion toMultiIndex
(GH 24813)Improved performance of
DataFrame.to_csv()
when writing datetime dtypes (GH 25708)Improved performance of
read_csv()
by much faster parsing ofMM/YYYY
andDD/MM/YYYY
datetime formats (GH 25922)Improved performance of nanops for dtypes that cannot store NaNs. Speedup is particularly prominent for
Series.all()
andSeries.any()
(GH 25070)Improved performance of
Series.map()
for dictionary mappers on categorical series by mapping the categories instead of mapping all values (GH 23785)Improved performance of
IntervalIndex.intersection()
(GH 24813)Improved performance of
read_csv()
by faster concatenating date columns without extra conversion to string for integer/float zero and floatNaN
; by faster checking the string for the possibility of being a date (GH 25754)Improved performance of
IntervalIndex.is_unique
by removing conversion toMultiIndex
(GH 24813)Restored performance of
DatetimeIndex.__iter__()
by re-enabling specialized code path (GH 26702)Improved performance when building
MultiIndex
with at least oneCategoricalIndex
level (GH 22044)Improved performance by removing the need for a garbage collect when checking for
SettingWithCopyWarning
(GH 27031)For
to_datetime()
changed default value of cache parameter toTrue
(GH 26043)Improved performance of
DatetimeIndex
andPeriodIndex
slicing given non-unique, monotonic data (GH 27136).Improved performance of
pd.read_json()
for index-oriented data. (GH 26773)Improved performance of
MultiIndex.shape()
(GH 27384).
Bug fixes#
Categorical#
Bug in
DataFrame.at()
andSeries.at()
that would raise exception if the index was aCategoricalIndex
(GH 20629)Fixed bug in comparison of ordered
Categorical
that contained missing values with a scalar which sometimes incorrectly resulted inTrue
(GH 26504)Bug in
DataFrame.dropna()
when theDataFrame
has aCategoricalIndex
containingInterval
objects incorrectly raised aTypeError
(GH 25087)
Datetimelike#
Bug in
to_datetime()
which would raise an (incorrect)ValueError
when called with a date far into the future and theformat
argument specified instead of raisingOutOfBoundsDatetime
(GH 23830)Bug in
to_datetime()
which would raiseInvalidIndexError: Reindexing only valid with uniquely valued Index objects
when called withcache=True
, witharg
including at least two different elements from the set{None, numpy.nan, pandas.NaT}
(GH 22305)Bug in
DataFrame
andSeries
where timezone aware data withdtype='datetime64[ns]
was not cast to naive (GH 25843)Improved
Timestamp
type checking in various datetime functions to prevent exceptions when using a subclasseddatetime
(GH 25851)Bug in
Series
andDataFrame
repr wherenp.datetime64('NaT')
andnp.timedelta64('NaT')
withdtype=object
would be represented asNaN
(GH 25445)Bug in
to_datetime()
which does not replace the invalid argument withNaT
when error is set to coerce (GH 26122)Bug in adding
DateOffset
with nonzero month toDatetimeIndex
would raiseValueError
(GH 26258)Bug in
to_datetime()
which raises unhandledOverflowError
when called with mix of invalid dates andNaN
values withformat='%Y%m%d'
anderror='coerce'
(GH 25512)Bug in
isin()
for datetimelike indexes;DatetimeIndex
,TimedeltaIndex
andPeriodIndex
where thelevels
parameter was ignored. (GH 26675)Bug in
to_datetime()
which raisesTypeError
forformat='%Y%m%d'
when called for invalid integer dates with length >= 6 digits witherrors='ignore'
Bug when comparing a
PeriodIndex
against a zero-dimensional numpy array (GH 26689)Bug in constructing a
Series
orDataFrame
from a numpydatetime64
array with a non-ns unit and out-of-bound timestamps generating rubbish data, which will now correctly raise anOutOfBoundsDatetime
error (GH 26206).Bug in
date_range()
with unnecessaryOverflowError
being raised for very large or very small dates (GH 26651)Bug where adding
Timestamp
to anp.timedelta64
object would raise instead of returning aTimestamp
(GH 24775)Bug where comparing a zero-dimensional numpy array containing a
np.datetime64
object to aTimestamp
would incorrect raiseTypeError
(GH 26916)Bug in
to_datetime()
which would raiseValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True
when called withcache=True
, witharg
including datetime strings with different offset (GH 26097)
Timedelta#
Bug in
TimedeltaIndex.intersection()
where for non-monotonic indices in some cases an emptyIndex
was returned when in fact an intersection existed (GH 25913)Bug with comparisons between
Timedelta
andNaT
raisingTypeError
(GH 26039)Bug when adding or subtracting a
BusinessHour
to aTimestamp
with the resulting time landing in a following or prior day respectively (GH 26381)Bug when comparing a
TimedeltaIndex
against a zero-dimensional numpy array (GH 26689)
Timezones#
Bug in
DatetimeIndex.to_frame()
where timezone aware data would be converted to timezone naive data (GH 25809)Bug in
to_datetime()
withutc=True
and datetime strings that would apply previously parsed UTC offsets to subsequent arguments (GH 24992)Bug in
Timestamp.tz_localize()
andTimestamp.tz_convert()
does not propagatefreq
(GH 25241)Bug in
Series.at()
where settingTimestamp
with timezone raisesTypeError
(GH 25506)Bug in
DataFrame.update()
when updating with timezone aware data would return timezone naive data (GH 25807)Bug in
to_datetime()
where an uninformativeRuntimeError
was raised when passing a naiveTimestamp
with datetime strings with mixed UTC offsets (GH 25978)Bug in
to_datetime()
withunit='ns'
would drop timezone information from the parsed argument (GH 26168)Bug in
DataFrame.join()
where joining a timezone aware index with a timezone aware column would result in a column ofNaN
(GH 26335)Bug in
date_range()
where ambiguous or nonexistent start or end times were not handled by theambiguous
ornonexistent
keywords respectively (GH 27088)Bug in
DatetimeIndex.union()
when combining a timezone aware and timezone unawareDatetimeIndex
(GH 21671)Bug when applying a numpy reduction function (e.g.
numpy.minimum()
) to a timezone awareSeries
(GH 15552)
Numeric#
Bug in
to_numeric()
in which large negative numbers were being improperly handled (GH 24910)Bug in
to_numeric()
in which numbers were being coerced to float, even thougherrors
was notcoerce
(GH 24910)Bug in
to_numeric()
in which invalid values forerrors
were being allowed (GH 26466)Bug in
format
in which floating point complex numbers were not being formatted to proper display precision and trimming (GH 25514)Bug in error messages in
DataFrame.corr()
andSeries.corr()
. Added the possibility of using a callable. (GH 25729)Bug in
Series.divmod()
andSeries.rdivmod()
which would raise an (incorrect)ValueError
rather than return a pair ofSeries
objects as result (GH 25557)Raises a helpful exception when a non-numeric index is sent to
interpolate()
with methods which require numeric index. (GH 21662)Bug in
eval()
when comparing floats with scalar operators, for example:x < -0.1
(GH 25928)Fixed bug where casting all-boolean array to integer extension array failed (GH 25211)
Bug in
divmod
with aSeries
object containing zeros incorrectly raisingAttributeError
(GH 26987)Inconsistency in
Series
floor-division (//
) anddivmod
filling positive//zero withNaN
instead ofInf
(GH 27321)
Conversion#
Bug in
DataFrame.astype()
when passing a dict of columns and types theerrors
parameter was ignored. (GH 25905)
Strings#
Bug in the
__name__
attribute of several methods ofSeries.str
, which were set incorrectly (GH 23551)Improved error message when passing
Series
of wrong dtype toSeries.str.cat()
(GH 22722)
Interval#
Construction of
Interval
is restricted to numeric,Timestamp
andTimedelta
endpoints (GH 23013)Fixed bug in
Series
/DataFrame
not displayingNaN
inIntervalIndex
with missing values (GH 25984)Bug in
IntervalIndex.get_loc()
where aKeyError
would be incorrectly raised for a decreasingIntervalIndex
(GH 25860)Bug in
Index
constructor where passing mixed closedInterval
objects would result in aValueError
instead of anobject
dtypeIndex
(GH 27172)
Indexing#
Improved exception message when calling
DataFrame.iloc()
with a list of non-numeric objects (GH 25753).Improved exception message when calling
.iloc
or.loc
with a boolean indexer with different length (GH 26658).Bug in
KeyError
exception message when indexing aMultiIndex
with a non-existent key not displaying the original key (GH 27250).Bug in
.iloc
and.loc
with a boolean indexer not raising anIndexError
when too few items are passed (GH 26658).Bug in
DataFrame.loc()
andSeries.loc()
whereKeyError
was not raised for aMultiIndex
when the key was less than or equal to the number of levels in theMultiIndex
(GH 14885).Bug in which
DataFrame.append()
produced an erroneous warning indicating that aKeyError
will be thrown in the future when the data to be appended contains new columns (GH 22252).Bug in which
DataFrame.to_csv()
caused a segfault for a reindexed data frame, when the indices were single-levelMultiIndex
(GH 26303).Fixed bug where assigning a
arrays.PandasArray
to aDataFrame
would raise error (GH 26390)Allow keyword arguments for callable local reference used in the
DataFrame.query()
string (GH 26426)Fixed a
KeyError
when indexing aMultiIndex
level with a list containing exactly one label, which is missing (GH 27148)Bug which produced
AttributeError
on partial matchingTimestamp
in aMultiIndex
(GH 26944)Bug in
Categorical
andCategoricalIndex
withInterval
values when using thein
operator (__contains
) with objects that are not comparable to the values in theInterval
(GH 23705)Bug in
DataFrame.loc()
andDataFrame.iloc()
on aDataFrame
with a single timezone-aware datetime64[ns] column incorrectly returning a scalar instead of aSeries
(GH 27110)Bug in
CategoricalIndex
andCategorical
incorrectly raisingValueError
instead ofTypeError
when a list is passed using thein
operator (__contains__
) (GH 21729)Bug in setting a new value in a
Series
with aTimedelta
object incorrectly casting the value to an integer (GH 22717)Bug in
Series
setting a new key (__setitem__
) with a timezone-aware datetime incorrectly raisingValueError
(GH 12862)Bug in
DataFrame.iloc()
when indexing with a read-only indexer (GH 17192)Bug in
Series
setting an existing tuple key (__setitem__
) with timezone-aware datetime values incorrectly raisingTypeError
(GH 20441)
Missing#
Fixed misleading exception message in
Series.interpolate()
if argumentorder
is required, but omitted (GH 10633, GH 24014).Fixed class type displayed in exception message in
DataFrame.dropna()
if invalidaxis
parameter passed (GH 25555)A
ValueError
will now be thrown byDataFrame.fillna()
whenlimit
is not a positive integer (GH 27042)
MultiIndex#
Bug in which incorrect exception raised by
Timedelta
when testing the membership ofMultiIndex
(GH 24570)
IO#
Bug in
DataFrame.to_html()
where values were truncated using display options instead of outputting the full content (GH 17004)Fixed bug in missing text when using
to_clipboard()
if copying utf-16 characters in Python 3 on Windows (GH 25040)Bug in
read_json()
fororient='table'
when it tries to infer dtypes by default, which is not applicable as dtypes are already defined in the JSON schema (GH 21345)Bug in
read_json()
fororient='table'
and float index, as it infers index dtype by default, which is not applicable because index dtype is already defined in the JSON schema (GH 25433)Bug in
read_json()
fororient='table'
and string of float column names, as it makes a column name type conversion toTimestamp
, which is not applicable because column names are already defined in the JSON schema (GH 25435)Bug in
json_normalize()
forerrors='ignore'
where missing values in the input data, were filled in resultingDataFrame
with the string"nan"
instead ofnumpy.nan
(GH 25468)DataFrame.to_html()
now raisesTypeError
when using an invalid type for theclasses
parameter instead ofAssertionError
(GH 25608)Bug in
DataFrame.to_string()
andDataFrame.to_latex()
that would lead to incorrect output when theheader
keyword is used (GH 16718)Bug in
read_csv()
not properly interpreting the UTF8 encoded filenames on Windows on Python 3.6+ (GH 15086)Improved performance in
pandas.read_stata()
andpandas.io.stata.StataReader
when converting columns that have missing values (GH 25772)Bug in
DataFrame.to_html()
where header numbers would ignore display options when rounding (GH 17280)Bug in
read_hdf()
where reading a table from an HDF5 file written directly with PyTables fails with aValueError
when using a sub-selection via thestart
orstop
arguments (GH 11188)Bug in
read_hdf()
not properly closing store after aKeyError
is raised (GH 25766)Improved the explanation for the failure when value labels are repeated in Stata dta files and suggested workarounds (GH 25772)
Improved
pandas.read_stata()
andpandas.io.stata.StataReader
to read incorrectly formatted 118 format files saved by Stata (GH 25960)Improved the
col_space
parameter inDataFrame.to_html()
to accept a string so CSS length values can be set correctly (GH 25941)Fixed bug in loading objects from S3 that contain
#
characters in the URL (GH 25945)Adds
use_bqstorage_api
parameter toread_gbq()
to speed up downloads of large data frames. This feature requires version 0.10.0 of thepandas-gbq
library as well as thegoogle-cloud-bigquery-storage
andfastavro
libraries. (GH 26104)Fixed memory leak in
DataFrame.to_json()
when dealing with numeric data (GH 24889)Bug in
read_json()
where date strings withZ
were not converted to a UTC timezone (GH 26168)Added
cache_dates=True
parameter toread_csv()
, which allows to cache unique dates when they are parsed (GH 25990)DataFrame.to_excel()
now raises aValueError
when the caller’s dimensions exceed the limitations of Excel (GH 26051)Fixed bug in
pandas.read_csv()
where a BOM would result in incorrect parsing using engine=’python’ (GH 26545)read_excel()
now raises aValueError
when input is of typepandas.io.excel.ExcelFile
andengine
param is passed sincepandas.io.excel.ExcelFile
has an engine defined (GH 26566)Bug while selecting from
HDFStore
withwhere=''
specified (GH 26610).Fixed bug in
DataFrame.to_excel()
where custom objects (i.e.PeriodIndex
) inside merged cells were not being converted into types safe for the Excel writer (GH 27006)Bug in
read_hdf()
where reading a timezone awareDatetimeIndex
would raise aTypeError
(GH 11926)Bug in
to_msgpack()
andread_msgpack()
which would raise aValueError
rather than aFileNotFoundError
for an invalid path (GH 27160)Fixed bug in
DataFrame.to_parquet()
which would raise aValueError
when the dataframe had no columns (GH 27339)Allow parsing of
PeriodDtype
columns when usingread_csv()
(GH 26934)
Plotting#
Fixed bug where
api.extensions.ExtensionArray
could not be used in matplotlib plotting (GH 25587)Bug in an error message in
DataFrame.plot()
. Improved the error message if non-numerics are passed toDataFrame.plot()
(GH 25481)Bug in incorrect ticklabel positions when plotting an index that are non-numeric / non-datetime (GH 7612, GH 15912, GH 22334)
Fixed bug causing plots of
PeriodIndex
timeseries to fail if the frequency is a multiple of the frequency rule code (GH 14763)Fixed bug when plotting a
DatetimeIndex
withdatetime.timezone.utc
timezone (GH 17173)
GroupBy/resample/rolling#
Bug in
Resampler.agg()
with a timezone aware index whereOverflowError
would raise when passing a list of functions (GH 22660)Bug in
DataFrameGroupBy.nunique()
in which the names of column levels were lost (GH 23222)Bug in
GroupBy.agg()
when applying an aggregation function to timezone aware data (GH 23683)Bug in
GroupBy.first()
andGroupBy.last()
where timezone information would be dropped (GH 21603)Bug in
GroupBy.size()
when grouping only NA values (GH 23050)Bug in
Series.groupby()
whereobserved
kwarg was previously ignored (GH 24880)Bug in
Series.groupby()
where usinggroupby
with aMultiIndex
Series with a list of labels equal to the length of the series caused incorrect grouping (GH 25704)Ensured that ordering of outputs in
groupby
aggregation functions is consistent across all versions of Python (GH 25692)Ensured that result group order is correct when grouping on an ordered
Categorical
and specifyingobserved=True
(GH 25871, GH 25167)Bug in
Rolling.min()
andRolling.max()
that caused a memory leak (GH 25893)Bug in
Rolling.count()
and.Expanding.count
was previously ignoring theaxis
keyword (GH 13503)Bug in
GroupBy.idxmax()
andGroupBy.idxmin()
with datetime column would return incorrect dtype (GH 25444, GH 15306)Bug in
GroupBy.cumsum()
,GroupBy.cumprod()
,GroupBy.cummin()
andGroupBy.cummax()
with categorical column having absent categories, would return incorrect result or segfault (GH 16771)Bug in
GroupBy.nth()
where NA values in the grouping would return incorrect results (GH 26011)Bug in
SeriesGroupBy.transform()
where transforming an empty group would raise aValueError
(GH 26208)Bug in
DataFrame.groupby()
where passing aGrouper
would return incorrect groups when using the.groups
accessor (GH 26326)Bug in
GroupBy.agg()
where incorrect results are returned for uint64 columns. (GH 26310)Bug in
Rolling.median()
andRolling.quantile()
where MemoryError is raised with empty window (GH 26005)Bug in
Rolling.median()
andRolling.quantile()
where incorrect results are returned withclosed='left'
andclosed='neither'
(GH 26005)Improved
Rolling
,Window
andExponentialMovingWindow
functions to exclude nuisance columns from results instead of raising errors and raise aDataError
only if all columns are nuisance (GH 12537)Bug in
Rolling.max()
andRolling.min()
where incorrect results are returned with an empty variable window (GH 26005)Raise a helpful exception when an unsupported weighted window function is used as an argument of
Window.aggregate()
(GH 26597)
Reshaping#
Bug in
pandas.merge()
adds a string ofNone
, ifNone
is assigned in suffixes instead of remain the column name as-is (GH 24782).Bug in
merge()
when merging by index name would sometimes result in an incorrectly numbered index (missing index values are now assigned NA) (GH 24212, GH 25009)to_records()
now accepts dtypes to itscolumn_dtypes
parameter (GH 24895)Bug in
concat()
where order ofOrderedDict
(anddict
in Python 3.6+) is not respected, when passed in asobjs
argument (GH 21510)Bug in
pivot_table()
where columns withNaN
values are dropped even ifdropna
argument isFalse
, when theaggfunc
argument contains alist
(GH 22159)Bug in
concat()
where the resultingfreq
of twoDatetimeIndex
with the samefreq
would be dropped (GH 3232).Bug in
merge()
where merging with equivalent Categorical dtypes was raising an error (GH 22501)bug in
DataFrame
instantiating with a dict of iterators or generators (e.g.pd.DataFrame({'A': reversed(range(3))})
) raised an error (GH 26349).Bug in
DataFrame
instantiating with arange
(e.g.pd.DataFrame(range(3))
) raised an error (GH 26342).Bug in
DataFrame
constructor when passing non-empty tuples would cause a segmentation fault (GH 25691)Bug in
Series.apply()
failed when the series is a timezone awareDatetimeIndex
(GH 25959)Bug in
pandas.cut()
where large bins could incorrectly raise an error due to an integer overflow (GH 26045)Bug in
DataFrame.sort_index()
where an error is thrown when a multi-indexedDataFrame
is sorted on all levels with the initial level sorted last (GH 26053)Bug in
Series.nlargest()
treatsTrue
as smaller thanFalse
(GH 26154)Bug in
DataFrame.pivot_table()
with aIntervalIndex
as pivot index would raiseTypeError
(GH 25814)Bug in which
DataFrame.from_dict()
ignored order ofOrderedDict
whenorient='index'
(GH 8425).Bug in
DataFrame.transpose()
where transposing a DataFrame with a timezone-aware datetime column would incorrectly raiseValueError
(GH 26825)Bug in
pivot_table()
when pivoting a timezone aware column as thevalues
would remove timezone information (GH 14948)Bug in
merge_asof()
when specifying multipleby
columns where one isdatetime64[ns, tz]
dtype (GH 26649)
Sparse#
Significant speedup in
SparseArray
initialization that benefits most operations, fixing performance regression introduced in v0.20.0 (GH 24985)Bug in
SparseFrame
constructor where passingNone
as the data would causedefault_fill_value
to be ignored (GH 16807)Bug in
SparseDataFrame
when adding a column in which the length of values does not match length of index,AssertionError
is raised instead of raisingValueError
(GH 25484)Introduce a better error message in
Series.sparse.from_coo()
so it returns aTypeError
for inputs that are not coo matrices (GH 26554)Bug in
numpy.modf()
on aSparseArray
. Now a tuple ofSparseArray
is returned (GH 26946).
Build changes#
Fix install error with PyPy on macOS (GH 26536)
ExtensionArray#
Bug in
factorize()
when passing anExtensionArray
with a customna_sentinel
(GH 25696).Series.count()
miscounts NA values in ExtensionArrays (GH 26835)Added
Series.__array_ufunc__
to better handle NumPy ufuncs applied to Series backed by extension arrays (GH 23293).Keyword argument
deep
has been removed fromExtensionArray.copy()
(GH 27083)
Other#
Removed unused C functions from vendored UltraJSON implementation (GH 26198)
Allow
Index
andRangeIndex
to be passed to numpymin
andmax
functions (GH 26125)Use actual class name in repr of empty objects of a
Series
subclass (GH 27001).Bug in
DataFrame
where passing an object array of timezone-awaredatetime
objects would incorrectly raiseValueError
(GH 13287)
Contributors#
A total of 231 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.
1_x7 +
Abdullah İhsan Seçer +
Adam Bull +
Adam Hooper
Albert Villanova del Moral
Alex Watt +
AlexTereshenkov +
Alexander Buchkovsky
Alexander Hendorf +
Alexander Nordin +
Alexander Ponomaroff
Alexandre Batisse +
Alexandre Decan +
Allen Downey +
Alyssa Fu Ward +
Andrew Gaspari +
Andrew Wood +
Antoine Viscardi +
Antonio Gutierrez +
Arno Veenstra +
ArtinSarraf
Batalex +
Baurzhan Muftakhidinov
Benjamin Rowell
Bharat Raghunathan +
Bhavani Ravi +
Big Head +
Brett Randall +
Bryan Cutler +
C John Klehm +
Caleb Braun +
Cecilia +
Chris Bertinato +
Chris Stadler +
Christian Haege +
Christian Hudon
Christopher Whelan
Chuanzhu Xu +
Clemens Brunner
Damian Kula +
Daniel Hrisca +
Daniel Luis Costa +
Daniel Saxton
DanielFEvans +
David Liu +
Deepyaman Datta +
Denis Belavin +
Devin Petersohn +
Diane Trout +
EdAbati +
Enrico Rotundo +
EternalLearner42 +
Evan +
Evan Livelo +
Fabian Rost +
Flavien Lambert +
Florian Rathgeber +
Frank Hoang +
Gaibo Zhang +
Gioia Ballin
Giuseppe Romagnuolo +
Gordon Blackadder +
Gregory Rome +
Guillaume Gay
HHest +
Hielke Walinga +
How Si Wei +
Hubert
Huize Wang +
Hyukjin Kwon +
Ian Dunn +
Inevitable-Marzipan +
Irv Lustig
JElfner +
Jacob Bundgaard +
James Cobon-Kerr +
Jan-Philip Gehrcke +
Jarrod Millman +
Jayanth Katuri +
Jeff Reback
Jeremy Schendel
Jiang Yue +
Joel Ostblom
Johan von Forstner +
Johnny Chiu +
Jonas +
Jonathon Vandezande +
Jop Vermeer +
Joris Van den Bossche
Josh
Josh Friedlander +
Justin Zheng
Kaiqi Dong
Kane +
Kapil Patel +
Kara de la Marck +
Katherine Surta +
Katrin Leinweber +
Kendall Masse
Kevin Sheppard
Kyle Kosic +
Lorenzo Stella +
Maarten Rietbergen +
Mak Sze Chun
Marc Garcia
Mateusz Woś
Matias Heikkilä
Mats Maiwald +
Matthew Roeschke
Max Bolingbroke +
Max Kovalovs +
Max van Deursen +
Michael
Michael Davis +
Michael P. Moran +
Mike Cramblett +
Min ho Kim +
Misha Veldhoen +
Mukul Ashwath Ram +
MusTheDataGuy +
Nanda H Krishna +
Nicholas Musolino
Noam Hershtig +
Noora Husseini +
Paul
Paul Reidy
Pauli Virtanen
Pav A +
Peter Leimbigler +
Philippe Ombredanne +
Pietro Battiston
Richard Eames +
Roman Yurchak
Ruijing Li
Ryan
Ryan Joyce +
Ryan Nazareth
Ryan Rehman +
Sakar Panta +
Samuel Sinayoko
Sandeep Pathak +
Sangwoong Yoon
Saurav Chakravorty
Scott Talbert +
Sergey Kopylov +
Shantanu Gontia +
Shivam Rana +
Shorokhov Sergey +
Simon Hawkins
Soyoun(Rose) Kim
Stephan Hoyer
Stephen Cowley +
Stephen Rauch
Sterling Paramore +
Steven +
Stijn Van Hoey
Sumanau Sareen +
Takuya N +
Tan Tran +
Tao He +
Tarbo Fukazawa
Terji Petersen +
Thein Oo
ThibTrip +
Thijs Damsma +
Thiviyan Thanapalasingam
Thomas A Caswell
Thomas Kluiters +
Tilen Kusterle +
Tim Gates +
Tim Hoffmann
Tim Swast
Tom Augspurger
Tom Neep +
Tomáš Chvátal +
Tyler Reddy
Vaibhav Vishal +
Vasily Litvinov +
Vibhu Agarwal +
Vikramjeet Das +
Vladislav +
Víctor Moron Tejero +
Wenhuan
Will Ayd +
William Ayd
Wouter De Coster +
Yoann Goular +
Zach Angell +
alimcmaster1
anmyachev +
chris-b1
danielplawrence +
endenis +
enisnazif +
ezcitron +
fjetter
froessler
gfyoung
gwrome +
h-vetinari
haison +
hannah-c +
heckeop +
iamshwin +
jamesoliverh +
jbrockmendel
jkovacevic +
killerontherun1 +
knuu +
kpapdac +
kpflugshaupt +
krsnik93 +
leerssej +
lrjball +
mazayo +
nathalier +
nrebena +
nullptr +
pilkibun +
pmaxey83 +
rbenes +
robbuckley
shawnbrown +
sudhir mohanraj +
tadeja +
tamuhey +
thatneat
topper-123
willweil +
yehia67 +
yhaque1213 +