Warning
Starting with the 0.25.x series of releases, pandas only supports Python 3.5.3 and higher. See Dropping Python 2.7 for more details.
The minimum supported Python version will be bumped to 3.6 in a future release.
Panel has been fully removed. For N-D labeled data structures, please use xarray
read_pickle() and read_msgpack() are only guaranteed backwards compatible back to pandas version 0.20.3 (GH27082)
read_pickle()
read_msgpack()
These are the changes in pandas 0.25.0. See Release Notes for a full changelog including other versions of pandas.
Pandas has added special groupby behavior, known as “named aggregation”, for naming the output columns when applying multiple aggregation functions to specific columns (GH18366, GH26512).
In [1]: animals = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'], ...: 'height': [9.1, 6.0, 9.5, 34.0], ...: 'weight': [7.9, 7.5, 9.9, 198.0]}) ...: In [2]: animals Out[2]: kind height weight 0 cat 9.1 7.9 1 dog 6.0 7.5 2 cat 9.5 9.9 3 dog 34.0 198.0 [4 rows x 3 columns] In [3]: animals.groupby("kind").agg( ...: min_height=pd.NamedAgg(column='height', aggfunc='min'), ...: max_height=pd.NamedAgg(column='height', aggfunc='max'), ...: average_weight=pd.NamedAgg(column='weight', aggfunc=np.mean), ...: ) ...: Out[3]: min_height max_height average_weight kind cat 9.1 9.5 8.90 dog 6.0 34.0 102.75 [2 rows x 3 columns]
Pass the desired columns names as the **kwargs to .agg. The values of **kwargs should be tuples where the first element is the column selection, and the second element is the aggregation function to apply. Pandas provides the pandas.NamedAgg namedtuple to make it clearer what the arguments to the function are, but plain tuples are accepted as well.
**kwargs
.agg
pandas.NamedAgg
In [4]: animals.groupby("kind").agg( ...: min_height=('height', 'min'), ...: max_height=('height', 'max'), ...: average_weight=('weight', np.mean), ...: ) ...: Out[4]: min_height max_height average_weight kind cat 9.1 9.5 8.90 dog 6.0 34.0 102.75 [2 rows x 3 columns]
Named aggregation is the recommended replacement for the deprecated “dict-of-dicts” approach to naming the output of column-specific aggregations (Deprecate groupby.agg() with a dictionary when renaming).
A similar approach is now available for Series groupby objects as well. Because there’s no need for column selection, the values can just be the functions to apply
In [5]: animals.groupby("kind").height.agg( ...: min_height="min", ...: max_height="max", ...: ) ...: Out[5]: min_height max_height kind cat 9.1 9.5 dog 6.0 34.0 [2 rows x 2 columns]
This type of aggregation is the recommended alternative to the deprecated behavior when passing a dict to a Series groupby aggregation (Deprecate groupby.agg() with a dictionary when renaming).
See Named aggregation for more.
You can now provide multiple lambda functions to a list-like aggregation in pandas.core.groupby.GroupBy.agg (GH26430).
pandas.core.groupby.GroupBy.agg
In [6]: animals.groupby('kind').height.agg([ ...: lambda x: x.iloc[0], lambda x: x.iloc[-1] ...: ]) ...: Out[6]: <lambda_0> <lambda_1> kind cat 9.1 9.5 dog 6.0 34.0 [2 rows x 2 columns] In [7]: animals.groupby('kind').agg([ ...: lambda x: x.iloc[0] - x.iloc[1], ...: lambda x: x.iloc[0] + x.iloc[1] ...: ]) ...: Out[7]: height weight <lambda_0> <lambda_1> <lambda_0> <lambda_1> kind cat -0.4 18.6 -2.0 17.8 dog -28.0 40.0 -190.5 205.5 [2 rows x 4 columns]
Previously, these raised a SpecificationError.
SpecificationError
Printing of MultiIndex instances now shows tuples of each row and ensures that the tuple items are vertically aligned, so it’s now easier to understand the structure of the MultiIndex. (GH13480):
MultiIndex
The repr now looks like this:
In [8]: pd.MultiIndex.from_product([['a', 'abc'], range(500)]) Out[8]: MultiIndex([( 'a', 0), ( 'a', 1), ( 'a', 2), ( 'a', 3), ( 'a', 4), ( 'a', 5), ( 'a', 6), ( 'a', 7), ( 'a', 8), ( 'a', 9), ... ('abc', 490), ('abc', 491), ('abc', 492), ('abc', 493), ('abc', 494), ('abc', 495), ('abc', 496), ('abc', 497), ('abc', 498), ('abc', 499)], length=1000)
Previously, outputting a MultiIndex printed all the levels and codes of the MultiIndex, which was visually unappealing and made the output more difficult to navigate. For example (limiting the range to 5):
levels
codes
In [1]: pd.MultiIndex.from_product([['a', 'abc'], range(5)]) Out[1]: MultiIndex(levels=[['a', 'abc'], [0, 1, 2, 3]], ...: codes=[[0, 0, 0, 0, 1, 1, 1, 1], [0, 1, 2, 3, 0, 1, 2, 3]])
In the new repr, all values will be shown, if the number of rows is smaller than options.display.max_seq_items (default: 100 items). Horizontally, the output will truncate, if it’s wider than options.display.width (default: 80 characters).
options.display.max_seq_items
options.display.width
Currently, the default display options of pandas ensure that when a Series or DataFrame has more than 60 rows, its repr gets truncated to this maximum of 60 rows (the display.max_rows option). However, this still gives a repr that takes up a large part of the vertical screen estate. Therefore, a new option display.min_rows is introduced with a default of 10 which determines the number of rows showed in the truncated repr:
display.max_rows
display.min_rows
For small Series or DataFrames, up to max_rows number of rows is shown (default: 60).
max_rows
For larger Series of DataFrame with a length above max_rows, only min_rows number of rows is shown (default: 10, i.e. the first and last 5 rows).
min_rows
This dual option allows to still see the full content of relatively small objects (e.g. df.head(20) shows all 20 rows), while giving a brief repr for large objects.
df.head(20)
To restore the previous behaviour of a single threshold, set pd.options.display.min_rows = None.
pd.options.display.min_rows = None
json_normalize() normalizes the provided input dict to all nested levels. The new max_level parameter provides more control over which level to end normalization (GH23843):
json_normalize()
from pandas.io.json import json_normalize data = [{ 'CreatedBy': {'Name': 'User001'}, 'Lookup': {'TextField': 'Some text', 'UserField': {'Id': 'ID001', 'Name': 'Name001'}}, 'Image': {'a': 'b'} }] json_normalize(data, max_level=1)
Series and DataFrame have gained the DataFrame.explode() methods to transform list-likes to individual rows. See section on Exploding list-like column in docs for more information (GH16538, GH10511)
Series
DataFrame
DataFrame.explode()
Here is a typical usecase. You have comma separated string in a column.
In [9]: df = pd.DataFrame([{'var1': 'a,b,c', 'var2': 1}, ...: {'var1': 'd,e,f', 'var2': 2}]) ...: In [10]: df Out[10]: var1 var2 0 a,b,c 1 1 d,e,f 2 [2 rows x 2 columns]
Creating a long form DataFrame is now straightforward using chained operations
In [11]: df.assign(var1=df.var1.str.split(',')).explode('var1') Out[11]: var1 var2 0 a 1 0 b 1 0 c 1 1 d 2 1 e 2 1 f 2 [6 rows x 2 columns]
DataFrame.plot() keywords logy, logx and loglog can now accept the value 'sym' for symlog scaling. (GH24867)
DataFrame.plot()
logy
logx
loglog
'sym'
Added support for ISO week year format (‘%G-%V-%u’) when parsing datetimes using to_datetime() (GH16607)
to_datetime()
Indexing of DataFrame and Series now accepts zerodim np.ndarray (GH24919)
np.ndarray
Timestamp.replace() now supports the fold argument to disambiguate DST transition times (GH25017)
Timestamp.replace()
fold
DataFrame.at_time() and Series.at_time() now support datetime.time objects with timezones (GH24043)
DataFrame.at_time()
Series.at_time()
datetime.time
DataFrame.pivot_table() now accepts an observed parameter which is passed to underlying calls to DataFrame.groupby() to speed up grouping categorical data. (GH24923)
DataFrame.pivot_table()
observed
DataFrame.groupby()
Series.str has gained Series.str.casefold() method to removes all case distinctions present in a string (GH25405)
Series.str
Series.str.casefold()
DataFrame.set_index() now works for instances of abc.Iterator, provided their output is of the same length as the calling frame (GH22484, GH24984)
DataFrame.set_index()
abc.Iterator
DatetimeIndex.union() now supports the sort argument. The behavior of the sort parameter matches that of Index.union() (GH24994)
DatetimeIndex.union()
sort
Index.union()
RangeIndex.union() now supports the sort argument. If sort=False an unsorted Int64Index is always returned. sort=None is the default and returns a monotonically increasing RangeIndex if possible or a sorted Int64Index if not (GH24471)
RangeIndex.union()
sort=False
Int64Index
sort=None
RangeIndex
TimedeltaIndex.intersection() now also supports the sort keyword (GH24471)
TimedeltaIndex.intersection()
DataFrame.rename() now supports the errors argument to raise errors when attempting to rename nonexistent keys (GH13473)
DataFrame.rename()
errors
Added Sparse accessor for working with a DataFrame whose values are sparse (GH25681)
RangeIndex has gained start, stop, and step attributes (GH25710)
start
stop
step
datetime.timezone objects are now supported as arguments to timezone methods and constructors (GH25065)
datetime.timezone
DataFrame.query() and DataFrame.eval() now supports quoting column names with backticks to refer to names with spaces (GH6508)
DataFrame.query()
DataFrame.eval()
merge_asof() now gives a more clear error message when merge keys are categoricals that are not equal (GH26136)
merge_asof()
pandas.core.window.Rolling() supports exponential (or Poisson) window type (GH21303)
pandas.core.window.Rolling()
Error message for missing required imports now includes the original import error’s text (GH23868)
DatetimeIndex and TimedeltaIndex now have a mean method (GH24757)
DatetimeIndex
TimedeltaIndex
mean
DataFrame.describe() now formats integer percentiles without decimal point (GH26660)
DataFrame.describe()
Added support for reading SPSS .sav files using read_spss() (GH26537)
read_spss()
Added new option plotting.backend to be able to select a plotting backend different than the existing matplotlib one. Use pandas.set_option('plotting.backend', '<backend-module>') where <backend-module is a library implementing the pandas plotting API (GH14130)
plotting.backend
matplotlib
pandas.set_option('plotting.backend', '<backend-module>')
<backend-module
pandas.offsets.BusinessHour supports multiple opening hours intervals (GH15481)
pandas.offsets.BusinessHour
read_excel() can now use openpyxl to read Excel files via the engine='openpyxl' argument. This will become the default in a future release (GH11499)
read_excel()
openpyxl
engine='openpyxl'
pandas.io.excel.read_excel() supports reading OpenDocument tables. Specify engine='odf' to enable. Consult the IO User Guide for more details (GH9070)
pandas.io.excel.read_excel()
engine='odf'
Interval, IntervalIndex, and IntervalArray have gained an is_empty attribute denoting if the given interval(s) are empty (GH27219)
Interval
IntervalIndex
IntervalArray
is_empty
Indexing a DataFrame or Series with a DatetimeIndex with a date string with a UTC offset would previously ignore the UTC offset. Now, the UTC offset is respected in indexing. (GH24076, GH16785)
In [12]: df = pd.DataFrame([0], index=pd.DatetimeIndex(['2019-01-01'], tz='US/Pacific')) In [13]: df Out[13]: 0 2019-01-01 00:00:00-08:00 0 [1 rows x 1 columns]
Previous behavior:
In [3]: df['2019-01-01 00:00:00+04:00':'2019-01-01 01:00:00+04:00'] Out[3]: 0 2019-01-01 00:00:00-08:00 0
New behavior:
In [14]: df['2019-01-01 12:00:00+04:00':'2019-01-01 13:00:00+04:00'] Out[14]: 0 2019-01-01 00:00:00-08:00 0 [1 rows x 1 columns]
Constructing a MultiIndex with NaN levels or codes value < -1 was allowed previously. Now, construction with codes value < -1 is not allowed and NaN levels’ corresponding codes would be reassigned as -1. (GH19387)
NaN
In [1]: pd.MultiIndex(levels=[[np.nan, None, pd.NaT, 128, 2]], ...: codes=[[0, -1, 1, 2, 3, 4]]) ...: Out[1]: MultiIndex(levels=[[nan, None, NaT, 128, 2]], codes=[[0, -1, 1, 2, 3, 4]]) In [2]: pd.MultiIndex(levels=[[1, 2]], codes=[[0, -2]]) Out[2]: MultiIndex(levels=[[1, 2]], codes=[[0, -2]])
In [15]: pd.MultiIndex(levels=[[np.nan, None, pd.NaT, 128, 2]], ....: codes=[[0, -1, 1, 2, 3, 4]]) ....: Out[15]: MultiIndex([(nan,), (nan,), (nan,), (nan,), (128,), ( 2,)], ) In [16]: pd.MultiIndex(levels=[[1, 2]], codes=[[0, -2]]) --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-16-225a01af3975> in <module> ----> 1 pd.MultiIndex(levels=[[1, 2]], codes=[[0, -2]]) /pandas-release/pandas/pandas/core/indexes/multi.py in __new__(cls, levels, codes, sortorder, names, dtype, copy, name, verify_integrity, _set_identity) 278 279 if verify_integrity: --> 280 new_codes = result._verify_integrity() 281 result._codes = new_codes 282 /pandas-release/pandas/pandas/core/indexes/multi.py in _verify_integrity(self, codes, levels) 352 ) 353 if len(level_codes) and level_codes.min() < -1: --> 354 raise ValueError(f"On level {i}, code value ({level_codes.min()}) < -1") 355 if not level.is_unique: 356 raise ValueError( ValueError: On level 0, code value (-2) < -1
Groupby.apply
The implementation of DataFrameGroupBy.apply() previously evaluated the supplied function consistently twice on the first group to infer if it is safe to use a fast code path. Particularly for functions with side effects, this was an undesired behavior and may have led to surprises. (GH2936, GH2656, GH7739, GH10519, GH12155, GH20084, GH21417)
DataFrameGroupBy.apply()
Now every group is evaluated only a single time.
In [17]: df = pd.DataFrame({"a": ["x", "y"], "b": [1, 2]}) In [18]: df Out[18]: a b 0 x 1 1 y 2 [2 rows x 2 columns] In [19]: def func(group): ....: print(group.name) ....: return group ....:
In [3]: df.groupby('a').apply(func) x x y Out[3]: a b 0 x 1 1 y 2
In [20]: df.groupby("a").apply(func) x y Out[20]: a b 0 x 1 1 y 2 [2 rows x 2 columns]
When passed DataFrames whose values are sparse, concat() will now return a Series or DataFrame with sparse values, rather than a SparseDataFrame (GH25702).
concat()
SparseDataFrame
In [21]: df = pd.DataFrame({"A": pd.SparseArray([0, 1])})
In [2]: type(pd.concat([df, df])) pandas.core.sparse.frame.SparseDataFrame
In [22]: type(pd.concat([df, df])) Out[22]: pandas.core.frame.DataFrame
This now matches the existing behavior of concat on Series with sparse values. concat() will continue to return a SparseDataFrame when all the values are instances of SparseDataFrame.
concat
This change also affects routines using concat() internally, like get_dummies(), which now returns a DataFrame in all cases (previously a SparseDataFrame was returned if all the columns were dummy encoded, and a DataFrame otherwise).
get_dummies()
Providing any SparseSeries or SparseDataFrame to concat() will cause a SparseSeries or SparseDataFrame to be returned, as before.
SparseSeries
.str
Due to the lack of more fine-grained dtypes, Series.str so far only checked whether the data was of object dtype. Series.str will now infer the dtype data within the Series; in particular, 'bytes'-only data will raise an exception (except for Series.str.decode(), Series.str.get(), Series.str.len(), Series.str.slice()), see GH23163, GH23011, GH23551.
object
'bytes'
Series.str.decode()
Series.str.get()
Series.str.len()
Series.str.slice()
In [1]: s = pd.Series(np.array(['a', 'ba', 'cba'], 'S'), dtype=object) In [2]: s Out[2]: 0 b'a' 1 b'ba' 2 b'cba' dtype: object In [3]: s.str.startswith(b'a') Out[3]: 0 True 1 False 2 False dtype: bool
In [23]: s = pd.Series(np.array(['a', 'ba', 'cba'], 'S'), dtype=object) In [24]: s Out[24]: 0 b'a' 1 b'ba' 2 b'cba' Length: 3, dtype: object In [25]: s.str.startswith(b'a') --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-25-ac784692b361> in <module> ----> 1 s.str.startswith(b'a') /pandas-release/pandas/pandas/core/strings.py in wrapper(self, *args, **kwargs) 1951 f"inferred dtype '{self._inferred_dtype}'." 1952 ) -> 1953 raise TypeError(msg) 1954 return func(self, *args, **kwargs) 1955 TypeError: Cannot use .str.startswith with values of inferred dtype 'bytes'.
Previously, columns that were categorical, but not the groupby key(s) would be converted to object dtype during groupby operations. Pandas now will preserve these dtypes. (GH18502)
In [26]: cat = pd.Categorical(["foo", "bar", "bar", "qux"], ordered=True) In [27]: df = pd.DataFrame({'payload': [-1, -2, -1, -2], 'col': cat}) In [28]: df Out[28]: payload col 0 -1 foo 1 -2 bar 2 -1 bar 3 -2 qux [4 rows x 2 columns] In [29]: df.dtypes Out[29]: payload int64 col category Length: 2, dtype: object
Previous Behavior:
In [5]: df.groupby('payload').first().col.dtype Out[5]: dtype('O')
New Behavior:
In [30]: df.groupby('payload').first().col.dtype Out[30]: CategoricalDtype(categories=['bar', 'foo', 'qux'], ordered=True)
When performing Index.union() operations between objects of incompatible dtypes, the result will be a base Index of dtype object. This behavior holds true for unions between Index objects that previously would have been prohibited. The dtype of empty Index objects will now be evaluated before performing union operations rather than simply returning the other Index object. Index.union() can now be considered commutative, such that A.union(B) == B.union(A) (GH23525).
Index
A.union(B) == B.union(A)
In [1]: pd.period_range('19910905', periods=2).union(pd.Int64Index([1, 2, 3])) ... ValueError: can only call with other PeriodIndex-ed objects In [2]: pd.Index([], dtype=object).union(pd.Index([1, 2, 3])) Out[2]: Int64Index([1, 2, 3], dtype='int64')
In [31]: pd.period_range('19910905', periods=2).union(pd.Int64Index([1, 2, 3])) Out[31]: Index([1991-09-05, 1991-09-06, 1, 2, 3], dtype='object') In [32]: pd.Index([], dtype=object).union(pd.Index([1, 2, 3])) Out[32]: Index([1, 2, 3], dtype='object')
Note that integer- and floating-dtype indexes are considered “compatible”. The integer values are coerced to floating point, which may result in loss of precision. See Set operations on Index objects for more.
The methods ffill, bfill, pad and backfill of DataFrameGroupBy previously included the group labels in the return value, which was inconsistent with other groupby transforms. Now only the filled values are returned. (GH21521)
ffill
bfill
pad
backfill
DataFrameGroupBy
In [33]: df = pd.DataFrame({"a": ["x", "y"], "b": [1, 2]}) In [34]: df Out[34]: a b 0 x 1 1 y 2 [2 rows x 2 columns]
In [3]: df.groupby("a").ffill() Out[3]: a b 0 x 1 1 y 2
In [35]: df.groupby("a").ffill() Out[35]: b 0 1 1 2 [2 rows x 1 columns]
When calling DataFrame.describe() with an empty categorical / object column, the ‘top’ and ‘freq’ columns were previously omitted, which was inconsistent with the output for non-empty columns. Now the ‘top’ and ‘freq’ columns will always be included, with numpy.nan in the case of an empty DataFrame (GH26397)
numpy.nan
In [36]: df = pd.DataFrame({"empty_col": pd.Categorical([])}) In [37]: df Out[37]: Empty DataFrame Columns: [empty_col] Index: [] [0 rows x 1 columns]
In [3]: df.describe() Out[3]: empty_col count 0 unique 0
In [38]: df.describe() Out[38]: empty_col count 0 unique 0 top NaN freq NaN [4 rows x 1 columns]
__str__
__repr__
Pandas has until now mostly defined string representations in a Pandas objects’s __str__/__unicode__/__bytes__ methods, and called __str__ from the __repr__ method, if a specific __repr__ method is not found. This is not needed for Python3. In Pandas 0.25, the string representations of Pandas objects are now generally defined in __repr__, and calls to __str__ in general now pass the call on to the __repr__, if a specific __str__ method doesn’t exist, as is standard for Python. This change is backward compatible for direct usage of Pandas, but if you subclass Pandas objects and give your subclasses specific __str__/__repr__ methods, you may have to adjust your __str__/__repr__ methods (GH26495).
__unicode__
__bytes__
Indexing methods for IntervalIndex have been modified to require exact matches only for Interval queries. IntervalIndex methods previously matched on any overlapping Interval. Behavior with scalar points, e.g. querying with an integer, is unchanged (GH16316).
In [39]: ii = pd.IntervalIndex.from_tuples([(0, 4), (1, 5), (5, 8)]) In [40]: ii Out[40]: IntervalIndex([(0, 4], (1, 5], (5, 8]], closed='right', dtype='interval[int64]')
The in operator (__contains__) now only returns True for exact matches to Intervals in the IntervalIndex, whereas this would previously return True for any Interval overlapping an Interval in the IntervalIndex.
in
__contains__
True
Intervals
In [4]: pd.Interval(1, 2, closed='neither') in ii Out[4]: True In [5]: pd.Interval(-10, 10, closed='both') in ii Out[5]: True
In [41]: pd.Interval(1, 2, closed='neither') in ii Out[41]: False In [42]: pd.Interval(-10, 10, closed='both') in ii Out[42]: False
The get_loc() method now only returns locations for exact matches to Interval queries, as opposed to the previous behavior of returning locations for overlapping matches. A KeyError will be raised if an exact match is not found.
get_loc()
KeyError
In [6]: ii.get_loc(pd.Interval(1, 5)) Out[6]: array([0, 1]) In [7]: ii.get_loc(pd.Interval(2, 6)) Out[7]: array([0, 1, 2])
In [6]: ii.get_loc(pd.Interval(1, 5)) Out[6]: 1 In [7]: ii.get_loc(pd.Interval(2, 6)) --------------------------------------------------------------------------- KeyError: Interval(2, 6, closed='right')
Likewise, get_indexer() and get_indexer_non_unique() will also only return locations for exact matches to Interval queries, with -1 denoting that an exact match was not found.
get_indexer()
get_indexer_non_unique()
-1
These indexing changes extend to querying a Series or DataFrame with an IntervalIndex index.
In [43]: s = pd.Series(list('abc'), index=ii) In [44]: s Out[44]: (0, 4] a (1, 5] b (5, 8] c Length: 3, dtype: object
Selecting from a Series or DataFrame using [] (__getitem__) or loc now only returns exact matches for Interval queries.
[]
__getitem__
loc
In [8]: s[pd.Interval(1, 5)] Out[8]: (0, 4] a (1, 5] b dtype: object In [9]: s.loc[pd.Interval(1, 5)] Out[9]: (0, 4] a (1, 5] b dtype: object
In [45]: s[pd.Interval(1, 5)] Out[45]: 'b' In [46]: s.loc[pd.Interval(1, 5)] Out[46]: 'b'
Similarly, a KeyError will be raised for non-exact matches instead of returning overlapping matches.
In [9]: s[pd.Interval(2, 3)] Out[9]: (0, 4] a (1, 5] b dtype: object In [10]: s.loc[pd.Interval(2, 3)] Out[10]: (0, 4] a (1, 5] b dtype: object
In [6]: s[pd.Interval(2, 3)] --------------------------------------------------------------------------- KeyError: Interval(2, 3, closed='right') In [7]: s.loc[pd.Interval(2, 3)] --------------------------------------------------------------------------- KeyError: Interval(2, 3, closed='right')
The overlaps() method can be used to create a boolean indexer that replicates the previous behavior of returning overlapping matches.
overlaps()
In [47]: idxr = s.index.overlaps(pd.Interval(2, 3)) In [48]: idxr Out[48]: array([ True, True, False]) In [49]: s[idxr] Out[49]: (0, 4] a (1, 5] b Length: 2, dtype: object In [50]: s.loc[idxr] Out[50]: (0, 4] a (1, 5] b Length: 2, dtype: object
Applying a binary ufunc like numpy.power() now aligns the inputs when both are Series (GH23293).
numpy.power()
In [51]: s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c']) In [52]: s2 = pd.Series([3, 4, 5], index=['d', 'c', 'b']) In [53]: s1 Out[53]: a 1 b 2 c 3 Length: 3, dtype: int64 In [54]: s2 Out[54]: d 3 c 4 b 5 Length: 3, dtype: int64
Previous behavior
In [5]: np.power(s1, s2) Out[5]: a 1 b 16 c 243 dtype: int64
New behavior
In [55]: np.power(s1, s2) Out[55]: a 1.0 b 32.0 c 81.0 d NaN Length: 4, dtype: float64
This matches the behavior of other binary operations in pandas, like Series.add(). To retain the previous behavior, convert the other Series to an array before applying the ufunc.
Series.add()
In [56]: np.power(s1, s2.array) Out[56]: a 1 b 16 c 243 Length: 3, dtype: int64
Categorical.argsort() now places missing values at the end of the array, making it consistent with NumPy and the rest of pandas (GH21801).
Categorical.argsort()
In [57]: cat = pd.Categorical(['b', None, 'a'], categories=['a', 'b'], ordered=True)
In [2]: cat = pd.Categorical(['b', None, 'a'], categories=['a', 'b'], ordered=True) In [3]: cat.argsort() Out[3]: array([1, 2, 0]) In [4]: cat[cat.argsort()] Out[4]: [NaN, a, b] categories (2, object): [a < b]
In [58]: cat.argsort() Out[58]: array([2, 0, 1]) In [59]: cat[cat.argsort()] Out[59]: [a, b, NaN] Categories (2, object): [a < b]
Starting with Python 3.7 the key-order of dict is guaranteed. In practice, this has been true since Python 3.6. The DataFrame constructor now treats a list of dicts in the same way as it does a list of OrderedDict, i.e. preserving the order of the dicts. This change applies only when pandas is running on Python>=3.6 (GH27309).
dict
OrderedDict
In [60]: data = [ ....: {'name': 'Joe', 'state': 'NY', 'age': 18}, ....: {'name': 'Jane', 'state': 'KY', 'age': 19, 'hobby': 'Minecraft'}, ....: {'name': 'Jean', 'state': 'OK', 'age': 20, 'finances': 'good'} ....: ] ....:
The columns were lexicographically sorted previously,
In [1]: pd.DataFrame(data) Out[1]: age finances hobby name state 0 18 NaN NaN Joe NY 1 19 NaN Minecraft Jane KY 2 20 good NaN Jean OK
The column order now matches the insertion-order of the keys in the dict, considering all the records from top to bottom. As a consequence, the column order of the resulting DataFrame has changed compared to previous pandas versions.
In [61]: pd.DataFrame(data) Out[61]: name state age hobby finances 0 Joe NY 18 NaN NaN 1 Jane KY 19 Minecraft NaN 2 Jean OK 20 NaN good [3 rows x 5 columns]
Due to dropping support for Python 2.7, a number of optional dependencies have updated minimum versions (GH25725, GH24942, GH25752). Independently, some minimum supported versions of dependencies were updated (GH23519, GH25554). If installed, we now require:
Package
Minimum Version
Required
numpy
1.13.3
X
pytz
2015.4
python-dateutil
2.6.1
bottleneck
1.2.1
numexpr
2.6.2
pytest (dev)
4.0.2
For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.
beautifulsoup4
4.6.0
fastparquet
0.2.1
gcsfs
0.2.2
lxml
3.8.0
2.2.2
2.4.8
pyarrow
0.9.0
pymysql
0.7.1
pytables
3.4.2
scipy
0.19.0
sqlalchemy
1.1.4
xarray
0.8.2
xlrd
1.1.0
xlsxwriter
0.9.8
xlwt
1.2.0
See Dependencies and Optional dependencies for more.
DatetimeTZDtype will now standardize pytz timezones to a common timezone instance (GH24713)
DatetimeTZDtype
Timestamp and Timedelta scalars now implement the to_numpy() method as aliases to Timestamp.to_datetime64() and Timedelta.to_timedelta64(), respectively. (GH24653)
Timestamp
Timedelta
to_numpy()
Timestamp.to_datetime64()
Timedelta.to_timedelta64()
Timestamp.strptime() will now rise a NotImplementedError (GH25016)
Timestamp.strptime()
NotImplementedError
Comparing Timestamp with unsupported objects now returns NotImplemented instead of raising TypeError. This implies that unsupported rich comparisons are delegated to the other object, and are now consistent with Python 3 behavior for datetime objects (GH24011)
NotImplemented
TypeError
datetime
Bug in DatetimeIndex.snap() which didn’t preserving the name of the input Index (GH25575)
DatetimeIndex.snap()
name
The arg argument in pandas.core.groupby.DataFrameGroupBy.agg() has been renamed to func (GH26089)
arg
pandas.core.groupby.DataFrameGroupBy.agg()
func
The arg argument in pandas.core.window._Window.aggregate() has been renamed to func (GH26372)
pandas.core.window._Window.aggregate()
Most Pandas classes had a __bytes__ method, which was used for getting a python2-style bytestring representation of the object. This method has been removed as a part of dropping Python2 (GH26447)
The .str-accessor has been disabled for 1-level MultiIndex, use MultiIndex.to_flat_index() if necessary (GH23679)
MultiIndex.to_flat_index()
Removed support of gtk package for clipboards (GH26563)
Using an unsupported version of Beautiful Soup 4 will now raise an ImportError instead of a ValueError (GH27063)
ImportError
ValueError
Series.to_excel() and DataFrame.to_excel() will now raise a ValueError when saving timezone aware data. (GH27008, GH7056)
Series.to_excel()
DataFrame.to_excel()
ExtensionArray.argsort() places NA values at the end of the sorted array. (GH21801)
ExtensionArray.argsort()
DataFrame.to_hdf() and Series.to_hdf() will now raise a NotImplementedError when saving a MultiIndex with extension data types for a fixed format. (GH7775)
DataFrame.to_hdf()
Series.to_hdf()
fixed
Passing duplicate names in read_csv() will now raise a ValueError (GH17346)
names
read_csv()
The SparseSeries and SparseDataFrame subclasses are deprecated. Their functionality is better-provided by a Series or DataFrame with sparse values.
Previous way
df = pd.SparseDataFrame({"A": [0, 0, 1, 2]}) df.dtypes
New way
In [62]: df = pd.DataFrame({"A": pd.SparseArray([0, 0, 1, 2])}) In [63]: df.dtypes Out[63]: A Sparse[int64, 0] Length: 1, dtype: object
The memory usage of the two approaches is identical. See Migrating for more (GH19239).
The msgpack format is deprecated as of 0.25 and will be removed in a future version. It is recommended to use pyarrow for on-the-wire transmission of pandas objects. (GH27084)
The deprecated .ix[] indexer now raises a more visible FutureWarning instead of DeprecationWarning (GH26438).
.ix[]
FutureWarning
DeprecationWarning
Deprecated the units=M (months) and units=Y (year) parameters for units of pandas.to_timedelta(), pandas.Timedelta() and pandas.TimedeltaIndex() (GH16344)
units=M
units=Y
units
pandas.to_timedelta()
pandas.Timedelta()
pandas.TimedeltaIndex()
pandas.concat() has deprecated the join_axes-keyword. Instead, use DataFrame.reindex() or DataFrame.reindex_like() on the result or on the inputs (GH21951)
pandas.concat()
join_axes
DataFrame.reindex()
DataFrame.reindex_like()
The SparseArray.values attribute is deprecated. You can use np.asarray(...) or the SparseArray.to_dense() method instead (GH26421).
SparseArray.values
np.asarray(...)
SparseArray.to_dense()
The functions pandas.to_datetime() and pandas.to_timedelta() have deprecated the box keyword. Instead, use to_numpy() or Timestamp.to_datetime64() or Timedelta.to_timedelta64(). (GH24416)
pandas.to_datetime()
box
The DataFrame.compound() and Series.compound() methods are deprecated and will be removed in a future version (GH26405).
DataFrame.compound()
Series.compound()
The internal attributes _start, _stop and _step attributes of RangeIndex have been deprecated. Use the public attributes start, stop and step instead (GH26581).
_start
_stop
_step
The Series.ftype(), Series.ftypes() and DataFrame.ftypes() methods are deprecated and will be removed in a future version. Instead, use Series.dtype() and DataFrame.dtypes() (GH26705).
Series.ftype()
Series.ftypes()
DataFrame.ftypes()
Series.dtype()
DataFrame.dtypes()
The Series.get_values(), DataFrame.get_values(), Index.get_values(), SparseArray.get_values() and Categorical.get_values() methods are deprecated. One of np.asarray(..) or to_numpy() can be used instead (GH19617).
Series.get_values()
DataFrame.get_values()
Index.get_values()
SparseArray.get_values()
Categorical.get_values()
np.asarray(..)
The ‘outer’ method on NumPy ufuncs, e.g. np.subtract.outer has been deprecated on Series objects. Convert the input to an array with Series.array first (GH27186)
np.subtract.outer
Series.array
Timedelta.resolution() is deprecated and replaced with Timedelta.resolution_string(). In a future version, Timedelta.resolution() will be changed to behave like the standard library datetime.timedelta.resolution (GH21344)
Timedelta.resolution()
Timedelta.resolution_string()
datetime.timedelta.resolution
read_table() has been undeprecated. (GH25220)
read_table()
Index.dtype_str is deprecated. (GH18262)
Index.dtype_str
Series.imag and Series.real are deprecated. (GH18262)
Series.imag
Series.real
Series.put() is deprecated. (GH18262)
Series.put()
Index.item() and Series.item() is deprecated. (GH18262)
Index.item()
Series.item()
The default value ordered=None in CategoricalDtype has been deprecated in favor of ordered=False. When converting between categorical types ordered=True must be explicitly passed in order to be preserved. (GH26336)
ordered=None
CategoricalDtype
ordered=False
ordered=True
Index.contains() is deprecated. Use key in index (__contains__) instead (GH17753).
Index.contains()
key in index
DataFrame.get_dtype_counts() is deprecated. (GH18262)
DataFrame.get_dtype_counts()
Categorical.ravel() will return a Categorical instead of a np.ndarray (GH27199)
Categorical.ravel()
Categorical
Removed Panel (GH25047, GH25191, GH25231)
Panel
Removed the previously deprecated sheetname keyword in read_excel() (GH16442, GH20938)
sheetname
Removed the previously deprecated TimeGrouper (GH16942)
TimeGrouper
Removed the previously deprecated parse_cols keyword in read_excel() (GH16488)
parse_cols
Removed the previously deprecated pd.options.html.border (GH16970)
pd.options.html.border
Removed the previously deprecated convert_objects (GH11221)
convert_objects
Removed the previously deprecated select method of DataFrame and Series (GH17633)
select
Removed the previously deprecated behavior of Series treated as list-like in rename_categories() (GH17982)
rename_categories()
Removed the previously deprecated DataFrame.reindex_axis and Series.reindex_axis (GH17842)
DataFrame.reindex_axis
Series.reindex_axis
Removed the previously deprecated behavior of altering column or index labels with Series.rename_axis() or DataFrame.rename_axis() (GH17842)
Series.rename_axis()
DataFrame.rename_axis()
Removed the previously deprecated tupleize_cols keyword argument in read_html(), read_csv(), and DataFrame.to_csv() (GH17877, GH17820)
tupleize_cols
read_html()
DataFrame.to_csv()
Removed the previously deprecated DataFrame.from.csv and Series.from_csv (GH17812)
DataFrame.from.csv
Series.from_csv
Removed the previously deprecated raise_on_error keyword argument in DataFrame.where() and DataFrame.mask() (GH17744)
raise_on_error
DataFrame.where()
DataFrame.mask()
Removed the previously deprecated ordered and categories keyword arguments in astype (GH17742)
ordered
categories
astype
Removed the previously deprecated cdate_range (GH17691)
cdate_range
Removed the previously deprecated True option for the dropna keyword argument in SeriesGroupBy.nth() (GH17493)
dropna
SeriesGroupBy.nth()
Removed the previously deprecated convert keyword argument in Series.take() and DataFrame.take() (GH17352)
convert
Series.take()
DataFrame.take()
Removed the previously deprecated behavior of arithmetic operations with datetime.date objects (GH21152)
datetime.date
Significant speedup in SparseArray initialization that benefits most operations, fixing performance regression introduced in v0.20.0 (GH24985)
SparseArray
DataFrame.to_stata() is now faster when outputting data with any string or non-native endian columns (GH25045)
DataFrame.to_stata()
Improved performance of Series.searchsorted(). The speedup is especially large when the dtype is int8/int16/int32 and the searched key is within the integer bounds for the dtype (GH22034)
Series.searchsorted()
Improved performance of pandas.core.groupby.GroupBy.quantile() (GH20405)
pandas.core.groupby.GroupBy.quantile()
Improved performance of slicing and other selected operation on a RangeIndex (GH26565, GH26617, GH26722)
RangeIndex now performs standard lookup without instantiating an actual hashtable, hence saving memory (GH16685)
Improved performance of read_csv() by faster tokenizing and faster parsing of small float numbers (GH25784)
Improved performance of read_csv() by faster parsing of N/A and boolean values (GH25804)
Improved performance of IntervalIndex.is_monotonic, IntervalIndex.is_monotonic_increasing and IntervalIndex.is_monotonic_decreasing by removing conversion to MultiIndex (GH24813)
IntervalIndex.is_monotonic
IntervalIndex.is_monotonic_increasing
IntervalIndex.is_monotonic_decreasing
Improved performance of DataFrame.to_csv() when writing datetime dtypes (GH25708)
Improved performance of read_csv() by much faster parsing of MM/YYYY and DD/MM/YYYY datetime formats (GH25922)
MM/YYYY
DD/MM/YYYY
Improved performance of nanops for dtypes that cannot store NaNs. Speedup is particularly prominent for Series.all() and Series.any() (GH25070)
Series.all()
Series.any()
Improved performance of Series.map() for dictionary mappers on categorical series by mapping the categories instead of mapping all values (GH23785)
Series.map()
Improved performance of IntervalIndex.intersection() (GH24813)
IntervalIndex.intersection()
Improved performance of read_csv() by faster concatenating date columns without extra conversion to string for integer/float zero and float NaN; by faster checking the string for the possibility of being a date (GH25754)
Improved performance of IntervalIndex.is_unique by removing conversion to MultiIndex (GH24813)
IntervalIndex.is_unique
Restored performance of DatetimeIndex.__iter__() by re-enabling specialized code path (GH26702)
DatetimeIndex.__iter__()
Improved performance when building MultiIndex with at least one CategoricalIndex level (GH22044)
CategoricalIndex
Improved performance by removing the need for a garbage collect when checking for SettingWithCopyWarning (GH27031)
SettingWithCopyWarning
For to_datetime() changed default value of cache parameter to True (GH26043)
Improved performance of DatetimeIndex and PeriodIndex slicing given non-unique, monotonic data (GH27136).
PeriodIndex
Improved performance of pd.read_json() for index-oriented data. (GH26773)
pd.read_json()
Improved performance of MultiIndex.shape() (GH27384).
MultiIndex.shape()
Bug in DataFrame.at() and Series.at() that would raise exception if the index was a CategoricalIndex (GH20629)
DataFrame.at()
Series.at()
Fixed bug in comparison of ordered Categorical that contained missing values with a scalar which sometimes incorrectly resulted in True (GH26504)
Bug in DataFrame.dropna() when the DataFrame has a CategoricalIndex containing Interval objects incorrectly raised a TypeError (GH25087)
DataFrame.dropna()
Bug in to_datetime() which would raise an (incorrect) ValueError when called with a date far into the future and the format argument specified instead of raising OutOfBoundsDatetime (GH23830)
format
OutOfBoundsDatetime
Bug in to_datetime() which would raise InvalidIndexError: Reindexing only valid with uniquely valued Index objects when called with cache=True, with arg including at least two different elements from the set {None, numpy.nan, pandas.NaT} (GH22305)
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
cache=True
{None, numpy.nan, pandas.NaT}
Bug in DataFrame and Series where timezone aware data with dtype='datetime64[ns] was not cast to naive (GH25843)
dtype='datetime64[ns]
Improved Timestamp type checking in various datetime functions to prevent exceptions when using a subclassed datetime (GH25851)
Bug in Series and DataFrame repr where np.datetime64('NaT') and np.timedelta64('NaT') with dtype=object would be represented as NaN (GH25445)
np.datetime64('NaT')
np.timedelta64('NaT')
dtype=object
Bug in to_datetime() which does not replace the invalid argument with NaT when error is set to coerce (GH26122)
NaT
Bug in adding DateOffset with nonzero month to DatetimeIndex would raise ValueError (GH26258)
DateOffset
Bug in to_datetime() which raises unhandled OverflowError when called with mix of invalid dates and NaN values with format='%Y%m%d' and error='coerce' (GH25512)
OverflowError
format='%Y%m%d'
error='coerce'
Bug in isin() for datetimelike indexes; DatetimeIndex, TimedeltaIndex and PeriodIndex where the levels parameter was ignored. (GH26675)
isin()
Bug in to_datetime() which raises TypeError for format='%Y%m%d' when called for invalid integer dates with length >= 6 digits with errors='ignore'
errors='ignore'
Bug when comparing a PeriodIndex against a zero-dimensional numpy array (GH26689)
Bug in constructing a Series or DataFrame from a numpy datetime64 array with a non-ns unit and out-of-bound timestamps generating rubbish data, which will now correctly raise an OutOfBoundsDatetime error (GH26206).
datetime64
Bug in date_range() with unnecessary OverflowError being raised for very large or very small dates (GH26651)
date_range()
Bug where adding Timestamp to a np.timedelta64 object would raise instead of returning a Timestamp (GH24775)
np.timedelta64
Bug where comparing a zero-dimensional numpy array containing a np.datetime64 object to a Timestamp would incorrect raise TypeError (GH26916)
np.datetime64
Bug in to_datetime() which would raise ValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True when called with cache=True, with arg including datetime strings with different offset (GH26097)
ValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True
Bug in TimedeltaIndex.intersection() where for non-monotonic indices in some cases an empty Index was returned when in fact an intersection existed (GH25913)
Bug with comparisons between Timedelta and NaT raising TypeError (GH26039)
Bug when adding or subtracting a BusinessHour to a Timestamp with the resulting time landing in a following or prior day respectively (GH26381)
BusinessHour
Bug when comparing a TimedeltaIndex against a zero-dimensional numpy array (GH26689)
Bug in DatetimeIndex.to_frame() where timezone aware data would be converted to timezone naive data (GH25809)
DatetimeIndex.to_frame()
Bug in to_datetime() with utc=True and datetime strings that would apply previously parsed UTC offsets to subsequent arguments (GH24992)
utc=True
Bug in Timestamp.tz_localize() and Timestamp.tz_convert() does not propagate freq (GH25241)
Timestamp.tz_localize()
Timestamp.tz_convert()
freq
Bug in Series.at() where setting Timestamp with timezone raises TypeError (GH25506)
Bug in DataFrame.update() when updating with timezone aware data would return timezone naive data (GH25807)
DataFrame.update()
Bug in to_datetime() where an uninformative RuntimeError was raised when passing a naive Timestamp with datetime strings with mixed UTC offsets (GH25978)
RuntimeError
Bug in to_datetime() with unit='ns' would drop timezone information from the parsed argument (GH26168)
unit='ns'
Bug in DataFrame.join() where joining a timezone aware index with a timezone aware column would result in a column of NaN (GH26335)
DataFrame.join()
Bug in date_range() where ambiguous or nonexistent start or end times were not handled by the ambiguous or nonexistent keywords respectively (GH27088)
ambiguous
nonexistent
Bug in DatetimeIndex.union() when combining a timezone aware and timezone unaware DatetimeIndex (GH21671)
Bug when applying a numpy reduction function (e.g. numpy.minimum()) to a timezone aware Series (GH15552)
numpy.minimum()
Bug in to_numeric() in which large negative numbers were being improperly handled (GH24910)
to_numeric()
Bug in to_numeric() in which numbers were being coerced to float, even though errors was not coerce (GH24910)
coerce
Bug in to_numeric() in which invalid values for errors were being allowed (GH26466)
Bug in format in which floating point complex numbers were not being formatted to proper display precision and trimming (GH25514)
Bug in error messages in DataFrame.corr() and Series.corr(). Added the possibility of using a callable. (GH25729)
DataFrame.corr()
Series.corr()
Bug in Series.divmod() and Series.rdivmod() which would raise an (incorrect) ValueError rather than return a pair of Series objects as result (GH25557)
Series.divmod()
Series.rdivmod()
Raises a helpful exception when a non-numeric index is sent to interpolate() with methods which require numeric index. (GH21662)
interpolate()
Bug in eval() when comparing floats with scalar operators, for example: x < -0.1 (GH25928)
eval()
x < -0.1
Fixed bug where casting all-boolean array to integer extension array failed (GH25211)
Bug in divmod with a Series object containing zeros incorrectly raising AttributeError (GH26987)
divmod
AttributeError
Inconsistency in Series floor-division (//) and divmod filling positive//zero with NaN instead of Inf (GH27321)
Inf
Bug in DataFrame.astype() when passing a dict of columns and types the errors parameter was ignored. (GH25905)
DataFrame.astype()
Bug in the __name__ attribute of several methods of Series.str, which were set incorrectly (GH23551)
__name__
Improved error message when passing Series of wrong dtype to Series.str.cat() (GH22722)
Series.str.cat()
Construction of Interval is restricted to numeric, Timestamp and Timedelta endpoints (GH23013)
Fixed bug in Series/DataFrame not displaying NaN in IntervalIndex with missing values (GH25984)
Bug in IntervalIndex.get_loc() where a KeyError would be incorrectly raised for a decreasing IntervalIndex (GH25860)
IntervalIndex.get_loc()
Bug in Index constructor where passing mixed closed Interval objects would result in a ValueError instead of an object dtype Index (GH27172)
Improved exception message when calling DataFrame.iloc() with a list of non-numeric objects (GH25753).
DataFrame.iloc()
Improved exception message when calling .iloc or .loc with a boolean indexer with different length (GH26658).
.iloc
.loc
Bug in KeyError exception message when indexing a MultiIndex with a non-existent key not displaying the original key (GH27250).
Bug in .iloc and .loc with a boolean indexer not raising an IndexError when too few items are passed (GH26658).
IndexError
Bug in DataFrame.loc() and Series.loc() where KeyError was not raised for a MultiIndex when the key was less than or equal to the number of levels in the MultiIndex (GH14885).
DataFrame.loc()
Series.loc()
Bug in which DataFrame.append() produced an erroneous warning indicating that a KeyError will be thrown in the future when the data to be appended contains new columns (GH22252).
DataFrame.append()
Bug in which DataFrame.to_csv() caused a segfault for a reindexed data frame, when the indices were single-level MultiIndex (GH26303).
Fixed bug where assigning a arrays.PandasArray to a pandas.core.frame.DataFrame would raise error (GH26390)
arrays.PandasArray
pandas.core.frame.DataFrame
Allow keyword arguments for callable local reference used in the DataFrame.query() string (GH26426)
Fixed a KeyError when indexing a MultiIndex` level with a list containing exactly one label, which is missing (GH27148)
MultiIndex`
Bug which produced AttributeError on partial matching Timestamp in a MultiIndex (GH26944)
Bug in Categorical and CategoricalIndex with Interval values when using the in operator (__contains) with objects that are not comparable to the values in the Interval (GH23705)
__contains
Bug in DataFrame.loc() and DataFrame.iloc() on a DataFrame with a single timezone-aware datetime64[ns] column incorrectly returning a scalar instead of a Series (GH27110)
Bug in CategoricalIndex and Categorical incorrectly raising ValueError instead of TypeError when a list is passed using the in operator (__contains__) (GH21729)
Bug in setting a new value in a Series with a Timedelta object incorrectly casting the value to an integer (GH22717)
Bug in Series setting a new key (__setitem__) with a timezone-aware datetime incorrectly raising ValueError (GH12862)
__setitem__
Bug in DataFrame.iloc() when indexing with a read-only indexer (GH17192)
Bug in Series setting an existing tuple key (__setitem__) with timezone-aware datetime values incorrectly raising TypeError (GH20441)
Fixed misleading exception message in Series.interpolate() if argument order is required, but omitted (GH10633, GH24014).
Series.interpolate()
order
Fixed class type displayed in exception message in DataFrame.dropna() if invalid axis parameter passed (GH25555)
axis
A ValueError will now be thrown by DataFrame.fillna() when limit is not a positive integer (GH27042)
DataFrame.fillna()
limit
Bug in which incorrect exception raised by Timedelta when testing the membership of MultiIndex (GH24570)
Bug in DataFrame.to_html() where values were truncated using display options instead of outputting the full content (GH17004)
DataFrame.to_html()
Fixed bug in missing text when using to_clipboard() if copying utf-16 characters in Python 3 on Windows (GH25040)
to_clipboard()
Bug in read_json() for orient='table' when it tries to infer dtypes by default, which is not applicable as dtypes are already defined in the JSON schema (GH21345)
read_json()
orient='table'
Bug in read_json() for orient='table' and float index, as it infers index dtype by default, which is not applicable because index dtype is already defined in the JSON schema (GH25433)
Bug in read_json() for orient='table' and string of float column names, as it makes a column name type conversion to Timestamp, which is not applicable because column names are already defined in the JSON schema (GH25435)
Bug in json_normalize() for errors='ignore' where missing values in the input data, were filled in resulting DataFrame with the string "nan" instead of numpy.nan (GH25468)
"nan"
DataFrame.to_html() now raises TypeError when using an invalid type for the classes parameter instead of AssertionError (GH25608)
classes
AssertionError
Bug in DataFrame.to_string() and DataFrame.to_latex() that would lead to incorrect output when the header keyword is used (GH16718)
DataFrame.to_string()
DataFrame.to_latex()
header
Bug in read_csv() not properly interpreting the UTF8 encoded filenames on Windows on Python 3.6+ (GH15086)
Improved performance in pandas.read_stata() and pandas.io.stata.StataReader when converting columns that have missing values (GH25772)
pandas.read_stata()
pandas.io.stata.StataReader
Bug in DataFrame.to_html() where header numbers would ignore display options when rounding (GH17280)
Bug in read_hdf() where reading a table from an HDF5 file written directly with PyTables fails with a ValueError when using a sub-selection via the start or stop arguments (GH11188)
read_hdf()
Bug in read_hdf() not properly closing store after a KeyError is raised (GH25766)
Improved the explanation for the failure when value labels are repeated in Stata dta files and suggested work-arounds (GH25772)
Improved pandas.read_stata() and pandas.io.stata.StataReader to read incorrectly formatted 118 format files saved by Stata (GH25960)
Improved the col_space parameter in DataFrame.to_html() to accept a string so CSS length values can be set correctly (GH25941)
col_space
Fixed bug in loading objects from S3 that contain # characters in the URL (GH25945)
#
Adds use_bqstorage_api parameter to read_gbq() to speed up downloads of large data frames. This feature requires version 0.10.0 of the pandas-gbq library as well as the google-cloud-bigquery-storage and fastavro libraries. (GH26104)
use_bqstorage_api
read_gbq()
pandas-gbq
google-cloud-bigquery-storage
fastavro
Fixed memory leak in DataFrame.to_json() when dealing with numeric data (GH24889)
DataFrame.to_json()
Bug in read_json() where date strings with Z were not converted to a UTC timezone (GH26168)
Z
Added cache_dates=True parameter to read_csv(), which allows to cache unique dates when they are parsed (GH25990)
cache_dates=True
DataFrame.to_excel() now raises a ValueError when the caller’s dimensions exceed the limitations of Excel (GH26051)
Fixed bug in pandas.read_csv() where a BOM would result in incorrect parsing using engine=’python’ (GH26545)
pandas.read_csv()
read_excel() now raises a ValueError when input is of type pandas.io.excel.ExcelFile and engine param is passed since pandas.io.excel.ExcelFile has an engine defined (GH26566)
pandas.io.excel.ExcelFile
engine
Bug while selecting from HDFStore with where='' specified (GH26610).
HDFStore
where=''
Fixed bug in DataFrame.to_excel() where custom objects (i.e. PeriodIndex) inside merged cells were not being converted into types safe for the Excel writer (GH27006)
Bug in read_hdf() where reading a timezone aware DatetimeIndex would raise a TypeError (GH11926)
Bug in to_msgpack() and read_msgpack() which would raise a ValueError rather than a FileNotFoundError for an invalid path (GH27160)
to_msgpack()
FileNotFoundError
Fixed bug in DataFrame.to_parquet() which would raise a ValueError when the dataframe had no columns (GH27339)
DataFrame.to_parquet()
Allow parsing of PeriodDtype columns when using read_csv() (GH26934)
PeriodDtype
Fixed bug where api.extensions.ExtensionArray could not be used in matplotlib plotting (GH25587)
api.extensions.ExtensionArray
Bug in an error message in DataFrame.plot(). Improved the error message if non-numerics are passed to DataFrame.plot() (GH25481)
Bug in incorrect ticklabel positions when plotting an index that are non-numeric / non-datetime (GH7612, GH15912, GH22334)
Fixed bug causing plots of PeriodIndex timeseries to fail if the frequency is a multiple of the frequency rule code (GH14763)
Fixed bug when plotting a DatetimeIndex with datetime.timezone.utc timezone (GH17173)
datetime.timezone.utc
Bug in pandas.core.resample.Resampler.agg() with a timezone aware index where OverflowError would raise when passing a list of functions (GH22660)
pandas.core.resample.Resampler.agg()
Bug in pandas.core.groupby.DataFrameGroupBy.nunique() in which the names of column levels were lost (GH23222)
pandas.core.groupby.DataFrameGroupBy.nunique()
Bug in pandas.core.groupby.GroupBy.agg() when applying an aggregation function to timezone aware data (GH23683)
pandas.core.groupby.GroupBy.agg()
Bug in pandas.core.groupby.GroupBy.first() and pandas.core.groupby.GroupBy.last() where timezone information would be dropped (GH21603)
pandas.core.groupby.GroupBy.first()
pandas.core.groupby.GroupBy.last()
Bug in pandas.core.groupby.GroupBy.size() when grouping only NA values (GH23050)
pandas.core.groupby.GroupBy.size()
Bug in Series.groupby() where observed kwarg was previously ignored (GH24880)
Series.groupby()
Bug in Series.groupby() where using groupby with a MultiIndex Series with a list of labels equal to the length of the series caused incorrect grouping (GH25704)
groupby
Ensured that ordering of outputs in groupby aggregation functions is consistent across all versions of Python (GH25692)
Ensured that result group order is correct when grouping on an ordered Categorical and specifying observed=True (GH25871, GH25167)
observed=True
Bug in pandas.core.window.Rolling.min() and pandas.core.window.Rolling.max() that caused a memory leak (GH25893)
pandas.core.window.Rolling.min()
pandas.core.window.Rolling.max()
Bug in pandas.core.window.Rolling.count() and pandas.core.window.Expanding.count was previously ignoring the axis keyword (GH13503)
pandas.core.window.Rolling.count()
pandas.core.window.Expanding.count
Bug in pandas.core.groupby.GroupBy.idxmax() and pandas.core.groupby.GroupBy.idxmin() with datetime column would return incorrect dtype (GH25444, GH15306)
pandas.core.groupby.GroupBy.idxmax()
pandas.core.groupby.GroupBy.idxmin()
Bug in pandas.core.groupby.GroupBy.cumsum(), pandas.core.groupby.GroupBy.cumprod(), pandas.core.groupby.GroupBy.cummin() and pandas.core.groupby.GroupBy.cummax() with categorical column having absent categories, would return incorrect result or segfault (GH16771)
pandas.core.groupby.GroupBy.cumsum()
pandas.core.groupby.GroupBy.cumprod()
pandas.core.groupby.GroupBy.cummin()
pandas.core.groupby.GroupBy.cummax()
Bug in pandas.core.groupby.GroupBy.nth() where NA values in the grouping would return incorrect results (GH26011)
pandas.core.groupby.GroupBy.nth()
Bug in pandas.core.groupby.SeriesGroupBy.transform() where transforming an empty group would raise a ValueError (GH26208)
pandas.core.groupby.SeriesGroupBy.transform()
Bug in pandas.core.frame.DataFrame.groupby() where passing a pandas.core.groupby.grouper.Grouper would return incorrect groups when using the .groups accessor (GH26326)
pandas.core.frame.DataFrame.groupby()
pandas.core.groupby.grouper.Grouper
.groups
Bug in pandas.core.groupby.GroupBy.agg() where incorrect results are returned for uint64 columns. (GH26310)
Bug in pandas.core.window.Rolling.median() and pandas.core.window.Rolling.quantile() where MemoryError is raised with empty window (GH26005)
pandas.core.window.Rolling.median()
pandas.core.window.Rolling.quantile()
Bug in pandas.core.window.Rolling.median() and pandas.core.window.Rolling.quantile() where incorrect results are returned with closed='left' and closed='neither' (GH26005)
closed='left'
closed='neither'
Improved pandas.core.window.Rolling, pandas.core.window.Window and pandas.core.window.EWM functions to exclude nuisance columns from results instead of raising errors and raise a DataError only if all columns are nuisance (GH12537)
pandas.core.window.Rolling
pandas.core.window.Window
pandas.core.window.EWM
DataError
Bug in pandas.core.window.Rolling.max() and pandas.core.window.Rolling.min() where incorrect results are returned with an empty variable window (GH26005)
Raise a helpful exception when an unsupported weighted window function is used as an argument of pandas.core.window.Window.aggregate() (GH26597)
pandas.core.window.Window.aggregate()
Bug in pandas.merge() adds a string of None, if None is assigned in suffixes instead of remain the column name as-is (GH24782).
pandas.merge()
None
Bug in merge() when merging by index name would sometimes result in an incorrectly numbered index (missing index values are now assigned NA) (GH24212, GH25009)
merge()
to_records() now accepts dtypes to its column_dtypes parameter (GH24895)
to_records()
column_dtypes
Bug in concat() where order of OrderedDict (and dict in Python 3.6+) is not respected, when passed in as objs argument (GH21510)
objs
Bug in pivot_table() where columns with NaN values are dropped even if dropna argument is False, when the aggfunc argument contains a list (GH22159)
pivot_table()
False
aggfunc
list
Bug in concat() where the resulting freq of two DatetimeIndex with the same freq would be dropped (GH3232).
Bug in merge() where merging with equivalent Categorical dtypes was raising an error (GH22501)
bug in DataFrame instantiating with a dict of iterators or generators (e.g. pd.DataFrame({'A': reversed(range(3))})) raised an error (GH26349).
pd.DataFrame({'A': reversed(range(3))})
Bug in DataFrame instantiating with a range (e.g. pd.DataFrame(range(3))) raised an error (GH26342).
range
pd.DataFrame(range(3))
Bug in DataFrame constructor when passing non-empty tuples would cause a segmentation fault (GH25691)
Bug in Series.apply() failed when the series is a timezone aware DatetimeIndex (GH25959)
Series.apply()
Bug in pandas.cut() where large bins could incorrectly raise an error due to an integer overflow (GH26045)
pandas.cut()
Bug in DataFrame.sort_index() where an error is thrown when a multi-indexed DataFrame is sorted on all levels with the initial level sorted last (GH26053)
DataFrame.sort_index()
Bug in Series.nlargest() treats True as smaller than False (GH26154)
Series.nlargest()
Bug in DataFrame.pivot_table() with a IntervalIndex as pivot index would raise TypeError (GH25814)
Bug in which DataFrame.from_dict() ignored order of OrderedDict when orient='index' (GH8425).
DataFrame.from_dict()
orient='index'
Bug in DataFrame.transpose() where transposing a DataFrame with a timezone-aware datetime column would incorrectly raise ValueError (GH26825)
DataFrame.transpose()
Bug in pivot_table() when pivoting a timezone aware column as the values would remove timezone information (GH14948)
values
Bug in merge_asof() when specifying multiple by columns where one is datetime64[ns, tz] dtype (GH26649)
by
datetime64[ns, tz]
Bug in SparseFrame constructor where passing None as the data would cause default_fill_value to be ignored (GH16807)
SparseFrame
default_fill_value
Bug in SparseDataFrame when adding a column in which the length of values does not match length of index, AssertionError is raised instead of raising ValueError (GH25484)
Introduce a better error message in Series.sparse.from_coo() so it returns a TypeError for inputs that are not coo matrices (GH26554)
Series.sparse.from_coo()
Bug in numpy.modf() on a SparseArray. Now a tuple of SparseArray is returned (GH26946).
numpy.modf()
Fix install error with PyPy on macOS (GH26536)
Bug in factorize() when passing an ExtensionArray with a custom na_sentinel (GH25696).
factorize()
ExtensionArray
na_sentinel
Series.count() miscounts NA values in ExtensionArrays (GH26835)
Series.count()
Added Series.__array_ufunc__ to better handle NumPy ufuncs applied to Series backed by extension arrays (GH23293).
Series.__array_ufunc__
Keyword argument deep has been removed from ExtensionArray.copy() (GH27083)
deep
ExtensionArray.copy()
Removed unused C functions from vendored UltraJSON implementation (GH26198)
Allow Index and RangeIndex to be passed to numpy min and max functions (GH26125)
min
max
Use actual class name in repr of empty objects of a Series subclass (GH27001).
Bug in DataFrame where passing an object array of timezone-aware datetime objects would incorrectly raise ValueError (GH13287)
A total of 231 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.
1_x7 +
Abdullah İhsan Seçer +
Adam Bull +
Adam Hooper
Albert Villanova del Moral
Alex Watt +
AlexTereshenkov +
Alexander Buchkovsky
Alexander Hendorf +
Alexander Nordin +
Alexander Ponomaroff
Alexandre Batisse +
Alexandre Decan +
Allen Downey +
Alyssa Fu Ward +
Andrew Gaspari +
Andrew Wood +
Antoine Viscardi +
Antonio Gutierrez +
Arno Veenstra +
ArtinSarraf
Batalex +
Baurzhan Muftakhidinov
Benjamin Rowell
Bharat Raghunathan +
Bhavani Ravi +
Big Head +
Brett Randall +
Bryan Cutler +
C John Klehm +
Caleb Braun +
Cecilia +
Chris Bertinato +
Chris Stadler +
Christian Haege +
Christian Hudon
Christopher Whelan
Chuanzhu Xu +
Clemens Brunner
Damian Kula +
Daniel Hrisca +
Daniel Luis Costa +
Daniel Saxton
DanielFEvans +
David Liu +
Deepyaman Datta +
Denis Belavin +
Devin Petersohn +
Diane Trout +
EdAbati +
Enrico Rotundo +
EternalLearner42 +
Evan +
Evan Livelo +
Fabian Rost +
Flavien Lambert +
Florian Rathgeber +
Frank Hoang +
Gaibo Zhang +
Gioia Ballin
Giuseppe Romagnuolo +
Gordon Blackadder +
Gregory Rome +
Guillaume Gay
HHest +
Hielke Walinga +
How Si Wei +
Hubert
Huize Wang +
Hyukjin Kwon +
Ian Dunn +
Inevitable-Marzipan +
Irv Lustig
JElfner +
Jacob Bundgaard +
James Cobon-Kerr +
Jan-Philip Gehrcke +
Jarrod Millman +
Jayanth Katuri +
Jeff Reback
Jeremy Schendel
Jiang Yue +
Joel Ostblom
Johan von Forstner +
Johnny Chiu +
Jonas +
Jonathon Vandezande +
Jop Vermeer +
Joris Van den Bossche
Josh
Josh Friedlander +
Justin Zheng
Kaiqi Dong
Kane +
Kapil Patel +
Kara de la Marck +
Katherine Surta +
Katrin Leinweber +
Kendall Masse
Kevin Sheppard
Kyle Kosic +
Lorenzo Stella +
Maarten Rietbergen +
Mak Sze Chun
Marc Garcia
Mateusz Woś
Matias Heikkilä
Mats Maiwald +
Matthew Roeschke
Max Bolingbroke +
Max Kovalovs +
Max van Deursen +
Michael
Michael Davis +
Michael P. Moran +
Mike Cramblett +
Min ho Kim +
Misha Veldhoen +
Mukul Ashwath Ram +
MusTheDataGuy +
Nanda H Krishna +
Nicholas Musolino
Noam Hershtig +
Noora Husseini +
Paul
Paul Reidy
Pauli Virtanen
Pav A +
Peter Leimbigler +
Philippe Ombredanne +
Pietro Battiston
Richard Eames +
Roman Yurchak
Ruijing Li
Ryan
Ryan Joyce +
Ryan Nazareth
Ryan Rehman +
Sakar Panta +
Samuel Sinayoko
Sandeep Pathak +
Sangwoong Yoon
Saurav Chakravorty
Scott Talbert +
Sergey Kopylov +
Shantanu Gontia +
Shivam Rana +
Shorokhov Sergey +
Simon Hawkins
Soyoun(Rose) Kim
Stephan Hoyer
Stephen Cowley +
Stephen Rauch
Sterling Paramore +
Steven +
Stijn Van Hoey
Sumanau Sareen +
Takuya N +
Tan Tran +
Tao He +
Tarbo Fukazawa
Terji Petersen +
Thein Oo
ThibTrip +
Thijs Damsma +
Thiviyan Thanapalasingam
Thomas A Caswell
Thomas Kluiters +
Tilen Kusterle +
Tim Gates +
Tim Hoffmann
Tim Swast
Tom Augspurger
Tom Neep +
Tomáš Chvátal +
Tyler Reddy
Vaibhav Vishal +
Vasily Litvinov +
Vibhu Agarwal +
Vikramjeet Das +
Vladislav +
Víctor Moron Tejero +
Wenhuan
Will Ayd +
William Ayd
Wouter De Coster +
Yoann Goular +
Zach Angell +
alimcmaster1
anmyachev +
chris-b1
danielplawrence +
endenis +
enisnazif +
ezcitron +
fjetter
froessler
gfyoung
gwrome +
h-vetinari
haison +
hannah-c +
heckeop +
iamshwin +
jamesoliverh +
jbrockmendel
jkovacevic +
killerontherun1 +
knuu +
kpapdac +
kpflugshaupt +
krsnik93 +
leerssej +
lrjball +
mazayo +
nathalier +
nrebena +
nullptr +
pilkibun +
pmaxey83 +
rbenes +
robbuckley
shawnbrown +
sudhir mohanraj +
tadeja +
tamuhey +
thatneat
topper-123
willweil +
yehia67 +
yhaque1213 +