What’s new in 1.1.0 (July 28, 2020)¶
These are the changes in pandas 1.1.0. See Release notes for a full changelog including other versions of pandas.
Enhancements¶
KeyErrors raised by loc specify missing labels¶
Previously, if labels were missing for a .loc call, a KeyError was raised stating that this was no longer supported.
Now the error message also includes a list of the missing labels (max 10 items, display width 80 characters). See GH34272.
All dtypes can now be converted to StringDtype¶
Previously, declaring or converting to StringDtype was in general only possible if the data was already only str or nan-like (GH31204).
StringDtype now works in all situations where astype(str) or dtype=str work:
For example, the below now works:
In [1]: ser = pd.Series([1, "abc", np.nan], dtype="string")
In [2]: ser
Out[2]:
0 1
1 abc
2 <NA>
Length: 3, dtype: string
In [3]: ser[0]
Out[3]: '1'
In [4]: pd.Series([1, 2, np.nan], dtype="Int64").astype("string")
Out[4]:
0 1
1 2
2 <NA>
Length: 3, dtype: string
Non-monotonic PeriodIndex partial string slicing¶
PeriodIndex now supports partial string slicing for non-monotonic indexes, mirroring DatetimeIndex behavior (GH31096)
For example:
In [5]: dti = pd.date_range("2014-01-01", periods=30, freq="30D")
In [6]: pi = dti.to_period("D")
In [7]: ser_monotonic = pd.Series(np.arange(30), index=pi)
In [8]: shuffler = list(range(0, 30, 2)) + list(range(1, 31, 2))
In [9]: ser = ser_monotonic[shuffler]
In [10]: ser
Out[10]:
2014-01-01 0
2014-03-02 2
2014-05-01 4
2014-06-30 6
2014-08-29 8
..
2015-09-23 21
2015-11-22 23
2016-01-21 25
2016-03-21 27
2016-05-20 29
Freq: D, Length: 30, dtype: int64
In [11]: ser["2014"]
Out[11]:
2014-01-01 0
2014-03-02 2
2014-05-01 4
2014-06-30 6
2014-08-29 8
2014-10-28 10
2014-12-27 12
2014-01-31 1
2014-04-01 3
2014-05-31 5
2014-07-30 7
2014-09-28 9
2014-11-27 11
Freq: D, Length: 13, dtype: int64
In [12]: ser.loc["May 2015"]
Out[12]:
2015-05-26 17
Freq: D, Length: 1, dtype: int64
Comparing two DataFrame or two Series and summarizing the differences¶
We’ve added DataFrame.compare() and Series.compare() for comparing two DataFrame or two Series (GH30429)
In [13]: df = pd.DataFrame(
....: {
....: "col1": ["a", "a", "b", "b", "a"],
....: "col2": [1.0, 2.0, 3.0, np.nan, 5.0],
....: "col3": [1.0, 2.0, 3.0, 4.0, 5.0]
....: },
....: columns=["col1", "col2", "col3"],
....: )
....:
In [14]: df
Out[14]:
col1 col2 col3
0 a 1.0 1.0
1 a 2.0 2.0
2 b 3.0 3.0
3 b NaN 4.0
4 a 5.0 5.0
[5 rows x 3 columns]
In [15]: df2 = df.copy()
In [16]: df2.loc[0, 'col1'] = 'c'
In [17]: df2.loc[2, 'col3'] = 4.0
In [18]: df2
Out[18]:
col1 col2 col3
0 c 1.0 1.0
1 a 2.0 2.0
2 b 3.0 4.0
3 b NaN 4.0
4 a 5.0 5.0
[5 rows x 3 columns]
In [19]: df.compare(df2)
Out[19]:
col1 col3
self other self other
0 a c NaN NaN
2 NaN NaN 3.0 4.0
[2 rows x 4 columns]
See User Guide for more details.
Allow NA in groupby key¶
With groupby , we’ve added a dropna keyword to DataFrame.groupby() and Series.groupby() in order to
allow NA values in group keys. Users can define dropna to False if they want to include
NA values in groupby keys. The default is set to True for dropna to keep backwards
compatibility (GH3729)
In [20]: df_list = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
In [21]: df_dropna = pd.DataFrame(df_list, columns=["a", "b", "c"])
In [22]: df_dropna
Out[22]:
a b c
0 1 2.0 3
1 1 NaN 4
2 2 1.0 3
3 1 2.0 2
[4 rows x 3 columns]
# Default ``dropna`` is set to True, which will exclude NaNs in keys
In [23]: df_dropna.groupby(by=["b"], dropna=True).sum()
Out[23]:
a c
b
1.0 2 3
2.0 2 5
[2 rows x 2 columns]
# In order to allow NaN in keys, set ``dropna`` to False
In [24]: df_dropna.groupby(by=["b"], dropna=False).sum()
Out[24]:
a c
b
1.0 2 3
2.0 2 5
NaN 1 4
[3 rows x 2 columns]
The default setting of dropna argument is True which means NA are not included in group keys.
Sorting with keys¶
We’ve added a key argument to the DataFrame and Series sorting methods, including
DataFrame.sort_values(), DataFrame.sort_index(), Series.sort_values(),
and Series.sort_index(). The key can be any callable function which is applied
column-by-column to each column used for sorting, before sorting is performed (GH27237).
See sort_values with keys and sort_index with keys for more information.
In [25]: s = pd.Series(['C', 'a', 'B'])
In [26]: s
Out[26]:
0 C
1 a
2 B
Length: 3, dtype: object
In [27]: s.sort_values()
Out[27]:
2 B
0 C
1 a
Length: 3, dtype: object
Note how this is sorted with capital letters first. If we apply the Series.str.lower()
method, we get
In [28]: s.sort_values(key=lambda x: x.str.lower())
Out[28]:
1 a
2 B
0 C
Length: 3, dtype: object
When applied to a DataFrame, they key is applied per-column to all columns or a subset if
by is specified, e.g.
In [29]: df = pd.DataFrame({'a': ['C', 'C', 'a', 'a', 'B', 'B'],
....: 'b': [1, 2, 3, 4, 5, 6]})
....:
In [30]: df
Out[30]:
a b
0 C 1
1 C 2
2 a 3
3 a 4
4 B 5
5 B 6
[6 rows x 2 columns]
In [31]: df.sort_values(by=['a'], key=lambda col: col.str.lower())
Out[31]:
a b
2 a 3
3 a 4
4 B 5
5 B 6
0 C 1
1 C 2
[6 rows x 2 columns]
For more details, see examples and documentation in DataFrame.sort_values(),
Series.sort_values(), and sort_index().
Fold argument support in Timestamp constructor¶
Timestamp: now supports the keyword-only fold argument according to PEP 495 similar to parent datetime.datetime class. It supports both accepting fold as an initialization argument and inferring fold from other constructor arguments (GH25057, GH31338). Support is limited to dateutil timezones as pytz doesn’t support fold.
For example:
In [32]: ts = pd.Timestamp("2019-10-27 01:30:00+00:00")
In [33]: ts.fold
Out[33]: 0
In [34]: ts = pd.Timestamp(year=2019, month=10, day=27, hour=1, minute=30,
....: tz="dateutil/Europe/London", fold=1)
....:
In [35]: ts
Out[35]: Timestamp('2019-10-27 01:30:00+0000', tz='dateutil//usr/share/zoneinfo/Europe/London')
For more on working with fold, see Fold subsection in the user guide.
Parsing timezone-aware format with different timezones in to_datetime¶
to_datetime() now supports parsing formats containing timezone names (%Z) and UTC offsets (%z) from different timezones then converting them to UTC by setting utc=True. This would return a DatetimeIndex with timezone at UTC as opposed to an Index with object dtype if utc=True is not set (GH32792).
For example:
In [36]: tz_strs = ["2010-01-01 12:00:00 +0100", "2010-01-01 12:00:00 -0100",
....: "2010-01-01 12:00:00 +0300", "2010-01-01 12:00:00 +0400"]
....:
In [37]: pd.to_datetime(tz_strs, format='%Y-%m-%d %H:%M:%S %z', utc=True)
Out[37]:
DatetimeIndex(['2010-01-01 11:00:00+00:00', '2010-01-01 13:00:00+00:00',
'2010-01-01 09:00:00+00:00', '2010-01-01 08:00:00+00:00'],
dtype='datetime64[ns, UTC]', freq=None)
In [38]: pd.to_datetime(tz_strs, format='%Y-%m-%d %H:%M:%S %z')
Out[38]:
Index([2010-01-01 12:00:00+01:00, 2010-01-01 12:00:00-01:00,
2010-01-01 12:00:00+03:00, 2010-01-01 12:00:00+04:00],
dtype='object')
Grouper and resample now supports the arguments origin and offset¶
Grouper and DataFrame.resample() now supports the arguments origin and offset. It let the user control the timestamp on which to adjust the grouping. (GH31809)
The bins of the grouping are adjusted based on the beginning of the day of the time series starting point. This works well with frequencies that are multiples of a day (like 30D) or that divides a day (like 90s or 1min). But it can create inconsistencies with some frequencies that do not meet this criteria. To change this behavior you can now specify a fixed timestamp with the argument origin.
Two arguments are now deprecated (more information in the documentation of DataFrame.resample()):
baseshould be replaced byoffset.loffsetshould be replaced by directly adding an offset to the indexDataFrameafter being resampled.
Small example of the use of origin:
In [39]: start, end = '2000-10-01 23:30:00', '2000-10-02 00:30:00'
In [40]: middle = '2000-10-02 00:00:00'
In [41]: rng = pd.date_range(start, end, freq='7min')
In [42]: ts = pd.Series(np.arange(len(rng)) * 3, index=rng)
In [43]: ts
Out[43]:
2000-10-01 23:30:00 0
2000-10-01 23:37:00 3
2000-10-01 23:44:00 6
2000-10-01 23:51:00 9
2000-10-01 23:58:00 12
2000-10-02 00:05:00 15
2000-10-02 00:12:00 18
2000-10-02 00:19:00 21
2000-10-02 00:26:00 24
Freq: 7T, Length: 9, dtype: int64
Resample with the default behavior 'start_day' (origin is 2000-10-01 00:00:00):
In [44]: ts.resample('17min').sum()
Out[44]:
2000-10-01 23:14:00 0
2000-10-01 23:31:00 9
2000-10-01 23:48:00 21
2000-10-02 00:05:00 54
2000-10-02 00:22:00 24
Freq: 17T, Length: 5, dtype: int64
In [45]: ts.resample('17min', origin='start_day').sum()
Out[45]:
2000-10-01 23:14:00 0
2000-10-01 23:31:00 9
2000-10-01 23:48:00 21
2000-10-02 00:05:00 54
2000-10-02 00:22:00 24
Freq: 17T, Length: 5, dtype: int64
Resample using a fixed origin:
In [46]: ts.resample('17min', origin='epoch').sum()
Out[46]:
2000-10-01 23:18:00 0
2000-10-01 23:35:00 18
2000-10-01 23:52:00 27
2000-10-02 00:09:00 39
2000-10-02 00:26:00 24
Freq: 17T, Length: 5, dtype: int64
In [47]: ts.resample('17min', origin='2000-01-01').sum()
Out[47]:
2000-10-01 23:24:00 3
2000-10-01 23:41:00 15
2000-10-01 23:58:00 45
2000-10-02 00:15:00 45
Freq: 17T, Length: 4, dtype: int64
If needed you can adjust the bins with the argument offset (a Timedelta) that would be added to the default origin.
For a full example, see: Use origin or offset to adjust the start of the bins.
fsspec now used for filesystem handling¶
For reading and writing to filesystems other than local and reading from HTTP(S),
the optional dependency fsspec will be used to dispatch operations (GH33452).
This will give unchanged
functionality for S3 and GCS storage, which were already supported, but also add
support for several other storage implementations such as Azure Datalake and Blob,
SSH, FTP, dropbox and github. For docs and capabilities, see the fsspec docs.
The existing capability to interface with S3 and GCS will be unaffected by this
change, as fsspec will still bring in the same packages as before.
Other enhancements¶
Compatibility with matplotlib 3.3.0 (GH34850)
IntegerArray.astype()now supportsdatetime64dtype (GH32538)IntegerArraynow implements thesumoperation (GH33172)Added
pandas.errors.InvalidIndexError(GH34570).Added
DataFrame.value_counts()(GH5377)Added a
pandas.api.indexers.FixedForwardWindowIndexer()class to support forward-looking windows duringrollingoperations.Added a
pandas.api.indexers.VariableOffsetWindowIndexer()class to supportrollingoperations with non-fixed offsets (GH34994)describe()now includes adatetime_is_numerickeyword to control how datetime columns are summarized (GH30164, GH34798)Stylermay now render CSS more efficiently where multiple cells have the same styling (GH30876)highlight_null()now acceptssubsetargument (GH31345)When writing directly to a sqlite connection
DataFrame.to_sql()now supports themultimethod (GH29921)pandas.errors.OptionErroris now exposed inpandas.errors(GH27553)Added
api.extensions.ExtensionArray.argmax()andapi.extensions.ExtensionArray.argmin()(GH24382)timedelta_range()will now infer a frequency when passedstart,stop, andperiods(GH32377)Positional slicing on a
IntervalIndexnow supports slices withstep > 1(GH31658)Series.strnow has afullmatchmethod that matches a regular expression against the entire string in each row of theSeries, similar tore.fullmatch(GH32806).DataFrame.sample()will now also allow array-like and BitGenerator objects to be passed torandom_stateas seeds (GH32503)Index.union()will now raiseRuntimeWarningforMultiIndexobjects if the object inside are unsortable. Passsort=Falseto suppress this warning (GH33015)Added
Series.dt.isocalendar()andDatetimeIndex.isocalendar()that returns aDataFramewith year, week, and day calculated according to the ISO 8601 calendar (GH33206, GH34392).The
DataFrame.to_feather()method now supports additional keyword arguments (e.g. to set the compression) that are added in pyarrow 0.17 (GH33422).The
cut()will now accept parameterorderedwith defaultordered=True. Ifordered=Falseand no labels are provided, an error will be raised (GH33141)DataFrame.to_csv(),DataFrame.to_pickle(), andDataFrame.to_json()now support passing a dict of compression arguments when using thegzipandbz2protocols. This can be used to set a custom compression level, e.g.,df.to_csv(path, compression={'method': 'gzip', 'compresslevel': 1}(GH33196)melt()has gained anignore_index(defaultTrue) argument that, if set toFalse, prevents the method from dropping the index (GH17440).Series.update()now accepts objects that can be coerced to aSeries, such asdictandlist, mirroring the behavior ofDataFrame.update()(GH33215)transform()andaggregate()have gainedengineandengine_kwargsarguments that support executing functions withNumba(GH32854, GH33388)interpolate()now supports SciPy interpolation methodscipy.interpolate.CubicSplineas methodcubicspline(GH33670)DataFrameGroupByandSeriesGroupBynow implement thesamplemethod for doing random sampling within groups (GH31775)DataFrame.to_numpy()now supports thena_valuekeyword to control the NA sentinel in the output array (GH33820)Added
api.extension.ExtensionArray.equalsto the extension array interface, similar toSeries.equals()(GH27081)The minimum supported dta version has increased to 105 in
read_stata()andStataReader(GH26667).to_stata()supports compression using thecompressionkeyword argument. Compression can either be inferred or explicitly set using a string or a dictionary containing both the method and any additional arguments that are passed to the compression library. Compression was also added to the low-level Stata-file writersStataWriter,StataWriter117, andStataWriterUTF8(GH26599).HDFStore.put()now accepts atrack_timesparameter. This parameter is passed to thecreate_tablemethod ofPyTables(GH32682).Series.plot()andDataFrame.plot()now acceptsxlabelandylabelparameters to present labels on x and y axis (GH9093).Made
pandas.core.window.rolling.Rollingandpandas.core.window.expanding.Expandingiterable(GH11704)Made
option_contextacontextlib.ContextDecorator, which allows it to be used as a decorator over an entire function (GH34253).DataFrame.to_csv()andSeries.to_csv()now accept anerrorsargument (GH22610)transform()now allowsfuncto bepad,backfillandcumcount(GH31269).read_json()now accepts annrowsparameter. (GH33916).DataFrame.hist(),Series.hist(),core.groupby.DataFrameGroupBy.hist(), andcore.groupby.SeriesGroupBy.hist()have gained thelegendargument. Set to True to show a legend in the histogram. (GH6279)concat()andappend()now preserve extension dtypes, for example combining a nullable integer column with a numpy integer column will no longer result in object dtype but preserve the integer dtype (GH33607, GH34339, GH34095).read_gbq()now allows to disable progress bar (GH33360).read_gbq()now supports themax_resultskwarg frompandas-gbq(GH34639).DataFrame.cov()andSeries.cov()now support a new parameterddofto support delta degrees of freedom as in the corresponding numpy methods (GH34611).DataFrame.to_html()andDataFrame.to_string()’scol_spaceparameter now accepts a list or dict to change only some specific columns’ width (GH28917).DataFrame.to_excel()can now also write OpenOffice spreadsheet (.ods) files (GH27222)explode()now acceptsignore_indexto reset the index, similar topd.concat()orDataFrame.sort_values()(GH34932).DataFrame.to_markdown()andSeries.to_markdown()now acceptindexargument as an alias for tabulate’sshowindex(GH32667)read_csv()now accepts string values like “0”, “0.0”, “1”, “1.0” as convertible to the nullable Boolean dtype (GH34859)pandas.core.window.ExponentialMovingWindownow supports atimesargument that allowsmeanto be calculated with observations spaced by the timestamps intimes(GH34839)DataFrame.agg()andSeries.agg()now accept named aggregation for renaming the output columns/indexes. (GH26513)compute.use_numbanow exists as a configuration option that utilizes the numba engine when available (GH33966, GH35374)Series.plot()now supports asymmetric error bars. Previously, ifSeries.plot()received a “2xN” array with error values foryerrand/orxerr, the left/lower values (first row) were mirrored, while the right/upper values (second row) were ignored. Now, the first row represents the left/lower error values and the second row the right/upper error values. (GH9536)
Notable bug fixes¶
These are bug fixes that might have notable behavior changes.
MultiIndex.get_indexer interprets method argument correctly¶
This restores the behavior of MultiIndex.get_indexer() with method='backfill' or method='pad' to the behavior before pandas 0.23.0. In particular, MultiIndexes are treated as a list of tuples and padding or backfilling is done with respect to the ordering of these lists of tuples (GH29896).
As an example of this, given:
In [48]: df = pd.DataFrame({
....: 'a': [0, 0, 0, 0],
....: 'b': [0, 2, 3, 4],
....: 'c': ['A', 'B', 'C', 'D'],
....: }).set_index(['a', 'b'])
....:
In [49]: mi_2 = pd.MultiIndex.from_product([[0], [-1, 0, 1, 3, 4, 5]])
The differences in reindexing df with mi_2 and using method='backfill' can be seen here:
pandas >= 0.23, < 1.1.0:
In [1]: df.reindex(mi_2, method='backfill')
Out[1]:
c
0 -1 A
0 A
1 D
3 A
4 A
5 C
pandas <0.23, >= 1.1.0
In [50]: df.reindex(mi_2, method='backfill')
Out[50]:
c
0 -1 A
0 A
1 B
3 C
4 D
5 NaN
[6 rows x 1 columns]
And the differences in reindexing df with mi_2 and using method='pad' can be seen here:
pandas >= 0.23, < 1.1.0
In [1]: df.reindex(mi_2, method='pad')
Out[1]:
c
0 -1 NaN
0 NaN
1 D
3 NaN
4 A
5 C
pandas < 0.23, >= 1.1.0
In [51]: df.reindex(mi_2, method='pad')
Out[51]:
c
0 -1 NaN
0 A
1 A
3 C
4 D
5 D
[6 rows x 1 columns]
Failed label-based lookups always raise KeyError¶
Label lookups series[key], series.loc[key] and frame.loc[key]
used to raise either KeyError or TypeError depending on the type of
key and type of Index. These now consistently raise KeyError (GH31867)
In [52]: ser1 = pd.Series(range(3), index=[0, 1, 2])
In [53]: ser2 = pd.Series(range(3), index=pd.date_range("2020-02-01", periods=3))
Previous behavior:
In [3]: ser1[1.5]
...
TypeError: cannot do label indexing on Int64Index with these indexers [1.5] of type float
In [4] ser1["foo"]
...
KeyError: 'foo'
In [5]: ser1.loc[1.5]
...
TypeError: cannot do label indexing on Int64Index with these indexers [1.5] of type float
In [6]: ser1.loc["foo"]
...
KeyError: 'foo'
In [7]: ser2.loc[1]
...
TypeError: cannot do label indexing on DatetimeIndex with these indexers [1] of type int
In [8]: ser2.loc[pd.Timestamp(0)]
...
KeyError: Timestamp('1970-01-01 00:00:00')
New behavior:
In [3]: ser1[1.5]
...
KeyError: 1.5
In [4] ser1["foo"]
...
KeyError: 'foo'
In [5]: ser1.loc[1.5]
...
KeyError: 1.5
In [6]: ser1.loc["foo"]
...
KeyError: 'foo'
In [7]: ser2.loc[1]
...
KeyError: 1
In [8]: ser2.loc[pd.Timestamp(0)]
...
KeyError: Timestamp('1970-01-01 00:00:00')
Similarly, DataFrame.at() and Series.at() will raise a TypeError instead of a ValueError if an incompatible key is passed, and KeyError if a missing key is passed, matching the behavior of .loc[] (GH31722)
Failed Integer Lookups on MultiIndex Raise KeyError¶
Indexing with integers with a MultiIndex that has an integer-dtype
first level incorrectly failed to raise KeyError when one or more of
those integer keys is not present in the first level of the index (GH33539)
In [54]: idx = pd.Index(range(4))
In [55]: dti = pd.date_range("2000-01-03", periods=3)
In [56]: mi = pd.MultiIndex.from_product([idx, dti])
In [57]: ser = pd.Series(range(len(mi)), index=mi)
Previous behavior:
In [5]: ser[[5]]
Out[5]: Series([], dtype: int64)
New behavior:
In [5]: ser[[5]]
...
KeyError: '[5] not in index'
DataFrame.merge() preserves right frame’s row order¶
DataFrame.merge() now preserves the right frame’s row order when executing a right merge (GH27453)
In [58]: left_df = pd.DataFrame({'animal': ['dog', 'pig'],
....: 'max_speed': [40, 11]})
....:
In [59]: right_df = pd.DataFrame({'animal': ['quetzal', 'pig'],
....: 'max_speed': [80, 11]})
....:
In [60]: left_df
Out[60]:
animal max_speed
0 dog 40
1 pig 11
[2 rows x 2 columns]
In [61]: right_df
Out[61]:
animal max_speed
0 quetzal 80
1 pig 11
[2 rows x 2 columns]
Previous behavior:
>>> left_df.merge(right_df, on=['animal', 'max_speed'], how="right")
animal max_speed
0 pig 11
1 quetzal 80
New behavior:
In [62]: left_df.merge(right_df, on=['animal', 'max_speed'], how="right")
Out[62]:
animal max_speed
0 quetzal 80
1 pig 11
[2 rows x 2 columns]
Assignment to multiple columns of a DataFrame when some columns do not exist¶
Assignment to multiple columns of a DataFrame when some of the columns do not exist would previously assign the values to the last column. Now, new columns will be constructed with the right values. (GH13658)
In [63]: df = pd.DataFrame({'a': [0, 1, 2], 'b': [3, 4, 5]})
In [64]: df
Out[64]:
a b
0 0 3
1 1 4
2 2 5
[3 rows x 2 columns]
Previous behavior:
In [3]: df[['a', 'c']] = 1
In [4]: df
Out[4]:
a b
0 1 1
1 1 1
2 1 1
New behavior:
In [65]: df[['a', 'c']] = 1
In [66]: df
Out[66]:
a b c
0 1 3 1
1 1 4 1
2 1 5 1
[3 rows x 3 columns]
Consistency across groupby reductions¶
Using DataFrame.groupby() with as_index=True and the aggregation nunique would include the grouping column(s) in the columns of the result. Now the grouping column(s) only appear in the index, consistent with other reductions. (GH32579)
In [67]: df = pd.DataFrame({"a": ["x", "x", "y", "y"], "b": [1, 1, 2, 3]})
In [68]: df
Out[68]:
a b
0 x 1
1 x 1
2 y 2
3 y 3
[4 rows x 2 columns]
Previous behavior:
In [3]: df.groupby("a", as_index=True).nunique()
Out[4]:
a b
a
x 1 1
y 1 2
New behavior:
In [69]: df.groupby("a", as_index=True).nunique()
Out[69]:
b
a
x 1
y 2
[2 rows x 1 columns]
Using DataFrame.groupby() with as_index=False and the function idxmax, idxmin, mad, nunique, sem, skew, or std would modify the grouping column. Now the grouping column remains unchanged, consistent with other reductions. (GH21090, GH10355)
Previous behavior:
In [3]: df.groupby("a", as_index=False).nunique()
Out[4]:
a b
0 1 1
1 1 2
New behavior:
In [70]: df.groupby("a", as_index=False).nunique()
Out[70]:
a b
0 x 1
1 y 2
[2 rows x 2 columns]
The method size() would previously ignore as_index=False. Now the grouping columns are returned as columns, making the result a DataFrame instead of a Series. (GH32599)
Previous behavior:
In [3]: df.groupby("a", as_index=False).size()
Out[4]:
a
x 2
y 2
dtype: int64
New behavior:
In [71]: df.groupby("a", as_index=False).size()
Out[71]:
a size
0 x 2
1 y 2
[2 rows x 2 columns]
agg() lost results with as_index=False when relabeling columns¶
Previously agg() lost the result columns, when the as_index option was
set to False and the result columns were relabeled. In this case the result values were replaced with
the previous index (GH32240).
In [72]: df = pd.DataFrame({"key": ["x", "y", "z", "x", "y", "z"],
....: "val": [1.0, 0.8, 2.0, 3.0, 3.6, 0.75]})
....:
In [73]: df
Out[73]:
key val
0 x 1.00
1 y 0.80
2 z 2.00
3 x 3.00
4 y 3.60
5 z 0.75
[6 rows x 2 columns]
Previous behavior:
In [2]: grouped = df.groupby("key", as_index=False)
In [3]: result = grouped.agg(min_val=pd.NamedAgg(column="val", aggfunc="min"))
In [4]: result
Out[4]:
min_val
0 x
1 y
2 z
New behavior:
In [74]: grouped = df.groupby("key", as_index=False)
In [75]: result = grouped.agg(min_val=pd.NamedAgg(column="val", aggfunc="min"))
In [76]: result
Out[76]:
key min_val
0 x 1.00
1 y 0.80
2 z 0.75
[3 rows x 2 columns]
apply and applymap on DataFrame evaluates first row/column only once¶
In [77]: df = pd.DataFrame({'a': [1, 2], 'b': [3, 6]})
In [78]: def func(row):
....: print(row)
....: return row
....:
Previous behavior:
In [4]: df.apply(func, axis=1)
a 1
b 3
Name: 0, dtype: int64
a 1
b 3
Name: 0, dtype: int64
a 2
b 6
Name: 1, dtype: int64
Out[4]:
a b
0 1 3
1 2 6
New behavior:
In [79]: df.apply(func, axis=1)
a 1
b 3
Name: 0, Length: 2, dtype: int64
a 2
b 6
Name: 1, Length: 2, dtype: int64
Out[79]:
a b
0 1 3
1 2 6
[2 rows x 2 columns]
Backwards incompatible API changes¶
Added check_freq argument to testing.assert_frame_equal and testing.assert_series_equal¶
The check_freq argument was added to testing.assert_frame_equal() and testing.assert_series_equal() in pandas 1.1.0 and defaults to True. testing.assert_frame_equal() and testing.assert_series_equal() now raise AssertionError if the indexes do not have the same frequency. Before pandas 1.1.0, the index frequency was not checked.
Increased minimum versions for dependencies¶
Some minimum supported versions of dependencies were updated (GH33718, GH29766, GH29723, pytables >= 3.4.3). If installed, we now require:
Package |
Minimum Version |
Required |
Changed |
|---|---|---|---|
numpy |
1.15.4 |
X |
X |
pytz |
2015.4 |
X |
|
python-dateutil |
2.7.3 |
X |
X |
bottleneck |
1.2.1 |
||
numexpr |
2.6.2 |
||
pytest (dev) |
4.0.2 |
For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.
Package |
Minimum Version |
Changed |
|---|---|---|
beautifulsoup4 |
4.6.0 |
|
fastparquet |
0.3.2 |
|
fsspec |
0.7.4 |
|
gcsfs |
0.6.0 |
X |
lxml |
3.8.0 |
|
matplotlib |
2.2.2 |
|
numba |
0.46.0 |
|
openpyxl |
2.5.7 |
|
pyarrow |
0.13.0 |
|
pymysql |
0.7.1 |
|
pytables |
3.4.3 |
X |
s3fs |
0.4.0 |
X |
scipy |
1.2.0 |
X |
sqlalchemy |
1.1.4 |
|
xarray |
0.8.2 |
|
xlrd |
1.1.0 |
|
xlsxwriter |
0.9.8 |
|
xlwt |
1.2.0 |
|
pandas-gbq |
1.2.0 |
X |
See Dependencies and Optional dependencies for more.
Deprecations¶
Lookups on a
Serieswith a single-item list containing a slice (e.g.ser[[slice(0, 4)]]) are deprecated and will raise in a future version. Either convert the list to a tuple, or pass the slice directly instead (GH31333)DataFrame.mean()andDataFrame.median()withnumeric_only=Nonewill includedatetime64anddatetime64tzcolumns in a future version (GH29941)Setting values with
.locusing a positional slice is deprecated and will raise in a future version. Use.locwith labels or.ilocwith positions instead (GH31840)DataFrame.to_dict()has deprecated accepting short names fororientand will raise in a future version (GH32515)Categorical.to_dense()is deprecated and will be removed in a future version, usenp.asarray(cat)instead (GH32639)The
fastpathkeyword in theSingleBlockManagerconstructor is deprecated and will be removed in a future version (GH33092)Providing
suffixesas asetinpandas.merge()is deprecated. Provide a tuple instead (GH33740, GH34741).Indexing a
Serieswith a multi-dimensional indexer like[:, None]to return anndarraynow raises aFutureWarning. Convert to a NumPy array before indexing instead (GH27837)Index.is_mixed()is deprecated and will be removed in a future version, checkindex.inferred_typedirectly instead (GH32922)Passing any arguments but the first one to
read_html()as positional arguments is deprecated. All other arguments should be given as keyword arguments (GH27573).Passing any arguments but
path_or_buf(the first one) toread_json()as positional arguments is deprecated. All other arguments should be given as keyword arguments (GH27573).Passing any arguments but the first two to
read_excel()as positional arguments is deprecated. All other arguments should be given as keyword arguments (GH27573).pandas.api.types.is_categorical()is deprecated and will be removed in a future version; usepandas.api.types.is_categorical_dtype()instead (GH33385)Index.get_value()is deprecated and will be removed in a future version (GH19728)Series.dt.week()andSeries.dt.weekofyear()are deprecated and will be removed in a future version, useSeries.dt.isocalendar().week()instead (GH33595)DatetimeIndex.week()andDatetimeIndex.weekofyearare deprecated and will be removed in a future version, useDatetimeIndex.isocalendar().weekinstead (GH33595)DatetimeArray.week()andDatetimeArray.weekofyearare deprecated and will be removed in a future version, useDatetimeArray.isocalendar().weekinstead (GH33595)DateOffset.__call__()is deprecated and will be removed in a future version, useoffset + otherinstead (GH34171)apply_index()is deprecated and will be removed in a future version. Useoffset + otherinstead (GH34580)DataFrame.tshift()andSeries.tshift()are deprecated and will be removed in a future version, useDataFrame.shift()andSeries.shift()instead (GH11631)Indexing an
Indexobject with a float key is deprecated, and will raise anIndexErrorin the future. You can manually convert to an integer key instead (GH34191).The
squeezekeyword ingroupby()is deprecated and will be removed in a future version (GH32380)The
tzkeyword inPeriod.to_timestamp()is deprecated and will be removed in a future version; useper.to_timestamp(...).tz_localize(tz)instead (GH34522)DatetimeIndex.to_perioddelta()is deprecated and will be removed in a future version. Useindex - index.to_period(freq).to_timestamp()instead (GH34853)DataFrame.melt()accepting avalue_namethat already exists is deprecated, and will be removed in a future version (GH34731)The
centerkeyword in theDataFrame.expanding()function is deprecated and will be removed in a future version (GH20647)
Performance improvements¶
Performance improvement in flex arithmetic ops between
DataFrameandSerieswithaxis=0(GH31296)Performance improvement in arithmetic ops between
DataFrameandSerieswithaxis=1(GH33600)The internal index method
_shallow_copy()now copies cached attributes over to the new index, avoiding creating these again on the new index. This can speed up many operations that depend on creating copies of existing indexes (GH28584, GH32640, GH32669)Significant performance improvement when creating a
DataFramewith sparse values fromscipy.sparsematrices using theDataFrame.sparse.from_spmatrix()constructor (GH32821, GH32825, GH32826, GH32856, GH32858).Performance improvement for groupby methods
first()andlast()(GH34178)Performance improvement in
factorize()for nullable (integer and Boolean) dtypes (GH33064).Performance improvement when constructing
Categoricalobjects (GH33921)Fixed performance regression in
pandas.qcut()andpandas.cut()(GH33921)Performance improvement in reductions (
sum,prod,min,max) for nullable (integer and Boolean) dtypes (GH30982, GH33261, GH33442).Performance improvement in arithmetic operations between two
DataFrameobjects (GH32779)Performance improvement in
pandas.core.groupby.RollingGroupby(GH34052)Performance improvement in arithmetic operations (
sub,add,mul,div) forMultiIndex(GH34297)Performance improvement in
DataFrame[bool_indexer]whenbool_indexeris alist(GH33924)Significant performance improvement of
io.formats.style.Styler.render()with styles added with various ways such asio.formats.style.Styler.apply(),io.formats.style.Styler.applymap()orio.formats.style.Styler.bar()(GH19917)
Bug fixes¶
Categorical¶
Passing an invalid
fill_valuetoCategorical.take()raises aValueErrorinstead ofTypeError(GH33660)Combining a
Categoricalwith integer categories and which contains missing values with a float dtype column in operations such asconcat()orappend()will now result in a float column instead of an object dtype column (GH33607)Bug where
merge()was unable to join on non-unique categorical indices (GH28189)Bug when passing categorical data to
Indexconstructor along withdtype=objectincorrectly returning aCategoricalIndexinstead of object-dtypeIndex(GH32167)Bug where
Categoricalcomparison operator__ne__would incorrectly evaluate toFalsewhen either element was missing (GH32276)Categorical.fillna()now acceptsCategoricalotherargument (GH32420)Repr of
Categoricalwas not distinguishing betweenintandstr(GH33676)
Datetimelike¶
Passing an integer dtype other than
int64tonp.array(period_index, dtype=...)will now raiseTypeErrorinstead of incorrectly usingint64(GH32255)Series.to_timestamp()now raises aTypeErrorif the axis is not aPeriodIndex. Previously anAttributeErrorwas raised (GH33327)Series.to_period()now raises aTypeErrorif the axis is not aDatetimeIndex. Previously anAttributeErrorwas raised (GH33327)Periodno longer accepts tuples for thefreqargument (GH34658)Bug in
Timestampwhere constructing aTimestampfrom ambiguous epoch time and calling constructor again changed theTimestamp.value()property (GH24329)DatetimeArray.searchsorted(),TimedeltaArray.searchsorted(),PeriodArray.searchsorted()not recognizing non-pandas scalars and incorrectly raisingValueErrorinstead ofTypeError(GH30950)Bug in
Timestampwhere constructingTimestampwith dateutil timezone less than 128 nanoseconds before daylight saving time switch from winter to summer would result in nonexistent time (GH31043)Bug in
Period.to_timestamp(),Period.start_time()with microsecond frequency returning a timestamp one nanosecond earlier than the correct time (GH31475)Timestampraised a confusing error message when year, month or day is missing (GH31200)Bug in
DatetimeIndexconstructor incorrectly acceptingbool-dtype inputs (GH32668)Bug in
DatetimeIndex.searchsorted()not accepting alistorSeriesas its argument (GH32762)Bug where
PeriodIndex()raised when passed aSeriesof strings (GH26109)Bug in
Timestamparithmetic when adding or subtracting annp.ndarraywithtimedelta64dtype (GH33296)Bug in
DatetimeIndex.to_period()not inferring the frequency when called with no arguments (GH33358)Bug in
DatetimeIndex.tz_localize()incorrectly retainingfreqin some cases where the originalfreqis no longer valid (GH30511)Bug in
DatetimeIndex.intersection()losingfreqand timezone in some cases (GH33604)Bug in
DatetimeIndex.get_indexer()where incorrect output would be returned for mixed datetime-like targets (GH33741)Bug in
DatetimeIndexaddition and subtraction with some types ofDateOffsetobjects incorrectly retaining an invalidfreqattribute (GH33779)Bug in
DatetimeIndexwhere setting thefreqattribute on an index could silently change thefreqattribute on another index viewing the same data (GH33552)DataFrame.min()andDataFrame.max()were not returning consistent results withSeries.min()andSeries.max()when called on objects initialized with emptypd.to_datetime()Bug in
DatetimeIndex.intersection()andTimedeltaIndex.intersection()with results not having the correctnameattribute (GH33904)Bug in
DatetimeArray.__setitem__(),TimedeltaArray.__setitem__(),PeriodArray.__setitem__()incorrectly allowing values withint64dtype to be silently cast (GH33717)Bug in subtracting
TimedeltaIndexfromPeriodincorrectly raisingTypeErrorin some cases where it should succeed andIncompatibleFrequencyin some cases where it should raiseTypeError(GH33883)Bug in constructing a
SeriesorIndexfrom a read-only NumPy array with non-ns resolution which converted to object dtype instead of coercing todatetime64[ns]dtype when within the timestamp bounds (GH34843).The
freqkeyword inPeriod,date_range(),period_range(),pd.tseries.frequencies.to_offset()no longer allows tuples, pass as string instead (GH34703)Bug in
DataFrame.append()when appending aSeriescontaining a scalar tz-awareTimestampto an emptyDataFrameresulted in an object column instead ofdatetime64[ns, tz]dtype (GH35038)OutOfBoundsDatetimeissues an improved error message when timestamp is out of implementation bounds. (GH32967)Bug in
AbstractHolidayCalendar.holidays()when no rules were defined (GH31415)Bug in
Tickcomparisons raisingTypeErrorwhen comparing against timedelta-like objects (GH34088)Bug in
Tickmultiplication raisingTypeErrorwhen multiplying by a float (GH34486)
Timedelta¶
Bug in constructing a
Timedeltawith a high precision integer that would round theTimedeltacomponents (GH31354)Bug in dividing
np.nanorNonebyTimedeltaincorrectly returningNaT(GH31869)Timedeltanow understandsµsas an identifier for microsecond (GH32899)Timedeltastring representation now includes nanoseconds, when nanoseconds are non-zero (GH9309)Bug in comparing a
Timedeltaobject against annp.ndarraywithtimedelta64dtype incorrectly viewing all entries as unequal (GH33441)Bug in
timedelta_range()that produced an extra point on a edge case (GH30353, GH33498)Bug in
DataFrame.resample()that produced an extra point on a edge case (GH30353, GH13022, GH33498)Bug in
DataFrame.resample()that ignored theloffsetargument when dealing with timedelta (GH7687, GH33498)Bug in
Timedeltaandpandas.to_timedelta()that ignored theunitargument for string input (GH12136)
Timezones¶
Bug in
to_datetime()withinfer_datetime_format=Truewhere timezone names (e.g.UTC) would not be parsed correctly (GH33133)
Numeric¶
Bug in
DataFrame.floordiv()withaxis=0not treating division-by-zero likeSeries.floordiv()(GH31271)Bug in
to_numeric()with string argument"uint64"anderrors="coerce"silently fails (GH32394)Bug in
to_numeric()withdowncast="unsigned"fails for empty data (GH32493)Bug in
DataFrame.mean()withnumeric_only=Falseand eitherdatetime64dtype orPeriodDtypecolumn incorrectly raisingTypeError(GH32426)Bug in
DataFrame.count()withlevel="foo"and index level"foo"containing NaNs causes segmentation fault (GH21824)Bug in
DataFrame.diff()withaxis=1returning incorrect results with mixed dtypes (GH32995)Bug in
DataFrame.corr()andDataFrame.cov()raising when handling nullable integer columns withpandas.NA(GH33803)Bug in arithmetic operations between
DataFrameobjects with non-overlapping columns with duplicate labels causing an infinite loop (GH35194)Bug in
DataFrameandSeriesaddition and subtraction between object-dtype objects anddatetime64dtype objects (GH33824)Bug in
Index.difference()giving incorrect results when comparing aFloat64Indexand objectIndex(GH35217)Bug in
DataFramereductions (e.g.df.min(),df.max()) withExtensionArraydtypes (GH34520, GH32651)Series.interpolate()andDataFrame.interpolate()now raise a ValueError iflimit_directionis'forward'or'both'andmethodis'backfill'or'bfill'orlimit_directionis'backward'or'both'andmethodis'pad'or'ffill'(GH34746)
Conversion¶
Bug in
Seriesconstruction from NumPy array with big-endiandatetime64dtype (GH29684)Bug in
Timedeltaconstruction with large nanoseconds keyword value (GH32402)Bug in
DataFrameconstruction where sets would be duplicated rather than raising (GH32582)The
DataFrameconstructor no longer accepts a list ofDataFrameobjects. Because of changes to NumPy,DataFrameobjects are now consistently treated as 2D objects, so a list ofDataFrameobjects is considered 3D, and no longer acceptable for theDataFrameconstructor (GH32289).Bug in
DataFramewhen initiating a frame with lists and assigncolumnswith nested list forMultiIndex(GH32173)Improved error message for invalid construction of list when creating a new index (GH35190)
Strings¶
Bug in the
astype()method when converting “string” dtype data to nullable integer dtype (GH32450).Fixed issue where taking
minormaxof aStringArrayorSerieswithStringDtypetype would raise. (GH31746)Bug in
Series.str.cat()returningNaNoutput when other hadIndextype (GH33425)pandas.api.dtypes.is_string_dtype()no longer incorrectly identifies categorical series as string.
Interval¶
Bug in
IntervalArrayincorrectly allowing the underlying data to be changed when setting values (GH32782)
Indexing¶
DataFrame.xs()now raises aTypeErrorif alevelkeyword is supplied and the axis is not aMultiIndex. Previously anAttributeErrorwas raised (GH33610)Bug in slicing on a
DatetimeIndexwith a partial-timestamp dropping high-resolution indices near the end of a year, quarter, or month (GH31064)Bug in
PeriodIndex.get_loc()treating higher-resolution strings differently fromPeriodIndex.get_value()(GH31172)Bug in
Series.at()andDataFrame.at()not matching.locbehavior when looking up an integer in aFloat64Index(GH31329)Bug in
PeriodIndex.is_monotonic()incorrectly returningTruewhen containing leadingNaTentries (GH31437)Bug in
DatetimeIndex.get_loc()raisingKeyErrorwith converted-integer key instead of the user-passed key (GH31425)Bug in
Series.xs()incorrectly returningTimestampinstead ofdatetime64in some object-dtype cases (GH31630)Bug in
DataFrame.iat()incorrectly returningTimestampinstead ofdatetimein some object-dtype cases (GH32809)Bug in
DataFrame.at()when either columns or index is non-unique (GH33041)Bug in
Series.loc()andDataFrame.loc()when indexing with an integer key on a object-dtypeIndexthat is not all-integers (GH31905)Bug in
DataFrame.iloc.__setitem__()on aDataFramewith duplicate columns incorrectly setting values for all matching columns (GH15686, GH22036)Bug in
DataFrame.loc()andSeries.loc()with aDatetimeIndex,TimedeltaIndex, orPeriodIndexincorrectly allowing lookups of non-matching datetime-like dtypes (GH32650)Bug in
Series.__getitem__()indexing with non-standard scalars, e.g.np.dtype(GH32684)Bug in
Indexconstructor where an unhelpful error message was raised for NumPy scalars (GH33017)Bug in
DataFrame.lookup()incorrectly raising anAttributeErrorwhenframe.indexorframe.columnsis not unique; this will now raise aValueErrorwith a helpful error message (GH33041)Bug in
Intervalwhere aTimedeltacould not be added or subtracted from aTimestampinterval (GH32023)Bug in
DataFrame.copy()not invalidating _item_cache after copy caused post-copy value updates to not be reflected (GH31784)Fixed regression in
DataFrame.loc()andSeries.loc()throwing an error when adatetime64[ns, tz]value is provided (GH32395)Bug in
Series.__getitem__()with an integer key and aMultiIndexwith leading integer level failing to raiseKeyErrorif the key is not present in the first level (GH33355)Bug in
DataFrame.iloc()when slicing a single columnDataFramewithExtensionDtype(e.g.df.iloc[:, :1]) returning an invalid result (GH32957)Bug in
DatetimeIndex.insert()andTimedeltaIndex.insert()causing indexfreqto be lost when setting an element into an emptySeries(GH33573)Bug in
Series.__setitem__()with anIntervalIndexand a list-like key of integers (GH33473)Bug in
Series.__getitem__()allowing missing labels withnp.ndarray,Index,Seriesindexers but notlist, these now all raiseKeyError(GH33646)Bug in
DataFrame.truncate()andSeries.truncate()where index was assumed to be monotone increasing (GH33756)Indexing with a list of strings representing datetimes failed on
DatetimeIndexorPeriodIndex(GH11278)Bug in
Series.at()when used with aMultiIndexwould raise an exception on valid inputs (GH26989)Bug in
DataFrame.loc()with dictionary of values changes columns with dtype ofinttofloat(GH34573)Bug in
Series.loc()when used with aMultiIndexwould raise anIndexingErrorwhen accessing aNonevalue (GH34318)Bug in
DataFrame.reset_index()andSeries.reset_index()would not preserve data types on an emptyDataFrameorSerieswith aMultiIndex(GH19602)Bug in
SeriesandDataFrameindexing with atimekey on aDatetimeIndexwithNaTentries (GH35114)
Missing¶
Calling
fillna()on an emptySeriesnow correctly returns a shallow copied object. The behaviour is now consistent withIndex,DataFrameand a non-emptySeries(GH32543).Bug in
Series.replace()when argumentto_replaceis of type dict/list and is used on aSeriescontaining<NA>was raising aTypeError. The method now handles this by ignoring<NA>values when doing the comparison for the replacement (GH32621)Bug in
any()andall()incorrectly returning<NA>for allFalseor allTruevalues using the nulllable Boolean dtype and withskipna=False(GH33253)Clarified documentation on interpolate with
method=akima. Thederparameter must be scalar orNone(GH33426)DataFrame.interpolate()uses the correct axis convention now. Previously interpolating along columns lead to interpolation along indices and vice versa. Furthermore interpolating with methodspad,ffill,bfillandbackfillare identical to using these methods withDataFrame.fillna()(GH12918, GH29146)Bug in
DataFrame.interpolate()when called on aDataFramewith column names of string type was throwing a ValueError. The method is now independent of the type of the column names (GH33956)Passing
NAinto a format string using format specs will now work. For example"{:.1f}".format(pd.NA)would previously raise aValueError, but will now return the string"<NA>"(GH34740)Bug in
Series.map()not raising on invalidna_action(GH32815)
MultiIndex¶
DataFrame.swaplevels()now raises aTypeErrorif the axis is not aMultiIndex. Previously anAttributeErrorwas raised (GH31126)Bug in
Dataframe.loc()when used with aMultiIndex. The returned values were not in the same order as the given inputs (GH22797)
In [80]: df = pd.DataFrame(np.arange(4),
....: index=[["a", "a", "b", "b"], [1, 2, 1, 2]])
....:
# Rows are now ordered as the requested keys
In [81]: df.loc[(['b', 'a'], [2, 1]), :]
Out[81]:
0
b 2 3
1 2
a 2 1
1 0
[4 rows x 1 columns]
Bug in
MultiIndex.intersection()was not guaranteed to preserve order whensort=False. (GH31325)Bug in
DataFrame.truncate()was droppingMultiIndexnames. (GH34564)
In [82]: left = pd.MultiIndex.from_arrays([["b", "a"], [2, 1]])
In [83]: right = pd.MultiIndex.from_arrays([["a", "b", "c"], [1, 2, 3]])
# Common elements are now guaranteed to be ordered by the left side
In [84]: left.intersection(right, sort=False)
Out[84]:
MultiIndex([('b', 2),
('a', 1)],
)
Bug when joining two
MultiIndexwithout specifying level with different columns. Return-indexers parameter was ignored. (GH34074)
IO¶
Passing a
setasnamesargument topandas.read_csv(),pandas.read_table(), orpandas.read_fwf()will raiseValueError: Names should be an ordered collection.(GH34946)Bug in print-out when
display.precisionis zero. (GH20359)Bug in
read_json()where integer overflow was occurring when json contains big number strings. (GH30320)read_csv()will now raise aValueErrorwhen the argumentsheaderandprefixboth are notNone. (GH27394)Bug in
DataFrame.to_json()was raisingNotFoundErrorwhenpath_or_bufwas an S3 URI (GH28375)Bug in
DataFrame.to_parquet()overwriting pyarrow’s default forcoerce_timestamps; following pyarrow’s default allows writing nanosecond timestamps withversion="2.0"(GH31652).Bug in
read_csv()was raisingTypeErrorwhensep=Nonewas used in combination withcommentkeyword (GH31396)Bug in
HDFStorethat caused it to set toint64the dtype of adatetime64column when reading aDataFramein Python 3 from fixed format written in Python 2 (GH31750)read_sas()now handles dates and datetimes larger thanTimestamp.maxreturning them asdatetime.datetimeobjects (GH20927)Bug in
DataFrame.to_json()whereTimedeltaobjects would not be serialized correctly withdate_format="iso"(GH28256)read_csv()will raise aValueErrorwhen the column names passed inparse_datesare missing in theDataframe(GH31251)Bug in
read_excel()where a UTF-8 string with a high surrogate would cause a segmentation violation (GH23809)Bug in
read_csv()was causing a file descriptor leak on an empty file (GH31488)Bug in
read_csv()was causing a segfault when there were blank lines between the header and data rows (GH28071)Bug in
read_csv()was raising a misleading exception on a permissions issue (GH23784)Bug in
read_csv()was raising anIndexErrorwhenheader=Noneand two extra data columnsBug in
read_sas()was raising anAttributeErrorwhen reading files from Google Cloud Storage (GH33069)Bug in
DataFrame.to_sql()where anAttributeErrorwas raised when saving an out of bounds date (GH26761)Bug in
read_excel()did not correctly handle multiple embedded spaces in OpenDocument text cells. (GH32207)Bug in
read_json()was raisingTypeErrorwhen reading alistof Booleans into aSeries. (GH31464)Bug in
pandas.io.json.json_normalize()where location specified byrecord_pathdoesn’t point to an array. (GH26284)pandas.read_hdf()has a more explicit error message when loading an unsupported HDF file (GH9539)Bug in
read_feather()was raising anArrowIOErrorwhen reading an s3 or http file path (GH29055)Bug in
to_excel()could not handle the column namerenderand was raising anKeyError(GH34331)Bug in
execute()was raising aProgrammingErrorfor some DB-API drivers when the SQL statement contained the%character and no parameters were present (GH34211)Bug in
StataReader()which resulted in categorical variables with different dtypes when reading data using an iterator. (GH31544)HDFStore.keys()has now an optionalincludeparameter that allows the retrieval of all native HDF5 table names (GH29916)TypeErrorexceptions raised byread_csv()andread_table()were showing asparser_fwhen an unexpected keyword argument was passed (GH25648)Bug in
read_excel()for ODS files removes 0.0 values (GH27222)Bug in
ujson.encode()was raising anOverflowErrorwith numbers larger thansys.maxsize(GH34395)Bug in
HDFStore.append_to_multiple()was raising aValueErrorwhen themin_itemsizeparameter is set (GH11238)Bug in
create_table()now raises an error whencolumnargument was not specified indata_columnson input (GH28156)read_json()now could read line-delimited json file from a file url whilelinesandchunksizeare set.Bug in
DataFrame.to_sql()when reading DataFrames with-np.infentries with MySQL now has a more explicitValueError(GH34431)Bug where capitalised files extensions were not decompressed by read_* functions (GH35164)
Bug in
read_excel()that was raising aTypeErrorwhenheader=Noneandindex_colis given as alist(GH31783)Bug in
read_excel()where datetime values are used in the header in aMultiIndex(GH34748)read_excel()no longer takes**kwdsarguments. This means that passing in the keyword argumentchunksizenow raises aTypeError(previously raised aNotImplementedError), while passing in the keyword argumentencodingnow raises aTypeError(GH34464)Bug in
DataFrame.to_records()was incorrectly losing timezone information in timezone-awaredatetime64columns (GH32535)
Plotting¶
DataFrame.plot()for line/bar now accepts color by dictionary (GH8193).Bug in
DataFrame.plot.hist()where weights are not working for multiple columns (GH33173)Bug in
DataFrame.boxplot()andDataFrame.plot.boxplot()lost color attributes ofmedianprops,whiskerprops,cappropsandboxprops(GH30346)Bug in
DataFrame.hist()where the order ofcolumnargument was ignored (GH29235)Bug in
DataFrame.plot.scatter()that when adding multiple plots with differentcmap, colorbars always use the firstcmap(GH33389)Bug in
DataFrame.plot.scatter()was adding a colorbar to the plot even if the argumentcwas assigned to a column containing color names (GH34316)Bug in
pandas.plotting.bootstrap_plot()was causing cluttered axes and overlapping labels (GH34905)Bug in
DataFrame.plot.scatter()caused an error when plotting variable marker sizes (GH32904)
GroupBy/resample/rolling¶
Using a
pandas.api.indexers.BaseIndexerwithcount,min,max,median,skew,cov,corrwill now return correct results for any monotonicpandas.api.indexers.BaseIndexerdescendant (GH32865)DataFrameGroupby.mean()andSeriesGroupby.mean()(and similarly formedian(),std()andvar()) now raise aTypeErrorif a non-accepted keyword argument is passed into it. Previously anUnsupportedFunctionCallwas raised (AssertionErrorifmin_countpassed intomedian()) (GH31485)Bug in
GroupBy.apply()raisesValueErrorwhen thebyaxis is not sorted, has duplicates, and the appliedfuncdoes not mutate passed in objects (GH30667)Bug in
DataFrameGroupBy.transform()produces an incorrect result with transformation functions (GH30918)Bug in
Groupby.transform()was returning the wrong result when grouping by multiple keys of which some were categorical and others not (GH32494)Bug in
GroupBy.count()causes segmentation fault when grouped-by columns contain NaNs (GH32841)Bug in
DataFrame.groupby()andSeries.groupby()produces inconsistent type when aggregating BooleanSeries(GH32894)Bug in
DataFrameGroupBy.sum()andSeriesGroupBy.sum()where a large negative number would be returned when the number of non-null values was belowmin_countfor nullable integer dtypes (GH32861)Bug in
SeriesGroupBy.quantile()was raising on nullable integers (GH33136)Bug in
DataFrame.resample()where anAmbiguousTimeErrorwould be raised when the resulting timezone awareDatetimeIndexhad a DST transition at midnight (GH25758)Bug in
DataFrame.groupby()where aValueErrorwould be raised when grouping by a categorical column with read-only categories andsort=False(GH33410)Bug in
GroupBy.agg(),GroupBy.transform(), andGroupBy.resample()where subclasses are not preserved (GH28330)Bug in
SeriesGroupBy.agg()where any column name was accepted in the named aggregation ofSeriesGroupBypreviously. The behaviour now allows onlystrand callables else would raiseTypeError. (GH34422)Bug in
DataFrame.groupby()lost the name of theIndexwhen one of theaggkeys referenced an empty list (GH32580)Bug in
Rolling.apply()wherecenter=Truewas ignored whenengine='numba'was specified (GH34784)Bug in
DataFrame.ewm.cov()was throwingAssertionErrorforMultiIndexinputs (GH34440)Bug in
core.groupby.DataFrameGroupBy.quantile()raisedTypeErrorfor non-numeric types rather than dropping the columns (GH27892)Bug in
core.groupby.DataFrameGroupBy.transform()whenfunc='nunique'and columns are of typedatetime64, the result would also be of typedatetime64instead ofint64(GH35109)Bug in
DataFrame.groupby()raising anAttributeErrorwhen selecting a column and aggregating withas_index=False(GH35246).Bug in
DataFrameGroupBy.first()andDataFrameGroupBy.last()that would raise an unnecessaryValueErrorwhen grouping on multipleCategoricals(GH34951)
Reshaping¶
Bug effecting all numeric and Boolean reduction methods not returning subclassed data type. (GH25596)
Bug in
DataFrame.pivot_table()when onlyMultiIndexedcolumns is set (GH17038)Bug in
DataFrame.unstack()andSeries.unstack()can take tuple names inMultiIndexeddata (GH19966)Bug in
DataFrame.pivot_table()whenmarginisTrueand onlycolumnis defined (GH31016)Fixed incorrect error message in
DataFrame.pivot()whencolumnsis set toNone. (GH30924)Bug in
crosstab()when inputs are twoSeriesand have tuple names, the output will keep a dummyMultiIndexas columns. (GH18321)DataFrame.pivot()can now take lists forindexandcolumnsarguments (GH21425)Bug in
concat()where the resulting indices are not copied whencopy=True(GH29879)Bug in
SeriesGroupBy.aggregate()was resulting in aggregations being overwritten when they shared the same name (GH30880)Bug where
Index.astype()would lose thenameattribute when converting fromFloat64IndextoInt64Index, or when casting to anExtensionArraydtype (GH32013)Series.append()will now raise aTypeErrorwhen passed aDataFrameor a sequence containingDataFrame(GH31413)DataFrame.replace()andSeries.replace()will raise aTypeErrorifto_replaceis not an expected type. Previously thereplacewould fail silently (GH18634)Bug on inplace operation of a
Seriesthat was adding a column to theDataFramefrom where it was originally dropped from (usinginplace=True) (GH30484)Bug in
DataFrame.apply()where callback was called withSeriesparameter even thoughraw=Truerequested. (GH32423)Bug in
DataFrame.pivot_table()losing timezone information when creating aMultiIndexlevel from a column with timezone-aware dtype (GH32558)Bug in
concat()where when passing a non-dict mapping asobjswould raise aTypeError(GH32863)DataFrame.agg()now provides more descriptiveSpecificationErrormessage when attempting to aggregate a non-existent column (GH32755)Bug in
DataFrame.unstack()whenMultiIndexcolumns andMultiIndexrows were used (GH32624, GH24729 and GH28306)Appending a dictionary to a
DataFramewithout passingignore_index=Truewill raiseTypeError: Can only append a dict if ignore_index=Trueinstead ofTypeError: Can only append a :class:`Series` if ignore_index=True or if the :class:`Series` has a name(GH30871)Bug in
DataFrame.corrwith(),DataFrame.memory_usage(),DataFrame.dot(),DataFrame.idxmin(),DataFrame.idxmax(),DataFrame.duplicated(),DataFrame.isin(),DataFrame.count(),Series.explode(),Series.asof()andDataFrame.asof()not returning subclassed types. (GH31331)Bug in
concat()was not allowing for concatenation ofDataFrameandSerieswith duplicate keys (GH33654)Bug in
cut()raised an error when the argumentlabelscontains duplicates (GH33141)Bug in
Dataframe.aggregate()andSeries.aggregate()was causing a recursive loop in some cases (GH34224)Fixed bug in
melt()where meltingMultiIndexcolumns withcol_level > 0would raise aKeyErroronid_vars(GH34129)Bug in
Series.where()with an emptySeriesand emptycondhaving non-bool dtype (GH34592)Fixed regression where
DataFrame.apply()would raiseValueErrorfor elements withSdtype (GH34529)
Sparse¶
Creating a
SparseArrayfrom timezone-aware dtype will issue a warning before dropping timezone information, instead of doing so silently (GH32501)Bug in
arrays.SparseArray.from_spmatrix()wrongly read scipy sparse matrix (GH31991)Bug in
Series.sum()withSparseArrayraised aTypeError(GH25777)Bug where
DataFramecontaining an all-sparseSparseArrayfilled withNaNwhen indexed by a list-like (GH27781, GH29563)The repr of
SparseDtypenow includes the repr of itsfill_valueattribute. Previously it usedfill_value’s string representation (GH34352)Bug where empty
DataFramecould not be cast toSparseDtype(GH33113)Bug in
arrays.SparseArray()was returning the incorrect type when indexing a sparse dataframe with an iterable (GH34526, GH34540)
ExtensionArray¶
Fixed bug where
Series.value_counts()would raise on empty input ofInt64dtype (GH33317)Fixed bug in
concat()when concatenatingDataFrameobjects with non-overlapping columns resulting in object-dtype columns rather than preserving the extension dtype (GH27692, GH33027)Fixed bug where
StringArray.isna()would returnFalsefor NA values whenpandas.options.mode.use_inf_as_nawas set toTrue(GH33655)Fixed bug in
Seriesconstruction with EA dtype and index but no data or scalar data fails (GH26469)Fixed bug that caused
Series.__repr__()to crash for extension types whose elements are multidimensional arrays (GH33770).Fixed bug where
Series.update()would raise aValueErrorforExtensionArraydtypes with missing values (GH33980)Fixed bug where
StringArray.memory_usage()was not implemented (GH33963)Fixed bug where
DataFrameGroupBy()would ignore themin_countargument for aggregations on nullable Boolean dtypes (GH34051)Fixed bug where the constructor of
DataFramewithdtype='string'would fail (GH27953, GH33623)Bug where
DataFramecolumn set to scalar extension type was considered an object type rather than the extension type (GH34832)Fixed bug in
IntegerArray.astype()to correctly copy the mask as well (GH34931).
Other¶
Set operations on an object-dtype
Indexnow always return object-dtype results (GH31401)Fixed
pandas.testing.assert_series_equal()to correctly raise if theleftargument is a different subclass withcheck_series_type=True(GH32670).Getting a missing attribute in a
DataFrame.query()orDataFrame.eval()string raises the correctAttributeError(GH32408)Fixed bug in
pandas.testing.assert_series_equal()where dtypes were checked forIntervalandExtensionArrayoperands whencheck_dtypewasFalse(GH32747)Bug in
DataFrame.__dir__()caused a segfault when using unicode surrogates in a column name (GH25509)Bug in
DataFrame.equals()andSeries.equals()in allowing subclasses to be equal (GH34402).
Contributors¶
A total of 368 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.
3vts +
A Brooks +
Abbie Popa +
Achmad Syarif Hidayatullah +
Adam W Bagaskarta +
Adrian Mastronardi +
Aidan Montare +
Akbar Septriyan +
Akos Furton +
Alejandro Hall +
Alex Hall +
Alex Itkes +
Alex Kirko
Ali McMaster +
Alvaro Aleman +
Amy Graham +
Andrew Schonfeld +
Andrew Shumanskiy +
Andrew Wieteska +
Angela Ambroz
Anjali Singh +
Anna Daglis
Anthony Milbourne +
Antony Lee +
Ari Sosnovsky +
Arkadeep Adhikari +
Arunim Samudra +
Ashkan +
Ashwin Prakash Nalwade +
Ashwin Srinath +
Atsushi Nukariya +
Ayappan +
Ayla Khan +
Bart +
Bart Broere +
Benjamin Beier Liu +
Benjamin Fischer +
Bharat Raghunathan
Bradley Dice +
Brendan Sullivan +
Brian Strand +
Carsten van Weelden +
Chamoun Saoma +
ChrisRobo +
Christian Chwala
Christopher Whelan
Christos Petropoulos +
Chuanzhu Xu
CloseChoice +
Clément Robert +
CuylenE +
DanBasson +
Daniel Saxton
Danilo Horta +
DavaIlhamHaeruzaman +
Dave Hirschfeld
Dave Hughes
David Rouquet +
David S +
Deepyaman Datta
Dennis Bakhuis +
Derek McCammond +
Devjeet Roy +
Diane Trout
Dina +
Dom +
Drew Seibert +
EdAbati
Emiliano Jordan +
Erfan Nariman +
Eric Groszman +
Erik Hasse +
Erkam Uyanik +
Evan D +
Evan Kanter +
Fangchen Li +
Farhan Reynaldo +
Farhan Reynaldo Hutabarat +
Florian Jetter +
Fred Reiss +
GYHHAHA +
Gabriel Moreira +
Gabriel Tutui +
Galuh Sahid
Gaurav Chauhan +
George Hartzell +
Gim Seng +
Giovanni Lanzani +
Gordon Chen +
Graham Wetzler +
Guillaume Lemaitre
Guillem Sánchez +
HH-MWB +
Harshavardhan Bachina
How Si Wei
Ian Eaves
Iqrar Agalosi Nureyza +
Irv Lustig
Iva Laginja +
JDkuba
Jack Greisman +
Jacob Austin +
Jacob Deppen +
Jacob Peacock +
Jake Tae +
Jake Vanderplas +
James Cobon-Kerr
Jan Červenka +
Jan Škoda
Jane Chen +
Jean-Francois Zinque +
Jeanderson Barros Candido +
Jeff Reback
Jered Dominguez-Trujillo +
Jeremy Schendel
Jesse Farnham
Jiaxiang
Jihwan Song +
Joaquim L. Viegas +
Joel Nothman
John Bodley +
John Paton +
Jon Thielen +
Joris Van den Bossche
Jose Manuel Martí +
Joseph Gulian +
Josh Dimarsky
Joy Bhalla +
João Veiga +
Julian de Ruiter +
Justin Essert +
Justin Zheng
KD-dev-lab +
Kaiqi Dong
Karthik Mathur +
Kaushal Rohit +
Kee Chong Tan
Ken Mankoff +
Kendall Masse
Kenny Huynh +
Ketan +
Kevin Anderson +
Kevin Bowey +
Kevin Sheppard
Kilian Lieret +
Koki Nishihara +
Krishna Chivukula +
KrishnaSai2020 +
Lesley +
Lewis Cowles +
Linda Chen +
Linxiao Wu +
Lucca Delchiaro Costabile +
MBrouns +
Mabel Villalba
Mabroor Ahmed +
Madhuri Palanivelu +
Mak Sze Chun
Malcolm +
Marc Garcia
Marco Gorelli
Marian Denes +
Martin Bjeldbak Madsen +
Martin Durant +
Martin Fleischmann +
Martin Jones +
Martin Winkel
Martina Oefelein +
Marvzinc +
María Marino +
Matheus Cardoso +
Mathis Felardos +
Matt Roeschke
Matteo Felici +
Matteo Santamaria +
Matthew Roeschke
Matthias Bussonnier
Max Chen
Max Halford +
Mayank Bisht +
Megan Thong +
Michael Marino +
Miguel Marques +
Mike Kutzma
Mohammad Hasnain Mohsin Rajan +
Mohammad Jafar Mashhadi +
MomIsBestFriend
Monica +
Natalie Jann
Nate Armstrong +
Nathanael +
Nick Newman +
Nico Schlömer +
Niklas Weber +
ObliviousParadigm +
Olga Lyashevska +
OlivierLuG +
Pandas Development Team
Parallels +
Patrick +
Patrick Cando +
Paul Lilley +
Paul Sanders +
Pearcekieser +
Pedro Larroy +
Pedro Reys
Peter Bull +
Peter Steinbach +
Phan Duc Nhat Minh +
Phil Kirlin +
Pierre-Yves Bourguignon +
Piotr Kasprzyk +
Piotr Niełacny +
Prakhar Pandey
Prashant Anand +
Puneetha Pai +
Quang Nguyễn +
Rafael Jaimes III +
Rafif +
RaisaDZ +
Rakshit Naidu +
Ram Rachum +
Red +
Ricardo Alanis +
Richard Shadrach +
Rik-de-Kort
Robert de Vries
Robin to Roxel +
Roger Erens +
Rohith295 +
Roman Yurchak
Ror +
Rushabh Vasani
Ryan
Ryan Nazareth
SAI SRAVAN MEDICHERLA +
SHUBH CHATTERJEE +
Sam Cohan
Samira-g-js +
Sandu Ursu +
Sang Agung +
SanthoshBala18 +
Sasidhar Kasturi +
SatheeshKumar Mohan +
Saul Shanabrook
Scott Gigante +
Sebastian Berg +
Sebastián Vanrell
Sergei Chipiga +
Sergey +
ShilpaSugan +
Simon Gibbons
Simon Hawkins
Simon Legner +
Soham Tiwari +
Song Wenhao +
Souvik Mandal
Spencer Clark
Steffen Rehberg +
Steffen Schmitz +
Stijn Van Hoey
Stéphan Taljaard
SultanOrazbayev +
Sumanau Sareen
SurajH1 +
Suvayu Ali +
Terji Petersen
Thomas J Fan +
Thomas Li
Thomas Smith +
Tim Swast
Tobias Pitters +
Tom +
Tom Augspurger
Uwe L. Korn
Valentin Iovene +
Vandana Iyer +
Venkatesh Datta +
Vijay Sai Mutyala +
Vikas Pandey
Vipul Rai +
Vishwam Pandya +
Vladimir Berkutov +
Will Ayd
Will Holmgren
William +
William Ayd
Yago González +
Yosuke KOBAYASHI +
Zachary Lawrence +
Zaky Bilfagih +
Zeb Nicholls +
alimcmaster1
alm +
andhikayusup +
andresmcneill +
avinashpancham +
benabel +
bernie gray +
biddwan09 +
brock +
chris-b1
cleconte987 +
dan1261 +
david-cortes +
davidwales +
dequadras +
dhuettenmoser +
dilex42 +
elmonsomiat +
epizzigoni +
fjetter
gabrielvf1 +
gdex1 +
gfyoung
guru kiran +
h-vishal
iamshwin
jamin-aws-ospo +
jbrockmendel
jfcorbett +
jnecus +
kernc
kota matsuoka +
kylekeppler +
leandermaben +
link2xt +
manoj_koneni +
marydmit +
masterpiga +
maxime.song +
mglasder +
moaraccounts +
mproszewska
neilkg
nrebena
ossdev07 +
paihu
pan Jacek +
partev +
patrick +
pedrooa +
pizzathief +
proost
pvanhauw +
rbenes
rebecca-palmer
rhshadrach +
rjfs +
s-scherrer +
sage +
sagungrp +
salem3358 +
saloni30 +
smartswdeveloper +
smartvinnetou +
themien +
timhunderwood +
tolhassianipar +
tonywu1999
tsvikas
tv3141
venkateshdatta1993 +
vivikelapoutre +
willbowditch +
willpeppo +
za +
zaki-indra +