What’s new in 1.3.0 (??)

These are the changes in pandas 1.3.0. See Release notes for a full changelog including other versions of pandas.

Warning

When reading new Excel 2007+ (.xlsx) files, the default argument engine=None to read_excel() will now result in using the openpyxl engine in all cases when the option io.excel.xlsx.reader is set to "auto". Previously, some cases would use the xlrd engine instead. See What’s new 1.2.0 for background on this change.

Enhancements

Custom HTTP(s) headers when reading csv or json files

When reading from a remote URL that is not handled by fsspec (ie. HTTP and HTTPS) the dictionary passed to storage_options will be used to create the headers included in the request. This can be used to control the User-Agent header or send other custom headers (GH36688). For example:

In [1]: headers = {"User-Agent": "pandas"}

In [2]: df = pd.read_csv(
   ...:     "https://download.bls.gov/pub/time.series/cu/cu.item",
   ...:     sep="\t",
   ...:     storage_options=headers
   ...: )
   ...: 

Read and write XML documents

We added I/O support to read and render shallow versions of XML documents with pandas.read_xml() and DataFrame.to_xml(). Using lxml as parser, both XPath 1.0 and XSLT 1.0 is available. (GH27554)

In [1]: xml = """<?xml version='1.0' encoding='utf-8'?>
   ...: <data>
   ...:  <row>
   ...:     <shape>square</shape>
   ...:     <degrees>360</degrees>
   ...:     <sides>4.0</sides>
   ...:  </row>
   ...:  <row>
   ...:     <shape>circle</shape>
   ...:     <degrees>360</degrees>
   ...:     <sides/>
   ...:  </row>
   ...:  <row>
   ...:     <shape>triangle</shape>
   ...:     <degrees>180</degrees>
   ...:     <sides>3.0</sides>
   ...:  </row>
   ...:  </data>"""

In [2]: df = pd.read_xml(xml)
In [3]: df
Out[3]:
      shape  degrees  sides
0    square      360    4.0
1    circle      360    NaN
2  triangle      180    3.0

In [4]: df.to_xml()
Out[4]:
<?xml version='1.0' encoding='utf-8'?>
<data>
  <row>
    <index>0</index>
    <shape>square</shape>
    <degrees>360</degrees>
    <sides>4.0</sides>
  </row>
  <row>
    <index>1</index>
    <shape>circle</shape>
    <degrees>360</degrees>
    <sides/>
  </row>
  <row>
    <index>2</index>
    <shape>triangle</shape>
    <degrees>180</degrees>
    <sides>3.0</sides>
  </row>
</data>

For more, see Writing XML in the user guide on IO tools.

Other enhancements

Notable bug fixes

These are bug fixes that might have notable behavior changes.

Preserve dtypes in combine_first()

combine_first() will now preserve dtypes (GH7509)

In [3]: df1 = pd.DataFrame({"A": [1, 2, 3], "B": [1, 2, 3]}, index=[0, 1, 2])

In [4]: df1
Out[4]: 
   A  B
0  1  1
1  2  2
2  3  3

In [5]: df2 = pd.DataFrame({"B": [4, 5, 6], "C": [1, 2, 3]}, index=[2, 3, 4])

In [6]: df2
Out[6]: 
   B  C
2  4  1
3  5  2
4  6  3

In [7]: combined = df1.combine_first(df2)

pandas 1.2.x

In [1]: combined.dtypes
Out[2]:
A    float64
B    float64
C    float64
dtype: object

pandas 1.3.0

In [8]: combined.dtypes
Out[8]: 
A    float64
B      int64
C    float64
dtype: object

Try operating inplace when setting values with loc and iloc

When setting an entire column using loc or iloc, pandas will try to insert the values into the existing data rather than create an entirely new array.

In [9]: df = pd.DataFrame(range(3), columns=["A"], dtype="float64")

In [10]: values = df.values

In [11]: new = np.array([5, 6, 7], dtype="int64")

In [12]: df.loc[[0, 1, 2], "A"] = new

In both the new and old behavior, the data in values is overwritten, but in the old behavior the dtype of df["A"] changed to int64.

pandas 1.2.x

In [1]: df.dtypes
Out[1]:
A    int64
dtype: object
In [2]: np.shares_memory(df["A"].values, new)
Out[2]: False
In [3]: np.shares_memory(df["A"].values, values)
Out[3]: False

In pandas 1.3.0, df continues to share data with values

pandas 1.3.0

In [13]: df.dtypes
Out[13]: 
A    float64
dtype: object

In [14]: np.shares_memory(df["A"], new)
Out[14]: False

In [15]: np.shares_memory(df["A"], values)
Out[15]: True

Consistent Casting With Setting Into Boolean Series

Setting non-boolean values into a Series with ``dtype=bool` consistently cast to dtype=object (GH38709)

In [16]: orig = pd.Series([True, False])

In [17]: ser = orig.copy()

In [18]: ser.iloc[1] = np.nan

In [19]: ser2 = orig.copy()

In [20]: ser2.iloc[1] = 2.0

pandas 1.2.x

In [1]: ser
Out [1]:
0    1.0
1    NaN
dtype: float64

In [2]:ser2
Out [2]:
0    True
1     2.0
dtype: object

pandas 1.3.0

In [21]: ser
Out[21]: 
0    True
1     NaN
dtype: object

In [22]: ser2
Out[22]: 
0    True
1     2.0
dtype: object

Increased minimum versions for dependencies

Some minimum supported versions of dependencies were updated. If installed, we now require:

Package

Minimum Version

Required

Changed

numpy

1.16.5

X

pytz

2017.3

X

python-dateutil

2.7.3

X

bottleneck

1.2.1

numexpr

2.6.8

pytest (dev)

5.0.1

mypy (dev)

0.800

X

setuptools

38.6.0

X

For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.

Package

Minimum Version

Changed

beautifulsoup4

4.6.0

fastparquet

0.3.2

fsspec

0.7.4

gcsfs

0.6.0

lxml

4.3.0

matplotlib

2.2.3

numba

0.46.0

openpyxl

3.0.0

X

pyarrow

0.15.0

pymysql

0.7.11

pytables

3.5.1

s3fs

0.4.0

scipy

1.2.0

sqlalchemy

1.2.8

tabulate

0.8.7

X

xarray

0.12.0

xlrd

1.2.0

xlsxwriter

1.0.2

xlwt

1.3.0

pandas-gbq

0.12.0

See Dependencies and Optional dependencies for more.

Other API changes

  • Partially initialized CategoricalDtype (i.e. those with categories=None objects will no longer compare as equal to fully initialized dtype objects.

  • Accessing _constructor_expanddim on a DataFrame and _constructor_sliced on a Series now raise an AttributeError. Previously a NotImplementedError was raised (GH38782)

Deprecations

  • Deprecated allowing scalars to be passed to the Categorical constructor (GH38433)

  • Deprecated allowing subclass-specific keyword arguments in the Index constructor, use the specific subclass directly instead (GH14093, GH21311, GH22315, GH26974)

  • Deprecated astype of datetimelike (timedelta64[ns], datetime64[ns], Datetime64TZDtype, PeriodDtype) to integer dtypes, use values.view(...) instead (GH38544)

  • Deprecated MultiIndex.is_lexsorted() and MultiIndex.lexsort_depth(), use MultiIndex.is_monotonic_increasing() instead (GH32259)

  • Deprecated keyword try_cast in Series.where(), Series.mask(), DataFrame.where(), DataFrame.mask(); cast results manually if desired (GH38836)

  • Deprecated comparison of Timestamp object with datetime.date objects. Instead of e.g. ts <= mydate use ts <= pd.Timestamp(mydate) or ts.date() <= mydate (GH36131)

  • Deprecated Rolling.win_type returning "freq" (GH38963)

  • Deprecated Rolling.is_datetimelike (GH38963)

  • Deprecated DataFrame indexer for Series.__setitem__() and DataFrame.__setitem__() (GH39004)

  • Deprecated core.window.ewm.ExponentialMovingWindow.vol() (GH39220)

  • Using .astype to convert between datetime64[ns] dtype and DatetimeTZDtype is deprecated and will raise in a future version, use obj.tz_localize or obj.dt.tz_localize instead (GH38622)

  • Deprecated casting datetime.date objects to datetime64 when used as fill_value in DataFrame.unstack(), DataFrame.shift(), Series.shift(), and DataFrame.reindex(), pass pd.Timestamp(dateobj) instead (GH39767)

  • Deprecated allowing partial failure in Series.transform() and DataFrame.transform() when func is list-like or dict-like; will raise if any function fails on a column in a future version (GH40211)

Performance improvements

Bug fixes

Categorical

Datetimelike

Timedelta

  • Bug in constructing Timedelta from np.timedelta64 objects with non-nanosecond units that are out of bounds for timedelta64[ns] (GH38965)

  • Bug in constructing a TimedeltaIndex incorrectly accepting np.datetime64("NaT") objects (GH39462)

  • Bug in constructing Timedelta from input string with only symbols and no digits failed to raise an error (GH39710)

  • Bug in TimedeltaIndex and to_timedelta() failing to raise when passed non-nanosecond timedelta64 arrays that overflow when converting to timedelta64[ns] (GH40008)

Timezones

  • Bug in different tzinfo objects representing UTC not being treated as equivalent (GH39216)

  • Bug in dateutil.tz.gettz("UTC") not being recognized as equivalent to other UTC-representing tzinfos (GH39276)

Numeric

Conversion

Strings

Interval

  • Bug in IntervalIndex.intersection() and IntervalIndex.symmetric_difference() always returning object-dtype when operating with CategoricalIndex (GH38653, GH38741)

  • Bug in IntervalIndex.intersection() returning duplicates when at least one of both Indexes has duplicates which are present in the other (GH38743)

  • IntervalIndex.union(), IntervalIndex.intersection(), IntervalIndex.difference(), and IntervalIndex.symmetric_difference() now cast to the appropriate dtype instead of raising TypeError when operating with another IntervalIndex with incompatible dtype (GH39267)

  • PeriodIndex.union(), PeriodIndex.intersection(), PeriodIndex.symmetric_difference(), PeriodIndex.difference() now cast to object dtype instead of raising IncompatibleFrequency when opearting with another PeriodIndex with incompatible dtype (GH??)

Indexing

  • Bug in Index.union() dropping duplicate Index values when Index was not monotonic or sort was set to False (GH36289, GH31326)

  • Bug in CategoricalIndex.get_indexer() failing to raise InvalidIndexError when non-unique (GH38372)

  • Bug in inserting many new columns into a DataFrame causing incorrect subsequent indexing behavior (GH38380)

  • Bug in DataFrame.__setitem__() raising ValueError when setting multiple values to duplicate columns (GH15695)

  • Bug in DataFrame.loc(), Series.loc(), DataFrame.__getitem__() and Series.__getitem__() returning incorrect elements for non-monotonic DatetimeIndex for string slices (GH33146)

  • Bug in DataFrame.reindex() and Series.reindex() with timezone aware indexes raising TypeError for method="ffill" and method="bfill" and specified tolerance (GH38566)

  • Bug in DataFrame.reindex() with datetime64[ns] or timedelta64[ns] incorrectly casting to integers when the fill_value requires casting to object dtype (GH39755)

  • Bug in DataFrame.__setitem__() raising ValueError with empty DataFrame and specified columns for string indexer and non empty DataFrame to set (GH38831)

  • Bug in DataFrame.loc.__setitem__() raising ValueError when expanding unique column for DataFrame with duplicate columns (GH38521)

  • Bug in DataFrame.iloc.__setitem__() and DataFrame.loc.__setitem__() with mixed dtypes when setting with a dictionary value (GH38335)

  • Bug in DataFrame.__setitem__() not raising ValueError when right hand side is a DataFrame with wrong number of columns (GH38604)

  • Bug in Series.__setitem__() raising ValueError when setting a Series with a scalar indexer (GH38303)

  • Bug in DataFrame.loc() dropping levels of MultiIndex when DataFrame used as input has only one row (GH10521)

  • Bug in DataFrame.__getitem__() and Series.__getitem__() always raising KeyError when slicing with existing strings an Index with milliseconds (GH33589)

  • Bug in setting timedelta64 or datetime64 values into numeric Series failing to cast to object dtype (GH39086, issue:39619)

  • Bug in setting Interval values into a Series or DataFrame with mismatched IntervalDtype incorrectly casting the new values to the existing dtype (GH39120)

  • Bug in setting datetime64 values into a Series with integer-dtype incorrect casting the datetime64 values to integers (GH39266)

  • Bug in setting np.datetime64("NaT") into a Series with Datetime64TZDtype incorrectly treating the timezone-naive value as timezone-aware (GH39769)

  • Bug in Index.get_loc() not raising KeyError when method is specified for NaN value when NaN is not in Index (GH39382)

  • Bug in DatetimeIndex.insert() when inserting np.datetime64("NaT") into a timezone-aware index incorrectly treating the timezone-naive value as timezone-aware (GH39769)

  • Bug in incorrectly raising in Index.insert(), when setting a new column that cannot be held in the existing frame.columns, or in Series.reset_index() or DataFrame.reset_index() instead of casting to a compatible dtype (GH39068)

  • Bug in RangeIndex.append() where a single object of length 1 was concatenated incorrectly (GH39401)

  • Bug in setting numpy.timedelta64 values into an object-dtype Series using a boolean indexer (GH39488)

  • Bug in setting numeric values into a into a boolean-dtypes Series using at or iat failing to cast to object-dtype (GH39582)

  • Bug in DataFrame.loc.__setitem__() when setting-with-expansion incorrectly raising when the index in the expanding axis contains duplicates (GH40096)

Missing

MultiIndex

  • Bug in DataFrame.drop() raising TypeError when MultiIndex is non-unique and level is not provided (GH36293)

  • Bug in MultiIndex.intersection() duplicating NaN in result (GH38623)

  • Bug in MultiIndex.equals() incorrectly returning True when MultiIndex containing NaN even when they are differently ordered (GH38439)

  • Bug in MultiIndex.intersection() always returning empty when intersecting with CategoricalIndex (GH38653)

I/O

  • Bug in Index.__repr__() when display.max_seq_items=1 (GH38415)

  • Bug in read_csv() not recognizing scientific notation if decimal is set for engine="python" (GH31920)

  • Bug in read_csv() interpreting NA value as comment, when NA does contain the comment string fixed for engine="python" (GH34002)

  • Bug in read_csv() raising IndexError with multiple header columns and index_col specified when file has no data rows (GH38292)

  • Bug in read_csv() not accepting usecols with different length than names for engine="python" (GH16469)

  • Bug in read_csv() returning object dtype when delimiter="," with usecols and parse_dates specified for engine="python" (GH35873)

  • Bug in read_csv() raising TypeError when names and parse_dates is specified for engine="c" (GH33699)

  • Bug in read_clipboard(), DataFrame.to_clipboard() not working in WSL (GH38527)

  • Allow custom error values for parse_dates argument of read_sql(), read_sql_query() and read_sql_table() (GH35185)

  • Bug in to_hdf() raising KeyError when trying to apply for subclasses of DataFrame or Series (GH33748)

  • Bug in put() raising a wrong TypeError when saving a DataFrame with non-string dtype (GH34274)

  • Bug in json_normalize() resulting in the first element of a generator object not being included in the returned DataFrame (GH35923)

  • Bug in read_csv() apllying thousands separator to date columns when column should be parsed for dates and usecols is specified for engine="python" (GH39365)

  • Bug in read_excel() forward filling MultiIndex names with multiple header and index columns specified (GH34673)

  • read_excel() now respects set_option() (GH34252)

  • Bug in read_csv() not switching true_values and false_values for nullable boolean dtype (GH34655)

  • Bug in read_json() when orient="split" does not maintain numeric string index (GH28556)

  • read_sql() returned an empty generator if chunksize was no-zero and the query returned no results. Now returns a generator with a single empty dataframe (GH34411)

  • Bug in read_hdf() returning unexpected records when filtering on categorical string columns using where parameter (GH39189)

  • Bug in read_sas() raising ValueError when datetimes were null (GH39725)

Period

  • Comparisons of Period objects or Index, Series, or DataFrame with mismatched PeriodDtype now behave like other mismatched-type comparisons, returning False for equals, True for not-equal, and raising TypeError for inequality checks (GH39274)

Plotting

  • Bug in scatter_matrix() raising when 2d ax argument passed (GH16253)

  • Prevent warnings when matplotlib’s constrained_layout is enabled (GH25261)

Groupby/resample/rolling

  • Bug in DataFrameGroupBy.agg() and SeriesGroupBy.agg() with PeriodDtype columns incorrectly casting results too aggressively (GH38254)

  • Bug in SeriesGroupBy.value_counts() where unobserved categories in a grouped categorical series were not tallied (GH38672)

  • Bug in SeriesGroupBy.value_counts() where error was raised on an empty series (GH39172)

  • Bug in GroupBy.indices() would contain non-existent indices when null values were present in the groupby keys (GH9304)

  • Fixed bug in DataFrameGroupBy.sum() and SeriesGroupBy.sum() causing loss of precision through using Kahan summation (GH38778)

  • Fixed bug in DataFrameGroupBy.cumsum(), SeriesGroupBy.cumsum(), DataFrameGroupBy.mean() and SeriesGroupBy.mean() causing loss of precision through using Kahan summation (GH38934)

  • Bug in Resampler.aggregate() and DataFrame.transform() raising TypeError instead of SpecificationError when missing keys had mixed dtypes (GH39025)

  • Bug in DataFrameGroupBy.idxmin() and DataFrameGroupBy.idxmax() with ExtensionDtype columns (GH38733)

  • Bug in Series.resample() would raise when the index was a PeriodIndex consisting of NaT (GH39227)

  • Bug in core.window.rolling.RollingGroupby.corr() and core.window.expanding.ExpandingGroupby.corr() where the groupby column would return 0 instead of np.nan when providing other that was longer than each group (GH39591)

  • Bug in core.window.expanding.ExpandingGroupby.corr() and core.window.expanding.ExpandingGroupby.cov() where 1 would be returned instead of np.nan when providing other that was longer than each group (GH39591)

  • Bug in GroupBy.mean(), GroupBy.median() and DataFrame.pivot_table() not propagating metadata (GH28283)

  • Bug in Series.rolling() and DataFrame.rolling() not calculating window bounds correctly when window is an offset and dates are in descending order (GH40002)

  • Bug in SeriesGroupBy and DataFrameGroupBy on an empty Series or DataFrame would lose index, columns, and/or data types when directly using the methods idxmax, idxmin, mad, min, max, sum, prod, and skew or using them through apply, aggregate, or resample (GH26411)

  • Bug in DataFrameGroupBy.apply() where a MultiIndex would be created instead of an Index if a :meth:`core.window.rolling.RollingGroupby object was created (GH39732)

  • Bug in DataFrameGroupBy.sample() where error was raised when weights was specified and the index was an Int64Index (GH39927)

  • Bug in DataFrameGroupBy.aggregate() and Resampler.aggregate() would sometimes raise SpecificationError when passed a dictionary and columns were missing; will now always raise a KeyError instead (GH40004)

  • Bug in DataFrameGroupBy.sample() where column selection was not applied to sample result (GH39928)

  • Bug in core.window.ewm.ExponentialMovingWindow when calling __getitem__ would incorrectly raise a ValueError when providing times (GH40164)

  • Bug in core.window.ewm.ExponentialMovingWindow when calling __getitem__ would not retain com, span, alpha or halflife attributes (GH40164)

Reshaping

Sparse

  • Bug in DataFrame.sparse.to_coo() raising KeyError with columns that are a numeric Index without a 0 (GH18414)

  • Bug in SparseArray.astype() with copy=False producing incorrect results when going from integer dtype to floating dtype (GH34456)

ExtensionArray

Other

Contributors