Version 0.22.0 (December 29, 2017)¶
This is a major release from 0.21.1 and includes a single, API-breaking change. We recommend that all users upgrade to this version after carefully reading the release note (singular!).
Backwards incompatible API changes¶
pandas 0.22.0 changes the handling of empty and all-NA sums and products. The summary is that
The sum of an empty or all-NA
Seriesis now0The product of an empty or all-NA
Seriesis now1We’ve added a
min_countparameter to.sum()and.prod()controlling the minimum number of valid values for the result to be valid. If fewer thanmin_countnon-NA values are present, the result is NA. The default is0. To returnNaN, the 0.21 behavior, usemin_count=1.
Some background: In pandas 0.21, we fixed a long-standing inconsistency
in the return value of all-NA series depending on whether or not bottleneck
was installed. See Sum/prod of all-NaN or empty Series/DataFrames is now consistently NaN. At the same
time, we changed the sum and prod of an empty Series to also be NaN.
Based on feedback, we’ve partially reverted those changes.
Arithmetic operations¶
The default sum for empty or all-NA Series is now 0.
pandas 0.21.x
In [1]: pd.Series([]).sum()
Out[1]: nan
In [2]: pd.Series([np.nan]).sum()
Out[2]: nan
pandas 0.22.0
In [1]: pd.Series([]).sum()
Out[1]: 0.0
In [2]: pd.Series([np.nan]).sum()
Out[2]: 0.0
The default behavior is the same as pandas 0.20.3 with bottleneck installed. It
also matches the behavior of NumPy’s np.nansum on empty and all-NA arrays.
To have the sum of an empty series return NaN (the default behavior of
pandas 0.20.3 without bottleneck, or pandas 0.21.x), use the min_count
keyword.
In [3]: pd.Series([]).sum(min_count=1)
Out[3]: nan
Thanks to the skipna parameter, the .sum on an all-NA
series is conceptually the same as the .sum of an empty one with
skipna=True (the default).
In [4]: pd.Series([np.nan]).sum(min_count=1) # skipna=True by default
Out[4]: nan
The min_count parameter refers to the minimum number of non-null values
required for a non-NA sum or product.
Series.prod() has been updated to behave the same as Series.sum(),
returning 1 instead.
In [5]: pd.Series([]).prod()
Out[5]: 1.0
In [6]: pd.Series([np.nan]).prod()
Out[6]: 1.0
In [7]: pd.Series([]).prod(min_count=1)
Out[7]: nan
These changes affect DataFrame.sum() and DataFrame.prod() as well.
Finally, a few less obvious places in pandas are affected by this change.
Grouping by a Categorical¶
Grouping by a Categorical and summing now returns 0 instead of
NaN for categories with no observations. The product now returns 1
instead of NaN.
pandas 0.21.x
In [8]: grouper = pd.Categorical(['a', 'a'], categories=['a', 'b'])
In [9]: pd.Series([1, 2]).groupby(grouper).sum()
Out[9]:
a 3.0
b NaN
dtype: float64
pandas 0.22
In [8]: grouper = pd.Categorical(["a", "a"], categories=["a", "b"])
In [9]: pd.Series([1, 2]).groupby(grouper).sum()
Out[9]:
a 3
b 0
Length: 2, dtype: int64
To restore the 0.21 behavior of returning NaN for unobserved groups,
use min_count>=1.
In [10]: pd.Series([1, 2]).groupby(grouper).sum(min_count=1)
Out[10]:
a 3.0
b NaN
Length: 2, dtype: float64
Resample¶
The sum and product of all-NA bins has changed from NaN to 0 for
sum and 1 for product.
pandas 0.21.x
In [11]: s = pd.Series([1, 1, np.nan, np.nan],
....: index=pd.date_range('2017', periods=4))
....: s
Out[11]:
2017-01-01 1.0
2017-01-02 1.0
2017-01-03 NaN
2017-01-04 NaN
Freq: D, dtype: float64
In [12]: s.resample('2d').sum()
Out[12]:
2017-01-01 2.0
2017-01-03 NaN
Freq: 2D, dtype: float64
pandas 0.22.0
In [11]: s = pd.Series([1, 1, np.nan, np.nan], index=pd.date_range("2017", periods=4))
In [12]: s.resample("2d").sum()
Out[12]:
2017-01-01 2.0
2017-01-03 0.0
Freq: 2D, Length: 2, dtype: float64
To restore the 0.21 behavior of returning NaN, use min_count>=1.
In [13]: s.resample("2d").sum(min_count=1)
Out[13]:
2017-01-01 2.0
2017-01-03 NaN
Freq: 2D, Length: 2, dtype: float64
In particular, upsampling and taking the sum or product is affected, as upsampling introduces missing values even if the original series was entirely valid.
pandas 0.21.x
In [14]: idx = pd.DatetimeIndex(['2017-01-01', '2017-01-02'])
In [15]: pd.Series([1, 2], index=idx).resample('12H').sum()
Out[15]:
2017-01-01 00:00:00 1.0
2017-01-01 12:00:00 NaN
2017-01-02 00:00:00 2.0
Freq: 12H, dtype: float64
pandas 0.22.0
In [14]: idx = pd.DatetimeIndex(["2017-01-01", "2017-01-02"])
In [15]: pd.Series([1, 2], index=idx).resample("12H").sum()
Out[15]:
2017-01-01 00:00:00 1
2017-01-01 12:00:00 0
2017-01-02 00:00:00 2
Freq: 12H, Length: 3, dtype: int64
Once again, the min_count keyword is available to restore the 0.21 behavior.
In [16]: pd.Series([1, 2], index=idx).resample("12H").sum(min_count=1)
Out[16]:
2017-01-01 00:00:00 1.0
2017-01-01 12:00:00 NaN
2017-01-02 00:00:00 2.0
Freq: 12H, Length: 3, dtype: float64
Rolling and expanding¶
Rolling and expanding already have a min_periods keyword that behaves
similar to min_count. The only case that changes is when doing a rolling
or expanding sum with min_periods=0. Previously this returned NaN,
when fewer than min_periods non-NA values were in the window. Now it
returns 0.
pandas 0.21.1
In [17]: s = pd.Series([np.nan, np.nan])
In [18]: s.rolling(2, min_periods=0).sum()
Out[18]:
0 NaN
1 NaN
dtype: float64
pandas 0.22.0
In [17]: s = pd.Series([np.nan, np.nan])
In [18]: s.rolling(2, min_periods=0).sum()
Out[18]:
0 0.0
1 0.0
Length: 2, dtype: float64
The default behavior of min_periods=None, implying that min_periods
equals the window size, is unchanged.
Compatibility¶
If you maintain a library that should work across pandas versions, it
may be easiest to exclude pandas 0.21 from your requirements. Otherwise, all your
sum() calls would need to check if the Series is empty before summing.
With setuptools, in your setup.py use:
install_requires=['pandas!=0.21.*', ...]
With conda, use
requirements:
run:
- pandas !=0.21.0,!=0.21.1
Note that the inconsistency in the return value for all-NA series is still there for pandas 0.20.3 and earlier. Avoiding pandas 0.21 will only help with the empty case.
Contributors¶
A total of 1 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.
Tom Augspurger