v0.20.1 (May 5, 2017)¶
This is a major release from 0.19.2 and includes a number of API changes, deprecations, new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version.
Highlights include:
- New
.agg()
API for Series/DataFrame similar to the groupby-rolling-resample API’s, see here - Integration with the
feather-format
, including a new top-levelpd.read_feather()
andDataFrame.to_feather()
method, see here. - The
.ix
indexer has been deprecated, see here Panel
has been deprecated, see here- Addition of an
IntervalIndex
andInterval
scalar type, see here - Improved user API when grouping by index levels in
.groupby()
, see here - Improved support for
UInt64
dtypes, see here - A new orient for JSON serialization,
orient='table'
, that uses the Table Schema spec and that gives the possibility for a more interactive repr in the Jupyter Notebook, see here - Experimental support for exporting styled DataFrames (
DataFrame.style
) to Excel, see here - Window binary corr/cov operations now return a MultiIndexed
DataFrame
rather than aPanel
, asPanel
is now deprecated, see here - Support for S3 handling now uses
s3fs
, see here - Google BigQuery support now uses the
pandas-gbq
library, see here
Warning
Pandas has changed the internal structure and layout of the code base.
This can affect imports that are not from the top-level pandas.*
namespace, please see the changes here.
Check the API Changes and deprecations before updating.
Note
This is a combined release for 0.20.0 and and 0.20.1.
Version 0.20.1 contains one additional change for backwards-compatibility with downstream projects using pandas’ utils
routines. (GH16250)
What’s new in v0.20.0
- New features
agg
API for DataFrame/Seriesdtype
keyword for data IO.to_datetime()
has gained anorigin
parameter- Groupby enhancements
- Better support for compressed URLs in
read_csv
- Pickle file I/O now supports compression
- UInt64 support improved
- GroupBy on categoricals
- Table schema output
- SciPy sparse matrix from/to SparseDataFrame
- Excel output for styled DataFrames
- IntervalIndex
- Other enhancements
- Backwards incompatible API changes
- Possible incompatibility for HDF5 formats created with pandas < 0.13.0
- Map on Index types now return other Index types
- Accessing datetime fields of Index now return Index
- pd.unique will now be consistent with extension types
- S3 file handling
- Partial string indexing changes
- Concat of different float dtypes will not automatically upcast
- Pandas Google BigQuery support has moved
- Memory usage for Index is more accurate
- DataFrame.sort_index changes
- Groupby describe formatting
- Window binary corr/cov operations return a MultiIndex DataFrame
- HDFStore where string comparison
- Index.intersection and inner join now preserve the order of the left Index
- Pivot table always returns a DataFrame
- Other API changes
- Reorganization of the library: privacy changes
- Deprecations
- Removal of prior version deprecations/changes
- Performance improvements
- Bug fixes
- Contributors
New features¶
agg
API for DataFrame/Series¶
Series & DataFrame have been enhanced to support the aggregation API. This is a familiar API
from groupby, window operations, and resampling. This allows aggregation operations in a concise way
by using agg()
and transform()
. The full documentation
is here (GH1623).
Here is a sample
In [1]: df = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'],
...: index=pd.date_range('1/1/2000', periods=10))
...:
In [2]: df.iloc[3:7] = np.nan
In [3]: df
Out[3]:
A B C
2000-01-01 0.469112 -0.282863 -1.509059
2000-01-02 -1.135632 1.212112 -0.173215
2000-01-03 0.119209 -1.044236 -0.861849
2000-01-04 NaN NaN NaN
2000-01-05 NaN NaN NaN
2000-01-06 NaN NaN NaN
2000-01-07 NaN NaN NaN
2000-01-08 0.113648 -1.478427 0.524988
2000-01-09 0.404705 0.577046 -1.715002
2000-01-10 -1.039268 -0.370647 -1.157892
[10 rows x 3 columns]
One can operate using string function names, callables, lists, or dictionaries of these.
Using a single function is equivalent to .apply
.
In [4]: df.agg('sum')
Out[4]:
A -1.068226
B -1.387015
C -4.892029
Length: 3, dtype: float64
Multiple aggregations with a list of functions.
In [5]: df.agg(['sum', 'min'])
Out[5]:
A B C
sum -1.068226 -1.387015 -4.892029
min -1.135632 -1.478427 -1.715002
[2 rows x 3 columns]
Using a dict provides the ability to apply specific aggregations per column.
You will get a matrix-like output of all of the aggregators. The output has one column
per unique function. Those functions applied to a particular column will be NaN
:
In [6]: df.agg({'A': ['sum', 'min'], 'B': ['min', 'max']})
Out[6]:
A B
max NaN 1.212112
min -1.135632 -1.478427
sum -1.068226 NaN
[3 rows x 2 columns]
The API also supports a .transform()
function for broadcasting results.
In [7]: df.transform(['abs', lambda x: x - x.min()])
Out[7]:
A B C
abs <lambda> abs <lambda> abs <lambda>
2000-01-01 0.469112 1.604745 0.282863 1.195563 1.509059 0.205944
2000-01-02 1.135632 0.000000 1.212112 2.690539 0.173215 1.541787
2000-01-03 0.119209 1.254841 1.044236 0.434191 0.861849 0.853153
2000-01-04 NaN NaN NaN NaN NaN NaN
2000-01-05 NaN NaN NaN NaN NaN NaN
2000-01-06 NaN NaN NaN NaN NaN NaN
2000-01-07 NaN NaN NaN NaN NaN NaN
2000-01-08 0.113648 1.249281 1.478427 0.000000 0.524988 2.239990
2000-01-09 0.404705 1.540338 0.577046 2.055473 1.715002 0.000000
2000-01-10 1.039268 0.096364 0.370647 1.107780 1.157892 0.557110
[10 rows x 6 columns]
When presented with mixed dtypes that cannot be aggregated, .agg()
will only take the valid
aggregations. This is similar to how groupby .agg()
works. (GH15015)
In [8]: df = pd.DataFrame({'A': [1, 2, 3],
...: 'B': [1., 2., 3.],
...: 'C': ['foo', 'bar', 'baz'],
...: 'D': pd.date_range('20130101', periods=3)})
...:
In [9]: df.dtypes
Out[9]:
A int64
B float64
C object
D datetime64[ns]
Length: 4, dtype: object
In [10]: df.agg(['min', 'sum'])
Out[10]:
A B C D
min 1 1.0 bar 2013-01-01
sum 6 6.0 foobarbaz NaT
[2 rows x 4 columns]
dtype
keyword for data IO¶
The 'python'
engine for read_csv()
, as well as the read_fwf()
function for parsing
fixed-width text files and read_excel()
for parsing Excel files, now accept the dtype
keyword argument for specifying the types of specific columns (GH14295). See the io docs for more information.
In [11]: data = "a b\n1 2\n3 4"
In [12]: pd.read_fwf(StringIO(data)).dtypes
Out[12]:
a int64
b int64
Length: 2, dtype: object
In [13]: pd.read_fwf(StringIO(data), dtype={'a': 'float64', 'b': 'object'}).dtypes