What’s New¶
These are new features and improvements of note in each release.
v0.20.1 (May 5, 2017)¶
This is a major release from 0.19.2 and includes a number of API changes, deprecations, new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version.
Highlights include:
- New
.agg()
API for Series/DataFrame similar to the groupby-rolling-resample API’s, see here - Integration with the
feather-format
, including a new top-levelpd.read_feather()
andDataFrame.to_feather()
method, see here. - The
.ix
indexer has been deprecated, see here Panel
has been deprecated, see here- Addition of an
IntervalIndex
andInterval
scalar type, see here - Improved user API when grouping by index levels in
.groupby()
, see here - Improved support for
UInt64
dtypes, see here - A new orient for JSON serialization,
orient='table'
, that uses the Table Schema spec and that gives the possibility for a more interactive repr in the Jupyter Notebook, see here - Experimental support for exporting styled DataFrames (
DataFrame.style
) to Excel, see here - Window binary corr/cov operations now return a MultiIndexed
DataFrame
rather than aPanel
, asPanel
is now deprecated, see here - Support for S3 handling now uses
s3fs
, see here - Google BigQuery support now uses the
pandas-gbq
library, see here
Warning
Pandas has changed the internal structure and layout of the codebase.
This can affect imports that are not from the top-level pandas.*
namespace, please see the changes here.
Check the API Changes and deprecations before updating.
Note
This is a combined release for 0.20.0 and and 0.20.1.
Version 0.20.1 contains one additional change for backwards-compatibility with downstream projects using pandas’ utils
routines. (GH16250)
What’s new in v0.20.0
- New features
agg
API for DataFrame/Seriesdtype
keyword for data IO.to_datetime()
has gained anorigin
parameter- Groupby Enhancements
- Better support for compressed URLs in
read_csv
- Pickle file I/O now supports compression
- UInt64 Support Improved
- GroupBy on Categoricals
- Table Schema Output
- SciPy sparse matrix from/to SparseDataFrame
- Excel output for styled DataFrames
- IntervalIndex
- Other Enhancements
- Backwards incompatible API changes
- Possible incompatibility for HDF5 formats created with pandas < 0.13.0
- Map on Index types now return other Index types
- Accessing datetime fields of Index now return Index
- pd.unique will now be consistent with extension types
- S3 File Handling
- Partial String Indexing Changes
- Concat of different float dtypes will not automatically upcast
- Pandas Google BigQuery support has moved
- Memory Usage for Index is more Accurate
- DataFrame.sort_index changes
- Groupby Describe Formatting
- Window Binary Corr/Cov operations return a MultiIndex DataFrame
- HDFStore where string comparison
- Index.intersection and inner join now preserve the order of the left Index
- Pivot Table always returns a DataFrame
- Other API Changes
- Reorganization of the library: Privacy Changes
- Deprecations
- Removal of prior version deprecations/changes
- Performance Improvements
- Bug Fixes
New features¶
agg
API for DataFrame/Series¶
Series & DataFrame have been enhanced to support the aggregation API. This is a familiar API
from groupby, window operations, and resampling. This allows aggregation operations in a concise way
by using agg()
and transform()
. The full documentation
is here (GH1623).
Here is a sample
In [1]: df = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'],
...: index=pd.date_range('1/1/2000', periods=10))
...:
In [2]: df.iloc[3:7] = np.nan
In [3]: df
Out[3]:
A B C
2000-01-01 1.474071 -0.064034 -1.282782
2000-01-02 0.781836 -1.071357 0.441153
2000-01-03 2.353925 0.583787 0.221471
2000-01-04 NaN NaN NaN
2000-01-05 NaN NaN NaN
2000-01-06 NaN NaN NaN
2000-01-07 NaN NaN NaN
2000-01-08 0.901805 1.171216 0.520260
2000-01-09 -1.197071 -1.066969 -0.303421
2000-01-10 -0.858447 0.306996 -0.028665
One can operate using string function names, callables, lists, or dictionaries of these.
Using a single function is equivalent to .apply
.
In [4]: df.agg('sum')
Out[4]:
A 3.456119
B -0.140361
C -0.431984
dtype: float64
Multiple aggregations with a list of functions.
In [5]: df.agg(['sum', 'min'])
Out[5]:
A B C
sum 3.456119 -0.140361 -0.431984
min -1.197071 -1.071357 -1.282782
Using a dict provides the ability to apply specific aggregations per column.
You will get a matrix-like output of all of the aggregators. The output has one column
per unique function. Those functions applied to a particular column will be NaN
:
In [6]: df.agg({'A' : ['sum', 'min'], 'B' : ['min', 'max']})
Out[6]:
B A
max 1.171216 NaN
min -1.071357 -1.197071
sum NaN 3.456119
The API also supports a .transform()
function for broadcasting results.
In [7]: df.transform(['abs', lambda x: x - x.min()])
Out[7]:
A B C
abs <lambda> abs <lambda> abs <lambda>
2000-01-01 1.474071 2.671143 0.064034 1.007322 1.282782 0.000000
2000-01-02 0.781836 1.978907 1.071357 0.000000 0.441153 1.723935
2000-01-03 2.353925 3.550996 0.583787 1.655143 0.221471 1.504252
2000-01-04 NaN NaN NaN NaN NaN NaN
2000-01-05 NaN NaN NaN NaN NaN NaN
2000-01-06 NaN NaN NaN NaN NaN NaN
2000-01-07 NaN NaN NaN NaN NaN NaN
2000-01-08 0.901805 2.098877 1.171216 2.242573 0.520260 1.803042
2000-01-09 1.197071 0.000000 1.066969 0.004388 0.303421 0.979361
2000-01-10 0.858447 0.338624 0.306996 1.378353 0.028665 1.254117
When presented with mixed dtypes that cannot be aggregated, .agg()
will only take the valid
aggregations. This is similiar to how groupby .agg()
works. (GH15015)
In [8]: df = pd.DataFrame({'A': [1, 2, 3],
...: 'B': [1., 2., 3.],
...: 'C': ['foo', 'bar', 'baz'],
...: 'D': pd.date_range('20130101', periods=3)})
...:
In [9]: df.dtypes
Out[9]:
A int64
B float64
C object
D datetime64[ns]
dtype: object
In [10]: df.agg(['min', 'sum'])
Out[10]:
A B C D
min 1 1.0 bar 2013-01-01
sum 6 6.0 foobarbaz NaT
dtype
keyword for data IO¶
The 'python'
engine for read_csv()
, as well as the read_fwf()
function for parsing
fixed-width text files and read_excel()
for parsing Excel files, now accept the dtype
keyword argument for specifying the types of specific columns (GH14295). See the io docs for more information.
In [11]: data = "a b\n1 2\n3 4"
In [12]: pd.read_fwf(StringIO(data)).dtypes
Out[12]:
a int64
b int64
dtype: object
In [13]: pd.read_fwf(StringIO(data), dtype={'a':'float64', 'b':'object'}).dtypes