What’s New¶
These are new features and improvements of note in each release.
v0.19.0 (October 2, 2016)¶
This is a major release from 0.18.1 and includes number of API changes, several new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version.
Highlights include:
merge_asof()
for asof-style time-series joining, see here.rolling()
is now time-series aware, see hereread_csv()
now supports parsingCategorical
data, see here- A function
union_categorical()
has been added for combining categoricals, see here PeriodIndex
now has its ownperiod
dtype, and changed to be more consistent with otherIndex
classes. See here- Sparse data structures gained enhanced support of
int
andbool
dtypes, see here - Comparison operations with
Series
no longer ignores the index, see here for an overview of the API changes. - Introduction of a pandas development API for utility functions, see here.
- Deprecation of
Panel4D
andPanelND
. We recommend to represent these types of n-dimensional data with the xarray package. - Removal of the previously deprecated modules
pandas.io.data
,pandas.io.wb
,pandas.tools.rplot
.
Warning
pandas >= 0.19.0 will no longer silence numpy ufunc warnings upon import, see here.
What’s new in v0.19.0
- New features
merge_asof
for asof-style time-series joining.rolling()
is now time-series awareread_csv
has improved support for duplicate column namesread_csv
supports parsingCategorical
directly- Categorical Concatenation
- Semi-Month Offsets
- New Index methods
- Google BigQuery Enhancements
- Fine-grained numpy errstate
get_dummies
now returns integer dtypes- Downcast values to smallest possible dtype in
to_numeric
- pandas development API
- Other enhancements
- API changes
Series.tolist()
will now return Python typesSeries
operators for different indexesSeries
type promotion on assignment.to_datetime()
changes- Merging changes
.describe()
changesPeriod
changes- Index
+
/-
no longer used for set operations Index.difference
and.symmetric_difference
changesIndex.unique
consistently returnsIndex
MultiIndex
constructors,groupby
andset_index
preserve categorical dtypesread_csv
will progressively enumerate chunks- Sparse Changes
- Indexer dtype changes
- Other API Changes
- Deprecations
- Removal of prior version deprecations/changes
- Performance Improvements
- Bug Fixes
New features¶
merge_asof
for asof-style time-series joining¶
A long-time requested feature has been added through the merge_asof()
function, to
support asof style joining of time-series (GH1870, GH13695, GH13709, GH13902). Full documentation is
here.
The merge_asof()
performs an asof merge, which is similar to a left-join
except that we match on nearest key rather than equal keys.
In [1]: left = pd.DataFrame({'a': [1, 5, 10],
...: 'left_val': ['a', 'b', 'c']})
...:
In [2]: right = pd.DataFrame({'a': [1, 2, 3, 6, 7],
...: 'right_val': [1, 2, 3, 6, 7]})
...:
In [3]: left
Out[3]:
a left_val
0 1 a
1 5 b
2 10 c
In [4]: right
Out[4]:
a right_val
0 1 1
1 2 2
2 3 3
3 6 6
4 7 7
We typically want to match exactly when possible, and use the most recent value otherwise.
In [5]: pd.merge_asof(left, right, on='a')
Out[5]:
a left_val right_val
0 1 a 1
1 5 b 3
2 10 c 7
We can also match rows ONLY with prior data, and not an exact match.
In [6]: pd.merge_asof(left, right, on='a', allow_exact_matches=False)
Out[6]:
a left_val right_val
0 1 a NaN
1 5 b 3.0
2 10 c 7.0
In a typical time-series example, we have trades
and quotes
and we want to asof-join
them.
This also illustrates using the by
parameter to group data before merging.
In [7]: trades = pd.DataFrame({
...: 'time': pd.to_datetime(['20160525 13:30:00.023',
...: '20160525 13:30:00.038',
...: '20160525 13:30:00.048',
...: '20160525 13:30:00.048',
...: '20160525 13:30:00.048']),
...: 'ticker': ['MSFT', 'MSFT',
...: 'GOOG', 'GOOG', 'AAPL'],
...: 'price': [51.95, 51.95,
...: 720.77, 720.92, 98.00],
...: 'quantity': [75, 155,
...: 100, 100, 100]},
...: columns=['time', 'ticker', 'price', 'quantity'])
...:
In [8]: quotes = pd.DataFrame({
...: 'time': pd.to_datetime(['20160525 13:30:00.023',
...: '20160525 13:30:00.023',
...: '20160525 13:30:00.030',
...: '20160525 13:30:00.041',
...: '20160525 13:30:00.048',
...: '20160525 13:30:00.049',
...: '20160525 13:30:00.072',
...: '20160525 13:30:00.075']),
...: 'ticker': ['GOOG', 'MSFT', 'MSFT',
...: 'MSFT', 'GOOG', 'AAPL', 'GOOG',
...: 'MSFT'],
...: 'bid': [720.50, 51.95, 51.97, 51.99,
...: 720.50, 97.99, 720.50, 52.01],
...: 'ask': [720.93, 51.96, 51.98, 52.00,
...: 720.93, 98.01, 720.88, 52.03]},
...: columns=['time', 'ticker', 'bid', 'ask'])
...:
In [9]: trades
Out[9]:
time ticker price quantity
0 2016-05-25 13:30:00.023 MSFT 51.95 75
1 2016-05-25 13:30:00.038 MSFT 51.95 155
2 2016-05-25 13:30:00.048 GOOG 720.77 100
3 2016-05-25 13:30:00.048 GOOG 720.92 100
4 2016-05-25 13:30:00.048 AAPL 98.00 100
In [10]: quotes
Out[10]:
time ticker bid ask
0 2016-05-25 13:30:00.023 GOOG 720.50 720.93
1 2016-05-25 13:30:00.023 MSFT 51.95 51.96
2 2016-05-25 13:30:00.030 MSFT 51.97 51.98
3 2016-05-25 13:30:00.041 MSFT 51.99 52.00
4 2016-05-25 13:30:00.048 GOOG 720.50 720.93
5 2016-05-25 13:30:00.049 AAPL 97.99 98.01
6 2016-05-25 13:30:00.072 GOOG 720.50 720.88
7 2016-05-25 13:30:00.075 MSFT 52.01 52.03
An asof merge joins on the on
, typically a datetimelike field, which is ordered, and
in this case we are using a grouper in the by
field. This is like a left-outer join, except
that forward filling happens automatically taking the most recent non-NaN value.
In [11]: pd.merge_asof(trades, quotes,
....: on='time',
....: by='ticker')
....:
Out[11]:
time ticker price quantity bid ask
0 2016-05-25 13:30:00.023 MSFT 51.95 75 51.95 51.96
1 2016-05-25 13:30:00.038 MSFT 51.95 155 51.97 51.98
2 2016-05-25 13:30:00.048 GOOG 720.77 100 720.50 720.93
3 2016-05-25 13:30:00.048 GOOG 720.92 100 720.50 720.93
4 2016-05-25 13:30:00.048 AAPL 98.00 100 NaN NaN
This returns a merged DataFrame with the entries in the same order as the original left
passed DataFrame (trades
in this case), with the fields of the quotes
merged.
.rolling()
is now time-series aware¶
.rolling()
objects are now time-series aware and can accept a time-series offset (or convertible) for the window
argument (GH13327, GH12995).
See the full documentation here.
In [12]: dft = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]},
....: index=pd.date_range('20130101 09:00:00', periods=5, freq='s'))
....:
In [13]: dft
Out[13]:
B
2013-01-01 09:00:00 0.0
2013-01-01 09:00:01 1.0
2013-01-01 09:00:02 2.0
2013-01-01 09:00:03 NaN
2013-01-01 09:00:04 4.0
This is a regular frequency index. Using an integer window parameter works to roll along the window frequency.
In [14]: dft.rolling(2).sum()
Out[14]:
B
2013-01-01 09:00:00 NaN
2013-01-01 09:00:01 1.0
2013-01-01 09:00:02 3.0
2013-01-01 09:00:03 NaN
2013-01-01 09:00:04 NaN
In [15]: dft.rolling(2, min_periods=1).sum()
Out[15]:
B
2013-01-01 09:00:00 0.0
2013-01-01 09:00:01 1.0
2013-01-01 09:00:02 3.0
2013-01-01 09:00:03 2.0
2013-01-01 09:00:04 4.0
Specifying an offset allows a more intuitive specification of the rolling frequency.
In [16]: dft.rolling('2s').sum()
Out[16]:
B
2013-01-01 09:00:00 0.0
2013-01-01 09:00:01 1.0
2013-01-01 09:00:02 3.0
2013-01-01 09:00:03 2.0
2013-01-01 09:00:04 4.0
Using a non-regular, but still monotonic index, rolling with an integer window does not impart any special calculation.
In [17]: dft = DataFrame({'B': [0, 1, 2, np.nan, 4]},
....: index = pd.Index([pd.Timestamp('20130101 09:00:00'),
....: pd.Timestamp('20130101 09:00:02'),
....: pd.Timestamp('20130101 09:00:03'),
....: pd.Timestamp('20130101 09:00:05'),
....: pd.Timestamp('20130101 09:00:06')],
....: name='foo'))
....:
In [18]: dft
Out[18]:
B
foo
2013-01-01 09:00:00 0.0
2013-01-01 09:00:02 1.0
2013-01-01 09:00:03 2.0
2013-01-01 09:00:05 NaN
2013-01-01 09:00:06 4.0
In [19]: dft.rolling(2).sum()
Out[19]:
B
foo
2013-01-01 09:00:00 NaN
2013-01-01 09:00:02 1.0
2013-01-01 09:00:03 3.0
2013-01-01 09:00:05 NaN
2013-01-01 09:00:06 NaN
Using the time-specification generates variable windows for this sparse data.
In [20]: dft.rolling('2s').sum()
Out[20]:
B
foo
2013-01-01 09:00:00 0.0
2013-01-01 09:00:02 1.0
2013-01-01 09:00:03 3.0
2013-01-01 09:00:05 NaN
2013-01-01 09:00:06 4.0
Furthermore, we now allow an optional on
parameter to specify a column (rather than the
default of the index) in a DataFrame.
In [21]: dft = dft.reset_index()
In [22]: dft
Out[22]:
foo B
0 2013-01-01 09:00:00 0.0
1 2013-01-01 09:00:02 1.0
2 2013-01-01 09:00:03 2.0
3 2013-01-01 09:00:05 NaN
4 2013-01-01 09:00:06 4.0
In [23]: dft.rolling('2s', on='foo').sum()
Out[23]:
foo B
0 2013-01-01 09:00:00 0.0
1 2013-01-01 09:00:02 1.0
2 2013-01-01 09:00:03 3.0
3 2013-01-01 09:00:05 NaN
4 2013-01-01 09:00:06 4.0
read_csv
has improved support for duplicate column names¶
Duplicate column names are now supported in read_csv()
whether
they are in the file or passed in as the names
parameter (GH7160, GH9424)
In [24]: data = '0,1,2\n3,4,5'
In [25]: names = ['a', 'b', 'a']
Previous behavior:
In [2]: pd.read_csv(StringIO(data), names=names)
Out[2]:
a b a
0 2 1 2
1 5 4 5
The first a
column contained the same data as the second a
column, when it should have
contained the values [0, 3]
.
New behavior:
In [26]: pd.read_csv(StringIO(data), names=names)
Out[26]:
a b a.1
0 0 1 2
1 3 4 5
read_csv
supports parsing Categorical
directly¶
The read_csv()
function now supports parsing a Categorical
column when
specified as a dtype (GH10153). Depending on the structure of the data,
this can result in a faster parse time and lower memory usage compared to
converting to Categorical
after parsing. See the io docs here.
In [27]: data = 'col1,col2,col3\na,b,1\na,b,2\nc,d,3'
In [28]: pd.read_csv(StringIO(data))
Out[28]:
col1 col2 col3
0 a b 1
1 a b 2
2 c d 3
In [29]: pd.read_csv(StringIO(data)).dtypes
Out[29]:
col1 object
col2 object
col3 int64
dtype: object
In [30]: pd.read_csv(StringIO(data), dtype='category').dtypes
Out[30]:
col1 category
col2 category
col3 category
dtype: object
Individual columns can be parsed as a Categorical
using a dict specification
In [31]: pd.read_csv(StringIO(data), dtype={'col1': 'category'}).dtypes
Out[31]:
col1 category
col2 object
col3 int64
dtype: object
Note
The resulting categories will always be parsed as strings (object dtype).
If the categories are numeric they can be converted using the
to_numeric()
function, or as appropriate, another converter
such as to_datetime()
.
In [32]: df = pd.read_csv(StringIO(data), dtype='category')
In [33]: df.dtypes
Out[33]:
col1 category
col2 category
col3 category
dtype: object
In [34]: df['col3']
Out[34]:
0 1
1 2
2 3
Name: col3, dtype: category
Categories (3, object): [1, 2, 3]
In [35]: df['col3'].cat.categories = pd.to_numeric(df['col3'].cat.categories)
In [36]: df['col3']
Out[36]:
0 1
1 2
2 3
Name: col3, dtype: category
Categories (3, int64): [1, 2, 3]
Categorical Concatenation¶
A function
union_categoricals()
has been added for combining categoricals, see Unioning Categoricals (GH13361, GH:13763, issue:13846, GH14173)In [37]: from pandas.types.concat import union_categoricals In [38]: a = pd.Categorical(["b", "c"]) In [39]: b = pd.Categorical(["a", "b"]) In [40]: union_categoricals([a, b]) Out[40]: [b, c, a, b] Categories (3, object): [b, c, a]
concat
andappend
now can concatcategory
dtypes with differentcategories
asobject
dtype (GH13524)In [41]: s1 = pd.Series(['a', 'b'], dtype='category') In [42]: s2 = pd.Series(['b', 'c'], dtype='category')
Previous behavior:
In [1]: pd.concat([s1, s2]) ValueError: incompatible categories in categorical concat
New behavior:
In [43]: pd.concat([s1, s2]) Out[43]: 0 a 1 b 0 b 1 c dtype: object
Semi-Month Offsets¶
Pandas has gained new frequency offsets, SemiMonthEnd
(‘SM’) and SemiMonthBegin
(‘SMS’).
These provide date offsets anchored (by default) to the 15th and end of month, and 15th and 1st of month respectively.
(GH1543)
In [44]: from pandas.tseries.offsets import SemiMonthEnd, SemiMonthBegin
SemiMonthEnd:
In [45]: Timestamp('2016-01-01') + SemiMonthEnd()
Out[45]: Timestamp('2016-01-15 00:00:00')
In [46]: pd.date_range('2015-01-01', freq='SM', periods=4)
Out[46]: DatetimeIndex(['2015-01-15', '2015-01-31', '2015-02-15', '2015-02-28'], dtype='datetime64[ns]', freq='SM-15')
SemiMonthBegin:
In [47]: Timestamp('2016-01-01') + SemiMonthBegin()
Out[47]: Timestamp('2016-01-15 00:00:00')
In [48]: pd.date_range('2015-01-01', freq='SMS', periods=4)
Out[48]: DatetimeIndex(['2015-01-01', '2015-01-15', '2015-02-01', '2015-02-15'], dtype='datetime64[ns]', freq='SMS-15')
Using the anchoring suffix, you can also specify the day of month to use instead of the 15th.
In [49]: pd.date_range('2015-01-01', freq='SMS-16', periods=4)
Out[49]: DatetimeIndex(['2015-01-01', '2015-01-16', '2015-02-01', '2015-02-16'], dtype='datetime64[ns]', freq='SMS-16')
In [50]: pd.date_range('2015-01-01', freq='SM-14', periods=4)
Out[50]: DatetimeIndex(['2015-01-14', '2015-01-31', '2015-02-14', '2015-02-28'], dtype='datetime64[ns]', freq='SM-14')
New Index methods¶
The following methods and options are added to Index
, to be more consistent with the Series
and DataFrame
API.
Index
now supports the .where()
function for same shape indexing (GH13170)
In [51]: idx = pd.Index(['a', 'b', 'c'])
In [52]: idx.where([True, False, True])
Out[52]: Index([u'a', nan, u'c'], dtype='object')
Index
now supports .dropna()
to exclude missing values (GH6194)
In [53]: idx = pd.Index([1, 2, np.nan, 4])
In [54]: idx.dropna()
Out[54]: Float64Index([1.0, 2.0, 4.0], dtype='float64')
For MultiIndex
, values are dropped if any level is missing by default. Specifying
how='all'
only drops values where all levels are missing.
In [55]: midx = pd.MultiIndex.from_arrays([[1, 2, np.nan, 4],
....: [1, 2, np.nan, np.nan]])
....:
In [56]: midx
Out[56]:
MultiIndex(levels=[[1, 2, 4], [1, 2]],
labels=[[0, 1, -1, 2], [0, 1, -1, -1]])
In [57]: midx.dropna()
Out[57]:
MultiIndex(levels=[[1, 2, 4], [1, 2]],
labels=[[0, 1], [0, 1]])
In [58]: midx.dropna(how='all')
Out[58]:
MultiIndex(levels=[[1, 2, 4], [1, 2]],
labels=[[0, 1, 2], [0, 1, -1]])
Index
now supports .str.extractall()
which returns a DataFrame
, see the docs here (GH10008, GH13156)
In [59]: idx = pd.Index(["a1a2", "b1", "c1"])
In [60]: idx.str.extractall("[ab](?P<digit>\d)")
Out[60]:
digit
match
0 0 1
1 2
1 0 1
Index.astype()
now accepts an optional boolean argument copy
, which allows optional copying if the requirements on dtype are satisfied (GH13209)
Google BigQuery Enhancements¶
Fine-grained numpy errstate¶
Previous versions of pandas would permanently silence numpy’s ufunc error handling when pandas
was imported. Pandas did this in order to silence the warnings that would arise from using numpy ufuncs on missing data, which are usually represented as NaN
s. Unfortunately, this silenced legitimate warnings arising in non-pandas code in the application. Starting with 0.19.0, pandas will use the numpy.errstate
context manager to silence these warnings in a more fine-grained manner, only around where these operations are actually used in the pandas codebase. (GH13109, GH13145)
After upgrading pandas, you may see new RuntimeWarnings
being issued from your code. These are likely legitimate, and the underlying cause likely existed in the code when using previous versions of pandas that simply silenced the warning. Use numpy.errstate around the source of the RuntimeWarning
to control how these conditions are handled.
get_dummies
now returns integer dtypes¶
The pd.get_dummies
function now returns dummy-encoded columns as small integers, rather than floats (GH8725). This should provide an improved memory footprint.
Previous behavior:
In [1]: pd.get_dummies(['a', 'b', 'a', 'c']).dtypes
Out[1]:
a float64
b float64
c float64
dtype: object
New behavior:
In [61]: pd.get_dummies(['a', 'b', 'a', 'c']).dtypes
Out[61]:
a uint8
b uint8
c uint8
dtype: object
Downcast values to smallest possible dtype in to_numeric
¶
pd.to_numeric()
now accepts a downcast
parameter, which will downcast the data if possible to smallest specified numerical dtype (GH13352)
In [62]: s = ['1', 2, 3]
In [63]: pd.to_numeric(s, downcast='unsigned')
Out[63]: array([1, 2, 3], dtype=uint8)
In [64]: pd.to_numeric(s, downcast='integer')
Out[64]: array([1, 2, 3], dtype=int8)
pandas development API¶
As part of making pandas API more uniform and accessible in the future, we have created a standard
sub-package of pandas, pandas.api
to hold public API’s. We are starting by exposing type
introspection functions in pandas.api.types
. More sub-packages and officially sanctioned API’s
will be published in future versions of pandas (GH13147, GH13634)
The following are now part of this API:
In [65]: import pprint
In [66]: from pandas.api import types
In [67]: funcs = [ f for f in dir(types) if not f.startswith('_') ]
In [68]: pprint.pprint(funcs)
['is_any_int_dtype',
'is_bool',
'is_bool_dtype',
'is_categorical',
'is_categorical_dtype',
'is_complex',
'is_complex_dtype',
'is_datetime64_any_dtype',
'is_datetime64_dtype',
'is_datetime64_ns_dtype',
'is_datetime64tz_dtype',
'is_datetimetz',
'is_dict_like',
'is_dtype_equal',
'is_extension_type',
'is_float',
'is_float_dtype',
'is_floating_dtype',
'is_hashable',
'is_int64_dtype',
'is_integer',
'is_integer_dtype',
'is_iterator',
'is_list_like',
'is_named_tuple',
'is_number',
'is_numeric_dtype',
'is_object_dtype',
'is_period',
'is_period_dtype',
'is_re',
'is_re_compilable',
'is_scalar',
'is_sequence',
'is_sparse',
'is_string_dtype',
'is_timedelta64_dtype',
'is_timedelta64_ns_dtype',
'pandas_dtype']
Note
Calling these functions from the internal module pandas.core.common
will now show a DeprecationWarning
(GH13990)
Other enhancements¶
Timestamp
can now accept positional and keyword parameters similar todatetime.datetime()
(GH10758, GH11630)In [69]: pd.Timestamp(2012, 1, 1) Out[69]: Timestamp('2012-01-01 00:00:00') In [70]: pd.Timestamp(year=2012, month=1, day=1, hour=8, minute=30) Out[70]: Timestamp('2012-01-01 08:30:00')
The
.resample()
function now accepts aon=
orlevel=
parameter for resampling on a datetimelike column orMultiIndex
level (GH13500)In [71]: df = pd.DataFrame({'date': pd.date_range('2015-01-01', freq='W', periods=5), ....: 'a': np.arange(5)}, ....: index=pd.MultiIndex.from_arrays([ ....: [1,2,3,4,5], ....: pd.date_range('2015-01-01', freq='W', periods=5)], ....: names=['v','d'])) ....: In [72]: df Out[72]: a date v d 1 2015-01-04 0 2015-01-04 2 2015-01-11 1 2015-01-11 3 2015-01-18 2 2015-01-18 4 2015-01-25 3 2015-01-25 5 2015-02-01 4 2015-02-01 In [73]: df.resample('M', on='date').sum() Out[73]: a date 2015-01-31 6 2015-02-28 4 In [74]: df.resample('M', level='d').sum() Out[74]: a d 2015-01-31 6 2015-02-28 4
The
.get_credentials()
method ofGbqConnector
can now first try to fetch the application default credentials. See the docs for more details (GH13577).The
.tz_localize()
method ofDatetimeIndex
andTimestamp
has gained theerrors
keyword, so you can potentially coerce nonexistent timestamps toNaT
. The default behavior remains to raising aNonExistentTimeError
(GH13057).to_hdf/read_hdf()
now accept path objects (e.g.pathlib.Path
,py.path.local
) for the file path (GH11773)The
pd.read_csv()
withengine='python'
has gained support for thedecimal
(GH12933),na_filter
(GH13321) and thememory_map
option (GH13381).Consistent with the Python API,
pd.read_csv()
will now interpret+inf
as positive infinity (GH13274)The
pd.read_html()
has gained support for thena_values
,converters
,keep_default_na
options (GH13461)Categorical.astype()
now accepts an optional boolean argumentcopy
, effective when dtype is categorical (GH13209)DataFrame
has gained the.asof()
method to return the last non-NaN values according to the selected subset (GH13358)The
DataFrame
constructor will now respect key ordering if a list ofOrderedDict
objects are passed in (GH13304)pd.read_html()
has gained support for thedecimal
option (GH12907)Series
has gained the properties.is_monotonic
,.is_monotonic_increasing
,.is_monotonic_decreasing
, similar toIndex
(GH13336)DataFrame.to_sql()
now allows a single value as the SQL type for all columns (GH11886).Series.append
now supports theignore_index
option (GH13677).to_stata()
andStataWriter
can now write variable labels to Stata dta files using a dictionary to make column names to labels (GH13535, GH13536).to_stata()
andStataWriter
will automatically convertdatetime64[ns]
columns to Stata format%tc
, rather than raising aValueError
(GH12259)read_stata()
andStataReader
raise with a more explicit error message when reading Stata files with repeated value labels whenconvert_categoricals=True
(GH13923)DataFrame.style
will now render sparsified MultiIndexes (GH11655)DataFrame.style
will now show column level names (e.g.DataFrame.columns.names
) (GH13775)DataFrame
has gained support to re-order the columns based on the values in a row usingdf.sort_values(by='...', axis=1)
(GH10806)In [75]: df = pd.DataFrame({'A': [2, 7], 'B': [3, 5], 'C': [4, 8]}, ....: index=['row1', 'row2']) ....: In [76]: df Out[76]: A B C row1 2 3 4 row2 7 5 8 In [77]: df.sort_values(by='row2', axis=1) Out[77]: B A C row1 3 2 4 row2 5 7 8
Added documentation to I/O regarding the perils of reading in columns with mixed dtypes and how to handle it (GH13746)
to_html()
now has aborder
argument to control the value in the opening<table>
tag. The default is the value of thehtml.border
option, which defaults to 1. This also affects the notebook HTML repr, but since Jupyter’s CSS includes a border-width attribute, the visual effect is the same. (GH11563).Raise
ImportError
in the sql functions whensqlalchemy
is not installed and a connection string is used (GH11920).Compatibility with matplotlib 2.0. Older versions of pandas should also work with matplotlib 2.0 (GH13333)
Timestamp
,Period
,DatetimeIndex
,PeriodIndex
and.dt
accessor have gained a.is_leap_year
property to check whether the date belongs to a leap year. (GH13727)astype()
will now accept a dict of column name to data types mapping as thedtype
argument. (GH12086)The
pd.read_json
andDataFrame.to_json
has gained support for reading and writing json lines withlines
option see Line delimited json (GH9180):func:
read_excel
now supports the true_values and false_values keyword arguments (GH13347)groupby()
will now accept a scalar and a single-element list for specifyinglevel
on a non-MultiIndex
grouper. (GH13907)Non-convertible dates in an excel date column will be returned without conversion and the column will be
object
dtype, rather than raising an exception (GH10001).pd.Timedelta(None)
is now accepted and will returnNaT
, mirroringpd.Timestamp
(GH13687)pd.read_stata()
can now handle some format 111 files, which are produced by SAS when generating Stata dta files (GH11526)Series
andIndex
now supportdivmod
which will return a tuple of series or indices. This behaves like a standard binary operator with regards to broadcasting rules (GH14208).
API changes¶
Series.tolist()
will now return Python types¶
Series.tolist()
will now return Python types in the output, mimicking NumPy .tolist()
behavior (GH10904)
In [78]: s = pd.Series([1,2,3])
Previous behavior:
In [7]: type(s.tolist()[0])
Out[7]:
<class 'numpy.int64'>
New behavior:
In [79]: type(s.tolist()[0])
Out[79]: int
Series
operators for different indexes¶
Following Series
operators have been changed to make all operators consistent,
including DataFrame
(GH1134, GH4581, GH13538)
Series
comparison operators now raiseValueError
whenindex
are different.Series
logical operators align bothindex
of left and right hand side.
Warning
Until 0.18.1, comparing Series
with the same length, would succeed even if
the .index
are different (the result ignores .index
). As of 0.19.0, this will raises ValueError
to be more strict. This section also describes how to keep previous behavior or align different indexes, using the flexible comparison methods like .eq
.
As a result, Series
and DataFrame
operators behave as below:
Arithmetic operators¶
Arithmetic operators align both index
(no changes).
In [80]: s1 = pd.Series([1, 2, 3], index=list('ABC'))
In [81]: s2 = pd.Series([2, 2, 2], index=list('ABD'))
In [82]: s1 + s2
Out[82]:
A 3.0
B 4.0
C NaN
D NaN
dtype: float64
In [83]: df1 = pd.DataFrame([1, 2, 3], index=list('ABC'))
In [84]: df2 = pd.DataFrame([2, 2, 2], index=list('ABD'))
In [85]: df1 + df2
Out[85]:
0
A 3.0
B 4.0
C NaN
D NaN
Comparison operators¶
Comparison operators raise ValueError
when .index
are different.
Previous Behavior (Series
):
Series
compared values ignoring the .index
as long as both had the same length:
In [1]: s1 == s2
Out[1]:
A False
B True
C False
dtype: bool
New behavior (Series
):
In [2]: s1 == s2
Out[2]:
ValueError: Can only compare identically-labeled Series objects
Note
To achieve the same result as previous versions (compare values based on locations ignoring .index
), compare both .values
.
In [86]: s1.values == s2.values
Out[86]: array([False, True, False], dtype=bool)
If you want to compare Series
aligning its .index
, see flexible comparison methods section below:
In [87]: s1.eq(s2)
Out[87]:
A False
B True
C False
D False
dtype: bool
Current Behavior (DataFrame
, no change):
In [3]: df1 == df2
Out[3]:
ValueError: Can only compare identically-labeled DataFrame objects
Logical operators¶
Logical operators align both .index
of left and right hand side.
Previous behavior (Series
), only left hand side index
was kept:
In [4]: s1 = pd.Series([True, False, True], index=list('ABC'))
In [5]: s2 = pd.Series([True, True, True], index=list('ABD'))
In [6]: s1 & s2
Out[6]:
A True
B False
C False
dtype: bool
New behavior (Series
):
In [88]: s1 = pd.Series([True, False, True], index=list('ABC'))
In [89]: s2 = pd.Series([True, True, True], index=list('ABD'))
In [90]: s1 & s2
Out[90]:
A True
B False
C False
D False
dtype: bool
Note
Series
logical operators fill a NaN
result with False
.
Note
To achieve the same result as previous versions (compare values based on only left hand side index), you can use reindex_like
:
In [91]: s1 & s2.reindex_like(s1)
Out[91]:
A True
B False
C False
dtype: bool
Current Behavior (DataFrame
, no change):
In [92]: df1 = pd.DataFrame([True, False, True], index=list('ABC'))
In [93]: df2 = pd.DataFrame([True, True, True], index=list('ABD'))
In [94]: df1 & df2
Out[94]:
0
A True
B False
C NaN
D NaN
Flexible comparison methods¶
Series
flexible comparison methods like eq
, ne
, le
, lt
, ge
and gt
now align both index
. Use these operators if you want to compare two Series
which has the different index
.
In [95]: s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
In [96]: s2 = pd.Series([2, 2, 2], index=['b', 'c', 'd'])
In [97]: s1.eq(s2)
Out[97]:
a False
b True
c False
d False
dtype: bool
In [98]: s1.ge(s2)
Out[98]:
a False
b True
c True
d False
dtype: bool
Previously, this worked the same as comparison operators (see above).
Series
type promotion on assignment¶
A Series
will now correctly promote its dtype for assignment with incompat values to the current dtype (GH13234)
In [99]: s = pd.Series()
Previous behavior:
In [2]: s["a"] = pd.Timestamp("2016-01-01")
In [3]: s["b"] = 3.0
TypeError: invalid type promotion
New behavior:
In [100]: s["a"] = pd.Timestamp("2016-01-01")
In [101]: s["b"] = 3.0
In [102]: s
Out[102]:
a 2016-01-01 00:00:00
b 3
dtype: object
In [103]: s.dtype
Out[103]: dtype('O')
.to_datetime()
changes¶
Previously if .to_datetime()
encountered mixed integers/floats and strings, but no datetimes with errors='coerce'
it would convert all to NaT
.
Previous behavior:
In [2]: pd.to_datetime([1, 'foo'], errors='coerce')
Out[2]: DatetimeIndex(['NaT', 'NaT'], dtype='datetime64[ns]', freq=None)
Current behavior:
This will now convert integers/floats with the default unit of ns
.
In [104]: pd.to_datetime([1, 'foo'], errors='coerce')
Out[104]: DatetimeIndex(['1970-01-01 00:00:00.000000001', 'NaT'], dtype='datetime64[ns]', freq=None)
Bug fixes related to .to_datetime()
:
- Bug in
pd.to_datetime()
when passing integers or floats, and nounit
anderrors='coerce'
(GH13180). - Bug in
pd.to_datetime()
when passing invalid datatypes (e.g. bool); will now respect theerrors
keyword (GH13176) - Bug in
pd.to_datetime()
which overflowed onint8
, andint16
dtypes (GH13451) - Bug in
pd.to_datetime()
raiseAttributeError
withNaN
and the other string is not valid whenerrors='ignore'
(GH12424) - Bug in
pd.to_datetime()
did not cast floats correctly whenunit
was specified, resulting in truncated datetime (GH13834)
Merging changes¶
Merging will now preserve the dtype of the join keys (GH8596)
In [105]: df1 = pd.DataFrame({'key': [1], 'v1': [10]})
In [106]: df1
Out[106]:
key v1
0 1 10
In [107]: df2 = pd.DataFrame({'key': [1, 2], 'v1': [20, 30]})
In [108]: df2
Out[108]:
key v1
0 1 20
1 2 30
Previous behavior:
In [5]: pd.merge(df1, df2, how='outer')
Out[5]:
key v1
0 1.0 10.0
1 1.0 20.0
2 2.0 30.0
In [6]: pd.merge(df1, df2, how='outer').dtypes
Out[6]:
key float64
v1 float64
dtype: object
New behavior:
We are able to preserve the join keys
In [109]: pd.merge(df1, df2, how='outer')
Out[109]:
key v1
0 1 10
1 1 20
2 2 30
In [110]: pd.merge(df1, df2, how='outer').dtypes
Out[110]:
key int64
v1 int64
dtype: object
Of course if you have missing values that are introduced, then the resulting dtype will be upcast, which is unchanged from previous.
In [111]: pd.merge(df1, df2, how='outer', on='key')
Out[111]:
key v1_x v1_y
0 1 10.0 20
1 2 NaN 30
In [112]: pd.merge(df1, df2, how='outer', on='key').dtypes
Out[112]:
key int64
v1_x float64
v1_y int64
dtype: object
.describe()
changes¶
Percentile identifiers in the index of a .describe()
output will now be rounded to the least precision that keeps them distinct (GH13104)
In [113]: s = pd.Series([0, 1, 2, 3, 4])
In [114]: df = pd.DataFrame([0, 1, 2, 3, 4])
Previous behavior:
The percentiles were rounded to at most one decimal place, which could raise ValueError
for a data frame if the percentiles were duplicated.
In [3]: s.describe(percentiles=[0.0001, 0.0005, 0.001, 0.999, 0.9995, 0.9999])
Out[3]:
count 5.000000
mean 2.000000
std 1.581139
min 0.000000
0.0% 0.000400
0.1% 0.002000
0.1% 0.004000
50% 2.000000
99.9% 3.996000
100.0% 3.998000
100.0% 3.999600
max 4.000000
dtype: float64
In [4]: df.describe(percentiles=[0.0001, 0.0005, 0.001, 0.999, 0.9995, 0.9999])
Out[4]:
...
ValueError: cannot reindex from a duplicate axis
New behavior:
In [115]: s.describe(percentiles=[0.0001, 0.0005, 0.001, 0.999, 0.9995, 0.9999])
Out[115]:
count 5.000000
mean 2.000000
std 1.581139
min 0.000000
0.01% 0.000400
0.05% 0.002000
0.1% 0.004000
50% 2.000000
99.9% 3.996000
99.95% 3.998000
99.99% 3.999600
max 4.000000
dtype: float64
In [116]: df.describe(percentiles=[0.0001, 0.0005, 0.001, 0.999, 0.9995, 0.9999])
Out[116]:
0
count 5.000000
mean 2.000000
std 1.581139
min 0.000000
0.01% 0.000400
0.05% 0.002000
0.1% 0.004000
50% 2.000000
99.9% 3.996000
99.95% 3.998000
99.99% 3.999600
max 4.000000
Furthermore:
- Passing duplicated
percentiles
will now raise aValueError
. - Bug in
.describe()
on a DataFrame with a mixed-dtype column index, which would previously raise aTypeError
(GH13288)
Period
changes¶
PeriodIndex
now has period
dtype¶
PeriodIndex
now has its own period
dtype. The period
dtype is a
pandas extension dtype like category
or the timezone aware dtype (datetime64[ns, tz]
) (GH13941).
As a consequence of this change, PeriodIndex
no longer has an integer dtype:
Previous behavior:
In [1]: pi = pd.PeriodIndex(['2016-08-01'], freq='D')
In [2]: pi
Out[2]: PeriodIndex(['2016-08-01'], dtype='int64', freq='D')
In [3]: pd.api.types.is_integer_dtype(pi)
Out[3]: True
In [4]: pi.dtype
Out[4]: dtype('int64')
New behavior:
In [117]: pi = pd.PeriodIndex(['2016-08-01'], freq='D')
In [118]: pi
Out[118]: PeriodIndex(['2016-08-01'], dtype='period[D]', freq='D')
In [119]: pd.api.types.is_integer_dtype(pi)
Out[119]: False
In [120]: pd.api.types.is_period_dtype(pi)
Out[120]: True
In [121]: pi.dtype
Out[121]: period[D]
In [122]: type(pi.dtype)
Out[122]: pandas.types.dtypes.PeriodDtype
Period('NaT')
now returns pd.NaT
¶
Previously, Period
has its own Period('NaT')
representation different from pd.NaT
. Now Period('NaT')
has been changed to return pd.NaT
. (GH12759, GH13582)
Previous behavior:
In [5]: pd.Period('NaT', freq='D')
Out[5]: Period('NaT', 'D')
New behavior:
These result in pd.NaT
without providing freq
option.
In [123]: pd.Period('NaT')
Out[123]: NaT
In [124]: pd.Period(None)
Out[124]: NaT
To be compatible with Period
addition and subtraction, pd.NaT
now supports addition and subtraction with int
. Previously it raised ValueError
.
Previous behavior:
In [5]: pd.NaT + 1
...
ValueError: Cannot add integral value to Timestamp without freq.
New behavior:
In [125]: pd.NaT + 1
Out[125]: NaT
In [126]: pd.NaT - 1
Out[126]: NaT
PeriodIndex.values
now returns array of Period
object¶
.values
is changed to return an array of Period
objects, rather than an array
of integers (GH13988).
Previous behavior:
In [6]: pi = pd.PeriodIndex(['2011-01', '2011-02'], freq='M')
In [7]: pi.values
array([492, 493])
New behavior:
In [127]: pi = pd.PeriodIndex(['2011-01', '2011-02'], freq='M')
In [128]: pi.values
Out[128]: array([Period('2011-01', 'M'), Period('2011-02', 'M')], dtype=object)
Index +
/ -
no longer used for set operations¶
Addition and subtraction of the base Index type and of DatetimeIndex
(not the numeric index types)
previously performed set operations (set union and difference). This
behavior was already deprecated since 0.15.0 (in favor using the specific
.union()
and .difference()
methods), and is now disabled. When
possible, +
and -
are now used for element-wise operations, for
example for concatenating strings or subtracting datetimes
(GH8227, GH14127).
Previous behavior:
In [1]: pd.Index(['a', 'b']) + pd.Index(['a', 'c'])
FutureWarning: using '+' to provide set union with Indexes is deprecated, use '|' or .union()
Out[1]: Index(['a', 'b', 'c'], dtype='object')
New behavior: the same operation will now perform element-wise addition:
In [129]: pd.Index(['a', 'b']) + pd.Index(['a', 'c'])
Out[129]: Index([u'aa', u'bc'], dtype='object')
Note that numeric Index objects already performed element-wise operations.
For example, the behavior of adding two integer Indexes is unchanged.
The base Index
is now made consistent with this behavior.
In [130]: pd.Index([1, 2, 3]) + pd.Index([2, 3, 4])
Out[130]: Int64Index([3, 5, 7], dtype='int64')
Further, because of this change, it is now possible to subtract two DatetimeIndex objects resulting in a TimedeltaIndex:
Previous behavior:
In [1]: pd.DatetimeIndex(['2016-01-01', '2016-01-02']) - pd.DatetimeIndex(['2016-01-02', '2016-01-03'])
FutureWarning: using '-' to provide set differences with datetimelike Indexes is deprecated, use .difference()
Out[1]: DatetimeIndex(['2016-01-01'], dtype='datetime64[ns]', freq=None)
New behavior:
In [131]: pd.DatetimeIndex(['2016-01-01', '2016-01-02']) - pd.DatetimeIndex(['2016-01-02', '2016-01-03'])
Out[131]: TimedeltaIndex(['-1 days', '-1 days'], dtype='timedelta64[ns]', freq=None)
Index.difference
and .symmetric_difference
changes¶
Index.difference
and Index.symmetric_difference
will now, more consistently, treat NaN
values as any other values. (GH13514)
In [132]: idx1 = pd.Index([1, 2, 3, np.nan])
In [133]: idx2 = pd.Index([0, 1, np.nan])
Previous behavior:
In [3]: idx1.difference(idx2)
Out[3]: Float64Index([nan, 2.0, 3.0], dtype='float64')
In [4]: idx1.symmetric_difference(idx2)
Out[4]: Float64Index([0.0, nan, 2.0, 3.0], dtype='float64')
New behavior:
In [134]: idx1.difference(idx2)
Out[134]: Float64Index([2.0, 3.0], dtype='float64')
In [135]: idx1.symmetric_difference(idx2)
Out[135]: Float64Index([0.0, 2.0, 3.0], dtype='float64')
Index.unique
consistently returns Index
¶
Index.unique()
now returns unique values as an
Index
of the appropriate dtype
. (GH13395).
Previously, most Index
classes returned np.ndarray
, and DatetimeIndex
,
TimedeltaIndex
and PeriodIndex
returned Index
to keep metadata like timezone.
Previous behavior:
In [1]: pd.Index([1, 2, 3]).unique()
Out[1]: array([1, 2, 3])
In [2]: pd.DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03'], tz='Asia/Tokyo').unique()
Out[2]:
DatetimeIndex(['2011-01-01 00:00:00+09:00', '2011-01-02 00:00:00+09:00',
'2011-01-03 00:00:00+09:00'],
dtype='datetime64[ns, Asia/Tokyo]', freq=None)
New behavior:
In [136]: pd.Index([1, 2, 3]).unique()
Out[136]: Int64Index([1, 2, 3], dtype='int64')
In [137]: pd.DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03'], tz='Asia/Tokyo').unique()
Out[137]:
DatetimeIndex(['2011-01-01 00:00:00+09:00', '2011-01-02 00:00:00+09:00',
'2011-01-03 00:00:00+09:00'],
dtype='datetime64[ns, Asia/Tokyo]', freq=None)
MultiIndex
constructors, groupby
and set_index
preserve categorical dtypes¶
MultiIndex.from_arrays
and MultiIndex.from_product
will now preserve categorical dtype
in MultiIndex
levels (GH13743, GH13854).
In [138]: cat = pd.Categorical(['a', 'b'], categories=list("bac"))
In [139]: lvl1 = ['foo', 'bar']
In [140]: midx = pd.MultiIndex.from_arrays([cat, lvl1])
In [141]: midx
Out[141]:
MultiIndex(levels=[[u'b', u'a', u'c'], [u'bar', u'foo']],
labels=[[1, 0], [1, 0]])
Previous behavior:
In [4]: midx.levels[0]
Out[4]: Index(['b', 'a', 'c'], dtype='object')
In [5]: midx.get_level_values[0]
Out[5]: Index(['a', 'b'], dtype='object')
New behavior: the single level is now a CategoricalIndex
:
In [142]: midx.levels[0]
Out[142]: CategoricalIndex([u'b', u'a', u'c'], categories=[u'b', u'a', u'c'], ordered=False, dtype='category')
In [143]: midx.get_level_values(0)
Out[143]: CategoricalIndex([u'a', u'b'], categories=[u'b', u'a', u'c'], ordered=False, dtype='category')
An analogous change has been made to MultiIndex.from_product
.
As a consequence, groupby
and set_index
also preserve categorical dtypes in indexes
In [144]: df = pd.DataFrame({'A': [0, 1], 'B': [10, 11], 'C': cat})
In [145]: df_grouped = df.groupby(by=['A', 'C']).first()
In [146]: df_set_idx = df.set_index(['A', 'C'])
Previous behavior:
In [11]: df_grouped.index.levels[1]
Out[11]: Index(['b', 'a', 'c'], dtype='object', name='C')
In [12]: df_grouped.reset_index().dtypes
Out[12]:
A int64
C object
B float64
dtype: object
In [13]: df_set_idx.index.levels[1]
Out[13]: Index(['b', 'a', 'c'], dtype='object', name='C')
In [14]: df_set_idx.reset_index().dtypes
Out[14]:
A int64
C object
B int64
dtype: object
New behavior:
In [147]: df_grouped.index.levels[1]
Out[147]: CategoricalIndex([u'b', u'a', u'c'], categories=[u'b', u'a', u'c'], ordered=False, name=u'C', dtype='category')
In [148]: df_grouped.reset_index().dtypes
Out[148]:
A int64
C category
B float64
dtype: object
In [149]: df_set_idx.index.levels[1]
Out[149]: CategoricalIndex([u'b', u'a', u'c'], categories=[u'b', u'a', u'c'], ordered=False, name=u'C', dtype='category')
In [150]: df_set_idx.reset_index().dtypes
Out[150]:
A int64
C category
B int64
dtype: object
read_csv
will progressively enumerate chunks¶
When read_csv()
is called with chunksize=n
and without specifying an index,
each chunk used to have an independently generated index from 0
to n-1
.
They are now given instead a progressive index, starting from 0
for the first chunk,
from n
for the second, and so on, so that, when concatenated, they are identical to
the result of calling read_csv()
without the chunksize=
argument
(GH12185).
In [151]: data = 'A,B\n0,1\n2,3\n4,5\n6,7'
Previous behavior:
In [2]: pd.concat(pd.read_csv(StringIO(data), chunksize=2))
Out[2]:
A B
0 0 1
1 2 3
0 4 5
1 6 7
New behavior:
In [152]: pd.concat(pd.read_csv(StringIO(data), chunksize=2))
Out[152]:
A B
0 0 1
1 2 3
2 4 5
3 6 7
Sparse Changes¶
These changes allow pandas to handle sparse data with more dtypes, and for work to make a smoother experience with data handling.
int64
and bool
support enhancements¶
Sparse data structures now gained enhanced support of int64
and bool
dtype
(GH667, GH13849).
Previously, sparse data were float64
dtype by default, even if all inputs were of int
or bool
dtype. You had to specify dtype
explicitly to create sparse data with int64
dtype. Also, fill_value
had to be specified explicitly because the default was np.nan
which doesn’t appear in int64
or bool
data.
In [1]: pd.SparseArray([1, 2, 0, 0])
Out[1]:
[1.0, 2.0, 0.0, 0.0]
Fill: nan
IntIndex
Indices: array([0, 1, 2, 3], dtype=int32)
# specifying int64 dtype, but all values are stored in sp_values because
# fill_value default is np.nan
In [2]: pd.SparseArray([1, 2, 0, 0], dtype=np.int64)
Out[2]:
[1, 2, 0, 0]
Fill: nan
IntIndex
Indices: array([0, 1, 2, 3], dtype=int32)
In [3]: pd.SparseArray([1, 2, 0, 0], dtype=np.int64, fill_value=0)
Out[3]:
[1, 2, 0, 0]
Fill: 0
IntIndex
Indices: array([0, 1], dtype=int32)
As of v0.19.0, sparse data keeps the input dtype, and uses more appropriate fill_value
defaults (0
for int64
dtype, False
for bool
dtype).
In [153]: pd.SparseArray([1, 2, 0, 0], dtype=np.int64)
Out[153]:
[1, 2, 0, 0]
Fill: 0
IntIndex
Indices: array([0, 1], dtype=int32)
In [154]: pd.SparseArray([True, False, False, False])
Out[154]:
[True, False, False, False]
Fill: False
IntIndex
Indices: array([0], dtype=int32)
See the docs for more details.
Operators now preserve dtypes¶
Sparse data structure now can preserve
dtype
after arithmetic ops (GH13848)In [155]: s = pd.SparseSeries([0, 2, 0, 1], fill_value=0, dtype=np.int64) In [156]: s.dtype Out[156]: dtype('int64') In [157]: s + 1 Out[157]: 0 1 1 3 2 1 3 2 dtype: int64 BlockIndex Block locations: array([1, 3], dtype=int32) Block lengths: array([1, 1], dtype=int32)
Sparse data structure now support
astype
to convert internaldtype
(GH13900)In [158]: s = pd.SparseSeries([1., 0., 2., 0.], fill_value=0) In [159]: s Out[159]: 0 1.0 1 0.0 2 2.0 3 0.0 dtype: float64 BlockIndex Block locations: array([0, 2], dtype=int32) Block lengths: array([1, 1], dtype=int32) In [160]: s.astype(np.int64) Out[160]: 0 1 1 0 2 2 3 0 dtype: int64 BlockIndex Block locations: array([0, 2], dtype=int32) Block lengths: array([1, 1], dtype=int32)
astype
fails if data contains values which cannot be converted to specifieddtype
. Note that the limitation is applied tofill_value
which default isnp.nan
.In [7]: pd.SparseSeries([1., np.nan, 2., np.nan], fill_value=np.nan).astype(np.int64) Out[7]: ValueError: unable to coerce current fill_value nan to int64 dtype
Other sparse fixes¶
- Subclassed
SparseDataFrame
andSparseSeries
now preserve class types when slicing or transposing. (GH13787) SparseArray
withbool
dtype now supports logical (bool) operators (GH14000)- Bug in
SparseSeries
withMultiIndex
[]
indexing may raiseIndexError
(GH13144) - Bug in
SparseSeries
withMultiIndex
[]
indexing result may have normalIndex
(GH13144) - Bug in
SparseDataFrame
in whichaxis=None
did not default toaxis=0
(GH13048) - Bug in
SparseSeries
andSparseDataFrame
creation withobject
dtype may raiseTypeError
(GH11633) - Bug in
SparseDataFrame
doesn’t respect passedSparseArray
orSparseSeries
‘s dtype andfill_value
(GH13866) - Bug in
SparseArray
andSparseSeries
don’t apply ufunc tofill_value
(GH13853) - Bug in
SparseSeries.abs
incorrectly keeps negativefill_value
(GH13853) - Bug in single row slicing on multi-type
SparseDataFrame
s, types were previously forced to float (GH13917) - Bug in
SparseSeries
slicing changes integer dtype to float (GH8292) - Bug in
SparseDataFarme
comparison ops may raiseTypeError
(GH13001) - Bug in
SparseDataFarme.isnull
raisesValueError
(GH8276) - Bug in
SparseSeries
representation withbool
dtype may raiseIndexError
(GH13110) - Bug in
SparseSeries
andSparseDataFrame
ofbool
orint64
dtype may display its values likefloat64
dtype (GH13110) - Bug in sparse indexing using
SparseArray
withbool
dtype may return incorrect result (GH13985) - Bug in
SparseArray
created fromSparseSeries
may losedtype
(GH13999) - Bug in
SparseSeries
comparison with dense returns normalSeries
rather thanSparseSeries
(GH13999)
Indexer dtype changes¶
Note
This change only affects 64 bit python running on Windows, and only affects relatively advanced indexing operations
Methods such as Index.get_indexer
that return an indexer array, coerce that array to a “platform int”, so that it can be
directly used in 3rd party library operations like numpy.take
. Previously, a platform int was defined as np.int_
which corresponds to a C integer, but the correct type, and what is being used now, is np.intp
, which corresponds
to the C integer size that can hold a pointer (GH3033, GH13972).
These types are the same on many platform, but for 64 bit python on Windows,
np.int_
is 32 bits, and np.intp
is 64 bits. Changing this behavior improves performance for many
operations on that platform.
Previous behavior:
In [1]: i = pd.Index(['a', 'b', 'c'])
In [2]: i.get_indexer(['b', 'b', 'c']).dtype
Out[2]: dtype('int32')
New behavior:
In [1]: i = pd.Index(['a', 'b', 'c'])
In [2]: i.get_indexer(['b', 'b', 'c']).dtype
Out[2]: dtype('int64')
Other API Changes¶
Timestamp.to_pydatetime
will issue aUserWarning
whenwarn=True
, and the instance has a non-zero number of nanoseconds, previously this would print a message to stdout (GH14101).Series.unique()
with datetime and timezone now returns return array ofTimestamp
with timezone (GH13565).Panel.to_sparse()
will raise aNotImplementedError
exception when called (GH13778).Index.reshape()
will raise aNotImplementedError
exception when called (GH12882)..filter()
enforces mutual exclusion of the keyword arguments (GH12399).eval
‘s upcasting rules forfloat32
types have been updated to be more consistent with NumPy’s rules. New behavior will not upcast tofloat64
if you multiply a pandasfloat32
object by a scalar float64 (GH12388).- An
UnsupportedFunctionCall
error is now raised if NumPy ufuncs likenp.mean
are called on groupby or resample objects (GH12811). __setitem__
will no longer apply a callable rhs as a function instead of storing it. Callwhere
directly to get the previous behavior (GH13299).- Calls to
.sample()
will respect the random seed set vianumpy.random.seed(n)
(GH13161) Styler.apply
is now more strict about the outputs your function must return. Foraxis=0
oraxis=1
, the output shape must be identical. Foraxis=None
, the output must be a DataFrame with identical columns and index labels (GH13222).Float64Index.astype(int)
will now raiseValueError
ifFloat64Index
containsNaN
values (GH13149)TimedeltaIndex.astype(int)
andDatetimeIndex.astype(int)
will now returnInt64Index
instead ofnp.array
(GH13209)- Passing
Period
with multiple frequencies to normalIndex
now returnsIndex
withobject
dtype (GH13664) PeriodIndex.fillna
withPeriod
has different freq now coerces toobject
dtype (GH13664)- Faceted boxplots from
DataFrame.boxplot(by=col)
now return aSeries
whenreturn_type
is not None. Previously these returned anOrderedDict
. Note that whenreturn_type=None
, the default, these still return a 2-D NumPy array (GH12216, GH7096). pd.read_hdf
will now raise aValueError
instead ofKeyError
, if a mode other thanr
,r+
anda
is supplied. (GH13623)pd.read_csv()
,pd.read_table()
, andpd.read_hdf()
raise the builtinFileNotFoundError
exception for Python 3.x when called on a nonexistent file; this is back-ported asIOError
in Python 2.x (GH14086)- More informative exceptions are passed through the csv parser. The exception type would now be the original exception type instead of
CParserError
(GH13652). pd.read_csv()
in the C engine will now issue aParserWarning
or raise aValueError
whensep
encoded is more than one character long (GH14065)DataFrame.values
will now returnfloat64
with aDataFrame
of mixedint64
anduint64
dtypes, conforming tonp.find_common_type
(GH10364, GH13917).groupby.groups
will now return a dictionary ofIndex
objects, rather than a dictionary ofnp.ndarray
orlists
(GH14293)
Deprecations¶
Series.reshape
andCategorical.reshape
have been deprecated and will be removed in a subsequent release (GH12882, GH12882)PeriodIndex.to_datetime
has been deprecated in favor ofPeriodIndex.to_timestamp
(GH8254)Timestamp.to_datetime
has been deprecated in favor ofTimestamp.to_pydatetime
(GH8254)Index.to_datetime
andDatetimeIndex.to_datetime
have been deprecated in favor ofpd.to_datetime
(GH8254)pandas.core.datetools
module has been deprecated and will be removed in a subsequent release (GH14094)SparseList
has been deprecated and will be removed in a future version (GH13784)DataFrame.to_html()
andDataFrame.to_latex()
have dropped thecolSpace
parameter in favor ofcol_space
(GH13857)DataFrame.to_sql()
has deprecated theflavor
parameter, as it is superfluous when SQLAlchemy is not installed (GH13611)- Deprecated
read_csv
keywords:compact_ints
anduse_unsigned
have been deprecated and will be removed in a future version (GH13320)buffer_lines
has been deprecated and will be removed in a future version (GH13360)as_recarray
has been deprecated and will be removed in a future version (GH13373)skip_footer
has been deprecated in favor ofskipfooter
and will be removed in a future version (GH13349)
- top-level
pd.ordered_merge()
has been renamed topd.merge_ordered()
and the original name will be removed in a future version (GH13358) Timestamp.offset
property (and named arg in the constructor), has been deprecated in favor offreq
(GH12160)pd.tseries.util.pivot_annual
is deprecated. Usepivot_table
as alternative, an example is here (GH736)pd.tseries.util.isleapyear
has been deprecated and will be removed in a subsequent release. Datetime-likes now have a.is_leap_year
property (GH13727)Panel4D
andPanelND
constructors are deprecated and will be removed in a future version. The recommended way to represent these types of n-dimensional data are with the xarray package. Pandas provides ato_xarray()
method to automate this conversion (GH13564).pandas.tseries.frequencies.get_standard_freq
is deprecated. Usepandas.tseries.frequencies.to_offset(freq).rule_code
instead (GH13874)pandas.tseries.frequencies.to_offset
‘sfreqstr
keyword is deprecated in favor offreq
(GH13874)Categorical.from_array
has been deprecated and will be removed in a future version (GH13854)
Removal of prior version deprecations/changes¶
- The
SparsePanel
class has been removed (GH13778) - The
pd.sandbox
module has been removed in favor of the external librarypandas-qt
(GH13670) - The
pandas.io.data
andpandas.io.wb
modules are removed in favor of the pandas-datareader package (GH13724). - The
pandas.tools.rplot
module has been removed in favor of the seaborn package (GH13855) DataFrame.to_csv()
has dropped theengine
parameter, as was deprecated in 0.17.1 (GH11274, GH13419)DataFrame.to_dict()
has dropped theouttype
parameter in favor oforient
(GH13627, GH8486)pd.Categorical
has dropped setting of theordered
attribute directly in favor of theset_ordered
method (GH13671)pd.Categorical
has dropped thelevels
attribute in favor ofcategories
(GH8376)DataFrame.to_sql()
has dropped themysql
option for theflavor
parameter (GH13611)Panel.shift()
has dropped thelags
parameter in favor ofperiods
(GH14041)pd.Index
has dropped thediff
method in favor ofdifference
(GH13669)pd.DataFrame
has dropped theto_wide
method in favor ofto_panel
(GH14039)Series.to_csv
has dropped thenanRep
parameter in favor ofna_rep
(GH13804)Series.xs
,DataFrame.xs
,Panel.xs
,Panel.major_xs
, andPanel.minor_xs
have dropped thecopy
parameter (GH13781)str.split
has dropped thereturn_type
parameter in favor ofexpand
(GH13701)- Removal of the legacy time rules (offset aliases), deprecated since 0.17.0 (this has been alias since 0.8.0) (GH13590, GH13868). Now legacy time rules raises
ValueError
. For the list of currently supported offsets, see here. - The default value for the
return_type
parameter forDataFrame.plot.box
andDataFrame.boxplot
changed fromNone
to"axes"
. These methods will now return a matplotlib axes by default instead of a dictionary of artists. See here (GH6581). - The
tquery
anduquery
functions in thepandas.io.sql
module are removed (GH5950).
Performance Improvements¶
- Improved performance of sparse
IntIndex.intersect
(GH13082) - Improved performance of sparse arithmetic with
BlockIndex
when the number of blocks are large, though recommended to useIntIndex
in such cases (GH13082) - Improved performance of
DataFrame.quantile()
as it now operates per-block (GH11623) - Improved performance of float64 hash table operations, fixing some very slow indexing and groupby operations in python 3 (GH13166, GH13334)
- Improved performance of
DataFrameGroupBy.transform
(GH12737) - Improved performance of
Index
andSeries
.duplicated
(GH10235) - Improved performance of
Index.difference
(GH12044) - Improved performance of
RangeIndex.is_monotonic_increasing
andis_monotonic_decreasing
(GH13749) - Improved performance of datetime string parsing in
DatetimeIndex
(GH13692) - Improved performance of hashing
Period
(GH12817) - Improved performance of
factorize
of datetime with timezone (GH13750) - Improved performance of by lazily creating indexing hashtables on larger Indexes (GH14266)
- Improved performance of
groupby.groups
(GH14293) - Unecessary materializing of a MultiIndex when introspecting for memory usage (GH14308)
Bug Fixes¶
- Bug in
groupby().shift()
, which could cause a segfault or corruption in rare circumstances when grouping by columns with missing values (GH13813) - Bug in
groupby().cumsum()
calculatingcumprod
whenaxis=1
. (GH13994) - Bug in
pd.to_timedelta()
in which theerrors
parameter was not being respected (GH13613) - Bug in
io.json.json_normalize()
, where non-ascii keys raised an exception (GH13213) - Bug when passing a not-default-indexed
Series
asxerr
oryerr
in.plot()
(GH11858) - Bug in area plot draws legend incorrectly if subplot is enabled or legend is moved after plot (matplotlib 1.5.0 is required to draw area plot legend properly) (GH9161, GH13544)
- Bug in
DataFrame
assignment with an object-dtypedIndex
where the resultant column is mutable to the original object. (GH13522) - Bug in matplotlib
AutoDataFormatter
; this restores the second scaled formatting and re-adds micro-second scaled formatting (GH13131) - Bug in selection from a
HDFStore
with a fixed format andstart
and/orstop
specified will now return the selected range (GH8287) - Bug in
Categorical.from_codes()
where an unhelpful error was raised when an invalidordered
parameter was passed in (GH14058) - Bug in
Series
construction from a tuple of integers on windows not returning default dtype (int64) (GH13646) - Bug in
TimedeltaIndex
addition with a Datetime-like object where addition overflow was not being caught (GH14068) - Bug in
.groupby(..).resample(..)
when the same object is called multiple times (GH13174) - Bug in
.to_records()
when index name is a unicode string (GH13172) - Bug in calling
.memory_usage()
on object which doesn’t implement (GH12924) - Regression in
Series.quantile
with nans (also shows up in.median()
and.describe()
); furthermore now names theSeries
with the quantile (GH13098, GH13146) - Bug in
SeriesGroupBy.transform
with datetime values and missing groups (GH13191) - Bug where empty
Series
were incorrectly coerced in datetime-like numeric operations (GH13844) - Bug in
Categorical
constructor when passed aCategorical
containing datetimes with timezones (GH14190) - Bug in
Series.str.extractall()
withstr
index raisesValueError
(GH13156) - Bug in
Series.str.extractall()
with single group and quantifier (GH13382) - Bug in
DatetimeIndex
andPeriod
subtraction raisesValueError
orAttributeError
rather thanTypeError
(GH13078) - Bug in
Index
andSeries
created withNaN
andNaT
mixed data may not havedatetime64
dtype (GH13324) - Bug in
Index
andSeries
may ignorenp.datetime64('nat')
andnp.timdelta64('nat')
to infer dtype (GH13324) - Bug in
PeriodIndex
andPeriod
subtraction raisesAttributeError
(GH13071) - Bug in
PeriodIndex
construction returning afloat64
index in some circumstances (GH13067) - Bug in
.resample(..)
with aPeriodIndex
not changing itsfreq
appropriately when empty (GH13067) - Bug in
.resample(..)
with aPeriodIndex
not retaining its type or name with an emptyDataFrame
appropriately when empty (GH13212) - Bug in
groupby(..).apply(..)
when the passed function returns scalar values per group (GH13468). - Bug in
groupby(..).resample(..)
where passing some keywords would raise an exception (GH13235) - Bug in
.tz_convert
on a tz-awareDateTimeIndex
that relied on index being sorted for correct results (GH13306) - Bug in
.tz_localize
withdateutil.tz.tzlocal
may return incorrect result (GH13583) - Bug in
DatetimeTZDtype
dtype withdateutil.tz.tzlocal
cannot be regarded as valid dtype (GH13583) - Bug in
pd.read_hdf()
where attempting to load an HDF file with a single dataset, that had one or more categorical columns, failed unless the key argument was set to the name of the dataset. (GH13231) - Bug in
.rolling()
that allowed a negative integer window in contruction of theRolling()
object, but would later fail on aggregation (GH13383) - Bug in
Series
indexing with tuple-valued data and a numeric index (GH13509) - Bug in printing
pd.DataFrame
where unusual elements with theobject
dtype were causing segfaults (GH13717) - Bug in ranking
Series
which could result in segfaults (GH13445) - Bug in various index types, which did not propagate the name of passed index (GH12309)
- Bug in
DatetimeIndex
, which did not honour thecopy=True
(GH13205) - Bug in
DatetimeIndex.is_normalized
returns incorrectly for normalized date_range in case of local timezones (GH13459) - Bug in
pd.concat
and.append
may coercesdatetime64
andtimedelta
toobject
dtype containing python built-indatetime
ortimedelta
rather thanTimestamp
orTimedelta
(GH13626) - Bug in
PeriodIndex.append
may raisesAttributeError
when the result isobject
dtype (GH13221) - Bug in
CategoricalIndex.append
may accept normallist
(GH13626) - Bug in
pd.concat
and.append
with the same timezone get reset to UTC (GH7795) - Bug in
Series
andDataFrame
.append
raisesAmbiguousTimeError
if data contains datetime near DST boundary (GH13626) - Bug in
DataFrame.to_csv()
in which float values were being quoted even though quotations were specified for non-numeric values only (GH12922, GH13259) - Bug in
DataFrame.describe()
raisingValueError
with only boolean columns (GH13898) - Bug in
MultiIndex
slicing where extra elements were returned when level is non-unique (GH12896) - Bug in
.str.replace
does not raiseTypeError
for invalid replacement (GH13438) - Bug in
MultiIndex.from_arrays
which didn’t check for input array lengths matching (GH13599) - Bug in
cartesian_product
andMultiIndex.from_product
which may raise with empty input arrays (GH12258) - Bug in
pd.read_csv()
which may cause a segfault or corruption when iterating in large chunks over a stream/file under rare circumstances (GH13703) - Bug in
pd.read_csv()
which caused errors to be raised when a dictionary containing scalars is passed in forna_values
(GH12224) - Bug in
pd.read_csv()
which caused BOM files to be incorrectly parsed by not ignoring the BOM (GH4793) - Bug in
pd.read_csv()
withengine='python'
which raised errors when a numpy array was passed in forusecols
(GH12546) - Bug in
pd.read_csv()
where the index columns were being incorrectly parsed when parsed as dates with athousands
parameter (GH14066) - Bug in
pd.read_csv()
withengine='python'
in whichNaN
values weren’t being detected after data was converted to numeric values (GH13314) - Bug in
pd.read_csv()
in which thenrows
argument was not properly validated for both engines (GH10476) - Bug in
pd.read_csv()
withengine='python'
in which infinities of mixed-case forms were not being interpreted properly (GH13274) - Bug in
pd.read_csv()
withengine='python'
in which trailingNaN
values were not being parsed (GH13320) - Bug in
pd.read_csv()
withengine='python'
when reading from atempfile.TemporaryFile
on Windows with Python 3 (GH13398) - Bug in
pd.read_csv()
that preventsusecols
kwarg from accepting single-byte unicode strings (GH13219) - Bug in
pd.read_csv()
that preventsusecols
from being an empty set (GH13402) - Bug in
pd.read_csv()
in the C engine where the NULL character was not being parsed as NULL (GH14012) - Bug in
pd.read_csv()
withengine='c'
in which NULLquotechar
was not accepted even thoughquoting
was specified asNone
(GH13411) - Bug in
pd.read_csv()
withengine='c'
in which fields were not properly cast to float when quoting was specified as non-numeric (GH13411) - Bug in
pd.read_csv()
in Python 2.x with non-UTF8 encoded, multi-character separated data (GH3404) - Bug in
pd.read_csv()
, where aliases for utf-xx (e.g. UTF-xx, UTF_xx, utf_xx) raised UnicodeDecodeError (GH13549) - Bug in
pd.read_csv
,pd.read_table
,pd.read_fwf
,pd.read_stata
andpd.read_sas
where files were opened by parsers but not closed if bothchunksize
anditerator
wereNone
. (GH13940) - Bug in
StataReader
,StataWriter
,XportReader
andSAS7BDATReader
where a file was not properly closed when an error was raised. (GH13940) - Bug in
pd.pivot_table()
wheremargins_name
is ignored whenaggfunc
is a list (GH13354) - Bug in
pd.Series.str.zfill
,center
,ljust
,rjust
, andpad
when passing non-integers, did not raiseTypeError
(GH13598) - Bug in checking for any null objects in a
TimedeltaIndex
, which always returnedTrue
(GH13603) - Bug in
Series
arithmetic raisesTypeError
if it contains datetime-like asobject
dtype (GH13043) - Bug
Series.isnull()
andSeries.notnull()
ignorePeriod('NaT')
(GH13737) - Bug
Series.fillna()
andSeries.dropna()
don’t affect toPeriod('NaT')
(GH13737 - Bug in
.fillna(value=np.nan)
incorrectly raisesKeyError
on acategory
dtypedSeries
(GH14021) - Bug in extension dtype creation where the created types were not is/identical (GH13285)
- Bug in
.resample(..)
where incorrect warnings were triggered by IPython introspection (GH13618) - Bug in
NaT
-Period
raisesAttributeError
(GH13071) - Bug in
Series
comparison may output incorrect result if rhs containsNaT
(GH9005) - Bug in
Series
andIndex
comparison may output incorrect result if it containsNaT
withobject
dtype (GH13592) - Bug in
Period
addition raisesTypeError
ifPeriod
is on right hand side (GH13069) - Bug in
Peirod
andSeries
orIndex
comparison raisesTypeError
(GH13200) - Bug in
pd.set_eng_float_format()
that would prevent NaN and Inf from formatting (GH11981) - Bug in
.unstack
withCategorical
dtype resets.ordered
toTrue
(GH13249) - Clean some compile time warnings in datetime parsing (GH13607)
- Bug in
factorize
raisesAmbiguousTimeError
if data contains datetime near DST boundary (GH13750) - Bug in
.set_index
raisesAmbiguousTimeError
if new index contains DST boundary and multi levels (GH12920) - Bug in
.shift
raisesAmbiguousTimeError
if data contains datetime near DST boundary (GH13926) - Bug in
pd.read_hdf()
returns incorrect result when aDataFrame
with acategorical
column and a query which doesn’t match any values (GH13792) - Bug in
.iloc
when indexing with a non lex-sorted MultiIndex (GH13797) - Bug in
.loc
when indexing with date strings in a reverse sortedDatetimeIndex
(GH14316) - Bug in
Series
comparison operators when dealing with zero dim NumPy arrays (GH13006) - Bug in
.combine_first
may return incorrectdtype
(GH7630, GH10567) - Bug in
groupby
whereapply
returns different result depending on whether first result isNone
or not (GH12824) - Bug in
groupby(..).nth()
where the group key is included inconsistently if called after.head()/.tail()
(GH12839) - Bug in
.to_html
,.to_latex
and.to_string
silently ignore custom datetime formatter passed through theformatters
key word (GH10690) - Bug in
DataFrame.iterrows()
, not yielding aSeries
subclasse if defined (GH13977) - Bug in
pd.to_numeric
whenerrors='coerce'
and input contains non-hashable objects (GH13324) - Bug in invalid
Timedelta
arithmetic and comparison may raiseValueError
rather thanTypeError
(GH13624) - Bug in invalid datetime parsing in
to_datetime
andDatetimeIndex
may raiseTypeError
rather thanValueError
(GH11169, GH11287) - Bug in
Index
created with tz-awareTimestamp
and mismatchedtz
option incorrectly coerces timezone (GH13692) - Bug in
DatetimeIndex
with nanosecond frequency does not include timestamp specified withend
(GH13672) - Bug in
`Series`
when setting a slice with a`np.timedelta64`
(GH14155) - Bug in
Index
raisesOutOfBoundsDatetime
ifdatetime
exceedsdatetime64[ns]
bounds, rather than coercing toobject
dtype (GH13663) - Bug in
Index
may ignore specifieddatetime64
ortimedelta64
passed asdtype
(GH13981) - Bug in
RangeIndex
can be created without no arguments rather than raisesTypeError
(GH13793) - Bug in
.value_counts()
raisesOutOfBoundsDatetime
if data exceedsdatetime64[ns]
bounds (GH13663) - Bug in
DatetimeIndex
may raiseOutOfBoundsDatetime
if inputnp.datetime64
has other unit thanns
(GH9114) - Bug in
Series
creation withnp.datetime64
which has other unit thanns
asobject
dtype results in incorrect values (GH13876) - Bug in
resample
with timedelta data where data was casted to float (GH13119). - Bug in
pd.isnull()
pd.notnull()
raiseTypeError
if input datetime-like has other unit thanns
(GH13389) - Bug in
pd.merge()
may raiseTypeError
if input datetime-like has other unit thanns
(GH13389) - Bug in
HDFStore
/read_hdf()
discardedDatetimeIndex.name
iftz
was set (GH13884) - Bug in
Categorical.remove_unused_categories()
changes.codes
dtype to platform int (GH13261) - Bug in
groupby
withas_index=False
returns all NaN’s when grouping on multiple columns including a categorical one (GH13204) - Bug in
df.groupby(...)[...]
where getitem withInt64Index
raised an error (GH13731) - Bug in the CSS classes assigned to
DataFrame.style
for index names. Previously they were assigned"col_heading level<n> col<c>"
wheren
was the number of levels + 1. Now they are assigned"index_name level<n>"
, wheren
is the correct level for that MultiIndex. - Bug where
pd.read_gbq()
could throwImportError: No module named discovery
as a result of a naming conflict with another python package called apiclient (GH13454) - Bug in
Index.union
returns an incorrect result with a named empty index (GH13432) - Bugs in
Index.difference
andDataFrame.join
raise in Python3 when using mixed-integer indexes (GH13432, GH12814) - Bug in subtract tz-aware
datetime.datetime
from tz-awaredatetime64
series (GH14088) - Bug in
.to_excel()
when DataFrame contains a MultiIndex which contains a label with a NaN value (GH13511) - Bug in invalid frequency offset string like “D1”, “-2-3H” may not raise
ValueError
(GH13930) - Bug in
concat
andgroupby
for hierarchical frames withRangeIndex
levels (GH13542). - Bug in
Series.str.contains()
for Series containing onlyNaN
values ofobject
dtype (GH14171) - Bug in
agg()
function on groupby dataframe changes dtype ofdatetime64[ns]
column tofloat64
(GH12821) - Bug in using NumPy ufunc with
PeriodIndex
to add or subtract integer raiseIncompatibleFrequency
. Note that using standard operator like+
or-
is recommended, because standard operators use more efficient path (GH13980) - Bug in operations on
NaT
returningfloat
instead ofdatetime64[ns]
(GH12941) - Bug in
Series
flexible arithmetic methods (like.add()
) raisesValueError
whenaxis=None
(GH13894) - Bug in
DataFrame.to_csv()
withMultiIndex
columns in which a stray empty line was added (GH6618) - Bug in
DatetimeIndex
,TimedeltaIndex
andPeriodIndex.equals()
may returnTrue
when input isn’tIndex
but contains the same values (GH13107) - Bug in assignment against datetime with timezone may not work if it contains datetime near DST boundary (GH14146)
- Bug in
pd.eval()
andHDFStore
query truncating long float literals with python 2 (GH14241) - Bug in
Index
raisesKeyError
displaying incorrect column when column is not in the df and columns contains duplicate values (GH13822) - Bug in
Period
andPeriodIndex
creating wrong dates when frequency has combined offset aliases (GH13874) - Bug in
.to_string()
when called with an integerline_width
andindex=False
raises an UnboundLocalError exception becauseidx
referenced before assignment. - Bug in
eval()
where theresolvers
argument would not accept a list (GH14095) - Bugs in
stack
,get_dummies
,make_axis_dummies
which don’t preserve categorical dtypes in (multi)indexes (GH13854) PeridIndex
can now acceptlist
andarray
which containspd.NaT
(GH13430)- Bug in
df.groupby
where.median()
returns arbitrary values if grouped dataframe contains empty bins (GH13629) - Bug in
Index.copy()
wherename
parameter was ignored (GH14302)
v0.18.1 (May 3, 2016)¶
This is a minor bug-fix release from 0.18.0 and includes a large number of bug fixes along with several new features, enhancements, and performance improvements. We recommend that all users upgrade to this version.
Highlights include:
.groupby(...)
has been enhanced to provide convenient syntax when working with.rolling(..)
,.expanding(..)
and.resample(..)
per group, see herepd.to_datetime()
has gained the ability to assemble dates from aDataFrame
, see here- Method chaining improvements, see here.
- Custom business hour offset, see here.
- Many bug fixes in the handling of
sparse
, see here - Expanded the Tutorials section with a feature on modern pandas, courtesy of @TomAugsburger. (GH13045).
What’s new in v0.18.1
New features¶
Custom Business Hour¶
The CustomBusinessHour
is a mixture of BusinessHour
and CustomBusinessDay
which
allows you to specify arbitrary holidays. For details,
see Custom Business Hour (GH11514)
In [1]: from pandas.tseries.offsets import CustomBusinessHour
In [2]: from pandas.tseries.holiday import USFederalHolidayCalendar
In [3]: bhour_us = CustomBusinessHour(calendar=USFederalHolidayCalendar())
Friday before MLK Day
In [4]: dt = datetime(2014, 1, 17, 15)
In [5]: dt + bhour_us
Out[5]: Timestamp('2014-01-17 16:00:00')
Tuesday after MLK Day (Monday is skipped because it’s a holiday)
In [6]: dt + bhour_us * 2
Out[6]: Timestamp('2014-01-20 09:00:00')
.groupby(..)
syntax with window and resample operations¶
.groupby(...)
has been enhanced to provide convenient syntax when working with .rolling(..)
, .expanding(..)
and .resample(..)
per group, see (GH12486, GH12738).
You can now use .rolling(..)
and .expanding(..)
as methods on groupbys. These return another deferred object (similar to what .rolling()
and .expanding()
do on ungrouped pandas objects). You can then operate on these RollingGroupby
objects in a similar manner.
Previously you would have to do this to get a rolling window mean per-group:
In [7]: df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
...: 'B': np.arange(40)})
...:
In [8]: df
Out[8]:
A B
0 1 0
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 1 6
.. .. ..
33 3 33
34 3 34
35 3 35
36 3 36
37 3 37
38 3 38
39 3 39
[40 rows x 2 columns]
In [9]: df.groupby('A').apply(lambda x: x.rolling(4).B.mean())
Out[9]:
A
1 0 NaN
1 NaN
2 NaN
3 1.5
4 2.5
5 3.5
6 4.5
...
3 33 NaN
34 NaN
35 33.5
36 34.5
37 35.5
38 36.5
39 37.5
Name: B, dtype: float64
Now you can do:
In [10]: df.groupby('A').rolling(4).B.mean()
Out[10]:
A
1 0 NaN
1 NaN
2 NaN
3 1.5
4 2.5
5 3.5
6 4.5
...
3 33 NaN
34 NaN
35 33.5
36 34.5
37 35.5
38 36.5
39 37.5
Name: B, dtype: float64
For .resample(..)
type of operations, previously you would have to:
In [11]: df = pd.DataFrame({'date': pd.date_range(start='2016-01-01',
....: periods=4,
....: freq='W'),
....: 'group': [1, 1, 2, 2],
....: 'val': [5, 6, 7, 8]}).set_index('date')
....:
In [12]: df
Out[12]:
group val
date
2016-01-03 1 5
2016-01-10 1 6
2016-01-17 2 7
2016-01-24 2 8
In [13]: df.groupby('group').apply(lambda x: x.resample('1D').ffill())
Out[13]:
group val
group date
1 2016-01-03 1 5
2016-01-04 1 5
2016-01-05 1 5
2016-01-06 1 5
2016-01-07 1 5
2016-01-08 1 5
2016-01-09 1 5
... ... ...
2 2016-01-18 2 7
2016-01-19 2 7
2016-01-20 2 7
2016-01-21 2 7
2016-01-22 2 7
2016-01-23 2 7
2016-01-24 2 8
[16 rows x 2 columns]
Now you can do:
In [14]: df.groupby('group').resample('1D').ffill()
Out[14]:
group val
group date
1 2016-01-03 1 5
2016-01-04 1 5
2016-01-05 1 5
2016-01-06 1 5
2016-01-07 1 5
2016-01-08 1 5
2016-01-09 1 5
... ... ...
2 2016-01-18 2 7
2016-01-19 2 7
2016-01-20 2 7
2016-01-21 2 7
2016-01-22 2 7
2016-01-23 2 7
2016-01-24 2 8
[16 rows x 2 columns]
Method chaininng improvements¶
The following methods / indexers now accept a callable
. It is intended to make
these more useful in method chains, see the documentation.
(GH11485, GH12533)
.where()
and.mask()
.loc[]
,iloc[]
and.ix[]
[]
indexing
.where()
and .mask()
¶
These can accept a callable for the condition and other
arguments.
In [15]: df = pd.DataFrame({'A': [1, 2, 3],
....: 'B': [4, 5, 6],
....: 'C': [7, 8, 9]})
....:
In [16]: df.where(lambda x: x > 4, lambda x: x + 10)
Out[16]:
A B C
0 11 14 7
1 12 5 8
2 13 6 9
.loc[]
, .iloc[]
, .ix[]
¶
These can accept a callable, and a tuple of callable as a slicer. The callable can return a valid boolean indexer or anything which is valid for these indexer’s input.
# callable returns bool indexer
In [17]: df.loc[lambda x: x.A >= 2, lambda x: x.sum() > 10]
Out[17]:
B C
1 5 8
2 6 9
# callable returns list of labels
In [18]: df.loc[lambda x: [1, 2], lambda x: ['A', 'B']]
Out[18]:
A B
1 2 5
2 3 6
[]
indexing¶
Finally, you can use a callable in []
indexing of Series, DataFrame and Panel.
The callable must return a valid input for []
indexing depending on its
class and index type.
In [19]: df[lambda x: 'A']
Out[19]:
0 1
1 2
2 3
Name: A, dtype: int64
Using these methods / indexers, you can chain data selection operations without using temporary variable.
In [20]: bb = pd.read_csv('data/baseball.csv', index_col='id')
In [21]: (bb.groupby(['year', 'team'])
....: .sum()
....: .loc[lambda df: df.r > 100]
....: )
....:
Out[21]:
stint g ab r h X2b X3b hr rbi sb cs bb \
year team
2007 CIN 6 379 745 101 203 35 2 36 125.0 10.0 1.0 105
DET 5 301 1062 162 283 54 4 37 144.0 24.0 7.0 97
HOU 4 311 926 109 218 47 6 14 77.0 10.0 4.0 60
LAN 11 413 1021 153 293 61 3 36 154.0 7.0 5.0 114
NYN 13 622 1854 240 509 101 3 61 243.0 22.0 4.0 174
SFN 5 482 1305 198 337 67 6 40 171.0 26.0 7.0 235
TEX 2 198 729 115 200 40 4 28 115.0 21.0 4.0 73
TOR 4 459 1408 187 378 96 2 58 223.0 4.0 2.0 190
so ibb hbp sh sf gidp
year team
2007 CIN 127.0 14.0 1.0 1.0 15.0 18.0
DET 176.0 3.0 10.0 4.0 8.0 28.0
HOU 212.0 3.0 9.0 16.0 6.0 17.0
LAN 141.0 8.0 9.0 3.0 8.0 29.0
NYN 310.0 24.0 23.0 18.0 15.0 48.0
SFN 188.0 51.0 8.0 16.0 6.0 41.0
TEX 140.0 4.0 5.0 2.0 8.0 16.0
TOR 265.0 16.0 12.0 4.0 16.0 38.0
Partial string indexing on DateTimeIndex
when part of a MultiIndex
¶
Partial string indexing now matches on DateTimeIndex
when part of a MultiIndex
(GH10331)
In [22]: dft2 = pd.DataFrame(np.random.randn(20, 1),
....: columns=['A'],
....: index=pd.MultiIndex.from_product([pd.date_range('20130101',
....: periods=10,
....: freq='12H'),
....: ['a', 'b']]))
....:
In [23]: dft2
Out[23]:
A
2013-01-01 00:00:00 a 1.474071
b -0.064034
2013-01-01 12:00:00 a -1.282782
b 0.781836
2013-01-02 00:00:00 a -1.071357
b 0.441153
2013-01-02 12:00:00 a 2.353925
... ...
2013-01-04 00:00:00 b -0.845696
2013-01-04 12:00:00 a -1.340896
b 1.846883
2013-01-05 00:00:00 a -1.328865
b 1.682706
2013-01-05 12:00:00 a -1.717693
b 0.888782
[20 rows x 1 columns]
In [24]: dft2.loc['2013-01-05']
Out[24]:
A
2013-01-05 00:00:00 a -1.328865
b 1.682706
2013-01-05 12:00:00 a -1.717693
b 0.888782
On other levels
In [25]: idx = pd.IndexSlice
In [26]: dft2 = dft2.swaplevel(0, 1).sort_index()
In [27]: dft2
Out[27]:
A
a 2013-01-01 00:00:00 1.474071
2013-01-01 12:00:00 -1.282782
2013-01-02 00:00:00 -1.071357
2013-01-02 12:00:00 2.353925
2013-01-03 00:00:00 0.221471
2013-01-03 12:00:00 0.758527
2013-01-04 00:00:00 -0.964980
... ...
b 2013-01-02 12:00:00 0.583787
2013-01-03 00:00:00 -0.744471
2013-01-03 12:00:00 1.729689
2013-01-04 00:00:00 -0.845696
2013-01-04 12:00:00 1.846883
2013-01-05 00:00:00 1.682706
2013-01-05 12:00:00 0.888782
[20 rows x 1 columns]
In [28]: dft2.loc[idx[:, '2013-01-05'], :]
Out[28]:
A
a 2013-01-05 00:00:00 -1.328865
2013-01-05 12:00:00 -1.717693
b 2013-01-05 00:00:00 1.682706
2013-01-05 12:00:00 0.888782
Assembling Datetimes¶
pd.to_datetime()
has gained the ability to assemble datetimes from a passed in DataFrame
or a dict. (GH8158).
In [29]: df = pd.DataFrame({'year': [2015, 2016],
....: 'month': [2, 3],
....: 'day': [4, 5],
....: 'hour': [2, 3]})
....:
In [30]: df
Out[30]:
day hour month year
0 4 2 2 2015
1 5 3 3 2016
Assembling using the passed frame.
In [31]: pd.to_datetime(df)
Out[31]:
0 2015-02-04 02:00:00
1 2016-03-05 03:00:00
dtype: datetime64[ns]
You can pass only the columns that you need to assemble.
In [32]: pd.to_datetime(df[['year', 'month', 'day']])
Out[32]:
0 2015-02-04
1 2016-03-05
dtype: datetime64[ns]
Other Enhancements¶
pd.read_csv()
now supportsdelim_whitespace=True
for the Python engine (GH12958)pd.read_csv()
now supports opening ZIP files that contains a single CSV, via extension inference or explictcompression='zip'
(GH12175)pd.read_csv()
now supports opening files using xz compression, via extension inference or explicitcompression='xz'
is specified;xz
compressions is also supported byDataFrame.to_csv
in the same way (GH11852)pd.read_msgpack()
now always gives writeable ndarrays even when compression is used (GH12359).pd.read_msgpack()
now supports serializing and de-serializing categoricals with msgpack (GH12573).to_json()
now supportsNDFrames
that contain categorical and sparse data (GH10778)interpolate()
now supportsmethod='akima'
(GH7588).pd.read_excel()
now accepts path objects (e.g.pathlib.Path
,py.path.local
) for the file path, in line with otherread_*
functions (GH12655)Added
.weekday_name
property as a component toDatetimeIndex
and the.dt
accessor. (GH11128)Index.take
now handlesallow_fill
andfill_value
consistently (GH12631)In [33]: idx = pd.Index([1., 2., 3., 4.], dtype='float') # default, allow_fill=True, fill_value=None In [34]: idx.take([2, -1]) Out[34]: Float64Index([3.0, 4.0], dtype='float64') In [35]: idx.take([2, -1], fill_value=True) Out[35]: Float64Index([3.0, nan], dtype='float64')
Index
now supports.str.get_dummies()
which returnsMultiIndex
, see Creating Indicator Variables (GH10008, GH10103)In [36]: idx = pd.Index(['a|b', 'a|c', 'b|c']) In [37]: idx.str.get_dummies('|') Out[37]: MultiIndex(levels=[[0, 1], [0, 1], [0, 1]], labels=[[1, 1, 0], [1, 0, 1], [0, 1, 1]], names=[u'a', u'b', u'c'])
pd.crosstab()
has gained anormalize
argument for normalizing frequency tables (GH12569). Examples in the updated docs here..resample(..).interpolate()
is now supported (GH12925).isin()
now accepts passedsets
(GH12988)
Sparse changes¶
These changes conform sparse handling to return the correct types and work to make a smoother experience with indexing.
SparseArray.take
now returns a scalar for scalar input, SparseArray
for others. Furthermore, it handles a negative indexer with the same rule as Index
(GH10560, GH12796)
In [38]: s = pd.SparseArray([np.nan, np.nan, 1, 2, 3, np.nan, 4, 5, np.nan, 6])
In [39]: s.take(0)
Out[39]: nan
In [40]: s.take([1, 2, 3])
Out[40]:
[nan, 1.0, 2.0]
Fill: nan
IntIndex
Indices: array([1, 2], dtype=int32)
- Bug in
SparseSeries[]
indexing withEllipsis
raisesKeyError
(GH9467) - Bug in
SparseArray[]
indexing with tuples are not handled properly (GH12966) - Bug in
SparseSeries.loc[]
with list-like input raisesTypeError
(GH10560) - Bug in
SparseSeries.iloc[]
with scalar input may raiseIndexError
(GH10560) - Bug in
SparseSeries.loc[]
,.iloc[]
withslice
returnsSparseArray
, rather thanSparseSeries
(GH10560) - Bug in
SparseDataFrame.loc[]
,.iloc[]
may results in denseSeries
, rather thanSparseSeries
(GH12787) - Bug in
SparseArray
addition ignoresfill_value
of right hand side (GH12910) - Bug in
SparseArray
mod raisesAttributeError
(GH12910) - Bug in
SparseArray
pow calculates1 ** np.nan
asnp.nan
which must be 1 (GH12910) - Bug in
SparseArray
comparison output may incorrect result or raiseValueError
(GH12971) - Bug in
SparseSeries.__repr__
raisesTypeError
when it is longer thanmax_rows
(GH10560) - Bug in
SparseSeries.shape
ignoresfill_value
(GH10452) - Bug in
SparseSeries
andSparseArray
may have differentdtype
from its dense values (GH12908) - Bug in
SparseSeries.reindex
incorrectly handlefill_value
(GH12797) - Bug in
SparseArray.to_frame()
results inDataFrame
, rather thanSparseDataFrame
(GH9850) - Bug in
SparseSeries.value_counts()
does not countfill_value
(GH6749) - Bug in
SparseArray.to_dense()
does not preservedtype
(GH10648) - Bug in
SparseArray.to_dense()
incorrectly handlefill_value
(GH12797) - Bug in
pd.concat()
ofSparseSeries
results in dense (GH10536) - Bug in
pd.concat()
ofSparseDataFrame
incorrectly handlefill_value
(GH9765) - Bug in
pd.concat()
ofSparseDataFrame
may raiseAttributeError
(GH12174) - Bug in
SparseArray.shift()
may raiseNameError
orTypeError
(GH12908)
API changes¶
.groupby(..).nth()
changes¶
The index in .groupby(..).nth()
output is now more consistent when the as_index
argument is passed (GH11039):
In [41]: df = DataFrame({'A' : ['a', 'b', 'a'],
....: 'B' : [1, 2, 3]})
....:
In [42]: df
Out[42]:
A B
0 a 1
1 b 2
2 a 3
Previous Behavior:
In [3]: df.groupby('A', as_index=True)['B'].nth(0)
Out[3]:
0 1
1 2
Name: B, dtype: int64
In [4]: df.groupby('A', as_index=False)['B'].nth(0)
Out[4]:
0 1
1 2
Name: B, dtype: int64
New Behavior:
In [43]: df.groupby('A', as_index=True)['B'].nth(0)
Out[43]:
A
a 1
b 2
Name: B, dtype: int64
In [44]: df.groupby('A', as_index=False)['B'].nth(0)
Out[44]:
0 1
1 2
Name: B, dtype: int64
Furthermore, previously, a .groupby
would always sort, regardless if sort=False
was passed with .nth()
.
In [45]: np.random.seed(1234)
In [46]: df = pd.DataFrame(np.random.randn(100, 2), columns=['a', 'b'])
In [47]: df['c'] = np.random.randint(0, 4, 100)
Previous Behavior:
In [4]: df.groupby('c', sort=True).nth(1)
Out[4]:
a b
c
0 -0.334077 0.002118
1 0.036142 -2.074978
2 -0.720589 0.887163
3 0.859588 -0.636524
In [5]: df.groupby('c', sort=False).nth(1)
Out[5]:
a b
c
0 -0.334077 0.002118
1 0.036142 -2.074978
2 -0.720589 0.887163
3 0.859588 -0.636524
New Behavior:
In [48]: df.groupby('c', sort=True).nth(1)
Out[48]:
a b
c
0 -0.334077 0.002118
1 0.036142 -2.074978
2 -0.720589 0.887163
3 0.859588 -0.636524
In [49]: df.groupby('c', sort=False).nth(1)
Out[49]:
a b
c
2 -0.720589 0.887163
3 0.859588 -0.636524
0 -0.334077 0.002118
1 0.036142 -2.074978
numpy function compatibility¶
Compatibility between pandas array-like methods (e.g. sum
and take
) and their numpy
counterparts has been greatly increased by augmenting the signatures of the pandas
methods so
as to accept arguments that can be passed in from numpy
, even if they are not necessarily
used in the pandas
implementation (GH12644, GH12638, GH12687)
.searchsorted()
forIndex
andTimedeltaIndex
now accept asorter
argument to maintain compatibility with numpy’ssearchsorted
function (GH12238)- Bug in numpy compatibility of
np.round()
on aSeries
(GH12600)
An example of this signature augmentation is illustrated below:
In [50]: sp = pd.SparseDataFrame([1, 2, 3])
In [51]: sp
Out[51]:
0
0 1
1 2
2 3
Previous behaviour:
In [2]: np.cumsum(sp, axis=0)
...
TypeError: cumsum() takes at most 2 arguments (4 given)
New behaviour:
In [52]: np.cumsum(sp, axis=0)
Out[52]:
0
0 1
1 3
2 6
Using .apply
on groupby resampling¶
Using apply
on resampling groupby operations (using a pd.TimeGrouper
) now has the same output types as similar apply
calls on other groupby operations. (GH11742).
In [53]: df = pd.DataFrame({'date': pd.to_datetime(['10/10/2000', '11/10/2000']),
....: 'value': [10, 13]})
....:
In [54]: df
Out[54]:
date value
0 2000-10-10 10
1 2000-11-10 13
Previous behavior:
In [1]: df.groupby(pd.TimeGrouper(key='date', freq='M')).apply(lambda x: x.value.sum())
Out[1]:
...
TypeError: cannot concatenate a non-NDFrame object
# Output is a Series
In [2]: df.groupby(pd.TimeGrouper(key='date', freq='M')).apply(lambda x: x[['value']].sum())
Out[2]:
date
2000-10-31 value 10
2000-11-30 value 13
dtype: int64
New Behavior:
# Output is a Series
In [55]: df.groupby(pd.TimeGrouper(key='date', freq='M')).apply(lambda x: x.value.sum())
Out[55]:
date
2000-10-31 10
2000-11-30 13
Freq: M, dtype: int64
# Output is a DataFrame
In [56]: df.groupby(pd.TimeGrouper(key='date', freq='M')).apply(lambda x: x[['value']].sum())
Out[56]:
value
date
2000-10-31 10
2000-11-30 13
Changes in read_csv
exceptions¶
In order to standardize the read_csv
API for both the c
and python
engines, both will now raise an
EmptyDataError
, a subclass of ValueError
, in response to empty columns or header (GH12493, GH12506)
Previous behaviour:
In [1]: df = pd.read_csv(StringIO(''), engine='c')
...
ValueError: No columns to parse from file
In [2]: df = pd.read_csv(StringIO(''), engine='python')
...
StopIteration
New behaviour:
In [1]: df = pd.read_csv(StringIO(''), engine='c')
...
pandas.io.common.EmptyDataError: No columns to parse from file
In [2]: df = pd.read_csv(StringIO(''), engine='python')
...
pandas.io.common.EmptyDataError: No columns to parse from file
In addition to this error change, several others have been made as well:
CParserError
now sub-classesValueError
instead of just aException
(GH12551)- A
CParserError
is now raised instead of a genericException
inread_csv
when thec
engine cannot parse a column (GH12506) - A
ValueError
is now raised instead of a genericException
inread_csv
when thec
engine encounters aNaN
value in an integer column (GH12506) - A
ValueError
is now raised instead of a genericException
inread_csv
whentrue_values
is specified, and thec
engine encounters an element in a column containing unencodable bytes (GH12506) pandas.parser.OverflowError
exception has been removed and has been replaced with Python’s built-inOverflowError
exception (GH12506)pd.read_csv()
no longer allows a combination of strings and integers for theusecols
parameter (GH12678)
to_datetime
error changes¶
Bugs in pd.to_datetime()
when passing a unit
with convertible entries and errors='coerce'
or non-convertible with errors='ignore'
. Furthermore, an OutOfBoundsDateime
exception will be raised when an out-of-range value is encountered for that unit when errors='raise'
. (GH11758, GH13052, GH13059)
Previous behaviour:
In [27]: pd.to_datetime(1420043460, unit='s', errors='coerce')
Out[27]: NaT
In [28]: pd.to_datetime(11111111, unit='D', errors='ignore')
OverflowError: Python int too large to convert to C long
In [29]: pd.to_datetime(11111111, unit='D', errors='raise')
OverflowError: Python int too large to convert to C long
New behaviour:
In [2]: pd.to_datetime(1420043460, unit='s', errors='coerce')
Out[2]: Timestamp('2014-12-31 16:31:00')
In [3]: pd.to_datetime(11111111, unit='D', errors='ignore')
Out[3]: 11111111
In [4]: pd.to_datetime(11111111, unit='D', errors='raise')
OutOfBoundsDatetime: cannot convert input with unit 'D'
Other API changes¶
.swaplevel()
forSeries
,DataFrame
,Panel
, andMultiIndex
now features defaults for its first two parametersi
andj
that swap the two innermost levels of the index. (GH12934).searchsorted()
forIndex
andTimedeltaIndex
now accept asorter
argument to maintain compatibility with numpy’ssearchsorted
function (GH12238)Period
andPeriodIndex
now raisesIncompatibleFrequency
error which inheritsValueError
rather than rawValueError
(GH12615)Series.apply
for category dtype now applies the passed function to each of the.categories
(and not the.codes
), and returns acategory
dtype if possible (GH12473)read_csv
will now raise aTypeError
ifparse_dates
is neither a boolean, list, or dictionary (matches the doc-string) (GH5636)- The default for
.query()/.eval()
is nowengine=None
, which will usenumexpr
if it’s installed; otherwise it will fallback to thepython
engine. This mimics the pre-0.18.1 behavior ifnumexpr
is installed (and which, previously, if numexpr was not installed,.query()/.eval()
would raise). (GH12749) pd.show_versions()
now includespandas_datareader
version (GH12740)- Provide a proper
__name__
and__qualname__
attributes for generic functions (GH12021) pd.concat(ignore_index=True)
now usesRangeIndex
as default (GH12695)pd.merge()
andDataFrame.join()
will show aUserWarning
when merging/joining a single- with a multi-leveled dataframe (GH9455, GH12219)- Compat with
scipy
> 0.17 for deprecatedpiecewise_polynomial
interpolation method; support for the replacementfrom_derivatives
method (GH12887)
Performance Improvements¶
- Improved speed of SAS reader (GH12656, GH12961)
- Performance improvements in
.groupby(..).cumcount()
(GH11039) - Improved memory usage in
pd.read_csv()
when usingskiprows=an_integer
(GH13005) - Improved performance of
DataFrame.to_sql
when checking case sensitivity for tables. Now only checks if table has been created correctly when table name is not lower case. (GH12876) - Improved performance of
Period
construction and time series plotting (GH12903, GH11831). - Improved performance of
.str.encode()
and.str.decode()
methods (GH13008) - Improved performance of
to_numeric
if input is numeric dtype (GH12777) - Improved performance of sparse arithmetic with
IntIndex
(GH13036)
Bug Fixes¶
usecols
parameter inpd.read_csv
is now respected even when the lines of a CSV file are not even (GH12203)- Bug in
groupby.transform(..)
whenaxis=1
is specified with a non-monotonic ordered index (GH12713) - Bug in
Period
andPeriodIndex
creation raisesKeyError
iffreq="Minute"
is specified. Note that “Minute” freq is deprecated in v0.17.0, and recommended to usefreq="T"
instead (GH11854) - Bug in
.resample(...).count()
with aPeriodIndex
always raising aTypeError
(GH12774) - Bug in
.resample(...)
with aPeriodIndex
casting to aDatetimeIndex
when empty (GH12868) - Bug in
.resample(...)
with aPeriodIndex
when resampling to an existing frequency (GH12770) - Bug in printing data which contains
Period
with differentfreq
raisesValueError
(GH12615) - Bug in
Series
construction withCategorical
anddtype='category'
is specified (GH12574) - Bugs in concatenation with a coercable dtype was too aggressive, resulting in different dtypes in outputformatting when an object was longer than
display.max_rows
(GH12411, GH12045, GH11594, GH10571, GH12211) - Bug in
float_format
option with option not being validated as a callable. (GH12706) - Bug in
GroupBy.filter
whendropna=False
and no groups fulfilled the criteria (GH12768) - Bug in
__name__
of.cum*
functions (GH12021) - Bug in
.astype()
of aFloat64Inde/Int64Index
to anInt64Index
(GH12881) - Bug in roundtripping an integer based index in
.to_json()/.read_json()
whenorient='index'
(the default) (GH12866) - Bug in plotting
Categorical
dtypes cause error when attempting stacked bar plot (GH13019) - Compat with >=
numpy
1.11 forNaT
comparions (GH12969) - Bug in
.drop()
with a non-uniqueMultiIndex
. (GH12701) - Bug in
.concat
of datetime tz-aware and naive DataFrames (GH12467) - Bug in correctly raising a
ValueError
in.resample(..).fillna(..)
when passing a non-string (GH12952) - Bug fixes in various encoding and header processing issues in
pd.read_sas()
(GH12659, GH12654, GH12647, GH12809) - Bug in
pd.crosstab()
where would silently ignoreaggfunc
ifvalues=None
(GH12569). - Potential segfault in
DataFrame.to_json
when serialisingdatetime.time
(GH11473). - Potential segfault in
DataFrame.to_json
when attempting to serialise 0d array (GH11299). - Segfault in
to_json
when attempting to serialise aDataFrame
orSeries
with non-ndarray values; now supports serialization ofcategory
,sparse
, anddatetime64[ns, tz]
dtypes (GH10778). - Bug in
DataFrame.to_json
with unsupported dtype not passed to default handler (GH12554). - Bug in
.align
not returning the sub-class (GH12983) - Bug in aligning a
Series
with aDataFrame
(GH13037) - Bug in
ABCPanel
in whichPanel4D
was not being considered as a valid instance of this generic type (GH12810) - Bug in consistency of
.name
on.groupby(..).apply(..)
cases (GH12363) - Bug in
Timestamp.__repr__
that causedpprint
to fail in nested structures (GH12622) - Bug in
Timedelta.min
andTimedelta.max
, the properties now report the true minimum/maximumtimedeltas
as recognized by pandas. See the documentation. (GH12727) - Bug in
.quantile()
with interpolation may coerce tofloat
unexpectedly (GH12772) - Bug in
.quantile()
with emptySeries
may return scalar rather than emptySeries
(GH12772) - Bug in
.loc
with out-of-bounds in a large indexer would raiseIndexError
rather thanKeyError
(GH12527) - Bug in resampling when using a
TimedeltaIndex
and.asfreq()
, would previously not include the final fencepost (GH12926) - Bug in equality testing with a
Categorical
in aDataFrame
(GH12564) - Bug in
GroupBy.first()
,.last()
returns incorrect row whenTimeGrouper
is used (GH7453) - Bug in
pd.read_csv()
with thec
engine when specifyingskiprows
with newlines in quoted items (GH10911, GH12775) - Bug in
DataFrame
timezone lost when assigning tz-aware datetimeSeries
with alignment (GH12981) - Bug in
.value_counts()
whennormalize=True
anddropna=True
where nulls still contributed to the normalized count (GH12558) - Bug in
Series.value_counts()
loses name if its dtype iscategory
(GH12835) - Bug in
Series.value_counts()
loses timezone info (GH12835) - Bug in
Series.value_counts(normalize=True)
withCategorical
raisesUnboundLocalError
(GH12835) - Bug in
Panel.fillna()
ignoringinplace=True
(GH12633) - Bug in
pd.read_csv()
when specifyingnames
,usecols
, andparse_dates
simultaneously with thec
engine (GH9755) - Bug in
pd.read_csv()
when specifyingdelim_whitespace=True
andlineterminator
simultaneously with thec
engine (GH12912) - Bug in
Series.rename
,DataFrame.rename
andDataFrame.rename_axis
not treatingSeries
as mappings to relabel (GH12623). - Clean in
.rolling.min
and.rolling.max
to enhance dtype handling (GH12373) - Bug in
groupby
where complex types are coerced to float (GH12902) - Bug in
Series.map
raisesTypeError
if its dtype iscategory
or tz-awaredatetime
(GH12473) - Bugs on 32bit platforms for some test comparisons (GH12972)
- Bug in index coercion when falling back from
RangeIndex
construction (GH12893) - Better error message in window functions when invalid argument (e.g. a float window) is passed (GH12669)
- Bug in slicing subclassed
DataFrame
defined to return subclassedSeries
may return normalSeries
(GH11559) - Bug in
.str
accessor methods may raiseValueError
if input hasname
and the result isDataFrame
orMultiIndex
(GH12617) - Bug in
DataFrame.last_valid_index()
andDataFrame.first_valid_index()
on empty frames (GH12800) - Bug in
CategoricalIndex.get_loc
returns different result from regularIndex
(GH12531) - Bug in
PeriodIndex.resample
where name not propagated (GH12769) - Bug in
date_range
closed
keyword and timezones (GH12684). - Bug in
pd.concat
raisesAttributeError
when input data contains tz-aware datetime and timedelta (GH12620) - Bug in
pd.concat
did not handle emptySeries
properly (GH11082) - Bug in
.plot.bar
alginment whenwidth
is specified withint
(GH12979) - Bug in
fill_value
is ignored if the argument to a binary operator is a constant (GH12723) - Bug in
pd.read_html()
when using bs4 flavor and parsing table with a header and only one column (GH9178) - Bug in
.pivot_table
whenmargins=True
anddropna=True
where nulls still contributed to margin count (GH12577) - Bug in
.pivot_table
whendropna=False
where table index/column names disappear (GH12133) - Bug in
pd.crosstab()
whenmargins=True
anddropna=False
which raised (GH12642) - Bug in
Series.name
whenname
attribute can be a hashable type (GH12610) - Bug in
.describe()
resets categorical columns information (GH11558) - Bug where
loffset
argument was not applied when callingresample().count()
on a timeseries (GH12725) pd.read_excel()
now accepts column names associated with keyword argumentnames
(GH12870)- Bug in
pd.to_numeric()
withIndex
returnsnp.ndarray
, rather thanIndex
(GH12777) - Bug in
pd.to_numeric()
with datetime-like may raiseTypeError
(GH12777) - Bug in
pd.to_numeric()
with scalar raisesValueError
(GH12777)
v0.18.0 (March 13, 2016)¶
This is a major release from 0.17.1 and includes a small number of API changes, several new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version.
Warning
pandas >= 0.18.0 no longer supports compatibility with Python version 2.6 and 3.3 (GH7718, GH11273)
Warning
numexpr
version 2.4.4 will now show a warning and not be used as a computation back-end for pandas because of some buggy behavior. This does not affect other versions (>= 2.1 and >= 2.4.6). (GH12489)
Highlights include:
- Moving and expanding window functions are now methods on Series and DataFrame,
similar to
.groupby
, see here. - Adding support for a
RangeIndex
as a specialized form of theInt64Index
for memory savings, see here. - API breaking change to the
.resample
method to make it more.groupby
like, see here. - Removal of support for positional indexing with floats, which was deprecated
since 0.14.0. This will now raise a
TypeError
, see here. - The
.to_xarray()
function has been added for compatibility with the xarray package, see here. - The
read_sas
function has been enhanced to readsas7bdat
files, see here. - Addition of the .str.extractall() method, and API changes to the .str.extract() method and .str.cat() method.
pd.test()
top-level nose test runner is available (GH4327).
Check the API Changes and deprecations before updating.
What’s new in v0.18.0
- New features
- Window functions are now methods
- Changes to rename
- Range Index
- Changes to str.extract
- Addition of str.extractall
- Changes to str.cat
- Datetimelike rounding
- Formatting of Integers in FloatIndex
- Changes to dtype assignment behaviors
- to_xarray
- Latex Representation
pd.read_sas()
changes- Other enhancements
- Backwards incompatible API changes
- Performance Improvements
- Bug Fixes
New features¶
Window functions are now methods¶
Window functions have been refactored to be methods on Series/DataFrame
objects, rather than top-level functions, which are now deprecated. This allows these window-type functions, to have a similar API to that of .groupby
. See the full documentation here (GH11603, GH12373)
In [1]: np.random.seed(1234)
In [2]: df = pd.DataFrame({'A' : range(10), 'B' : np.random.randn(10)})
In [3]: df
Out[3]:
A B
0 0 0.471435
1 1 -1.190976
2 2 1.432707
3 3 -0.312652
4 4 -0.720589
5 5 0.887163
6 6 0.859588
7 7 -0.636524
8 8 0.015696
9 9 -2.242685
Previous Behavior:
In [8]: pd.rolling_mean(df,window=3)
FutureWarning: pd.rolling_mean is deprecated for DataFrame and will be removed in a future version, replace with
DataFrame.rolling(window=3,center=False).mean()
Out[8]:
A B
0 NaN NaN
1 NaN NaN
2 1 0.237722
3 2 -0.023640
4 3 0.133155
5 4 -0.048693
6 5 0.342054
7 6 0.370076
8 7 0.079587
9 8 -0.954504
New Behavior:
In [4]: r = df.rolling(window=3)
These show a descriptive repr
In [5]: r
Out[5]: Rolling [window=3,center=False,axis=0]
with tab-completion of available methods and properties.
In [9]: r.
r.A r.agg r.apply r.count r.exclusions r.max r.median r.name r.skew r.sum
r.B r.aggregate r.corr r.cov r.kurt r.mean r.min r.quantile r.std r.var
The methods operate on the Rolling
object itself
In [6]: r.mean()
Out[6]:
A B
0 NaN NaN
1 NaN NaN
2 1.0 0.237722
3 2.0 -0.023640
4 3.0 0.133155
5 4.0 -0.048693
6 5.0 0.342054
7 6.0 0.370076
8 7.0 0.079587
9 8.0 -0.954504
They provide getitem accessors
In [7]: r['A'].mean()
Out[7]:
0 NaN
1 NaN
2 1.0
3 2.0
4 3.0
5 4.0
6 5.0
7 6.0
8 7.0
9 8.0
Name: A, dtype: float64
And multiple aggregations
In [8]: r.agg({'A' : ['mean','std'],
...: 'B' : ['mean','std']})
...:
Out[8]:
A B
mean std mean std
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 1.0 1.0 0.237722 1.327364
3 2.0 1.0 -0.023640 1.335505
4 3.0 1.0 0.133155 1.143778
5 4.0 1.0 -0.048693 0.835747
6 5.0 1.0 0.342054 0.920379
7 6.0 1.0 0.370076 0.871850
8 7.0 1.0 0.079587 0.750099
9 8.0 1.0 -0.954504 1.162285
Changes to rename¶
Series.rename
and NDFrame.rename_axis
can now take a scalar or list-like
argument for altering the Series or axis name, in addition to their old behaviors of altering labels. (GH9494, GH11965)
In [9]: s = pd.Series(np.random.randn(5))
In [10]: s.rename('newname')
Out[10]:
0 1.150036
1 0.991946
2 0.953324
3 -2.021255
4 -0.334077
Name: newname, dtype: float64
In [11]: df = pd.DataFrame(np.random.randn(5, 2))
In [12]: (df.rename_axis("indexname")
....: .rename_axis("columns_name", axis="columns"))
....:
Out[12]:
columns_name 0 1
indexname
0 0.002118 0.405453
1 0.289092 1.321158
2 -1.546906 -0.202646
3 -0.655969 0.193421
4 0.553439 1.318152
The new functionality works well in method chains. Previously these methods only accepted functions or dicts mapping a label to a new label. This continues to work as before for function or dict-like values.
Range Index¶
A RangeIndex
has been added to the Int64Index
sub-classes to support a memory saving alternative for common use cases. This has a similar implementation to the python range
object (xrange
in python 2), in that it only stores the start, stop, and step values for the index. It will transparently interact with the user API, converting to Int64Index
if needed.
This will now be the default constructed index for NDFrame
objects, rather than previous an Int64Index
. (GH939, GH12070, GH12071, GH12109, GH12888)
Previous Behavior:
In [3]: s = pd.Series(range(1000))
In [4]: s.index
Out[4]:
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
...
990, 991, 992, 993, 994, 995, 996, 997, 998, 999], dtype='int64', length=1000)
In [6]: s.index.nbytes
Out[6]: 8000
New Behavior:
In [13]: s = pd.Series(range(1000))
In [14]: s.index
Out[14]: RangeIndex(start=0, stop=1000, step=1)
In [15]: s.index.nbytes
Out[15]: 72
Changes to str.extract¶
The .str.extract method takes a regular expression with capture groups, finds the first match in each subject string, and returns the contents of the capture groups (GH11386).
In v0.18.0, the expand
argument was added to
extract
.
expand=False
: it returns aSeries
,Index
, orDataFrame
, depending on the subject and regular expression pattern (same behavior as pre-0.18.0).expand=True
: it always returns aDataFrame
, which is more consistent and less confusing from the perspective of a user.
Currently the default is expand=None
which gives a FutureWarning
and uses expand=False
. To avoid this warning, please explicitly specify expand
.
In [1]: pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)', expand=None)
FutureWarning: currently extract(expand=None) means expand=False (return Index/Series/DataFrame)
but in a future version of pandas this will be changed to expand=True (return DataFrame)
Out[1]:
0 1
1 2
2 NaN
dtype: object
Extracting a regular expression with one group returns a Series if
expand=False
.
In [16]: pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)', expand=False)
Out[16]:
0 1
1 2
2 NaN
dtype: object
It returns a DataFrame
with one column if expand=True
.
In [17]: pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)', expand=True)
Out[17]:
0
0 1
1 2
2 NaN
Calling on an Index
with a regex with exactly one capture group
returns an Index
if expand=False
.
In [18]: s = pd.Series(["a1", "b2", "c3"], ["A11", "B22", "C33"])
In [19]: s.index
Out[19]: Index([u'A11', u'B22', u'C33'], dtype='object')
In [20]: s.index.str.extract("(?P<letter>[a-zA-Z])", expand=False)
Out[20]: Index([u'A', u'B', u'C'], dtype='object', name=u'letter')
It returns a DataFrame
with one column if expand=True
.
In [21]: s.index.str.extract("(?P<letter>[a-zA-Z])", expand=True)
Out[21]:
letter
0 A
1 B
2 C
Calling on an Index
with a regex with more than one capture group
raises ValueError
if expand=False
.
>>> s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)", expand=False)
ValueError: only one regex group is supported with Index
It returns a DataFrame
if expand=True
.
In [22]: s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)", expand=True)
Out[22]:
letter 1
0 A 11
1 B 22
2 C 33
In summary, extract(expand=True)
always returns a DataFrame
with a row for every subject string, and a column for every capture
group.
Addition of str.extractall¶
The .str.extractall method was added
(GH11386). Unlike extract
, which returns only the first
match.
In [23]: s = pd.Series(["a1a2", "b1", "c1"], ["A", "B", "C"])
In [24]: s
Out[24]:
A a1a2
B b1
C c1
dtype: object
In [25]: s.str.extract("(?P<letter>[ab])(?P<digit>\d)", expand=False)
Out[25]:
letter digit
A a 1
B b 1
C NaN NaN
The extractall
method returns all matches.
In [26]: s.str.extractall("(?P<letter>[ab])(?P<digit>\d)")
Out[26]:
letter digit
match
A 0 a 1
1 a 2
B 0 b 1
Changes to str.cat¶
The method .str.cat()
concatenates the members of a Series
. Before, if NaN
values were present in the Series, calling .str.cat()
on it would return NaN
, unlike the rest of the Series.str.*
API. This behavior has been amended to ignore NaN
values by default. (GH11435).
A new, friendlier ValueError
is added to protect against the mistake of supplying the sep
as an arg, rather than as a kwarg. (GH11334).
In [27]: pd.Series(['a','b',np.nan,'c']).str.cat(sep=' ')
Out[27]: 'a b c'
In [28]: pd.Series(['a','b',np.nan,'c']).str.cat(sep=' ', na_rep='?')
Out[28]: 'a b ? c'
In [2]: pd.Series(['a','b',np.nan,'c']).str.cat(' ')
ValueError: Did you mean to supply a `sep` keyword?
Datetimelike rounding¶
DatetimeIndex
, Timestamp
, TimedeltaIndex
, Timedelta
have gained the .round()
, .floor()
and .ceil()
method for datetimelike rounding, flooring and ceiling. (GH4314, GH11963)
Naive datetimes
In [29]: dr = pd.date_range('20130101 09:12:56.1234', periods=3)
In [30]: dr
Out[30]:
DatetimeIndex(['2013-01-01 09:12:56.123400', '2013-01-02 09:12:56.123400',
'2013-01-03 09:12:56.123400'],
dtype='datetime64[ns]', freq='D')
In [31]: dr.round('s')
Out[31]:
DatetimeIndex(['2013-01-01 09:12:56', '2013-01-02 09:12:56',
'2013-01-03 09:12:56'],
dtype='datetime64[ns]', freq=None)
# Timestamp scalar
In [32]: dr[0]
Out[32]: Timestamp('2013-01-01 09:12:56.123400', freq='D')
In [33]: dr[0].round('10s')
Out[33]: Timestamp('2013-01-01 09:13:00')
Tz-aware are rounded, floored and ceiled in local times
In [34]: dr = dr.tz_localize('US/Eastern')
In [35]: dr
Out[35]:
DatetimeIndex(['2013-01-01 09:12:56.123400-05:00',
'2013-01-02 09:12:56.123400-05:00',
'2013-01-03 09:12:56.123400-05:00'],
dtype='datetime64[ns, US/Eastern]', freq='D')
In [36]: dr.round('s')
Out[36]:
DatetimeIndex(['2013-01-01 09:12:56-05:00', '2013-01-02 09:12:56-05:00',
'2013-01-03 09:12:56-05:00'],
dtype='datetime64[ns, US/Eastern]', freq=None)
Timedeltas
In [37]: t = timedelta_range('1 days 2 hr 13 min 45 us',periods=3,freq='d')
In [38]: t
Out[38]:
TimedeltaIndex(['1 days 02:13:00.000045', '2 days 02:13:00.000045',
'3 days 02:13:00.000045'],
dtype='timedelta64[ns]', freq='D')
In [39]: t.round('10min')
Out[39]: TimedeltaIndex(['1 days 02:10:00', '2 days 02:10:00', '3 days 02:10:00'], dtype='timedelta64[ns]', freq=None)
# Timedelta scalar
In [40]: t[0]
Out[40]: Timedelta('1 days 02:13:00.000045')
In [41]: t[0].round('2h')
Out[41]: Timedelta('1 days 02:00:00')
In addition, .round()
, .floor()
and .ceil()
will be available thru the .dt
accessor of Series
.
In [42]: s = pd.Series(dr)
In [43]: s
Out[43]:
0 2013-01-01 09:12:56.123400-05:00
1 2013-01-02 09:12:56.123400-05:00
2 2013-01-03 09:12:56.123400-05:00
dtype: datetime64[ns, US/Eastern]
In [44]: s.dt.round('D')
Out[44]:
0 2013-01-01 00:00:00-05:00
1 2013-01-02 00:00:00-05:00
2 2013-01-03 00:00:00-05:00
dtype: datetime64[ns, US/Eastern]
Formatting of Integers in FloatIndex¶
Integers in FloatIndex
, e.g. 1., are now formatted with a decimal point and a 0
digit, e.g. 1.0
(GH11713)
This change not only affects the display to the console, but also the output of IO methods like .to_csv
or .to_html
.
Previous Behavior:
In [2]: s = pd.Series([1,2,3], index=np.arange(3.))
In [3]: s
Out[3]:
0 1
1 2
2 3
dtype: int64
In [4]: s.index
Out[4]: Float64Index([0.0, 1.0, 2.0], dtype='float64')
In [5]: print(s.to_csv(path=None))
0,1
1,2
2,3
New Behavior:
In [45]: s = pd.Series([1,2,3], index=np.arange(3.))
In [46]: s
Out[46]:
0.0 1
1.0 2
2.0 3
dtype: int64
In [47]: s.index
Out[47]: Float64Index([0.0, 1.0, 2.0], dtype='float64')
In [48]: print(s.to_csv(path=None))
0.0,1
1.0,2
2.0,3
Changes to dtype assignment behaviors¶
When a DataFrame’s slice is updated with a new slice of the same dtype, the dtype of the DataFrame will now remain the same. (GH10503)
Previous Behavior:
In [5]: df = pd.DataFrame({'a': [0, 1, 1],
'b': pd.Series([100, 200, 300], dtype='uint32')})
In [7]: df.dtypes
Out[7]:
a int64
b uint32
dtype: object
In [8]: ix = df['a'] == 1
In [9]: df.loc[ix, 'b'] = df.loc[ix, 'b']
In [11]: df.dtypes
Out[11]:
a int64
b int64
dtype: object
New Behavior:
In [49]: df = pd.DataFrame({'a': [0, 1, 1],
....: 'b': pd.Series([100, 200, 300], dtype='uint32')})
....:
In [50]: df.dtypes
Out[50]:
a int64
b uint32
dtype: object
In [51]: ix = df['a'] == 1
In [52]: df.loc[ix, 'b'] = df.loc[ix, 'b']
In [53]: df.dtypes
Out[53]:
a int64
b uint32
dtype: object
When a DataFrame’s integer slice is partially updated with a new slice of floats that could potentially be downcasted to integer without losing precision, the dtype of the slice will be set to float instead of integer.
Previous Behavior:
In [4]: df = pd.DataFrame(np.array(range(1,10)).reshape(3,3),
columns=list('abc'),
index=[[4,4,8], [8,10,12]])
In [5]: df
Out[5]:
a b c
4 8 1 2 3
10 4 5 6
8 12 7 8 9
In [7]: df.ix[4, 'c'] = np.array([0., 1.])
In [8]: df
Out[8]:
a b c
4 8 1 2 0
10 4 5 1
8 12 7 8 9
New Behavior:
In [54]: df = pd.DataFrame(np.array(range(1,10)).reshape(3,3),
....: columns=list('abc'),
....: index=[[4,4,8], [8,10,12]])
....:
In [55]: df
Out[55]:
a b c
4 8 1 2 3
10 4 5 6
8 12 7 8 9
In [56]: df.ix[4, 'c'] = np.array([0., 1.])
In [57]: df
Out[57]:
a b c
4 8 1 2 0.0
10 4 5 1.0
8 12 7 8 9.0
to_xarray¶
In a future version of pandas, we will be deprecating Panel
and other > 2 ndim objects. In order to provide for continuity,
all NDFrame
objects have gained the .to_xarray()
method in order to convert to xarray
objects, which has
a pandas-like interface for > 2 ndim. (GH11972)
See the xarray full-documentation here.
In [1]: p = Panel(np.arange(2*3*4).reshape(2,3,4))
In [2]: p.to_xarray()
Out[2]:
<xarray.DataArray (items: 2, major_axis: 3, minor_axis: 4)>
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]],
[[12, 13, 14, 15],
[16, 17, 18, 19],
[20, 21, 22, 23]]])
Coordinates:
* items (items) int64 0 1
* major_axis (major_axis) int64 0 1 2
* minor_axis (minor_axis) int64 0 1 2 3
Latex Representation¶
DataFrame
has gained a ._repr_latex_()
method in order to allow for conversion to latex in a ipython/jupyter notebook using nbconvert. (GH11778)
Note that this must be activated by setting the option pd.display.latex.repr=True
(GH12182)
For example, if you have a jupyter notebook you plan to convert to latex using nbconvert, place the statement pd.display.latex.repr=True
in the first cell to have the contained DataFrame output also stored as latex.
The options display.latex.escape
and display.latex.longtable
have also been added to the configuration and are used automatically by the to_latex
method. See the available options docs for more info.
pd.read_sas()
changes¶
read_sas
has gained the ability to read SAS7BDAT files, including compressed files. The files can be read in entirety, or incrementally. For full details see here. (GH4052)
Other enhancements¶
- Handle truncated floats in SAS xport files (GH11713)
- Added option to hide index in
Series.to_string
(GH11729) read_excel
now supports s3 urls of the formats3://bucketname/filename
(GH11447)- add support for
AWS_S3_HOST
env variable when reading from s3 (GH12198) - A simple version of
Panel.round()
is now implemented (GH11763) - For Python 3.x,
round(DataFrame)
,round(Series)
,round(Panel)
will work (GH11763) sys.getsizeof(obj)
returns the memory usage of a pandas object, including the values it contains (GH11597)Series
gained anis_unique
attribute (GH11946)DataFrame.quantile
andSeries.quantile
now acceptinterpolation
keyword (GH10174).- Added
DataFrame.style.format
for more flexible formatting of cell values (GH11692) DataFrame.select_dtypes
now allows thenp.float16
typecode (GH11990)pivot_table()
now accepts most iterables for thevalues
parameter (GH12017)- Added Google
BigQuery
service account authentication support, which enables authentication on remote servers. (GH11881, GH12572). For further details see here HDFStore
is now iterable:for k in store
is equivalent tofor k in store.keys()
(GH12221).- Add missing methods/fields to
.dt
forPeriod
(GH8848) - The entire codebase has been
PEP
-ified (GH12096)
Backwards incompatible API changes¶
- the leading whitespaces have been removed from the output of
.to_string(index=False)
method (GH11833) - the
out
parameter has been removed from theSeries.round()
method. (GH11763) DataFrame.round()
leaves non-numeric columns unchanged in its return, rather than raises. (GH11885)DataFrame.head(0)
andDataFrame.tail(0)
return empty frames, rather thanself
. (GH11937)Series.head(0)
andSeries.tail(0)
return empty series, rather thanself
. (GH11937)to_msgpack
andread_msgpack
encoding now defaults to'utf-8'
. (GH12170)- the order of keyword arguments to text file parsing functions (
.read_csv()
,.read_table()
,.read_fwf()
) changed to group related arguments. (GH11555) NaTType.isoformat
now returns the string'NaT
to allow the result to be passed to the constructor ofTimestamp
. (GH12300)
NaT and Timedelta operations¶
NaT
and Timedelta
have expanded arithmetic operations, which are extended to Series
arithmetic where applicable. Operations defined for datetime64[ns]
or timedelta64[ns]
are now also defined for NaT
(GH11564).
NaT
now supports arithmetic operations with integers and floats.
In [58]: pd.NaT * 1
Out[58]: NaT
In [59]: pd.NaT * 1.5
Out[59]: NaT
In [60]: pd.NaT / 2
Out[60]: NaT
In [61]: pd.NaT * np.nan
Out[61]: NaT
NaT
defines more arithmetic operations with datetime64[ns]
and timedelta64[ns]
.
In [62]: pd.NaT / pd.NaT
Out[62]: nan
In [63]: pd.Timedelta('1s') / pd.NaT
Out[63]: nan
NaT
may represent either a datetime64[ns]
null or a timedelta64[ns]
null.
Given the ambiguity, it is treated as a timedelta64[ns]
, which allows more operations
to succeed.
In [64]: pd.NaT + pd.NaT
Out[64]: NaT
# same as
In [65]: pd.Timedelta('1s') + pd.Timedelta('1s')
Out[65]: Timedelta('0 days 00:00:02')
as opposed to
In [3]: pd.Timestamp('19900315') + pd.Timestamp('19900315')
TypeError: unsupported operand type(s) for +: 'Timestamp' and 'Timestamp'
However, when wrapped in a Series
whose dtype
is datetime64[ns]
or timedelta64[ns]
,
the dtype
information is respected.
In [1]: pd.Series([pd.NaT], dtype='<M8[ns]') + pd.Series([pd.NaT], dtype='<M8[ns]')
TypeError: can only operate on a datetimes for subtraction,
but the operator [__add__] was passed
In [66]: pd.Series([pd.NaT], dtype='<m8[ns]') + pd.Series([pd.NaT], dtype='<m8[ns]')
Out[66]:
0 NaT
dtype: timedelta64[ns]
Timedelta
division by floats
now works.
In [67]: pd.Timedelta('1s') / 2.0
Out[67]: Timedelta('0 days 00:00:00.500000')
Subtraction by Timedelta
in a Series
by a Timestamp
works (GH11925)
In [68]: ser = pd.Series(pd.timedelta_range('1 day', periods=3))
In [69]: ser
Out[69]:
0 1 days
1 2 days
2 3 days
dtype: timedelta64[ns]
In [70]: pd.Timestamp('2012-01-01') - ser
Out[70]:
0 2011-12-31
1 2011-12-30
2 2011-12-29
dtype: datetime64[ns]
NaT.isoformat()
now returns 'NaT'
. This change allows allows
pd.Timestamp
to rehydrate any timestamp like object from its isoformat
(GH12300).
Changes to msgpack¶
Forward incompatible changes in msgpack
writing format were made over 0.17.0 and 0.18.0; older versions of pandas cannot read files packed by newer versions (GH12129, GH10527)
Bugs in to_msgpack
and read_msgpack
introduced in 0.17.0 and fixed in 0.18.0, caused files packed in Python 2 unreadable by Python 3 (GH12142). The following table describes the backward and forward compat of msgpacks.
Warning
Packed with | Can be unpacked with |
---|---|
pre-0.17 / Python 2 | any |
pre-0.17 / Python 3 | any |
0.17 / Python 2 |
|
0.17 / Python 3 | >=0.18 / any Python |
0.18 | >= 0.18 |
0.18.0 is backward-compatible for reading files packed by older versions, except for files packed with 0.17 in Python 2, in which case only they can only be unpacked in Python 2.
Signature change for .rank¶
Series.rank
and DataFrame.rank
now have the same signature (GH11759)
Previous signature
In [3]: pd.Series([0,1]).rank(method='average', na_option='keep',
ascending=True, pct=False)
Out[3]:
0 1
1 2
dtype: float64
In [4]: pd.DataFrame([0,1]).rank(axis=0, numeric_only=None,
method='average', na_option='keep',
ascending=True, pct=False)
Out[4]:
0
0 1
1 2
New signature
In [71]: pd.Series([0,1]).rank(axis=0, method='average', numeric_only=None,
....: na_option='keep', ascending=True, pct=False)
....:
Out[71]:
0 1.0
1 2.0
dtype: float64
In [72]: pd.DataFrame([0,1]).rank(axis=0, method='average', numeric_only=None,
....: na_option='keep', ascending=True, pct=False)
....:
Out[72]:
0
0 1.0
1 2.0
Bug in QuarterBegin with n=0¶
In previous versions, the behavior of the QuarterBegin offset was inconsistent
depending on the date when the n
parameter was 0. (GH11406)
The general semantics of anchored offsets for n=0
is to not move the date
when it is an anchor point (e.g., a quarter start date), and otherwise roll
forward to the next anchor point.
In [73]: d = pd.Timestamp('2014-02-01')
In [74]: d
Out[74]: Timestamp('2014-02-01 00:00:00')
In [75]: d + pd.offsets.QuarterBegin(n=0, startingMonth=2)
Out[75]: Timestamp('2014-02-01 00:00:00')
In [76]: d + pd.offsets.QuarterBegin(n=0, startingMonth=1)
Out[76]: Timestamp('2014-04-01 00:00:00')
For the QuarterBegin
offset in previous versions, the date would be rolled
backwards if date was in the same month as the quarter start date.
In [3]: d = pd.Timestamp('2014-02-15')
In [4]: d + pd.offsets.QuarterBegin(n=0, startingMonth=2)
Out[4]: Timestamp('2014-02-01 00:00:00')
This behavior has been corrected in version 0.18.0, which is consistent with
other anchored offsets like MonthBegin
and YearBegin
.
In [77]: d = pd.Timestamp('2014-02-15')
In [78]: d + pd.offsets.QuarterBegin(n=0, startingMonth=2)
Out[78]: Timestamp('2014-05-01 00:00:00')
Resample API¶
Like the change in the window functions API above, .resample(...)
is changing to have a more groupby-like API. (GH11732, GH12702, GH12202, GH12332, GH12334, GH12348, GH12448).
In [79]: np.random.seed(1234)
In [80]: df = pd.DataFrame(np.random.rand(10,4),
....: columns=list('ABCD'),
....: index=pd.date_range('2010-01-01 09:00:00', periods=10, freq='s'))
....:
In [81]: df
Out[81]:
A B C D
2010-01-01 09:00:00 0.191519 0.622109 0.437728 0.785359
2010-01-01 09:00:01 0.779976 0.272593 0.276464 0.801872
2010-01-01 09:00:02 0.958139 0.875933 0.357817 0.500995
2010-01-01 09:00:03 0.683463 0.712702 0.370251 0.561196
2010-01-01 09:00:04 0.503083 0.013768 0.772827 0.882641
2010-01-01 09:00:05 0.364886 0.615396 0.075381 0.368824
2010-01-01 09:00:06 0.933140 0.651378 0.397203 0.788730
2010-01-01 09:00:07 0.316836 0.568099 0.869127 0.436173
2010-01-01 09:00:08 0.802148 0.143767 0.704261 0.704581
2010-01-01 09:00:09 0.218792 0.924868 0.442141 0.909316
Previous API:
You would write a resampling operation that immediately evaluates. If a how
parameter was not provided, it
would default to how='mean'
.
In [6]: df.resample('2s')
Out[6]:
A B C D
2010-01-01 09:00:00 0.485748 0.447351 0.357096 0.793615
2010-01-01 09:00:02 0.820801 0.794317 0.364034 0.531096
2010-01-01 09:00:04 0.433985 0.314582 0.424104 0.625733
2010-01-01 09:00:06 0.624988 0.609738 0.633165 0.612452
2010-01-01 09:00:08 0.510470 0.534317 0.573201 0.806949
You could also specify a how
directly
In [7]: df.resample('2s', how='sum')
Out[7]:
A B C D
2010-01-01 09:00:00 0.971495 0.894701 0.714192 1.587231
2010-01-01 09:00:02 1.641602 1.588635 0.728068 1.062191
2010-01-01 09:00:04 0.867969 0.629165 0.848208 1.251465
2010-01-01 09:00:06 1.249976 1.219477 1.266330 1.224904
2010-01-01 09:00:08 1.020940 1.068634 1.146402 1.613897
New API:
Now, you can write .resample(..)
as a 2-stage operation like .groupby(...)
, which
yields a Resampler
.
In [82]: r = df.resample('2s')
In [83]: r
Out[83]: DatetimeIndexResampler [freq=<2 * Seconds>, axis=0, closed=left, label=left, convention=start, base=0]
Downsampling¶
You can then use this object to perform operations. These are downsampling operations (going from a higher frequency to a lower one).
In [84]: r.mean()
Out[84]:
A B C D
2010-01-01 09:00:00 0.485748 0.447351 0.357096 0.793615
2010-01-01 09:00:02 0.820801 0.794317 0.364034 0.531096
2010-01-01 09:00:04 0.433985 0.314582 0.424104 0.625733
2010-01-01 09:00:06 0.624988 0.609738 0.633165 0.612452
2010-01-01 09:00:08 0.510470 0.534317 0.573201 0.806949
In [85]: r.sum()
Out[85]:
A B C D
2010-01-01 09:00:00 0.971495 0.894701 0.714192 1.587231
2010-01-01 09:00:02 1.641602 1.588635 0.728068 1.062191
2010-01-01 09:00:04 0.867969 0.629165 0.848208 1.251465
2010-01-01 09:00:06 1.249976 1.219477 1.266330 1.224904
2010-01-01 09:00:08 1.020940 1.068634 1.146402 1.613897
Furthermore, resample now supports getitem
operations to perform the resample on specific columns.
In [86]: r[['A','C']].mean()
Out[86]:
A C
2010-01-01 09:00:00 0.485748 0.357096
2010-01-01 09:00:02 0.820801 0.364034
2010-01-01 09:00:04 0.433985 0.424104
2010-01-01 09:00:06 0.624988 0.633165
2010-01-01 09:00:08 0.510470 0.573201
and .aggregate
type operations.
In [87]: r.agg({'A' : 'mean', 'B' : 'sum'})
Out[87]:
A B
2010-01-01 09:00:00 0.485748 0.894701
2010-01-01 09:00:02 0.820801 1.588635
2010-01-01 09:00:04 0.433985 0.629165
2010-01-01 09:00:06 0.624988 1.219477
2010-01-01 09:00:08 0.510470 1.068634
These accessors can of course, be combined
In [88]: r[['A','B']].agg(['mean','sum'])
Out[88]:
A B
mean sum mean sum
2010-01-01 09:00:00 0.485748 0.971495 0.447351 0.894701
2010-01-01 09:00:02 0.820801 1.641602 0.794317 1.588635
2010-01-01 09:00:04 0.433985 0.867969 0.314582 0.629165
2010-01-01 09:00:06 0.624988 1.249976 0.609738 1.219477
2010-01-01 09:00:08 0.510470 1.020940 0.534317 1.068634
Upsampling¶
Upsampling operations take you from a lower frequency to a higher frequency. These are now
performed with the Resampler
objects with backfill()
,
ffill()
, fillna()
and asfreq()
methods.
In [89]: s = pd.Series(np.arange(5,dtype='int64'),
....: index=date_range('2010-01-01', periods=5, freq='Q'))
....:
In [90]: s
Out[90]:
2010-03-31 0
2010-06-30 1
2010-09-30 2
2010-12-31 3
2011-03-31 4
Freq: Q-DEC, dtype: int64
Previously
In [6]: s.resample('M', fill_method='ffill')
Out[6]:
2010-03-31 0
2010-04-30 0
2010-05-31 0
2010-06-30 1
2010-07-31 1
2010-08-31 1
2010-09-30 2
2010-10-31 2
2010-11-30 2
2010-12-31 3
2011-01-31 3
2011-02-28 3
2011-03-31 4
Freq: M, dtype: int64
New API
In [91]: s.resample('M').ffill()
Out[91]:
2010-03-31 0
2010-04-30 0
2010-05-31 0
2010-06-30 1
2010-07-31 1
2010-08-31 1
2010-09-30 2
2010-10-31 2
2010-11-30 2
2010-12-31 3
2011-01-31 3
2011-02-28 3
2011-03-31 4
Freq: M, dtype: int64
Note
In the new API, you can either downsample OR upsample. The prior implementation would allow you to pass an aggregator function (like mean
) even though you were upsampling, providing a bit of confusion.
Previous API will work but with deprecations¶
Warning
This new API for resample includes some internal changes for the prior-to-0.18.0 API, to work with a deprecation warning in most cases, as the resample operation returns a deferred object. We can intercept operations and just do what the (pre 0.18.0) API did (with a warning). Here is a typical use case:
In [4]: r = df.resample('2s')
In [6]: r*10
pandas/tseries/resample.py:80: FutureWarning: .resample() is now a deferred operation
use .resample(...).mean() instead of .resample(...)
Out[6]:
A B C D
2010-01-01 09:00:00 4.857476 4.473507 3.570960 7.936154
2010-01-01 09:00:02 8.208011 7.943173 3.640340 5.310957
2010-01-01 09:00:04 4.339846 3.145823 4.241039 6.257326
2010-01-01 09:00:06 6.249881 6.097384 6.331650 6.124518
2010-01-01 09:00:08 5.104699 5.343172 5.732009 8.069486
However, getting and assignment operations directly on a Resampler
will raise a ValueError
:
In [7]: r.iloc[0] = 5
ValueError: .resample() is now a deferred operation
use .resample(...).mean() instead of .resample(...)
There is a situation where the new API can not perform all the operations when using original code.
This code is intending to resample every 2s, take the mean
AND then take the min
of those results.
In [4]: df.resample('2s').min()
Out[4]:
A 0.433985
B 0.314582
C 0.357096
D 0.531096
dtype: float64
The new API will:
In [92]: df.resample('2s').min()
Out[92]:
A B C D
2010-01-01 09:00:00 0.191519 0.272593 0.276464 0.785359
2010-01-01 09:00:02 0.683463 0.712702 0.357817 0.500995
2010-01-01 09:00:04 0.364886 0.013768 0.075381 0.368824
2010-01-01 09:00:06 0.316836 0.568099 0.397203 0.436173
2010-01-01 09:00:08 0.218792 0.143767 0.442141 0.704581
The good news is the return dimensions will differ between the new API and the old API, so this should loudly raise an exception.
To replicate the original operation
In [93]: df.resample('2s').mean().min()
Out[93]:
A 0.433985
B 0.314582
C 0.357096
D 0.531096
dtype: float64
Changes to eval¶
In prior versions, new columns assignments in an eval
expression resulted
in an inplace change to the DataFrame
. (GH9297, GH8664, GH10486)
In [94]: df = pd.DataFrame({'a': np.linspace(0, 10, 5), 'b': range(5)})
In [95]: df
Out[95]:
a b
0 0.0 0
1 2.5 1
2 5.0 2
3 7.5 3
4 10.0 4
In [12]: df.eval('c = a + b')
FutureWarning: eval expressions containing an assignment currentlydefault to operating inplace.
This will change in a future version of pandas, use inplace=True to avoid this warning.
In [13]: df
Out[13]:
a b c
0 0.0 0 0.0
1 2.5 1 3.5
2 5.0 2 7.0
3 7.5 3 10.5
4 10.0 4 14.0
In version 0.18.0, a new inplace
keyword was added to choose whether the
assignment should be done inplace or return a copy.
In [96]: df
Out[96]:
a b c
0 0.0 0 0.0
1 2.5 1 3.5
2 5.0 2 7.0
3 7.5 3 10.5
4 10.0 4 14.0
In [97]: df.eval('d = c - b', inplace=False)
Out[97]:
a b c d
0 0.0 0 0.0 0.0
1 2.5 1 3.5 2.5
2 5.0 2 7.0 5.0
3 7.5 3 10.5 7.5
4 10.0 4 14.0 10.0
In [98]: df
Out[98]:
a b c
0 0.0 0 0.0
1 2.5 1 3.5
2 5.0 2 7.0
3 7.5 3 10.5
4 10.0 4 14.0
In [99]: df.eval('d = c - b', inplace=True)
In [100]: df
Out[100]:
a b c d
0 0.0 0 0.0 0.0
1 2.5 1 3.5 2.5
2 5.0 2 7.0 5.0
3 7.5 3 10.5 7.5
4 10.0 4 14.0 10.0
Warning
For backwards compatability, inplace
defaults to True
if not specified.
This will change in a future version of pandas. If your code depends on an
inplace assignment you should update to explicitly set inplace=True
The inplace
keyword parameter was also added the query
method.
In [101]: df.query('a > 5')
Out[101]:
a b c d
3 7.5 3 10.5 7.5
4 10.0 4 14.0 10.0
In [102]: df.query('a > 5', inplace=True)
In [103]: df
Out[103]:
a b c d
3 7.5 3 10.5 7.5
4 10.0 4 14.0 10.0
Warning
Note that the default value for inplace
in a query
is False
, which is consistent with prior versions.
eval
has also been updated to allow multi-line expressions for multiple
assignments. These expressions will be evaluated one at a time in order. Only
assignments are valid for multi-line expressions.
In [104]: df
Out[104]:
a b c d
3 7.5 3 10.5 7.5
4 10.0 4 14.0 10.0
In [105]: df.eval("""
.....: e = d + a
.....: f = e - 22
.....: g = f / 2.0""", inplace=True)
.....:
In [106]: df
Out[106]:
a b c d e f g
3 7.5 3 10.5 7.5 15.0 -7.0 -3.5
4 10.0 4 14.0 10.0 20.0 -2.0 -1.0
Other API Changes¶
DataFrame.between_time
andSeries.between_time
now only parse a fixed set of time strings. Parsing of date strings is no longer supported and raises aValueError
. (GH11818)In [107]: s = pd.Series(range(10), pd.date_range('2015-01-01', freq='H', periods=10)) In [108]: s.between_time("7:00am", "9:00am") --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-108-1f395af72989> in <module>() ----> 1 s.between_time("7:00am", "9:00am") /home/joris/scipy/pandas/pandas/core/generic.pyc in between_time(self, start_time, end_time, include_start, include_end) 4042 indexer = self.index.indexer_between_time( 4043 start_time, end_time, include_start=include_start, -> 4044 include_end=include_end) 4045 return self.take(indexer, convert=False) 4046 except AttributeError: /home/joris/scipy/pandas/pandas/tseries/index.pyc in indexer_between_time(self, start_time, end_time, include_start, include_end) 1878 values_between_time : TimeSeries 1879 """ -> 1880 start_time = to_time(start_time) 1881 end_time = to_time(end_time) 1882 time_micros = self._get_time_micros() /home/joris/scipy/pandas/pandas/tseries/tools.pyc in to_time(arg, format, infer_time_format, errors) 760 return _convert_listlike(arg, format) 761 --> 762 return _convert_listlike(np.array([arg]), format)[0] 763 764 /home/joris/scipy/pandas/pandas/tseries/tools.pyc in _convert_listlike(arg, format) 740 elif errors == 'raise': 741 raise ValueError("Cannot convert arg {arg} to " --> 742 "a time".format(arg=arg)) 743 elif errors == 'ignore': 744 return arg ValueError: Cannot convert arg ['7:00am'] to a time
This will now raise.
In [2]: s.between_time('20150101 07:00:00','20150101 09:00:00') ValueError: Cannot convert arg ['20150101 07:00:00'] to a time.
.memory_usage()
now includes values in the index, as does memory_usage in.info()
(GH11597)DataFrame.to_latex()
now supports non-ascii encodings (egutf-8
) in Python 2 with the parameterencoding
(GH7061)pandas.merge()
andDataFrame.merge()
will show a specific error message when trying to merge with an object that is not of typeDataFrame
or a subclass (GH12081)DataFrame.unstack
andSeries.unstack
now takefill_value
keyword to allow direct replacement of missing values when an unstack results in missing values in the resultingDataFrame
. As an added benefit, specifyingfill_value
will preserve the data type of the original stacked data. (GH9746)As part of the new API for window functions and resampling, aggregation functions have been clarified, raising more informative error messages on invalid aggregations. (GH9052). A full set of examples are presented in groupby.
Statistical functions for
NDFrame
objects (likesum(), mean(), min()
) will now raise if non-numpy-compatible arguments are passed in for**kwargs
(GH12301).to_latex
and.to_html
gain adecimal
parameter like.to_csv
; the default is'.'
(GH12031)More helpful error message when constructing a
DataFrame
with empty data but with indices (GH8020).describe()
will now properly handle bool dtype as a categorical (GH6625)More helpful error message with an invalid
.transform
with user defined input (GH10165)Exponentially weighted functions now allow specifying alpha directly (GH10789) and raise
ValueError
if parameters violate0 < alpha <= 1
(GH12492)
Deprecations¶
The functions
pd.rolling_*
,pd.expanding_*
, andpd.ewm*
are deprecated and replaced by the corresponding method call. Note that the new suggested syntax includes all of the arguments (even if default) (GH11603)In [1]: s = pd.Series(range(3)) In [2]: pd.rolling_mean(s,window=2,min_periods=1) FutureWarning: pd.rolling_mean is deprecated for Series and will be removed in a future version, replace with Series.rolling(min_periods=1,window=2,center=False).mean() Out[2]: 0 0.0 1 0.5 2 1.5 dtype: float64 In [3]: pd.rolling_cov(s, s, window=2) FutureWarning: pd.rolling_cov is deprecated for Series and will be removed in a future version, replace with Series.rolling(window=2).cov(other=<Series>) Out[3]: 0 NaN 1 0.5 2 0.5 dtype: float64
The the
freq
andhow
arguments to the.rolling
,.expanding
, and.ewm
(new) functions are deprecated, and will be removed in a future version. You can simply resample the input prior to creating a window function. (GH11603).For example, instead of
s.rolling(window=5,freq='D').max()
to get the max value on a rolling 5 Day window, one could uses.resample('D').mean().rolling(window=5).max()
, which first resamples the data to daily data, then provides a rolling 5 day window.pd.tseries.frequencies.get_offset_name
function is deprecated. Use offset’s.freqstr
property as alternative (GH11192)pandas.stats.fama_macbeth
routines are deprecated and will be removed in a future version (GH6077)pandas.stats.ols
,pandas.stats.plm
andpandas.stats.var
routines are deprecated and will be removed in a future version (GH6077)show a
FutureWarning
rather than aDeprecationWarning
on using long-time deprecated syntax inHDFStore.select
, where thewhere
clause is not a string-like (GH12027)The
pandas.options.display.mpl_style
configuration has been deprecated and will be removed in a future version of pandas. This functionality is better handled by matplotlib’s style sheets (GH11783).
Removal of deprecated float indexers¶
In GH4892 indexing with floating point numbers on a non-Float64Index
was deprecated (in version 0.14.0).
In 0.18.0, this deprecation warning is removed and these will now raise a TypeError
. (GH12165, GH12333)
In [109]: s = pd.Series([1, 2, 3], index=[4, 5, 6])
In [110]: s
Out[110]:
4 1
5 2
6 3
dtype: int64
In [111]: s2 = pd.Series([1, 2, 3], index=list('abc'))
In [112]: s2
Out[112]:
a 1
b 2
c 3
dtype: int64
Previous Behavior:
# this is label indexing
In [2]: s[5.0]
FutureWarning: scalar indexers for index type Int64Index should be integers and not floating point
Out[2]: 2
# this is positional indexing
In [3]: s.iloc[1.0]
FutureWarning: scalar indexers for index type Int64Index should be integers and not floating point
Out[3]: 2
# this is label indexing
In [4]: s.loc[5.0]
FutureWarning: scalar indexers for index type Int64Index should be integers and not floating point
Out[4]: 2
# .ix would coerce 1.0 to the positional 1, and index
In [5]: s2.ix[1.0] = 10
FutureWarning: scalar indexers for index type Index should be integers and not floating point
In [6]: s2
Out[6]:
a 1
b 10
c 3
dtype: int64
New Behavior:
For iloc, getting & setting via a float scalar will always raise.
In [3]: s.iloc[2.0]
TypeError: cannot do label indexing on <class 'pandas.indexes.numeric.Int64Index'> with these indexers [2.0] of <type 'float'>
Other indexers will coerce to a like integer for both getting and setting. The FutureWarning
has been dropped for .loc
, .ix
and []
.
In [113]: s[5.0]
Out[113]: 2
In [114]: s.loc[5.0]
Out[114]: 2
In [115]: s.ix[5.0]
Out[115]: 2
and setting
In [116]: s_copy = s.copy()
In [117]: s_copy[5.0] = 10
In [118]: s_copy
Out[118]:
4 1
5 10
6 3
dtype: int64
In [119]: s_copy = s.copy()
In [120]: s_copy.loc[5.0] = 10
In [121]: s_copy
Out[121]:
4 1
5 10
6 3
dtype: int64
In [122]: s_copy = s.copy()
In [123]: s_copy.ix[5.0] = 10
In [124]: s_copy
Out[124]:
4 1
5 10
6 3
dtype: int64
Positional setting with .ix
and a float indexer will ADD this value to the index, rather than previously setting the value by position.
In [125]: s2.ix[1.0] = 10
In [126]: s2
Out[126]:
a 1
b 2
c 3
1.0 10
dtype: int64
Slicing will also coerce integer-like floats to integers for a non-Float64Index
.
In [127]: s.loc[5.0:6]
Out[127]:
5 2
6 3
dtype: int64
In [128]: s.ix[5.0:6]
Out[128]:
5 2
6 3
dtype: int64
Note that for floats that are NOT coercible to ints, the label based bounds will be excluded
In [129]: s.loc[5.1:6]
Out[129]:
6 3
dtype: int64
In [130]: s.ix[5.1:6]
Out[130]:
6 3
dtype: int64
Float indexing on a Float64Index
is unchanged.
In [131]: s = pd.Series([1, 2, 3], index=np.arange(3.))
In [132]: s[1.0]
Out[132]: 2
In [133]: s[1.0:2.5]
Out[133]:
1.0 2
2.0 3
dtype: int64
Removal of prior version deprecations/changes¶
- Removal of
rolling_corr_pairwise
in favor of.rolling().corr(pairwise=True)
(GH4950) - Removal of
expanding_corr_pairwise
in favor of.expanding().corr(pairwise=True)
(GH4950) - Removal of
DataMatrix
module. This was not imported into the pandas namespace in any event (GH12111) - Removal of
cols
keyword in favor ofsubset
inDataFrame.duplicated()
andDataFrame.drop_duplicates()
(GH6680) - Removal of the
read_frame
andframe_query
(both aliases forpd.read_sql
) andwrite_frame
(alias ofto_sql
) functions in thepd.io.sql
namespace, deprecated since 0.14.0 (GH6292). - Removal of the
order
keyword from.factorize()
(GH6930)
Performance Improvements¶
- Improved performance of
andrews_curves
(GH11534) - Improved huge
DatetimeIndex
,PeriodIndex
andTimedeltaIndex
‘s ops performance includingNaT
(GH10277) - Improved performance of
pandas.concat
(GH11958) - Improved performance of
StataReader
(GH11591) - Improved performance in construction of
Categoricals
withSeries
of datetimes containingNaT
(GH12077) - Improved performance of ISO 8601 date parsing for dates without separators (GH11899), leading zeros (GH11871) and with whitespace preceding the time zone (GH9714)
Bug Fixes¶
- Bug in
GroupBy.size
when data-frame is empty. (GH11699) - Bug in
Period.end_time
when a multiple of time period is requested (GH11738) - Regression in
.clip
with tz-aware datetimes (GH11838) - Bug in
date_range
when the boundaries fell on the frequency (GH11804, GH12409) - Bug in consistency of passing nested dicts to
.groupby(...).agg(...)
(GH9052) - Accept unicode in
Timedelta
constructor (GH11995) - Bug in value label reading for
StataReader
when reading incrementally (GH12014) - Bug in vectorized
DateOffset
whenn
parameter is0
(GH11370) - Compat for numpy 1.11 w.r.t.
NaT
comparison changes (GH12049) - Bug in
read_csv
when reading from aStringIO
in threads (GH11790) - Bug in not treating
NaT
as a missing value in datetimelikes when factorizing & withCategoricals
(GH12077) - Bug in getitem when the values of a
Series
were tz-aware (GH12089) - Bug in
Series.str.get_dummies
when one of the variables was ‘name’ (GH12180) - Bug in
pd.concat
while concatenating tz-aware NaT series. (GH11693, GH11755, GH12217) - Bug in
pd.read_stata
with version <= 108 files (GH12232) - Bug in
Series.resample
using a frequency ofNano
when the index is aDatetimeIndex
and contains non-zero nanosecond parts (GH12037) - Bug in resampling with
.nunique
and a sparse index (GH12352) - Removed some compiler warnings (GH12471)
- Work around compat issues with
boto
in python 3.5 (GH11915) - Bug in
NaT
subtraction fromTimestamp
orDatetimeIndex
with timezones (GH11718) - Bug in subtraction of
Series
of a single tz-awareTimestamp
(GH12290) - Use compat iterators in PY2 to support
.next()
(GH12299) - Bug in
Timedelta.round
with negative values (GH11690) - Bug in
.loc
againstCategoricalIndex
may result in normalIndex
(GH11586) - Bug in
DataFrame.info
when duplicated column names exist (GH11761) - Bug in
.copy
of datetime tz-aware objects (GH11794) - Bug in
Series.apply
andSeries.map
wheretimedelta64
was not boxed (GH11349) - Bug in
DataFrame.set_index()
with tz-awareSeries
(GH12358) - Bug in subclasses of
DataFrame
whereAttributeError
did not propagate (GH11808) - Bug groupby on tz-aware data where selection not returning
Timestamp
(GH11616) - Bug in
pd.read_clipboard
andpd.to_clipboard
functions not supporting Unicode; upgrade includedpyperclip
to v1.5.15 (GH9263) - Bug in
DataFrame.query
containing an assignment (GH8664) - Bug in
from_msgpack
where__contains__()
fails for columns of the unpackedDataFrame
, if theDataFrame
has object columns. (GH11880) - Bug in
.resample
on categorical data withTimedeltaIndex
(GH12169) - Bug in timezone info lost when broadcasting scalar datetime to
DataFrame
(GH11682) - Bug in
Index
creation fromTimestamp
with mixed tz coerces to UTC (GH11488) - Bug in
to_numeric
where it does not raise if input is more than one dimension (GH11776) - Bug in parsing timezone offset strings with non-zero minutes (GH11708)
- Bug in
df.plot
using incorrect colors for bar plots under matplotlib 1.5+ (GH11614) - Bug in the
groupby
plot
method when using keyword arguments (GH11805). - Bug in
DataFrame.duplicated
anddrop_duplicates
causing spurious matches when settingkeep=False
(GH11864) - Bug in
.loc
result with duplicated key may haveIndex
with incorrect dtype (GH11497) - Bug in
pd.rolling_median
where memory allocation failed even with sufficient memory (GH11696) - Bug in
DataFrame.style
with spurious zeros (GH12134) - Bug in
DataFrame.style
with integer columns not starting at 0 (GH12125) - Bug in
.style.bar
may not rendered properly using specific browser (GH11678) - Bug in rich comparison of
Timedelta
with anumpy.array
ofTimedelta
that caused an infinite recursion (GH11835) - Bug in
DataFrame.round
dropping column index name (GH11986) - Bug in
df.replace
while replacing value in mixed dtypeDataframe
(GH11698) - Bug in
Index
prevents copying name of passedIndex
, when a new name is not provided (GH11193) - Bug in
read_excel
failing to read any non-empty sheets when empty sheets exist andsheetname=None
(GH11711) - Bug in
read_excel
failing to raiseNotImplemented
error when keywordsparse_dates
anddate_parser
are provided (GH11544) - Bug in
read_sql
withpymysql
connections failing to return chunked data (GH11522) - Bug in
.to_csv
ignoring formatting parametersdecimal
,na_rep
,float_format
for float indexes (GH11553) - Bug in
Int64Index
andFloat64Index
preventing the use of the modulo operator (GH9244) - Bug in
MultiIndex.drop
for not lexsorted multi-indexes (GH12078) - Bug in
DataFrame
when masking an emptyDataFrame
(GH11859) - Bug in
.plot
potentially modifying thecolors
input when the number of columns didn’t match the number of series provided (GH12039). - Bug in
Series.plot
failing when index has aCustomBusinessDay
frequency (GH7222). - Bug in
.to_sql
fordatetime.time
values with sqlite fallback (GH8341) - Bug in
read_excel
failing to read data with one column whensqueeze=True
(GH12157) - Bug in
read_excel
failing to read one empty column (GH12292, GH9002) - Bug in
.groupby
where aKeyError
was not raised for a wrong column if there was only one row in the dataframe (GH11741) - Bug in
.read_csv
with dtype specified on empty data producing an error (GH12048) - Bug in
.read_csv
where strings like'2E'
are treated as valid floats (GH12237) - Bug in building pandas with debugging symbols (GH12123)
- Removed
millisecond
property ofDatetimeIndex
. This would always raise aValueError
(GH12019). - Bug in
Series
constructor with read-only data (GH11502) - Removed
pandas.util.testing.choice()
. Should usenp.random.choice()
, instead. (GH12386) - Bug in
.loc
setitem indexer preventing the use of a TZ-aware DatetimeIndex (GH12050) - Bug in
.style
indexes and multi-indexes not appearing (GH11655) - Bug in
to_msgpack
andfrom_msgpack
which did not correctly serialize or deserializeNaT
(GH12307). - Bug in
.skew
and.kurt
due to roundoff error for highly similar values (GH11974) - Bug in
Timestamp
constructor where microsecond resolution was lost if HHMMSS were not separated with ‘:’ (GH10041) - Bug in
buffer_rd_bytes
src->buffer could be freed more than once if reading failed, causing a segfault (GH12098) - Bug in
crosstab
where arguments with non-overlapping indexes would return aKeyError
(GH10291) - Bug in
DataFrame.apply
in which reduction was not being prevented for cases in whichdtype
was not a numpy dtype (GH12244) - Bug when initializing categorical series with a scalar value. (GH12336)
- Bug when specifying a UTC
DatetimeIndex
by settingutc=True
in.to_datetime
(GH11934) - Bug when increasing the buffer size of CSV reader in
read_csv
(GH12494) - Bug when setting columns of a
DataFrame
with duplicate column names (GH12344)
v0.17.1 (November 21, 2015)¶
Note
We are proud to announce that pandas has become a sponsored project of the (NUMFocus organization). This will help ensure the success of development of pandas as a world-class open-source project.
This is a minor bug-fix release from 0.17.0 and includes a large number of bug fixes along several new features, enhancements, and performance improvements. We recommend that all users upgrade to this version.
Highlights include:
- Support for Conditional HTML Formatting, see here
- Releasing the GIL on the csv reader & other ops, see here
- Fixed regression in
DataFrame.drop_duplicates
from 0.16.2, causing incorrect results on integer values (GH11376)
What’s new in v0.17.1
New features¶
Conditional HTML Formatting¶
Warning
This is a new feature and is under active development. We’ll be adding features an possibly making breaking changes in future releases. Feedback is welcome.
We’ve added experimental support for conditional HTML formatting:
the visual styling of a DataFrame based on the data.
The styling is accomplished with HTML and CSS.
Acesses the styler class with the pandas.DataFrame.style
, attribute,
an instance of Styler
with your data attached.
Here’s a quick example:
In [1]: np.random.seed(123) In [2]: df = DataFrame(np.random.randn(10, 5), columns=list('abcde')) In [3]: html = df.style.background_gradient(cmap='viridis', low=.5)
We can render the HTML to get the following table.
a | b | c | d | e | |
---|---|---|---|---|---|
0 | -1.085631 | 0.997345 | 0.282978 | -1.506295 | -0.5786 |
1 | 1.651437 | -2.426679 | -0.428913 | 1.265936 | -0.86674 |
2 | -0.678886 | -0.094709 | 1.49139 | -0.638902 | -0.443982 |
3 | -0.434351 | 2.20593 | 2.186786 | 1.004054 | 0.386186 |
4 | 0.737369 | 1.490732 | -0.935834 | 1.175829 | -1.253881 |
5 | -0.637752 | 0.907105 | -1.428681 | -0.140069 | -0.861755 |
6 | -0.255619 | -2.798589 | -1.771533 | -0.699877 | 0.927462 |
7 | -0.173636 | 0.002846 | 0.688223 | -0.879536 | 0.283627 |
8 | -0.805367 | -1.727669 | -0.3909 | 0.573806 | 0.338589 |
9 | -0.01183 | 2.392365 | 0.412912 | 0.978736 | 2.238143 |
Styler
interacts nicely with the Jupyter Notebook.
See the documentation for more.
Enhancements¶
DatetimeIndex
now supports conversion to strings withastype(str)
(GH10442)Support for
compression
(gzip/bz2) inpandas.DataFrame.to_csv()
(GH7615)pd.read_*
functions can now also acceptpathlib.Path
, orpy._path.local.LocalPath
objects for thefilepath_or_buffer
argument. (GH11033) - TheDataFrame
andSeries
functions.to_csv()
,.to_html()
and.to_latex()
can now handle paths beginning with tildes (e.g.~/Documents/
) (GH11438)DataFrame
now uses the fields of anamedtuple
as columns, if columns are not supplied (GH11181)DataFrame.itertuples()
now returnsnamedtuple
objects, when possible. (GH11269, GH11625)Added
axvlines_kwds
to parallel coordinates plot (GH10709)Option to
.info()
and.memory_usage()
to provide for deep introspection of memory consumption. Note that this can be expensive to compute and therefore is an optional parameter. (GH11595)In [4]: df = DataFrame({'A' : ['foo']*1000}) In [5]: df['B'] = df['A'].astype('category') # shows the '+' as we have object dtypes In [6]: df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 1000 entries, 0 to 999 Data columns (total 2 columns): A 1000 non-null object B 1000 non-null category dtypes: category(1), object(1) memory usage: 8.9+ KB # we have an accurate memory assessment (but can be expensive to compute this) In [7]: df.info(memory_usage='deep') <class 'pandas.core.frame.DataFrame'> RangeIndex: 1000 entries, 0 to 999 Data columns (total 2 columns): A 1000 non-null object B 1000 non-null category dtypes: category(1), object(1) memory usage: 48.0 KB
Index
now has afillna
method (GH10089)In [8]: pd.Index([1, np.nan, 3]).fillna(2) Out[8]: Float64Index([1.0, 2.0, 3.0], dtype='float64')
Series of type
category
now make.str.<...>
and.dt.<...>
accessor methods / properties available, if the categories are of that type. (GH10661)In [9]: s = pd.Series(list('aabb')).astype('category') In [10]: s Out[10]: 0 a 1 a 2 b 3 b dtype: category Categories (2, object): [a, b] In [11]: s.str.contains("a") Out[11]: 0 True 1 True 2 False 3 False dtype: bool In [12]: date = pd.Series(pd.date_range('1/1/2015', periods=5)).astype('category') In [13]: date Out[13]: 0 2015-01-01 1 2015-01-02 2 2015-01-03 3 2015-01-04 4 2015-01-05 dtype: category Categories (5, datetime64[ns]): [2015-01-01, 2015-01-02, 2015-01-03, 2015-01-04, 2015-01-05] In [14]: date.dt.day Out[14]: 0 1 1 2 2 3 3 4 4 5 dtype: int64
pivot_table
now has amargins_name
argument so you can use something other than the default of ‘All’ (GH3335)Implement export of
datetime64[ns, tz]
dtypes with a fixed HDF5 store (GH11411)Pretty printing sets (e.g. in DataFrame cells) now uses set literal syntax (
{x, y}
) instead of Legacy Python syntax (set([x, y])
) (GH11215)Improve the error message in
pandas.io.gbq.to_gbq()
when a streaming insert fails (GH11285) and when the DataFrame does not match the schema of the destination table (GH11359)
API changes¶
- raise
NotImplementedError
inIndex.shift
for non-supported index types (GH8038) min
andmax
reductions ondatetime64
andtimedelta64
dtyped series now result inNaT
and notnan
(GH11245).- Indexing with a null key will raise a
TypeError
, instead of aValueError
(GH11356) Series.ptp
will now ignore missing values by default (GH11163)
Performance Improvements¶
- Checking monotonic-ness before sorting on an index (GH11080)
Series.dropna
performance improvement when its dtype can’t containNaN
(GH11159)- Release the GIL on most datetime field operations (e.g.
DatetimeIndex.year
,Series.dt.year
), normalization, and conversion to and fromPeriod
,DatetimeIndex.to_period
andPeriodIndex.to_timestamp
(GH11263) - Release the GIL on some rolling algos:
rolling_median
,rolling_mean
,rolling_max
,rolling_min
,rolling_var
,rolling_kurt
,rolling_skew
(GH11450) - Release the GIL when reading and parsing text files in
read_csv
,read_table
(GH11272) - Improved performance of
rolling_median
(GH11450) - Improved performance of
to_excel
(GH11352) - Performance bug in repr of
Categorical
categories, which was rendering the strings before chopping them for display (GH11305) - Performance improvement in
Categorical.remove_unused_categories
, (GH11643). - Improved performance of
Series
constructor with no data andDatetimeIndex
(GH11433) - Improved performance of
shift
,cumprod
, andcumsum
with groupby (GH4095)
Bug Fixes¶
SparseArray.__iter__()
now does not causePendingDeprecationWarning
in Python 3.5 (GH11622)- Regression from 0.16.2 for output formatting of long floats/nan, restored in (GH11302)
Series.sort_index()
now correctly handles theinplace
option (GH11402)- Incorrectly distributed .c file in the build on
PyPi
when reading a csv of floats and passingna_values=<a scalar>
would show an exception (GH11374) - Bug in
.to_latex()
output broken when the index has a name (GH10660) - Bug in
HDFStore.append
with strings whose encoded length exceded the max unencoded length (GH11234) - Bug in merging
datetime64[ns, tz]
dtypes (GH11405) - Bug in
HDFStore.select
when comparing with a numpy scalar in a where clause (GH11283) - Bug in using
DataFrame.ix
with a multi-index indexer (GH11372) - Bug in
date_range
with ambigous endpoints (GH11626) - Prevent adding new attributes to the accessors
.str
,.dt
and.cat
. Retrieving such a value was not possible, so error out on setting it. (GH10673) - Bug in tz-conversions with an ambiguous time and
.dt
accessors (GH11295) - Bug in output formatting when using an index of ambiguous times (GH11619)
- Bug in comparisons of Series vs list-likes (GH11339)
- Bug in
DataFrame.replace
with adatetime64[ns, tz]
and a non-compat to_replace (GH11326, GH11153) - Bug in
isnull
wherenumpy.datetime64('NaT')
in anumpy.array
was not determined to be null(GH11206) - Bug in list-like indexing with a mixed-integer Index (GH11320)
- Bug in
pivot_table
withmargins=True
when indexes are ofCategorical
dtype (GH10993) - Bug in
DataFrame.plot
cannot use hex strings colors (GH10299) - Regression in
DataFrame.drop_duplicates
from 0.16.2, causing incorrect results on integer values (GH11376) - Bug in
pd.eval
where unary ops in a list error (GH11235) - Bug in
squeeze()
with zero length arrays (GH11230, GH8999) - Bug in
describe()
dropping column names for hierarchical indexes (GH11517) - Bug in
DataFrame.pct_change()
not propagatingaxis
keyword on.fillna
method (GH11150) - Bug in
.to_csv()
when a mix of integer and string column names are passed as thecolumns
parameter (GH11637) - Bug in indexing with a
range
, (GH11652) - Bug in inference of numpy scalars and preserving dtype when setting columns (GH11638)
- Bug in
to_sql
using unicode column names giving UnicodeEncodeError with (GH11431). - Fix regression in setting of
xticks
inplot
(GH11529). - Bug in
holiday.dates
where observance rules could not be applied to holiday and doc enhancement (GH11477, GH11533) - Fix plotting issues when having plain
Axes
instances instead ofSubplotAxes
(GH11520, GH11556). - Bug in
DataFrame.to_latex()
produces an extra rule whenheader=False
(GH7124) - Bug in
df.groupby(...).apply(func)
when a func returns aSeries
containing a new datetimelike column (GH11324) - Bug in
pandas.json
when file to load is big (GH11344) - Bugs in
to_excel
with duplicate columns (GH11007, GH10982, GH10970) - Fixed a bug that prevented the construction of an empty series of dtype
datetime64[ns, tz]
(GH11245). - Bug in
read_excel
with multi-index containing integers (GH11317) - Bug in
to_excel
with openpyxl 2.2+ and merging (GH11408) - Bug in
DataFrame.to_dict()
produces anp.datetime64
object instead ofTimestamp
when only datetime is present in data (GH11327) - Bug in
DataFrame.corr()
raises exception when computes Kendall correlation for DataFrames with boolean and not boolean columns (GH11560) - Bug in the link-time error caused by C
inline
functions on FreeBSD 10+ (withclang
) (GH10510) - Bug in
DataFrame.to_csv
in passing through arguments for formattingMultiIndexes
, includingdate_format
(GH7791) - Bug in
DataFrame.join()
withhow='right'
producing aTypeError
(GH11519) - Bug in
Series.quantile
with empty list results hasIndex
withobject
dtype (GH11588) - Bug in
pd.merge
results in emptyInt64Index
rather thanIndex(dtype=object)
when the merge result is empty (GH11588) - Bug in
Categorical.remove_unused_categories
when havingNaN
values (GH11599) - Bug in
DataFrame.to_sparse()
loses column names for MultiIndexes (GH11600) - Bug in
DataFrame.round()
with non-unique column index producing a Fatal Python error (GH11611) - Bug in
DataFrame.round()
withdecimals
being a non-unique indexed Series producing extra columns (GH11618)
v0.17.0 (October 9, 2015)¶
This is a major release from 0.16.2 and includes a small number of API changes, several new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version.
Warning
pandas >= 0.17.0 will no longer support compatibility with Python version 3.2 (GH9118)
Warning
The pandas.io.data
package is deprecated and will be replaced by the
pandas-datareader package.
This will allow the data modules to be independently updated to your pandas
installation. The API for pandas-datareader v0.1.1
is exactly the same
as in pandas v0.17.0
(GH8961, GH10861).
After installing pandas-datareader, you can easily change your imports:
from pandas.io import data, wb
becomes
from pandas_datareader import data, wb
Highlights include:
- Release the Global Interpreter Lock (GIL) on some cython operations, see here
- Plotting methods are now available as attributes of the
.plot
accessor, see here - The sorting API has been revamped to remove some long-time inconsistencies, see here
- Support for a
datetime64[ns]
with timezones as a first-class dtype, see here - The default for
to_datetime
will now be toraise
when presented with unparseable formats, previously this would return the original input. Also, date parse functions now return consistent results. See here - The default for
dropna
inHDFStore
has changed toFalse
, to store by default all rows even if they are allNaN
, see here - Datetime accessor (
dt
) now supportsSeries.dt.strftime
to generate formatted strings for datetime-likes, andSeries.dt.total_seconds
to generate each duration of the timedelta in seconds. See here Period
andPeriodIndex
can handle multiplied freq like3D
, which corresponding to 3 days span. See here- Development installed versions of pandas will now have
PEP440
compliant version strings (GH9518) - Development support for benchmarking with the Air Speed Velocity library (GH8361)
- Support for reading SAS xport files, see here
- Documentation comparing SAS to pandas, see here
- Removal of the automatic TimeSeries broadcasting, deprecated since 0.8.0, see here
- Display format with plain text can optionally align with Unicode East Asian Width, see here
- Compatibility with Python 3.5 (GH11097)
- Compatibility with matplotlib 1.5.0 (GH11111)
Check the API Changes and deprecations before updating.
What’s new in v0.17.0
- New features
- Datetime with TZ
- Releasing the GIL
- Plot submethods
- Additional methods for
dt
accessor - Period Frequency Enhancement
- Support for SAS XPORT files
- Support for Math Functions in .eval()
- Changes to Excel with
MultiIndex
- Google BigQuery Enhancements
- Display Alignment with Unicode East Asian Width
- Other enhancements
- Backwards incompatible API changes
- Changes to sorting API
- Changes to to_datetime and to_timedelta
- Changes to Index Comparisons
- Changes to Boolean Comparisons vs. None
- HDFStore dropna behavior
- Changes to
display.precision
option - Changes to
Categorical.unique
- Changes to
bool
passed asheader
in Parsers - Other API Changes
- Deprecations
- Removal of prior version deprecations/changes
- Performance Improvements
- Bug Fixes
New features¶
Datetime with TZ¶
We are adding an implementation that natively supports datetime with timezones. A Series
or a DataFrame
column previously
could be assigned a datetime with timezones, and would work as an object
dtype. This had performance issues with a large
number rows. See the docs for more details. (GH8260, GH10763, GH11034).
The new implementation allows for having a single-timezone across all rows, with operations in a performant manner.
In [1]: df = DataFrame({'A' : date_range('20130101',periods=3),
...: 'B' : date_range('20130101',periods=3,tz='US/Eastern'),
...: 'C' : date_range('20130101',periods=3,tz='CET')})
...:
In [2]: df
Out[2]:
A B C
0 2013-01-01 2013-01-01 00:00:00-05:00 2013-01-01 00:00:00+01:00
1 2013-01-02 2013-01-02 00:00:00-05:00 2013-01-02 00:00:00+01:00
2 2013-01-03 2013-01-03 00:00:00-05:00 2013-01-03 00:00:00+01:00
In [3]: df.dtypes
Out[3]:
A datetime64[ns]
B datetime64[ns, US/Eastern]
C datetime64[ns, CET]
dtype: object
In [4]: df.B
Out[4]:
0 2013-01-01 00:00:00-05:00
1 2013-01-02 00:00:00-05:00
2 2013-01-03 00:00:00-05:00
Name: B, dtype: datetime64[ns, US/Eastern]
In [5]: df.B.dt.tz_localize(None)
Out[5]:
0 2013-01-01
1 2013-01-02
2 2013-01-03
Name: B, dtype: datetime64[ns]
This uses a new-dtype representation as well, that is very similar in look-and-feel to its numpy cousin datetime64[ns]
In [6]: df['B'].dtype
Out[6]: datetime64[ns, US/Eastern]
In [7]: type(df['B'].dtype)
Out[7]: pandas.types.dtypes.DatetimeTZDtype
Note
There is a slightly different string repr for the underlying DatetimeIndex
as a result of the dtype changes, but
functionally these are the same.
Previous Behavior:
In [1]: pd.date_range('20130101',periods=3,tz='US/Eastern')
Out[1]: DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00',
'2013-01-03 00:00:00-05:00'],
dtype='datetime64[ns]', freq='D', tz='US/Eastern')
In [2]: pd.date_range('20130101',periods=3,tz='US/Eastern').dtype
Out[2]: dtype('<M8[ns]')
New Behavior:
In [8]: pd.date_range('20130101',periods=3,tz='US/Eastern')
Out[8]:
DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00',
'2013-01-03 00:00:00-05:00'],
dtype='datetime64[ns, US/Eastern]', freq='D')
In [9]: pd.date_range('20130101',periods=3,tz='US/Eastern').dtype
Out[9]: datetime64[ns, US/Eastern]
Releasing the GIL¶
We are releasing the global-interpreter-lock (GIL) on some cython operations.
This will allow other threads to run simultaneously during computation, potentially allowing performance improvements
from multi-threading. Notably groupby
, nsmallest
, value_counts
and some indexing operations benefit from this. (GH8882)
For example the groupby expression in the following code will have the GIL released during the factorization step, e.g. df.groupby('key')
as well as the .sum()
operation.
N = 1000000
ngroups = 10
df = DataFrame({'key' : np.random.randint(0,ngroups,size=N),
'data' : np.random.randn(N) })
df.groupby('key')['data'].sum()
Releasing of the GIL could benefit an application that uses threads for user interactions (e.g. QT), or performing multi-threaded computations. A nice example of a library that can handle these types of computation-in-parallel is the dask library.
Plot submethods¶
The Series and DataFrame .plot()
method allows for customizing plot types by supplying the kind
keyword arguments. Unfortunately, many of these kinds of plots use different required and optional keyword arguments, which makes it difficult to discover what any given plot kind uses out of the dozens of possible arguments.
To alleviate this issue, we have added a new, optional plotting interface, which exposes each kind of plot as a method of the .plot
attribute. Instead of writing series.plot(kind=<kind>, ...)
, you can now also use series.plot.<kind>(...)
:
In [10]: df = pd.DataFrame(np.random.rand(10, 2), columns=['a', 'b'])
In [11]: df.plot.bar()
As a result of this change, these methods are now all discoverable via tab-completion:
In [12]: df.plot.<TAB>
df.plot.area df.plot.barh df.plot.density df.plot.hist df.plot.line df.plot.scatter
df.plot.bar df.plot.box df.plot.hexbin df.plot.kde df.plot.pie
Each method signature only includes relevant arguments. Currently, these are limited to required arguments, but in the future these will include optional arguments, as well. For an overview, see the new Plotting API documentation.
Additional methods for dt
accessor¶
strftime¶
We are now supporting a Series.dt.strftime
method for datetime-likes to generate a formatted string (GH10110). Examples:
# DatetimeIndex
In [13]: s = pd.Series(pd.date_range('20130101', periods=4))
In [14]: s
Out[14]:
0 2013-01-01
1 2013-01-02
2 2013-01-03
3 2013-01-04
dtype: datetime64[ns]
In [15]: s.dt.strftime('%Y/%m/%d')
Out[15]:
0 2013/01/01
1 2013/01/02
2 2013/01/03
3 2013/01/04
dtype: object
# PeriodIndex
In [16]: s = pd.Series(pd.period_range('20130101', periods=4))
In [17]: s
Out[17]:
0 2013-01-01
1 2013-01-02
2 2013-01-03
3 2013-01-04
dtype: object
In [18]: s.dt.strftime('%Y/%m/%d')
Out[18]:
0 2013/01/01
1 2013/01/02
2 2013/01/03
3 2013/01/04
dtype: object
The string format is as the python standard library and details can be found here
total_seconds¶
pd.Series
of type timedelta64
has new method .dt.total_seconds()
returning the duration of the timedelta in seconds (GH10817)
# TimedeltaIndex
In [19]: s = pd.Series(pd.timedelta_range('1 minutes', periods=4))
In [20]: s
Out[20]:
0 0 days 00:01:00
1 1 days 00:01:00
2 2 days 00:01:00
3 3 days 00:01:00
dtype: timedelta64[ns]
In [21]: s.dt.total_seconds()
Out[21]:
0 60.0
1 86460.0
2 172860.0
3 259260.0
dtype: float64
Period Frequency Enhancement¶
Period
, PeriodIndex
and period_range
can now accept multiplied freq. Also, Period.freq
and PeriodIndex.freq
are now stored as a DateOffset
instance like DatetimeIndex
, and not as str
(GH7811)
A multiplied freq represents a span of corresponding length. The example below creates a period of 3 days. Addition and subtraction will shift the period by its span.
In [22]: p = pd.Period('2015-08-01', freq='3D')
In [23]: p
Out[23]: Period('2015-08-01', '3D')
In [24]: p + 1
Out[24]: Period('2015-08-04', '3D')
In [25]: p - 2
Out[25]: Period('2015-07-26', '3D')
In [26]: p.to_timestamp()
Out[26]: Timestamp('2015-08-01 00:00:00')
In [27]: p.to_timestamp(how='E')
Out[27]: Timestamp('2015-08-03 00:00:00')
You can use the multiplied freq in PeriodIndex
and period_range
.
In [28]: idx = pd.period_range('2015-08-01', periods=4, freq='2D')
In [29]: idx
Out[29]: PeriodIndex(['2015-08-01', '2015-08-03', '2015-08-05', '2015-08-07'], dtype='period[2D]', freq='2D')
In [30]: idx + 1
Out[30]: PeriodIndex(['2015-08-03', '2015-08-05', '2015-08-07', '2015-08-09'], dtype='period[2D]', freq='2D')
Support for SAS XPORT files¶
read_sas()
provides support for reading SAS XPORT format files. (GH4052).
df = pd.read_sas('sas_xport.xpt')
It is also possible to obtain an iterator and read an XPORT file incrementally.
for df in pd.read_sas('sas_xport.xpt', chunksize=10000)
do_something(df)
See the docs for more details.
Support for Math Functions in .eval()¶
eval()
now supports calling math functions (GH4893)
df = pd.DataFrame({'a': np.random.randn(10)})
df.eval("b = sin(a)")
The support math functions are sin, cos, exp, log, expm1, log1p, sqrt, sinh, cosh, tanh, arcsin, arccos, arctan, arccosh, arcsinh, arctanh, abs and arctan2.
These functions map to the intrinsics for the NumExpr
engine. For the Python
engine, they are mapped to NumPy
calls.
Changes to Excel with MultiIndex
¶
In version 0.16.2 a DataFrame
with MultiIndex
columns could not be written to Excel via to_excel
.
That functionality has been added (GH10564), along with updating read_excel
so that the data can
be read back with, no loss of information, by specifying which columns/rows make up the MultiIndex
in the header
and index_col
parameters (GH4679)
See the documentation for more details.
In [31]: df = pd.DataFrame([[1,2,3,4], [5,6,7,8]],
....: columns = pd.MultiIndex.from_product([['foo','bar'],['a','b']],
....: names = ['col1', 'col2']),
....: index = pd.MultiIndex.from_product([['j'], ['l', 'k']],
....: names = ['i1', 'i2']))
....:
In [32]: df
Out[32]:
col1 foo bar
col2 a b a b
i1 i2
j l 1 2 3 4
k 5 6 7 8
In [33]: df.to_excel('test.xlsx')
In [34]: df = pd.read_excel('test.xlsx', header=[0,1], index_col=[0,1])
In [35]: df
Out[35]:
col1 foo bar
col2 a b a b
i1 i2
j l 1 2 3 4
k 5 6 7 8
Previously, it was necessary to specify the has_index_names
argument in read_excel
,
if the serialized data had index names. For version 0.17.0 the ouptput format of to_excel
has been changed to make this keyword unnecessary - the change is shown below.
Old
New
Warning
Excel files saved in version 0.16.2 or prior that had index names will still able to be read in,
but the has_index_names
argument must specified to True
.
Google BigQuery Enhancements¶
- Added ability to automatically create a table/dataset using the
pandas.io.gbq.to_gbq()
function if the destination table/dataset does not exist. (GH8325, GH11121). - Added ability to replace an existing table and schema when calling the
pandas.io.gbq.to_gbq()
function via theif_exists
argument. See the docs for more details (GH8325). InvalidColumnOrder
andInvalidPageToken
in the gbq module will raiseValueError
instead ofIOError
.- The
generate_bq_schema()
function is now deprecated and will be removed in a future version (GH11121) - The gbq module will now support Python 3 (GH11094).
Display Alignment with Unicode East Asian Width¶
Warning
Enabling this option will affect the performance for printing of DataFrame
and Series
(about 2 times slower).
Use only when it is actually required.
Some East Asian countries use Unicode characters its width is corresponding to 2 alphabets. If a DataFrame
or Series
contains these characters, the default output cannot be aligned properly. The following options are added to enable precise handling for these characters.
display.unicode.east_asian_width
: Whether to use the Unicode East Asian Width to calculate the display text width. (GH2612)display.unicode.ambiguous_as_wide
: Whether to handle Unicode characters belong to Ambiguous as Wide. (GH11102)
In [36]: df = pd.DataFrame({u'国籍': ['UK', u'日本'], u'名前': ['Alice', u'しのぶ']})
In [37]: df;
In [38]: pd.set_option('display.unicode.east_asian_width', True)
In [39]: df;
For further details, see here
Other enhancements¶
Support for
openpyxl
>= 2.2. The API for style support is now stable (GH10125)merge
now accepts the argumentindicator
which adds a Categorical-type column (by default called_merge
) to the output object that takes on the values (GH8790)Observation Origin _merge
valueMerge key only in 'left'
frameleft_only
Merge key only in 'right'
frameright_only
Merge key in both frames both
In [40]: df1 = pd.DataFrame({'col1':[0,1], 'col_left':['a','b']}) In [41]: df2 = pd.DataFrame({'col1':[1,2,2],'col_right':[2,2,2]}) In [42]: pd.merge(df1, df2, on='col1', how='outer', indicator=True) Out[42]: col1 col_left col_right _merge 0 0 a NaN left_only 1 1 b 2.0 both 2 2 NaN 2.0 right_only 3 2 NaN 2.0 right_only
For more, see the updated docs
pd.to_numeric
is a new function to coerce strings to numbers (possibly with coercion) (GH11133)pd.merge
will now allow duplicate column names if they are not merged upon (GH10639).pd.pivot
will now allow passing index asNone
(GH3962).pd.concat
will now use existing Series names if provided (GH10698).In [43]: foo = pd.Series([1,2], name='foo') In [44]: bar = pd.Series([1,2]) In [45]: baz = pd.Series([4,5])
Previous Behavior:
In [1] pd.concat([foo, bar, baz], 1) Out[1]: 0 1 2 0 1 1 4 1 2 2 5
New Behavior:
In [46]: pd.concat([foo, bar, baz], 1) Out[46]: foo 0 1 0 1 1 4 1 2 2 5
DataFrame
has gained thenlargest
andnsmallest
methods (GH10393)Add a
limit_direction
keyword argument that works withlimit
to enableinterpolate
to fillNaN
values forward, backward, or both (GH9218, GH10420, GH11115)In [47]: ser = pd.Series([np.nan, np.nan, 5, np.nan, np.nan, np.nan, 13]) In [48]: ser.interpolate(limit=1, limit_direction='both') Out[48]: 0 NaN 1 5.0 2 5.0 3 7.0 4 NaN 5 11.0 6 13.0 dtype: float64
Added a
DataFrame.round
method to round the values to a variable number of decimal places (GH10568).In [49]: df = pd.DataFrame(np.random.random([3, 3]), columns=['A', 'B', 'C'], ....: index=['first', 'second', 'third']) ....: In [50]: df Out[50]: A B C first 0.342764 0.304121 0.417022 second 0.681301 0.875457 0.510422 third 0.669314 0.585937 0.624904 In [51]: df.round(2) Out[51]: A B C first 0.34 0.30 0.42 second 0.68 0.88 0.51 third 0.67 0.59 0.62 In [52]: df.round({'A': 0, 'C': 2}) Out[52]: A B C first 0.0 0.304121 0.42 second 1.0 0.875457 0.51 third 1.0 0.585937 0.62
drop_duplicates
andduplicated
now accept akeep
keyword to target first, last, and all duplicates. Thetake_last
keyword is deprecated, see here (GH6511, GH8505)In [53]: s = pd.Series(['A', 'B', 'C', 'A', 'B', 'D']) In [54]: s.drop_duplicates() Out[54]: 0 A 1 B 2 C 5 D dtype: object In [55]: s.drop_duplicates(keep='last') Out[55]: 2 C 3 A 4 B 5 D dtype: object In [56]: s.drop_duplicates(keep=False) Out[56]: 2 C 5 D dtype: object
Reindex now has a
tolerance
argument that allows for finer control of Limits on filling while reindexing (GH10411):In [57]: df = pd.DataFrame({'x': range(5), ....: 't': pd.date_range('2000-01-01', periods=5)}) ....: In [58]: df.reindex([0.1, 1.9, 3.5], ....: method='nearest', ....: tolerance=0.2) ....: Out[58]: t x 0.1 2000-01-01 0.0 1.9 2000-01-03 2.0 3.5 NaT NaN
When used on a
DatetimeIndex
,TimedeltaIndex
orPeriodIndex
,tolerance
will coerced into aTimedelta
if possible. This allows you to specify tolerance with a string:In [59]: df = df.set_index('t') In [60]: df.reindex(pd.to_datetime(['1999-12-31']), ....: method='nearest', ....: tolerance='1 day') ....: Out[60]: x 1999-12-31 0
tolerance
is also exposed by the lower levelIndex.get_indexer
andIndex.get_loc
methods.Added functionality to use the
base
argument when resampling aTimeDeltaIndex
(GH10530)DatetimeIndex
can be instantiated using strings containsNaT
(GH7599)to_datetime
can now accept theyearfirst
keyword (GH7599)pandas.tseries.offsets
larger than theDay
offset can now be used with aSeries
for addition/subtraction (GH10699). See the docs for more details.pd.Timedelta.total_seconds()
now returns Timedelta duration to ns precision (previously microsecond precision) (GH10939)PeriodIndex
now supports arithmetic withnp.ndarray
(GH10638)Support pickling of
Period
objects (GH10439).as_blocks
will now take acopy
optional argument to return a copy of the data, default is to copy (no change in behavior from prior versions), (GH9607)regex
argument toDataFrame.filter
now handles numeric column names instead of raisingValueError
(GH10384).Enable reading gzip compressed files via URL, either by explicitly setting the compression parameter or by inferring from the presence of the HTTP Content-Encoding header in the response (GH8685)
Enable writing Excel files in memory using StringIO/BytesIO (GH7074)
Enable serialization of lists and dicts to strings in
ExcelWriter
(GH8188)SQL io functions now accept a SQLAlchemy connectable. (GH7877)
pd.read_sql
andto_sql
can accept database URI ascon
parameter (GH10214)read_sql_table
will now allow reading from views (GH10750).Enable writing complex values to
HDFStores
when using thetable
format (GH10447)Enable
pd.read_hdf
to be used without specifying a key when the HDF file contains a single dataset (GH10443)pd.read_stata
will now read Stata 118 type files. (GH9882)msgpack
submodule has been updated to 0.4.6 with backward compatibility (GH10581)DataFrame.to_dict
now acceptsorient='index'
keyword argument (GH10844).DataFrame.apply
will return a Series of dicts if the passed function returns a dict andreduce=True
(GH8735).Allow passing kwargs to the interpolation methods (GH10378).
Improved error message when concatenating an empty iterable of
Dataframe
objects (GH9157)pd.read_csv
can now read bz2-compressed files incrementally, and the C parser can read bz2-compressed files from AWS S3 (GH11070, GH11072).In
pd.read_csv
, recognizes3n://
ands3a://
URLs as designating S3 file storage (GH11070, GH11071).Read CSV files from AWS S3 incrementally, instead of first downloading the entire file. (Full file download still required for compressed files in Python 2.) (GH11070, GH11073)
pd.read_csv
is now able to infer compression type for files read from AWS S3 storage (GH11070, GH11074).
Backwards incompatible API changes¶
Changes to sorting API¶
The sorting API has had some longtime inconsistencies. (GH9816, GH8239).
Here is a summary of the API PRIOR to 0.17.0:
Series.sort
is INPLACE whileDataFrame.sort
returns a new object.Series.order
returns a new object- It was possible to use
Series/DataFrame.sort_index
to sort by values by passing theby
keyword. Series/DataFrame.sortlevel
worked only on aMultiIndex
for sorting by index.
To address these issues, we have revamped the API:
- We have introduced a new method,
DataFrame.sort_values()
, which is the merger ofDataFrame.sort()
,Series.sort()
, andSeries.order()
, to handle sorting of values. - The existing methods
Series.sort()
,Series.order()
, andDataFrame.sort()
have been deprecated and will be removed in a future version. - The
by
argument ofDataFrame.sort_index()
has been deprecated and will be removed in a future version. - The existing method
.sort_index()
will gain thelevel
keyword to enable level sorting.
We now have two distinct and non-overlapping methods of sorting. A *
marks items that
will show a FutureWarning
.
To sort by the values:
Previous | Replacement |
---|---|
* Series.order() |
Series.sort_values() |
* Series.sort() |
Series.sort_values(inplace=True) |
* DataFrame.sort(columns=...) |
DataFrame.sort_values(by=...) |
To sort by the index:
Previous | Replacement |
---|---|
Series.sort_index() |
Series.sort_index() |
Series.sortlevel(level=...) |
Series.sort_index(level=... ) |
DataFrame.sort_index() |
DataFrame.sort_index() |
DataFrame.sortlevel(level=...) |
DataFrame.sort_index(level=...) |
* DataFrame.sort() |
DataFrame.sort_index() |
We have also deprecated and changed similar methods in two Series-like classes, Index
and Categorical
.
Previous | Replacement |
---|---|
* Index.order() |
Index.sort_values() |
* Categorical.order() |
Categorical.sort_values() |
Changes to to_datetime and to_timedelta¶
Error handling¶
The default for pd.to_datetime
error handling has changed to errors='raise'
.
In prior versions it was errors='ignore'
. Furthermore, the coerce
argument
has been deprecated in favor of errors='coerce'
. This means that invalid parsing
will raise rather that return the original input as in previous versions. (GH10636)
Previous Behavior:
In [2]: pd.to_datetime(['2009-07-31', 'asd'])
Out[2]: array(['2009-07-31', 'asd'], dtype=object)
New Behavior:
In [3]: pd.to_datetime(['2009-07-31', 'asd'])
ValueError: Unknown string format
Of course you can coerce this as well.
In [61]: to_datetime(['2009-07-31', 'asd'], errors='coerce')
Out[61]: DatetimeIndex(['2009-07-31', 'NaT'], dtype='datetime64[ns]', freq=None)
To keep the previous behavior, you can use errors='ignore'
:
In [62]: to_datetime(['2009-07-31', 'asd'], errors='ignore')
Out[62]: array(['2009-07-31', 'asd'], dtype=object)
Furthermore, pd.to_timedelta
has gained a similar API, of errors='raise'|'ignore'|'coerce'
, and the coerce
keyword
has been deprecated in favor of errors='coerce'
.
Consistent Parsing¶
The string parsing of to_datetime
, Timestamp
and DatetimeIndex
has
been made consistent. (GH7599)
Prior to v0.17.0, Timestamp
and to_datetime
may parse year-only datetime-string incorrectly using today’s date, otherwise DatetimeIndex
uses the beginning of the year. Timestamp
and to_datetime
may raise ValueError
in some types of datetime-string which DatetimeIndex
can parse, such as a quarterly string.
Previous Behavior:
In [1]: Timestamp('2012Q2')
Traceback
...
ValueError: Unable to parse 2012Q2
# Results in today's date.
In [2]: Timestamp('2014')
Out [2]: 2014-08-12 00:00:00
v0.17.0 can parse them as below. It works on DatetimeIndex
also.
New Behavior:
In [63]: Timestamp('2012Q2')
Out[63]: Timestamp('2012-04-01 00:00:00')
In [64]: Timestamp('2014')
Out[64]: Timestamp('2014-01-01 00:00:00')
In [65]: DatetimeIndex(['2012Q2', '2014'])
Out[65]: DatetimeIndex(['2012-04-01', '2014-01-01'], dtype='datetime64[ns]', freq=None)
Note
If you want to perform calculations based on today’s date, use Timestamp.now()
and pandas.tseries.offsets
.
In [66]: import pandas.tseries.offsets as offsets
In [67]: Timestamp.now()
Out[67]: Timestamp('2016-10-02 16:23:30.154237')
In [68]: Timestamp.now() + offsets.DateOffset(years=1)
Out[68]: Timestamp('2017-10-02 16:23:30.166584')
Changes to Index Comparisons¶
Operator equal on Index
should behavior similarly to Series
(GH9947, GH10637)
Starting in v0.17.0, comparing Index
objects of different lengths will raise
a ValueError
. This is to be consistent with the behavior of Series
.
Previous Behavior:
In [2]: pd.Index([1, 2, 3]) == pd.Index([1, 4, 5])
Out[2]: array([ True, False, False], dtype=bool)
In [3]: pd.Index([1, 2, 3]) == pd.Index([2])
Out[3]: array([False, True, False], dtype=bool)
In [4]: pd.Index([1, 2, 3]) == pd.Index([1, 2])
Out[4]: False
New Behavior:
In [8]: pd.Index([1, 2, 3]) == pd.Index([1, 4, 5])
Out[8]: array([ True, False, False], dtype=bool)
In [9]: pd.Index([1, 2, 3]) == pd.Index([2])
ValueError: Lengths must match to compare
In [10]: pd.Index([1, 2, 3]) == pd.Index([1, 2])
ValueError: Lengths must match to compare
Note that this is different from the numpy
behavior where a comparison can
be broadcast:
In [69]: np.array([1, 2, 3]) == np.array([1])
Out[69]: array([ True, False, False], dtype=bool)
or it can return False if broadcasting can not be done:
In [70]: np.array([1, 2, 3]) == np.array([1, 2])
Out[70]: False
Changes to Boolean Comparisons vs. None¶
Boolean comparisons of a Series
vs None
will now be equivalent to comparing with np.nan
, rather than raise TypeError
. (GH1079).
In [71]: s = Series(range(3))
In [72]: s.iloc[1] = None
In [73]: s
Out[73]:
0 0.0
1 NaN
2 2.0
dtype: float64
Previous Behavior:
In [5]: s==None
TypeError: Could not compare <type 'NoneType'> type with Series
New Behavior:
In [74]: s==None
Out[74]:
0 False
1 False
2 False
dtype: bool
Usually you simply want to know which values are null.
In [75]: s.isnull()
Out[75]:
0 False
1 True
2 False
dtype: bool
Warning
You generally will want to use isnull/notnull
for these types of comparisons, as isnull/notnull
tells you which elements are null. One has to be
mindful that nan's
don’t compare equal, but None's
do. Note that Pandas/numpy uses the fact that np.nan != np.nan
, and treats None
like np.nan
.
In [76]: None == None
Out[76]: True
In [77]: np.nan == np.nan
Out[77]: False
HDFStore dropna behavior¶
The default behavior for HDFStore write functions with format='table'
is now to keep rows that are all missing. Previously, the behavior was to drop rows that were all missing save the index. The previous behavior can be replicated using the dropna=True
option. (GH9382)
Previous Behavior:
In [78]: df_with_missing = pd.DataFrame({'col1':[0, np.nan, 2],
....: 'col2':[1, np.nan, np.nan]})
....:
In [79]: df_with_missing
Out[79]:
col1 col2
0 0.0 1.0
1 NaN NaN
2 2.0 NaN
In [27]:
df_with_missing.to_hdf('file.h5',
'df_with_missing',
format='table',
mode='w')
In [28]: pd.read_hdf('file.h5', 'df_with_missing')
Out [28]:
col1 col2
0 0 1
2 2 NaN
New Behavior:
In [80]: df_with_missing.to_hdf('file.h5',
....: 'df_with_missing',
....: format='table',
....: mode='w')
....:
In [81]: pd.read_hdf('file.h5', 'df_with_missing')
Out[81]:
col1 col2
0 0.0 1.0
1 NaN NaN
2 2.0 NaN
See the docs for more details.
Changes to display.precision
option¶
The display.precision
option has been clarified to refer to decimal places (GH10451).
Earlier versions of pandas would format floating point numbers to have one less decimal place than the value in
display.precision
.
In [1]: pd.set_option('display.precision', 2)
In [2]: pd.DataFrame({'x': [123.456789]})
Out[2]:
x
0 123.5
If interpreting precision as “significant figures” this did work for scientific notation but that same interpretation did not work for values with standard formatting. It was also out of step with how numpy handles formatting.
Going forward the value of display.precision
will directly control the number of places after the decimal, for
regular formatting as well as scientific notation, similar to how numpy’s precision
print option works.
In [82]: pd.set_option('display.precision', 2)
In [83]: pd.DataFrame({'x': [123.456789]})
Out[83]:
x
0 123.46
To preserve output behavior with prior versions the default value of display.precision
has been reduced to 6
from 7
.
Changes to Categorical.unique
¶
Categorical.unique
now returns new Categoricals
with categories
and codes
that are unique, rather than returning np.array
(GH10508)
- unordered category: values and categories are sorted by appearance order.
- ordered category: values are sorted by appearance order, categories keep existing order.
In [84]: cat = pd.Categorical(['C', 'A', 'B', 'C'],
....: categories=['A', 'B', 'C'],
....: ordered=True)
....:
In [85]: cat
Out[85]:
[C, A, B, C]
Categories (3, object): [A < B < C]
In [86]: cat.unique()
Out[86]:
[C, A, B]
Categories (3, object): [A < B < C]
In [87]: cat = pd.Categorical(['C', 'A', 'B', 'C'],
....: categories=['A', 'B', 'C'])
....:
In [88]: cat
Out[88]:
[C, A, B, C]
Categories (3, object): [A, B, C]
In [89]: cat.unique()
Out[89]:
[C, A, B]
Categories (3, object): [C, A, B]
Changes to bool
passed as header
in Parsers¶
In earlier versions of pandas, if a bool was passed the header
argument of
read_csv
, read_excel
, or read_html
it was implicitly converted to
an integer, resulting in header=0
for False
and header=1
for True
(GH6113)
A bool
input to header
will now raise a TypeError
In [29]: df = pd.read_csv('data.csv', header=False)
TypeError: Passing a bool to header is invalid. Use header=None for no header or
header=int or list-like of ints to specify the row(s) making up the column names
Other API Changes¶
Line and kde plot with
subplots=True
now uses default colors, not all black. Specifycolor='k'
to draw all lines in black (GH9894)Calling the
.value_counts()
method on a Series with acategorical
dtype now returns a Series with aCategoricalIndex
(GH10704)The metadata properties of subclasses of pandas objects will now be serialized (GH10553).
groupby
usingCategorical
follows the same rule asCategorical.unique
described above (GH10508)When constructing
DataFrame
with an array ofcomplex64
dtype previously meant the corresponding column was automatically promoted to thecomplex128
dtype. Pandas will now preserve the itemsize of the input for complex data (GH10952)some numeric reduction operators would return
ValueError
, rather thanTypeError
on object types that includes strings and numbers (GH11131)Passing currently unsupported
chunksize
argument toread_excel
orExcelFile.parse
will now raiseNotImplementedError
(GH8011)Allow an
ExcelFile
object to be passed intoread_excel
(GH11198)DatetimeIndex.union
does not inferfreq
ifself
and the input haveNone
asfreq
(GH11086)NaT
‘s methods now either raiseValueError
, or returnnp.nan
orNaT
(GH9513)Behavior Methods return np.nan
weekday
,isoweekday
return NaT
date
,now
,replace
,to_datetime
,today
return np.datetime64('NaT')
to_datetime64
(unchanged)raise ValueError
All other public methods (names not beginning with underscores)
Deprecations¶
For
Series
the following indexing functions are deprecated (GH10177).Deprecated Function Replacement .irow(i)
.iloc[i]
or.iat[i]
.iget(i)
.iloc[i]
or.iat[i]
.iget_value(i)
.iloc[i]
or.iat[i]
For
DataFrame
the following indexing functions are deprecated (GH10177).Deprecated Function Replacement .irow(i)
.iloc[i]
.iget_value(i, j)
.iloc[i, j]
or.iat[i, j]
.icol(j)
.iloc[:, j]
Note
These indexing function have been deprecated in the documentation since 0.11.0.
Categorical.name
was deprecated to makeCategorical
morenumpy.ndarray
like. UseSeries(cat, name="whatever")
instead (GH10482).- Setting missing values (NaN) in a
Categorical
‘scategories
will issue a warning (GH10748). You can still have missing values in thevalues
. drop_duplicates
andduplicated
‘stake_last
keyword was deprecated in favor ofkeep
. (GH6511, GH8505)Series.nsmallest
andnlargest
‘stake_last
keyword was deprecated in favor ofkeep
. (GH10792)DataFrame.combineAdd
andDataFrame.combineMult
are deprecated. They can easily be replaced by using theadd
andmul
methods:DataFrame.add(other, fill_value=0)
andDataFrame.mul(other, fill_value=1.)
(GH10735).TimeSeries
deprecated in favor ofSeries
(note that this has been an alias since 0.13.0), (GH10890)SparsePanel
deprecated and will be removed in a future version (GH11157).Series.is_time_series
deprecated in favor ofSeries.index.is_all_dates
(GH11135)- Legacy offsets (like
'A@JAN'
) are deprecated (note that this has been alias since 0.8.0) (GH10878) WidePanel
deprecated in favor ofPanel
,LongPanel
in favor ofDataFrame
(note these have been aliases since < 0.11.0), (GH10892)DataFrame.convert_objects
has been deprecated in favor of type-specific functionspd.to_datetime
,pd.to_timestamp
andpd.to_numeric
(new in 0.17.0) (GH11133).
Removal of prior version deprecations/changes¶
Removal of
na_last
parameters fromSeries.order()
andSeries.sort()
, in favor ofna_position
. (GH5231)Remove of
percentile_width
from.describe()
, in favor ofpercentiles
. (GH7088)Removal of
colSpace
parameter fromDataFrame.to_string()
, in favor ofcol_space
, circa 0.8.0 version.Removal of automatic time-series broadcasting (GH2304)
In [90]: np.random.seed(1234) In [91]: df = DataFrame(np.random.randn(5,2),columns=list('AB'),index=date_range('20130101',periods=5)) In [92]: df Out[92]: A B 2013-01-01 0.471435 -1.190976 2013-01-02 1.432707 -0.312652 2013-01-03 -0.720589 0.887163 2013-01-04 0.859588 -0.636524 2013-01-05 0.015696 -2.242685
Previously
In [3]: df + df.A FutureWarning: TimeSeries broadcasting along DataFrame index by default is deprecated. Please use DataFrame.<op> to explicitly broadcast arithmetic operations along the index Out[3]: A B 2013-01-01 0.942870 -0.719541 2013-01-02 2.865414 1.120055 2013-01-03 -1.441177 0.166574 2013-01-04 1.719177 0.223065 2013-01-05 0.031393 -2.226989
Current
In [93]: df.add(df.A,axis='index') Out[93]: A B 2013-01-01 0.942870 -0.719541 2013-01-02 2.865414 1.120055 2013-01-03 -1.441177 0.166574 2013-01-04 1.719177 0.223065 2013-01-05 0.031393 -2.226989
Remove
table
keyword inHDFStore.put/append
, in favor of usingformat=
(GH4645)Remove
kind
inread_excel/ExcelFile
as its unused (GH4712)Remove
infer_type
keyword frompd.read_html
as its unused (GH4770, GH7032)Remove
offset
andtimeRule
keywords fromSeries.tshift/shift
, in favor offreq
(GH4853, GH4864)Remove
pd.load/pd.save
aliases in favor ofpd.to_pickle/pd.read_pickle
(GH3787)
Performance Improvements¶
- Development support for benchmarking with the Air Speed Velocity library (GH8361)
- Added vbench benchmarks for alternative ExcelWriter engines and reading Excel files (GH7171)
- Performance improvements in
Categorical.value_counts
(GH10804) - Performance improvements in
SeriesGroupBy.nunique
andSeriesGroupBy.value_counts
andSeriesGroupby.transform
(GH10820, GH11077) - Performance improvements in
DataFrame.drop_duplicates
with integer dtypes (GH10917) - Performance improvements in
DataFrame.duplicated
with wide frames. (GH10161, GH11180) - 4x improvement in
timedelta
string parsing (GH6755, GH10426) - 8x improvement in
timedelta64
anddatetime64
ops (GH6755) - Significantly improved performance of indexing
MultiIndex
with slicers (GH10287) - 8x improvement in
iloc
using list-like input (GH10791) - Improved performance of
Series.isin
for datetimelike/integer Series (GH10287) - 20x improvement in
concat
of Categoricals when categories are identical (GH10587) - Improved performance of
to_datetime
when specified format string is ISO8601 (GH10178) - 2x improvement of
Series.value_counts
for float dtype (GH10821) - Enable
infer_datetime_format
into_datetime
when date components do not have 0 padding (GH11142) - Regression from 0.16.1 in constructing
DataFrame
from nested dictionary (GH11084) - Performance improvements in addition/subtraction operations for
DateOffset
withSeries
orDatetimeIndex
(GH10744, GH11205)
Bug Fixes¶
- Bug in incorrection computation of
.mean()
ontimedelta64[ns]
because of overflow (GH9442) - Bug in
.isin
on older numpies (:issue: 11232) - Bug in
DataFrame.to_html(index=False)
renders unnecessaryname
row (GH10344) - Bug in
DataFrame.to_latex()
thecolumn_format
argument could not be passed (GH9402) - Bug in
DatetimeIndex
when localizing withNaT
(GH10477) - Bug in
Series.dt
ops in preserving meta-data (GH10477) - Bug in preserving
NaT
when passed in an otherwise invalidto_datetime
construction (GH10477) - Bug in
DataFrame.apply
when function returns categorical series. (GH9573) - Bug in
to_datetime
with invalid dates and formats supplied (GH10154) - Bug in
Index.drop_duplicates
dropping name(s) (GH10115) - Bug in
Series.quantile
dropping name (GH10881) - Bug in
pd.Series
when setting a value on an emptySeries
whose index has a frequency. (GH10193) - Bug in
pd.Series.interpolate
with invalidorder
keyword values. (GH10633) - Bug in
DataFrame.plot
raisesValueError
when color name is specified by multiple characters (GH10387) - Bug in
Index
construction with a mixed list of tuples (GH10697) - Bug in
DataFrame.reset_index
when index containsNaT
. (GH10388) - Bug in
ExcelReader
when worksheet is empty (GH6403) - Bug in
BinGrouper.group_info
where returned values are not compatible with base class (GH10914) - Bug in clearing the cache on
DataFrame.pop
and a subsequent inplace op (GH10912) - Bug in indexing with a mixed-integer
Index
causing anImportError
(GH10610) - Bug in
Series.count
when index has nulls (GH10946) - Bug in pickling of a non-regular freq
DatetimeIndex
(GH11002) - Bug causing
DataFrame.where
to not respect theaxis
parameter when the frame has a symmetric shape. (GH9736) - Bug in
Table.select_column
where name is not preserved (GH10392) - Bug in
offsets.generate_range
wherestart
andend
have finer precision thanoffset
(GH9907) - Bug in
pd.rolling_*
whereSeries.name
would be lost in the output (GH10565) - Bug in
stack
when index or columns are not unique. (GH10417) - Bug in setting a
Panel
when an axis has a multi-index (GH10360) - Bug in
USFederalHolidayCalendar
whereUSMemorialDay
andUSMartinLutherKingJr
were incorrect (GH10278 and GH9760 ) - Bug in
.sample()
where returned object, if set, gives unnecessarySettingWithCopyWarning
(GH10738) - Bug in
.sample()
where weights passed asSeries
were not aligned along axis before being treated positionally, potentially causing problems if weight indices were not aligned with sampled object. (GH10738) - Regression fixed in (GH9311, GH6620, GH9345), where groupby with a datetime-like converting to float with certain aggregators (GH10979)
- Bug in
DataFrame.interpolate
withaxis=1
andinplace=True
(GH10395) - Bug in
io.sql.get_schema
when specifying multiple columns as primary key (GH10385). - Bug in
groupby(sort=False)
with datetime-likeCategorical
raisesValueError
(GH10505) - Bug in
groupby(axis=1)
withfilter()
throwsIndexError
(GH11041) - Bug in
test_categorical
on big-endian builds (GH10425) - Bug in
Series.shift
andDataFrame.shift
not supporting categorical data (GH9416) - Bug in
Series.map
using categoricalSeries
raisesAttributeError
(GH10324) - Bug in
MultiIndex.get_level_values
includingCategorical
raisesAttributeError
(GH10460) - Bug in
pd.get_dummies
withsparse=True
not returningSparseDataFrame
(GH10531) - Bug in
Index
subtypes (such asPeriodIndex
) not returning their own type for.drop
and.insert
methods (GH10620) - Bug in
algos.outer_join_indexer
whenright
array is empty (GH10618) - Bug in
filter
(regression from 0.16.0) andtransform
when grouping on multiple keys, one of which is datetime-like (GH10114) - Bug in
to_datetime
andto_timedelta
causingIndex
name to be lost (GH10875) - Bug in
len(DataFrame.groupby)
causingIndexError
when there’s a column containing only NaNs (:issue: 11016) - Bug that caused segfault when resampling an empty Series (GH10228)
- Bug in
DatetimeIndex
andPeriodIndex.value_counts
resets name from its result, but retains in result’sIndex
. (GH10150) - Bug in
pd.eval
usingnumexpr
engine coerces 1 element numpy array to scalar (GH10546) - Bug in
pd.concat
withaxis=0
when column is of dtypecategory
(GH10177) - Bug in
read_msgpack
where input type is not always checked (GH10369, GH10630) - Bug in
pd.read_csv
with kwargsindex_col=False
,index_col=['a', 'b']
ordtype
(GH10413, GH10467, GH10577) - Bug in
Series.from_csv
withheader
kwarg not setting theSeries.name
or theSeries.index.name
(GH10483) - Bug in
groupby.var
which caused variance to be inaccurate for small float values (GH10448) - Bug in
Series.plot(kind='hist')
Y Label not informative (GH10485) - Bug in
read_csv
when using a converter which generates auint8
type (GH9266) - Bug causes memory leak in time-series line and area plot (GH9003)
- Bug when setting a
Panel
sliced along the major or minor axes when the right-hand side is aDataFrame
(GH11014) - Bug that returns
None
and does not raiseNotImplementedError
when operator functions (e.g..add
) ofPanel
are not implemented (GH7692) - Bug in line and kde plot cannot accept multiple colors when
subplots=True
(GH9894) - Bug in
DataFrame.plot
raisesValueError
when color name is specified by multiple characters (GH10387) - Bug in left and right
align
ofSeries
withMultiIndex
may be inverted (GH10665) - Bug in left and right
join
of withMultiIndex
may be inverted (GH10741) - Bug in
read_stata
when reading a file with a different order set incolumns
(GH10757) - Bug in
Categorical
may not representing properly when category containstz
orPeriod
(GH10713) - Bug in
Categorical.__iter__
may not returning correctdatetime
andPeriod
(GH10713) - Bug in indexing with a
PeriodIndex
on an object with aPeriodIndex
(GH4125) - Bug in
read_csv
withengine='c'
: EOF preceded by a comment, blank line, etc. was not handled correctly (GH10728, GH10548) - Reading “famafrench” data via
DataReader
results in HTTP 404 error because of the website url is changed (GH10591). - Bug in
read_msgpack
where DataFrame to decode has duplicate column names (GH9618) - Bug in
io.common.get_filepath_or_buffer
which caused reading of valid S3 files to fail if the bucket also contained keys for which the user does not have read permission (GH10604) - Bug in vectorised setting of timestamp columns with python
datetime.date
and numpydatetime64
(GH10408, GH10412) - Bug in
Index.take
may add unnecessaryfreq
attribute (GH10791) - Bug in
merge
with emptyDataFrame
may raiseIndexError
(GH10824) - Bug in
to_latex
where unexpected keyword argument for some documented arguments (GH10888) - Bug in indexing of large
DataFrame
whereIndexError
is uncaught (GH10645 and GH10692) - Bug in
read_csv
when using thenrows
orchunksize
parameters if file contains only a header line (GH9535) - Bug in serialization of
category
types in HDF5 in presence of alternate encodings. (GH10366) - Bug in
pd.DataFrame
when constructing an empty DataFrame with a string dtype (GH9428) - Bug in
pd.DataFrame.diff
when DataFrame is not consolidated (GH10907) - Bug in
pd.unique
for arrays with thedatetime64
ortimedelta64
dtype that meant an array with object dtype was returned instead the original dtype (GH9431) - Bug in
Timedelta
raising error when slicing from 0s (GH10583) - Bug in
DatetimeIndex.take
andTimedeltaIndex.take
may not raiseIndexError
against invalid index (GH10295) - Bug in
Series([np.nan]).astype('M8[ms]')
, which now returnsSeries([pd.NaT])
(GH10747) - Bug in
PeriodIndex.order
reset freq (GH10295) - Bug in
date_range
whenfreq
dividesend
as nanos (GH10885) - Bug in
iloc
allowing memory outside bounds of a Series to be accessed with negative integers (GH10779) - Bug in
read_msgpack
where encoding is not respected (GH10581) - Bug preventing access to the first index when using
iloc
with a list containing the appropriate negative integer (GH10547, GH10779) - Bug in
TimedeltaIndex
formatter causing error while trying to saveDataFrame
withTimedeltaIndex
usingto_csv
(GH10833) - Bug in
DataFrame.where
when handling Series slicing (GH10218, GH9558) - Bug where
pd.read_gbq
throwsValueError
when Bigquery returns zero rows (GH10273) - Bug in
to_json
which was causing segmentation fault when serializing 0-rank ndarray (GH9576) - Bug in plotting functions may raise
IndexError
when plotted onGridSpec
(GH10819) - Bug in plot result may show unnecessary minor ticklabels (GH10657)
- Bug in
groupby
incorrect computation for aggregation onDataFrame
withNaT
(E.gfirst
,last
,min
). (GH10590, GH11010) - Bug when constructing
DataFrame
where passing a dictionary with only scalar values and specifying columns did not raise an error (GH10856) - Bug in
.var()
causing roundoff errors for highly similar values (GH10242) - Bug in
DataFrame.plot(subplots=True)
with duplicated columns outputs incorrect result (GH10962) - Bug in
Index
arithmetic may result in incorrect class (GH10638) - Bug in
date_range
results in empty if freq is negative annualy, quarterly and monthly (GH11018) - Bug in
DatetimeIndex
cannot infer negative freq (GH11018) - Remove use of some deprecated numpy comparison operations, mainly in tests. (GH10569)
- Bug in
Index
dtype may not applied properly (GH11017) - Bug in
io.gbq
when testing for minimum google api client version (GH10652) - Bug in
DataFrame
construction from nesteddict
withtimedelta
keys (GH11129) - Bug in
.fillna
against may raiseTypeError
when data contains datetime dtype (GH7095, GH11153) - Bug in
.groupby
when number of keys to group by is same as length of index (GH11185) - Bug in
convert_objects
where converted values might not be returned if all null andcoerce
(GH9589) - Bug in
convert_objects
wherecopy
keyword was not respected (GH9589)
v0.16.2 (June 12, 2015)¶
This is a minor bug-fix release from 0.16.1 and includes a a large number of
bug fixes along some new features (pipe()
method), enhancements, and performance improvements.
We recommend that all users upgrade to this version.
Highlights include:
What’s new in v0.16.2
New features¶
Pipe¶
We’ve introduced a new method DataFrame.pipe()
. As suggested by the name, pipe
should be used to pipe data through a chain of function calls.
The goal is to avoid confusing nested function calls like
# df is a DataFrame
# f, g, and h are functions that take and return DataFrames
f(g(h(df), arg1=1), arg2=2, arg3=3)
The logic flows from inside out, and function names are separated from their keyword arguments. This can be rewritten as
(df.pipe(h)
.pipe(g, arg1=1)
.pipe(f, arg2=2, arg3=3)
)
Now both the code and the logic flow from top to bottom. Keyword arguments are next to their functions. Overall the code is much more readable.
In the example above, the functions f
, g
, and h
each expected the DataFrame as the first positional argument.
When the function you wish to apply takes its data anywhere other than the first argument, pass a tuple
of (function, keyword)
indicating where the DataFrame should flow. For example:
In [1]: import statsmodels.formula.api as sm
In [2]: bb = pd.read_csv('data/baseball.csv', index_col='id')
# sm.poisson takes (formula, data)
In [3]: (bb.query('h > 0')
...: .assign(ln_h = lambda df: np.log(df.h))
...: .pipe((sm.poisson, 'data'), 'hr ~ ln_h + year + g + C(lg)')
...: .fit()
...: .summary()
...: )
...:
Optimization terminated successfully.
Current function value: 2.116284
Iterations 24
Out[3]:
<class 'statsmodels.iolib.summary.Summary'>
"""
Poisson Regression Results
==============================================================================
Dep. Variable: hr No. Observations: 68
Model: Poisson Df Residuals: 63
Method: MLE Df Model: 4
Date: Son, 02 Okt 2016 Pseudo R-squ.: 0.6878
Time: 16:23:30 Log-Likelihood: -143.91
converged: True LL-Null: -460.91
LLR p-value: 6.774e-136
===============================================================================
coef std err z P>|z| [95.0% Conf. Int.]
-------------------------------------------------------------------------------
Intercept -1267.3636 457.867 -2.768 0.006 -2164.767 -369.960
C(lg)[T.NL] -0.2057 0.101 -2.044 0.041 -0.403 -0.008
ln_h 0.9280 0.191 4.866 0.000 0.554 1.302
year 0.6301 0.228 2.762 0.006 0.183 1.077
g 0.0099 0.004 2.754 0.006 0.003 0.017
===============================================================================
"""
The pipe method is inspired by unix pipes, which stream text through
processes. More recently dplyr and magrittr have introduced the
popular (%>%)
pipe operator for R.
See the documentation for more. (GH10129)
Other Enhancements¶
Added rsplit to Index/Series StringMethods (GH10303)
Removed the hard-coded size limits on the
DataFrame
HTML representation in the IPython notebook, and leave this to IPython itself (only for IPython v3.0 or greater). This eliminates the duplicate scroll bars that appeared in the notebook with large frames (GH10231).Note that the notebook has a
toggle output scrolling
feature to limit the display of very large frames (by clicking left of the output). You can also configure the way DataFrames are displayed using the pandas options, see here here.axis
parameter ofDataFrame.quantile
now accepts alsoindex
andcolumn
. (GH9543)
API Changes¶
Holiday
now raisesNotImplementedError
if bothoffset
andobservance
are used in the constructor instead of returning an incorrect result (GH10217).
Performance Improvements¶
Bug Fixes¶
- Bug in
Series.hist
raises an error when a one rowSeries
was given (GH10214) - Bug where
HDFStore.select
modifies the passed columns list (GH7212) - Bug in
Categorical
repr withdisplay.width
ofNone
in Python 3 (GH10087) - Bug in
to_json
with certain orients and aCategoricalIndex
would segfault (GH10317) - Bug where some of the nan funcs do not have consistent return dtypes (GH10251)
- Bug in
DataFrame.quantile
on checking that a valid axis was passed (GH9543) - Bug in
groupby.apply
aggregation forCategorical
not preserving categories (GH10138) - Bug in
to_csv
wheredate_format
is ignored if thedatetime
is fractional (GH10209) - Bug in
DataFrame.to_json
with mixed data types (GH10289) - Bug in cache updating when consolidating (GH10264)
- Bug in
mean()
where integer dtypes can overflow (GH10172) - Bug where
Panel.from_dict
does not set dtype when specified (GH10058) - Bug in
Index.union
raisesAttributeError
when passing array-likes. (GH10149) - Bug in
Timestamp
‘s’microsecond
,quarter
,dayofyear
,week
anddaysinmonth
properties returnnp.int
type, not built-inint
. (GH10050) - Bug in
NaT
raisesAttributeError
when accessing todaysinmonth
,dayofweek
properties. (GH10096) - Bug in Index repr when using the
max_seq_items=None
setting (GH10182). - Bug in getting timezone data with
dateutil
on various platforms ( GH9059, GH8639, GH9663, GH10121) - Bug in displaying datetimes with mixed frequencies; display ‘ms’ datetimes to the proper precision. (GH10170)
- Bug in
setitem
where type promotion is applied to the entire block (GH10280) - Bug in
Series
arithmetic methods may incorrectly hold names (GH10068) - Bug in
GroupBy.get_group
when grouping on multiple keys, one of which is categorical. (GH10132) - Bug in
DatetimeIndex
andTimedeltaIndex
names are lost after timedelta arithmetics ( GH9926) - Bug in
DataFrame
construction from nesteddict
withdatetime64
(GH10160) - Bug in
Series
construction fromdict
withdatetime64
keys (GH9456) - Bug in
Series.plot(label="LABEL")
not correctly setting the label (GH10119) - Bug in
plot
not defaulting to matplotlibaxes.grid
setting (GH9792) - Bug causing strings containing an exponent, but no decimal to be parsed as
int
instead offloat
inengine='python'
for theread_csv
parser (GH9565) - Bug in
Series.align
resetsname
whenfill_value
is specified (GH10067) - Bug in
read_csv
causing index name not to be set on an empty DataFrame (GH10184) - Bug in
SparseSeries.abs
resetsname
(GH10241) - Bug in
TimedeltaIndex
slicing may reset freq (GH10292) - Bug in
GroupBy.get_group
raisesValueError
when group key containsNaT
(GH6992) - Bug in
SparseSeries
constructor ignores input data name (GH10258) - Bug in
Categorical.remove_categories
causing aValueError
when removing theNaN
category if underlying dtype is floating-point (GH10156) - Bug where infer_freq infers timerule (WOM-5XXX) unsupported by to_offset (GH9425)
- Bug in
DataFrame.to_hdf()
where table format would raise a seemingly unrelated error for invalid (non-string) column names. This is now explicitly forbidden. (GH9057) - Bug to handle masking empty
DataFrame
(GH10126). - Bug where MySQL interface could not handle numeric table/column names (GH10255)
- Bug in
read_csv
with adate_parser
that returned adatetime64
array of other time resolution than[ns]
(GH10245) - Bug in
Panel.apply
when the result has ndim=0 (GH10332) - Bug in
read_hdf
whereauto_close
could not be passed (GH9327). - Bug in
read_hdf
where open stores could not be used (GH10330). - Bug in adding empty
DataFrame``s, now results in a ``DataFrame
that.equals
an emptyDataFrame
(GH10181). - Bug in
to_hdf
andHDFStore
which did not check that complib choices were valid (GH4582, GH8874).
v0.16.1 (May 11, 2015)¶
This is a minor bug-fix release from 0.16.0 and includes a a large number of bug fixes along several new features, enhancements, and performance improvements. We recommend that all users upgrade to this version.
Highlights include:
- Support for a
CategoricalIndex
, a category based index, see here - New section on how-to-contribute to pandas, see here
- Revised “Merge, join, and concatenate” documentation, including graphical examples to make it easier to understand each operations, see here
- New method
sample
for drawing random samples from Series, DataFrames and Panels. See here - The default
Index
printing has changed to a more uniform format, see here BusinessHour
datetime-offset is now supported, see here- Further enhancement to the
.str
accessor to make string operations easier, see here
What’s new in v0.16.1
Warning
In pandas 0.17.0, the sub-package pandas.io.data
will be removed in favor of a separately installable package. See here for details (GH8961)
Enhancements¶
CategoricalIndex¶
We introduce a CategoricalIndex
, a new type of index object that is useful for supporting
indexing with duplicates. This is a container around a Categorical
(introduced in v0.15.0)
and allows efficient indexing and storage of an index with a large number of duplicated elements. Prior to 0.16.1,
setting the index of a DataFrame/Series
with a category
dtype would convert this to regular object-based Index
.
In [1]: df = DataFrame({'A' : np.arange(6),
...: 'B' : Series(list('aabbca')).astype('category',
...: categories=list('cab'))
...: })
...:
In [2]: df
Out[2]:
A B
0 0 a
1 1 a
2 2 b
3 3 b
4 4 c
5 5 a
In [3]: df.dtypes
Out[3]:
A int64
B category
dtype: object
In [4]: df.B.cat.categories
Out[4]: Index([u'c', u'a', u'b'], dtype='object')
setting the index, will create create a CategoricalIndex
In [5]: df2 = df.set_index('B')
In [6]: df2.index
Out[6]: CategoricalIndex([u'a', u'a', u'b', u'b', u'c', u'a'], categories=[u'c', u'a', u'b'], ordered=False, name=u'B', dtype='category')
indexing with __getitem__/.iloc/.loc/.ix
works similarly to an Index with duplicates.
The indexers MUST be in the category or the operation will raise.
In [7]: df2.loc['a']
Out[7]:
A
B
a 0
a 1
a 5
and preserves the CategoricalIndex
In [8]: df2.loc['a'].index
Out[8]: CategoricalIndex([u'a', u'a', u'a'], categories=[u'c', u'a', u'b'], ordered=False, name=u'B', dtype='category')
sorting will order by the order of the categories
In [9]: df2.sort_index()
Out[9]:
A
B
c 4
a 0
a 1
a 5
b 2
b 3
groupby operations on the index will preserve the index nature as well
In [10]: df2.groupby(level=0).sum()
Out[10]:
A
B
c 4
a 6
b 5
In [11]: df2.groupby(level=0).sum().index
Out[11]: CategoricalIndex([u'c', u'a', u'b'], categories=[u'c', u'a', u'b'], ordered=False, name=u'B', dtype='category')
reindexing operations, will return a resulting index based on the type of the passed
indexer, meaning that passing a list will return a plain-old-Index
; indexing with
a Categorical
will return a CategoricalIndex
, indexed according to the categories
of the PASSED Categorical
dtype. This allows one to arbitrarly index these even with
values NOT in the categories, similarly to how you can reindex ANY pandas index.
In [12]: df2.reindex(['a','e'])
Out[12]:
A
B
a 0.0
a 1.0
a 5.0
e NaN
In [13]: df2.reindex(['a','e']).index
Out[13]: Index([u'a', u'a', u'a', u'e'], dtype='object', name=u'B')
In [14]: df2.reindex(pd.Categorical(['a','e'],categories=list('abcde')))
Out[14]:
A
B
a 0.0
a 1.0
a 5.0
e NaN
In [15]: df2.reindex(pd.Categorical(['a','e'],categories=list('abcde'))).index
Out[15]: CategoricalIndex([u'a', u'a', u'a', u'e'], categories=[u'a', u'b', u'c', u'd', u'e'], ordered=False, name=u'B', dtype='category')
See the documentation for more. (GH7629, GH10038, GH10039)
Sample¶
Series, DataFrames, and Panels now have a new method: sample()
.
The method accepts a specific number of rows or columns to return, or a fraction of the
total number or rows or columns. It also has options for sampling with or without replacement,
for passing in a column for weights for non-uniform sampling, and for setting seed values to
facilitate replication. (GH2419)
In [16]: example_series = Series([0,1,2,3,4,5])
# When no arguments are passed, returns 1
In [17]: example_series.sample()
Out[17]:
3 3
dtype: int64
# One may specify either a number of rows:
In [18]: example_series.sample(n=3)
Out[18]:
5 5
1 1
4 4
dtype: int64
# Or a fraction of the rows:
In [19]: example_series.sample(frac=0.5)
Out[19]:
4 4
1 1
0 0
dtype: int64
# weights are accepted.
In [20]: example_weights = [0, 0, 0.2, 0.2, 0.2, 0.4]
In [21]: example_series.sample(n=3, weights=example_weights)
Out[21]:
2 2
3 3
5 5
dtype: int64
# weights will also be normalized if they do not sum to one,
# and missing values will be treated as zeros.
In [22]: example_weights2 = [0.5, 0, 0, 0, None, np.nan]
In [23]: example_series.sample(n=1, weights=example_weights2)
Out[23]:
0 0
dtype: int64
When applied to a DataFrame, one may pass the name of a column to specify sampling weights when sampling from rows.
In [24]: df = DataFrame({'col1':[9,8,7,6], 'weight_column':[0.5, 0.4, 0.1, 0]})
In [25]: df.sample(n=3, weights='weight_column')
Out[25]:
col1 weight_column
0 9 0.5
1 8 0.4
2 7 0.1
String Methods Enhancements¶
Continuing from v0.16.0, the following enhancements make string operations easier and more consistent with standard python string operations.
Added
StringMethods
(.str
accessor) toIndex
(GH9068)The
.str
accessor is now available for bothSeries
andIndex
.In [26]: idx = Index([' jack', 'jill ', ' jesse ', 'frank']) In [27]: idx.str.strip() Out[27]: Index([u'jack', u'jill', u'jesse', u'frank'], dtype='object')
One special case for the .str accessor on
Index
is that if a string method returnsbool
, the.str
accessor will return anp.array
instead of a booleanIndex
(GH8875). This enables the following expression to work naturally:In [28]: idx = Index(['a1', 'a2', 'b1', 'b2']) In [29]: s = Series(range(4), index=idx) In [30]: s Out[30]: a1 0 a2 1 b1 2 b2 3 dtype: int64 In [31]: idx.str.startswith('a') Out[31]: array([ True, True, False, False], dtype=bool) In [32]: s[s.index.str.startswith('a')] Out[32]: a1 0 a2 1 dtype: int64
The following new methods are accesible via
.str
accessor to apply the function to each values. (GH9766, GH9773, GH10031, GH10045, GH10052)Methods capitalize()
swapcase()
normalize()
partition()
rpartition()
index()
rindex()
translate()
split
now takesexpand
keyword to specify whether to expand dimensionality.return_type
is deprecated. (GH9847)In [33]: s = Series(['a,b', 'a,c', 'b,c']) # return Series In [34]: s.str.split(',') Out[34]: 0 [a, b] 1 [a, c] 2 [b, c] dtype: object # return DataFrame In [35]: s.str.split(',', expand=True) Out[35]: 0 1 0 a b 1 a c 2 b c In [36]: idx = Index(['a,b', 'a,c', 'b,c']) # return Index In [37]: idx.str.split(',') Out[37]: Index([[u'a', u'b'], [u'a', u'c'], [u'b', u'c']], dtype='object') # return MultiIndex In [38]: idx.str.split(',', expand=True) Out[38]: MultiIndex(levels=[[u'a', u'b'], [u'b', u'c']], labels=[[0, 0, 1], [0, 1, 1]])
Improved
extract
andget_dummies
methods forIndex.str
(GH9980)
Other Enhancements¶
BusinessHour
offset is now supported, which represents business hours starting from 09:00 - 17:00 onBusinessDay
by default. See Here for details. (GH7905)In [39]: from pandas.tseries.offsets import BusinessHour In [40]: Timestamp('2014-08-01 09:00') + BusinessHour() Out[40]: Timestamp('2014-08-01 10:00:00') In [41]: Timestamp('2014-08-01 07:00') + BusinessHour() Out[41]: Timestamp('2014-08-01 10:00:00') In [42]: Timestamp('2014-08-01 16:30') + BusinessHour() Out[42]: Timestamp('2014-08-04 09:30:00')
DataFrame.diff
now takes anaxis
parameter that determines the direction of differencing (GH9727)Allow
clip
,clip_lower
, andclip_upper
to accept array-like arguments as thresholds (This is a regression from 0.11.0). These methods now have anaxis
parameter which determines how the Series or DataFrame will be aligned with the threshold(s). (GH6966)DataFrame.mask()
andSeries.mask()
now support same keywords aswhere
(GH8801)drop
function can now accepterrors
keyword to suppressValueError
raised when any of label does not exist in the target data. (GH6736)In [43]: df = DataFrame(np.random.randn(3, 3), columns=['A', 'B', 'C']) In [44]: df.drop(['A', 'X'], axis=1, errors='ignore') Out[44]: B C 0 1.058969 -0.397840 1 1.047579 1.045938 2 -0.122092 0.124713
Add support for separating years and quarters using dashes, for example 2014-Q1. (GH9688)
Allow conversion of values with dtype
datetime64
ortimedelta64
to strings usingastype(str)
(GH9757)get_dummies
function now acceptssparse
keyword. If set toTrue
, the returnDataFrame
is sparse, e.g.SparseDataFrame
. (GH8823)Period
now acceptsdatetime64
as value input. (GH9054)Allow timedelta string conversion when leading zero is missing from time definition, ie 0:00:00 vs 00:00:00. (GH9570)
Allow
Panel.shift
withaxis='items'
(GH9890)Trying to write an excel file now raises
NotImplementedError
if theDataFrame
has aMultiIndex
instead of writing a broken Excel file. (GH9794)Allow
Categorical.add_categories
to acceptSeries
ornp.array
. (GH9927)Add/delete
str/dt/cat
accessors dynamically from__dir__
. (GH9910)Add
normalize
as adt
accessor method. (GH10047)DataFrame
andSeries
now have_constructor_expanddim
property as overridable constructor for one higher dimensionality data. This should be used only when it is really needed, see herepd.lib.infer_dtype
now returns'bytes'
in Python 3 where appropriate. (GH10032)
API changes¶
- When passing in an ax to
df.plot( ..., ax=ax)
, the sharex kwarg will now default to False. The result is that the visibility of xlabels and xticklabels will not anymore be changed. You have to do that by yourself for the right axes in your figure or setsharex=True
explicitly (but this changes the visible for all axes in the figure, not only the one which is passed in!). If pandas creates the subplots itself (e.g. no passed in ax kwarg), then the default is stillsharex=True
and the visibility changes are applied. assign()
now inserts new columns in alphabetical order. Previously the order was arbitrary. (GH9777)- By default,
read_csv
andread_table
will now try to infer the compression type based on the file extension. Setcompression=None
to restore the previous behavior (no decompression). (GH9770)
Index Representation¶
The string representation of Index
and its sub-classes have now been unified. These will show a single-line display if there are few values; a wrapped multi-line display for a lot of values (but less than display.max_seq_items
; if lots of items (> display.max_seq_items
) will show a truncated display (the head and tail of the data). The formatting for MultiIndex
is unchanges (a multi-line wrapped display). The display width responds to the option display.max_seq_items
, which is defaulted to 100. (GH6482)
Previous Behavior
In [2]: pd.Index(range(4),name='foo')
Out[2]: Int64Index([0, 1, 2, 3], dtype='int64')
In [3]: pd.Index(range(104),name='foo')
Out[3]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...], dtype='int64')
In [4]: pd.date_range('20130101',periods=4,name='foo',tz='US/Eastern')
Out[4]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 00:00:00-05:00, ..., 2013-01-04 00:00:00-05:00]
Length: 4, Freq: D, Timezone: US/Eastern
In [5]: pd.date_range('20130101',periods=104,name='foo',tz='US/Eastern')
Out[5]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 00:00:00-05:00, ..., 2013-04-14 00:00:00-04:00]
Length: 104, Freq: D, Timezone: US/Eastern
New Behavior
In [45]: pd.set_option('display.width', 80)
In [46]: pd.Index(range(4), name='foo')
Out[46]: Int64Index([0, 1, 2, 3], dtype='int64', name=u'foo')
In [47]: pd.Index(range(30), name='foo')
Out[47]:
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
dtype='int64', name=u'foo')
In [48]: pd.Index(range(104), name='foo')
Out[48]:
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
...
94, 95, 96, 97, 98, 99, 100, 101, 102, 103],
dtype='int64', name=u'foo', length=104)
In [49]: pd.CategoricalIndex(['a','bb','ccc','dddd'], ordered=True, name='foobar')
Out[49]: CategoricalIndex([u'a', u'bb', u'ccc', u'dddd'], categories=[u'a', u'bb', u'ccc', u'dddd'], ordered=True, name=u'foobar', dtype='category')
In [50]: pd.CategoricalIndex(['a','bb','ccc','dddd']*10, ordered=True, name='foobar')
Out[50]:
CategoricalIndex([u'a', u'bb', u'ccc', u'dddd', u'a', u'bb', u'ccc', u'dddd',
u'a', u'bb', u'ccc', u'dddd', u'a', u'bb', u'ccc', u'dddd',
u'a', u'bb', u'ccc', u'dddd', u'a', u'bb', u'ccc', u'dddd',
u'a', u'bb', u'ccc', u'dddd', u'a', u'bb', u'ccc', u'dddd',
u'a', u'bb', u'ccc', u'dddd', u'a', u'bb', u'ccc', u'dddd'],
categories=[u'a', u'bb', u'ccc', u'dddd'], ordered=True, name=u'foobar', dtype='category')
In [51]: pd.CategoricalIndex(['a','bb','ccc','dddd']*100, ordered=True, name='foobar')
Out[51]:
CategoricalIndex([u'a', u'bb', u'ccc', u'dddd', u'a', u'bb', u'ccc', u'dddd',
u'a', u'bb',
...
u'ccc', u'dddd', u'a', u'bb', u'ccc', u'dddd', u'a', u'bb',
u'ccc', u'dddd'],
categories=[u'a', u'bb', u'ccc', u'dddd'], ordered=True, name=u'foobar', dtype='category', length=400)
In [52]: pd.date_range('20130101',periods=4, name='foo', tz='US/Eastern')
Out[52]:
DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00',
'2013-01-03 00:00:00-05:00', '2013-01-04 00:00:00-05:00'],
dtype='datetime64[ns, US/Eastern]', name=u'foo', freq='D')
In [53]: pd.date_range('20130101',periods=25, freq='D')
Out[53]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06', '2013-01-07', '2013-01-08',
'2013-01-09', '2013-01-10', '2013-01-11', '2013-01-12',
'2013-01-13', '2013-01-14', '2013-01-15', '2013-01-16',
'2013-01-17', '2013-01-18', '2013-01-19', '2013-01-20',
'2013-01-21', '2013-01-22', '2013-01-23', '2013-01-24',
'2013-01-25'],
dtype='datetime64[ns]', freq='D')
In [54]: pd.date_range('20130101',periods=104, name='foo', tz='US/Eastern')
Out[54]:
DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00',
'2013-01-03 00:00:00-05:00', '2013-01-04 00:00:00-05:00',
'2013-01-05 00:00:00-05:00', '2013-01-06 00:00:00-05:00',
'2013-01-07 00:00:00-05:00', '2013-01-08 00:00:00-05:00',
'2013-01-09 00:00:00-05:00', '2013-01-10 00:00:00-05:00',
...
'2013-04-05 00:00:00-04:00', '2013-04-06 00:00:00-04:00',
'2013-04-07 00:00:00-04:00', '2013-04-08 00:00:00-04:00',
'2013-04-09 00:00:00-04:00', '2013-04-10 00:00:00-04:00',
'2013-04-11 00:00:00-04:00', '2013-04-12 00:00:00-04:00',
'2013-04-13 00:00:00-04:00', '2013-04-14 00:00:00-04:00'],
dtype='datetime64[ns, US/Eastern]', name=u'foo', length=104, freq='D')
Performance Improvements¶
Bug Fixes¶
- Bug where labels did not appear properly in the legend of
DataFrame.plot()
, passinglabel=
arguments works, and Series indices are no longer mutated. (GH9542) - Bug in json serialization causing a segfault when a frame had zero length. (GH9805)
- Bug in
read_csv
where missing trailing delimiters would cause segfault. (GH5664) - Bug in retaining index name on appending (GH9862)
- Bug in
scatter_matrix
draws unexpected axis ticklabels (GH5662) - Fixed bug in
StataWriter
resulting in changes to inputDataFrame
upon save (GH9795). - Bug in
transform
causing length mismatch when null entries were present and a fast aggregator was being used (GH9697) - Bug in
equals
causing false negatives when block order differed (GH9330) - Bug in grouping with multiple
pd.Grouper
where one is non-time based (GH10063) - Bug in
read_sql_table
error when reading postgres table with timezone (GH7139) - Bug in
DataFrame
slicing may not retain metadata (GH9776) - Bug where
TimdeltaIndex
were not properly serialized in fixedHDFStore
(GH9635) - Bug with
TimedeltaIndex
constructor ignoringname
when given anotherTimedeltaIndex
as data (GH10025). - Bug in
DataFrameFormatter._get_formatted_index
with not applyingmax_colwidth
to theDataFrame
index (GH7856) - Bug in
.loc
with a read-only ndarray data source (GH10043) - Bug in
groupby.apply()
that would raise if a passed user defined function either returned onlyNone
(for all input). (GH9685) - Always use temporary files in pytables tests (GH9992)
- Bug in plotting continuously using
secondary_y
may not show legend properly. (GH9610, GH9779) - Bug in
DataFrame.plot(kind="hist")
results inTypeError
whenDataFrame
contains non-numeric columns (GH9853) - Bug where repeated plotting of
DataFrame
with aDatetimeIndex
may raiseTypeError
(GH9852) - Bug in
setup.py
that would allow an incompat cython version to build (GH9827) - Bug in plotting
secondary_y
incorrectly attachesright_ax
property to secondary axes specifying itself recursively. (GH9861) - Bug in
Series.quantile
on empty Series of typeDatetime
orTimedelta
(GH9675) - Bug in
where
causing incorrect results when upcasting was required (GH9731) - Bug in
FloatArrayFormatter
where decision boundary for displaying “small” floats in decimal format is off by one order of magnitude for a given display.precision (GH9764) - Fixed bug where
DataFrame.plot()
raised an error when bothcolor
andstyle
keywords were passed and there was no color symbol in the style strings (GH9671) - Not showing a
DeprecationWarning
on combining list-likes with anIndex
(GH10083) - Bug in
read_csv
andread_table
when usingskip_rows
parameter if blank lines are present. (GH9832) - Bug in
read_csv()
interpretsindex_col=True
as1
(GH9798) - Bug in index equality comparisons using
==
failing on Index/MultiIndex type incompatibility (GH9785) - Bug in which
SparseDataFrame
could not take nan as a column name (GH8822) - Bug in
to_msgpack
andread_msgpack
zlib and blosc compression support (GH9783) - Bug
GroupBy.size
doesn’t attach index name properly if grouped byTimeGrouper
(GH9925) - Bug causing an exception in slice assignments because
length_of_indexer
returns wrong results (GH9995) - Bug in csv parser causing lines with initial whitespace plus one non-space character to be skipped. (GH9710)
- Bug in C csv parser causing spurious NaNs when data started with newline followed by whitespace. (GH10022)
- Bug causing elements with a null group to spill into the final group when grouping by a
Categorical
(GH9603) - Bug where .iloc and .loc behavior is not consistent on empty dataframes (GH9964)
- Bug in invalid attribute access on a
TimedeltaIndex
incorrectly raisedValueError
instead ofAttributeError
(GH9680) - Bug in unequal comparisons between categorical data and a scalar, which was not in the categories (e.g.
Series(Categorical(list("abc"), ordered=True)) > "d"
. This returnedFalse
for all elements, but now raises aTypeError
. Equality comparisons also now returnFalse
for==
andTrue
for!=
. (GH9848) - Bug in DataFrame
__setitem__
when right hand side is a dictionary (GH9874) - Bug in
where
when dtype isdatetime64/timedelta64
, but dtype of other is not (GH9804) - Bug in
MultiIndex.sortlevel()
results in unicode level name breaks (GH9856) - Bug in which
groupby.transform
incorrectly enforced output dtypes to match input dtypes. (GH9807) - Bug in
DataFrame
constructor whencolumns
parameter is set, anddata
is an empty list (GH9939) - Bug in bar plot with
log=True
raisesTypeError
if all values are less than 1 (GH9905) - Bug in horizontal bar plot ignores
log=True
(GH9905) - Bug in PyTables queries that did not return proper results using the index (GH8265, GH9676)
- Bug where dividing a dataframe containing values of type
Decimal
by anotherDecimal
would raise. (GH9787) - Bug where using DataFrames asfreq would remove the name of the index. (GH9885)
- Bug causing extra index point when resample BM/BQ (GH9756)
- Changed caching in
AbstractHolidayCalendar
to be at the instance level rather than at the class level as the latter can result in unexpected behaviour. (GH9552) - Fixed latex output for multi-indexed dataframes (GH9778)
- Bug causing an exception when setting an empty range using
DataFrame.loc
(GH9596) - Bug in hiding ticklabels with subplots and shared axes when adding a new plot to an existing grid of axes (GH9158)
- Bug in
transform
andfilter
when grouping on a categorical variable (GH9921) - Bug in
transform
when groups are equal in number and dtype to the input index (GH9700) - Google BigQuery connector now imports dependencies on a per-method basis.(GH9713)
- Updated BigQuery connector to no longer use deprecated
oauth2client.tools.run()
(GH8327) - Bug in subclassed
DataFrame
. It may not return the correct class, when slicing or subsetting it. (GH9632) - Bug in
.median()
where non-float null values are not handled correctly (GH10040) - Bug in Series.fillna() where it raises if a numerically convertible string is given (GH10092)
v0.16.0 (March 22, 2015)¶
This is a major release from 0.15.2 and includes a small number of API changes, several new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version.
Highlights include:
DataFrame.assign
method, see hereSeries.to_coo/from_coo
methods to interact withscipy.sparse
, see here- Backwards incompatible change to
Timedelta
to conform the.seconds
attribute withdatetime.timedelta
, see here - Changes to the
.loc
slicing API to conform with the behavior of.ix
see here - Changes to the default for ordering in the
Categorical
constructor, see here - Enhancement to the
.str
accessor to make string operations easier, see here - The
pandas.tools.rplot
,pandas.sandbox.qtpandas
andpandas.rpy
modules are deprecated. We refer users to external packages like seaborn, pandas-qt and rpy2 for similar or equivalent functionality, see here
Check the API Changes and deprecations before updating.
What’s new in v0.16.0
New features¶
DataFrame Assign¶
Inspired by dplyr’s mutate
verb, DataFrame has a new
assign()
method.
The function signature for assign
is simply **kwargs
. The keys
are the column names for the new fields, and the values are either a value
to be inserted (for example, a Series
or NumPy array), or a function
of one argument to be called on the DataFrame
. The new values are inserted,
and the entire DataFrame (with all original and new columns) is returned.
In [1]: iris = read_csv('data/iris.data')
In [2]: iris.head()
Out[2]:
SepalLength SepalWidth PetalLength PetalWidth Name
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
In [3]: iris.assign(sepal_ratio=iris['SepalWidth'] / iris['SepalLength']).head()
Out[3]:
SepalLength SepalWidth PetalLength PetalWidth Name sepal_ratio
0 5.1 3.5 1.4 0.2 Iris-setosa 0.686275
1 4.9 3.0 1.4 0.2 Iris-setosa 0.612245
2 4.7 3.2 1.3 0.2 Iris-setosa 0.680851
3 4.6 3.1 1.5 0.2 Iris-setosa 0.673913
4 5.0 3.6 1.4 0.2 Iris-setosa 0.720000
Above was an example of inserting a precomputed value. We can also pass in a function to be evalutated.
In [4]: iris.assign(sepal_ratio = lambda x: (x['SepalWidth'] /
...: x['SepalLength'])).head()
...:
Out[4]:
SepalLength SepalWidth PetalLength PetalWidth Name sepal_ratio
0 5.1 3.5 1.4 0.2 Iris-setosa 0.686275
1 4.9 3.0 1.4 0.2 Iris-setosa 0.612245
2 4.7 3.2 1.3 0.2 Iris-setosa 0.680851
3 4.6 3.1 1.5 0.2 Iris-setosa 0.673913
4 5.0 3.6 1.4 0.2 Iris-setosa 0.720000
The power of assign
comes when used in chains of operations. For example,
we can limit the DataFrame to just those with a Sepal Length greater than 5,
calculate the ratio, and plot
In [5]: (iris.query('SepalLength > 5')
...: .assign(SepalRatio = lambda x: x.SepalWidth / x.SepalLength,
...: PetalRatio = lambda x: x.PetalWidth / x.PetalLength)
...: .plot(kind='scatter', x='SepalRatio', y='PetalRatio'))
...:
Out[5]: <matplotlib.axes._subplots.AxesSubplot at 0x7f3dbb896a90>
See the documentation for more. (GH9229)
Interaction with scipy.sparse¶
Added SparseSeries.to_coo()
and SparseSeries.from_coo()
methods (GH8048) for converting to and from scipy.sparse.coo_matrix
instances (see here). For example, given a SparseSeries with MultiIndex we can convert to a scipy.sparse.coo_matrix by specifying the row and column labels as index levels:
In [6]: from numpy import nan
In [7]: s = Series([3.0, nan, 1.0, 3.0, nan, nan])
In [8]: s.index = MultiIndex.from_tuples([(1, 2, 'a', 0),
...: (1, 2, 'a', 1),
...: (1, 1, 'b', 0),
...: (1, 1, 'b', 1),
...: (2, 1, 'b', 0),
...: (2, 1, 'b', 1)],
...: names=['A', 'B', 'C', 'D'])
...:
In [9]: s
Out[9]:
A B C D
1 2 a 0 3.0
1 NaN
1 b 0 1.0
1 3.0
2 1 b 0 NaN
1 NaN
dtype: float64
# SparseSeries
In [10]: ss = s.to_sparse()
In [11]: ss
Out[11]:
A B C D
1 2 a 0 3.0
1 NaN
1 b 0 1.0
1 3.0
2 1 b 0 NaN
1 NaN
dtype: float64
BlockIndex
Block locations: array([0, 2], dtype=int32)
Block lengths: array([1, 2], dtype=int32)
In [12]: A, rows, columns = ss.to_coo(row_levels=['A', 'B'],
....: column_levels=['C', 'D'],
....: sort_labels=False)
....:
In [13]: A
Out[13]:
<3x4 sparse matrix of type '<type 'numpy.float64'>'
with 3 stored elements in COOrdinate format>
In [14]: A.todense()
Out[14]:
matrix([[ 3., 0., 0., 0.],
[ 0., 0., 1., 3.],
[ 0., 0., 0., 0.]])
In [15]: rows
Out[15]: [(1, 2), (1, 1), (2, 1)]
In [16]: columns
Out[16]: [('a', 0), ('a', 1), ('b', 0), ('b', 1)]
The from_coo method is a convenience method for creating a SparseSeries
from a scipy.sparse.coo_matrix
:
In [17]: from scipy import sparse
In [18]: A = sparse.coo_matrix(([3.0, 1.0, 2.0], ([1, 0, 0], [0, 2, 3])),
....: shape=(3, 4))
....:
In [19]: A
Out[19]:
<3x4 sparse matrix of type '<type 'numpy.float64'>'
with 3 stored elements in COOrdinate format>
In [20]: A.todense()
Out[20]:
matrix([[ 0., 0., 1., 2.],
[ 3., 0., 0., 0.],
[ 0., 0., 0., 0.]])
In [21]: ss = SparseSeries.from_coo(A)
In [22]: ss
Out[22]:
0 2 1.0
3 2.0
1 0 3.0
dtype: float64
BlockIndex
Block locations: array([0], dtype=int32)
Block lengths: array([3], dtype=int32)
String Methods Enhancements¶
Following new methods are accesible via
.str
accessor to apply the function to each values. This is intended to make it more consistent with standard methods on strings. (GH9282, GH9352, GH9386, GH9387, GH9439)Methods isalnum()
isalpha()
isdigit()
isdigit()
isspace()
islower()
isupper()
istitle()
isnumeric()
isdecimal()
find()
rfind()
ljust()
rjust()
zfill()
In [23]: s = Series(['abcd', '3456', 'EFGH']) In [24]: s.str.isalpha() Out[24]: 0 True 1 False 2 True dtype: bool In [25]: s.str.find('ab') Out[25]: 0 0 1 -1 2 -1 dtype: int64
Series.str.pad()
andSeries.str.center()
now acceptfillchar
option to specify filling character (GH9352)In [26]: s = Series(['12', '300', '25']) In [27]: s.str.pad(5, fillchar='_') Out[27]: 0 ___12 1 __300 2 ___25 dtype: object
Added
Series.str.slice_replace()
, which previously raisedNotImplementedError
(GH8888)In [28]: s = Series(['ABCD', 'EFGH', 'IJK']) In [29]: s.str.slice_replace(1, 3, 'X') Out[29]: 0 AXD 1 EXH 2 IX dtype: object # replaced with empty char In [30]: s.str.slice_replace(0, 1) Out[30]: 0 BCD 1 FGH 2 JK dtype: object
Other enhancements¶
Reindex now supports
method='nearest'
for frames or series with a monotonic increasing or decreasing index (GH9258):In [31]: df = pd.DataFrame({'x': range(5)}) In [32]: df.reindex([0.2, 1.8, 3.5], method='nearest') Out[32]: x 0.2 0 1.8 2 3.5 4
This method is also exposed by the lower level
Index.get_indexer
andIndex.get_loc
methods.The
read_excel()
function’s sheetname argument now accepts a list andNone
, to get multiple or all sheets respectively. If more than one sheet is specified, a dictionary is returned. (GH9450)# Returns the 1st and 4th sheet, as a dictionary of DataFrames. pd.read_excel('path_to_file.xls',sheetname=['Sheet1',3])
Allow Stata files to be read incrementally with an iterator; support for long strings in Stata files. See the docs here (GH9493:).
Paths beginning with ~ will now be expanded to begin with the user’s home directory (GH9066)
Added time interval selection in
get_data_yahoo
(GH9071)Added
Timestamp.to_datetime64()
to complementTimedelta.to_timedelta64()
(GH9255)tseries.frequencies.to_offset()
now acceptsTimedelta
as input (GH9064)Lag parameter was added to the autocorrelation method of
Series
, defaults to lag-1 autocorrelation (GH9192)Timedelta
will now acceptnanoseconds
keyword in constructor (GH9273)SQL code now safely escapes table and column names (GH8986)
Added auto-complete for
Series.str.<tab>
,Series.dt.<tab>
andSeries.cat.<tab>
(GH9322)Index.get_indexer
now supportsmethod='pad'
andmethod='backfill'
even for any target array, not just monotonic targets. These methods also work for monotonic decreasing as well as monotonic increasing indexes (GH9258).Index.asof
now works on all index types (GH9258).A
verbose
argument has been augmented inio.read_excel()
, defaults to False. Set to True to print sheet names as they are parsed. (GH9450)Added
days_in_month
(compatibility aliasdaysinmonth
) property toTimestamp
,DatetimeIndex
,Period
,PeriodIndex
, andSeries.dt
(GH9572)Added
decimal
option into_csv
to provide formatting for non-‘.’ decimal separators (GH781)Added
normalize
option forTimestamp
to normalized to midnight (GH8794)Added example for
DataFrame
import to R using HDF5 file andrhdf5
library. See the documentation for more (GH9636).
Backwards incompatible API changes¶
Changes in Timedelta¶
In v0.15.0 a new scalar type Timedelta
was introduced, that is a
sub-class of datetime.timedelta
. Mentioned here was a notice of an API change w.r.t. the .seconds
accessor. The intent was to provide a user-friendly set of accessors that give the ‘natural’ value for that unit, e.g. if you had a Timedelta('1 day, 10:11:12')
, then .seconds
would return 12. However, this is at odds with the definition of datetime.timedelta
, which defines .seconds
as 10 * 3600 + 11 * 60 + 12 == 36672
.
So in v0.16.0, we are restoring the API to match that of datetime.timedelta
. Further, the component values are still available through the .components
accessor. This affects the .seconds
and .microseconds
accessors, and removes the .hours
, .minutes
, .milliseconds
accessors. These changes affect TimedeltaIndex
and the Series .dt
accessor as well. (GH9185, GH9139)
Previous Behavior
In [2]: t = pd.Timedelta('1 day, 10:11:12.100123')
In [3]: t.days
Out[3]: 1
In [4]: t.seconds
Out[4]: 12
In [5]: t.microseconds
Out[5]: 123
New Behavior
In [33]: t = pd.Timedelta('1 day, 10:11:12.100123')
In [34]: t.days
Out[34]: 1
In [35]: t.seconds
Out[35]: 36672
In [36]: t.microseconds
Out[36]: 100123
Using .components
allows the full component access
In [37]: t.components
Out[37]: Components(days=1, hours=10, minutes=11, seconds=12, milliseconds=100, microseconds=123, nanoseconds=0)
In [38]: t.components.seconds
Out[38]: 12
Indexing Changes¶
The behavior of a small sub-set of edge cases for using .loc
have changed (GH8613). Furthermore we have improved the content of the error messages that are raised:
Slicing with
.loc
where the start and/or stop bound is not found in the index is now allowed; this previously would raise aKeyError
. This makes the behavior the same as.ix
in this case. This change is only for slicing, not when indexing with a single label.In [39]: df = DataFrame(np.random.randn(5,4), ....: columns=list('ABCD'), ....: index=date_range('20130101',periods=5)) ....: In [40]: df Out[40]: