Version 0.16.1 (May 11, 2015)¶
This is a minor bug-fix release from 0.16.0 and includes a large number of bug fixes along several new features, enhancements, and performance improvements. We recommend that all users upgrade to this version.
Highlights include:
Support for a
CategoricalIndex, a category based index, see hereNew section on how-to-contribute to pandas, see here
Revised “Merge, join, and concatenate” documentation, including graphical examples to make it easier to understand each operations, see here
New method
samplefor drawing random samples from Series, DataFrames and Panels. See hereThe default
Indexprinting has changed to a more uniform format, see hereBusinessHourdatetime-offset is now supported, see hereFurther enhancement to the
.straccessor to make string operations easier, see here
What’s new in v0.16.1
Warning
In pandas 0.17.0, the sub-package pandas.io.data will be removed in favor of a separately installable package (GH8961).
Enhancements¶
CategoricalIndex¶
We introduce a CategoricalIndex, a new type of index object that is useful for supporting
indexing with duplicates. This is a container around a Categorical (introduced in v0.15.0)
and allows efficient indexing and storage of an index with a large number of duplicated elements. Prior to 0.16.1,
setting the index of a DataFrame/Series with a category dtype would convert this to regular object-based Index.
In [1]: df = pd.DataFrame({'A': np.arange(6),
...: 'B': pd.Series(list('aabbca'))
...: .astype('category', categories=list('cab'))
...: })
...:
In [2]: df
Out[2]:
A B
0 0 a
1 1 a
2 2 b
3 3 b
4 4 c
5 5 a
In [3]: df.dtypes
Out[3]:
A int64
B category
dtype: object
In [4]: df.B.cat.categories
Out[4]: Index(['c', 'a', 'b'], dtype='object')
setting the index, will create a CategoricalIndex
In [5]: df2 = df.set_index('B')
In [6]: df2.index
Out[6]: CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')
indexing with __getitem__/.iloc/.loc/.ix works similarly to an Index with duplicates.
The indexers MUST be in the category or the operation will raise.
In [7]: df2.loc['a']
Out[7]:
A
B
a 0
a 1
a 5
and preserves the CategoricalIndex
In [8]: df2.loc['a'].index
Out[8]: CategoricalIndex(['a', 'a', 'a'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')
sorting will order by the order of the categories
In [9]: df2.sort_index()
Out[9]:
A
B
c 4
a 0
a 1
a 5
b 2
b 3
groupby operations on the index will preserve the index nature as well
In [10]: df2.groupby(level=0).sum()
Out[10]:
A
B
c 4
a 6
b 5
In [11]: df2.groupby(level=0).sum().index
Out[11]: CategoricalIndex(['c', 'a', 'b'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')
reindexing operations, will return a resulting index based on the type of the passed
indexer, meaning that passing a list will return a plain-old-Index; indexing with
a Categorical will return a CategoricalIndex, indexed according to the categories
of the PASSED Categorical dtype. This allows one to arbitrarily index these even with
values NOT in the categories, similarly to how you can reindex ANY pandas index.
In [12]: df2.reindex(['a', 'e'])
Out[12]:
A
B
a 0.0
a 1.0
a 5.0
e NaN
In [13]: df2.reindex(['a', 'e']).index
Out[13]: pd.Index(['a', 'a', 'a', 'e'], dtype='object', name='B')
In [14]: df2.reindex(pd.Categorical(['a', 'e'], categories=list('abcde')))
Out[14]:
A
B
a 0.0
a 1.0
a 5.0
e NaN
In [15]: df2.reindex(pd.Categorical(['a', 'e'], categories=list('abcde'))).index
Out[15]: pd.CategoricalIndex(['a', 'a', 'a', 'e'],
categories=['a', 'b', 'c', 'd', 'e'],
ordered=False, name='B',
dtype='category')
See the documentation for more. (GH7629, GH10038, GH10039)
Sample¶
Series, DataFrames, and Panels now have a new method: sample().
The method accepts a specific number of rows or columns to return, or a fraction of the
total number or rows or columns. It also has options for sampling with or without replacement,
for passing in a column for weights for non-uniform sampling, and for setting seed values to
facilitate replication. (GH2419)
In [1]: example_series = pd.Series([0, 1, 2, 3, 4, 5])
# When no arguments are passed, returns 1
In [2]: example_series.sample()
Out[2]:
3 3
Length: 1, dtype: int64
# One may specify either a number of rows:
In [3]: example_series.sample(n=3)
Out[3]:
2 2
1 1
0 0
Length: 3, dtype: int64
# Or a fraction of the rows:
In [4]: example_series.sample(frac=0.5)
Out[4]:
1 1
5 5
3 3
Length: 3, dtype: int64
# weights are accepted.
In [5]: example_weights = [0, 0, 0.2, 0.2, 0.2, 0.4]
In [6]: example_series.sample(n=3, weights=example_weights)
Out[6]:
2 2
4 4
3 3
Length: 3, dtype: int64
# weights will also be normalized if they do not sum to one,
# and missing values will be treated as zeros.
In [7]: example_weights2 = [0.5, 0, 0, 0, None, np.nan]
In [8]: example_series.sample(n=1, weights=example_weights2)
Out[8]:
0 0
Length: 1, dtype: int64
When applied to a DataFrame, one may pass the name of a column to specify sampling weights when sampling from rows.
In [9]: df = pd.DataFrame({"col1": [9, 8, 7, 6], "weight_column": [0.5, 0.4, 0.1, 0]})
In [10]: df.sample(n=3, weights="weight_column")
Out[10]:
col1 weight_column
0 9 0.5
1 8 0.4
2 7 0.1
[3 rows x 2 columns]
String methods enhancements¶
Continuing from v0.16.0, the following enhancements make string operations easier and more consistent with standard python string operations.
Added
StringMethods(.straccessor) toIndex(GH9068)The
.straccessor is now available for bothSeriesandIndex.In [11]: idx = pd.Index([" jack", "jill ", " jesse ", "frank"]) In [12]: idx.str.strip() Out[12]: Index(['jack', 'jill', 'jesse', 'frank'], dtype='object')
One special case for the
.straccessor onIndexis that if a string method returnsbool, the.straccessor will return anp.arrayinstead of a booleanIndex(GH8875). This enables the following expression to work naturally:In [13]: idx = pd.Index(["a1", "a2", "b1", "b2"]) In [14]: s = pd.Series(range(4), index=idx) In [15]: s Out[15]: a1 0 a2 1 b1 2 b2 3 Length: 4, dtype: int64 In [16]: idx.str.startswith("a") Out[16]: array([ True, True, False, False]) In [17]: s[s.index.str.startswith("a")] Out[17]: a1 0 a2 1 Length: 2, dtype: int64
The following new methods are accessible via
.straccessor to apply the function to each values. (GH9766, GH9773, GH10031, GH10045, GH10052)Methods
capitalize()swapcase()normalize()partition()rpartition()index()rindex()translate()splitnow takesexpandkeyword to specify whether to expand dimensionality.return_typeis deprecated. (GH9847)In [18]: s = pd.Series(["a,b", "a,c", "b,c"]) # return Series In [19]: s.str.split(",") Out[19]: 0 [a, b] 1 [a, c] 2 [b, c] Length: 3, dtype: object # return DataFrame In [20]: s.str.split(",", expand=True) Out[20]: 0 1 0 a b 1 a c 2 b c [3 rows x 2 columns] In [21]: idx = pd.Index(["a,b", "a,c", "b,c"]) # return Index In [22]: idx.str.split(",") Out[22]: Index([['a', 'b'], ['a', 'c'], ['b', 'c']], dtype='object') # return MultiIndex In [23]: idx.str.split(",", expand=True) Out[23]: MultiIndex([('a', 'b'), ('a', 'c'), ('b', 'c')], )
Improved
extractandget_dummiesmethods forIndex.str(GH9980)
Other enhancements¶
BusinessHouroffset is now supported, which represents business hours starting from 09:00 - 17:00 onBusinessDayby default. See Here for details. (GH7905)In [24]: pd.Timestamp("2014-08-01 09:00") + pd.tseries.offsets.BusinessHour() Out[24]: Timestamp('2014-08-01 10:00:00') In [25]: pd.Timestamp("2014-08-01 07:00") + pd.tseries.offsets.BusinessHour() Out[25]: Timestamp('2014-08-01 10:00:00') In [26]: pd.Timestamp("2014-08-01 16:30") + pd.tseries.offsets.BusinessHour() Out[26]: Timestamp('2014-08-04 09:30:00')
DataFrame.diffnow takes anaxisparameter that determines the direction of differencing (GH9727)Allow
clip,clip_lower, andclip_upperto accept array-like arguments as thresholds (This is a regression from 0.11.0). These methods now have anaxisparameter which determines how the Series or DataFrame will be aligned with the threshold(s). (GH6966)DataFrame.mask()andSeries.mask()now support same keywords aswhere(GH8801)dropfunction can now accepterrorskeyword to suppressValueErrorraised when any of label does not exist in the target data. (GH6736)In [27]: df = pd.DataFrame(np.random.randn(3, 3), columns=["A", "B", "C"]) In [28]: df.drop(["A", "X"], axis=1, errors="ignore") Out[28]: B C 0 -0.706771 -1.039575 1 -0.424972 0.567020 2 -1.087401 -0.673690 [3 rows x 2 columns]
Add support for separating years and quarters using dashes, for example 2014-Q1. (GH9688)
Allow conversion of values with dtype
datetime64ortimedelta64to strings usingastype(str)(GH9757)get_dummiesfunction now acceptssparsekeyword. If set toTrue, the returnDataFrameis sparse, e.g.SparseDataFrame. (GH8823)Periodnow acceptsdatetime64as value input. (GH9054)Allow timedelta string conversion when leading zero is missing from time definition, ie
0:00:00vs00:00:00. (GH9570)Allow
Panel.shiftwithaxis='items'(GH9890)Trying to write an excel file now raises
NotImplementedErrorif theDataFramehas aMultiIndexinstead of writing a broken Excel file. (GH9794)Allow
Categorical.add_categoriesto acceptSeriesornp.array. (GH9927)Add/delete
str/dt/cataccessors dynamically from__dir__. (GH9910)Add
normalizeas adtaccessor method. (GH10047)DataFrameandSeriesnow have_constructor_expanddimproperty as overridable constructor for one higher dimensionality data. This should be used only when it is really needed, see herepd.lib.infer_dtypenow returns'bytes'in Python 3 where appropriate. (GH10032)
API changes¶
When passing in an ax to
df.plot( ..., ax=ax), thesharexkwarg will now default toFalse. The result is that the visibility of xlabels and xticklabels will not anymore be changed. You have to do that by yourself for the right axes in your figure or setsharex=Trueexplicitly (but this changes the visible for all axes in the figure, not only the one which is passed in!). If pandas creates the subplots itself (e.g. no passed inaxkwarg), then the default is stillsharex=Trueand the visibility changes are applied.assign()now inserts new columns in alphabetical order. Previously the order was arbitrary. (GH9777)By default,
read_csvandread_tablewill now try to infer the compression type based on the file extension. Setcompression=Noneto restore the previous behavior (no decompression). (GH9770)
Index representation¶
The string representation of Index and its sub-classes have now been unified. These will show a single-line display if there are few values; a wrapped multi-line display for a lot of values (but less than display.max_seq_items; if lots of items (> display.max_seq_items) will show a truncated display (the head and tail of the data). The formatting for MultiIndex is unchanged (a multi-line wrapped display). The display width responds to the option display.max_seq_items, which is defaulted to 100. (GH6482)
Previous behavior
In [2]: pd.Index(range(4), name='foo')
Out[2]: Int64Index([0, 1, 2, 3], dtype='int64')
In [3]: pd.Index(range(104), name='foo')
Out[3]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...], dtype='int64')
In [4]: pd.date_range('20130101', periods=4, name='foo', tz='US/Eastern')
Out[4]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 00:00:00-05:00, ..., 2013-01-04 00:00:00-05:00]
Length: 4, Freq: D, Timezone: US/Eastern
In [5]: pd.date_range('20130101', periods=104, name='foo', tz='US/Eastern')
Out[5]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 00:00:00-05:00, ..., 2013-04-14 00:00:00-04:00]
Length: 104, Freq: D, Timezone: US/Eastern
New behavior
In [29]: pd.set_option("display.width", 80)
In [30]: pd.Index(range(4), name="foo")
Out[30]: RangeIndex(start=0, stop=4, step=1, name='foo')
In [31]: pd.Index(range(30), name="foo")
Out[31]: RangeIndex(start=0, stop=30, step=1, name='foo')
In [32]: pd.Index(range(104), name="foo")
Out[32]: RangeIndex(start=0, stop=104, step=1, name='foo')
In [33]: pd.CategoricalIndex(["a", "bb", "ccc", "dddd"], ordered=True, name="foobar")
Out[33]: CategoricalIndex(['a', 'bb', 'ccc', 'dddd'], categories=['a', 'bb', 'ccc', 'dddd'], ordered=True, name='foobar', dtype='category')
In [34]: pd.CategoricalIndex(["a", "bb", "ccc", "dddd"] * 10, ordered=True, name="foobar")
Out[34]:
CategoricalIndex(['a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a',
'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a', 'bb',
'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc',
'dddd', 'a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd',
'a', 'bb', 'ccc', 'dddd'],
categories=['a', 'bb', 'ccc', 'dddd'], ordered=True, name='foobar', dtype='category')
In [35]: pd.CategoricalIndex(["a", "bb", "ccc", "dddd"] * 100, ordered=True, name="foobar")
Out[35]:
CategoricalIndex(['a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a',
'bb',
...
'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc',
'dddd'],
categories=['a', 'bb', 'ccc', 'dddd'], ordered=True, name='foobar', dtype='category', length=400)
In [36]: pd.date_range("20130101", periods=4, name="foo", tz="US/Eastern")
Out[36]:
DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00',
'2013-01-03 00:00:00-05:00', '2013-01-04 00:00:00-05:00'],
dtype='datetime64[ns, US/Eastern]', name='foo', freq='D')
In [37]: pd.date_range("20130101", periods=25, freq="D")
Out[37]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06', '2013-01-07', '2013-01-08',
'2013-01-09', '2013-01-10', '2013-01-11', '2013-01-12',
'2013-01-13', '2013-01-14', '2013-01-15', '2013-01-16',
'2013-01-17', '2013-01-18', '2013-01-19', '2013-01-20',
'2013-01-21', '2013-01-22', '2013-01-23', '2013-01-24',
'2013-01-25'],
dtype='datetime64[ns]', freq='D')
In [38]: pd.date_range("20130101", periods=104, name="foo", tz="US/Eastern")
Out[38]:
DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00',
'2013-01-03 00:00:00-05:00', '2013-01-04 00:00:00-05:00',
'2013-01-05 00:00:00-05:00', '2013-01-06 00:00:00-05:00',
'2013-01-07 00:00:00-05:00', '2013-01-08 00:00:00-05:00',
'2013-01-09 00:00:00-05:00', '2013-01-10 00:00:00-05:00',
...
'2013-04-05 00:00:00-04:00', '2013-04-06 00:00:00-04:00',
'2013-04-07 00:00:00-04:00', '2013-04-08 00:00:00-04:00',
'2013-04-09 00:00:00-04:00', '2013-04-10 00:00:00-04:00',
'2013-04-11 00:00:00-04:00', '2013-04-12 00:00:00-04:00',
'2013-04-13 00:00:00-04:00', '2013-04-14 00:00:00-04:00'],
dtype='datetime64[ns, US/Eastern]', name='foo', length=104, freq='D')
Performance improvements¶
Bug fixes¶
Bug where labels did not appear properly in the legend of
DataFrame.plot(), passinglabel=arguments works, and Series indices are no longer mutated. (GH9542)Bug in json serialization causing a segfault when a frame had zero length. (GH9805)
Bug in
read_csvwhere missing trailing delimiters would cause segfault. (GH5664)Bug in retaining index name on appending (GH9862)
Bug in
scatter_matrixdraws unexpected axis ticklabels (GH5662)Fixed bug in
StataWriterresulting in changes to inputDataFrameupon save (GH9795).Bug in
transformcausing length mismatch when null entries were present and a fast aggregator was being used (GH9697)Bug in
equalscausing false negatives when block order differed (GH9330)Bug in grouping with multiple
pd.Grouperwhere one is non-time based (GH10063)Bug in
read_sql_tableerror when reading postgres table with timezone (GH7139)Bug in
DataFrameslicing may not retain metadata (GH9776)Bug where
TimdeltaIndexwere not properly serialized in fixedHDFStore(GH9635)Bug with
TimedeltaIndexconstructor ignoringnamewhen given anotherTimedeltaIndexas data (GH10025).Bug in
DataFrameFormatter._get_formatted_indexwith not applyingmax_colwidthto theDataFrameindex (GH7856)Bug in
.locwith a read-only ndarray data source (GH10043)Bug in
groupby.apply()that would raise if a passed user defined function either returned onlyNone(for all input). (GH9685)Always use temporary files in pytables tests (GH9992)
Bug in plotting continuously using
secondary_ymay not show legend properly. (GH9610, GH9779)Bug in
DataFrame.plot(kind="hist")results inTypeErrorwhenDataFramecontains non-numeric columns (GH9853)Bug where repeated plotting of
DataFramewith aDatetimeIndexmay raiseTypeError(GH9852)Bug in
setup.pythat would allow an incompat cython version to build (GH9827)Bug in plotting
secondary_yincorrectly attachesright_axproperty to secondary axes specifying itself recursively. (GH9861)Bug in
Series.quantileon empty Series of typeDatetimeorTimedelta(GH9675)Bug in
wherecausing incorrect results when upcasting was required (GH9731)Bug in
FloatArrayFormatterwhere decision boundary for displaying “small” floats in decimal format is off by one order of magnitude for a given display.precision (GH9764)Fixed bug where
DataFrame.plot()raised an error when bothcolorandstylekeywords were passed and there was no color symbol in the style strings (GH9671)Not showing a
DeprecationWarningon combining list-likes with anIndex(GH10083)Bug in
read_csvandread_tablewhen usingskip_rowsparameter if blank lines are present. (GH9832)Bug in
read_csv()interpretsindex_col=Trueas1(GH9798)Bug in index equality comparisons using
==failing on Index/MultiIndex type incompatibility (GH9785)Bug in which
SparseDataFramecould not takenanas a column name (GH8822)Bug in
to_msgpackandread_msgpackzlib and blosc compression support (GH9783)Bug
GroupBy.sizedoesn’t attach index name properly if grouped byTimeGrouper(GH9925)Bug causing an exception in slice assignments because
length_of_indexerreturns wrong results (GH9995)Bug in csv parser causing lines with initial white space plus one non-space character to be skipped. (GH9710)
Bug in C csv parser causing spurious NaNs when data started with newline followed by white space. (GH10022)
Bug causing elements with a null group to spill into the final group when grouping by a
Categorical(GH9603)Bug where .iloc and .loc behavior is not consistent on empty dataframes (GH9964)
Bug in invalid attribute access on a
TimedeltaIndexincorrectly raisedValueErrorinstead ofAttributeError(GH9680)Bug in unequal comparisons between categorical data and a scalar, which was not in the categories (e.g.
Series(Categorical(list("abc"), ordered=True)) > "d". This returnedFalsefor all elements, but now raises aTypeError. Equality comparisons also now returnFalsefor==andTruefor!=. (GH9848)Bug in DataFrame
__setitem__when right hand side is a dictionary (GH9874)Bug in
wherewhen dtype isdatetime64/timedelta64, but dtype of other is not (GH9804)Bug in
MultiIndex.sortlevel()results in unicode level name breaks (GH9856)Bug in which
groupby.transformincorrectly enforced output dtypes to match input dtypes. (GH9807)Bug in
DataFrameconstructor whencolumnsparameter is set, anddatais an empty list (GH9939)Bug in bar plot with
log=TrueraisesTypeErrorif all values are less than 1 (GH9905)Bug in horizontal bar plot ignores
log=True(GH9905)Bug in PyTables queries that did not return proper results using the index (GH8265, GH9676)
Bug where dividing a dataframe containing values of type
Decimalby anotherDecimalwould raise. (GH9787)Bug where using DataFrames asfreq would remove the name of the index. (GH9885)
Bug causing extra index point when resample BM/BQ (GH9756)
Changed caching in
AbstractHolidayCalendarto be at the instance level rather than at the class level as the latter can result in unexpected behaviour. (GH9552)Fixed latex output for MultiIndexed dataframes (GH9778)
Bug causing an exception when setting an empty range using
DataFrame.loc(GH9596)Bug in hiding ticklabels with subplots and shared axes when adding a new plot to an existing grid of axes (GH9158)
Bug in
transformandfilterwhen grouping on a categorical variable (GH9921)Bug in
transformwhen groups are equal in number and dtype to the input index (GH9700)Google BigQuery connector now imports dependencies on a per-method basis.(GH9713)
Updated BigQuery connector to no longer use deprecated
oauth2client.tools.run()(GH8327)Bug in subclassed
DataFrame. It may not return the correct class, when slicing or subsetting it. (GH9632)Bug in
.median()where non-float null values are not handled correctly (GH10040)Bug in Series.fillna() where it raises if a numerically convertible string is given (GH10092)
Contributors¶
A total of 58 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.
Alfonso MHC +
Andy Hayden
Artemy Kolchinsky
Chris Gilmer +
Chris Grinolds +
Dan Birken
David BROCHART +
David Hirschfeld +
David Stephens
Dr. Leo +
Evan Wright +
Frans van Dunné +
Hatem Nassrat +
Henning Sperr +
Hugo Herter +
Jan Schulz
Jeff Blackburne +
Jeff Reback
Jim Crist +
Jonas Abernot +
Joris Van den Bossche
Kerby Shedden
Leo Razoumov +
Manuel Riel +
Mortada Mehyar
Nick Burns +
Nick Eubank +
Olivier Grisel
Phillip Cloud
Pietro Battiston
Roy Hyunjin Han
Sam Zhang +
Scott Sanderson +
Sinhrks +
Stephan Hoyer
Tiago Antao
Tom Ajamian +
Tom Augspurger
Tomaz Berisa +
Vikram Shirgur +
Vladimir Filimonov
William Hogman +
Yasin A +
Younggun Kim +
behzad nouri
dsm054
floydsoft +
flying-sheep +
gfr +
jnmclarty
jreback
ksanghai +
lucas +
mschmohl +
ptype +
rockg
scls19fr +
sinhrks