v0.16.1 (May 11, 2015)¶
This is a minor bug-fix release from 0.16.0 and includes a a large number of bug fixes along several new features, enhancements, and performance improvements. We recommend that all users upgrade to this version.
Highlights include:
- Support for a
CategoricalIndex
, a category based index, see here - New section on how-to-contribute to pandas, see here
- Revised “Merge, join, and concatenate” documentation, including graphical examples to make it easier to understand each operations, see here
- New method
sample
for drawing random samples from Series, DataFrames and Panels. See here - The default
Index
printing has changed to a more uniform format, see here BusinessHour
datetime-offset is now supported, see here- Further enhancement to the
.str
accessor to make string operations easier, see here
What’s new in v0.16.1
Warning
In pandas 0.17.0, the sub-package pandas.io.data
will be removed in favor of a separately installable package (GH8961).
Enhancements¶
CategoricalIndex¶
We introduce a CategoricalIndex
, a new type of index object that is useful for supporting
indexing with duplicates. This is a container around a Categorical
(introduced in v0.15.0)
and allows efficient indexing and storage of an index with a large number of duplicated elements. Prior to 0.16.1,
setting the index of a DataFrame/Series
with a category
dtype would convert this to regular object-based Index
.
In [1]: df = pd.DataFrame({'A': np.arange(6),
...: 'B': pd.Series(list('aabbca'))
...: .astype('category', categories=list('cab'))
...: })
...:
In [2]: df
Out[2]:
A B
0 0 a
1 1 a
2 2 b
3 3 b
4 4 c
5 5 a
In [3]: df.dtypes
Out[3]:
A int64
B category
dtype: object
In [4]: df.B.cat.categories
Out[4]: Index(['c', 'a', 'b'], dtype='object')
setting the index, will create create a CategoricalIndex
In [5]: df2 = df.set_index('B')
In [6]: df2.index
Out[6]: CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')
indexing with __getitem__/.iloc/.loc/.ix
works similarly to an Index with duplicates.
The indexers MUST be in the category or the operation will raise.
In [7]: df2.loc['a']
Out[7]:
A
B
a 0
a 1
a 5
and preserves the CategoricalIndex
In [8]: df2.loc['a'].index
Out[8]: CategoricalIndex(['a', 'a', 'a'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')
sorting will order by the order of the categories
In [9]: df2.sort_index()
Out[9]:
A
B
c 4
a 0
a 1
a 5
b 2
b 3
groupby operations on the index will preserve the index nature as well
In [10]: df2.groupby(level=0).sum()
Out[10]:
A
B
c 4
a 6
b 5
In [11]: df2.groupby(level=0).sum().index
Out[11]: CategoricalIndex(['c', 'a', 'b'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')
reindexing operations, will return a resulting index based on the type of the passed
indexer, meaning that passing a list will return a plain-old-Index
; indexing with
a Categorical
will return a CategoricalIndex
, indexed according to the categories
of the PASSED Categorical
dtype. This allows one to arbitrarily index these even with
values NOT in the categories, similarly to how you can reindex ANY pandas index.
In [12]: df2.reindex(['a', 'e'])
Out[12]:
A
B
a 0.0
a 1.0
a 5.0
e NaN
In [13]: df2.reindex(['a', 'e']).index
Out[13]: pd.Index(['a', 'a', 'a', 'e'], dtype='object', name='B')
In [14]: df2.reindex(pd.Categorical(['a', 'e'], categories=list('abcde')))
Out[14]:
A
B
a 0.0
a 1.0
a 5.0
e NaN
In [15]: df2.reindex(pd.Categorical(['a', 'e'], categories=list('abcde'))).index
Out[15]: pd.CategoricalIndex(['a', 'a', 'a', 'e'],
categories=['a', 'b', 'c', 'd', 'e'],
ordered=False, name='B',
dtype='category')
See the documentation for more. (GH7629, GH10038, GH10039)
Sample¶
Series, DataFrames, and Panels now have a new method: sample()
.
The method accepts a specific number of rows or columns to return, or a fraction of the
total number or rows or columns. It also has options for sampling with or without replacement,
for passing in a column for weights for non-uniform sampling, and for setting seed values to
facilitate replication. (GH2419)
In [1]: example_series = pd.Series([0, 1, 2, 3, 4, 5])
# When no arguments are passed, returns 1
In [2]: example_series.sample()
Out[2]:
3 3
Length: 1, dtype: int64
# One may specify either a number of rows:
In [3]: example_series.sample(n=3)