MultiIndex / Advanced Indexing¶
This section covers indexing with a MultiIndex
and more advanced indexing features.
See the Indexing and Selecting Data for general indexing documentation.
Warning
Whether a copy or a reference is returned for a setting operation, may
depend on the context. This is sometimes called chained assignment
and
should be avoided. See Returning a View versus Copy
Warning
In 0.15.0 Index
has internally been refactored to no longer sub-class ndarray
but instead subclass PandasObject
, similarly to the rest of the pandas objects. This should be
a transparent change with only very limited API implications (See the Internal Refactoring)
See the cookbook for some advanced strategies
Hierarchical indexing (MultiIndex)¶
Hierarchical / Multi-level indexing is very exciting as it opens the door to some quite sophisticated data analysis and manipulation, especially for working with higher dimensional data. In essence, it enables you to store and manipulate data with an arbitrary number of dimensions in lower dimensional data structures like Series (1d) and DataFrame (2d).
In this section, we will show what exactly we mean by “hierarchical” indexing and how it integrates with the all of the pandas indexing functionality described above and in prior sections. Later, when discussing group by and pivoting and reshaping data, we’ll show non-trivial applications to illustrate how it aids in structuring data for analysis.
See the cookbook for some advanced strategies
Creating a MultiIndex (hierarchical index) object¶
The MultiIndex
object is the hierarchical analogue of the standard
Index
object which typically stores the axis labels in pandas objects. You
can think of MultiIndex
an array of tuples where each tuple is unique. A
MultiIndex
can be created from a list of arrays (using
MultiIndex.from_arrays
), an array of tuples (using
MultiIndex.from_tuples
), or a crossed set of iterables (using
MultiIndex.from_product
). The Index
constructor will attempt to return
a MultiIndex
when it is passed a list of tuples. The following examples
demo different ways to initialize MultiIndexes.
In [1]: arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
...: ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
...:
In [2]: tuples = list(zip(*arrays))
In [3]: tuples
Out[3]:
[('bar', 'one'),
('bar', 'two'),
('baz', 'one'),
('baz', 'two'),
('foo', 'one'),
('foo', 'two'),
('qux', 'one'),
('qux', 'two')]
In [4]: index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
In [5]: index
Out[5]:
MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']],
labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
names=[u'first', u'second'])
In [6]: s = pd.Series(np.random.randn(8), index=index)
In [7]: s
Out[7]:
first second
bar one 0.469112
two -0.282863
baz one -1.509059
two -1.135632
foo one 1.212112
two -0.173215
qux one 0.119209
two -1.044236
dtype: float64
When you want every pairing of the elements in two iterables, it can be easier
to use the MultiIndex.from_product
function:
In [8]: iterables = [['bar', 'baz', 'foo', 'qux'], ['one', 'two']]
In [9]: pd.MultiIndex.from_product(iterables, names=['first', 'second'])
Out[9]:
MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']],
labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
names=[u'first', u'second'])
As a convenience, you can pass a list of arrays directly into Series or DataFrame to construct a MultiIndex automatically:
In [10]: arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
....: np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])]
....:
In [11]: s = pd.Series(np.random.randn(8), index=arrays)
In [12]: s
Out[12]:
bar one -0.861849
two -2.104569
baz one -0.494929
two 1.071804
foo one 0.721555
two -0.706771
qux one -1.039575
two 0.271860
dtype: float64
In [13]: df = pd.DataFrame(np.random.randn(8, 4), index=arrays)
In [14]: df
Out[14]:
0 1 2 3
bar one -0.424972 0.567020 0.276232 -1.087401
two -0.673690 0.113648 -1.478427 0.524988
baz one 0.404705 0.577046 -1.715002 -1.039268
two -0.370647 -1.157892 -1.344312 0.844885
foo one 1.075770 -0.109050 1.643563 -1.469388
two 0.357021 -0.674600 -1.776904 -0.968914
qux one -1.294524 0.413738 0.276662 -0.472035
two -0.013960 -0.362543 -0.006154 -0.923061
All of the MultiIndex
constructors accept a names
argument which stores
string names for the levels themselves. If no names are provided, None
will
be assigned:
In [15]: df.index.names
Out[15]: FrozenList([None, None])
This index can back any axis of a pandas object, and the number of levels of the index is up to you:
In [16]: df = pd.DataFrame(np.random.randn(3, 8), index=['A', 'B', 'C'], columns=index)
In [17]: df
Out[17]:
first bar baz foo qux \
second one two one two one two one
A 0.895717 0.805244 -1.206412 2.565646 1.431256 1.340309 -1.170299
B 0.410835 0.813850 0.132003 -0.827317 -0.076467 -1.187678 1.130127
C -1.413681 1.607920 1.024180 0.569605 0.875906 -2.211372 0.974466
first
second two
A -0.226169
B -1.436737
C -2.006747
In [18]: pd.DataFrame(np.random.randn(6, 6), index=index[:6], columns=index[:6])
Out[18]:
first bar baz foo
second one two one two one two
first second
bar one -0.410001 -0.078638 0.545952 -1.219217 -1.226825 0.769804
two -1.281247 -0.727707 -0.121306 -0.097883 0.695775 0.341734
baz one 0.959726 -1.110336 -0.619976 0.149748 -0.732339 0.687738
two 0.176444 0.403310 -0.154951 0.301624 -2.179861 -1.369849
foo one -0.954208 1.462696 -1.743161 -0.826591 -0.345352 1.314232
two 0.690579 0.995761 2.396780 0.014871 3.357427 -0.317441
We’ve “sparsified” the higher levels of the indexes to make the console output a bit easier on the eyes.
It’s worth keeping in mind that there’s nothing preventing you from using tuples as atomic labels on an axis:
In [19]: pd.Series(np.random.randn(8), index=tuples)
Out[19]:
(bar, one) -1.236269
(bar, two) 0.896171
(baz, one) -0.487602
(baz, two) -0.082240
(foo, one) -2.182937
(foo, two) 0.380396
(qux, one) 0.084844
(qux, two) 0.432390
dtype: float64
The reason that the MultiIndex
matters is that it can allow you to do
grouping, selection, and reshaping operations as we will describe below and in
subsequent areas of the documentation. As you will see in later sections, you
can find yourself working with hierarchically-indexed data without creating a
MultiIndex
explicitly yourself. However, when loading data from a file, you
may wish to generate your own MultiIndex
when preparing the data set.
Note that how the index is displayed by be controlled using the
multi_sparse
option in pandas.set_printoptions
:
In [20]: pd.set_option('display.multi_sparse', False)
In [21]: df
Out[21]:
first bar bar baz baz foo foo qux \
second one two one two one two one
A 0.895717 0.805244 -1.206412 2.565646 1.431256 1.340309 -1.170299
B 0.410835 0.813850 0.132003 -0.827317 -0.076467 -1.187678 1.130127
C -1.413681 1.607920 1.024180 0.569605 0.875906 -2.211372 0.974466
first qux
second two
A -0.226169
B -1.436737
C -2.006747
In [22]: pd.set_option('display.multi_sparse', True)
Reconstructing the level labels¶
The method get_level_values
will return a vector of the labels for each
location at a particular level:
In [23]: index.get_level_values(0)
Out[23]: Index([u'bar', u'bar', u'baz', u'baz', u'foo', u'foo', u'qux', u'qux'], dtype='object', name=u'first')
In [24]: index.get_level_values('second')
Out[24]: Index([u'one', u'two', u'one', u'two', u'one', u'two', u'one', u'two'], dtype='object', name=u'second')
Basic indexing on axis with MultiIndex¶
One of the important features of hierarchical indexing is that you can select data by a “partial” label identifying a subgroup in the data. Partial selection “drops” levels of the hierarchical index in the result in a completely analogous way to selecting a column in a regular DataFrame:
In [25]: df['bar']
Out[25]:
second one two
A 0.895717 0.805244
B 0.410835 0.813850
C -1.413681 1.607920
In [26]: df['bar', 'one']
Out[26]:
A 0.895717
B 0.410835
C -1.413681
Name: (bar, one), dtype: float64
In [27]: df['bar']['one']
Out[27]:
A 0.895717
B 0.410835
C -1.413681
Name: one, dtype: float64
In [28]: s['qux']
Out[28]:
one -1.039575
two 0.271860
dtype: float64
See Cross-section with hierarchical index for how to select on a deeper level.
Note
The repr of a MultiIndex
shows ALL the defined levels of an index, even
if the they are not actually used. When slicing an index, you may notice this.
For example:
# original multi-index
In [29]: df.columns
Out[29]:
MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']],
labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
names=[u'first', u'second'])
# sliced
In [30]: df[['foo','qux']].columns
Out[30]:
MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']],
labels=[[2, 2, 3, 3], [0, 1, 0, 1]],
names=[u'first', u'second'])
This is done to avoid a recomputation of the levels in order to make slicing highly performant. If you want to see the actual used levels.
In [31]: df[['foo','qux']].columns.values
Out[31]: array([('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')], dtype=object)
# for a specific level
In [32]: df[['foo','qux']].columns.get_level_values(0)
Out[32]: Index([u'foo', u'foo', u'qux', u'qux'], dtype='object', name=u'first')
To reconstruct the multiindex with only the used levels
In [33]: pd.MultiIndex.from_tuples(df[['foo','qux']].columns.values)
Out[33]:
MultiIndex(levels=[[u'foo', u'qux'], [u'one', u'two']],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
Data alignment and using reindex
¶
Operations between differently-indexed objects having MultiIndex
on the
axes will work as you expect; data alignment will work the same as an Index of
tuples:
In [34]: s + s[:-2]
Out[34]:
bar one -1.723698
two -4.209138
baz one -0.989859
two 2.143608
foo one 1.443110
two -1.413542
qux one NaN
two NaN
dtype: float64
In [35]: s + s[::2]
Out[35]:
bar one -1.723698
two NaN
baz one -0.989859
two NaN
foo one 1.443110
two NaN
qux one -2.079150
two NaN
dtype: float64
reindex
can be called with another MultiIndex
or even a list or array
of tuples:
In [36]: s.reindex(index[:3])
Out[36]:
first second
bar one -0.861849
two -2.104569
baz one -0.494929
dtype: float64
In [37]: s.reindex([('foo', 'two'), ('bar', 'one'), ('qux', 'one'), ('baz', 'one')])
Out[37]:
foo two -0.706771
bar one -0.861849
qux one -1.039575
baz one -0.494929
dtype: float64
Advanced indexing with hierarchical index¶
Syntactically integrating MultiIndex
in advanced indexing with .loc/.ix
is a
bit challenging, but we’ve made every effort to do so. for example the
following works as you would expect:
In [38]: df = df.T
In [39]: df
Out[39]:
A B C
first second
bar one 0.895717 0.410835 -1.413681
two 0.805244 0.813850 1.607920
baz one -1.206412 0.132003 1.024180
two 2.565646 -0.827317 0.569605
foo one 1.431256 -0.076467 0.875906
two 1.340309 -1.187678 -2.211372
qux one -1.170299 1.130127 0.974466
two -0.226169 -1.436737 -2.006747
In [40]: df.loc['bar']
Out[40]:
A B C
second
one 0.895717 0.410835 -1.413681
two 0.805244 0.813850 1.607920
In [41]: df.loc['bar', 'two']
Out[41]:
A 0.805244
B 0.813850
C 1.607920
Name: (bar, two), dtype: float64
“Partial” slicing also works quite nicely.
In [42]: df.loc['baz':'foo']
Out[42]:
A B C
first second
baz one -1.206412 0.132003 1.024180
two 2.565646 -0.827317 0.569605
foo one 1.431256 -0.076467 0.875906
two 1.340309 -1.187678 -2.211372
You can slice with a ‘range’ of values, by providing a slice of tuples.
In [43]: df.loc[('baz', 'two'):('qux', 'one')]
Out[43]:
A B C
first second
baz two 2.565646 -0.827317 0.569605
foo one 1.431256 -0.076467 0.875906
two 1.340309 -1.187678 -2.211372
qux one -1.170299 1.130127 0.974466
In [44]: df.loc[('baz', 'two'):'foo']
Out[44]:
A B C
first second
baz two 2.565646 -0.827317 0.569605
foo one 1.431256 -0.076467 0.875906
two 1.340309 -1.187678 -2.211372
Passing a list of labels or tuples works similar to reindexing:
In [45]: df.ix[[('bar', 'two'), ('qux', 'one')]]
Out[45]:
A B C
first second
bar two 0.805244 0.813850 1.607920
qux one -1.170299 1.130127 0.974466
Using slicers¶
New in version 0.14.0.
In 0.14.0 we added a new way to slice multi-indexed objects. You can slice a multi-index by providing multiple indexers.
You can provide any of the selectors as if you are indexing by label, see Selection by Label, including slices, lists of labels, labels, and boolean indexers.
You can use slice(None)
to select all the contents of that level. You do not need to specify all the
deeper levels, they will be implied as slice(None)
.
As usual, both sides of the slicers are included as this is label indexing.
Warning
You should specify all axes in the .loc
specifier, meaning the indexer for the index and
for the columns. There are some ambiguous cases where the passed indexer could be mis-interpreted
as indexing both axes, rather than into say the MuliIndex for the rows.
You should do this:
df.loc[(slice('A1','A3'),.....),:]
rather than this:
df.loc[(slice('A1','A3'),.....)]
In [46]: def mklbl(prefix,n):
....: return ["%s%s" % (prefix,i) for i in range(n)]
....:
In [47]: miindex = pd.MultiIndex.from_product([mklbl('A',4),
....: mklbl('B',2),
....: mklbl('C',4),
....: mklbl('D',2)])
....:
In [48]: micolumns = pd.MultiIndex.from_tuples([('a','foo'),('a','bar'),
....: ('b','foo'),('b','bah')],
....: names=['lvl0', 'lvl1'])
....:
In [49]: dfmi = pd.DataFrame(np.arange(len(miindex)*len(micolumns)).reshape((len(miindex),len(micolumns))),
....: index=miindex,
....: columns=micolumns).sort_index().sort_index(axis=1)
....:
In [50]: dfmi
Out[50]:
lvl0 a b
lvl1 bar foo bah foo
A0 B0 C0 D0 1 0 3 2
D1 5 4 7 6
C1 D0 9 8 11 10
D1 13 12 15 14
C2 D0 17 16 19 18
D1 21 20 23 22
C3 D0 25 24 27 26
... ... ... ... ...
A3 B1 C0 D1 229 228 231 230
C1 D0 233 232 235 234
D1 237 236 239 238
C2 D0 241 240 243 242
D1 245 244 247 246
C3 D0 249 248 251 250
D1 253 252 255 254
[64 rows x 4 columns]
Basic multi-index slicing using slices, lists, and labels.
In [51]: dfmi.loc[(slice('A1','A3'),slice(None), ['C1','C3']),:]
Out[51]:
lvl0 a b
lvl1 bar foo bah foo
A1 B0 C1 D0 73 72 75 74
D1 77 76 79 78
C3 D0 89 88 91 90
D1 93 92 95 94
B1 C1 D0 105 104 107 106
D1 109 108 111 110
C3 D0 121 120 123 122
... ... ... ... ...
A3 B0 C1 D1 205 204 207 206
C3 D0 217 216 219 218
D1 221 220 223 222
B1 C1 D0 233 232 235 234
D1 237 236 239 238
C3 D0 249 248 251 250
D1 253 252 255 254
[24 rows x 4 columns]
You can use a pd.IndexSlice
to have a more natural syntax using :
rather than using slice(None)
In [52]: idx = pd.IndexSlice
In [53]: dfmi.loc[idx[:,:,['C1','C3']],idx[:,'foo']]
Out[53]:
lvl0 a b
lvl1 foo foo
A0 B0 C1 D0 8 10
D1 12 14
C3 D0 24 26
D1 28 30
B1 C1 D0 40 42
D1 44 46
C3 D0 56 58
... ... ...
A3 B0 C1 D1 204 206
C3 D0 216 218
D1 220 222
B1 C1 D0 232 234
D1 236 238
C3 D0 248 250
D1 252 254
[32 rows x 2 columns]
It is possible to perform quite complicated selections using this method on multiple axes at the same time.
In [54]: dfmi.loc['A1',(slice(None),'foo')]
Out[54]:
lvl0 a b
lvl1 foo foo
B0 C0 D0 64 66
D1 68 70
C1 D0 72 74
D1 76 78
C2 D0 80 82
D1 84 86
C3 D0 88 90
... ... ...
B1 C0 D1 100 102
C1 D0 104 106
D1 108 110
C2 D0 112 114
D1 116 118
C3 D0 120 122
D1 124 126
[16 rows x 2 columns]
In [55]: dfmi.loc[idx[:,:,['C1','C3']],idx[:,'foo']]
Out[55]:
lvl0 a b
lvl1 foo foo
A0 B0 C1 D0 8 10
D1 12 14
C3 D0 24 26
D1 28 30
B1 C1 D0 40 42
D1 44 46
C3 D0 56 58
... ... ...
A3 B0 C1 D1 204 206
C3 D0 216 218
D1 220 222
B1 C1 D0 232 234
D1 236 238
C3 D0 248 250
D1 252 254
[32 rows x 2 columns]
Using a boolean indexer you can provide selection related to the values.
In [56]: mask = dfmi[('a','foo')]>200
In [57]: dfmi.loc[idx[mask,:,['C1','C3']],idx[:,'foo']]
Out[57]:
lvl0 a b
lvl1 foo foo
A3 B0 C1 D1 204 206
C3 D0 216 218
D1 220 222
B1 C1 D0 232 234
D1 236 238
C3 D0 248 250
D1 252 254
You can also specify the axis
argument to .loc
to interpret the passed
slicers on a single axis.
In [58]: dfmi.loc(axis=0)[:,:,['C1','C3']]
Out[58]:
lvl0 a b
lvl1 bar foo bah foo
A0 B0 C1 D0 9 8 11 10
D1 13 12 15 14
C3 D0 25 24 27 26
D1 29 28 31 30
B1 C1 D0 41 40 43 42
D1 45 44 47 46
C3 D0 57 56 59 58
... ... ... ... ...
A3 B0 C1 D1 205 204 207 206
C3 D0 217 216 219 218
D1 221 220 223 222
B1 C1 D0 233 232 235 234
D1 237 236 239 238
C3 D0 249 248 251 250
D1 253 252 255 254
[32 rows x 4 columns]
Furthermore you can set the values using these methods
In [59]: df2 = dfmi.copy()
In [60]: df2.loc(axis=0)[:,:,['C1','C3']] = -10
In [61]: df2
Out[61]:
lvl0 a b
lvl1 bar foo bah foo
A0 B0 C0 D0 1 0 3 2
D1 5 4 7 6
C1 D0 -10 -10 -10 -10
D1 -10 -10 -10 -10
C2 D0 17 16 19 18
D1 21 20 23 22
C3 D0 -10 -10 -10 -10
... ... ... ... ...
A3 B1 C0 D1 229 228 231 230
C1 D0 -10 -10 -10 -10
D1 -10 -10 -10 -10
C2 D0 241 240 243 242
D1 245 244 247 246
C3 D0 -10 -10 -10 -10
D1 -10 -10 -10 -10
[64 rows x 4 columns]
You can use a right-hand-side of an alignable object as well.
In [62]: df2 = dfmi.copy()
In [63]: df2.loc[idx[:,:,['C1','C3']],:] = df2*1000
In [64]: df2
Out[64]:
lvl0 a b
lvl1 bar foo bah foo
A0 B0 C0 D0 1 0 3 2
D1 5 4 7 6
C1 D0 9000 8000 11000 10000
D1 13000 12000 15000 14000
C2 D0 17 16 19 18
D1 21 20 23 22
C3 D0 25000 24000 27000 26000
... ... ... ... ...
A3 B1 C0 D1 229 228 231 230
C1 D0 233000 232000 235000 234000
D1 237000 236000 239000 238000
C2 D0 241 240 243 242
D1 245 244 247 246
C3 D0 249000 248000 251000 250000
D1 253000 252000 255000 254000
[64 rows x 4 columns]
Cross-section¶
The xs
method of DataFrame
additionally takes a level argument to make
selecting data at a particular level of a MultiIndex easier.
In [65]: df
Out[65]:
A B C
first second
bar one 0.895717 0.410835 -1.413681
two 0.805244 0.813850 1.607920
baz one -1.206412 0.132003 1.024180
two 2.565646 -0.827317 0.569605
foo one 1.431256 -0.076467 0.875906
two 1.340309 -1.187678 -2.211372
qux one -1.170299 1.130127 0.974466
two -0.226169 -1.436737 -2.006747
In [66]: df.xs('one', level='second')
Out[66]:
A B C
first
bar 0.895717 0.410835 -1.413681
baz -1.206412 0.132003 1.024180
foo 1.431256 -0.076467 0.875906
qux -1.170299 1.130127 0.974466
# using the slicers (new in 0.14.0)
In [67]: df.loc[(slice(None),'one'),:]
Out[67]:
A B C
first second
bar one 0.895717 0.410835 -1.413681
baz one -1.206412 0.132003 1.024180
foo one 1.431256 -0.076467 0.875906
qux one -1.170299 1.130127 0.974466
You can also select on the columns with xs()
, by
providing the axis argument
In [68]: df = df.T
In [69]: df.xs('one', level='second', axis=1)
Out[69]:
first bar baz foo qux
A 0.895717 -1.206412 1.431256 -1.170299
B 0.410835 0.132003 -0.076467 1.130127
C -1.413681 1.024180 0.875906 0.974466
# using the slicers (new in 0.14.0)
In [70]: df.loc[:,(slice(None),'one')]
Out[70]:
first bar baz foo qux
second one one one one
A 0.895717 -1.206412 1.431256 -1.170299
B 0.410835 0.132003 -0.076467 1.130127
C -1.413681 1.024180 0.875906 0.974466
xs()
also allows selection with multiple keys
In [71]: df.xs(('one', 'bar'), level=('second', 'first'), axis=1)
Out[71]:
first bar
second one
A 0.895717
B 0.410835
C -1.413681
# using the slicers (new in 0.14.0)
In [72]: df.loc[:,('bar','one')]
Out[72]:
A 0.895717
B 0.410835
C -1.413681
Name: (bar, one), dtype: float64
New in version 0.13.0.
You can pass drop_level=False
to xs()
to retain
the level that was selected
In [73]: df.xs('one', level='second', axis=1, drop_level=False)
Out[73]:
first bar baz foo qux
second one one one one
A 0.895717 -1.206412 1.431256 -1.170299
B 0.410835 0.132003 -0.076467 1.130127
C -1.413681 1.024180 0.875906 0.974466
versus the result with drop_level=True
(the default value)
In [74]: df.xs('one', level='second', axis=1, drop_level=True)
Out[74]:
first bar baz foo qux
A 0.895717 -1.206412 1.431256 -1.170299
B 0.410835 0.132003 -0.076467 1.130127
C -1.413681 1.024180 0.875906 0.974466
Advanced reindexing and alignment¶
The parameter level
has been added to the reindex
and align
methods
of pandas objects. This is useful to broadcast values across a level. For
instance:
In [75]: midx = pd.MultiIndex(levels=[['zero', 'one'], ['x','y']],
....: labels=[[1,1,0,0],[1,0,1,0]])
....:
In [76]: df = pd.DataFrame(np.random.randn(4,2), index=midx)
In [77]: df
Out[77]:
0 1
one y 1.519970 -0.493662
x 0.600178 0.274230
zero y 0.132885 -0.023688
x 2.410179 1.450520
In [78]: df2 = df.mean(level=0)
In [79]: df2
Out[79]:
0 1
zero 1.271532 0.713416
one 1.060074 -0.109716
In [80]: df2.reindex(df.index, level=0)
Out[80]:
0 1
one y 1.060074 -0.109716
x 1.060074 -0.109716
zero y 1.271532 0.713416
x 1.271532 0.713416
# aligning
In [81]: df_aligned, df2_aligned = df.align(df2, level=0)
In [82]: df_aligned
Out[82]:
0 1
one y 1.519970 -0.493662
x 0.600178 0.274230
zero y 0.132885 -0.023688
x 2.410179 1.450520
In [83]: df2_aligned
Out[83]:
0 1
one y 1.060074 -0.109716
x 1.060074 -0.109716
zero y 1.271532 0.713416
x 1.271532 0.713416
Swapping levels with swaplevel()
¶
The swaplevel
function can switch the order of two levels:
In [84]: df[:5]
Out[84]:
0 1
one y 1.519970 -0.493662
x 0.600178 0.274230
zero y 0.132885 -0.023688
x 2.410179 1.450520
In [85]: df[:5].swaplevel(0, 1, axis=0)
Out[85]:
0 1
y one 1.519970 -0.493662
x one 0.600178 0.274230
y zero 0.132885 -0.023688
x zero 2.410179 1.450520
Reordering levels with reorder_levels()
¶
The reorder_levels
function generalizes the swaplevel
function,
allowing you to permute the hierarchical index levels in one step:
In [86]: df[:5].reorder_levels([1,0], axis=0)
Out[86]:
0 1
y one 1.519970 -0.493662
x one 0.600178 0.274230
y zero 0.132885 -0.023688
x zero 2.410179 1.450520
Sorting a MultiIndex
¶
For MultiIndex-ed objects to be indexed & sliced effectively, they need
to be sorted. As with any index, you can use sort_index
.
In [87]: import random; random.shuffle(tuples)
In [88]: s = pd.Series(np.random.randn(8), index=pd.MultiIndex.from_tuples(tuples))
In [89]: s
Out[89]:
baz two 0.206053
qux one -0.251905
bar two -2.213588
one 1.063327
baz one 1.266143
qux two 0.299368
foo one -0.863838
two 0.408204
dtype: float64
In [90]: s.sort_index()
Out[90]:
bar one 1.063327
two -2.213588
baz one 1.266143
two 0.206053
foo one -0.863838
two 0.408204
qux one -0.251905
two 0.299368
dtype: float64
In [91]: s.sort_index(level=0)
Out[91]:
bar one 1.063327
two -2.213588
baz one 1.266143
two 0.206053
foo one -0.863838
two 0.408204
qux one -0.251905
two 0.299368
dtype: float64
In [92]: s.sort_index(level=1)
Out[92]:
bar one 1.063327
baz one 1.266143
foo one -0.863838
qux one -0.251905
bar two -2.213588
baz two 0.206053
foo two 0.408204
qux two 0.299368
dtype: float64
You may also pass a level name to sort_index
if the MultiIndex levels
are named.
In [93]: s.index.set_names(['L1', 'L2'], inplace=True)
In [94]: s.sort_index(level='L1')
Out[94]:
L1 L2
bar one 1.063327
two -2.213588
baz one 1.266143
two 0.206053
foo one -0.863838
two 0.408204
qux one -0.251905
two 0.299368
dtype: float64
In [95]: s.sort_index(level='L2')
Out[95]:
L1 L2
bar one 1.063327
baz one 1.266143
foo one -0.863838
qux one -0.251905
bar two -2.213588
baz two 0.206053
foo two 0.408204
qux two 0.299368
dtype: float64
On higher dimensional objects, you can sort any of the other axes by level if they have a MultiIndex:
In [96]: df.T.sort_index(level=1, axis=1)
Out[96]:
zero one zero one
x x y y
0 2.410179 0.600178 0.132885 1.519970
1 1.450520 0.274230 -0.023688 -0.493662
Indexing will work even if the data are not sorted, but will be rather
inefficient (and show a PerformanceWarning
). It will also
return a copy of the data rather than a view:
In [97]: dfm = pd.DataFrame({'jim': [0, 0, 1, 1],
....: 'joe': ['x', 'x', 'z', 'y'],
....: 'jolie': np.random.rand(4)})
....:
In [98]: dfm = dfm.set_index(['jim', 'joe'])
In [99]: dfm
Out[99]:
jolie
jim joe
0 x 0.490671
x 0.120248
1 z 0.537020
y 0.110968
In [4]: dfm.loc[(1, 'z')]
PerformanceWarning: indexing past lexsort depth may impact performance.
Out[4]:
jolie
jim joe
1 z 0.64094
Furthermore if you try to index something that is not fully lexsorted, this can raise:
In [5]: dfm.loc[(0,'y'):(1, 'z')]
KeyError: 'Key length (2) was greater than MultiIndex lexsort depth (1)'
The is_lexsorted()
method on an Index
show if the index is sorted, and the lexsort_depth
property returns the sort depth:
In [100]: dfm.index.is_lexsorted()
Out[100]: False
In [101]: dfm.index.lexsort_depth
Out[101]: 1
In [102]: dfm = dfm.sort_index()
In [103]: dfm
Out[103]:
jolie
jim joe
0 x 0.490671
x 0.120248
1 y 0.110968
z 0.537020
In [104]: dfm.index.is_lexsorted()
Out[104]: True
In [105]: dfm.index.lexsort_depth
Out[105]: 2
And now selection works as expected.
In [106]: dfm.loc[(0,'y'):(1, 'z')]
Out[106]:
jolie
jim joe
1 y 0.110968
z 0.537020
Take Methods¶
Similar to numpy ndarrays, pandas Index, Series, and DataFrame also provides
the take
method that retrieves elements along a given axis at the given
indices. The given indices must be either a list or an ndarray of integer
index positions. take
will also accept negative integers as relative positions to the end of the object.
In [107]: index = pd.Index(np.random.randint(0, 1000, 10))
In [108]: index
Out[108]: Int64Index([214, 502, 712, 567, 786, 175, 993, 133, 758, 329], dtype='int64')
In [109]: positions = [0, 9, 3]
In [110]: index[positions]
Out[110]: Int64Index([214, 329, 567], dtype='int64')
In [111]: index.take(positions)
Out[111]: Int64Index([214, 329, 567], dtype='int64')
In [112]: ser = pd.Series(np.random.randn(10))
In [113]: ser.iloc[positions]
Out[113]:
0 -0.179666
9 1.824375
3 0.392149
dtype: float64
In [114]: ser.take(positions)
Out[114]:
0 -0.179666
9 1.824375
3 0.392149
dtype: float64
For DataFrames, the given indices should be a 1d list or ndarray that specifies row or column positions.
In [115]: frm = pd.DataFrame(np.random.randn(5, 3))
In [116]: frm.take([1, 4, 3])
Out[116]:
0 1 2
1 -1.237881 0.106854 -1.276829
4 0.629675 -1.425966 1.857704
3 0.979542 -1.633678 0.615855
In [117]: frm.take([0, 2], axis=1)
Out[117]:
0 2
0 0.595974 0.601544
1 -1.237881 -1.276829
2 -0.767101 1.499591
3 0.979542 0.615855
4 0.629675 1.857704
It is important to note that the take
method on pandas objects are not
intended to work on boolean indices and may return unexpected results.
In [118]: arr = np.random.randn(10)
In [119]: arr.take([False, False, True, True])
Out[119]: array([-1.1935, -1.1935, 0.6775, 0.6775])
In [120]: arr[[0, 1]]
Out[120]: array([-1.1935, 0.6775])
In [121]: ser = pd.Series(np.random.randn(10))
In [122]: ser.take([False, False, True, True])
Out[122]:
0 0.233141
0 0.233141
1 -0.223540
1 -0.223540
dtype: float64
In [123]: ser.ix[[0, 1]]
Out[123]:
0 0.233141
1 -0.223540
dtype: float64
Finally, as a small note on performance, because the take
method handles
a narrower range of inputs, it can offer performance that is a good deal
faster than fancy indexing.
Index Types¶
We have discussed MultiIndex
in the previous sections pretty extensively. DatetimeIndex
and PeriodIndex
are shown here. TimedeltaIndex
are here.
In the following sub-sections we will highlite some other index types.
CategoricalIndex¶
New in version 0.16.1.
We introduce a CategoricalIndex
, a new type of index object that is useful for supporting
indexing with duplicates. This is a container around a Categorical
(introduced in v0.15.0)
and allows efficient indexing and storage of an index with a large number of duplicated elements. Prior to 0.16.1,
setting the index of a DataFrame/Series
with a category
dtype would convert this to regular object-based Index
.
In [124]: df = pd.DataFrame({'A': np.arange(6),
.....: 'B': list('aabbca')})
.....:
In [125]: df['B'] = df['B'].astype('category', categories=list('cab'))
In [126]: df
Out[126]:
A B
0 0 a
1 1 a
2 2 b
3 3 b
4 4 c
5 5 a
In [127]: df.dtypes
Out[127]:
A int64
B category
dtype: object
In [128]: df.B.cat.categories
Out[128]: Index([u'c', u'a', u'b'], dtype='object')
Setting the index, will create create a CategoricalIndex
In [129]: df2 = df.set_index('B')
In [130]: df2.index
Out[130]: CategoricalIndex([u'a', u'a', u'b', u'b', u'c', u'a'], categories=[u'c', u'a', u'b'], ordered=False, name=u'B', dtype='category')
Indexing with __getitem__/.iloc/.loc/.ix
works similarly to an Index
with duplicates.
The indexers MUST be in the category or the operation will raise.
In [131]: df2.loc['a']
Out[131]:
A
B
a 0
a 1
a 5
These PRESERVE the CategoricalIndex
In [132]: df2.loc['a'].index
Out[132]: CategoricalIndex([u'a', u'a', u'a'], categories=[u'c', u'a', u'b'], ordered=False, name=u'B', dtype='category')
Sorting will order by the order of the categories
In [133]: df2.sort_index()
Out[133]:
A
B
c 4
a 0
a 1
a 5
b 2
b 3
Groupby operations on the index will preserve the index nature as well
In [134]: df2.groupby(level=0).sum()
Out[134]:
A
B
c 4
a 6
b 5
In [135]: df2.groupby(level=0).sum().index
Out[135]: CategoricalIndex([u'c', u'a', u'b'], categories=[u'c', u'a', u'b'], ordered=False, name=u'B', dtype='category')
Reindexing operations, will return a resulting index based on the type of the passed
indexer, meaning that passing a list will return a plain-old-Index
; indexing with
a Categorical
will return a CategoricalIndex
, indexed according to the categories
of the PASSED Categorical
dtype. This allows one to arbitrarly index these even with
values NOT in the categories, similarly to how you can reindex ANY pandas index.
In [136]: df2.reindex(['a','e'])
Out[136]:
A
B
a 0.0
a 1.0
a 5.0
e NaN
In [137]: df2.reindex(['a','e']).index
Out[137]: Index([u'a', u'a', u'a', u'e'], dtype='object', name=u'B')
In [138]: df2.reindex(pd.Categorical(['a','e'],categories=list('abcde')))
Out[138]:
A
B
a 0.0
a 1.0
a 5.0
e NaN
In [139]: df2.reindex(pd.Categorical(['a','e'],categories=list('abcde'))).index
Out[139]: CategoricalIndex([u'a', u'a', u'a', u'e'], categories=[u'a', u'b', u'c', u'd', u'e'], ordered=False, name=u'B', dtype='category')
Warning
Reshaping and Comparison operations on a CategoricalIndex
must have the same categories
or a TypeError
will be raised.
In [9]: df3 = pd.DataFrame({'A' : np.arange(6),
'B' : pd.Series(list('aabbca')).astype('category')})
In [11]: df3 = df3.set_index('B')
In [11]: df3.index
Out[11]: CategoricalIndex([u'a', u'a', u'b', u'b', u'c', u'a'], categories=[u'a', u'b', u'c'], ordered=False, name=u'B', dtype='category')
In [12]: pd.concat([df2, df3]
TypeError: categories must match existing categories when appending
Int64Index and RangeIndex¶
Warning
Indexing on an integer-based Index with floats has been clarified in 0.18.0, for a summary of the changes, see here.
Int64Index
is a fundamental basic index in pandas. This is an Immutable array implementing an ordered, sliceable set.
Prior to 0.18.0, the Int64Index
would provide the default index for all NDFrame
objects.
RangeIndex
is a sub-class of Int64Index
added in version 0.18.0, now providing the default index for all NDFrame
objects.
RangeIndex
is an optimized version of Int64Index
that can represent a monotonic ordered set. These are analagous to python range types.
Float64Index¶
Note
As of 0.14.0, Float64Index
is backed by a native float64
dtype
array. Prior to 0.14.0, Float64Index
was backed by an object
dtype
array. Using a float64
dtype in the backend speeds up arithmetic
operations by about 30x and boolean indexing operations on the
Float64Index
itself are about 2x as fast.
New in version 0.13.0.
By default a Float64Index
will be automatically created when passing floating, or mixed-integer-floating values in index creation.
This enables a pure label-based slicing paradigm that makes [],ix,loc
for scalar indexing and slicing work exactly the
same.
In [140]: indexf = pd.Index([1.5, 2, 3, 4.5, 5])
In [141]: indexf
Out[141]: Float64Index([1.5, 2.0, 3.0, 4.5, 5.0], dtype='float64')
In [142]: sf = pd.Series(range(5), index=indexf)
In [143]: sf
Out[143]:
1.5 0
2.0 1
3.0 2
4.5 3
5.0 4
dtype: int64
Scalar selection for [],.ix,.loc
will always be label based. An integer will match an equal float index (e.g. 3
is equivalent to 3.0
)
In [144]: sf[3]
Out[144]: 2
In [145]: sf[3.0]
Out[145]: 2
In [146]: sf.ix[3]
Out[146]: 2
In [147]: sf.ix[3.0]
Out[147]: 2
In [148]: sf.loc[3]
Out[148]: 2
In [149]: sf.loc[3.0]
Out[149]: 2
The only positional indexing is via iloc
In [150]: sf.iloc[3]
Out[150]: 3
A scalar index that is not found will raise KeyError
Slicing is ALWAYS on the values of the index, for [],ix,loc
and ALWAYS positional with iloc
In [151]: sf[2:4]
Out[151]:
2.0 1
3.0 2
dtype: int64
In [152]: sf.ix[2:4]
Out[152]:
2.0 1
3.0 2
dtype: int64
In [153]: sf.loc[2:4]
Out[153]:
2.0 1
3.0 2
dtype: int64
In [154]: sf.iloc[2:4]
Out[154]:
3.0 2
4.5 3
dtype: int64
In float indexes, slicing using floats is allowed
In [155]: sf[2.1:4.6]
Out[155]:
3.0 2
4.5 3
dtype: int64
In [156]: sf.loc[2.1:4.6]
Out[156]:
3.0 2
4.5 3
dtype: int64
In non-float indexes, slicing using floats will raise a TypeError
In [1]: pd.Series(range(5))[3.5]
TypeError: the label [3.5] is not a proper indexer for this index type (Int64Index)
In [1]: pd.Series(range(5))[3.5:4.5]
TypeError: the slice start [3.5] is not a proper indexer for this index type (Int64Index)
Warning
Using a scalar float indexer for .iloc
has been removed in 0.18.0, so the following will raise a TypeError
In [3]: pd.Series(range(5)).iloc[3.0]
TypeError: cannot do positional indexing on <class 'pandas.indexes.range.RangeIndex'> with these indexers [3.0] of <type 'float'>
Further the treatment of .ix
with a float indexer on a non-float index, will be label based, and thus coerce the index.
In [157]: s2 = pd.Series([1, 2, 3], index=list('abc'))
In [158]: s2
Out[158]:
a 1
b 2
c 3
dtype: int64
In [159]: s2.ix[1.0] = 10
In [160]: s2
Out[160]:
a 1
b 2
c 3
1.0 10
dtype: int64
Here is a typical use-case for using this type of indexing. Imagine that you have a somewhat irregular timedelta-like indexing scheme, but the data is recorded as floats. This could for example be millisecond offsets.
In [161]: dfir = pd.concat([pd.DataFrame(np.random.randn(5,2),
.....: index=np.arange(5) * 250.0,
.....: columns=list('AB')),
.....: pd.DataFrame(np.random.randn(6,2),
.....: index=np.arange(4,10) * 250.1,
.....: columns=list('AB'))])
.....:
In [162]: dfir
Out[162]:
A B
0.0 0.997289 -1.693316
250.0 -0.179129 -1.598062
500.0 0.936914 0.912560
750.0 -1.003401 1.632781
1000.0 -0.724626 0.178219
1000.4 0.310610 -0.108002
1250.5 -0.974226 -1.147708
1500.6 -2.281374 0.760010
1750.7 -0.742532 1.533318
2000.8 2.495362 -0.432771
2250.9 -0.068954 0.043520
Selection operations then will always work on a value basis, for all selection operators.
In [163]: dfir[0:1000.4]
Out[163]:
A B
0.0 0.997289 -1.693316
250.0 -0.179129 -1.598062
500.0 0.936914 0.912560
750.0 -1.003401 1.632781
1000.0 -0.724626 0.178219
1000.4 0.310610 -0.108002
In [164]: dfir.loc[0:1001,'A']
Out[164]:
0.0 0.997289
250.0 -0.179129
500.0 0.936914
750.0 -1.003401
1000.0 -0.724626
1000.4 0.310610
Name: A, dtype: float64
In [165]: dfir.loc[1000.4]
Out[165]:
A 0.310610
B -0.108002
Name: 1000.4, dtype: float64
You could then easily pick out the first 1 second (1000 ms) of data then.
In [166]: dfir[0:1000]
Out[166]:
A B
0.0 0.997289 -1.693316
250.0 -0.179129 -1.598062
500.0 0.936914 0.912560
750.0 -1.003401 1.632781
1000.0 -0.724626 0.178219
Of course if you need integer based selection, then use iloc
In [167]: dfir.iloc[0:5]
Out[167]:
A B
0.0 0.997289 -1.693316
250.0 -0.179129 -1.598062
500.0 0.936914 0.912560
750.0 -1.003401 1.632781
1000.0 -0.724626 0.178219