MultiIndex / advanced indexing#
This section covers indexing with a MultiIndex and other advanced indexing features.
See the Indexing and Selecting Data for general indexing documentation.
Warning
Whether a copy or a reference is returned for a setting operation may
depend on the context. This is sometimes called chained assignment
and
should be avoided. See Returning a View versus Copy.
See the cookbook for some advanced strategies.
Hierarchical indexing (MultiIndex)#
Hierarchical / Multi-level indexing is very exciting as it opens the door to some
quite sophisticated data analysis and manipulation, especially for working with
higher dimensional data. In essence, it enables you to store and manipulate
data with an arbitrary number of dimensions in lower dimensional data
structures like Series
(1d) and DataFrame
(2d).
In this section, we will show what exactly we mean by “hierarchical” indexing and how it integrates with all of the pandas indexing functionality described above and in prior sections. Later, when discussing group by and pivoting and reshaping data, we’ll show non-trivial applications to illustrate how it aids in structuring data for analysis.
See the cookbook for some advanced strategies.
Creating a MultiIndex (hierarchical index) object#
The MultiIndex
object is the hierarchical analogue of the standard
Index
object which typically stores the axis labels in pandas objects. You
can think of MultiIndex
as an array of tuples where each tuple is unique. A
MultiIndex
can be created from a list of arrays (using
MultiIndex.from_arrays()
), an array of tuples (using
MultiIndex.from_tuples()
), a crossed set of iterables (using
MultiIndex.from_product()
), or a DataFrame
(using
MultiIndex.from_frame()
). The Index
constructor will attempt to return
a MultiIndex
when it is passed a list of tuples. The following examples
demonstrate different ways to initialize MultiIndexes.
In [1]: arrays = [
...: ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
...: ["one", "two", "one", "two", "one", "two", "one", "two"],
...: ]
...:
In [2]: tuples = list(zip(*arrays))
In [3]: tuples
Out[3]:
[('bar', 'one'),
('bar', 'two'),
('baz', 'one'),
('baz', 'two'),
('foo', 'one'),
('foo', 'two'),
('qux', 'one'),
('qux', 'two')]
In [4]: index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])
In [5]: index
Out[5]:
MultiIndex([('bar', 'one'),
('bar', 'two'),
('baz', 'one'),
('baz', 'two'),
('foo', 'one'),
('foo', 'two'),
('qux', 'one'),
('qux', 'two')],
names=['first', 'second'])
In [6]: s = pd.Series(np.random.randn(8), index=index)
In [7]: s
Out[7]:
first second
bar one 0.469112
two -0.282863
baz one -1.509059
two -1.135632
foo one 1.212112
two -0.173215
qux one 0.119209
two -1.044236
dtype: float64
When you want every pairing of the elements in two iterables, it can be easier
to use the MultiIndex.from_product()
method:
In [8]: iterables = [["bar", "baz", "foo", "qux"], ["one", "two"]]
In [9]: pd.MultiIndex.from_product(iterables, names=["first", "second"])
Out[9]:
MultiIndex([('bar', 'one'),
('bar', 'two'),
('baz', 'one'),
('baz', 'two'),
('foo', 'one'),
('foo', 'two'),
('qux', 'one'),
('qux', 'two')],
names=['first', 'second'])
You can also construct a MultiIndex
from a DataFrame
directly, using
the method MultiIndex.from_frame()
. This is a complementary method to
MultiIndex.to_frame()
.
In [10]: df = pd.DataFrame(
....: [["bar", "one"], ["bar", "two"], ["foo", "one"], ["foo", "two"]],
....: columns=["first", "second"],
....: )
....:
In [11]: pd.MultiIndex.from_frame(df)
Out[11]:
MultiIndex([('bar', 'one'),
('bar', 'two'),
('foo', 'one'),
('foo', 'two')],
names=['first', 'second'])
As a convenience, you can pass a list of arrays directly into Series
or
DataFrame
to construct a MultiIndex
automatically:
In [12]: arrays = [
....: np.array(["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"]),
....: np.array(["one", "two", "one", "two", "one", "two", "one", "two"]),
....: ]
....:
In [13]: s = pd.Series(np.random.randn(8), index=arrays)
In [14]: s
Out[14]:
bar one -0.861849
two -2.104569
baz one -0.494929
two 1.071804
foo one 0.721555
two -0.706771
qux one -1.039575
two 0.271860
dtype: float64
In [15]: df = pd.DataFrame(np.random.randn(8, 4), index=arrays)
In [16]: df
Out[16]:
0 1 2 3
bar one -0.424972 0.567020 0.276232 -1.087401
two -0.673690 0.113648 -1.478427 0.524988
baz one 0.404705 0.577046 -1.715002 -1.039268
two -0.370647 -1.157892 -1.344312 0.844885
foo one 1.075770 -0.109050 1.643563 -1.469388
two 0.357021 -0.674600 -1.776904 -0.968914
qux one -1.294524 0.413738 0.276662 -0.472035
two -0.013960 -0.362543 -0.006154 -0.923061
All of the MultiIndex
constructors accept a names
argument which stores
string names for the levels themselves. If no names are provided, None
will
be assigned:
In [17]: df.index.names
Out[17]: FrozenList([None, None])
This index can back any axis of a pandas object, and the number of levels of the index is up to you:
In [18]: df = pd.DataFrame(np.random.randn(3, 8), index=["A", "B", "C"], columns=index)
In [19]: df
Out[19]:
first bar baz ... foo qux
second one two one ... two one two
A 0.895717 0.805244 -1.206412 ... 1.340309 -1.170299 -0.226169
B 0.410835 0.813850 0.132003 ... -1.187678 1.130127 -1.436737
C -1.413681 1.607920 1.024180 ... -2.211372 0.974466 -2.006747
[3 rows x 8 columns]
In [20]: pd.DataFrame(np.random.randn(6, 6), index=index[:6], columns=index[:6])
Out[20]:
first bar baz foo
second one two one two one two
first second
bar one -0.410001 -0.078638 0.545952 -1.219217 -1.226825 0.769804
two -1.281247 -0.727707 -0.121306 -0.097883 0.695775 0.341734
baz one 0.959726 -1.110336 -0.619976 0.149748 -0.732339 0.687738
two 0.176444 0.403310 -0.154951 0.301624 -2.179861 -1.369849
foo one -0.954208 1.462696 -1.743161 -0.826591 -0.345352 1.314232
two 0.690579 0.995761 2.396780 0.014871 3.357427 -0.317441
We’ve “sparsified” the higher levels of the indexes to make the console output a
bit easier on the eyes. Note that how the index is displayed can be controlled using the
multi_sparse
option in pandas.set_options()
:
In [21]: with pd.option_context("display.multi_sparse", False):
....: df
....:
It’s worth keeping in mind that there’s nothing preventing you from using tuples as atomic labels on an axis:
In [22]: pd.Series(np.random.randn(8), index=tuples)
Out[22]:
(bar, one) -1.236269
(bar, two) 0.896171
(baz, one) -0.487602
(baz, two) -0.082240
(foo, one) -2.182937
(foo, two) 0.380396
(qux, one) 0.084844
(qux, two) 0.432390
dtype: float64
The reason that the MultiIndex
matters is that it can allow you to do
grouping, selection, and reshaping operations as we will describe below and in
subsequent areas of the documentation. As you will see in later sections, you
can find yourself working with hierarchically-indexed data without creating a
MultiIndex
explicitly yourself. However, when loading data from a file, you
may wish to generate your own MultiIndex
when preparing the data set.
Reconstructing the level labels#
The method get_level_values()
will return a vector of the labels for each
location at a particular level:
In [23]: index.get_level_values(0)
Out[23]: Index(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], dtype='object', name='first')
In [24]: index.get_level_values("second")
Out[24]: Index(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'], dtype='object', name='second')
Basic indexing on axis with MultiIndex#
One of the important features of hierarchical indexing is that you can select data by a “partial” label identifying a subgroup in the data. Partial selection “drops” levels of the hierarchical index in the result in a completely analogous way to selecting a column in a regular DataFrame:
In [25]: df["bar"]
Out[25]:
second one two
A 0.895717 0.805244
B 0.410835 0.813850
C -1.413681 1.607920
In [26]: df["bar", "one"]
Out[26]:
A 0.895717
B 0.410835
C -1.413681
Name: (bar, one), dtype: float64
In [27]: df["bar"]["one"]
Out[27]:
A 0.895717
B 0.410835
C -1.413681
Name: one, dtype: float64
In [28]: s["qux"]
Out[28]:
one -1.039575
two 0.271860
dtype: float64
See Cross-section with hierarchical index for how to select on a deeper level.
Defined levels#
The MultiIndex
keeps all the defined levels of an index, even
if they are not actually used. When slicing an index, you may notice this.
For example:
In [29]: df.columns.levels # original MultiIndex
Out[29]: FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])
In [30]: df[["foo","qux"]].columns.levels # sliced
Out[30]: FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])
This is done to avoid a recomputation of the levels in order to make slicing
highly performant. If you want to see only the used levels, you can use the
get_level_values()
method.
In [31]: df[["foo", "qux"]].columns.to_numpy()
Out[31]:
array([('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')],
dtype=object)
# for a specific level
In [32]: df[["foo", "qux"]].columns.get_level_values(0)
Out[32]: Index(['foo', 'foo', 'qux', 'qux'], dtype='object', name='first')
To reconstruct the MultiIndex
with only the used levels, the
remove_unused_levels()
method may be used.
In [33]: new_mi = df[["foo", "qux"]].columns.remove_unused_levels()
In [34]: new_mi.levels
Out[34]: FrozenList([['foo', 'qux'], ['one', 'two']])
Data alignment and using reindex
#
Operations between differently-indexed objects having MultiIndex
on the
axes will work as you expect; data alignment will work the same as an Index of
tuples:
In [35]: s + s[:-2]
Out[35]:
bar one -1.723698
two -4.209138
baz one -0.989859
two 2.143608
foo one 1.443110
two -1.413542
qux one NaN
two NaN
dtype: float64
In [36]: s + s[::2]
Out[36]:
bar one -1.723698
two NaN
baz one -0.989859
two NaN
foo one 1.443110
two NaN
qux one -2.079150
two NaN
dtype: float64
The reindex()
method of Series
/DataFrames
can be
called with another MultiIndex
, or even a list or array of tuples:
In [37]: s.reindex(index[:3])
Out[37]:
first second
bar one -0.861849
two -2.104569
baz one -0.494929
dtype: float64
In [38]: s.reindex([("foo", "two"), ("bar", "one"), ("qux", "one"), ("baz", "one")])
Out[38]:
foo two -0.706771
bar one -0.861849
qux one -1.039575
baz one -0.494929
dtype: float64
Advanced indexing with hierarchical index#
Syntactically integrating MultiIndex
in advanced indexing with .loc
is a
bit challenging, but we’ve made every effort to do so. In general, MultiIndex
keys take the form of tuples. For example, the following works as you would expect:
In [39]: df = df.T
In [40]: df
Out[40]:
A B C
first second
bar one 0.895717 0.410835 -1.413681
two 0.805244 0.813850 1.607920
baz one -1.206412 0.132003 1.024180
two 2.565646 -0.827317 0.569605
foo one 1.431256 -0.076467 0.875906
two 1.340309 -1.187678 -2.211372
qux one -1.170299 1.130127 0.974466
two -0.226169 -1.436737 -2.006747
In [41]: df.loc[("bar", "two")]
Out[41]:
A 0.805244
B 0.813850
C 1.607920
Name: (bar, two), dtype: float64
Note that df.loc['bar', 'two']
would also work in this example, but this shorthand
notation can lead to ambiguity in general.
If you also want to index a specific column with .loc
, you must use a tuple
like this:
In [42]: df.loc[("bar", "two"), "A"]
Out[42]: 0.8052440253863785
You don’t have to specify all levels of the MultiIndex
by passing only the
first elements of the tuple. For example, you can use “partial” indexing to
get all elements with bar
in the first level as follows:
In [43]: df.loc["bar"]
Out[43]:
A B C
second
one 0.895717 0.410835 -1.413681
two 0.805244 0.813850 1.607920
This is a shortcut for the slightly more verbose notation df.loc[('bar',),]
(equivalent
to df.loc['bar',]
in this example).
“Partial” slicing also works quite nicely.
In [44]: df.loc["baz":"foo"]
Out[44]:
A B C
first second
baz one -1.206412 0.132003 1.024180
two 2.565646 -0.827317 0.569605
foo one 1.431256 -0.076467 0.875906
two 1.340309 -1.187678 -2.211372
You can slice with a ‘range’ of values, by providing a slice of tuples.
In [45]: df.loc[("baz", "two"):("qux", "one")]
Out[45]:
A B C
first second
baz two 2.565646 -0.827317 0.569605
foo one 1.431256 -0.076467 0.875906
two 1.340309 -1.187678 -2.211372
qux one -1.170299 1.130127 0.974466
In [46]: df.loc[("baz", "two"):"foo"]
Out[46]:
A B C
first second
baz two 2.565646 -0.827317 0.569605
foo one 1.431256 -0.076467 0.875906
two 1.340309 -1.187678 -2.211372
Passing a list of labels or tuples works similar to reindexing:
In [47]: df.loc[[("bar", "two"), ("qux", "one")]]
Out[47]:
A B C
first second
bar two 0.805244 0.813850 1.607920
qux one -1.170299 1.130127 0.974466
Note
It is important to note that tuples and lists are not treated identically in pandas when it comes to indexing. Whereas a tuple is interpreted as one multi-level key, a list is used to specify several keys. Or in other words, tuples go horizontally (traversing levels), lists go vertically (scanning levels).
Importantly, a list of tuples indexes several complete MultiIndex
keys,
whereas a tuple of lists refer to several values within a level:
In [48]: s = pd.Series(
....: [1, 2, 3, 4, 5, 6],
....: index=pd.MultiIndex.from_product([["A", "B"], ["c", "d", "e"]]),
....: )
....:
In [49]: s.loc[[("A", "c"), ("B", "d")]] # list of tuples
Out[49]:
A c 1
B d 5
dtype: int64
In [50]: s.loc[(["A", "B"], ["c", "d"])] # tuple of lists
Out[50]:
A c 1
d 2
B c 4
d 5
dtype: int64
Using slicers#
You can slice a MultiIndex
by providing multiple indexers.
You can provide any of the selectors as if you are indexing by label, see Selection by Label, including slices, lists of labels, labels, and boolean indexers.
You can use slice(None)
to select all the contents of that level. You do not need to specify all the
deeper levels, they will be implied as slice(None)
.
As usual, both sides of the slicers are included as this is label indexing.
Warning
You should specify all axes in the .loc
specifier, meaning the indexer for the index and
for the columns. There are some ambiguous cases where the passed indexer could be misinterpreted
as indexing both axes, rather than into say the MultiIndex
for the rows.
You should do this:
df.loc[(slice("A1", "A3"), ...), :] # noqa: E999
You should not do this:
df.loc[(slice("A1", "A3"), ...)] # noqa: E999
In [51]: def mklbl(prefix, n):
....: return ["%s%s" % (prefix, i) for i in range(n)]
....:
In [52]: miindex = pd.MultiIndex.from_product(
....: [mklbl("A", 4), mklbl("B", 2), mklbl("C", 4), mklbl("D", 2)]
....: )
....:
In [53]: micolumns = pd.MultiIndex.from_tuples(
....: [("a", "foo"), ("a", "bar"), ("b", "foo"), ("b", "bah")], names=["lvl0", "lvl1"]
....: )
....:
In [54]: dfmi = (
....: pd.DataFrame(
....: np.arange(len(miindex) * len(micolumns)).reshape(
....: (len(miindex), len(micolumns))
....: ),
....: index=miindex,
....: columns=micolumns,
....: )
....: .sort_index()
....: .sort_index(axis=1)
....: )
....:
In [55]: dfmi
Out[55]:
lvl0 a b
lvl1 bar foo bah foo
A0 B0 C0 D0 1 0 3 2
D1 5 4 7 6
C1 D0 9 8 11 10
D1 13 12 15 14
C2 D0 17 16 19 18
... ... ... ... ...
A3 B1 C1 D1 237 236 239 238
C2 D0 241 240 243 242
D1 245 244 247 246
C3 D0 249 248 251 250
D1 253 252 255 254
[64 rows x 4 columns]
Basic MultiIndex slicing using slices, lists, and labels.
In [56]: dfmi.loc[(slice("A1", "A3"), slice(None), ["C1", "C3"]), :]
Out[56]:
lvl0 a b
lvl1 bar foo bah foo
A1 B0 C1 D0 73 72 75 74
D1 77 76 79 78
C3 D0 89 88 91 90
D1 93 92 95 94
B1 C1 D0 105 104 107 106
... ... ... ... ...
A3 B0 C3 D1 221 220 223 222
B1 C1 D0 233 232 235 234
D1 237 236 239 238
C3 D0 249 248 251 250
D1 253 252 255 254
[24 rows x 4 columns]
You can use pandas.IndexSlice
to facilitate a more natural syntax
using :
, rather than using slice(None)
.
In [57]: idx = pd.IndexSlice
In [58]: dfmi.loc[idx[:, :, ["C1", "C3"]], idx[:, "foo"]]
Out[58]:
lvl0 a b
lvl1 foo foo
A0 B0 C1 D0 8 10
D1 12 14
C3 D0 24 26
D1 28 30
B1 C1 D0 40 42
... ... ...
A3 B0 C3 D1 220 222
B1 C1 D0 232 234
D1 236 238
C3 D0 248 250
D1 252 254
[32 rows x 2 columns]
It is possible to perform quite complicated selections using this method on multiple axes at the same time.
In [59]: dfmi.loc["A1", (slice(None), "foo")]
Out[59]:
lvl0 a b
lvl1 foo foo
B0 C0 D0 64 66
D1 68 70
C1 D0 72 74
D1 76 78
C2 D0 80 82
... ... ...
B1 C1 D1 108 110
C2 D0 112 114
D1 116 118
C3 D0 120 122
D1 124 126
[16 rows x 2 columns]
In [60]: dfmi.loc[idx[:, :, ["C1", "C3"]], idx[:, "foo"]]
Out[60]:
lvl0 a b
lvl1 foo foo
A0 B0 C1 D0 8 10
D1 12 14
C3 D0 24 26
D1 28 30
B1 C1 D0 40 42
... ... ...
A3 B0 C3 D1 220 222
B1 C1 D0 232 234
D1 236 238
C3 D0 248 250
D1 252 254
[32 rows x 2 columns]
Using a boolean indexer you can provide selection related to the values.
In [61]: mask = dfmi[("a", "foo")] > 200
In [62]: dfmi.loc[idx[mask, :, ["C1", "C3"]], idx[:, "foo"]]
Out[62]:
lvl0 a b
lvl1 foo foo
A3 B0 C1 D1 204 206
C3 D0 216 218
D1 220 222
B1 C1 D0 232 234
D1 236 238
C3 D0 248 250
D1 252 254
You can also specify the axis
argument to .loc
to interpret the passed
slicers on a single axis.
In [63]: dfmi.loc(axis=0)[:, :, ["C1", "C3"]]
Out[63]:
lvl0 a b
lvl1 bar foo bah foo
A0 B0 C1 D0 9 8 11 10
D1 13 12 15 14
C3 D0 25 24 27 26
D1 29 28 31 30
B1 C1 D0 41 40 43 42
... ... ... ... ...
A3 B0 C3 D1 221 220 223 222
B1 C1 D0 233 232 235 234
D1 237 236 239 238
C3 D0 249 248 251 250
D1 253 252 255 254
[32 rows x 4 columns]
Furthermore, you can set the values using the following methods.
In [64]: df2 = dfmi.copy()
In [65]: df2.loc(axis=0)[:, :, ["C1", "C3"]] = -10
In [66]: df2
Out[66]:
lvl0 a b
lvl1 bar foo bah foo
A0 B0 C0 D0 1 0 3 2
D1 5 4 7 6
C1 D0 -10 -10 -10 -10
D1 -10 -10 -10 -10
C2 D0 17 16 19 18
... ... ... ... ...
A3 B1 C1 D1 -10 -10 -10 -10
C2 D0 241 240 243 242
D1 245 244 247 246
C3 D0 -10 -10 -10 -10
D1 -10 -10 -10 -10
[64 rows x 4 columns]
You can use a right-hand-side of an alignable object as well.
In [67]: df2 = dfmi.copy()
In [68]: df2.loc[idx[:, :, ["C1", "C3"]], :] = df2 * 1000
In [69]: df2
Out[69]:
lvl0 a b
lvl1 bar foo bah foo
A0 B0 C0 D0 1 0 3 2
D1 5 4 7 6
C1 D0 9000 8000 11000 10000
D1 13000 12000 15000 14000
C2 D0 17 16 19 18
... ... ... ... ...
A3 B1 C1 D1 237000 236000 239000 238000
C2 D0 241 240 243 242
D1 245 244 247 246
C3 D0 249000 248000 251000 250000
D1 253000 252000 255000 254000
[64 rows x 4 columns]
Cross-section#
The xs()
method of DataFrame
additionally takes a level argument to make
selecting data at a particular level of a MultiIndex
easier.
In [70]: df
Out[70]:
A B C
first second
bar one 0.895717 0.410835 -1.413681
two 0.805244 0.813850 1.607920
baz one -1.206412 0.132003 1.024180
two 2.565646 -0.827317 0.569605
foo one 1.431256 -0.076467 0.875906
two 1.340309 -1.187678 -2.211372
qux one -1.170299 1.130127 0.974466
two -0.226169 -1.436737 -2.006747
In [71]: df.xs("one", level="second")
Out[71]:
A B C
first
bar 0.895717 0.410835 -1.413681
baz -1.206412 0.132003 1.024180
foo 1.431256 -0.076467 0.875906
qux -1.170299 1.130127 0.974466
# using the slicers
In [72]: df.loc[(slice(None), "one"), :]
Out[72]:
A B C
first second
bar one 0.895717 0.410835 -1.413681
baz one -1.206412 0.132003 1.024180
foo one 1.431256 -0.076467 0.875906
qux one -1.170299 1.130127 0.974466
You can also select on the columns with xs
, by
providing the axis argument.
In [73]: df = df.T
In [74]: df.xs("one", level="second", axis=1)
Out[74]:
first bar baz foo qux
A 0.895717 -1.206412 1.431256 -1.170299
B 0.410835 0.132003 -0.076467 1.130127
C -1.413681 1.024180 0.875906 0.974466
# using the slicers
In [75]: df.loc[:, (slice(None), "one")]
Out[75]:
first bar baz foo qux
second one one one one
A 0.895717 -1.206412 1.431256 -1.170299
B 0.410835 0.132003 -0.076467 1.130127
C -1.413681 1.024180 0.875906 0.974466
xs
also allows selection with multiple keys.
In [76]: df.xs(("one", "bar"), level=("second", "first"), axis=1)
Out[76]:
first bar
second one
A 0.895717
B 0.410835
C -1.413681
# using the slicers
In [77]: df.loc[:, ("bar", "one")]
Out[77]:
A 0.895717
B 0.410835
C -1.413681
Name: (bar, one), dtype: float64
You can pass drop_level=False
to xs
to retain
the level that was selected.
In [78]: df.xs("one", level="second", axis=1, drop_level=False)
Out[78]:
first bar baz foo qux
second one one one one
A 0.895717 -1.206412 1.431256 -1.170299
B 0.410835 0.132003 -0.076467 1.130127
C -1.413681 1.024180 0.875906 0.974466
Compare the above with the result using drop_level=True
(the default value).
In [79]: df.xs("one", level="second", axis=1, drop_level=True)
Out[79]:
first bar baz foo qux
A 0.895717 -1.206412 1.431256 -1.170299
B 0.410835 0.132003 -0.076467 1.130127
C -1.413681 1.024180 0.875906 0.974466
Advanced reindexing and alignment#
Using the parameter level
in the reindex()
and
align()
methods of pandas objects is useful to broadcast
values across a level. For instance:
In [80]: midx = pd.MultiIndex(
....: levels=[["zero", "one"], ["x", "y"]], codes=[[1, 1, 0, 0], [1, 0, 1, 0]]
....: )
....:
In [81]: df = pd.DataFrame(np.random.randn(4, 2), index=midx)
In [82]: df
Out[82]:
0 1
one y 1.519970 -0.493662
x 0.600178 0.274230
zero y 0.132885 -0.023688
x 2.410179 1.450520
In [83]: df2 = df.groupby(level=0).mean()
In [84]: df2
Out[84]:
0 1
one 1.060074 -0.109716
zero 1.271532 0.713416
In [85]: df2.reindex(df.index, level=0)
Out[85]:
0 1
one y 1.060074 -0.109716
x 1.060074 -0.109716
zero y 1.271532 0.713416
x 1.271532 0.713416
# aligning
In [86]: df_aligned, df2_aligned = df.align(df2, level=0)
In [87]: df_aligned
Out[87]:
0 1
one y 1.519970 -0.493662
x 0.600178 0.274230
zero y 0.132885 -0.023688
x 2.410179 1.450520
In [88]: df2_aligned
Out[88]:
0 1
one y 1.060074 -0.109716
x 1.060074 -0.109716
zero y 1.271532 0.713416
x 1.271532 0.713416
Swapping levels with swaplevel
#
The swaplevel()
method can switch the order of two levels:
In [89]: df[:5]
Out[89]:
0 1
one y 1.519970 -0.493662
x 0.600178 0.274230
zero y 0.132885 -0.023688
x 2.410179 1.450520
In [90]: df[:5].swaplevel(0, 1, axis=0)
Out[90]:
0 1
y one 1.519970 -0.493662
x one 0.600178 0.274230
y zero 0.132885 -0.023688
x zero 2.410179 1.450520
Reordering levels with reorder_levels
#
The reorder_levels()
method generalizes the swaplevel
method, allowing you to permute the hierarchical index levels in one step:
In [91]: df[:5].reorder_levels([1, 0], axis=0)
Out[91]:
0 1
y one 1.519970 -0.493662
x one 0.600178 0.274230
y zero 0.132885 -0.023688
x zero 2.410179 1.450520
Renaming names of an Index
or MultiIndex
#
The rename()
method is used to rename the labels of a
MultiIndex
, and is typically used to rename the columns of a DataFrame
.
The columns
argument of rename
allows a dictionary to be specified
that includes only the columns you wish to rename.
In [92]: df.rename(columns={0: "col0", 1: "col1"})
Out[92]:
col0 col1
one y 1.519970 -0.493662
x 0.600178 0.274230
zero y 0.132885 -0.023688
x 2.410179 1.450520
This method can also be used to rename specific labels of the main index
of the DataFrame
.
In [93]: df.rename(index={"one": "two", "y": "z"})
Out[93]:
0 1
two z 1.519970 -0.493662
x 0.600178 0.274230
zero z 0.132885 -0.023688
x 2.410179 1.450520
The rename_axis()
method is used to rename the name of a
Index
or MultiIndex
. In particular, the names of the levels of a
MultiIndex
can be specified, which is useful if reset_index()
is later
used to move the values from the MultiIndex
to a column.
In [94]: df.rename_axis(index=["abc", "def"])
Out[94]:
0 1
abc def
one y 1.519970 -0.493662
x 0.600178 0.274230
zero y 0.132885 -0.023688
x 2.410179 1.450520
Note that the columns of a DataFrame
are an index, so that using
rename_axis
with the columns
argument will change the name of that
index.
In [95]: df.rename_axis(columns="Cols").columns
Out[95]: RangeIndex(start=0, stop=2, step=1, name='Cols')
Both rename
and rename_axis
support specifying a dictionary,
Series
or a mapping function to map labels/names to new values.
When working with an Index
object directly, rather than via a DataFrame
,
Index.set_names()
can be used to change the names.
In [96]: mi = pd.MultiIndex.from_product([[1, 2], ["a", "b"]], names=["x", "y"])
In [97]: mi.names
Out[97]: FrozenList(['x', 'y'])
In [98]: mi2 = mi.rename("new name", level=0)
In [99]: mi2
Out[99]:
MultiIndex([(1, 'a'),
(1, 'b'),
(2, 'a'),
(2, 'b')],
names=['new name', 'y'])
You cannot set the names of the MultiIndex via a level.
In [100]: mi.levels[0].name = "name via level"
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[100], line 1
----> 1 mi.levels[0].name = "name via level"
File ~/work/pandas/pandas/pandas/core/indexes/base.py:1675, in Index.name(self, value)
1671 @name.setter
1672 def name(self, value: Hashable) -> None:
1673 if self._no_setting_name:
1674 # Used in MultiIndex.levels to avoid silently ignoring name updates.
-> 1675 raise RuntimeError(
1676 "Cannot set name on a level of a MultiIndex. Use "
1677 "'MultiIndex.set_names' instead."
1678 )
1679 maybe_extract_name(value, None, type(self))
1680 self._name = value
RuntimeError: Cannot set name on a level of a MultiIndex. Use 'MultiIndex.set_names' instead.
Use Index.set_names()
instead.
Sorting a MultiIndex
#
For MultiIndex
-ed objects to be indexed and sliced effectively,
they need to be sorted. As with any index, you can use sort_index()
.
In [101]: import random
In [102]: random.shuffle(tuples)
In [103]: s = pd.Series(np.random.randn(8), index=pd.MultiIndex.from_tuples(tuples))
In [104]: s
Out[104]:
foo two 0.206053
qux two -0.251905
foo one -2.213588
bar two 1.063327
baz two 1.266143
one 0.299368
qux one -0.863838
bar one 0.408204
dtype: float64
In [105]: s.sort_index()
Out[105]:
bar one 0.408204
two 1.063327
baz one 0.299368
two 1.266143
foo one -2.213588
two 0.206053
qux one -0.863838
two -0.251905
dtype: float64
In [106]: s.sort_index(level=0)
Out[106]:
bar one 0.408204
two 1.063327
baz one 0.299368
two 1.266143
foo one -2.213588
two 0.206053
qux one -0.863838
two -0.251905
dtype: float64
In [107]: s.sort_index(level=1)
Out[107]:
bar one 0.408204
baz one 0.299368
foo one -2.213588
qux one -0.863838
bar two 1.063327
baz two 1.266143
foo two 0.206053
qux two -0.251905
dtype: float64
You may also pass a level name to sort_index
if the MultiIndex
levels
are named.
In [108]: s.index = s.index.set_names(["L1", "L2"])
In [109]: s.sort_index(level="L1")
Out[109]:
L1 L2
bar one 0.408204
two 1.063327
baz one 0.299368
two 1.266143
foo one -2.213588
two 0.206053
qux one -0.863838
two -0.251905
dtype: float64
In [110]: s.sort_index(level="L2")
Out[110]:
L1 L2
bar one 0.408204
baz one 0.299368
foo one -2.213588
qux one -0.863838
bar two 1.063327
baz two 1.266143
foo two 0.206053
qux two -0.251905
dtype: float64
On higher dimensional objects, you can sort any of the other axes by level if
they have a MultiIndex
:
In [111]: df.T.sort_index(level=1, axis=1)
Out[111]:
one zero one zero
x x y y
0 0.600178 2.410179 1.519970 0.132885
1 0.274230 1.450520 -0.493662 -0.023688
Indexing will work even if the data are not sorted, but will be rather
inefficient (and show a PerformanceWarning
). It will also
return a copy of the data rather than a view:
In [112]: dfm = pd.DataFrame(
.....: {"jim": [0, 0, 1, 1], "joe": ["x", "x", "z", "y"], "jolie": np.random.rand(4)}
.....: )
.....:
In [113]: dfm = dfm.set_index(["jim", "joe"])
In [114]: dfm
Out[114]:
jolie
jim joe
0 x 0.490671
x 0.120248
1 z 0.537020
y 0.110968
In [115]: dfm.loc[(1, 'z')]
Out[115]:
jolie
jim joe
1 z 0.53702
Furthermore, if you try to index something that is not fully lexsorted, this can raise:
In [116]: dfm.loc[(0, 'y'):(1, 'z')]
---------------------------------------------------------------------------
UnsortedIndexError Traceback (most recent call last)
Cell In[116], line 1
----> 1 dfm.loc[(0, 'y'):(1, 'z')]
File ~/work/pandas/pandas/pandas/core/indexing.py:1153, in _LocationIndexer.__getitem__(self, key)
1150 axis = self.axis or 0
1152 maybe_callable = com.apply_if_callable(key, self.obj)
-> 1153 return self._getitem_axis(maybe_callable, axis=axis)
File ~/work/pandas/pandas/pandas/core/indexing.py:1373, in _LocIndexer._getitem_axis(self, key, axis)
1371 if isinstance(key, slice):
1372 self._validate_key(key, axis)
-> 1373 return self._get_slice_axis(key, axis=axis)
1374 elif com.is_bool_indexer(key):
1375 return self._getbool_axis(key, axis=axis)
File ~/work/pandas/pandas/pandas/core/indexing.py:1405, in _LocIndexer._get_slice_axis(self, slice_obj, axis)
1402 return obj.copy(deep=False)
1404 labels = obj._get_axis(axis)
-> 1405 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, slice_obj.step)
1407 if isinstance(indexer, slice):
1408 return self.obj._slice(indexer, axis=axis)
File ~/work/pandas/pandas/pandas/core/indexes/base.py:6601, in Index.slice_indexer(self, start, end, step)
6557 def slice_indexer(
6558 self,
6559 start: Hashable | None = None,
6560 end: Hashable | None = None,
6561 step: int | None = None,
6562 ) -> slice:
6563 """
6564 Compute the slice indexer for input labels and step.
6565
(...)
6599 slice(1, 3, None)
6600 """
-> 6601 start_slice, end_slice = self.slice_locs(start, end, step=step)
6603 # return a slice
6604 if not is_scalar(start_slice):
File ~/work/pandas/pandas/pandas/core/indexes/multi.py:2789, in MultiIndex.slice_locs(self, start, end, step)
2737 """
2738 For an ordered MultiIndex, compute the slice locations for input
2739 labels.
(...)
2785 sequence of such.
2786 """
2787 # This function adds nothing to its parent implementation (the magic
2788 # happens in get_slice_bound method), but it adds meaningful doc.
-> 2789 return super().slice_locs(start, end, step)
File ~/work/pandas/pandas/pandas/core/indexes/base.py:6818, in Index.slice_locs(self, start, end, step)
6816 start_slice = None
6817 if start is not None:
-> 6818 start_slice = self.get_slice_bound(start, "left")
6819 if start_slice is None:
6820 start_slice = 0
File ~/work/pandas/pandas/pandas/core/indexes/multi.py:2733, in MultiIndex.get_slice_bound(self, label, side)
2731 if not isinstance(label, tuple):
2732 label = (label,)
-> 2733 return self._partial_tup_index(label, side=side)
File ~/work/pandas/pandas/pandas/core/indexes/multi.py:2793, in MultiIndex._partial_tup_index(self, tup, side)
2791 def _partial_tup_index(self, tup: tuple, side: Literal["left", "right"] = "left"):
2792 if len(tup) > self._lexsort_depth:
-> 2793 raise UnsortedIndexError(
2794 f"Key length ({len(tup)}) was greater than MultiIndex lexsort depth "
2795 f"({self._lexsort_depth})"
2796 )
2798 n = len(tup)
2799 start, end = 0, len(self)
UnsortedIndexError: 'Key length (2) was greater than MultiIndex lexsort depth (1)'
The is_monotonic_increasing()
method on a MultiIndex
shows if the
index is sorted:
In [117]: dfm.index.is_monotonic_increasing
Out[117]: False
In [118]: dfm = dfm.sort_index()
In [119]: dfm
Out[119]:
jolie
jim joe
0 x 0.490671
x 0.120248
1 y 0.110968
z 0.537020
In [120]: dfm.index.is_monotonic_increasing
Out[120]: True
And now selection works as expected.
In [121]: dfm.loc[(0, "y"):(1, "z")]
Out[121]:
jolie
jim joe
1 y 0.110968
z 0.537020
Take methods#
Similar to NumPy ndarrays, pandas Index
, Series
, and DataFrame
also provides
the take()
method that retrieves elements along a given axis at the given
indices. The given indices must be either a list or an ndarray of integer
index positions. take
will also accept negative integers as relative positions to the end of the object.
In [122]: index = pd.Index(np.random.randint(0, 1000, 10))
In [123]: index
Out[123]: Index([214, 502, 712, 567, 786, 175, 993, 133, 758, 329], dtype='int64')
In [124]: positions = [0, 9, 3]
In [125]: index[positions]
Out[125]: Index([214, 329, 567], dtype='int64')
In [126]: index.take(positions)
Out[126]: Index([214, 329, 567], dtype='int64')
In [127]: ser = pd.Series(np.random.randn(10))
In [128]: ser.iloc[positions]
Out[128]:
0 -0.179666
9 1.824375
3 0.392149
dtype: float64
In [129]: ser.take(positions)
Out[129]:
0 -0.179666
9 1.824375
3 0.392149
dtype: float64
For DataFrames, the given indices should be a 1d list or ndarray that specifies row or column positions.
In [130]: frm = pd.DataFrame(np.random.randn(5, 3))
In [131]: frm.take([1, 4, 3])
Out[131]:
0 1 2
1 -1.237881 0.106854 -1.276829
4 0.629675 -1.425966 1.857704
3 0.979542 -1.633678 0.615855
In [132]: frm.take([0, 2], axis=1)
Out[132]:
0 2
0 0.595974 0.601544
1 -1.237881 -1.276829
2 -0.767101 1.499591
3 0.979542 0.615855
4 0.629675 1.857704
It is important to note that the take
method on pandas objects are not
intended to work on boolean indices and may return unexpected results.
In [133]: arr = np.random.randn(10)
In [134]: arr.take([False, False, True, True])
Out[134]: array([-1.1935, -1.1935, 0.6775, 0.6775])
In [135]: arr[[0, 1]]
Out[135]: array([-1.1935, 0.6775])
In [136]: ser = pd.Series(np.random.randn(10))
In [137]: ser.take([False, False, True, True])
Out[137]:
0 0.233141
0 0.233141
1 -0.223540
1 -0.223540
dtype: float64
In [138]: ser.iloc[[0, 1]]
Out[138]:
0 0.233141
1 -0.223540
dtype: float64
Finally, as a small note on performance, because the take
method handles
a narrower range of inputs, it can offer performance that is a good deal
faster than fancy indexing.
In [139]: arr = np.random.randn(10000, 5)
In [140]: indexer = np.arange(10000)
In [141]: random.shuffle(indexer)
In [142]: %timeit arr[indexer]
.....: %timeit arr.take(indexer, axis=0)
.....:
252 us +- 5.57 us per loop (mean +- std. dev. of 7 runs, 1,000 loops each)
96.1 us +- 7.77 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each)
In [143]: ser = pd.Series(arr[:, 0])
In [144]: %timeit ser.iloc[indexer]
.....: %timeit ser.take(indexer)
.....:
144 us +- 11.9 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each)
165 us +- 22.3 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each)
Index types#
We have discussed MultiIndex
in the previous sections pretty extensively.
Documentation about DatetimeIndex
and PeriodIndex
are shown here,
and documentation about TimedeltaIndex
is found here.
In the following sub-sections we will highlight some other index types.
CategoricalIndex#
CategoricalIndex
is a type of index that is useful for supporting
indexing with duplicates. This is a container around a Categorical
and allows efficient indexing and storage of an index with a large number of duplicated elements.
In [145]: from pandas.api.types import CategoricalDtype
In [146]: df = pd.DataFrame({"A": np.arange(6), "B": list("aabbca")})
In [147]: df["B"] = df["B"].astype(CategoricalDtype(list("cab")))
In [148]: df
Out[148]:
A B
0 0 a
1 1 a
2 2 b
3 3 b
4 4 c
5 5 a
In [149]: df.dtypes
Out[149]:
A int64
B category
dtype: object
In [150]: df["B"].cat.categories
Out[150]: Index(['c', 'a', 'b'], dtype='object')
Setting the index will create a CategoricalIndex
.
In [151]: df2 = df.set_index("B")
In [152]: df2.index
Out[152]: CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['c', 'a', 'b'], ordered=False, dtype='category', name='B')
Indexing with __getitem__/.iloc/.loc
works similarly to an Index
with duplicates.
The indexers must be in the category or the operation will raise a KeyError
.
In [153]: df2.loc["a"]
Out[153]:
A
B
a 0
a 1
a 5
The CategoricalIndex
is preserved after indexing:
In [154]: df2.loc["a"].index
Out[154]: CategoricalIndex(['a', 'a', 'a'], categories=['c', 'a', 'b'], ordered=False, dtype='category', name='B')
Sorting the index will sort by the order of the categories (recall that we
created the index with CategoricalDtype(list('cab'))
, so the sorted
order is cab
).
In [155]: df2.sort_index()
Out[155]:
A
B
c 4
a 0
a 1
a 5
b 2
b 3
Groupby operations on the index will preserve the index nature as well.
In [156]: df2.groupby(level=0, observed=True).sum()
Out[156]:
A
B
c 4
a 6
b 5
In [157]: df2.groupby(level=0, observed=True).sum().index
Out[157]: CategoricalIndex(['c', 'a', 'b'], categories=['c', 'a', 'b'], ordered=False, dtype='category', name='B')
Reindexing operations will return a resulting index based on the type of the passed
indexer. Passing a list will return a plain-old Index
; indexing with
a Categorical
will return a CategoricalIndex
, indexed according to the categories
of the passed Categorical
dtype. This allows one to arbitrarily index these even with
values not in the categories, similarly to how you can reindex any pandas index.
In [158]: df3 = pd.DataFrame(
.....: {"A": np.arange(3), "B": pd.Series(list("abc")).astype("category")}
.....: )
.....:
In [159]: df3 = df3.set_index("B")
In [160]: df3
Out[160]:
A
B
a 0
b 1
c 2
In [161]: df3.reindex(["a", "e"])
Out[161]:
A
B
a 0.0
e NaN
In [162]: df3.reindex(["a", "e"]).index
Out[162]: Index(['a', 'e'], dtype='object', name='B')
In [163]: df3.reindex(pd.Categorical(["a", "e"], categories=list("abe")))
Out[163]:
A
B
a 0.0
e NaN
In [164]: df3.reindex(pd.Categorical(["a", "e"], categories=list("abe"))).index
Out[164]: CategoricalIndex(['a', 'e'], categories=['a', 'b', 'e'], ordered=False, dtype='category', name='B')
Warning
Reshaping and Comparison operations on a CategoricalIndex
must have the same categories
or a TypeError
will be raised.
In [165]: df4 = pd.DataFrame({"A": np.arange(2), "B": list("ba")})
In [166]: df4["B"] = df4["B"].astype(CategoricalDtype(list("ab")))
In [167]: df4 = df4.set_index("B")
In [168]: df4.index
Out[168]: CategoricalIndex(['b', 'a'], categories=['a', 'b'], ordered=False, dtype='category', name='B')
In [169]: df5 = pd.DataFrame({"A": np.arange(2), "B": list("bc")})
In [170]: df5["B"] = df5["B"].astype(CategoricalDtype(list("bc")))
In [171]: df5 = df5.set_index("B")
In [172]: df5.index
Out[172]: CategoricalIndex(['b', 'c'], categories=['b', 'c'], ordered=False, dtype='category', name='B')
In [173]: pd.concat([df4, df5])
Out[173]:
A
B
b 0
a 1
b 0
c 1
RangeIndex#
RangeIndex
is a sub-class of Index
that provides the default index for all DataFrame
and Series
objects.
RangeIndex
is an optimized version of Index
that can represent a monotonic ordered set. These are analogous to Python range types.
A RangeIndex
will always have an int64
dtype.
In [174]: idx = pd.RangeIndex(5)
In [175]: idx
Out[175]: RangeIndex(start=0, stop=5, step=1)
RangeIndex
is the default index for all DataFrame
and Series
objects:
In [176]: ser = pd.Series([1, 2, 3])
In [177]: ser.index
Out[177]: RangeIndex(start=0, stop=3, step=1)
In [178]: df = pd.DataFrame([[1, 2], [3, 4]])
In [179]: df.index
Out[179]: RangeIndex(start=0, stop=2, step=1)
In [180]: df.columns
Out[180]: RangeIndex(start=0, stop=2, step=1)
A RangeIndex
will behave similarly to a Index
with an int64
dtype and operations on a RangeIndex
,
whose result cannot be represented by a RangeIndex
, but should have an integer dtype, will be converted to an Index
with int64
.
For example:
In [181]: idx[[0, 2]]
Out[181]: Index([0, 2], dtype='int64')
IntervalIndex#
IntervalIndex
together with its own dtype, IntervalDtype
as well as the Interval
scalar type, allow first-class support in pandas
for interval notation.
The IntervalIndex
allows some unique indexing and is also used as a
return type for the categories in cut()
and qcut()
.
Indexing with an IntervalIndex
#
An IntervalIndex
can be used in Series
and in DataFrame
as the index.
In [182]: df = pd.DataFrame(
.....: {"A": [1, 2, 3, 4]}, index=pd.IntervalIndex.from_breaks([0, 1, 2, 3, 4])
.....: )
.....:
In [183]: df
Out[183]:
A
(0, 1] 1
(1, 2] 2
(2, 3] 3
(3, 4] 4
Label based indexing via .loc
along the edges of an interval works as you would expect,
selecting that particular interval.
In [184]: df.loc[2]
Out[184]:
A 2
Name: (1, 2], dtype: int64
In [185]: df.loc[[2, 3]]
Out[185]:
A
(1, 2] 2
(2, 3] 3
If you select a label contained within an interval, this will also select the interval.
In [186]: df.loc[2.5]
Out[186]:
A 3
Name: (2, 3], dtype: int64
In [187]: df.loc[[2.5, 3.5]]
Out[187]:
A
(2, 3] 3
(3, 4] 4
Selecting using an Interval
will only return exact matches.
In [188]: df.loc[pd.Interval(1, 2)]
Out[188]:
A 2
Name: (1, 2], dtype: int64
Trying to select an Interval
that is not exactly contained in the IntervalIndex
will raise a KeyError
.
In [189]: df.loc[pd.Interval(0.5, 2.5)]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[189], line 1
----> 1 df.loc[pd.Interval(0.5, 2.5)]
File ~/work/pandas/pandas/pandas/core/indexing.py:1153, in _LocationIndexer.__getitem__(self, key)
1150 axis = self.axis or 0
1152 maybe_callable = com.apply_if_callable(key, self.obj)
-> 1153 return self._getitem_axis(maybe_callable, axis=axis)
File ~/work/pandas/pandas/pandas/core/indexing.py:1393, in _LocIndexer._getitem_axis(self, key, axis)
1391 # fall thru to straight lookup
1392 self._validate_key(key, axis)
-> 1393 return self._get_label(key, axis=axis)
File ~/work/pandas/pandas/pandas/core/indexing.py:1343, in _LocIndexer._get_label(self, label, axis)
1341 def _get_label(self, label, axis: AxisInt):
1342 # GH#5567 this will fail if the label is not present in the axis.
-> 1343 return self.obj.xs(label, axis=axis)
File ~/work/pandas/pandas/pandas/core/generic.py:4236, in NDFrame.xs(self, key, axis, level, drop_level)
4234 new_index = index[loc]
4235 else:
-> 4236 loc = index.get_loc(key)
4238 if isinstance(loc, np.ndarray):
4239 if loc.dtype == np.bool_:
File ~/work/pandas/pandas/pandas/core/indexes/interval.py:676, in IntervalIndex.get_loc(self, key)
674 matches = mask.sum()
675 if matches == 0:
--> 676 raise KeyError(key)
677 if matches == 1:
678 return mask.argmax()
KeyError: Interval(0.5, 2.5, closed='right')
Selecting all Intervals
that overlap a given Interval
can be performed using the
overlaps()
method to create a boolean indexer.
In [190]: idxr = df.index.overlaps(pd.Interval(0.5, 2.5))
In [191]: idxr
Out[191]: array([ True, True, True, False])
In [192]: df[idxr]
Out[192]:
A
(0, 1] 1
(1, 2] 2
(2, 3] 3
Binning data with cut
and qcut
#
cut()
and qcut()
both return a Categorical
object, and the bins they
create are stored as an IntervalIndex
in its .categories
attribute.
In [193]: c = pd.cut(range(4), bins=2)
In [194]: c
Out[194]:
[(-0.003, 1.5], (-0.003, 1.5], (1.5, 3.0], (1.5, 3.0]]
Categories (2, interval[float64, right]): [(-0.003, 1.5] < (1.5, 3.0]]
In [195]: c.categories
Out[195]: IntervalIndex([(-0.003, 1.5], (1.5, 3.0]], dtype='interval[float64, right]')
cut()
also accepts an IntervalIndex
for its bins
argument, which enables
a useful pandas idiom. First, We call cut()
with some data and bins
set to a
fixed number, to generate the bins. Then, we pass the values of .categories
as the
bins
argument in subsequent calls to cut()
, supplying new data which will be
binned into the same bins.
In [196]: pd.cut([0, 3, 5, 1], bins=c.categories)
Out[196]:
[(-0.003, 1.5], (1.5, 3.0], NaN, (-0.003, 1.5]]
Categories (2, interval[float64, right]): [(-0.003, 1.5] < (1.5, 3.0]]
Any value which falls outside all bins will be assigned a NaN
value.
Generating ranges of intervals#
If we need intervals on a regular frequency, we can use the interval_range()
function
to create an IntervalIndex
using various combinations of start
, end
, and periods
.
The default frequency for interval_range
is a 1 for numeric intervals, and calendar day for
datetime-like intervals:
In [197]: pd.interval_range(start=0, end=5)
Out[197]: IntervalIndex([(0, 1], (1, 2], (2, 3], (3, 4], (4, 5]], dtype='interval[int64, right]')
In [198]: pd.interval_range(start=pd.Timestamp("2017-01-01"), periods=4)
Out[198]: IntervalIndex([(2017-01-01, 2017-01-02], (2017-01-02, 2017-01-03], (2017-01-03, 2017-01-04], (2017-01-04, 2017-01-05]], dtype='interval[datetime64[ns], right]')
In [199]: pd.interval_range(end=pd.Timedelta("3 days"), periods=3)
Out[199]: IntervalIndex([(0 days 00:00:00, 1 days 00:00:00], (1 days 00:00:00, 2 days 00:00:00], (2 days 00:00:00, 3 days 00:00:00]], dtype='interval[timedelta64[ns], right]')
The freq
parameter can used to specify non-default frequencies, and can utilize a variety
of frequency aliases with datetime-like intervals:
In [200]: pd.interval_range(start=0, periods=5, freq=1.5)
Out[200]: IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0], (6.0, 7.5]], dtype='interval[float64, right]')
In [201]: pd.interval_range(start=pd.Timestamp("2017-01-01"), periods=4, freq="W")
Out[201]: IntervalIndex([(2017-01-01, 2017-01-08], (2017-01-08, 2017-01-15], (2017-01-15, 2017-01-22], (2017-01-22, 2017-01-29]], dtype='interval[datetime64[ns], right]')
In [202]: pd.interval_range(start=pd.Timedelta("0 days"), periods=3, freq="9H")
Out[202]: IntervalIndex([(0 days 00:00:00, 0 days 09:00:00], (0 days 09:00:00, 0 days 18:00:00], (0 days 18:00:00, 1 days 03:00:00]], dtype='interval[timedelta64[ns], right]')
Additionally, the closed
parameter can be used to specify which side(s) the intervals
are closed on. Intervals are closed on the right side by default.
In [203]: pd.interval_range(start=0, end=4, closed="both")
Out[203]: IntervalIndex([[0, 1], [1, 2], [2, 3], [3, 4]], dtype='interval[int64, both]')
In [204]: pd.interval_range(start=0, end=4, closed="neither")
Out[204]: IntervalIndex([(0, 1), (1, 2), (2, 3), (3, 4)], dtype='interval[int64, neither]')
Specifying start
, end
, and periods
will generate a range of evenly spaced
intervals from start
to end
inclusively, with periods
number of elements
in the resulting IntervalIndex
:
In [205]: pd.interval_range(start=0, end=6, periods=4)
Out[205]: IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0]], dtype='interval[float64, right]')
In [206]: pd.interval_range(pd.Timestamp("2018-01-01"), pd.Timestamp("2018-02-28"), periods=3)
Out[206]: IntervalIndex([(2018-01-01, 2018-01-20 08:00:00], (2018-01-20 08:00:00, 2018-02-08 16:00:00], (2018-02-08 16:00:00, 2018-02-28]], dtype='interval[datetime64[ns], right]')
Miscellaneous indexing FAQ#
Integer indexing#
Label-based indexing with integer axis labels is a thorny topic. It has been
discussed heavily on mailing lists and among various members of the scientific
Python community. In pandas, our general viewpoint is that labels matter more
than integer locations. Therefore, with an integer axis index only
label-based indexing is possible with the standard tools like .loc
. The
following code will generate exceptions:
In [207]: s = pd.Series(range(5))
In [208]: s[-1]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File ~/work/pandas/pandas/pandas/core/indexes/range.py:414, in RangeIndex.get_loc(self, key)
413 try:
--> 414 return self._range.index(new_key)
415 except ValueError as err:
ValueError: -1 is not in range
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Cell In[208], line 1
----> 1 s[-1]
File ~/work/pandas/pandas/pandas/core/series.py:1040, in Series.__getitem__(self, key)
1037 return self._values[key]
1039 elif key_is_scalar:
-> 1040 return self._get_value(key)
1042 # Convert generator to list before going through hashable part
1043 # (We will iterate through the generator there to check for slices)
1044 if is_iterator(key):
File ~/work/pandas/pandas/pandas/core/series.py:1156, in Series._get_value(self, label, takeable)
1153 return self._values[label]
1155 # Similar to Index.get_value, but we do not fall back to positional
-> 1156 loc = self.index.get_loc(label)
1158 if is_integer(loc):
1159 return self._values[loc]
File ~/work/pandas/pandas/pandas/core/indexes/range.py:416, in RangeIndex.get_loc(self, key)
414 return self._range.index(new_key)
415 except ValueError as err:
--> 416 raise KeyError(key) from err
417 if isinstance(key, Hashable):
418 raise KeyError(key)
KeyError: -1
In [209]: df = pd.DataFrame(np.random.randn(5, 4))
In [210]: df
Out[210]:
0 1 2 3
0 -0.435772 -1.188928 -0.808286 -0.284634
1 -1.815703 1.347213 -0.243487 0.514704
2 1.162969 -0.287725 -0.179734 0.993962
3 -0.212673 0.909872 -0.733333 -0.349893
4 0.456434 -0.306735 0.553396 0.166221
In [211]: df.loc[-2:]
Out[211]:
0 1 2 3
0 -0.435772 -1.188928 -0.808286 -0.284634
1 -1.815703 1.347213 -0.243487 0.514704
2 1.162969 -0.287725 -0.179734 0.993962
3 -0.212673 0.909872 -0.733333 -0.349893
4 0.456434 -0.306735 0.553396 0.166221
This deliberate decision was made to prevent ambiguities and subtle bugs (many users reported finding bugs when the API change was made to stop “falling back” on position-based indexing).
Non-monotonic indexes require exact matches#
If the index of a Series
or DataFrame
is monotonically increasing or decreasing, then the bounds
of a label-based slice can be outside the range of the index, much like slice indexing a
normal Python list
. Monotonicity of an index can be tested with the is_monotonic_increasing()
and
is_monotonic_decreasing()
attributes.
In [212]: df = pd.DataFrame(index=[2, 3, 3, 4, 5], columns=["data"], data=list(range(5)))
In [213]: df.index.is_monotonic_increasing
Out[213]: True
# no rows 0 or 1, but still returns rows 2, 3 (both of them), and 4:
In [214]: df.loc[0:4, :]
Out[214]:
data
2 0
3 1
3 2
4 3
# slice is are outside the index, so empty DataFrame is returned
In [215]: df.loc[13:15, :]
Out[215]:
Empty DataFrame
Columns: [data]
Index: []
On the other hand, if the index is not monotonic, then both slice bounds must be unique members of the index.
In [216]: df = pd.DataFrame(index=[2, 3, 1, 4, 3, 5], columns=["data"], data=list(range(6)))
In [217]: df.index.is_monotonic_increasing
Out[217]: False
# OK because 2 and 4 are in the index
In [218]: df.loc[2:4, :]
Out[218]:
data
2 0
3 1
1 2
4 3
# 0 is not in the index
In [219]: df.loc[0:4, :]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File ~/work/pandas/pandas/pandas/core/indexes/base.py:3790, in Index.get_loc(self, key)
3789 try:
-> 3790 return self._engine.get_loc(casted_key)
3791 except KeyError as err:
File index.pyx:152, in pandas._libs.index.IndexEngine.get_loc()
File index.pyx:176, in pandas._libs.index.IndexEngine.get_loc()
File index.pyx:219, in pandas._libs.index.IndexEngine._get_loc_duplicates()
File index.pyx:227, in pandas._libs.index.IndexEngine._maybe_get_bool_indexer()
File index.pyx:119, in pandas._libs.index._unpack_bool_indexer()
KeyError: 0
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Cell In[219], line 1
----> 1 df.loc[0:4, :]
File ~/work/pandas/pandas/pandas/core/indexing.py:1147, in _LocationIndexer.__getitem__(self, key)
1145 if self._is_scalar_access(key):
1146 return self.obj._get_value(*key, takeable=self._takeable)
-> 1147 return self._getitem_tuple(key)
1148 else:
1149 # we by definition only have the 0th axis
1150 axis = self.axis or 0
File ~/work/pandas/pandas/pandas/core/indexing.py:1339, in _LocIndexer._getitem_tuple(self, tup)
1336 if self._multi_take_opportunity(tup):
1337 return self._multi_take(tup)
-> 1339 return self._getitem_tuple_same_dim(tup)
File ~/work/pandas/pandas/pandas/core/indexing.py:994, in _LocationIndexer._getitem_tuple_same_dim(self, tup)
991 if com.is_null_slice(key):
992 continue
--> 994 retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
995 # We should never have retval.ndim < self.ndim, as that should
996 # be handled by the _getitem_lowerdim call above.
997 assert retval.ndim == self.ndim
File ~/work/pandas/pandas/pandas/core/indexing.py:1373, in _LocIndexer._getitem_axis(self, key, axis)
1371 if isinstance(key, slice):
1372 self._validate_key(key, axis)
-> 1373 return self._get_slice_axis(key, axis=axis)
1374 elif com.is_bool_indexer(key):
1375 return self._getbool_axis(key, axis=axis)
File ~/work/pandas/pandas/pandas/core/indexing.py:1405, in _LocIndexer._get_slice_axis(self, slice_obj, axis)
1402 return obj.copy(deep=False)
1404 labels = obj._get_axis(axis)
-> 1405 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, slice_obj.step)
1407 if isinstance(indexer, slice):
1408 return self.obj._slice(indexer, axis=axis)
File ~/work/pandas/pandas/pandas/core/indexes/base.py:6601, in Index.slice_indexer(self, start, end, step)
6557 def slice_indexer(
6558 self,
6559 start: Hashable | None = None,
6560 end: Hashable | None = None,
6561 step: int | None = None,
6562 ) -> slice:
6563 """
6564 Compute the slice indexer for input labels and step.
6565
(...)
6599 slice(1, 3, None)
6600 """
-> 6601 start_slice, end_slice = self.slice_locs(start, end, step=step)
6603 # return a slice
6604 if not is_scalar(start_slice):
File ~/work/pandas/pandas/pandas/core/indexes/base.py:6818, in Index.slice_locs(self, start, end, step)
6816 start_slice = None
6817 if start is not None:
-> 6818 start_slice = self.get_slice_bound(start, "left")
6819 if start_slice is None:
6820 start_slice = 0
File ~/work/pandas/pandas/pandas/core/indexes/base.py:6743, in Index.get_slice_bound(self, label, side)
6740 return self._searchsorted_monotonic(label, side)
6741 except ValueError:
6742 # raise the original KeyError
-> 6743 raise err
6745 if isinstance(slc, np.ndarray):
6746 # get_loc may return a boolean array, which
6747 # is OK as long as they are representable by a slice.
6748 assert is_bool_dtype(slc.dtype)
File ~/work/pandas/pandas/pandas/core/indexes/base.py:6737, in Index.get_slice_bound(self, label, side)
6735 # we need to look up the label
6736 try:
-> 6737 slc = self.get_loc(label)
6738 except KeyError as err:
6739 try:
File ~/work/pandas/pandas/pandas/core/indexes/base.py:3797, in Index.get_loc(self, key)
3792 if isinstance(casted_key, slice) or (
3793 isinstance(casted_key, abc.Iterable)
3794 and any(isinstance(x, slice) for x in casted_key)
3795 ):
3796 raise InvalidIndexError(key)
-> 3797 raise KeyError(key) from err
3798 except TypeError:
3799 # If we have a listlike key, _check_indexing_error will raise
3800 # InvalidIndexError. Otherwise we fall through and re-raise
3801 # the TypeError.
3802 self._check_indexing_error(key)
KeyError: 0
# 3 is not a unique label
In [220]: df.loc[2:3, :]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[220], line 1
----> 1 df.loc[2:3, :]
File ~/work/pandas/pandas/pandas/core/indexing.py:1147, in _LocationIndexer.__getitem__(self, key)
1145 if self._is_scalar_access(key):
1146 return self.obj._get_value(*key, takeable=self._takeable)
-> 1147 return self._getitem_tuple(key)
1148 else:
1149 # we by definition only have the 0th axis
1150 axis = self.axis or 0
File ~/work/pandas/pandas/pandas/core/indexing.py:1339, in _LocIndexer._getitem_tuple(self, tup)
1336 if self._multi_take_opportunity(tup):
1337 return self._multi_take(tup)
-> 1339 return self._getitem_tuple_same_dim(tup)
File ~/work/pandas/pandas/pandas/core/indexing.py:994, in _LocationIndexer._getitem_tuple_same_dim(self, tup)
991 if com.is_null_slice(key):
992 continue
--> 994 retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
995 # We should never have retval.ndim < self.ndim, as that should
996 # be handled by the _getitem_lowerdim call above.
997 assert retval.ndim == self.ndim
File ~/work/pandas/pandas/pandas/core/indexing.py:1373, in _LocIndexer._getitem_axis(self, key, axis)
1371 if isinstance(key, slice):
1372 self._validate_key(key, axis)
-> 1373 return self._get_slice_axis(key, axis=axis)
1374 elif com.is_bool_indexer(key):
1375 return self._getbool_axis(key, axis=axis)
File ~/work/pandas/pandas/pandas/core/indexing.py:1405, in _LocIndexer._get_slice_axis(self, slice_obj, axis)
1402 return obj.copy(deep=False)
1404 labels = obj._get_axis(axis)
-> 1405 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, slice_obj.step)
1407 if isinstance(indexer, slice):
1408 return self.obj._slice(indexer, axis=axis)
File ~/work/pandas/pandas/pandas/core/indexes/base.py:6601, in Index.slice_indexer(self, start, end, step)
6557 def slice_indexer(
6558 self,
6559 start: Hashable | None = None,
6560 end: Hashable | None = None,
6561 step: int | None = None,
6562 ) -> slice:
6563 """
6564 Compute the slice indexer for input labels and step.
6565
(...)
6599 slice(1, 3, None)
6600 """
-> 6601 start_slice, end_slice = self.slice_locs(start, end, step=step)
6603 # return a slice
6604 if not is_scalar(start_slice):
File ~/work/pandas/pandas/pandas/core/indexes/base.py:6824, in Index.slice_locs(self, start, end, step)
6822 end_slice = None
6823 if end is not None:
-> 6824 end_slice = self.get_slice_bound(end, "right")
6825 if end_slice is None:
6826 end_slice = len(self)
File ~/work/pandas/pandas/pandas/core/indexes/base.py:6751, in Index.get_slice_bound(self, label, side)
6749 slc = lib.maybe_booleans_to_slice(slc.view("u1"))
6750 if isinstance(slc, np.ndarray):
-> 6751 raise KeyError(
6752 f"Cannot get {side} slice bound for non-unique "
6753 f"label: {repr(original_label)}"
6754 )
6756 if isinstance(slc, slice):
6757 if side == "left":
KeyError: 'Cannot get right slice bound for non-unique label: 3'
Index.is_monotonic_increasing
and Index.is_monotonic_decreasing
only check that
an index is weakly monotonic. To check for strict monotonicity, you can combine one of those with
the is_unique()
attribute.
In [221]: weakly_monotonic = pd.Index(["a", "b", "c", "c"])
In [222]: weakly_monotonic
Out[222]: Index(['a', 'b', 'c', 'c'], dtype='object')
In [223]: weakly_monotonic.is_monotonic_increasing
Out[223]: True
In [224]: weakly_monotonic.is_monotonic_increasing & weakly_monotonic.is_unique
Out[224]: False
Endpoints are inclusive#
Compared with standard Python sequence slicing in which the slice endpoint is
not inclusive, label-based slicing in pandas is inclusive. The primary
reason for this is that it is often not possible to easily determine the
“successor” or next element after a particular label in an index. For example,
consider the following Series
:
In [225]: s = pd.Series(np.random.randn(6), index=list("abcdef"))
In [226]: s
Out[226]:
a -0.101684
b -0.734907
c -0.130121
d -0.476046
e 0.759104
f 0.213379
dtype: float64
Suppose we wished to slice from c
to e
, using integers this would be
accomplished as such:
In [227]: s[2:5]
Out[227]:
c -0.130121
d -0.476046
e 0.759104
dtype: float64
However, if you only had c
and e
, determining the next element in the
index can be somewhat complicated. For example, the following does not work:
In [228]: s.loc['c':'e' + 1]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[228], line 1
----> 1 s.loc['c':'e' + 1]
TypeError: can only concatenate str (not "int") to str
A very common use case is to limit a time series to start and end at two specific dates. To enable this, we made the design choice to make label-based slicing include both endpoints:
In [229]: s.loc["c":"e"]
Out[229]:
c -0.130121
d -0.476046
e 0.759104
dtype: float64
This is most definitely a “practicality beats purity” sort of thing, but it is something to watch out for if you expect label-based slicing to behave exactly in the way that standard Python integer slicing works.
Indexing potentially changes underlying Series dtype#
The different indexing operation can potentially change the dtype of a Series
.
In [230]: series1 = pd.Series([1, 2, 3])
In [231]: series1.dtype
Out[231]: dtype('int64')
In [232]: res = series1.reindex([0, 4])
In [233]: res.dtype
Out[233]: dtype('float64')
In [234]: res
Out[234]:
0 1.0
4 NaN
dtype: float64
In [235]: series2 = pd.Series([True])
In [236]: series2.dtype
Out[236]: dtype('bool')
In [237]: res = series2.reindex_like(series1)
In [238]: res.dtype
Out[238]: dtype('O')
In [239]: res
Out[239]:
0 True
1 NaN
2 NaN
dtype: object
This is because the (re)indexing operations above silently inserts NaNs
and the dtype
changes accordingly. This can cause some issues when using numpy
ufuncs
such as numpy.logical_and
.
See the GH 2388 for a more detailed discussion.