Indexing and selecting data¶
The axis labeling information in pandas objects serves many purposes:
- Identifies data (i.e. provides metadata) using known indicators, important for analysis, visualization, and interactive console display.
- Enables automatic and explicit data alignment.
- Allows intuitive getting and setting of subsets of the data set.
In this section, we will focus on the final point: namely, how to slice, dice, and generally get and set subsets of pandas objects. The primary focus will be on Series and DataFrame as they have received more development attention in this area.
Note
The Python and NumPy indexing operators []
and attribute operator .
provide quick and easy access to pandas data structures across a wide range
of use cases. This makes interactive work intuitive, as there’s little new
to learn if you already know how to deal with Python dictionaries and NumPy
arrays. However, since the type of the data to be accessed isn’t known in
advance, directly using standard operators has some optimization limits. For
production code, we recommended that you take advantage of the optimized
pandas data access methods exposed in this chapter.
Warning
Whether a copy or a reference is returned for a setting operation, may
depend on the context. This is sometimes called chained assignment
and
should be avoided. See Returning a View versus Copy.
Warning
Indexing on an integer-based Index with floats has been clarified in 0.18.0, for a summary of the changes, see here.
See the MultiIndex / Advanced Indexing for MultiIndex
and more advanced indexing documentation.
See the cookbook for some advanced strategies.
Different choices for indexing¶
Object selection has had a number of user-requested additions in order to support more explicit location based indexing. Pandas now supports three types of multi-axis indexing.
.loc
is primarily label based, but may also be used with a boolean array..loc
will raiseKeyError
when the items are not found. Allowed inputs are:A single label, e.g.
5
or'a'
(Note that5
is interpreted as a label of the index. This use is not an integer position along the index.).A list or array of labels
['a', 'b', 'c']
.A slice object with labels
'a':'f'
(Note that contrary to usual python slices, both the start and the stop are included, when present in the index! See Slicing with labels and Endpoints are inclusive.)A boolean array
A
callable
function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above).New in version 0.18.1.
See more at Selection by Label.
.iloc
is primarily integer position based (from0
tolength-1
of the axis), but may also be used with a boolean array..iloc
will raiseIndexError
if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing. (this conforms with Python/NumPy slice semantics). Allowed inputs are:An integer e.g.
5
.A list or array of integers
[4, 3, 0]
.A slice object with ints
1:7
.A boolean array.
A
callable
function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above).New in version 0.18.1.
See more at Selection by Position, Advanced Indexing and Advanced Hierarchical.
.loc
,.iloc
, and also[]
indexing can accept acallable
as indexer. See more at Selection By Callable.
Getting values from an object with multi-axes selection uses the following
notation (using .loc
as an example, but the following applies to .iloc
as
well). Any of the axes accessors may be the null slice :
. Axes left out of
the specification are assumed to be :
, e.g. p.loc['a']
is equivalent to
p.loc['a', :, :]
.
Object Type | Indexers |
---|---|
Series | s.loc[indexer] |
DataFrame | df.loc[row_indexer,column_indexer] |
Basics¶
As mentioned when introducing the data structures in the last section, the primary function of indexing with []
(a.k.a. __getitem__
for those familiar with implementing class behavior in Python) is selecting out
lower-dimensional slices. The following table shows return type values when
indexing pandas objects with []
:
Object Type | Selection | Return Value Type |
---|---|---|
Series | series[label] |
scalar value |
DataFrame | frame[colname] |
Series corresponding to colname |
Here we construct a simple time series data set to use for illustrating the indexing functionality:
In [1]: dates = pd.date_range('1/1/2000', periods=8)
In [2]: df = pd.DataFrame(np.random.randn(8, 4),
...: index=dates, columns=['A', 'B', 'C', 'D'])
...:
In [3]: df
Out[3]:
A B C D
2000-01-01 0.469112 -0.282863 -1.509059 -1.135632
2000-01-02 1.212112 -0.173215 0.119209 -1.044236
2000-01-03 -0.861849 -2.104569 -0.494929 1.071804
2000-01-04 0.721555 -0.706771 -1.039575 0.271860
2000-01-05 -0.424972 0.567020 0.276232 -1.087401
2000-01-06 -0.673690 0.113648 -1.478427 0.524988
2000-01-07 0.404705 0.577046 -1.715002 -1.039268
2000-01-08 -0.370647 -1.157892 -1.344312 0.844885
Note
None of the indexing functionality is time series specific unless specifically stated.
Thus, as per above, we have the most basic indexing using []
:
In [4]: s = df['A']
In [5]: s[dates[5]]
Out[5]: -0.6736897080883706
You can pass a list of columns to []
to select columns in that order.
If a column is not contained in the DataFrame, an exception will be
raised. Multiple columns can also be set in this manner:
In [6]: df
Out[6]:
A B C D
2000-01-01 0.469112 -0.282863 -1.509059 -1.135632
2000-01-02 1.212112 -0.173215 0.119209 -1.044236
2000-01-03 -0.861849 -2.104569 -0.494929 1.071804
2000-01-04 0.721555 -0.706771 -1.039575 0.271860
2000-01-05 -0.424972 0.567020 0.276232 -1.087401
2000-01-06 -0.673690 0.113648 -1.478427 0.524988
2000-01-07 0.404705 0.577046 -1.715002 -1.039268
2000-01-08 -0.370647 -1.157892 -1.344312 0.844885
In [7]: df[['B', 'A']] = df[['A', 'B']]
In [8]: df
Out[8]:
A B C D
2000-01-01 -0.282863 0.469112 -1.509059 -1.135632
2000-01-02 -0.173215 1.212112 0.119209 -1.044236
2000-01-03 -2.104569 -0.861849 -0.494929 1.071804
2000-01-04 -0.706771 0.721555 -1.039575 0.271860
2000-01-05 0.567020 -0.424972 0.276232 -1.087401
2000-01-06 0.113648 -0.673690 -1.478427 0.524988
2000-01-07 0.577046 0.404705 -1.715002 -1.039268
2000-01-08 -1.157892 -0.370647 -1.344312 0.844885
You may find this useful for applying a transform (in-place) to a subset of the columns.
Warning
pandas aligns all AXES when setting Series
and DataFrame
from .loc
, and .iloc
.
This will not modify df
because the column alignment is before value assignment.
In [9]: df[['A', 'B']]
Out[9]:
A B
2000-01-01 -0.282863 0.469112
2000-01-02 -0.173215 1.212112
2000-01-03 -2.104569 -0.861849
2000-01-04 -0.706771 0.721555
2000-01-05 0.567020 -0.424972
2000-01-06 0.113648 -0.673690
2000-01-07 0.577046 0.404705
2000-01-08 -1.157892 -0.370647
In [10]: df.loc[:, ['B', 'A']] = df[['A', 'B']]
In [11]: df[['A', 'B']]
Out[11]:
A B
2000-01-01 -0.282863 0.469112
2000-01-02 -0.173215 1.212112
2000-01-03 -2.104569 -0.861849
2000-01-04 -0.706771 0.721555
2000-01-05 0.567020 -0.424972
2000-01-06 0.113648 -0.673690
2000-01-07 0.577046 0.404705
2000-01-08 -1.157892 -0.370647
The correct way to swap column values is by using raw values:
In [12]: df.loc[:, ['B', 'A']] = df[['A', 'B']].to_numpy()
In [13]: df[['A', 'B']]
Out[13]:
A B
2000-01-01 0.469112 -0.282863
2000-01-02 1.212112 -0.173215
2000-01-03 -0.861849 -2.104569
2000-01-04 0.721555 -0.706771
2000-01-05 -0.424972 0.567020
2000-01-06 -0.673690 0.113648
2000-01-07 0.404705 0.577046
2000-01-08 -0.370647 -1.157892
Attribute access¶
You may access an index on a Series
or column on a DataFrame
directly
as an attribute:
In [14]: sa = pd.Series([1, 2, 3], index=list('abc'))
In [15]: dfa = df.copy()
In [16]: sa.b
Out[16]: 2
In [17]: dfa.A