Internals¶
This section will provide a look into some of pandas internals.
Indexing¶
In pandas there are a few objects implemented which can serve as valid containers for the axis labels:
Index
: the generic “ordered set” object, an ndarray of object dtype assuming nothing about its contents. The labels must be hashable (and likely immutable) and unique. Populates a dict of label to location in Cython to doO(1)
lookups.Int64Index
: a version ofIndex
highly optimized for 64-bit integer data, such as time stampsFloat64Index
: a version ofIndex
highly optimized for 64-bit float dataMultiIndex
: the standard hierarchical index objectDatetimeIndex
: An Index object withTimestamp
boxed elements (impl are the int64 values)TimedeltaIndex
: An Index object withTimedelta
boxed elements (impl are the in64 values)PeriodIndex
: An Index object with Period elements
There are functions that make the creation of a regular index easy:
date_range
: fixed frequency date range generated from a time rule or DateOffset. An ndarray of Python datetime objectsperiod_range
: fixed frequency date range generated from a time rule or DateOffset. An ndarray ofPeriod
objects, representing Timespans
The motivation for having an Index
class in the first place was to enable
different implementations of indexing. This means that it’s possible for you,
the user, to implement a custom Index
subclass that may be better suited to
a particular application than the ones provided in pandas.
From an internal implementation point of view, the relevant methods that an
Index
must define are one or more of the following (depending on how
incompatible the new object internals are with the Index
functions):
get_loc
: returns an “indexer” (an integer, or in some cases a slice object) for a labelslice_locs
: returns the “range” to slice between two labelsget_indexer
: Computes the indexing vector for reindexing / data alignment purposes. See the source / docstrings for more on thisget_indexer_non_unique
: Computes the indexing vector for reindexing / data alignment purposes when the index is non-unique. See the source / docstrings for more on thisreindex
: Does any pre-conversion of the input index then callsget_indexer
union
,intersection
: computes the union or intersection of two Index objectsinsert
: Inserts a new label into an Index, yielding a new objectdelete
: Delete a label, yielding a new objectdrop
: Deletes a set of labelstake
: Analogous to ndarray.take
MultiIndex¶
Internally, the MultiIndex
consists of a few things: the levels, the
integer labels, and the level names:
In [1]: index = pd.MultiIndex.from_product([range(3), ['one', 'two']], names=['first', 'second'])
In [2]: index
Out[2]:
MultiIndex(levels=[[0, 1, 2], [u'one', u'two']],
labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]],
names=[u'first', u'second'])
In [3]: index.levels
Out[3]: FrozenList([[0, 1, 2], [u'one', u'two']])
In [4]: index.labels
Out[4]: FrozenList([[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])
In [5]: index.names
Out[5]: FrozenList([u'first', u'second'])
You can probably guess that the labels determine which unique element is
identified with that location at each layer of the index. It’s important to
note that sortedness is determined solely from the integer labels and does
not check (or care) whether the levels themselves are sorted. Fortunately, the
constructors from_tuples
and from_arrays
ensure that this is true, but
if you compute the levels and labels yourself, please be careful.
Subclassing pandas Data Structures¶
Warning
There are some easier alternatives before considering subclassing pandas
data structures.
This section describes how to subclass pandas
data structures to meet more specific needs. There are 2 points which need attention:
- Override constructor properties.
- Define original properties
Note
You can find a nice example in geopandas project.
Override Constructor Properties¶
Each data structure has constructor properties to specifying data constructors. By overriding these properties, you can retain defined-classes through pandas
data manipulations.
There are 3 constructors to be defined:
_constructor
: Used when a manipulation result has the same dimesions as the original._constructor_sliced
: Used when a manipulation result has one lower dimension(s) as the original, such asDataFrame
single columns slicing._constructor_expanddim
: Used when a manipulation result has one higher dimension as the original, such asSeries.to_frame()
andDataFrame.to_panel()
.
Following table shows how pandas
data structures define constructor properties by default.
Property Attributes | Series |
DataFrame |
Panel |
---|---|---|---|
_constructor |
Series |
DataFrame |
Panel |
_constructor_sliced |
NotImplementedError |
Series |
DataFrame |
_constructor_expanddim |
DataFrame |
Panel |
NotImplementedError |
Below example shows how to define SubclassedSeries
and SubclassedDataFrame
overriding constructor properties.
class SubclassedSeries(Series):
@property
def _constructor(self):
return SubclassedSeries
@property
def _constructor_expanddim(self):
return SubclassedDataFrame
class SubclassedDataFrame(DataFrame):
@property
def _constructor(self):
return SubclassedDataFrame
@property
def _constructor_sliced(self):
return SubclassedSeries
>>> s = SubclassedSeries([1, 2, 3])
>>> type(s)
<class '__main__.SubclassedSeries'>
>>> to_framed = s.to_frame()
>>> type(to_framed)
<class '__main__.SubclassedDataFrame'>
>>> df = SubclassedDataFrame({'A', [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
>>> df
A B C
0 1 4 7
1 2 5 8
2 3 6 9
>>> type(df)
<class '__main__.SubclassedDataFrame'>
>>> sliced1 = df[['A', 'B']]
>>> sliced1
A B
0 1 4
1 2 5
2 3 6
>>> type(sliced1)
<class '__main__.SubclassedDataFrame'>
>>> sliced2 = df['A']
>>> sliced2
0 1
1 2
2 3
Name: A, dtype: int64
>>> type(sliced2)
<class '__main__.SubclassedSeries'>
Define Original Properties¶
To let original data structures have additional properties, you should let pandas
know what properties are added. pandas
maps unknown properties to data names overriding __getattribute__
. Defining original properties can be done in one of 2 ways:
- Define
_internal_names
and_internal_names_set
for temporary properties which WILL NOT be passed to manipulation results. - Define
_metadata
for normal properties which will be passed to manipulation results.
Below is an example to define 2 original properties, “internal_cache” as a temporary property and “added_property” as a normal property
class SubclassedDataFrame2(DataFrame):
# temporary properties
_internal_names = pd.DataFrame._internal_names + ['internal_cache']
_internal_names_set = set(_internal_names)
# normal properties
_metadata = ['added_property']
@property
def _constructor(self):
return SubclassedDataFrame2
>>> df = SubclassedDataFrame2({'A', [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
>>> df
A B C
0 1 4 7
1 2 5 8
2 3 6 9
>>> df.internal_cache = 'cached'
>>> df.added_property = 'property'
>>> df.internal_cache
cached
>>> df.added_property
property
# properties defined in _internal_names is reset after manipulation
>>> df[['A', 'B']].internal_cache
AttributeError: 'SubclassedDataFrame2' object has no attribute 'internal_cache'
# properties defined in _metadata are retained
>>> df[['A', 'B']].added_property
property