Cookbook¶
This is a respository for short and sweet examples and links for useful pandas recipes. We encourage users to add to this documentation.
This is a great First Pull Request (to add interesting links and/or put short code inline for existing links)
Idioms¶
These are some neat pandas idioms
How to split a frame with a boolean criterion?
Selection¶
The indexing docs.
Indexing using both row labels and conditionals, see here
Use loc for label-oriented slicing and iloc positional slicing, see here
Extend a panel frame by transposing, adding a new dimension, and transposing back to the original dimensions, see here
Mask a panel by using np.where and then reconstructing the panel with the new masked values here
Using ~ to take the complement of a boolean array, see here
MultiIndexing¶
The multindexing docs.
Missing Data¶
The missing data docs.
Fill forward a reversed timeseries
In [1]: df = pd.DataFrame(np.random.randn(6,1), index=pd.date_range('2013-08-01', periods=6, freq='B'), columns=list('A'))
In [2]: df.ix[3,'A'] = np.nan
In [3]: df
A
2013-08-01 0.469112
2013-08-02 -0.282863
2013-08-05 -1.509059
2013-08-06 NaN
2013-08-07 1.212112
2013-08-08 -0.173215
[6 rows x 1 columns]
In [4]: df.reindex(df.index[::-1]).ffill()
A
2013-08-08 -0.173215
2013-08-07 1.212112
2013-08-06 1.212112
2013-08-05 -1.509059
2013-08-02 -0.282863
2013-08-01 0.469112
[6 rows x 1 columns]
Replace¶
Grouping¶
The grouping docs.
Apply to different items in a group
Replacing values with groupby means
Sort by group with aggregation
Create multiple aggregated columns
Create a value counts column and reassign back to the DataFrame
Splitting¶
Timeseries¶
Turn a matrix with hours in columns and days in rows into a continous row sequence in the form of a time series. How to rearrange a python pandas dataframe?
Plotting¶
The Plotting docs.
Setting x-axis major and minor labels
Data In/Out¶
Performance comparison of SQL vs HDF5
CSV¶
The CSV docs
Reading only certain rows of a csv chunk-by-chunk
Reading the first few lines of a frame
Reading a file that is compressed but not by gzip/bz2 (the native compresed formats which read_csv understands). This example shows a WinZipped file, but is a general application of opening the file within a context manager and using that handle to read. See here
Reading CSV with Unix timestamps and converting to local timezone
Excel¶
The Excel docs
Reading from a filelike handle
Reading HTML tables from a server that cannot handle the default request header
HDFStore¶
The HDFStores docs
Simple Queries with a Timestamp Index
Managing heteregenous data using a linked multiple table hierarchy
Merging on-disk tables with millions of rows
Deduplicating a large store by chunks, essentially a recusive reduction operation. Shows a function for taking in data from csv file and creating a store by chunks, with date parsing as well. See here
Creating a store chunk-by-chunk from a csv file
Appending to a store, while creating a unique index
Reading in a sequence of files, then providing a global unique index to a store while appending
Troubleshoot HDFStore exceptions
Setting min_itemsize with strings
Using ptrepack to create a completely-sorted-index on a store
Storing Attributes to a group node
In [5]: df = DataFrame(np.random.randn(8,3))
In [6]: store = HDFStore('test.h5')
In [7]: store.put('df',df)
# you can store an arbitrary python object via pickle
In [8]: store.get_storer('df').attrs.my_attribute = dict(A = 10)
In [9]: store.get_storer('df').attrs.my_attribute
{'A': 10}
Aliasing Axis Names¶
To globally provide aliases for axis names, one can define these 2 functions:
In [10]: def set_axis_alias(cls, axis, alias):
....: if axis not in cls._AXIS_NUMBERS:
....: raise Exception("invalid axis [%s] for alias [%s]" % (axis, alias))
....: cls._AXIS_ALIASES[alias] = axis
....:
In [11]: def clear_axis_alias(cls, axis, alias):
....: if axis not in cls._AXIS_NUMBERS:
....: raise Exception("invalid axis [%s] for alias [%s]" % (axis, alias))
....: cls._AXIS_ALIASES.pop(alias,None)
....:
In [12]: set_axis_alias(DataFrame,'columns', 'myaxis2')
In [13]: df2 = DataFrame(randn(3,2),columns=['c1','c2'],index=['i1','i2','i3'])
In [14]: df2.sum(axis='myaxis2')
i1 -0.499427
i2 0.966720
i3 0.174175
dtype: float64
In [15]: clear_axis_alias(DataFrame,'columns', 'myaxis2')