Sparse data structures¶
Note
SparseSeries
and SparseDataFrame
have been deprecated. Their purpose
is served equally well by a Series
or DataFrame
with
sparse values. See Migrating for tips on migrating.
Pandas provides data structures for efficiently storing sparse data.
These are not necessarily sparse in the typical “mostly 0”. Rather, you can view these
objects as being “compressed” where any data matching a specific value (NaN
/ missing value, though any value
can be chosen, including 0) is omitted. The compressed values are not actually stored in the array.
In [1]: arr = np.random.randn(10)
In [2]: arr[2:-2] = np.nan
In [3]: ts = pd.Series(pd.SparseArray(arr))
In [4]: ts
Out[4]:
0 0.469112
1 -0.282863
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 -0.861849
9 -2.104569
dtype: Sparse[float64, nan]
Notice the dtype, Sparse[float64, nan]
. The nan
means that elements in the
array that are nan
aren’t actually stored, only the non-nan
elements are.
Those non-nan
elements have a float64
dtype.
The sparse objects exist for memory efficiency reasons. Suppose you had a
large, mostly NA DataFrame
:
In [5]: df = pd.DataFrame(np.random.randn(10000, 4))
In [6]: df.iloc[:9998] = np.nan
In [7]: sdf = df.astype(pd.SparseDtype("float", np.nan))
In [8]: sdf.head()
Out[8]:
0 1 2 3
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
In [9]: sdf.dtypes