Series and DataFrame have a method pct_change() to compute the percent change over a given number of periods (using fill_method to fill NA/null values before computing the percent change).
Series
DataFrame
pct_change()
fill_method
In [1]: ser = pd.Series(np.random.randn(8)) In [2]: ser.pct_change() Out[2]: 0 NaN 1 -1.602976 2 4.334938 3 -0.247456 4 -2.067345 5 -1.142903 6 -1.688214 7 -9.759729 dtype: float64
In [3]: df = pd.DataFrame(np.random.randn(10, 4)) In [4]: df.pct_change(periods=3) Out[4]: 0 1 2 3 0 NaN NaN NaN NaN 1 NaN NaN NaN NaN 2 NaN NaN NaN NaN 3 -0.218320 -1.054001 1.987147 -0.510183 4 -0.439121 -1.816454 0.649715 -4.822809 5 -0.127833 -3.042065 -5.866604 -1.776977 6 -2.596833 -1.959538 -2.111697 -3.798900 7 -0.117826 -2.169058 0.036094 -0.067696 8 2.492606 -1.357320 -1.205802 -1.558697 9 -1.012977 2.324558 -1.003744 -0.371806
Series.cov() can be used to compute covariance between series (excluding missing values).
Series.cov()
In [5]: s1 = pd.Series(np.random.randn(1000)) In [6]: s2 = pd.Series(np.random.randn(1000)) In [7]: s1.cov(s2) Out[7]: 0.0006801088174310875
Analogously, DataFrame.cov() to compute pairwise covariances among the series in the DataFrame, also excluding NA/null values.
DataFrame.cov()
Note
Assuming the missing data are missing at random this results in an estimate for the covariance matrix which is unbiased. However, for many applications this estimate may not be acceptable because the estimated covariance matrix is not guaranteed to be positive semi-definite. This could lead to estimated correlations having absolute values which are greater than one, and/or a non-invertible covariance matrix. See Estimation of covariance matrices for more details.
In [8]: frame = pd.DataFrame(np.random.randn(1000, 5), columns=["a", "b", "c", "d", "e"]) In [9]: frame.cov() Out[9]: a b c d e a 1.000882 -0.003177 -0.002698 -0.006889 0.031912 b -0.003177 1.024721 0.000191 0.009212 0.000857 c -0.002698 0.000191 0.950735 -0.031743 -0.005087 d -0.006889 0.009212 -0.031743 1.002983 -0.047952 e 0.031912 0.000857 -0.005087 -0.047952 1.042487
DataFrame.cov also supports an optional min_periods keyword that specifies the required minimum number of observations for each column pair in order to have a valid result.
DataFrame.cov
min_periods
In [10]: frame = pd.DataFrame(np.random.randn(20, 3), columns=["a", "b", "c"]) In [11]: frame.loc[frame.index[:5], "a"] = np.nan In [12]: frame.loc[frame.index[5:10], "b"] = np.nan In [13]: frame.cov() Out[13]: a b c a 1.123670 -0.412851 0.018169 b -0.412851 1.154141 0.305260 c 0.018169 0.305260 1.301149 In [14]: frame.cov(min_periods=12) Out[14]: a b c a 1.123670 NaN 0.018169 b NaN 1.154141 0.305260 c 0.018169 0.305260 1.301149
Correlation may be computed using the corr() method. Using the method parameter, several methods for computing correlations are provided:
corr()
method
Method name
Description
pearson (default)
Standard correlation coefficient
kendall
Kendall Tau correlation coefficient
spearman
Spearman rank correlation coefficient
All of these are currently computed using pairwise complete observations. Wikipedia has articles covering the above correlation coefficients:
Pearson correlation coefficient
Kendall rank correlation coefficient
Spearman’s rank correlation coefficient
Please see the caveats associated with this method of calculating correlation matrices in the covariance section.
In [15]: frame = pd.DataFrame(np.random.randn(1000, 5), columns=["a", "b", "c", "d", "e"]) In [16]: frame.iloc[::2] = np.nan # Series with Series In [17]: frame["a"].corr(frame["b"]) Out[17]: 0.013479040400098775 In [18]: frame["a"].corr(frame["b"], method="spearman") Out[18]: -0.007289885159540637 # Pairwise correlation of DataFrame columns In [19]: frame.corr() Out[19]: a b c d e a 1.000000 0.013479 -0.049269 -0.042239 -0.028525 b 0.013479 1.000000 -0.020433 -0.011139 0.005654 c -0.049269 -0.020433 1.000000 0.018587 -0.054269 d -0.042239 -0.011139 0.018587 1.000000 -0.017060 e -0.028525 0.005654 -0.054269 -0.017060 1.000000
Note that non-numeric columns will be automatically excluded from the correlation calculation.
Like cov, corr also supports the optional min_periods keyword:
cov
corr
In [20]: frame = pd.DataFrame(np.random.randn(20, 3), columns=["a", "b", "c"]) In [21]: frame.loc[frame.index[:5], "a"] = np.nan In [22]: frame.loc[frame.index[5:10], "b"] = np.nan In [23]: frame.corr() Out[23]: a b c a 1.000000 -0.121111 0.069544 b -0.121111 1.000000 0.051742 c 0.069544 0.051742 1.000000 In [24]: frame.corr(min_periods=12) Out[24]: a b c a 1.000000 NaN 0.069544 b NaN 1.000000 0.051742 c 0.069544 0.051742 1.000000
New in version 0.24.0.
The method argument can also be a callable for a generic correlation calculation. In this case, it should be a single function that produces a single value from two ndarray inputs. Suppose we wanted to compute the correlation based on histogram intersection:
# histogram intersection In [25]: def histogram_intersection(a, b): ....: return np.minimum(np.true_divide(a, a.sum()), np.true_divide(b, b.sum())).sum() ....: In [26]: frame.corr(method=histogram_intersection) Out[26]: a b c a 1.000000 -6.404882 -2.058431 b -6.404882 1.000000 -19.255743 c -2.058431 -19.255743 1.000000
A related method corrwith() is implemented on DataFrame to compute the correlation between like-labeled Series contained in different DataFrame objects.
corrwith()
In [27]: index = ["a", "b", "c", "d", "e"] In [28]: columns = ["one", "two", "three", "four"] In [29]: df1 = pd.DataFrame(np.random.randn(5, 4), index=index, columns=columns) In [30]: df2 = pd.DataFrame(np.random.randn(4, 4), index=index[:4], columns=columns) In [31]: df1.corrwith(df2) Out[31]: one -0.125501 two -0.493244 three 0.344056 four 0.004183 dtype: float64 In [32]: df2.corrwith(df1, axis=1) Out[32]: a -0.675817 b 0.458296 c 0.190809 d -0.186275 e NaN dtype: float64
The rank() method produces a data ranking with ties being assigned the mean of the ranks (by default) for the group:
rank()
In [33]: s = pd.Series(np.random.randn(5), index=list("abcde")) In [34]: s["d"] = s["b"] # so there's a tie In [35]: s.rank() Out[35]: a 5.0 b 2.5 c 1.0 d 2.5 e 4.0 dtype: float64
rank() is also a DataFrame method and can rank either the rows (axis=0) or the columns (axis=1). NaN values are excluded from the ranking.
axis=0
axis=1
NaN
In [36]: df = pd.DataFrame(np.random.randn(10, 6)) In [37]: df[4] = df[2][:5] # some ties In [38]: df Out[38]: 0 1 2 3 4 5 0 -0.904948 -1.163537 -1.457187 0.135463 -1.457187 0.294650 1 -0.976288 -0.244652 -0.748406 -0.999601 -0.748406 -0.800809 2 0.401965 1.460840 1.256057 1.308127 1.256057 0.876004 3 0.205954 0.369552 -0.669304 0.038378 -0.669304 1.140296 4 -0.477586 -0.730705 -1.129149 -0.601463 -1.129149 -0.211196 5 -1.092970 -0.689246 0.908114 0.204848 NaN 0.463347 6 0.376892 0.959292 0.095572 -0.593740 NaN -0.069180 7 -1.002601 1.957794 -0.120708 0.094214 NaN -1.467422 8 -0.547231 0.664402 -0.519424 -0.073254 NaN -1.263544 9 -0.250277 -0.237428 -1.056443 0.419477 NaN 1.375064 In [39]: df.rank(1) Out[39]: 0 1 2 3 4 5 0 4.0 3.0 1.5 5.0 1.5 6.0 1 2.0 6.0 4.5 1.0 4.5 3.0 2 1.0 6.0 3.5 5.0 3.5 2.0 3 4.0 5.0 1.5 3.0 1.5 6.0 4 5.0 3.0 1.5 4.0 1.5 6.0 5 1.0 2.0 5.0 3.0 NaN 4.0 6 4.0 5.0 3.0 1.0 NaN 2.0 7 2.0 5.0 3.0 4.0 NaN 1.0 8 2.0 5.0 3.0 4.0 NaN 1.0 9 2.0 3.0 1.0 4.0 NaN 5.0
rank optionally takes a parameter ascending which by default is true; when false, data is reverse-ranked, with larger values assigned a smaller rank.
rank
ascending
rank supports different tie-breaking methods, specified with the method parameter:
average : average rank of tied group min : lowest rank in the group max : highest rank in the group first : ranks assigned in the order they appear in the array
average : average rank of tied group
average
min : lowest rank in the group
min
max : highest rank in the group
max
first : ranks assigned in the order they appear in the array
first
See the window operations user guide for an overview of windowing functions.