Comparison with R / R libraries¶
Since pandas aims to provide a lot of the data manipulation and analysis
functionality that people use R for, this page
was started to provide a more detailed look at the R language and its many third
party libraries as they relate to pandas. In comparisons with R and CRAN
libraries, we care about the following things:
- Functionality / flexibility: what can/cannot be done with each tool
- Performance: how fast are operations. Hard numbers/benchmarks are preferable
- Ease-of-use: Is one tool easier/harder to use (you may have to be the judge of this, given side-by-side code comparisons)
This page is also here to offer a bit of a translation guide for users of these R packages.
For transfer of DataFrame objects from pandas to R, one option is to
use HDF5 files, see External Compatibility for an
example.
Quick Reference¶
We’ll start off with a quick reference guide pairing some common R operations using dplyr with pandas equivalents.
Querying, Filtering, Sampling¶
| R | pandas | 
|---|---|
| dim(df) | df.shape | 
| head(df) | df.head() | 
| slice(df, 1:10) | df.iloc[:9] | 
| filter(df, col1 == 1, col2 == 1) | df.query('col1 == 1 & col2 == 1') | 
| df[df$col1 == 1 & df$col2 == 1,] | df[(df.col1 == 1) & (df.col2 == 1)] | 
| select(df, col1, col2) | df[['col1', 'col2']] | 
| select(df, col1:col3) | df.loc[:, 'col1':'col3'] | 
| select(df, -(col1:col3)) | df.drop(cols_to_drop, axis=1)but see [1] | 
| distinct(select(df, col1)) | df[['col1']].drop_duplicates() | 
| distinct(select(df, col1, col2)) | df[['col1', 'col2']].drop_duplicates() | 
| sample_n(df, 10) | df.sample(n=10) | 
| sample_frac(df, 0.01) | df.sample(frac=0.01) | 
| [1] | R’s shorthand for a subrange of columns
( select(df, col1:col3)) can be approached
cleanly in pandas, if you have the list of columns,
for exampledf[cols[1:3]]ordf.drop(cols[1:3]), but doing this by column
name is a bit messy. | 
Sorting¶
| R | pandas | 
|---|---|
| arrange(df, col1, col2) | df.sort_values(['col1', 'col2']) | 
| arrange(df, desc(col1)) | df.sort_values('col1', ascending=False) | 
Transforming¶
| R | pandas | 
|---|---|
| select(df, col_one = col1) | df.rename(columns={'col1': 'col_one'})['col_one'] | 
| rename(df, col_one = col1) | df.rename(columns={'col1': 'col_one'}) | 
| mutate(df, c=a-b) | df.assign(c=df.a-df.b) | 
Grouping and Summarizing¶
| R | pandas | 
|---|---|
| summary(df) | df.describe() | 
| gdf <- group_by(df, col1) | gdf = df.groupby('col1') | 
| summarise(gdf, avg=mean(col1, na.rm=TRUE)) | df.groupby('col1').agg({'col1': 'mean'}) | 
| summarise(gdf, total=sum(col1)) | df.groupby('col1').sum() | 
Base R¶
Slicing with R’s c¶
R makes it easy to access data.frame columns by name
df <- data.frame(a=rnorm(5), b=rnorm(5), c=rnorm(5), d=rnorm(5), e=rnorm(5))
df[, c("a", "c", "e")]
or by integer location
df <- data.frame(matrix(rnorm(1000), ncol=100))
df[, c(1:10, 25:30, 40, 50:100)]
Selecting multiple columns by name in pandas is straightforward
In [1]: df = pd.DataFrame(np.random.randn(10, 3), columns=list('abc'))
In [2]: df[['a', 'c']]
Out[2]: 
          a         c
0  0.469112 -1.509059
1 -1.135632 -0.173215
2  0.119209 -0.861849
3 -2.104569  1.071804
4  0.721555 -1.039575
5  0.271860  0.567020
6  0.276232 -0.673690
7  0.113648  0.524988
8  0.404705 -1.715002
9 -1.039268 -1.157892
In [3]: df.loc[:, ['a', 'c']]