Comparison with R / R libraries¶
Since pandas aims to provide a lot of the data manipulation and analysis functionality that people use R for, this page was started to provide a more detailed look at the R language and its many third party libraries as they relate to pandas. In comparisons with R and CRAN libraries, we care about the following things:
- Functionality / flexibility: what can/cannot be done with each tool
- Performance: how fast are operations. Hard numbers/benchmarks are preferable
- Ease-of-use: Is one tool easier/harder to use (you may have to be the judge of this, given side-by-side code comparisons)
This page is also here to offer a bit of a translation guide for users of these R packages.
Base R¶
subset¶
New in version 0.13.
The query() method is similar to the base R subset function. In R you might want to get the rows of a data.frame where one column’s values are less than another column’s values:
df <- data.frame(a=rnorm(10), b=rnorm(10))
subset(df, a <= b)
df[df$a <= df$b,] # note the comma
In pandas, there are a few ways to perform subsetting. You can use query() or pass an expression as if it were an index/slice as well as standard boolean indexing:
In [1]: from pandas import DataFrame
In [2]: from numpy import random
In [3]: df = DataFrame({'a': random.randn(10), 'b': random.randn(10)})
In [4]: df.query('a <= b')
a b
0 0.026864 0.731942
1 -2.071188 -0.714342
5 -0.550596 -0.391094
[3 rows x 2 columns]
In [5]: df[df.a <= df.b]
a b
0 0.026864 0.731942
1 -2.071188 -0.714342
5 -0.550596 -0.391094
[3 rows x 2 columns]
In [6]: df.loc[df.a <= df.b]
a b
0 0.026864 0.731942
1 -2.071188 -0.714342
5 -0.550596 -0.391094
[3 rows x 2 columns]
For more details and examples see the query documentation.
with¶
New in version 0.13.
An expression using a data.frame called df in R with the columns a and b would be evaluated using with like so:
df <- data.frame(a=rnorm(10), b=rnorm(10))
with(df, a + b)
df$a + df$b # same as the previous expression
In pandas the equivalent expression, using the eval() method, would be:
In [7]: df = DataFrame({'a': random.randn(10), 'b': random.randn(10)})
In [8]: df.eval('a + b')
0 0.436549
1 -2.071980
2 -2.090933
3 0.447128
4 1.947388
5 -1.181133
6 2.974307
7 0.707994
8 1.424054
9 0.146349
dtype: float64
In [9]: df.a + df.b # same as the previous expression
0 0.436549
1 -2.071980
2 -2.090933
3 0.447128
4 1.947388
5 -1.181133
6 2.974307
7 0.707994
8 1.424054
9 0.146349
dtype: float64
In certain cases eval() will be much faster than evaluation in pure Python. For more details and examples see the eval documentation.
zoo¶
xts¶
plyr¶
plyr is an R library for the split-apply-combine strategy for data analysis. The functions revolve around three data structures in R, a for arrays, l for lists, and d for data.frame. The table below shows how these data structures could be mapped in Python.
R | Python |
---|---|
array | list |
lists | dictionary or list of objects |
data.frame | dataframe |
ddply¶
An expression using a data.frame called df in R where you want to summarize x by month:
require(plyr)
df <- data.frame(
x = runif(120, 1, 168),
y = runif(120, 7, 334),
z = runif(120, 1.7, 20.7),
month = rep(c(5,6,7,8),30),
week = sample(1:4, 120, TRUE)
)
ddply(df, .(month, week), summarize,
mean = round(mean(x), 2),
sd = round(sd(x), 2))
In pandas the equivalent expression, using the groupby() method, would be:
In [10]: df = DataFrame({
....: 'x': random.uniform(1., 168., 120),
....: 'y': random.uniform(7., 334., 120),
....: 'z': random.uniform(1.7, 20.7, 120),
....: 'month': [5,6,7,8]*30,
....: 'week': random.randint(1,4, 120)
....: })
....:
In [11]: grouped = df.groupby(['month','week'])
In [12]: print grouped['x'].agg([np.mean, np.std])
mean std
month week
5 1 76.668813 50.666094
2 68.651122 43.948919
3 69.108800 42.935478
6 1 94.854115 58.036739
2 97.374311 46.307307
3 113.711425 50.303955
7 1 83.197105 45.647110
2 94.956183 51.086197
3 94.872662 47.258354
8 1 106.109188 49.332236
2 73.117347 53.968658
3 89.748792 53.828984
[12 rows x 2 columns]
For more details and examples see the groupby documentation.
reshape / reshape2¶
melt.array¶
An expression using a 3 dimensional array called a in R where you want to melt it into a data.frame:
a <- array(c(1:23, NA), c(2,3,4))
data.frame(melt(a))
In Python, since a is a list, you can simply use list comprehension.
In [13]: a = np.array(range(1,24)+[np.NAN]).reshape(2,3,4)
In [14]: DataFrame([tuple(list(x)+[val]) for x, val in np.ndenumerate(a)])
0 1 2 3
0 0 0 0 1
1 0 0 1 2
2 0 0 2 3
3 0 0 3 4
4 0 1 0 5
5 0 1 1 6
6 0 1 2 7
7 0 1 3 8
8 0 2 0 9
9 0 2 1 10
10 0 2 2 11
11 0 2 3 12
12 1 0 0 13
13 1 0 1 14
14 1 0 2 15
.. .. .. ...
[24 rows x 4 columns]
melt.list¶
An expression using a list called a in R where you want to melt it into a data.frame:
a <- as.list(c(1:4, NA))
data.frame(melt(a))
In Python, this list would be a list of tuples, so DataFrame() method would convert it to a dataframe as required.
In [15]: a = list(enumerate(range(1,5)+[np.NAN]))
In [16]: DataFrame(a)
0 1
0 0 1
1 1 2
2 2 3
3 3 4
4 4 NaN
[5 rows x 2 columns]
For more details and examples see the Into to Data Structures documentation.
melt.data.frame¶
An expression using a data.frame called cheese in R where you want to reshape the data.frame:
cheese <- data.frame(
first = c('John', 'Mary'),
last = c('Doe', 'Bo'),
height = c(5.5, 6.0),
weight = c(130, 150)
)
melt(cheese, id=c("first", "last"))
In Python, the melt() method is the R equivalent:
In [17]: cheese = DataFrame({'first' : ['John', 'Mary'],
....: 'last' : ['Doe', 'Bo'],
....: 'height' : [5.5, 6.0],
....: 'weight' : [130, 150]})
....:
In [18]: pd.melt(cheese, id_vars=['first', 'last'])
first last variable value
0 John Doe height 5.5
1 Mary Bo height 6.0
2 John Doe weight 130.0
3 Mary Bo weight 150.0
[4 rows x 4 columns]
In [19]: cheese.set_index(['first', 'last']).stack() # alternative way
first last
John Doe height 5.5
weight 130.0
Mary Bo height 6.0
weight 150.0
dtype: float64
For more details and examples see the reshaping documentation.
cast¶
An expression using a data.frame called df in R to cast into a higher dimensional array:
df <- data.frame(
x = runif(12, 1, 168),
y = runif(12, 7, 334),
z = runif(12, 1.7, 20.7),
month = rep(c(5,6,7),4),
week = rep(c(1,2), 6)
)
mdf <- melt(df, id=c("month", "week"))
acast(mdf, week ~ month ~ variable, mean)
In Python the best way is to make use of pivot_table():
In [20]: df = DataFrame({
....: 'x': random.uniform(1., 168., 12),
....: 'y': random.uniform(7., 334., 12),
....: 'z': random.uniform(1.7, 20.7, 12),
....: 'month': [5,6,7]*4,
....: 'week': [1,2]*6
....: })
....:
In [21]: mdf = pd.melt(df, id_vars=['month', 'week'])
In [22]: pd.pivot_table(mdf, values='value', rows=['variable','week'],
....: cols=['month'], aggfunc=np.mean)
....:
month 5 6 7
variable week
x 1 81.157131 131.364666 63.034819
2 47.736485 120.352953 99.131511
y 1 61.777944 18.588993 188.790929
2 303.348717 174.792974 209.744832
z 1 2.795230 13.532019 12.555273
2 13.226753 13.347298 4.249712
[6 rows x 3 columns]
For more details and examples see the reshaping documentation.