Comparison with R / R libraries¶
Since pandas aims to provide a lot of the data manipulation and analysis functionality that people use R for, this page was started to provide a more detailed look at the R language and its many third party libraries as they relate to pandas. In comparisons with R and CRAN libraries, we care about the following things:
- Functionality / flexibility: what can/cannot be done with each tool
- Performance: how fast are operations. Hard numbers/benchmarks are preferable
- Ease-of-use: Is one tool easier/harder to use (you may have to be the judge of this, given side-by-side code comparisons)
This page is also here to offer a bit of a translation guide for users of these R packages.
Base R¶
aggregate¶
In R you may want to split data into subsets and compute the mean for each. Using a data.frame called df and splitting it into groups by1 and by2:
df <- data.frame(
v1 = c(1,3,5,7,8,3,5,NA,4,5,7,9),
v2 = c(11,33,55,77,88,33,55,NA,44,55,77,99),
by1 = c("red", "blue", 1, 2, NA, "big", 1, 2, "red", 1, NA, 12),
by2 = c("wet", "dry", 99, 95, NA, "damp", 95, 99, "red", 99, NA, NA))
aggregate(x=df[, c("v1", "v2")], by=list(mydf2$by1, mydf2$by2), FUN = mean)
The groupby() method is similar to base R aggregate function.
In [1]: from pandas import DataFrame
In [2]: df = DataFrame({
...: 'v1': [1,3,5,7,8,3,5,np.nan,4,5,7,9],
...: 'v2': [11,33,55,77,88,33,55,np.nan,44,55,77,99],
...: 'by1': ["red", "blue", 1, 2, np.nan, "big", 1, 2, "red", 1, np.nan, 12],
...: 'by2': ["wet", "dry", 99, 95, np.nan, "damp", 95, 99, "red", 99, np.nan,
...: np.nan]
...: })
...:
In [3]: g = df.groupby(['by1','by2'])
In [4]: g[['v1','v2']].mean()
Out[4]:
v1 v2
by1 by2
1 95 5 55
99 5 55
2 95 7 77
99 NaN NaN
big damp 3 33
blue dry 3 33
red red 4 44
wet 1 11
[8 rows x 2 columns]
For more details and examples see the groupby documentation.
match / %in%¶
A common way to select data in R is using %in% which is defined using the function match. The operator %in% is used to return a logical vector indicating if there is a match or not:
s <- 0:4
s %in% c(2,4)
The isin() method is similar to R %in% operator:
In [5]: s = pd.Series(np.arange(5),dtype=np.float32)
In [6]: s.isin([2, 4])
Out[6]:
0 False
1 False
2 True
3 False
4 True
dtype: bool
The match function returns a vector of the positions of matches of its first argument in its second:
s <- 0:4
match(s, c(2,4))
The apply() method can be used to replicate this:
In [7]: s = pd.Series(np.arange(5),dtype=np.float32)
In [8]: Series(pd.match(s,[2,4],np.nan))
Out[8]:
0 NaN
1 NaN
2 0
3 NaN
4 1
dtype: float64
For more details and examples see the reshaping documentation.
tapply¶
tapply is similar to aggregate, but data can be in a ragged array, since the subclass sizes are possibly irregular. Using a data.frame called baseball, and retrieving information based on the array team:
baseball <-
data.frame(team = gl(5, 5,
labels = paste("Team", LETTERS[1:5])),
player = sample(letters, 25),
batting.average = runif(25, .200, .400))
tapply(baseball$batting.average, baseball.example$team,
max)
In pandas we may use pivot_table() method to handle this:
In [9]: import random
In [10]: import string
In [11]: baseball = DataFrame({
....: 'team': ["team %d" % (x+1) for x in range(5)]*5,
....: 'player': random.sample(list(string.ascii_lowercase),25),
....: 'batting avg': np.random.uniform(.200, .400, 25)
....: })
....:
In [12]: baseball.pivot_table(values='batting avg', cols='team', aggfunc=np.max)
Out[12]:
team
team 1 0.321235
team 2 0.399140
team 3 0.386815
team 4 0.387197
team 5 0.392086
Name: batting avg, dtype: float64
For more details and examples see the reshaping documentation.
subset¶
New in version 0.13.
The query() method is similar to the base R subset function. In R you might want to get the rows of a data.frame where one column’s values are less than another column’s values:
df <- data.frame(a=rnorm(10), b=rnorm(10))
subset(df, a <= b)
df[df$a <= df$b,] # note the comma
In pandas, there are a few ways to perform subsetting. You can use query() or pass an expression as if it were an index/slice as well as standard boolean indexing:
In [13]: df = DataFrame({'a': np.random.randn(10), 'b': np.random.randn(10)})
In [14]: df.query('a <= b')
Out[14]:
a b
2 -0.838260 0.980077
6 -0.017685 0.027505
7 -0.182877 0.703105
9 -1.717420 -0.986426
[4 rows x 2 columns]
In [15]: df[df.a <= df.b]
Out[15]:
a b
2 -0.838260 0.980077
6 -0.017685 0.027505
7 -0.182877 0.703105
9 -1.717420 -0.986426
[4 rows x 2 columns]
In [16]: df.loc[df.a <= df.b]
Out[16]:
a b
2 -0.838260 0.980077
6 -0.017685 0.027505
7 -0.182877 0.703105
9 -1.717420 -0.986426
[4 rows x 2 columns]
For more details and examples see the query documentation.
with¶
New in version 0.13.
An expression using a data.frame called df in R with the columns a and b would be evaluated using with like so:
df <- data.frame(a=rnorm(10), b=rnorm(10))
with(df, a + b)
df$a + df$b # same as the previous expression
In pandas the equivalent expression, using the eval() method, would be:
In [17]: df = DataFrame({'a': np.random.randn(10), 'b': np.random.randn(10)})
In [18]: df.eval('a + b')
Out[18]:
0 -0.163194
1 0.985872
2 2.864538
3 0.782622
4 0.962818
5 1.974849
6 0.258445
7 -2.288045
8 -0.800437
9 2.667426
dtype: float64
In [19]: df.a + df.b # same as the previous expression
Out[19]:
0 -0.163194
1 0.985872
2 2.864538
3 0.782622
4 0.962818
5 1.974849
6 0.258445
7 -2.288045
8 -0.800437
9 2.667426
dtype: float64
In certain cases eval() will be much faster than evaluation in pure Python. For more details and examples see the eval documentation.
zoo¶
xts¶
plyr¶
plyr is an R library for the split-apply-combine strategy for data analysis. The functions revolve around three data structures in R, a for arrays, l for lists, and d for data.frame. The table below shows how these data structures could be mapped in Python.
R | Python |
---|---|
array | list |
lists | dictionary or list of objects |
data.frame | dataframe |
ddply¶
An expression using a data.frame called df in R where you want to summarize x by month:
require(plyr)
df <- data.frame(
x = runif(120, 1, 168),
y = runif(120, 7, 334),
z = runif(120, 1.7, 20.7),
month = rep(c(5,6,7,8),30),
week = sample(1:4, 120, TRUE)
)
ddply(df, .(month, week), summarize,
mean = round(mean(x), 2),
sd = round(sd(x), 2))
In pandas the equivalent expression, using the groupby() method, would be:
In [20]: df = DataFrame({
....: 'x': np.random.uniform(1., 168., 120),
....: 'y': np.random.uniform(7., 334., 120),
....: 'z': np.random.uniform(1.7, 20.7, 120),
....: 'month': [5,6,7,8]*30,
....: 'week': np.random.randint(1,4, 120)
....: })
....:
In [21]: grouped = df.groupby(['month','week'])
In [22]: print grouped['x'].agg([np.mean, np.std])
mean std
month week
5 1 74.750543 37.602035
2 91.420601 56.817107
3 80.270102 55.994654
6 1 81.840060 50.966643
2 97.434542 59.919288
3 79.867371 47.377914
7 1 83.997435 39.391772
2 86.244632 41.066830
3 108.811608 45.048738
8 1 81.647843 50.264539
2 94.056653 47.677568
3 76.004631 47.048914
[12 rows x 2 columns]
For more details and examples see the groupby documentation.
reshape / reshape2¶
melt.array¶
An expression using a 3 dimensional array called a in R where you want to melt it into a data.frame:
a <- array(c(1:23, NA), c(2,3,4))
data.frame(melt(a))
In Python, since a is a list, you can simply use list comprehension.
In [23]: a = np.array(range(1,24)+[np.NAN]).reshape(2,3,4)
In [24]: DataFrame([tuple(list(x)+[val]) for x, val in np.ndenumerate(a)])
Out[24]:
0 1 2 3
0 0 0 0 1
1 0 0 1 2
2 0 0 2 3
3 0 0 3 4
4 0 1 0 5
5 0 1 1 6
6 0 1 2 7
7 0 1 3 8
8 0 2 0 9
9 0 2 1 10
10 0 2 2 11
11 0 2 3 12
12 1 0 0 13
13 1 0 1 14
14 1 0 2 15
.. .. .. ...
[24 rows x 4 columns]
melt.list¶
An expression using a list called a in R where you want to melt it into a data.frame:
a <- as.list(c(1:4, NA))
data.frame(melt(a))
In Python, this list would be a list of tuples, so DataFrame() method would convert it to a dataframe as required.
In [25]: a = list(enumerate(range(1,5)+[np.NAN]))
In [26]: DataFrame(a)
Out[26]:
0 1
0 0 1
1 1 2
2 2 3
3 3 4
4 4 NaN
[5 rows x 2 columns]
For more details and examples see the Into to Data Structures documentation.
melt.data.frame¶
An expression using a data.frame called cheese in R where you want to reshape the data.frame:
cheese <- data.frame(
first = c('John', 'Mary'),
last = c('Doe', 'Bo'),
height = c(5.5, 6.0),
weight = c(130, 150)
)
melt(cheese, id=c("first", "last"))
In Python, the melt() method is the R equivalent:
In [27]: cheese = DataFrame({'first' : ['John', 'Mary'],
....: 'last' : ['Doe', 'Bo'],
....: 'height' : [5.5, 6.0],
....: 'weight' : [130, 150]})
....:
In [28]: pd.melt(cheese, id_vars=['first', 'last'])
Out[28]:
first last variable value
0 John Doe height 5.5
1 Mary Bo height 6.0
2 John Doe weight 130.0
3 Mary Bo weight 150.0
[4 rows x 4 columns]
In [29]: cheese.set_index(['first', 'last']).stack() # alternative way
Out[29]:
first last
John Doe height 5.5
weight 130.0
Mary Bo height 6.0
weight 150.0
dtype: float64
For more details and examples see the reshaping documentation.
cast¶
In R acast is an expression using a data.frame called df in R to cast into a higher dimensional array:
df <- data.frame(
x = runif(12, 1, 168),
y = runif(12, 7, 334),
z = runif(12, 1.7, 20.7),
month = rep(c(5,6,7),4),
week = rep(c(1,2), 6)
)
mdf <- melt(df, id=c("month", "week"))
acast(mdf, week ~ month ~ variable, mean)
In Python the best way is to make use of pivot_table():
In [30]: df = DataFrame({
....: 'x': np.random.uniform(1., 168., 12),
....: 'y': np.random.uniform(7., 334., 12),
....: 'z': np.random.uniform(1.7, 20.7, 12),
....: 'month': [5,6,7]*4,
....: 'week': [1,2]*6
....: })
....:
In [31]: mdf = pd.melt(df, id_vars=['month', 'week'])
In [32]: pd.pivot_table(mdf, values='value', rows=['variable','week'],
....: cols=['month'], aggfunc=np.mean)
....:
Out[32]:
month 5 6 7
variable week
x 1 89.863679 78.824388 50.832050
2 132.209447 36.715123 75.566345
y 1 216.526257 110.507591 11.484571
2 153.506838 239.965235 160.223954
z 1 15.536152 8.826941 7.015962
2 14.646656 17.064267 11.806954
[6 rows x 3 columns]
Similarly for dcast which uses a data.frame called df in R to aggregate information based on Animal and FeedType:
df <- data.frame(
Animal = c('Animal1', 'Animal2', 'Animal3', 'Animal2', 'Animal1',
'Animal2', 'Animal3'),
FeedType = c('A', 'B', 'A', 'A', 'B', 'B', 'A'),
Amount = c(10, 7, 4, 2, 5, 6, 2)
)
dcast(df, Animal ~ FeedType, sum, fill=NaN)
# Alternative method using base R
with(df, tapply(Amount, list(Animal, FeedType), sum))
Python can approach this in two different ways. Firstly, similar to above using pivot_table():
In [33]: df = DataFrame({
....: 'Animal': ['Animal1', 'Animal2', 'Animal3', 'Animal2', 'Animal1',
....: 'Animal2', 'Animal3'],
....: 'FeedType': ['A', 'B', 'A', 'A', 'B', 'B', 'A'],
....: 'Amount': [10, 7, 4, 2, 5, 6, 2],
....: })
....:
In [34]: df.pivot_table(values='Amount', rows='Animal', cols='FeedType', aggfunc='sum')
Out[34]:
FeedType A B
Animal
Animal1 10 5
Animal2 2 13
Animal3 6 NaN
[3 rows x 2 columns]
The second approach is to use the groupby() method:
In [35]: df.groupby(['Animal','FeedType'])['Amount'].sum()
Out[35]:
Animal FeedType
Animal1 A 10
B 5
Animal2 A 2
B 13
Animal3 A 6
Name: Amount, dtype: int64
For more details and examples see the reshaping documentation or the groupby documentation.