Comparison with R / R libraries¶
Since pandas aims to provide a lot of the data manipulation and analysis functionality that people use R for, this page was started to provide a more detailed look at the R language and its many third party libraries as they relate to pandas. In comparisons with R and CRAN libraries, we care about the following things:
- Functionality / flexibility: what can/cannot be done with each tool
- Performance: how fast are operations. Hard numbers/benchmarks are preferable
- Ease-of-use: Is one tool easier/harder to use (you may have to be the judge of this, given side-by-side code comparisons)
This page is also here to offer a bit of a translation guide for users of these R packages.
Base R¶
Slicing with R’s c¶
R makes it easy to access data.frame columns by name
df <- data.frame(a=rnorm(5), b=rnorm(5), c=rnorm(5), d=rnorm(5), e=rnorm(5))
df[, c("a", "c", "e")]
or by integer location
df <- data.frame(matrix(rnorm(1000), ncol=100))
df[, c(1:10, 25:30, 40, 50:100)]
Selecting multiple columns by name in pandas is straightforward
In [1]: df = DataFrame(np.random.randn(10, 3), columns=list('abc'))
In [2]: df[['a', 'c']]
Out[2]:
a c
0 -0.010277 1.754450
1 -1.979042 0.026731
2 -0.171905 -0.668032
3 0.156823 -0.287102
4 -0.654693 2.486931
5 0.314941 -0.209642
6 -0.482069 0.713264
7 1.524014 -0.483850
8 1.615149 0.673194
9 1.512817 -0.017685
In [3]: df.loc[:, ['a', 'c']]
Out[3]:
a c
0 -0.010277 1.754450
1 -1.979042 0.026731
2 -0.171905 -0.668032
3 0.156823 -0.287102
4 -0.654693 2.486931
5 0.314941 -0.209642
6 -0.482069 0.713264
7 1.524014 -0.483850
8 1.615149 0.673194
9 1.512817 -0.017685
Selecting multiple noncontiguous columns by integer location can be achieved with a combination of the iloc indexer attribute and numpy.r_.
In [4]: named = list('abcdefg')
In [5]: n = 30
In [6]: columns = named + np.arange(len(named), n).tolist()
In [7]: df = DataFrame(np.random.randn(n, n), columns=columns)
In [8]: df.iloc[:, np.r_[:10, 24:30]]
Out[8]:
a b c d e f g \
0 -0.182877 1.556201 -1.717420 -3.017047 1.201081 0.980077 -0.026234
1 -1.679855 -0.912033 0.895000 0.759727 1.053398 -0.854995 0.514409
2 -0.242666 -0.153091 0.571129 1.049663 -0.200188 0.169303 0.127031
3 0.442832 -1.344020 -0.497400 -1.255580 -0.000235 2.493078 -1.483518
4 0.939187 -2.739487 -0.573693 -1.233017 -0.803782 -1.527202 0.680366
5 0.398306 -1.886066 -0.488051 1.022238 -1.097735 0.182293 0.166052
6 -1.621497 -1.229428 0.340857 -0.240188 -0.640714 -0.620492 1.395629
.. ... ... ... ... ... ... ...
23 -0.636174 -0.236723 -0.542805 0.475213 1.683208 -0.759966 1.525081
24 1.093811 -1.589366 -0.402728 0.333326 -1.036511 0.756512 -1.622032
25 -0.634240 0.286129 0.835316 0.826629 -0.735065 -0.285695 -1.094918
26 -0.235821 -1.058125 -1.137497 1.768844 1.973471 0.747723 -0.532274
27 -0.056739 -1.497562 0.697053 1.246539 -0.369645 1.788288 -0.425494
28 0.951542 1.199311 0.361042 -1.004705 0.124648 -0.564100 -1.675530
29 -0.210387 -0.236078 1.614966 0.897880 -0.840843 -0.403887 -0.559663
7 8 9 24 25 26 27 \
0 -0.170669 0.013884 0.027505 1.366784 1.864420 -1.119262 2.489142
1 -1.659617 -0.430681 -0.158656 0.014234 -1.392410 -0.049204 1.503745
2 0.272596 0.604549 -2.485841 -0.477770 1.454589 -0.392196 -1.705802
3 -2.641568 0.667512 1.527341 -0.766140 -0.199585 1.569370 0.628867
4 -0.373935 -2.253353 -1.046336 1.421232 1.055904 -0.429448 0.023700
5 1.008944 0.979035 -0.525153 -0.058021 -0.173119 -1.375517 0.270460
6 0.306483 0.433323 -0.522057 -0.245659 0.240485 0.962699 -0.172859
.. ... ... ... ... ... ... ...
23 0.251470 -0.137568 -0.556620 -1.226969 -0.459633 -0.733977 -1.221608
24 -0.468504 -0.656569 1.187661 0.714776 -0.459475 -2.880218 0.629157
25 0.782185 0.026271 -0.671403 0.185990 1.271593 -0.722660 1.232652
26 -1.982213 0.189767 0.952486 0.220958 -0.345484 -0.615855 0.513608
27 1.867720 0.456395 -0.673380 -1.201060 0.133686 1.563471 0.532838
28 1.351386 -0.415076 1.449936 0.045704 0.079924 -0.846139 1.179317
29 -1.589124 1.188404 0.137041 0.522771 1.131500 -0.863309 -1.412090
28 29
0 1.115041 -0.078156
1 1.088485 -0.119904
2 0.091318 1.444209
3 -1.485142 0.583219
4 -0.177887 1.346323
5 -0.042217 0.923272
6 -0.323793 -0.704276
.. ... ...
23 -0.348523 -1.829344
24 -0.090673 0.283814
25 0.964679 -0.327307
26 0.109942 -0.346066
27 -1.326245 0.520447
28 -0.812521 0.290519
29 1.344344 -1.151327
[30 rows x 16 columns]
aggregate¶
In R you may want to split data into subsets and compute the mean for each. Using a data.frame called df and splitting it into groups by1 and by2:
df <- data.frame(
v1 = c(1,3,5,7,8,3,5,NA,4,5,7,9),
v2 = c(11,33,55,77,88,33,55,NA,44,55,77,99),
by1 = c("red", "blue", 1, 2, NA, "big", 1, 2, "red", 1, NA, 12),
by2 = c("wet", "dry", 99, 95, NA, "damp", 95, 99, "red", 99, NA, NA))
aggregate(x=df[, c("v1", "v2")], by=list(mydf2$by1, mydf2$by2), FUN = mean)
The groupby() method is similar to base R aggregate function.
In [9]: from pandas import DataFrame
In [10]: df = DataFrame({
....: 'v1': [1,3,5,7,8,3,5,np.nan,4,5,7,9],
....: 'v2': [11,33,55,77,88,33,55,np.nan,44,55,77,99],
....: 'by1': ["red", "blue", 1, 2, np.nan, "big", 1, 2, "red", 1, np.nan, 12],
....: 'by2': ["wet", "dry", 99, 95, np.nan, "damp", 95, 99, "red", 99, np.nan,
....: np.nan]
....: })
....:
In [11]: g = df.groupby(['by1','by2'])
In [12]: g[['v1','v2']].mean()
Out[12]:
v1 v2
by1 by2
1 95 5 55
99 5 55
2 95 7 77
99 NaN NaN
big damp 3 33
blue dry 3 33
red red 4 44
wet 1 11
For more details and examples see the groupby documentation.
match / %in%¶
A common way to select data in R is using %in% which is defined using the function match. The operator %in% is used to return a logical vector indicating if there is a match or not:
s <- 0:4
s %in% c(2,4)
The isin() method is similar to R %in% operator:
In [13]: s = pd.Series(np.arange(5),dtype=np.float32)
In [14]: s.isin([2, 4])
Out[14]:
0 False
1 False
2 True
3 False
4 True
dtype: bool
The match function returns a vector of the positions of matches of its first argument in its second:
s <- 0:4
match(s, c(2,4))
The apply() method can be used to replicate this:
In [15]: s = pd.Series(np.arange(5),dtype=np.float32)
In [16]: pd.Series(pd.match(s,[2,4],np.nan))
Out[16]:
0 NaN
1 NaN
2 0
3 NaN
4 1
dtype: float64
For more details and examples see the reshaping documentation.
tapply¶
tapply is similar to aggregate, but data can be in a ragged array, since the subclass sizes are possibly irregular. Using a data.frame called baseball, and retrieving information based on the array team:
baseball <-
data.frame(team = gl(5, 5,
labels = paste("Team", LETTERS[1:5])),
player = sample(letters, 25),
batting.average = runif(25, .200, .400))
tapply(baseball$batting.average, baseball.example$team,
max)
In pandas we may use pivot_table() method to handle this:
In [17]: import random
In [18]: import string
In [19]: baseball = DataFrame({
....: 'team': ["team %d" % (x+1) for x in range(5)]*5,
....: 'player': random.sample(list(string.ascii_lowercase),25),
....: 'batting avg': np.random.uniform(.200, .400, 25)
....: })
....:
In [20]: baseball.pivot_table(values='batting avg', columns='team', aggfunc=np.max)
Out[20]:
team
team 1 0.382841
team 2 0.395048
team 3 0.387240
team 4 0.383183
team 5 0.364851
Name: batting avg, dtype: float64
For more details and examples see the reshaping documentation.
subset¶
New in version 0.13.
The query() method is similar to the base R subset function. In R you might want to get the rows of a data.frame where one column’s values are less than another column’s values:
df <- data.frame(a=rnorm(10), b=rnorm(10))
subset(df, a <= b)
df[df$a <= df$b,] # note the comma
In pandas, there are a few ways to perform subsetting. You can use query() or pass an expression as if it were an index/slice as well as standard boolean indexing:
In [21]: df = DataFrame({'a': np.random.randn(10), 'b': np.random.randn(10)})
In [22]: df.query('a <= b')
Out[22]:
a b
0 -0.869907 0.118969
1 -1.243405 1.059719
2 -2.103633 -0.585222
3 -0.160015 0.476629
4 -1.361355 1.317180
8 -0.945780 -0.571590
9 -0.761339 -0.073684
In [23]: df[df.a <= df.b]
Out[23]:
a b
0 -0.869907 0.118969
1 -1.243405 1.059719
2 -2.103633 -0.585222
3 -0.160015 0.476629
4 -1.361355 1.317180
8 -0.945780 -0.571590
9 -0.761339 -0.073684
In [24]: df.loc[df.a <= df.b]
Out[24]:
a b
0 -0.869907 0.118969
1 -1.243405 1.059719
2 -2.103633 -0.585222
3 -0.160015 0.476629
4 -1.361355 1.317180
8 -0.945780 -0.571590
9 -0.761339 -0.073684
For more details and examples see the query documentation.
with¶
New in version 0.13.
An expression using a data.frame called df in R with the columns a and b would be evaluated using with like so:
df <- data.frame(a=rnorm(10), b=rnorm(10))
with(df, a + b)
df$a + df$b # same as the previous expression
In pandas the equivalent expression, using the eval() method, would be:
In [25]: df = DataFrame({'a': np.random.randn(10), 'b': np.random.randn(10)})
In [26]: df.eval('a + b')
Out[26]:
0 -0.251791
1 -0.061720
2 -1.395907
3 -0.095439
4 0.277944
5 -0.445539
6 1.199896
7 -0.524619
8 -0.657901
9 2.069409
dtype: float64
In [27]: df.a + df.b # same as the previous expression
Out[27]:
0 -0.251791
1 -0.061720
2 -1.395907
3 -0.095439
4 0.277944
5 -0.445539
6 1.199896
7 -0.524619
8 -0.657901
9 2.069409
dtype: float64
In certain cases eval() will be much faster than evaluation in pure Python. For more details and examples see the eval documentation.
zoo¶
xts¶
plyr¶
plyr is an R library for the split-apply-combine strategy for data analysis. The functions revolve around three data structures in R, a for arrays, l for lists, and d for data.frame. The table below shows how these data structures could be mapped in Python.
R | Python |
---|---|
array | list |
lists | dictionary or list of objects |
data.frame | dataframe |
ddply¶
An expression using a data.frame called df in R where you want to summarize x by month:
require(plyr)
df <- data.frame(
x = runif(120, 1, 168),
y = runif(120, 7, 334),
z = runif(120, 1.7, 20.7),
month = rep(c(5,6,7,8),30),
week = sample(1:4, 120, TRUE)
)
ddply(df, .(month, week), summarize,
mean = round(mean(x), 2),
sd = round(sd(x), 2))
In pandas the equivalent expression, using the groupby() method, would be:
In [28]: df = DataFrame({
....: 'x': np.random.uniform(1., 168., 120),
....: 'y': np.random.uniform(7., 334., 120),
....: 'z': np.random.uniform(1.7, 20.7, 120),
....: 'month': [5,6,7,8]*30,
....: 'week': np.random.randint(1,4, 120)
....: })
....:
In [29]: grouped = df.groupby(['month','week'])
In [30]: print grouped['x'].agg([np.mean, np.std])
mean std
month week
5 1 77.357995 49.531314
2 85.764198 58.738812
3 89.386008 57.720890
6 1 91.738228 62.007638
2 91.100447 46.713644
3 76.136174 56.952827
7 1 83.534178 52.964956
2 92.955649 49.004203
3 83.708346 46.338509
8 1 84.729214 53.924477
2 72.545882 55.905134
3 71.868187 50.813235
For more details and examples see the groupby documentation.
reshape / reshape2¶
melt.array¶
An expression using a 3 dimensional array called a in R where you want to melt it into a data.frame:
a <- array(c(1:23, NA), c(2,3,4))
data.frame(melt(a))
In Python, since a is a list, you can simply use list comprehension.
In [31]: a = np.array(list(range(1,24))+[np.NAN]).reshape(2,3,4)
In [32]: DataFrame([tuple(list(x)+[val]) for x, val in np.ndenumerate(a)])
Out[32]:
0 1 2 3
0 0 0 0 1
1 0 0 1 2
2 0 0 2 3
3 0 0 3 4
4 0 1 0 5
5 0 1 1 6
6 0 1 2 7
.. .. .. .. ..
17 1 1 1 18
18 1 1 2 19
19 1 1 3 20
20 1 2 0 21
21 1 2 1 22
22 1 2 2 23
23 1 2 3 NaN
[24 rows x 4 columns]
melt.list¶
An expression using a list called a in R where you want to melt it into a data.frame:
a <- as.list(c(1:4, NA))
data.frame(melt(a))
In Python, this list would be a list of tuples, so DataFrame() method would convert it to a dataframe as required.
In [33]: a = list(enumerate(list(range(1,5))+[np.NAN]))
In [34]: DataFrame(a)
Out[34]:
0 1
0 0 1
1 1 2
2 2 3
3 3 4
4 4 NaN
For more details and examples see the Into to Data Structures documentation.
melt.data.frame¶
An expression using a data.frame called cheese in R where you want to reshape the data.frame:
cheese <- data.frame(
first = c('John', 'Mary'),
last = c('Doe', 'Bo'),
height = c(5.5, 6.0),
weight = c(130, 150)
)
melt(cheese, id=c("first", "last"))
In Python, the melt() method is the R equivalent:
In [35]: cheese = DataFrame({'first' : ['John', 'Mary'],
....: 'last' : ['Doe', 'Bo'],
....: 'height' : [5.5, 6.0],
....: 'weight' : [130, 150]})
....:
In [36]: pd.melt(cheese, id_vars=['first', 'last'])
Out[36]:
first last variable value
0 John Doe height 5.5
1 Mary Bo height 6.0
2 John Doe weight 130.0
3 Mary Bo weight 150.0
In [37]: cheese.set_index(['first', 'last']).stack() # alternative way
Out[37]:
first last
John Doe height 5.5
weight 130.0
Mary Bo height 6.0
weight 150.0
dtype: float64
For more details and examples see the reshaping documentation.
cast¶
In R acast is an expression using a data.frame called df in R to cast into a higher dimensional array:
df <- data.frame(
x = runif(12, 1, 168),
y = runif(12, 7, 334),
z = runif(12, 1.7, 20.7),
month = rep(c(5,6,7),4),
week = rep(c(1,2), 6)
)
mdf <- melt(df, id=c("month", "week"))
acast(mdf, week ~ month ~ variable, mean)
In Python the best way is to make use of pivot_table():
In [38]: df = DataFrame({
....: 'x': np.random.uniform(1., 168., 12),
....: 'y': np.random.uniform(7., 334., 12),
....: 'z': np.random.uniform(1.7, 20.7, 12),
....: 'month': [5,6,7]*4,
....: 'week': [1,2]*6
....: })
....:
In [39]: mdf = pd.melt(df, id_vars=['month', 'week'])
In [40]: pd.pivot_table(mdf, values='value', index=['variable','week'],
....: columns=['month'], aggfunc=np.mean)
....:
Out[40]:
month 5 6 7
variable week
x 1 58.488427 32.594687 149.838258
2 88.972028 109.131941 59.435615
y 1 34.774928 173.914293 167.835338
2 126.265859 70.692387 123.789140
z 1 8.408572 3.194041 9.885935
2 18.922270 9.083850 7.874963
Similarly for dcast which uses a data.frame called df in R to aggregate information based on Animal and FeedType:
df <- data.frame(
Animal = c('Animal1', 'Animal2', 'Animal3', 'Animal2', 'Animal1',
'Animal2', 'Animal3'),
FeedType = c('A', 'B', 'A', 'A', 'B', 'B', 'A'),
Amount = c(10, 7, 4, 2, 5, 6, 2)
)
dcast(df, Animal ~ FeedType, sum, fill=NaN)
# Alternative method using base R
with(df, tapply(Amount, list(Animal, FeedType), sum))
Python can approach this in two different ways. Firstly, similar to above using pivot_table():
In [41]: df = DataFrame({
....: 'Animal': ['Animal1', 'Animal2', 'Animal3', 'Animal2', 'Animal1',
....: 'Animal2', 'Animal3'],
....: 'FeedType': ['A', 'B', 'A', 'A', 'B', 'B', 'A'],
....: 'Amount': [10, 7, 4, 2, 5, 6, 2],
....: })
....:
In [42]: df.pivot_table(values='Amount', index='Animal', columns='FeedType', aggfunc='sum')
Out[42]:
FeedType A B
Animal
Animal1 10 5
Animal2 2 13
Animal3 6 NaN
The second approach is to use the groupby() method:
In [43]: df.groupby(['Animal','FeedType'])['Amount'].sum()
Out[43]:
Animal FeedType
Animal1 A 10
B 5
Animal2 A 2
B 13
Animal3 A 6
Name: Amount, dtype: int64
For more details and examples see the reshaping documentation or the groupby documentation.