Remote Data Access¶
Functions from pandas.io.data extract data from various Internet sources into a DataFrame. Currently the following sources are supported:
- Yahoo! Finance
- Google Finance
- St. Louis FED (FRED)
- Kenneth French’s data library
- World Bank
It should be noted, that various sources support different kinds of data, so not all sources implement the same methods and the data elements returned might also differ.
Yahoo! Finance¶
In [1]: import pandas.io.data as web
In [2]: import datetime
In [3]: start = datetime.datetime(2010, 1, 1)
In [4]: end = datetime.datetime(2013, 1, 27)
In [5]: f=web.DataReader("F", 'yahoo', start, end)
In [6]: f.ix['2010-01-04']
Out[6]:
Open 10.17
High 10.28
Low 10.05
Close 10.28
Volume 60855800.00
Adj Close 9.68
Name: 2010-01-04 00:00:00, dtype: float64
Yahoo! Finance Options¶
*Experimental*
The Options class allows the download of options data from Yahoo! Finance.
The get_all_data method downloads and caches option data for all expiry months and provides a formatted DataFrame with a hierarchical index, so its easy to get to the specific option you want.
In [7]: from pandas.io.data import Options
In [8]: aapl = Options('aapl', 'yahoo')
In [9]: data = aapl.get_all_data()
In [10]: data.iloc[0:5, 0:5]
Out[10]:
Last Chg Bid Ask Vol
Strike Expiry Type Symbol
27.86 2015-01-17 call AAPL150117C00027860 63.06 0 NaN NaN 34
put AAPL150117P00027860 0.01 0 NaN NaN 65
28.57 2015-01-17 call AAPL150117C00028570 61.63 0 NaN NaN 0
put AAPL150117P00028570 0.01 0 NaN NaN 5
29.29 2015-01-17 call AAPL150117C00029290 61.02 0 NaN NaN 0
#Show the $100 strike puts at all expiry dates:
In [11]: data.loc[(100, slice(None), 'put'),:].iloc[0:5, 0:5]
Out[11]:
Last Chg Bid Ask Vol
Strike Expiry Type Symbol
100 2014-07-11 put AAPL140711P00100000 4.90 0.30 NaN NaN 106
2014-07-19 put AAPL140719P00100000 5.00 0.30 NaN NaN 742
2014-07-25 put AAPL140725P00100000 6.20 0.79 NaN NaN 9
2014-08-01 put AAPL140801P00100000 5.84 0.20 NaN NaN 65
AAPL7140801P00100000 7.10 0.00 NaN NaN 5
#Show the volume traded of $100 strike puts at all expiry dates:
In [12]: data.loc[(100, slice(None), 'put'),'Vol'].head()
Out[12]:
Strike Expiry Type Symbol
100 2014-07-11 put AAPL140711P00100000 106
2014-07-19 put AAPL140719P00100000 742
2014-07-25 put AAPL140725P00100000 9
2014-08-01 put AAPL140801P00100000 65
AAPL7140801P00100000 5
Name: Vol, dtype: int64
If you don’t want to download all the data, more specific requests can be made.
In [13]: import datetime
In [14]: expiry = datetime.date(2016, 1, 1)
In [15]: data = aapl.get_call_data(expiry=expiry)
In [16]: data.iloc[0:5:, 0:5]
Out[16]:
Last Chg Bid Ask Vol
Strike Expiry Type Symbol
34.29 2016-01-15 call AAPL160115C00034290 62.00 0.00 NaN NaN 5
35.71 2016-01-15 call AAPL160115C00035710 398.00 0.00 NaN NaN 2
37.14 2016-01-15 call AAPL160115C00037140 47.54 0.00 NaN NaN 0
38.57 2016-01-15 call AAPL160115C00038570 45.82 0.00 NaN NaN 0
40.00 2016-01-15 call AAPL160115C00040000 54.75 -0.65 NaN NaN 2
Note that if you call get_all_data first, this second call will happen much faster, as the data is cached.
Google Finance¶
In [17]: import pandas.io.data as web
In [18]: import datetime
In [19]: start = datetime.datetime(2010, 1, 1)
In [20]: end = datetime.datetime(2013, 1, 27)
In [21]: f=web.DataReader("F", 'google', start, end)
In [22]: f.ix['2010-01-04']
Out[22]:
Open 10.17
High 10.28
Low 10.05
Close 10.28
Volume 60855796
Name: 2010-01-04 00:00:00, dtype: object
FRED¶
In [23]: import pandas.io.data as web
In [24]: import datetime
In [25]: start = datetime.datetime(2010, 1, 1)
In [26]: end = datetime.datetime(2013, 1, 27)
In [27]: gdp=web.DataReader("GDP", "fred", start, end)
In [28]: gdp.ix['2013-01-01']
Out[28]:
GDP 16535.3
Name: 2013-01-01 00:00:00, dtype: float64
# Multiple series:
In [29]: inflation = web.DataReader(["CPIAUCSL", "CPILFESL"], "fred", start, end)
In [30]: inflation.head()
Out[30]:
CPIAUCSL CPILFESL
DATE
2010-01-01 217.466 220.543
2010-02-01 217.251 220.662
2010-03-01 217.305 220.753
2010-04-01 217.376 220.817
2010-05-01 217.299 221.026
Fama/French¶
Dataset names are listed at Fama/French Data Library.
In [31]: import pandas.io.data as web
In [32]: ip=web.DataReader("5_Industry_Portfolios", "famafrench")
In [33]: ip[4].ix[192607]
Out[33]:
1 Cnsmr 5.43
2 Manuf 2.73
3 HiTec 1.83
4 Hlth 1.64
5 Other 2.15
Name: 192607, dtype: float64
World Bank¶
pandas users can easily access thousands of panel data series from the World Bank’s World Development Indicators by using the wb I/O functions.
For example, if you wanted to compare the Gross Domestic Products per capita in constant dollars in North America, you would use the search function:
In [1]: from pandas.io import wb
In [2]: wb.search('gdp.*capita.*const').iloc[:,:2]
Out[2]:
id name
3242 GDPPCKD GDP per Capita, constant US$, millions
5143 NY.GDP.PCAP.KD GDP per capita (constant 2005 US$)
5145 NY.GDP.PCAP.KN GDP per capita (constant LCU)
5147 NY.GDP.PCAP.PP.KD GDP per capita, PPP (constant 2005 internation...
Then you would use the download function to acquire the data from the World Bank’s servers:
In [3]: dat = wb.download(indicator='NY.GDP.PCAP.KD', country=['US', 'CA', 'MX'], start=2005, end=2008)
In [4]: print(dat)
NY.GDP.PCAP.KD
country year
Canada 2008 36005.5004978584
2007 36182.9138439757
2006 35785.9698172849
2005 35087.8925933298
Mexico 2008 8113.10219480083
2007 8119.21298908649
2006 7961.96818458178
2005 7666.69796097264
United States 2008 43069.5819857208
2007 43635.5852068142
2006 43228.111147107
2005 42516.3934699993
The resulting dataset is a properly formatted DataFrame with a hierarchical index, so it is easy to apply .groupby transformations to it:
In [6]: dat['NY.GDP.PCAP.KD'].groupby(level=0).mean()
Out[6]:
country
Canada 35765.569188
Mexico 7965.245332
United States 43112.417952
dtype: float64
Now imagine you want to compare GDP to the share of people with cellphone contracts around the world.
In [7]: wb.search('cell.*%').iloc[:,:2]
Out[7]:
id name
3990 IT.CEL.SETS.FE.ZS Mobile cellular telephone users, female (% of ...
3991 IT.CEL.SETS.MA.ZS Mobile cellular telephone users, male (% of po...
4027 IT.MOB.COV.ZS Population coverage of mobile cellular telepho...
Notice that this second search was much faster than the first one because pandas now has a cached list of available data series.
In [13]: ind = ['NY.GDP.PCAP.KD', 'IT.MOB.COV.ZS']
In [14]: dat = wb.download(indicator=ind, country='all', start=2011, end=2011).dropna()
In [15]: dat.columns = ['gdp', 'cellphone']
In [16]: print(dat.tail())
gdp cellphone
country year
Swaziland 2011 2413.952853 94.9
Tunisia 2011 3687.340170 100.0
Uganda 2011 405.332501 100.0
Zambia 2011 767.911290 62.0
Zimbabwe 2011 419.236086 72.4
Finally, we use the statsmodels package to assess the relationship between our two variables using ordinary least squares regression. Unsurprisingly, populations in rich countries tend to use cellphones at a higher rate:
In [17]: import numpy as np
In [18]: import statsmodels.formula.api as smf
In [19]: mod = smf.ols("cellphone ~ np.log(gdp)", dat).fit()
In [20]: print(mod.summary())
OLS Regression Results
==============================================================================
Dep. Variable: cellphone R-squared: 0.297
Model: OLS Adj. R-squared: 0.274
Method: Least Squares F-statistic: 13.08
Date: Thu, 25 Jul 2013 Prob (F-statistic): 0.00105
Time: 15:24:42 Log-Likelihood: -139.16
No. Observations: 33 AIC: 282.3
Df Residuals: 31 BIC: 285.3
Df Model: 1
===============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
-------------------------------------------------------------------------------
Intercept 16.5110 19.071 0.866 0.393 -22.384 55.406
np.log(gdp) 9.9333 2.747 3.616 0.001 4.331 15.535
==============================================================================
Omnibus: 36.054 Durbin-Watson: 2.071
Prob(Omnibus): 0.000 Jarque-Bera (JB): 119.133
Skew: -2.314 Prob(JB): 1.35e-26
Kurtosis: 11.077 Cond. No. 45.8
==============================================================================