Remote Data Access

Functions from pandas.io.data extract data from various Internet sources into a DataFrame. Currently the following sources are supported:

  • Yahoo! Finance
  • Google Finance
  • St. Louis FED (FRED)
  • Kenneth French’s data library
  • World Bank

It should be noted, that various sources support different kinds of data, so not all sources implement the same methods and the data elements returned might also differ.

Yahoo! Finance

In [1]: import pandas.io.data as web

In [2]: import datetime

In [3]: start = datetime.datetime(2010, 1, 1)

In [4]: end = datetime.datetime(2013, 1, 27)

In [5]: f=web.DataReader("F", 'yahoo', start, end)

In [6]: f.ix['2010-01-04']
Out[6]: 
Open               10.17
High               10.28
Low                10.05
Close              10.28
Volume       60855800.00
Adj Close           9.52
Name: 2010-01-04 00:00:00, dtype: float64

Yahoo! Finance Options

*Experimental*

The Options class allows the download of options data from Yahoo! Finance.

The get_all_data method downloads and caches option data for all expiry months and provides a formatted DataFrame with a hierarchical index, so its easy to get to the specific option you want.

In [7]: from pandas.io.data import Options

In [8]: aapl = Options('aapl', 'yahoo')

In [9]: data = aapl.get_all_data()

In [10]: data.iloc[0:5, 0:5]
Out[10]: 
                                             Last    Bid    Ask   Chg  PctChg
Strike Expiry     Type Symbol                                                
27.86  2015-01-17 call AAPL150117C00027860  81.05  81.05  81.30  0.50  +0.62%
                  put  AAPL150117P00027860   0.02   0.00   0.02  0.00   0.00%
28.57  2015-01-17 call AAPL150117C00028570  80.45  80.30  80.60  0.25  +0.31%
                  put  AAPL150117P00028570   0.01   0.00   0.02  0.00   0.00%
29.29  2015-01-17 call AAPL150117C00029290  78.70  78.30  80.95  0.00   0.00%

#Show the $100 strike puts at all expiry dates:
In [11]: data.loc[(100, slice(None), 'put'),:].iloc[0:5, 0:5]
Out[11]: 
                                            Last   Bid   Ask   Chg   PctChg
Strike Expiry     Type Symbol                                              
100    2014-11-14 put  AAPL141114P00100000  0.07  0.05  0.07  0.00    0.00%
       2014-11-22 put  AAPL141122P00100000  0.13  0.13  0.14 -0.03  -18.75%
       2014-11-28 put  AAPL141128P00100000  0.18  0.16  0.19 -0.03  -14.29%
       2014-12-05 put  AAPL141205P00100000  0.28  0.24  0.30 -0.04  -12.50%
       2014-12-12 put  AAPL141212P00100000  0.38  0.35  0.39 -0.09  -19.15%

#Show the volume traded of $100 strike puts at all expiry dates:
In [12]: data.loc[(100, slice(None), 'put'),'Vol'].head()
Out[12]: 
Strike  Expiry      Type  Symbol             
100     2014-11-14  put   AAPL141114P00100000     671
        2014-11-22  put   AAPL141122P00100000    3239
        2014-11-28  put   AAPL141128P00100000     119
        2014-12-05  put   AAPL141205P00100000      12
        2014-12-12  put   AAPL141212P00100000      22
Name: Vol, dtype: int64

If you don’t want to download all the data, more specific requests can be made.

In [13]: import datetime

In [14]: expiry = datetime.date(2016, 1, 1)

In [15]: data = aapl.get_call_data(expiry=expiry)

In [16]: data.iloc[0:5:, 0:5]
Out[16]: 
                                             Last   Bid    Ask   Chg  PctChg
Strike Expiry     Type Symbol                                               
34.29  2016-01-15 call AAPL160115C00034290  74.25  72.1  76.95  0.00   0.00%
35.71  2016-01-15 call AAPL160115C00035710  72.65  70.6  75.00  0.00   0.00%
37.14  2016-01-15 call AAPL160115C00037140  72.55  69.5  74.00  0.00   0.00%
38.57  2016-01-15 call AAPL160115C00038570  58.35  68.0  72.70  0.00   0.00%
40.00  2016-01-15 call AAPL160115C00040000  69.10  67.0  71.25  0.55  +0.80%

Note that if you call get_all_data first, this second call will happen much faster, as the data is cached.

If a given expiry date is not available, data for the next available expiry will be returned (January 15, 2015 in the above example).

Available expiry dates can be accessed from the expiry_dates property.

In [17]: aapl.expiry_dates
Out[17]: 
[datetime.date(2014, 11, 14),
 datetime.date(2014, 11, 22),
 datetime.date(2014, 11, 28),
 datetime.date(2014, 12, 5),
 datetime.date(2014, 12, 12),
 datetime.date(2014, 12, 20),
 datetime.date(2015, 1, 17),
 datetime.date(2015, 2, 20),
 datetime.date(2015, 4, 17),
 datetime.date(2015, 7, 17),
 datetime.date(2016, 1, 15),
 datetime.date(2017, 1, 20)]

In [18]: data = aapl.get_call_data(expiry=aapl.expiry_dates[0])

In [19]: data.iloc[0:5:, 0:5]
Out[19]: 
                                             Last    Bid    Ask   Chg  PctChg
Strike Expiry     Type Symbol                                                
80     2014-11-14 call AAPL141114C00080000  29.05  28.70  29.30  0.30  +1.04%
84     2014-11-14 call AAPL141114C00084000  24.80  24.70  25.20  0.00   0.00%
85     2014-11-14 call AAPL141114C00085000  24.05  23.90  24.05  0.67  +2.87%
86     2014-11-14 call AAPL141114C00086000  22.76  22.75  23.25  0.00   0.00%
87     2014-11-14 call AAPL141114C00087000  21.74  21.90  22.10  0.00   0.00%

A list-like object containing dates can also be passed to the expiry parameter, returning options data for all expiry dates in the list.

In [20]: data = aapl.get_near_stock_price(expiry=aapl.expiry_dates[0:3])

In [21]: data.iloc[0:5:, 0:5]
Out[21]: 
                                            Last   Bid   Ask   Chg  PctChg
Strike Expiry     Type Symbol                                             
109    2014-11-22 call AAPL141122C00109000  1.48  1.46  1.50  0.01  +0.68%
       2014-11-28 call AAPL141128C00109000  1.79  1.77  1.82  0.03  +1.70%
110    2014-11-14 call AAPL141114C00110000  0.55  0.55  0.57 -0.05  -8.33%
       2014-11-22 call AAPL141122C00110000  1.02  1.01  1.04 -0.02  -1.92%
       2014-11-28 call AAPL141128C00110000  1.32  1.30  1.35  0.02  +1.54%

The month and year parameters can be used to get all options data for a given month.

Google Finance

In [22]: import pandas.io.data as web

In [23]: import datetime

In [24]: start = datetime.datetime(2010, 1, 1)

In [25]: end = datetime.datetime(2013, 1, 27)

In [26]: f=web.DataReader("F", 'google', start, end)

In [27]: f.ix['2010-01-04']
Out[27]: 
Open         10.17
High         10.28
Low          10.05
Close        10.28
Volume    60855796
Name: 2010-01-04 00:00:00, dtype: object

FRED

In [28]: import pandas.io.data as web

In [29]: import datetime

In [30]: start = datetime.datetime(2010, 1, 1)

In [31]: end = datetime.datetime(2013, 1, 27)

In [32]: gdp=web.DataReader("GDP", "fred", start, end)

In [33]: gdp.ix['2013-01-01']
Out[33]: 
GDP    16502.4
Name: 2013-01-01 00:00:00, dtype: float64

# Multiple series:
In [34]: inflation = web.DataReader(["CPIAUCSL", "CPILFESL"], "fred", start, end)

In [35]: inflation.head()
Out[35]: 
            CPIAUCSL  CPILFESL
DATE                          
2010-01-01   217.466   220.543
2010-02-01   217.251   220.662
2010-03-01   217.305   220.753
2010-04-01   217.376   220.817
2010-05-01   217.299   221.026

Fama/French

Dataset names are listed at Fama/French Data Library.

In [36]: import pandas.io.data as web

In [37]: ip=web.DataReader("5_Industry_Portfolios", "famafrench")

In [38]: ip[4].ix[192607]
Out[38]: 
1 Cnsmr    5.43
2 Manuf    2.73
3 HiTec    1.83
4 Hlth     1.64
5 Other    2.15
Name: 192607, dtype: float64

World Bank

pandas users can easily access thousands of panel data series from the World Bank’s World Development Indicators by using the wb I/O functions.

Indicators

Either from exploring the World Bank site, or using the search function included, every world bank indicator is accessible.

For example, if you wanted to compare the Gross Domestic Products per capita in constant dollars in North America, you would use the search function:

In [1]: from pandas.io import wb

In [2]: wb.search('gdp.*capita.*const').iloc[:,:2]
Out[2]:
                     id                                               name
3242            GDPPCKD             GDP per Capita, constant US$, millions
5143     NY.GDP.PCAP.KD                 GDP per capita (constant 2005 US$)
5145     NY.GDP.PCAP.KN                      GDP per capita (constant LCU)
5147  NY.GDP.PCAP.PP.KD  GDP per capita, PPP (constant 2005 internation...

Then you would use the download function to acquire the data from the World Bank’s servers:

In [3]: dat = wb.download(indicator='NY.GDP.PCAP.KD', country=['US', 'CA', 'MX'], start=2005, end=2008)

In [4]: print(dat)
                      NY.GDP.PCAP.KD
country       year
Canada        2008  36005.5004978584
              2007  36182.9138439757
              2006  35785.9698172849
              2005  35087.8925933298
Mexico        2008  8113.10219480083
              2007  8119.21298908649
              2006  7961.96818458178
              2005  7666.69796097264
United States 2008  43069.5819857208
              2007  43635.5852068142
              2006   43228.111147107
              2005  42516.3934699993

The resulting dataset is a properly formatted DataFrame with a hierarchical index, so it is easy to apply .groupby transformations to it:

In [6]: dat['NY.GDP.PCAP.KD'].groupby(level=0).mean()
Out[6]:
country
Canada           35765.569188
Mexico            7965.245332
United States    43112.417952
dtype: float64

Now imagine you want to compare GDP to the share of people with cellphone contracts around the world.

In [7]: wb.search('cell.*%').iloc[:,:2]
Out[7]:
                     id                                               name
3990  IT.CEL.SETS.FE.ZS  Mobile cellular telephone users, female (% of ...
3991  IT.CEL.SETS.MA.ZS  Mobile cellular telephone users, male (% of po...
4027      IT.MOB.COV.ZS  Population coverage of mobile cellular telepho...

Notice that this second search was much faster than the first one because pandas now has a cached list of available data series.

In [13]: ind = ['NY.GDP.PCAP.KD', 'IT.MOB.COV.ZS']
In [14]: dat = wb.download(indicator=ind, country='all', start=2011, end=2011).dropna()
In [15]: dat.columns = ['gdp', 'cellphone']
In [16]: print(dat.tail())
                        gdp  cellphone
country   year
Swaziland 2011  2413.952853       94.9
Tunisia   2011  3687.340170      100.0
Uganda    2011   405.332501      100.0
Zambia    2011   767.911290       62.0
Zimbabwe  2011   419.236086       72.4

Finally, we use the statsmodels package to assess the relationship between our two variables using ordinary least squares regression. Unsurprisingly, populations in rich countries tend to use cellphones at a higher rate:

In [17]: import numpy as np
In [18]: import statsmodels.formula.api as smf
In [19]: mod = smf.ols("cellphone ~ np.log(gdp)", dat).fit()
In [20]: print(mod.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:              cellphone   R-squared:                       0.297
Model:                            OLS   Adj. R-squared:                  0.274
Method:                 Least Squares   F-statistic:                     13.08
Date:                Thu, 25 Jul 2013   Prob (F-statistic):            0.00105
Time:                        15:24:42   Log-Likelihood:                -139.16
No. Observations:                  33   AIC:                             282.3
Df Residuals:                      31   BIC:                             285.3
Df Model:                           1
===============================================================================
                  coef    std err          t      P>|t|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------
Intercept      16.5110     19.071      0.866      0.393       -22.384    55.406
np.log(gdp)     9.9333      2.747      3.616      0.001         4.331    15.535
==============================================================================
Omnibus:                       36.054   Durbin-Watson:                   2.071
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              119.133
Skew:                          -2.314   Prob(JB):                     1.35e-26
Kurtosis:                      11.077   Cond. No.                         45.8
==============================================================================

Country Codes

New in version 0.15.1.

The country argument accepts a string or list of mixed two or three character ISO country codes, as well as dynamic World Bank exceptions to the ISO standards.

For a list of the the hard-coded country codes (used solely for error handling logic) see pandas.io.wb.country_codes.

Problematic Country Codes & Indicators

Note

The World Bank’s country list and indicators are dynamic. As of 0.15.1, wb.download() is more flexible. To achieve this, the warning and exception logic changed.

The world bank converts some country codes, in their response, which makes error checking by pandas difficult. Retired indicators still persist in the search.

Given the new flexibility of 0.15.1, improved error handling by the user may be necessary for fringe cases.

To help identify issues:

There are at least 4 kinds of country codes:

  1. Standard (2/3 digit ISO) - returns data, will warn and error properly.
  2. Non-standard (WB Exceptions) - returns data, but will falsely warn.
  3. Blank - silently missing from the response.
  4. Bad - causes the entire response from WB to fail, always exception inducing.

There are at least 3 kinds of indicators:

  1. Current - Returns data.
  2. Retired - Appears in search results, yet won’t return data.
  3. Bad - Will not return data.

Use the errors argument to control warnings and exceptions. Setting errors to ignore or warn, won’t stop failed responses. (ie, 100% bad indicators, or a single “bad” (#4 above) country code).

See docstrings for more info.