Version 0.10.1 (January 22, 2013)#
This is a minor release from 0.10.0 and includes new features, enhancements, and bug fixes. In particular, there is substantial new HDFStore functionality contributed by Jeff Reback.
An undesired API breakage with functions taking the inplace option has been
reverted and deprecation warnings added.
API changes#
- Functions taking an - inplaceoption return the calling object as before. A deprecation message has been added
- Groupby aggregations Max/Min no longer exclude non-numeric data (GH2700) 
- Resampling an empty DataFrame now returns an empty DataFrame instead of raising an exception (GH2640) 
- The file reader will now raise an exception when NA values are found in an explicitly specified integer column instead of converting the column to float (GH2631) 
- DatetimeIndex.unique now returns a DatetimeIndex with the same name and 
- timezone instead of an array (GH2563) 
New features#
- MySQL support for database (contribution from Dan Allan) 
HDFStore#
You may need to upgrade your existing data files. Please visit the compatibility section in the main docs.
You can designate (and index) certain columns that you want to be able to
perform queries on a table, by passing a list to data_columns
In [1]: store = pd.HDFStore("store.h5")
In [2]: df = pd.DataFrame(
   ...:     np.random.randn(8, 3),
   ...:     index=pd.date_range("1/1/2000", periods=8),
   ...:     columns=["A", "B", "C"],
   ...: )
   ...: 
In [3]: df["string"] = "foo"
In [4]: df.loc[df.index[4:6], "string"] = np.nan
In [5]: df.loc[df.index[7:9], "string"] = "bar"
In [6]: df["string2"] = "cool"
In [7]: df
Out[7]: 
                   A         B         C string string2
2000-01-01  0.469112 -0.282863 -1.509059    foo    cool
2000-01-02 -1.135632  1.212112 -0.173215    foo    cool
2000-01-03  0.119209 -1.044236 -0.861849    foo    cool
2000-01-04 -2.104569 -0.494929  1.071804    foo    cool
2000-01-05  0.721555 -0.706771 -1.039575    NaN    cool
2000-01-06  0.271860 -0.424972  0.567020    NaN    cool
2000-01-07  0.276232 -1.087401 -0.673690    foo    cool
2000-01-08  0.113648 -1.478427  0.524988    bar    cool
# on-disk operations
In [8]: store.append("df", df, data_columns=["B", "C", "string", "string2"])
In [9]: store.select("df", "B>0 and string=='foo'")
Out[9]: 
                   A         B         C string string2
2000-01-02 -1.135632  1.212112 -0.173215    foo    cool
# this is in-memory version of this type of selection
In [10]: df[(df.B > 0) & (df.string == "foo")]
Out[10]: 
                   A         B         C string string2
2000-01-02 -1.135632  1.212112 -0.173215    foo    cool
Retrieving unique values in an indexable or data column.
# note that this is deprecated as of 0.14.0
# can be replicated by: store.select_column('df','index').unique()
store.unique("df", "index")
store.unique("df", "string")
You can now store datetime64 in data columns
In [11]: df_mixed = df.copy()
In [12]: df_mixed["datetime64"] = pd.Timestamp("20010102")
In [13]: df_mixed.loc[df_mixed.index[3:4], ["A", "B"]] = np.nan
In [14]: store.append("df_mixed", df_mixed)
In [15]: df_mixed1 = store.select("df_mixed")
In [16]: df_mixed1
Out[16]: 
                   A         B         C string string2 datetime64
2000-01-01  0.469112 -0.282863 -1.509059    foo    cool 2001-01-02
2000-01-02 -1.135632  1.212112 -0.173215    foo    cool 2001-01-02
2000-01-03  0.119209 -1.044236 -0.861849    foo    cool 2001-01-02
2000-01-04       NaN       NaN  1.071804    foo    cool 2001-01-02
2000-01-05  0.721555 -0.706771 -1.039575    NaN    cool 2001-01-02
2000-01-06  0.271860 -0.424972  0.567020    NaN    cool 2001-01-02
2000-01-07  0.276232 -1.087401 -0.673690    foo    cool 2001-01-02
2000-01-08  0.113648 -1.478427  0.524988    bar    cool 2001-01-02
In [17]: df_mixed1.dtypes.value_counts()
Out[17]: 
float64           3
object            2
datetime64[ns]    1
Name: count, dtype: int64
You can pass columns keyword to select to filter a list of the return
columns, this is equivalent to passing a
Term('columns',list_of_columns_to_filter)
In [18]: store.select("df", columns=["A", "B"])
Out[18]: 
                   A         B
2000-01-01  0.469112 -0.282863
2000-01-02 -1.135632  1.212112
2000-01-03  0.119209 -1.044236
2000-01-04 -2.104569 -0.494929
2000-01-05  0.721555 -0.706771
2000-01-06  0.271860 -0.424972
2000-01-07  0.276232 -1.087401
2000-01-08  0.113648 -1.478427
HDFStore now serializes MultiIndex dataframes when appending tables.
In [19]: index = pd.MultiIndex(levels=[['foo', 'bar', 'baz', 'qux'],
   ....:                               ['one', 'two', 'three']],
   ....:                       labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3],
   ....:                               [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]],
   ....:                       names=['foo', 'bar'])
   ....:
In [20]: df = pd.DataFrame(np.random.randn(10, 3), index=index,
   ....:                   columns=['A', 'B', 'C'])
   ....:
In [21]: df
Out[21]:
                  A         B         C
foo bar
foo one   -0.116619  0.295575 -1.047704
    two    1.640556  1.905836  2.772115
    three  0.088787 -1.144197 -0.633372
bar one    0.925372 -0.006438 -0.820408
    two   -0.600874 -1.039266  0.824758
baz two   -0.824095 -0.337730 -0.927764
    three -0.840123  0.248505 -0.109250
qux one    0.431977 -0.460710  0.336505
    two   -3.207595 -1.535854  0.409769
    three -0.673145 -0.741113 -0.110891
In [22]: store.append('mi', df)
In [23]: store.select('mi')
Out[23]:
                  A         B         C
foo bar
foo one   -0.116619  0.295575 -1.047704
    two    1.640556  1.905836  2.772115
    three  0.088787 -1.144197 -0.633372
bar one    0.925372 -0.006438 -0.820408
    two   -0.600874 -1.039266  0.824758
baz two   -0.824095 -0.337730 -0.927764
    three -0.840123  0.248505 -0.109250
qux one    0.431977 -0.460710  0.336505
    two   -3.207595 -1.535854  0.409769
    three -0.673145 -0.741113 -0.110891
# the levels are automatically included as data columns
In [24]: store.select('mi', "foo='bar'")
Out[24]:
                A         B         C
foo bar
bar one  0.925372 -0.006438 -0.820408
    two -0.600874 -1.039266  0.824758
Multi-table creation via append_to_multiple and selection via
select_as_multiple can create/select from multiple tables and return a
combined result, by using where on a selector table.
In [19]: df_mt = pd.DataFrame(
   ....:     np.random.randn(8, 6),
   ....:     index=pd.date_range("1/1/2000", periods=8),
   ....:     columns=["A", "B", "C", "D", "E", "F"],
   ....: )
   ....: 
In [20]: df_mt["foo"] = "bar"
# you can also create the tables individually
In [21]: store.append_to_multiple(
   ....:     {"df1_mt": ["A", "B"], "df2_mt": None}, df_mt, selector="df1_mt"
   ....: )
   ....: 
In [22]: store
Out[22]: 
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
# individual tables were created
In [23]: store.select("df1_mt")
Out[23]: 
                   A         B
2000-01-01  0.404705  0.577046
2000-01-02 -1.344312  0.844885
2000-01-03  0.357021 -0.674600
2000-01-04  0.276662 -0.472035
2000-01-05  0.895717  0.805244
2000-01-06 -1.170299 -0.226169
2000-01-07 -0.076467 -1.187678
2000-01-08  1.024180  0.569605
In [24]: store.select("df2_mt")
Out[24]: 
                   C         D         E         F  foo
2000-01-01 -1.715002 -1.039268 -0.370647 -1.157892  bar
2000-01-02  1.075770 -0.109050  1.643563 -1.469388  bar
2000-01-03 -1.776904 -0.968914 -1.294524  0.413738  bar
2000-01-04 -0.013960 -0.362543 -0.006154 -0.923061  bar
2000-01-05 -1.206412  2.565646  1.431256  1.340309  bar
2000-01-06  0.410835  0.813850  0.132003 -0.827317  bar
2000-01-07  1.130127 -1.436737 -1.413681  1.607920  bar
2000-01-08  0.875906 -2.211372  0.974466 -2.006747  bar
# as a multiple
In [25]: store.select_as_multiple(
   ....:     ["df1_mt", "df2_mt"], where=["A>0", "B>0"], selector="df1_mt"
   ....: )
   ....: 
Out[25]: 
                   A         B         C         D         E         F  foo
2000-01-01  0.404705  0.577046 -1.715002 -1.039268 -0.370647 -1.157892  bar
2000-01-05  0.895717  0.805244 -1.206412  2.565646  1.431256  1.340309  bar
2000-01-08  1.024180  0.569605  0.875906 -2.211372  0.974466 -2.006747  bar
Enhancements
- HDFStorenow can read native PyTables table format tables
- You can pass - nan_rep = 'my_nan_rep'to append, to change the default nan representation on disk (which converts to/from- np.nan), this defaults to- nan.
- You can pass - indexto- append. This defaults to- True. This will automagically create indices on the indexables and data columns of the table
- You can pass - chunksize=an integerto- append, to change the writing chunksize (default is 50000). This will significantly lower your memory usage on writing.
- You can pass - expectedrows=an integerto the first- append, to set the TOTAL number of expected rows that- PyTableswill expected. This will optimize read/write performance.
- Selectnow supports passing- startand- stopto provide selection space limiting in selection.
- Greatly improved ISO8601 (e.g., yyyy-mm-dd) date parsing for file parsers (GH2698) 
- Allow - DataFrame.mergeto handle combinatorial sizes too large for 64-bit integer (GH2690)
- Series now has unary negation (-series) and inversion (~series) operators (GH2686) 
- DataFrame.plot now includes a - logxparameter to change the x-axis to log scale (GH2327)
- Series arithmetic operators can now handle constant and ndarray input (GH2574) 
- ExcelFile now takes a - kindargument to specify the file type (GH2613)
- A faster implementation for Series.str methods (GH2602) 
Bug Fixes
- HDFStoretables can now store- float32types correctly (cannot be mixed with- float64however)
- Fixed Google Analytics prefix when specifying request segment (GH2713). 
- Function to reset Google Analytics token store so users can recover from improperly setup client secrets (GH2687). 
- Fixed groupby bug resulting in segfault when passing in MultiIndex (GH2706) 
- Fixed bug where passing a Series with datetime64 values into - to_datetimeresults in bogus output values (GH2699)
- Fixed bug in - pattern in HDFStoreexpressions when pattern is not a valid regex (GH2694)
- Fixed performance issues while aggregating boolean data (GH2692) 
- When given a boolean mask key and a Series of new values, Series __setitem__ will now align the incoming values with the original Series (GH2686) 
- Fixed MemoryError caused by performing counting sort on sorting MultiIndex levels with a very large number of combinatorial values (GH2684) 
- Fixed bug that causes plotting to fail when the index is a DatetimeIndex with a fixed-offset timezone (GH2683) 
- Corrected business day subtraction logic when the offset is more than 5 bdays and the starting date is on a weekend (GH2680) 
- Fixed C file parser behavior when the file has more columns than data (GH2668) 
- Fixed file reader bug that misaligned columns with data in the presence of an implicit column and a specified - usecolsvalue
- DataFrames with numerical or datetime indices are now sorted prior to plotting (GH2609) 
- Fixed DataFrame.from_records error when passed columns, index, but empty records (GH2633) 
- Several bug fixed for Series operations when dtype is datetime64 (GH2689, GH2629, GH2626) 
See the full release notes or issue tracker on GitHub for a complete list.
Contributors#
A total of 17 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.
- Andy Hayden + 
- Anton I. Sipos + 
- Chang She 
- Christopher Whelan 
- Damien Garaud + 
- Dan Allan + 
- Dieter Vandenbussche 
- Garrett Drapala + 
- Jay Parlar + 
- Thouis (Ray) Jones + 
- Vincent Arel-Bundock + 
- Wes McKinney 
- elpres 
- herrfz + 
- jreback 
- svaksha + 
- y-p