This is a minor release from 0.10.0 and includes new features, enhancements, and bug fixes. In particular, there is substantial new HDFStore functionality contributed by Jeff Reback.
An undesired API breakage with functions taking the inplace option has been reverted and deprecation warnings added.
inplace
Functions taking an inplace option return the calling object as before. A deprecation message has been added
Groupby aggregations Max/Min no longer exclude non-numeric data (GH2700)
Resampling an empty DataFrame now returns an empty DataFrame instead of raising an exception (GH2640)
The file reader will now raise an exception when NA values are found in an explicitly specified integer column instead of converting the column to float (GH2631)
DatetimeIndex.unique now returns a DatetimeIndex with the same name and
timezone instead of an array (GH2563)
MySQL support for database (contribution from Dan Allan)
You may need to upgrade your existing data files. Please visit the compatibility section in the main docs.
You can designate (and index) certain columns that you want to be able to perform queries on a table, by passing a list to data_columns
data_columns
In [1]: store = pd.HDFStore("store.h5") In [2]: df = pd.DataFrame( ...: np.random.randn(8, 3), ...: index=pd.date_range("1/1/2000", periods=8), ...: columns=["A", "B", "C"], ...: ) ...: In [3]: df["string"] = "foo" In [4]: df.loc[df.index[4:6], "string"] = np.nan In [5]: df.loc[df.index[7:9], "string"] = "bar" In [6]: df["string2"] = "cool" In [7]: df Out[7]: A B C string string2 2000-01-01 0.469112 -0.282863 -1.509059 foo cool 2000-01-02 -1.135632 1.212112 -0.173215 foo cool 2000-01-03 0.119209 -1.044236 -0.861849 foo cool 2000-01-04 -2.104569 -0.494929 1.071804 foo cool 2000-01-05 0.721555 -0.706771 -1.039575 NaN cool 2000-01-06 0.271860 -0.424972 0.567020 NaN cool 2000-01-07 0.276232 -1.087401 -0.673690 foo cool 2000-01-08 0.113648 -1.478427 0.524988 bar cool # on-disk operations In [8]: store.append("df", df, data_columns=["B", "C", "string", "string2"]) In [9]: store.select("df", "B>0 and string=='foo'") Out[9]: A B C string string2 2000-01-02 -1.135632 1.212112 -0.173215 foo cool # this is in-memory version of this type of selection In [10]: df[(df.B > 0) & (df.string == "foo")] Out[10]: A B C string string2 2000-01-02 -1.135632 1.212112 -0.173215 foo cool
Retrieving unique values in an indexable or data column.
# note that this is deprecated as of 0.14.0 # can be replicated by: store.select_column('df','index').unique() store.unique("df", "index") store.unique("df", "string")
You can now store datetime64 in data columns
datetime64
In [11]: df_mixed = df.copy() In [12]: df_mixed["datetime64"] = pd.Timestamp("20010102") In [13]: df_mixed.loc[df_mixed.index[3:4], ["A", "B"]] = np.nan In [14]: store.append("df_mixed", df_mixed) In [15]: df_mixed1 = store.select("df_mixed") In [16]: df_mixed1 Out[16]: A B C string string2 datetime64 2000-01-01 0.469112 -0.282863 -1.509059 foo cool 2001-01-02 2000-01-02 -1.135632 1.212112 -0.173215 foo cool 2001-01-02 2000-01-03 0.119209 -1.044236 -0.861849 foo cool 2001-01-02 2000-01-04 NaN NaN 1.071804 foo cool 2001-01-02 2000-01-05 0.721555 -0.706771 -1.039575 NaN cool 2001-01-02 2000-01-06 0.271860 -0.424972 0.567020 NaN cool 2001-01-02 2000-01-07 0.276232 -1.087401 -0.673690 foo cool 2001-01-02 2000-01-08 0.113648 -1.478427 0.524988 bar cool 2001-01-02 In [17]: df_mixed1.dtypes.value_counts() Out[17]: float64 3 object 2 datetime64[ns] 1 dtype: int64
You can pass columns keyword to select to filter a list of the return columns, this is equivalent to passing a Term('columns',list_of_columns_to_filter)
columns
Term('columns',list_of_columns_to_filter)
In [18]: store.select("df", columns=["A", "B"]) Out[18]: A B 2000-01-01 0.469112 -0.282863 2000-01-02 -1.135632 1.212112 2000-01-03 0.119209 -1.044236 2000-01-04 -2.104569 -0.494929 2000-01-05 0.721555 -0.706771 2000-01-06 0.271860 -0.424972 2000-01-07 0.276232 -1.087401 2000-01-08 0.113648 -1.478427
HDFStore now serializes MultiIndex dataframes when appending tables.
HDFStore
In [19]: index = pd.MultiIndex(levels=[['foo', 'bar', 'baz', 'qux'], ....: ['one', 'two', 'three']], ....: labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3], ....: [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]], ....: names=['foo', 'bar']) ....: In [20]: df = pd.DataFrame(np.random.randn(10, 3), index=index, ....: columns=['A', 'B', 'C']) ....: In [21]: df Out[21]: A B C foo bar foo one -0.116619 0.295575 -1.047704 two 1.640556 1.905836 2.772115 three 0.088787 -1.144197 -0.633372 bar one 0.925372 -0.006438 -0.820408 two -0.600874 -1.039266 0.824758 baz two -0.824095 -0.337730 -0.927764 three -0.840123 0.248505 -0.109250 qux one 0.431977 -0.460710 0.336505 two -3.207595 -1.535854 0.409769 three -0.673145 -0.741113 -0.110891 In [22]: store.append('mi', df) In [23]: store.select('mi') Out[23]: A B C foo bar foo one -0.116619 0.295575 -1.047704 two 1.640556 1.905836 2.772115 three 0.088787 -1.144197 -0.633372 bar one 0.925372 -0.006438 -0.820408 two -0.600874 -1.039266 0.824758 baz two -0.824095 -0.337730 -0.927764 three -0.840123 0.248505 -0.109250 qux one 0.431977 -0.460710 0.336505 two -3.207595 -1.535854 0.409769 three -0.673145 -0.741113 -0.110891 # the levels are automatically included as data columns In [24]: store.select('mi', "foo='bar'") Out[24]: A B C foo bar bar one 0.925372 -0.006438 -0.820408 two -0.600874 -1.039266 0.824758
Multi-table creation via append_to_multiple and selection via select_as_multiple can create/select from multiple tables and return a combined result, by using where on a selector table.
append_to_multiple
select_as_multiple
where
In [19]: df_mt = pd.DataFrame( ....: np.random.randn(8, 6), ....: index=pd.date_range("1/1/2000", periods=8), ....: columns=["A", "B", "C", "D", "E", "F"], ....: ) ....: In [20]: df_mt["foo"] = "bar" # you can also create the tables individually In [21]: store.append_to_multiple( ....: {"df1_mt": ["A", "B"], "df2_mt": None}, df_mt, selector="df1_mt" ....: ) ....: In [22]: store Out[22]: <class 'pandas.io.pytables.HDFStore'> File path: store.h5 # individual tables were created In [23]: store.select("df1_mt") Out[23]: A B 2000-01-01 0.404705 0.577046 2000-01-02 -1.344312 0.844885 2000-01-03 0.357021 -0.674600 2000-01-04 0.276662 -0.472035 2000-01-05 0.895717 0.805244 2000-01-06 -1.170299 -0.226169 2000-01-07 -0.076467 -1.187678 2000-01-08 1.024180 0.569605 In [24]: store.select("df2_mt") Out[24]: C D E F foo 2000-01-01 -1.715002 -1.039268 -0.370647 -1.157892 bar 2000-01-02 1.075770 -0.109050 1.643563 -1.469388 bar 2000-01-03 -1.776904 -0.968914 -1.294524 0.413738 bar 2000-01-04 -0.013960 -0.362543 -0.006154 -0.923061 bar 2000-01-05 -1.206412 2.565646 1.431256 1.340309 bar 2000-01-06 0.410835 0.813850 0.132003 -0.827317 bar 2000-01-07 1.130127 -1.436737 -1.413681 1.607920 bar 2000-01-08 0.875906 -2.211372 0.974466 -2.006747 bar # as a multiple In [25]: store.select_as_multiple( ....: ["df1_mt", "df2_mt"], where=["A>0", "B>0"], selector="df1_mt" ....: ) ....: Out[25]: A B C D E F foo 2000-01-01 0.404705 0.577046 -1.715002 -1.039268 -0.370647 -1.157892 bar 2000-01-05 0.895717 0.805244 -1.206412 2.565646 1.431256 1.340309 bar 2000-01-08 1.024180 0.569605 0.875906 -2.211372 0.974466 -2.006747 bar
Enhancements
HDFStore now can read native PyTables table format tables
You can pass nan_rep = 'my_nan_rep' to append, to change the default nan representation on disk (which converts to/from np.nan), this defaults to nan.
nan_rep = 'my_nan_rep'
np.nan
nan
You can pass index to append. This defaults to True. This will automagically create indices on the indexables and data columns of the table
index
append
True
You can pass chunksize=an integer to append, to change the writing chunksize (default is 50000). This will significantly lower your memory usage on writing.
chunksize=an integer
You can pass expectedrows=an integer to the first append, to set the TOTAL number of expected rows that PyTables will expected. This will optimize read/write performance.
expectedrows=an integer
PyTables
Select now supports passing start and stop to provide selection space limiting in selection.
Select
start
stop
Greatly improved ISO8601 (e.g., yyyy-mm-dd) date parsing for file parsers (GH2698)
Allow DataFrame.merge to handle combinatorial sizes too large for 64-bit integer (GH2690)
DataFrame.merge
Series now has unary negation (-series) and inversion (~series) operators (GH2686)
DataFrame.plot now includes a logx parameter to change the x-axis to log scale (GH2327)
logx
Series arithmetic operators can now handle constant and ndarray input (GH2574)
ExcelFile now takes a kind argument to specify the file type (GH2613)
kind
A faster implementation for Series.str methods (GH2602)
Bug Fixes
HDFStore tables can now store float32 types correctly (cannot be mixed with float64 however)
float32
float64
Fixed Google Analytics prefix when specifying request segment (GH2713).
Function to reset Google Analytics token store so users can recover from improperly setup client secrets (GH2687).
Fixed groupby bug resulting in segfault when passing in MultiIndex (GH2706)
Fixed bug where passing a Series with datetime64 values into to_datetime results in bogus output values (GH2699)
to_datetime
Fixed bug in pattern in HDFStore expressions when pattern is not a valid regex (GH2694)
pattern in HDFStore
Fixed performance issues while aggregating boolean data (GH2692)
When given a boolean mask key and a Series of new values, Series __setitem__ will now align the incoming values with the original Series (GH2686)
Fixed MemoryError caused by performing counting sort on sorting MultiIndex levels with a very large number of combinatorial values (GH2684)
Fixed bug that causes plotting to fail when the index is a DatetimeIndex with a fixed-offset timezone (GH2683)
Corrected business day subtraction logic when the offset is more than 5 bdays and the starting date is on a weekend (GH2680)
Fixed C file parser behavior when the file has more columns than data (GH2668)
Fixed file reader bug that misaligned columns with data in the presence of an implicit column and a specified usecols value
usecols
DataFrames with numerical or datetime indices are now sorted prior to plotting (GH2609)
Fixed DataFrame.from_records error when passed columns, index, but empty records (GH2633)
Several bug fixed for Series operations when dtype is datetime64 (GH2689, GH2629, GH2626)
See the full release notes or issue tracker on GitHub for a complete list.
A total of 17 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.
Andy Hayden +
Anton I. Sipos +
Chang She
Christopher Whelan
Damien Garaud +
Dan Allan +
Dieter Vandenbussche
Garrett Drapala +
Jay Parlar +
Thouis (Ray) Jones +
Vincent Arel-Bundock +
Wes McKinney
elpres
herrfz +
jreback
svaksha +
y-p