v.0.7.0 (February 9, 2012)¶
New features¶
- New unified merge function for efficiently performing full gamut of database / relational-algebra operations. Refactored existing join methods to use the new infrastructure, resulting in substantial performance gains (GH220, GH249, GH267)
- New unified concatenation function for concatenating
Series, DataFrame or Panel objects along an axis. Can form union or
intersection of the other axes. Improves performance of
Series.appendandDataFrame.append(GH468, GH479, GH273) - Can pass multiple DataFrames to
DataFrame.append to concatenate (stack) and multiple Series to
Series.appendtoo - Can pass list of dicts (e.g., a list of JSON objects) to DataFrame constructor (GH526)
- You can now set multiple columns in a
DataFrame via
__getitem__, useful for transformation (GH342) - Handle differently-indexed output values in
DataFrame.apply(GH498)
In [1]: df = pd.DataFrame(np.random.randn(10, 4))
In [2]: df.apply(lambda x: x.describe())
Out[2]:
0 1 2 3
count 10.000000 10.000000 10.000000 10.000000
mean 0.190912 -0.395125 -0.731920 -0.403130
std 0.730951 0.813266 1.112016 0.961912
min -0.861849 -2.104569 -1.776904 -1.469388
25% -0.411391 -0.698728 -1.501401 -1.076610
50% 0.380863 -0.228039 -1.191943 -1.004091
75% 0.658444 0.057974 -0.034326 0.461706
max 1.212112 0.577046 1.643563 1.071804
[8 rows x 4 columns]
- Add
reorder_levelsmethod to Series and DataFrame (GH534) - Add dict-like
getfunction to DataFrame and Panel (GH521) - Add
DataFrame.iterrowsmethod for efficiently iterating through the rows of a DataFrame - Add
DataFrame.to_panelwith code adapted fromLongPanel.to_long - Add
reindex_axismethod added to DataFrame - Add
leveloption to binary arithmetic functions onDataFrameandSeries - Add
leveloption to thereindexandalignmethods on Series and DataFrame for broadcasting values across a level (GH542, GH552, others) - Add attribute-based item access to
Paneland add IPython completion (GH563) - Add
logyoption toSeries.plotfor log-scaling on the Y axis - Add
indexandheaderoptions toDataFrame.to_string - Can pass multiple DataFrames to
DataFrame.jointo join on index (GH115) - Can pass multiple Panels to
Panel.join(GH115) - Added
justifyargument toDataFrame.to_stringto allow different alignment of column headers - Add
sortoption to GroupBy to allow disabling sorting of the group keys for potential speedups (GH595) - Can pass MaskedArray to Series constructor (GH563)
- Add Panel item access via attributes and IPython completion (GH554)
- Implement
DataFrame.lookup, fancy-indexing analogue for retrieving values given a sequence of row and column labels (GH338) - Can pass a list of functions to aggregate with groupby on a DataFrame, yielding an aggregated result with hierarchical columns (GH166)
- Can call
cumminandcummaxon Series and DataFrame to get cumulative minimum and maximum, respectively (GH647) value_rangeadded as utility function to get min and max of a dataframe (GH288)- Added
encodingargument toread_csv,read_table,to_csvandfrom_csvfor non-ascii text (GH717) - Added
absmethod to pandas objects - Added
crosstabfunction for easily computing frequency tables - Added
isinmethod to index objects - Added
levelargument toxsmethod of DataFrame.
API changes to integer indexing¶
One of the potentially riskiest API changes in 0.7.0, but also one of the most important, was a complete review of how integer indexes are handled with regard to label-based indexing. Here is an example:
In [3]: s = pd.Series(np.random.randn(10), index=range(0, 20, 2)) In [4]: s Out[4]: 0 -1.294524 2 0.413738 4 0.276662 6 -0.472035 8 -0.013960 10 -0.362543 12 -0.006154 14 -0.923061 16 0.895717 18 0.805244 Length: 10, dtype: float64 In [5]: s[0] Out[5]: -1.2945235902555294 In [6]: s[2] Out[6]: 0.41373810535784006 In [7]: s[4] Out[7]: 0.2766617129497566
This is all exactly identical to the behavior before. However, if you ask for a
key not contained in the Series, in versions 0.6.1 and prior, Series would
fall back on a location-based lookup. This now raises a KeyError:
In [2]: s[1]
KeyError: 1
This change also has the same impact on DataFrame:
In [3]: df = pd.DataFrame(np.random.randn(8, 4), index=range(0, 16, 2))
In [4]: df
0 1 2 3
0 0.88427 0.3363 -0.1787 0.03162
2 0.14451 -0.1415 0.2504 0.58374
4 -1.44779 -0.9186 -1.4996 0.27163
6 -0.26598 -2.4184 -0.2658 0.11503
8 -0.58776 0.3144 -0.8566 0.61941
10 0.10940 -0.7175 -1.0108 0.47990
12 -1.16919 -0.3087 -0.6049 -0.43544
14 -0.07337 0.3410 0.0424 -0.16037
In [5]: df.ix[3]
KeyError: 3
In order to support purely integer-based indexing, the following methods have been added:
| Method | Description |
|---|---|
Series.iget_value(i) |
Retrieve value stored at location i |
Series.iget(i) |
Alias for iget_value |
DataFrame.irow(i) |
Retrieve the i-th row |
DataFrame.icol(j) |
Retrieve the j-th column |
DataFrame.iget_value(i, j) |
Retrieve the value at row i and column j |
API tweaks regarding label-based slicing¶
Label-based slicing using ix now requires that the index be sorted
(monotonic) unless both the start and endpoint are contained in the index:
In [1]: s = pd.Series(np.random.randn(6), index=list('gmkaec'))
In [2]: s
Out[2]:
g -1.182230
m -0.276183
k -0.243550
a 1.628992
e 0.073308
c -0.539890
dtype: float64
Then this is OK:
In [3]: s.ix['k':'e']
Out[3]:
k -0.243550
a 1.628992
e 0.073308
dtype: float64
But this is not:
In [12]: s.ix['b':'h']
KeyError 'b'
If the index had been sorted, the “range selection” would have been possible:
In [4]: s2 = s.sort_index()
In [5]: s2
Out[5]:
a 1.628992
c -0.539890
e 0.073308
g -1.182230
k -0.243550
m -0.276183
dtype: float64
In [6]: s2.ix['b':'h']
Out[6]:
c -0.539890
e 0.073308
g -1.182230
dtype: float64
Changes to Series [] operator¶
As as notational convenience, you can pass a sequence of labels or a label
slice to a Series when getting and setting values via [] (i.e. the
__getitem__ and __setitem__ methods). The behavior will be the same as
passing similar input to ix except in the case of integer indexing:
In [8]: s = pd.Series(np.random.randn(6), index=list('acegkm'))
In [9]: s
Out[9]:
a -1.206412
c 2.565646
e 1.431256
g 1.340309
k -1.170299
m -0.226169
Length: 6, dtype: float64
In [10]: s[['m', 'a', 'c', 'e']]
Out[10]:
m -0.226169
a -1.206412
c 2.565646
e 1.431256
Length: 4, dtype: float64
In [11]: s['b':'l']
Out[11]:
c 2.565646
e 1.431256
g 1.340309
k -1.170299
Length: 4, dtype: float64
In [12]: s['c':'k']
Out[12]:
c 2.565646
e 1.431256
g 1.340309
k -1.170299
Length: 4, dtype: float64
In the case of integer indexes, the behavior will be exactly as before
(shadowing ndarray):
In [13]: s = pd.Series(np.random.randn(6), index=range(0, 12, 2)) In [14]: s[[4, 0, 2]] Out[14]: 4 0.132003 0 0.410835 2 0.813850 Length: 3, dtype: float64 In [15]: s[1:5] Out[15]: 2 0.813850 4 0.132003 6 -0.827317 8 -0.076467 Length: 4, dtype: float64
If you wish to do indexing with sequences and slicing on an integer index with
label semantics, use ix.
Other API changes¶
- The deprecated
LongPanelclass has been completely removed - If
Series.sortis called on a column of a DataFrame, an exception will now be raised. Before it was possible to accidentally mutate a DataFrame’s column by doingdf[col].sort()instead of the side-effect free methoddf[col].order()(GH316) - Miscellaneous renames and deprecations which will (harmlessly) raise
FutureWarning dropadded as an optional parameter toDataFrame.reset_index(GH699)
Performance improvements¶
- Cythonized GroupBy aggregations no longer presort the data, thus achieving a significant speedup (GH93). GroupBy aggregations with Python functions significantly sped up by clever manipulation of the ndarray data type in Cython (GH496).
- Better error message in DataFrame constructor when passed column labels don’t match data (GH497)
- Substantially improve performance of multi-GroupBy aggregation when a Python function is passed, reuse ndarray object in Cython (GH496)
- Can store objects indexed by tuples and floats in HDFStore (GH492)
- Don’t print length by default in Series.to_string, add length option (GH489)
- Improve Cython code for multi-groupby to aggregate without having to sort the data (GH93)
- Improve MultiIndex reindexing speed by storing tuples in the MultiIndex, test for backwards unpickling compatibility
- Improve column reindexing performance by using specialized Cython take function
- Further performance tweaking of Series.__getitem__ for standard use cases
- Avoid Index dict creation in some cases (i.e. when getting slices, etc.), regression from prior versions
- Friendlier error message in setup.py if NumPy not installed
- Use common set of NA-handling operations (sum, mean, etc.) in Panel class also (GH536)
- Default name assignment when calling
reset_indexon DataFrame with a regular (non-hierarchical) index (GH476) - Use Cythonized groupers when possible in Series/DataFrame stat ops with
levelparameter passed (GH545) - Ported skiplist data structure to C to speed up
rolling_medianby about 5-10x in most typical use cases (GH374)
Contributors¶
A total of 18 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.
- Adam Klein
- Bayle Shanks +
- Chris Billington +
- Dieter Vandenbussche
- Fabrizio Pollastri +
- Graham Taylor +
- Gregg Lind +
- Josh Klein +
- Luca Beltrame
- Olivier Grisel +
- Skipper Seabold
- Thomas Kluyver
- Thomas Wiecki +
- Wes McKinney
- Wouter Overmeire
- Yaroslav Halchenko
- fabriziop +
- theandygross +