v.0.7.0 (February 9, 2012)¶
New features¶
- New unified merge function for efficiently performing full gamut of database / relational-algebra operations. Refactored existing join methods to use the new infrastructure, resulting in substantial performance gains (GH220, GH249, GH267)
- New unified concatenation function for concatenating
Series, DataFrame or Panel objects along an axis. Can form union or
intersection of the other axes. Improves performance of
Series.append
andDataFrame.append
(GH468, GH479, GH273) - Can pass multiple DataFrames to
DataFrame.append to concatenate (stack) and multiple Series to
Series.append
too - Can pass list of dicts (e.g., a list of JSON objects) to DataFrame constructor (GH526)
- You can now set multiple columns in a
DataFrame via
__getitem__
, useful for transformation (GH342) - Handle differently-indexed output values in
DataFrame.apply
(GH498)
In [1]: df = pd.DataFrame(np.random.randn(10, 4))
In [2]: df.apply(lambda x: x.describe())
Out[2]:
0 1 2 3
count 10.000000 10.000000 10.000000 10.000000
mean 0.190912 -0.395125 -0.731920 -0.403130
std 0.730951 0.813266 1.112016 0.961912
min -0.861849 -2.104569 -1.776904 -1.469388
25% -0.411391 -0.698728 -1.501401 -1.076610
50% 0.380863 -0.228039 -1.191943 -1.004091
75% 0.658444 0.057974 -0.034326 0.461706
max 1.212112 0.577046 1.643563 1.071804
[8 rows x 4 columns]
- Add
reorder_levels
method to Series and DataFrame (GH534) - Add dict-like
get
function to DataFrame and Panel (GH521) - Add
DataFrame.iterrows
method for efficiently iterating through the rows of a DataFrame - Add
DataFrame.to_panel
with code adapted fromLongPanel.to_long
- Add
reindex_axis
method added to DataFrame - Add
level
option to binary arithmetic functions onDataFrame
andSeries
- Add
level
option to thereindex
andalign
methods on Series and DataFrame for broadcasting values across a level (GH542, GH552, others) - Add attribute-based item access to
Panel
and add IPython completion (GH563) - Add
logy
option toSeries.plot
for log-scaling on the Y axis - Add
index
andheader
options toDataFrame.to_string
- Can pass multiple DataFrames to
DataFrame.join
to join on index (GH115) - Can pass multiple Panels to
Panel.join
(GH115) - Added
justify
argument toDataFrame.to_string
to allow different alignment of column headers - Add
sort
option to GroupBy to allow disabling sorting of the group keys for potential speedups (GH595) - Can pass MaskedArray to Series constructor (GH563)
- Add Panel item access via attributes and IPython completion (GH554)
- Implement
DataFrame.lookup
, fancy-indexing analogue for retrieving values given a sequence of row and column labels (GH338) - Can pass a list of functions to aggregate with groupby on a DataFrame, yielding an aggregated result with hierarchical columns (GH166)
- Can call
cummin
andcummax
on Series and DataFrame to get cumulative minimum and maximum, respectively (GH647) value_range
added as utility function to get min and max of a dataframe (GH288)- Added
encoding
argument toread_csv
,read_table
,to_csv
andfrom_csv
for non-ascii text (GH717) - Added
abs
method to pandas objects - Added
crosstab
function for easily computing frequency tables - Added
isin
method to index objects - Added
level
argument toxs
method of DataFrame.
API changes to integer indexing¶
One of the potentially riskiest API changes in 0.7.0, but also one of the most important, was a complete review of how integer indexes are handled with regard to label-based indexing. Here is an example:
In [3]: s = pd.Series(np.random.randn(10), index=range(0, 20, 2)) In [4]: s Out[4]: 0 -1.294524 2 0.413738 4 0.276662 6 -0.472035 8 -0.013960 10 -0.362543 12 -0.006154 14 -0.923061 16 0.895717 18 0.805244 Length: 10, dtype: float64 In [5]: s[0] Out[5]: -1.2945235902555294 In [6]: s[2] Out[6]: 0.41373810535784006 In [7]: s[4] Out[7]: 0.2766617129497566
This is all exactly identical to the behavior before. However, if you ask for a
key not contained in the Series, in versions 0.6.1 and prior, Series would
fall back on a location-based lookup. This now raises a KeyError
:
In [2]: s[1]
KeyError: 1
This change also has the same impact on DataFrame:
In [3]: df = pd.DataFrame(np.random.randn(8, 4), index=range(0, 16, 2))
In [4]: df
0 1 2 3
0 0.88427 0.3363 -0.1787 0.03162
2 0.14451 -0.1415 0.2504 0.58374
4 -1.44779 -0.9186 -1.4996 0.27163
6 -0.26598 -2.4184 -0.2658 0.11503
8 -0.58776 0.3144 -0.8566 0.61941
10 0.10940 -0.7175 -1.0108 0.47990
12 -1.16919 -0.3087 -0.6049 -0.43544
14 -0.07337 0.3410 0.0424 -0.16037
In [5]: df.ix[3]
KeyError: 3
In order to support purely integer-based indexing, the following methods have been added:
Method | Description |
---|---|
Series.iget_value(i) |
Retrieve value stored at location i |
Series.iget(i) |
Alias for iget_value |
DataFrame.irow(i) |
Retrieve the i -th row |
DataFrame.icol(j) |
Retrieve the j -th column |
DataFrame.iget_value(i, j) |
Retrieve the value at row i and column j |
API tweaks regarding label-based slicing¶
Label-based slicing using ix
now requires that the index be sorted
(monotonic) unless both the start and endpoint are contained in the index:
In [1]: s = pd.Series(np.random.randn(6), index=list('gmkaec'))
In [2]: s
Out[2]:
g -1.182230
m -0.276183
k -0.243550
a 1.628992
e 0.073308
c -0.539890
dtype: float64
Then this is OK:
In [3]: s.ix['k':'e']
Out[3]:
k -0.243550
a 1.628992
e 0.073308
dtype: float64
But this is not:
In [12]: s.ix['b':'h']
KeyError 'b'
If the index had been sorted, the “range selection” would have been possible:
In [4]: s2 = s.sort_index()
In [5]: s2
Out[5]:
a 1.628992
c -0.539890
e 0.073308
g -1.182230
k -0.243550
m -0.276183
dtype: float64
In [6]: s2.ix['b':'h']
Out[6]:
c -0.539890
e 0.073308
g -1.182230
dtype: float64
Changes to Series []
operator¶
As as notational convenience, you can pass a sequence of labels or a label
slice to a Series when getting and setting values via []
(i.e. the
__getitem__
and __setitem__
methods). The behavior will be the same as
passing similar input to ix
except in the case of integer indexing:
In [8]: s = pd.Series(np.random.randn(6), index=list('acegkm')) In [9]: s Out[9]: a -1.206412 c 2.565646 e 1.431256 g 1.340309 k -1.170299 m -0.226169 Length: 6, dtype: float64 In [10]: s[['m', 'a', 'c', 'e']] Out[10]: m -0.226169 a -1.206412 c 2.565646 e 1.431256 Length: 4, dtype: float64 In [11]: s['b':'l'] Out[11]: c 2.565646 e 1.431256 g 1.340309 k -1.170299 Length: 4, dtype: float64 In [12]: s['c':'k'] Out[12]: c 2.565646 e 1.431256 g 1.340309 k -1.170299 Length: 4, dtype: float64
In the case of integer indexes, the behavior will be exactly as before
(shadowing ndarray
):
In [13]: s = pd.Series(np.random.randn(6), index=range(0, 12, 2)) In [14]: s[[4, 0, 2]] Out[14]: 4 0.132003 0 0.410835 2 0.813850 Length: 3, dtype: float64 In [15]: s[1:5] Out[15]: 2 0.813850 4 0.132003 6 -0.827317 8 -0.076467 Length: 4, dtype: float64
If you wish to do indexing with sequences and slicing on an integer index with
label semantics, use ix
.
Other API changes¶
- The deprecated
LongPanel
class has been completely removed - If
Series.sort
is called on a column of a DataFrame, an exception will now be raised. Before it was possible to accidentally mutate a DataFrame’s column by doingdf[col].sort()
instead of the side-effect free methoddf[col].order()
(GH316) - Miscellaneous renames and deprecations which will (harmlessly) raise
FutureWarning
drop
added as an optional parameter toDataFrame.reset_index
(GH699)
Performance improvements¶
- Cythonized GroupBy aggregations no longer presort the data, thus achieving a significant speedup (GH93). GroupBy aggregations with Python functions significantly sped up by clever manipulation of the ndarray data type in Cython (GH496).
- Better error message in DataFrame constructor when passed column labels don’t match data (GH497)
- Substantially improve performance of multi-GroupBy aggregation when a Python function is passed, reuse ndarray object in Cython (GH496)
- Can store objects indexed by tuples and floats in HDFStore (GH492)
- Don’t print length by default in Series.to_string, add length option (GH489)
- Improve Cython code for multi-groupby to aggregate without having to sort the data (GH93)
- Improve MultiIndex reindexing speed by storing tuples in the MultiIndex, test for backwards unpickling compatibility
- Improve column reindexing performance by using specialized Cython take function
- Further performance tweaking of Series.__getitem__ for standard use cases
- Avoid Index dict creation in some cases (i.e. when getting slices, etc.), regression from prior versions
- Friendlier error message in setup.py if NumPy not installed
- Use common set of NA-handling operations (sum, mean, etc.) in Panel class also (GH536)
- Default name assignment when calling
reset_index
on DataFrame with a regular (non-hierarchical) index (GH476) - Use Cythonized groupers when possible in Series/DataFrame stat ops with
level
parameter passed (GH545) - Ported skiplist data structure to C to speed up
rolling_median
by about 5-10x in most typical use cases (GH374)
Contributors¶
A total of 18 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.
- Adam Klein
- Bayle Shanks +
- Chris Billington +
- Dieter Vandenbussche
- Fabrizio Pollastri +
- Graham Taylor +
- Gregg Lind +
- Josh Klein +
- Luca Beltrame
- Olivier Grisel +
- Skipper Seabold
- Thomas Kluyver
- Thomas Wiecki +
- Wes McKinney
- Wouter Overmeire
- Yaroslav Halchenko
- fabriziop +
- theandygross +