v0.16.2 (June 12, 2015)¶
This is a minor bug-fix release from 0.16.1 and includes a a large number of
bug fixes along some new features (pipe()
method), enhancements, and performance improvements.
We recommend that all users upgrade to this version.
Highlights include:
What’s new in v0.16.2
New features¶
Pipe¶
We’ve introduced a new method DataFrame.pipe()
. As suggested by the name, pipe
should be used to pipe data through a chain of function calls.
The goal is to avoid confusing nested function calls like
# df is a DataFrame
# f, g, and h are functions that take and return DataFrames
f(g(h(df), arg1=1), arg2=2, arg3=3) # noqa F821
The logic flows from inside out, and function names are separated from their keyword arguments. This can be rewritten as
(df.pipe(h) # noqa F821
.pipe(g, arg1=1) # noqa F821
.pipe(f, arg2=2, arg3=3) # noqa F821
)
Now both the code and the logic flow from top to bottom. Keyword arguments are next to their functions. Overall the code is much more readable.
In the example above, the functions f
, g
, and h
each expected the DataFrame as the first positional argument.
When the function you wish to apply takes its data anywhere other than the first argument, pass a tuple
of (function, keyword)
indicating where the DataFrame should flow. For example:
In [1]: import statsmodels.formula.api as sm
In [2]: bb = pd.read_csv('data/baseball.csv', index_col='id')
# sm.ols takes (formula, data)
In [3]: (bb.query('h > 0')
...: .assign(ln_h=lambda df: np.log(df.h))
...: .pipe((sm.ols, 'data'), 'hr ~ ln_h + year + g + C(lg)')
...: .fit()
...: .summary()
...: )
...:
Out[3]:
<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
==============================================================================
Dep. Variable: hr R-squared: 0.685
Model: OLS Adj. R-squared: 0.665
Method: Least Squares F-statistic: 34.28
Date: Tue, 22 Oct 2019 Prob (F-statistic): 3.48e-15
Time: 13:59:53 Log-Likelihood: -205.92
No. Observations: 68 AIC: 421.8
Df Residuals: 63 BIC: 432.9
Df Model: 4
Covariance Type: nonrobust
===============================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------
Intercept -8484.7720 4664.146 -1.819 0.074 -1.78e+04 835.780
C(lg)[T.NL] -2.2736 1.325 -1.716 0.091 -4.922 0.375
ln_h -1.3542 0.875 -1.547 0.127 -3.103 0.395
year 4.2277 2.324 1.819 0.074 -0.417 8.872
g 0.1841 0.029 6.258 0.000 0.125 0.243
==============================================================================
Omnibus: 10.875 Durbin-Watson: 1.999
Prob(Omnibus): 0.004 Jarque-Bera (JB): 17.298
Skew: 0.537 Prob(JB): 0.000175
Kurtosis: 5.225 Cond. No. 1.49e+07
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.49e+07. This might indicate that there are
strong multicollinearity or other numerical problems.
"""
The pipe method is inspired by unix pipes, which stream text through
processes. More recently dplyr and magrittr have introduced the
popular (%>%)
pipe operator for R.
See the documentation for more. (GH10129)
Other enhancements¶
Added rsplit to Index/Series StringMethods (GH10303)
Removed the hard-coded size limits on the
DataFrame
HTML representation in the IPython notebook, and leave this to IPython itself (only for IPython v3.0 or greater). This eliminates the duplicate scroll bars that appeared in the notebook with large frames (GH10231).Note that the notebook has a
toggle output scrolling
feature to limit the display of very large frames (by clicking left of the output). You can also configure the way DataFrames are displayed using the pandas options, see here here.axis
parameter ofDataFrame.quantile
now accepts alsoindex
andcolumn
. (GH9543)
API changes¶
Holiday
now raisesNotImplementedError
if bothoffset
andobservance
are used in the constructor instead of returning an incorrect result (GH10217).
Performance improvements¶
Bug fixes¶
- Bug in
Series.hist
raises an error when a one rowSeries
was given (GH10214) - Bug where
HDFStore.select
modifies the passed columns list (GH7212) - Bug in
Categorical
repr withdisplay.width
ofNone
in Python 3 (GH10087) - Bug in
to_json
with certain orients and aCategoricalIndex
would segfault (GH10317) - Bug where some of the nan functions do not have consistent return dtypes (GH10251)
- Bug in
DataFrame.quantile
on checking that a valid axis was passed (GH9543) - Bug in
groupby.apply
aggregation forCategorical
not preserving categories (GH10138) - Bug in
to_csv
wheredate_format
is ignored if thedatetime
is fractional (GH10209) - Bug in
DataFrame.to_json
with mixed data types (GH10289) - Bug in cache updating when consolidating (GH10264)
- Bug in
mean()
where integer dtypes can overflow (GH10172) - Bug where
Panel.from_dict
does not set dtype when specified (GH10058) - Bug in
Index.union
raisesAttributeError
when passing array-likes. (GH10149) - Bug in
Timestamp
’s’microsecond
,quarter
,dayofyear
,week
anddaysinmonth
properties returnnp.int
type, not built-inint
. (GH10050) - Bug in
NaT
raisesAttributeError
when accessing todaysinmonth
,dayofweek
properties. (GH10096) - Bug in Index repr when using the
max_seq_items=None
setting (GH10182). - Bug in getting timezone data with
dateutil
on various platforms ( GH9059, GH8639, GH9663, GH10121) - Bug in displaying datetimes with mixed frequencies; display ‘ms’ datetimes to the proper precision. (GH10170)
- Bug in
setitem
where type promotion is applied to the entire block (GH10280) - Bug in
Series
arithmetic methods may incorrectly hold names (GH10068) - Bug in
GroupBy.get_group
when grouping on multiple keys, one of which is categorical. (GH10132) - Bug in
DatetimeIndex
andTimedeltaIndex
names are lost after timedelta arithmetics ( GH9926) - Bug in
DataFrame
construction from nesteddict
withdatetime64
(GH10160) - Bug in
Series
construction fromdict
withdatetime64
keys (GH9456) - Bug in
Series.plot(label="LABEL")
not correctly setting the label (GH10119) - Bug in
plot
not defaulting to matplotlibaxes.grid
setting (GH9792) - Bug causing strings containing an exponent, but no decimal to be parsed as
int
instead offloat
inengine='python'
for theread_csv
parser (GH9565) - Bug in
Series.align
resetsname
whenfill_value
is specified (GH10067) - Bug in
read_csv
causing index name not to be set on an empty DataFrame (GH10184) - Bug in
SparseSeries.abs
resetsname
(GH10241) - Bug in
TimedeltaIndex
slicing may reset freq (GH10292) - Bug in
GroupBy.get_group
raisesValueError
when group key containsNaT
(GH6992) - Bug in
SparseSeries
constructor ignores input data name (GH10258) - Bug in
Categorical.remove_categories
causing aValueError
when removing theNaN
category if underlying dtype is floating-point (GH10156) - Bug where infer_freq infers time rule (WOM-5XXX) unsupported by to_offset (GH9425)
- Bug in
DataFrame.to_hdf()
where table format would raise a seemingly unrelated error for invalid (non-string) column names. This is now explicitly forbidden. (GH9057) - Bug to handle masking empty
DataFrame
(GH10126). - Bug where MySQL interface could not handle numeric table/column names (GH10255)
- Bug in
read_csv
with adate_parser
that returned adatetime64
array of other time resolution than[ns]
(GH10245) - Bug in
Panel.apply
when the result has ndim=0 (GH10332) - Bug in
read_hdf
whereauto_close
could not be passed (GH9327). - Bug in
read_hdf
where open stores could not be used (GH10330). - Bug in adding empty
DataFrames
, now results in aDataFrame
that.equals
an emptyDataFrame
(GH10181). - Bug in
to_hdf
andHDFStore
which did not check that complib choices were valid (GH4582, GH8874).
Contributors¶
A total of 34 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.
- Andrew Rosenfeld
- Artemy Kolchinsky
- Bernard Willers +
- Christer van der Meeren
- Christian Hudon +
- Constantine Glen Evans +
- Daniel Julius Lasiman +
- Evan Wright
- Francesco Brundu +
- Gaëtan de Menten +
- Jake VanderPlas
- James Hiebert +
- Jeff Reback
- Joris Van den Bossche
- Justin Lecher +
- Ka Wo Chen +
- Kevin Sheppard
- Mortada Mehyar
- Morton Fox +
- Robin Wilson +
- Sinhrks
- Stephan Hoyer
- Thomas Grainger
- Tom Ajamian
- Tom Augspurger
- Yoshiki Vázquez Baeza
- Younggun Kim
- austinc +
- behzad nouri
- jreback
- lexual
- rekcahpassyla +
- scls19fr
- sinhrks