PDEP-4: Consistent datetime parsing

Abstract

The suggestion is that:

Motivation and Scope

Pandas date parsing is very flexible, but arguably too much so - see https://github.com/pandas-dev/pandas/issues/12585 and linked issues for how much confusion this causes. Pandas can swap format midway, and though this is documented, it regularly breaks users' expectations.

Simple example:

In [1]: pd.to_datetime(['12-01-2000 00:00:00', '13-01-2000 00:00:00'])
Out[1]: DatetimeIndex(['2000-12-01', '2000-01-13'], dtype='datetime64[ns]', freq=None)

The user was almost certainly intending the data to be read as "12th of January, 13th of January". However, it's read as "1st of December, 13th of January". No warning or error is thrown.

Currently, the only way to ensure consistent parsing is by explicitly passing format=. The argument infer_datetime_format isn't strict, can be called together with format, and can still break users' expectations:

In [2]: pd.to_datetime(['12-01-2000 00:00:00', '13-01-2000 00:00:00'], infer_datetime_format=True)
Out[2]: DatetimeIndex(['2000-12-01', '2000-01-13'], dtype='datetime64[ns]', freq=None)

Detailed Description

Concretely, the suggestion is:

If a user has dates in a mixed format, they can still use flexible parsing and accept the risks that poses, e.g.:

In [3]: pd.to_datetime(['12-01-2000 00:00:00', '13-01-2000 00:00:00'], format='mixed')
Out[3]: DatetimeIndex(['2000-12-01', '2000-01-13'], dtype='datetime64[ns]', freq=None)

or, if their dates are all ISO8601,

In [4]: pd.to_datetime(['2020-01-01', '2020-01-01 03:00'], format='ISO8601')
Out[4]: DatetimeIndex(['2020-01-01 00:00:00', '2020-01-01 03:00:00'], dtype='datetime64[ns]', freq=None)

Usage and Impact

My expectation is that the impact would be a net-positive:

As far as I can tell, there is no chance of introducing bugs.

Implementation

The whatsnew notes read

In the next major version release, 2.0, several larger API changes are being considered without a formal deprecation.

I'd suggest making this change as part of the above, because:

Note that this wouldn't mean getting rid of dateutil.parser, as that would still be used within guess_datetime_format. With this proposal, however, subsequent rows would be parsed with the guessed format rather than repeatedly calling dateutil.parser and risk having it silently switch format

Finally, the function from pandas._libs.tslibs.parsing import guess_datetime_format would be made public, under pandas.tools.

Out of scope

We could make guess_datetime_format smarter by using a random sample of elements to infer the format.

PDEP History