PDEP-9: Allow third-party projects to register pandas connectors with a standard API
- Created: 5 March 2023
- Status: Rejected
- Discussion: #51799 #53005
- Author: Marc Garcia
- Revision: 1
PDEP Summary
This document proposes that third-party projects implementing I/O or memory connectors to pandas can register them using Python's entrypoint system, and make them available to pandas users with the usual pandas I/O interface. For example, packages independent from pandas could implement readers from DuckDB and writers to Delta Lake, and when installed in the user environment the user would be able to use them as if they were implemented in pandas. For example:
import pandas
pandas.load_io_plugins()
df = pandas.DataFrame.read_duckdb("SELECT * FROM 'my_dataset.parquet';")
df.to_deltalake('/delta/my_dataset')
This would allow to easily extend the existing number of connectors, adding support to new formats and database engines, data lake technologies, out-of-core connectors, the new ADBC interface, and others, and at the same time reduce the maintenance cost of the pandas code base.
Current state
pandas supports importing and exporting data from different formats using
I/O connectors, currently implemented in pandas/io
, as well as connectors
to in-memory structures like Python structures or other library formats.
In many cases, those connectors wrap an existing Python library, while in
some others, pandas implements the logic to read and write to a particular
format.
In some cases, different engines exist for the same format. The API to use
those connectors is pandas.read_<format>(engine='<engine-name>', ...)
to
import data, and DataFrame.to_<format>(engine='<engine-name>', ...)
to
export data.
For objects exported to memory (like a Python dict) the API is the same as
for I/O, DataFrame.to_<format>(...)
. For formats imported from objects in
memory, the API is different using the from_
prefix instead of read_
,
DataFrame.from_<format>(...)
.
In some cases, the pandas API provides DataFrame.to_*
methods that are not
used to export the data to a disk or memory object, but instead to transform
the index of a DataFrame
: DataFrame.to_period
and DataFrame.to_timestamp
.
Dependencies of the connectors are not loaded by default, and are
imported when the connector is used. If the dependencies are not installed
an ImportError
is raised.
>>> pandas.read_gbq(query)
Traceback (most recent call last):
...
ImportError: Missing optional dependency 'pandas-gbq'.
pandas-gbq is required to load data from Google BigQuery.
See the docs: https://pandas-gbq.readthedocs.io.
Use pip or conda to install pandas-gbq.
Supported formats
The list of formats can be found in the IO guide. A more detailed table, including in memory objects, and I/O connectors in the DataFrame styler is presented next:
Format | Reader | Writer | Engines |
---|---|---|---|
CSV | X | X | c , python , pyarrow |
FWF | X | c , python , pyarrow |
|
JSON | X | X | ujson , pyarrow |
HTML | X | X | lxml , bs4/html5lib (parameter flavor ) |
LaTeX | X | ||
XML | X | X | lxml , etree (parameter parser ) |
Clipboard | X | X | |
Excel | X | X | xlrd , openpyxl , odf , pyxlsb (each engine supports different file formats) |
HDF5 | X | X | |
Feather | X | X | |
Parquet | X | X | pyarrow , fastparquet |
ORC | X | X | |
Stata | X | X | |
SAS | X | ||
SPSS | X | ||
Pickle | X | X | |
SQL | X | X | sqlalchemy , dbapi2 (inferred from the type of the con parameter) |
BigQuery | X | X | |
dict | X | X | |
records | X | X | |
string | X | ||
markdown | X | ||
xarray | X |
At the time of writing this document, the io/
module contains
close to 100,000 lines of Python, C and Cython code.
There is no objective criteria for when a format is included in pandas, and the list above is mostly the result of a developer being interested in implementing the connectors for a certain format in pandas.
The number of existing formats available for data that can be processed with pandas is constantly increasing, and its difficult for pandas to keep up to date even with popular formats. It possibly makes sense to have connectors to PyArrow, PySpark, Iceberg, DuckDB, Hive, Polars, and many others.
At the same time, some of the formats are not frequently used as shown in the 2019 user survey. Those less popular formats include SPSS, SAS, Google BigQuery and Stata. Note that only I/O formats (and not memory formats like records or xarray) were included in the survey.
The maintenance cost of supporting all formats is not only in maintaining the code and reviewing pull requests, but also it has a significant cost in time spent on CI systems installing dependencies, compiling code, running tests, etc.
In some cases, the main maintainers of some of the connectors are not part of the pandas core development team, but people specialized in one of the formats.
Proposal
While the current pandas approach has worked reasonably well, it is difficult to find a stable solution where the maintenance incurred in pandas is not too big, while at the same time users can interact with all different formats and representations they are interested in, in an easy and intuitive way.
Third-party packages are already able to implement connectors to pandas, but there are some limitations to it:
- Given the large number of formats supported by pandas itself, third-party connectors are likely seen as second class citizens, not important enough to be used, or not well supported.
- There is no standard API for external I/O connectors, and users need
to learn each of them individually. Since the pandas I/O API is inconsistent
by using read/to instead of read/write or from/to, developers in many cases
ignore the convention. Also, even if developers follow the pandas convention
the namespaces would be different, since developers of connectors will rarely
monkeypatch their functions into the
pandas
orDataFrame
namespace. - Method chaining is not possible with third-party I/O connectors to export
data, unless authors monkey patch the
DataFrame
class, which should not be encouraged.
This document proposes to open the development of pandas I/O connectors to third-party libraries in a standard way that overcomes those limitations.
Proposal implementation
Implementing this proposal would not require major changes to pandas, and the API defined next would be used.
User API
Users will be able to install third-party packages implementing pandas
connectors using the standard packaging tools (pip, conda, etc.). These
connectors should implement entrypoints that pandas will use to
automatically create the corresponding methods pandas.read_*
,
pandas.DataFrame.to_*
and pandas.Series.to_*
. Arbitrary function or
method names will not be created by this interface, only the read_*
and to_*
pattern will be allowed.
By simply installing the appropriate packages and calling the function
pandas.load_io_plugins()
users will be able to use code like this:
import pandas
pandas.load_io_plugins()
df = pandas.read_duckdb("SELECT * FROM 'dataset.parquet';")
df.to_hive(hive_conn, "hive_table")
This API allows for method chaining:
(pandas.read_duckdb("SELECT * FROM 'dataset.parquet';")
.to_hive(hive_conn, "hive_table"))
The total number of I/O functions and methods is expected to be small, as users in general use only a small subset of formats. The number could actually be reduced from the current state if the less popular formats (such as SAS, SPSS, BigQuery, etc.) are removed from the pandas core into third-party packages. Moving these connectors is not part of this proposal, and could be discussed later in a separate proposal.
Plugin registration
Third-party packages would implement
entrypoints
to define the connectors that they implement, under a group dataframe.io
.
For example, a hypothetical project pandas_duckdb
implementing a read_duckdb
function, could use pyproject.toml
to define the next entry point:
[project.entry-points."dataframe.io"]
reader_duckdb = "pandas_duckdb:read_duckdb"
When the user calls pandas.load_io_plugins()
, it would read the entrypoint registry for the
dataframe.io
group, and would dynamically create methods in the pandas
,
pandas.DataFrame
and pandas.Series
namespaces for them. Only entrypoints with
name starting by reader_
or writer_
would be processed by pandas, and the functions
registered in the entrypoint would be made available to pandas users in the corresponding
pandas namespaces. The text after the keywords reader_
and writer_
would be used
for the name of the function. In the example above, the entrypoint name reader_duckdb
would create pandas.read_duckdb
. An entrypoint with name writer_hive
would create
the methods DataFrame.to_hive
and Series.to_hive
.
Entrypoints not starting with reader_
or writer_
would be ignored by this interface,
but will not raise an exception since they can be used for future extensions of this
API, or other related dataframe I/O interfaces.
Internal API
Connectors will use the dataframe interchange API to provide data to pandas. When
data is read from a connector, and before returning it to the user as a response
to pandas.read_<format>
, data will be parsed from the data interchange interface
and converted to a pandas DataFrame. In practice, connectors are likely to return
a pandas DataFrame or a PyArrow Table, but the interface will support any object
implementing the dataframe interchange API.
Connector guidelines
In order to provide a better and more consistent experience to users, guidelines will be created to unify terminology and behavior. Some of the topics to unify are defined next.
Guidelines to avoid name conflicts. Since it is expected that more than one
implementation exists for certain formats, as it already happens, guidelines on
how to name connectors would be created. The easiest approach is probably to use
as the format a string of the type to_<format>_<implementation-id>
if it is
expected that more than one connector can exist. For example, for LanceDB it is likely
that only one connector exist, and the name lance
can be used (which would create
pandas.read_lance
or DataFrame.to_lance
. But if a new csv
reader based in the
Arrow2 Rust implementation, the guidelines can recommend to use csv_arrow2
to
create pandas.read_csv_arrow2
, etc.
Existence and naming of parameters, since many connectors are likely to provide similar features, like loading only a subset of columns in the data, or dealing with paths. Examples of recommendations to connector developers could be:
columns
: Use this argument to let the user load a subset of columns. Allow a list or tuple.path
: Use this argument if the dataset is a file in the file disk. Allow a string, apathlib.Path
object, or a file descriptor. For a string object, allow URLs that will be automatically download, compressed files that will be automatically uncompressed, etc. Specific libraries can be recommended to deal with those in an easier and more consistent way.schema
: For datasets that don't have a schema (e.g.csv
), allow providing an Apache Arrow schema instance, and automatically infer types if not provided.
Note that the above are only examples of guidelines for illustration, and not a proposal of the guidelines, which would be developed independently after this PDEP is approved.
Connector registry and documentation. To simplify the discovery of connectors and its documentation, connector developers can be encourage to register their projects in a central location, and to use a standard structure for documentation. This would allow the creation of a unified website to find the available connectors, and their documentation. It would also allow to customize the documentation for specific implementations, and include their final API.
Connector examples
This section lists specific examples of connectors that could immediately benefit from this proposal.
PyArrow currently provides Table.from_pandas
and Table.to_pandas
.
With the new interface, it could also register DataFrame.from_pyarrow
and DataFrame.to_pyarrow
, so pandas users can use the converters with
the interface they are used to, when PyArrow is installed in the environment.
Better integration with PyArrow tables was discussed in
#51760.
Current API:
pyarrow.Table.from_pandas(table.to_pandas()
.query('my_col > 0'))
Proposed API:
(pandas.read_pyarrow(table)
.query('my_col > 0')
.to_pyarrow())
Polars, Vaex and other dataframe frameworks could benefit from third-party projects that make the interoperability with pandas use a more explicitly API. Integration with Polars was requested in #47368.
Current API:
polars.DataFrame(df.to_pandas()
.query('my_col > 0'))
Proposed API:
(pandas.read_polars(df)
.query('my_col > 0')
.to_polars())
DuckDB provides an out-of-core engine able to push predicates before the data is loaded, making much better use of memory and significantly decreasing loading time. pandas, because of its eager nature is not able to easily implement this itself, but could benefit from a DuckDB loader. The loader can already be implemented inside pandas (it has already been proposed in #45678, or as a third-party extension with an arbitrary API. But this proposal would let the creation of a third-party extension with a standard and intuitive API:
pandas.read_duckdb("SELECT *
FROM 'dataset.parquet'
WHERE my_col > 0")
Out-of-core algorithms push some operations like filtering or grouping to the loading of the data. While this is not currently possible, connectors implementing out-of-core algorithms could be developed using this interface.
Big data systems such as Hive, Iceberg, Presto, etc. could benefit from a standard way to load data to pandas. Also regular SQL databases that can return their query results as Arrow, would benefit from better and faster connectors than the existing ones based on SQL Alchemy and Python structures.
Any other format, including domain-specific formats could easily implement pandas connectors with a clear and intuitive API.
Limitations
The implementation of this proposal has some limitations discussed here:
- Lack of support for multiple engines. The current pandas I/O API
supports multiple engines for the same format (for the same function or
method name). For example
read_csv(engine='pyarrow', ...)
. Supporting engines requires that all engines for a particular format use the same signature (the same parameters), which is not ideal. Different connectors are likely to have different parameters and using*args
and**kwargs
provides users with a more complex and difficult experience. For this reason this proposal prefers that function and method names are unique instead of supporting an option for engines. - Lack of support for type checking of connectors. This PDEP proposes creating functions and methods dynamically, and those are not supported for type checking using stubs. This is already the case for other dynamically created components of pandas, such as custom accessors.
- No improvements to the current I/O API. In the discussions of this
proposal it has been considered to improve the current pandas I/O API to
fix the inconsistency of using
read
/to
(instead of for exampleread
/write
), avoid usingto_
prefixed methods for non-I/O operations, or using a dedicated namespace (e.g.DataFrame.io
) for the connectors. All of these changes are out of scope for this PDEP.
Future plans
This PDEP is exclusively to support a better API for existing of future connectors. It is out of scope for this PDEP to implement changes to any connectors existing in the pandas code base.
Some ideas for future discussion related to this PDEP include:
-
Automatically loading of I/O plugins when pandas is imported.
-
Removing from the pandas code base some of the least frequently used connectors, such as SAS, SPSS or Google BigQuery, and move them to third-party connectors registered with this interface.
-
Discussing a better API for pandas connectors. For example, using
read_*
methods instead offrom_*
methods, renamingto_*
methods not used as I/O connectors, using a consistent terminology like from/to, read/write, load/dump, etc. or using a dedicated namespace for connectors (e.g.pandas.io
instead of the generalpandas
namespace). -
Implement as I/O connectors some of the formats supported by the
DataFrame
constructor.
PDEP-9 History
- 5 March 2023: Initial version
- 30 May 2023: Major refactoring to use the pandas existing API, the dataframe interchange API and to make the user be explicit to load the plugins
- 13 June 2023: The PDEP did not get any support after several iterations, and its been closed as rejected by the author