.. _developer: {{ header }} .. currentmodule:: pandas ********* Developer ********* This section will focus on downstream applications of pandas. .. _apache.parquet: Storing pandas DataFrame objects in Apache Parquet format --------------------------------------------------------- The `Apache Parquet `__ format provides key-value metadata at the file and column level, stored in the footer of the Parquet file: .. code-block:: shell 5: optional list key_value_metadata where ``KeyValue`` is .. code-block:: shell struct KeyValue { 1: required string key 2: optional string value } So that a ``pandas.DataFrame`` can be faithfully reconstructed, we store a ``pandas`` metadata key in the ``FileMetaData`` with the value stored as : .. code-block:: text {'index_columns': [, , ...], 'column_indexes': [, , ..., ], 'columns': [, , ...], 'pandas_version': $VERSION, 'creator': { 'library': $LIBRARY, 'version': $LIBRARY_VERSION }} The "descriptor" values ```` in the ``'index_columns'`` field are strings (referring to a column) or dictionaries with values as described below. The ````/```` and so forth are dictionaries containing the metadata for each column, *including the index columns*. This has JSON form: .. code-block:: text {'name': column_name, 'field_name': parquet_column_name, 'pandas_type': pandas_type, 'numpy_type': numpy_type, 'metadata': metadata} See below for the detailed specification for these. Index metadata descriptors ~~~~~~~~~~~~~~~~~~~~~~~~~~ ``RangeIndex`` can be stored as metadata only, not requiring serialization. The descriptor format for these as is follows: .. code-block:: python index = pd.RangeIndex(0, 10, 2) { "kind": "range", "name": index.name, "start": index.start, "stop": index.stop, "step": index.step, } Other index types must be serialized as data columns along with the other DataFrame columns. The metadata for these is a string indicating the name of the field in the data columns, for example ``'__index_level_0__'``. If an index has a non-None ``name`` attribute, and there is no other column with a name matching that value, then the ``index.name`` value can be used as the descriptor. Otherwise (for unnamed indexes and ones with names colliding with other column names) a disambiguating name with pattern matching ``__index_level_\d+__`` should be used. In cases of named indexes as data columns, ``name`` attribute is always stored in the column descriptors as above. Column metadata ~~~~~~~~~~~~~~~ ``pandas_type`` is the logical type of the column, and is one of: * Boolean: ``'bool'`` * Integers: ``'int8', 'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32', 'uint64'`` * Floats: ``'float16', 'float32', 'float64'`` * Date and Time Types: ``'datetime', 'datetimetz'``, ``'timedelta'`` * String: ``'unicode', 'bytes'`` * Categorical: ``'categorical'`` * Other Python objects: ``'object'`` The ``numpy_type`` is the physical storage type of the column, which is the result of ``str(dtype)`` for the underlying NumPy array that holds the data. So for ``datetimetz`` this is ``datetime64[ns]`` and for categorical, it may be any of the supported integer categorical types. The ``metadata`` field is ``None`` except for: * ``datetimetz``: ``{'timezone': zone, 'unit': 'ns'}``, e.g. ``{'timezone', 'America/New_York', 'unit': 'ns'}``. The ``'unit'`` is optional, and if omitted it is assumed to be nanoseconds. * ``categorical``: ``{'num_categories': K, 'ordered': is_ordered, 'type': $TYPE}`` * Here ``'type'`` is optional, and can be a nested pandas type specification here (but not categorical) * ``unicode``: ``{'encoding': encoding}`` * The encoding is optional, and if not present is UTF-8 * ``object``: ``{'encoding': encoding}``. Objects can be serialized and stored in ``BYTE_ARRAY`` Parquet columns. The encoding can be one of: * ``'pickle'`` * ``'bson'`` * ``'json'`` * ``timedelta``: ``{'unit': 'ns'}``. The ``'unit'`` is optional, and if omitted it is assumed to be nanoseconds. This metadata is optional altogether For types other than these, the ``'metadata'`` key can be omitted. Implementations can assume ``None`` if the key is not present. As an example of fully-formed metadata: .. code-block:: text {'index_columns': ['__index_level_0__'], 'column_indexes': [ {'name': None, 'field_name': 'None', 'pandas_type': 'unicode', 'numpy_type': 'object', 'metadata': {'encoding': 'UTF-8'}} ], 'columns': [ {'name': 'c0', 'field_name': 'c0', 'pandas_type': 'int8', 'numpy_type': 'int8', 'metadata': None}, {'name': 'c1', 'field_name': 'c1', 'pandas_type': 'bytes', 'numpy_type': 'object', 'metadata': None}, {'name': 'c2', 'field_name': 'c2', 'pandas_type': 'categorical', 'numpy_type': 'int16', 'metadata': {'num_categories': 1000, 'ordered': False}}, {'name': 'c3', 'field_name': 'c3', 'pandas_type': 'datetimetz', 'numpy_type': 'datetime64[ns]', 'metadata': {'timezone': 'America/Los_Angeles'}}, {'name': 'c4', 'field_name': 'c4', 'pandas_type': 'object', 'numpy_type': 'object', 'metadata': {'encoding': 'pickle'}}, {'name': None, 'field_name': '__index_level_0__', 'pandas_type': 'int64', 'numpy_type': 'int64', 'metadata': None} ], 'pandas_version': '1.4.0', 'creator': { 'library': 'pyarrow', 'version': '0.13.0' }}