Developer#

This section will focus on downstream applications of pandas.

Storing pandas DataFrame objects in Apache Parquet format#

The Apache Parquet format provides key-value metadata at the file and column level, stored in the footer of the Parquet file:

5: optional list<KeyValue> key_value_metadata

where KeyValue is

struct KeyValue {
  1: required string key
  2: optional string value
}

So that a pandas.DataFrame can be faithfully reconstructed, we store a pandas metadata key in the FileMetaData with the value stored as :

{'index_columns': [<descr0>, <descr1>, ...],
 'column_indexes': [<ci0>, <ci1>, ..., <ciN>],
 'columns': [<c0>, <c1>, ...],
 'pandas_version': $VERSION,
 'creator': {
   'library': $LIBRARY,
   'version': $LIBRARY_VERSION
 }}

The “descriptor” values <descr0> in the 'index_columns' field are strings (referring to a column) or dictionaries with values as described below.

The <c0>/<ci0> and so forth are dictionaries containing the metadata for each column, including the index columns. This has JSON form:

{'name': column_name,
 'field_name': parquet_column_name,
 'pandas_type': pandas_type,
 'numpy_type': numpy_type,
 'metadata': metadata}

See below for the detailed specification for these.

Index metadata descriptors#

RangeIndex can be stored as metadata only, not requiring serialization. The descriptor format for these as is follows:

index = pd.RangeIndex(0, 10, 2)
{
    "kind": "range",
    "name": index.name,
    "start": index.start,
    "stop": index.stop,
    "step": index.step,
}

Other index types must be serialized as data columns along with the other DataFrame columns. The metadata for these is a string indicating the name of the field in the data columns, for example '__index_level_0__'.

If an index has a non-None name attribute, and there is no other column with a name matching that value, then the index.name value can be used as the descriptor. Otherwise (for unnamed indexes and ones with names colliding with other column names) a disambiguating name with pattern matching __index_level_\d+__ should be used. In cases of named indexes as data columns, name attribute is always stored in the column descriptors as above.

Column metadata#

pandas_type is the logical type of the column, and is one of:

Boolean: 'bool'
Integers: 'int8', 'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32', 'uint64'
Floats: 'float16', 'float32', 'float64'
Date and Time Types: 'datetime', 'datetimetz', 'timedelta'
String: 'unicode', 'bytes'
Categorical: 'categorical'
Other Python objects: 'object'

The numpy_type is the physical storage type of the column, which is the result of str(dtype) for the underlying NumPy array that holds the data. So for datetimetz this is datetime64[ns] and for categorical, it may be any of the supported integer categorical types.

The metadata field is None except for:

datetimetz: {'timezone': zone, 'unit': 'ns'}, e.g. {'timezone', 'America/New_York', 'unit': 'ns'}. The 'unit' is optional, and if omitted it is assumed to be nanoseconds.
categorical: {'num_categories': K, 'ordered': is_ordered, 'type': $TYPE}
- Here 'type' is optional, and can be a nested pandas type specification here (but not categorical)
unicode: {'encoding': encoding}
- The encoding is optional, and if not present is UTF-8
object: {'encoding': encoding}. Objects can be serialized and stored in BYTE_ARRAY Parquet columns. The encoding can be one of:
- 'pickle'
- 'bson'
- 'json'
timedelta: {'unit': 'ns'}. The 'unit' is optional, and if omitted it is assumed to be nanoseconds. This metadata is optional altogether

For types other than these, the 'metadata' key can be omitted. Implementations can assume None if the key is not present.

As an example of fully-formed metadata:

{'index_columns': ['__index_level_0__'],
 'column_indexes': [
     {'name': None,
      'field_name': 'None',
      'pandas_type': 'unicode',
      'numpy_type': 'object',
      'metadata': {'encoding': 'UTF-8'}}
 ],
 'columns': [
     {'name': 'c0',
      'field_name': 'c0',
      'pandas_type': 'int8',
      'numpy_type': 'int8',
      'metadata': None},
     {'name': 'c1',
      'field_name': 'c1',
      'pandas_type': 'bytes',
      'numpy_type': 'object',
      'metadata': None},
     {'name': 'c2',
      'field_name': 'c2',
      'pandas_type': 'categorical',
      'numpy_type': 'int16',
      'metadata': {'num_categories': 1000, 'ordered': False}},
     {'name': 'c3',
      'field_name': 'c3',
      'pandas_type': 'datetimetz',
      'numpy_type': 'datetime64[ns]',
      'metadata': {'timezone': 'America/Los_Angeles'}},
     {'name': 'c4',
      'field_name': 'c4',
      'pandas_type': 'object',
      'numpy_type': 'object',
      'metadata': {'encoding': 'pickle'}},
     {'name': None,
      'field_name': '__index_level_0__',
      'pandas_type': 'int64',
      'numpy_type': 'int64',
      'metadata': None}
 ],
 'pandas_version': '1.4.0',
 'creator': {
   'library': 'pyarrow',
   'version': '0.13.0'
 }}