This section will focus on downstream applications of pandas.
The Apache Parquet format provides key-value metadata at the file and column level, stored in the footer of the Parquet file:
5: optional list<KeyValue> key_value_metadata
where KeyValue is
KeyValue
struct KeyValue { 1: required string key 2: optional string value }
So that a pandas.DataFrame can be faithfully reconstructed, we store a pandas metadata key in the FileMetaData with the value stored as :
pandas.DataFrame
pandas
FileMetaData
{'index_columns': [<descr0>, <descr1>, ...], 'column_indexes': [<ci0>, <ci1>, ..., <ciN>], 'columns': [<c0>, <c1>, ...], 'pandas_version': $VERSION, 'creator': { 'library': $LIBRARY, 'version': $LIBRARY_VERSION }}
The “descriptor” values <descr0> in the 'index_columns' field are strings (referring to a column) or dictionaries with values as described below.
<descr0>
'index_columns'
The <c0>/<ci0> and so forth are dictionaries containing the metadata for each column, including the index columns. This has JSON form:
<c0>
<ci0>
{'name': column_name, 'field_name': parquet_column_name, 'pandas_type': pandas_type, 'numpy_type': numpy_type, 'metadata': metadata}
See below for the detailed specification for these.
RangeIndex can be stored as metadata only, not requiring serialization. The descriptor format for these as is follows:
RangeIndex
index = pd.RangeIndex(0, 10, 2) { "kind": "range", "name": index.name, "start": index.start, "stop": index.stop, "step": index.step, }
Other index types must be serialized as data columns along with the other DataFrame columns. The metadata for these is a string indicating the name of the field in the data columns, for example '__index_level_0__'.
'__index_level_0__'
If an index has a non-None name attribute, and there is no other column with a name matching that value, then the index.name value can be used as the descriptor. Otherwise (for unnamed indexes and ones with names colliding with other column names) a disambiguating name with pattern matching __index_level_\d+__ should be used. In cases of named indexes as data columns, name attribute is always stored in the column descriptors as above.
name
index.name
__index_level_\d+__
pandas_type is the logical type of the column, and is one of:
pandas_type
Boolean: 'bool'
'bool'
Integers: 'int8', 'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32', 'uint64'
'int8', 'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32', 'uint64'
Floats: 'float16', 'float32', 'float64'
'float16', 'float32', 'float64'
Date and Time Types: 'datetime', 'datetimetz', 'timedelta'
'datetime', 'datetimetz'
'timedelta'
String: 'unicode', 'bytes'
'unicode', 'bytes'
Categorical: 'categorical'
'categorical'
Other Python objects: 'object'
'object'
The numpy_type is the physical storage type of the column, which is the result of str(dtype) for the underlying NumPy array that holds the data. So for datetimetz this is datetime64[ns] and for categorical, it may be any of the supported integer categorical types.
numpy_type
str(dtype)
datetimetz
datetime64[ns]
The metadata field is None except for:
metadata
None
datetimetz: {'timezone': zone, 'unit': 'ns'}, e.g. {'timezone', 'America/New_York', 'unit': 'ns'}. The 'unit' is optional, and if omitted it is assumed to be nanoseconds.
{'timezone': zone, 'unit': 'ns'}
{'timezone', 'America/New_York', 'unit': 'ns'}
'unit'
categorical: {'num_categories': K, 'ordered': is_ordered, 'type': $TYPE}
categorical
{'num_categories': K, 'ordered': is_ordered, 'type': $TYPE}
Here 'type' is optional, and can be a nested pandas type specification here (but not categorical)
'type'
unicode: {'encoding': encoding}
unicode
{'encoding': encoding}
The encoding is optional, and if not present is UTF-8
object: {'encoding': encoding}. Objects can be serialized and stored in BYTE_ARRAY Parquet columns. The encoding can be one of:
object
BYTE_ARRAY
'pickle' 'bson' 'json'
'pickle'
'bson'
'json'
timedelta: {'unit': 'ns'}. The 'unit' is optional, and if omitted it is assumed to be nanoseconds. This metadata is optional altogether
timedelta
{'unit': 'ns'}
For types other than these, the 'metadata' key can be omitted. Implementations can assume None if the key is not present.
'metadata'
As an example of fully-formed metadata:
{'index_columns': ['__index_level_0__'], 'column_indexes': [ {'name': None, 'field_name': 'None', 'pandas_type': 'unicode', 'numpy_type': 'object', 'metadata': {'encoding': 'UTF-8'}} ], 'columns': [ {'name': 'c0', 'field_name': 'c0', 'pandas_type': 'int8', 'numpy_type': 'int8', 'metadata': None}, {'name': 'c1', 'field_name': 'c1', 'pandas_type': 'bytes', 'numpy_type': 'object', 'metadata': None}, {'name': 'c2', 'field_name': 'c2', 'pandas_type': 'categorical', 'numpy_type': 'int16', 'metadata': {'num_categories': 1000, 'ordered': False}}, {'name': 'c3', 'field_name': 'c3', 'pandas_type': 'datetimetz', 'numpy_type': 'datetime64[ns]', 'metadata': {'timezone': 'America/Los_Angeles'}}, {'name': 'c4', 'field_name': 'c4', 'pandas_type': 'object', 'numpy_type': 'object', 'metadata': {'encoding': 'pickle'}}, {'name': None, 'field_name': '__index_level_0__', 'pandas_type': 'int64', 'numpy_type': 'int64', 'metadata': None} ], 'pandas_version': '0.20.0', 'creator': { 'library': 'pyarrow', 'version': '0.13.0' }}