pandas.DataFrame.to_orc#

DataFrame.to_orc(path=None, *, engine='pyarrow', index=None, engine_kwargs=None)[source]#

Write a DataFrame to the Optimized Row Columnar (ORC) format.

ORC is a self-describing, type-aware columnar file format designed for Hadoop workloads. It provides efficient compression and encoding schemes, making it well-suited for large-scale data storage and analytics. This method requires the pyarrow library.

Parameters:

pathstr, file-like object or None, default None

If a string, it will be used as the root directory path when writing a partitioned dataset. By file-like object, we refer to objects with a write() method, such as a file handle (e.g. via builtin open function). If path is None, a bytes object is returned.

The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected. A local file could be: file://localhost/path/to/table.orc. A remote example could be: s3://bucket/path/to/table.orc.

Certain URL schemes may require additional packages. For example, S3 URLs require the s3fs library. See Optional dependencies for a full list.

engine{‘pyarrow’}, default ‘pyarrow’

ORC library to use.

indexbool, optional

If True, include the dataframe’s index(es) in the file output. If False, they will not be written to the file. If None, similar to infer the dataframe’s index(es) will be saved. However, instead of being saved as values, the RangeIndex will be stored as a range in the metadata so it doesn’t require much space and is faster. Other indexes will be included as columns in the file output.

engine_kwargsdict[str, Any] or None, default None

Additional keyword arguments passed to pyarrow.orc.write_table().

Returns:

bytes if no path argument is provided else None: Bytes object with DataFrame data if path is not specified else None.

Raises:

NotImplementedError: Dtype of one or more columns is category, unsigned integers, interval, period or sparse.
ValueError: engine is not pyarrow.

See also

read_orc: Read a ORC file.
DataFrame.to_parquet: Write a parquet file.
DataFrame.to_csv: Write a csv file.
DataFrame.to_sql: Write to a sql table.
DataFrame.to_hdf: Write to hdf.

Notes

Find more information on ORC here.
Before using this function you should read the user guide about ORC and install optional dependencies.
This function requires pyarrow library.
For supported dtypes please refer to supported ORC features in Arrow.
Currently timezones in datetime columns are not preserved when a dataframe is converted into ORC files.

Examples

>>> df = pd.DataFrame(data={"col1": [1, 2], "col2": [4, 3]})
>>> df.to_orc("df.orc")
>>> pd.read_orc("df.orc")
   col1  col2
0     1     4
1     2     3

If you want to get a buffer to the orc content you can write it to io.BytesIO

>>> import io
>>> b = io.BytesIO(df.to_orc())
>>> b.seek(0)
0
>>> content = b.read()