IO Tools (Text, CSV, HDF5, …)¶
The pandas I/O API is a set of top level reader
functions accessed like
pandas.read_csv()
that generally return a pandas object. The corresponding
writer
functions are object methods that are accessed like
DataFrame.to_csv()
. Below is a table containing available readers
and
writers
.
Format Type | Data Description | Reader | Writer |
---|---|---|---|
text | CSV | read_csv | to_csv |
text | JSON | read_json | to_json |
text | HTML | read_html | to_html |
text | Local clipboard | read_clipboard | to_clipboard |
binary | MS Excel | read_excel | to_excel |
binary | HDF5 Format | read_hdf | to_hdf |
binary | Feather Format | read_feather | to_feather |
binary | Parquet Format | read_parquet | to_parquet |
binary | Msgpack | read_msgpack | to_msgpack |
binary | Stata | read_stata | to_stata |
binary | SAS | read_sas | |
binary | Python Pickle Format | read_pickle | to_pickle |
SQL | SQL | read_sql | to_sql |
SQL | Google Big Query | read_gbq | to_gbq |
Here is an informal performance comparison for some of these IO methods.
Note
For examples that use the StringIO
class, make sure you import it
according to your Python version, i.e. from StringIO import StringIO
for
Python 2 and from io import StringIO
for Python 3.
CSV & Text files¶
The two workhorse functions for reading text files (a.k.a. flat files) are
read_csv()
and read_table()
. They both use the same parsing code to
intelligently convert tabular data into a DataFrame
object. See the
cookbook for some advanced strategies.
Parsing options¶
The functions read_csv()
and read_table()
accept the following
common arguments:
Basic¶
- filepath_or_buffer : various
- Either a path to a file (a
str
,pathlib.Path
, orpy._path.local.LocalPath
), URL (including http, ftp, and S3 locations), or any object with aread()
method (such as an open file orStringIO
). - sep : str, defaults to
','
forread_csv()
,\t
forread_table()
- Delimiter to use. If sep is
None
, the C engine cannot automatically detect the separator, but the Python parsing engine can, meaning the latter will be used and automatically detect the separator by Python’s builtin sniffer tool,csv.Sniffer
. In addition, separators longer than 1 character and different from'\s+'
will be interpreted as regular expressions and will also force the use of the Python parsing engine. Note that regex delimiters are prone to ignoring quoted data. Regex example:'\\r\\t'
. - delimiter : str, default
None
- Alternative argument name for sep.
- delim_whitespace : boolean, default False
Specifies whether or not whitespace (e.g.
' '
or'\t'
) will be used as the delimiter. Equivalent to settingsep='\s+'
. If this option is set toTrue
, nothing should be passed in for thedelimiter
parameter.New in version 0.18.1: support for the Python parser.
Column and Index Locations and Names¶
- header : int or list of ints, default
'infer'
Row number(s) to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to
header=0
and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical toheader=None
. Explicitly passheader=0
to be able to replace existing names.The header can be a list of ints that specify row locations for a multi-index on the columns e.g.
[0,1,3]
. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines ifskip_blank_lines=True
, so header=0 denotes the first line of data rather than the first line of the file.- names : array-like, default
None
- List of column names to use. If file contains no header row, then you should
explicitly pass
header=None
. Duplicates in this list will cause aUserWarning
to be issued. - index_col : int or sequence or
False
, defaultNone
- Column to use as the row labels of the
DataFrame
. If a sequence is given, a MultiIndex is used. If you have a malformed file with delimiters at the end of each line, you might considerindex_col=False
to force pandas to not use the first column as the index (row names). - usecols : list-like or callable, default
None
Return a subset of the columns. If list-like, all elements must either be positional (i.e. integer indices into the document columns) or strings that correspond to column names provided either by the user in names or inferred from the document header row(s). For example, a valid list-like usecols parameter would be
[0, 1, 2]
or['foo', 'bar', 'baz']
.Element order is ignored, so
usecols=[0, 1]
is the same as[1, 0]
. To instantiate a DataFrame fromdata
with element order preserved usepd.read_csv(data, usecols=['foo', 'bar'])[['foo', 'bar']]
for columns in['foo', 'bar']
order orpd.read_csv(data, usecols=['foo', 'bar'])[['bar', 'foo']]
for['bar', 'foo']
order.If callable, the callable function will be evaluated against the column names, returning names where the callable function evaluates to True:
In [1]: data = 'col1,col2,col3\na,b,1\na,b,2\nc,d,3' In [2]: pd.read_csv(StringIO(data)) Out[2]: col1 col2 col3 0 a b 1 1 a b 2 2 c d 3 In [3]: pd.read_csv(StringIO(data), usecols=lambda x: x.upper() in ['COL1', 'COL3'])