IO Tools (Text, CSV, HDF5, …)¶
The pandas I/O API is a set of top level reader
functions accessed like pd.read_csv()
that generally return a pandas
object. The corresponding writer
functions are object methods that are accessed like df.to_csv()
Format Type | Data Description | Reader | Writer |
---|---|---|---|
text | CSV | read_csv | to_csv |
text | JSON | read_json | to_json |
text | HTML | read_html | to_html |
text | Local clipboard | read_clipboard | to_clipboard |
binary | MS Excel | read_excel | to_excel |
binary | HDF5 Format | read_hdf | to_hdf |
binary | Feather Format | read_feather | to_feather |
binary | Msgpack | read_msgpack | to_msgpack |
binary | Stata | read_stata | to_stata |
binary | SAS | read_sas | |
binary | Python Pickle Format | read_pickle | to_pickle |
SQL | SQL | read_sql | to_sql |
SQL | Google Big Query | read_gbq | to_gbq |
Here is an informal performance comparison for some of these IO methods.
Note
For examples that use the StringIO
class, make sure you import it
according to your Python version, i.e. from StringIO import StringIO
for
Python 2 and from io import StringIO
for Python 3.
CSV & Text files¶
The two workhorse functions for reading text files (a.k.a. flat files) are
read_csv()
and read_table()
. They both use the same parsing code to
intelligently convert tabular data into a DataFrame object. See the
cookbook for some advanced strategies.
Parsing options¶
read_csv()
and read_table()
accept the following arguments:
Basic¶
- filepath_or_buffer : various
- Either a path to a file (a
str
,pathlib.Path
, orpy._path.local.LocalPath
), URL (including http, ftp, and S3 locations), or any object with aread()
method (such as an open file orStringIO
). - sep : str, defaults to
','
forread_csv()
,\t
forread_table()
- Delimiter to use. If sep is
None
, the C engine cannot automatically detect the separator, but the Python parsing engine can, meaning the latter will be used automatically. In addition, separators longer than 1 character and different from'\s+'
will be interpreted as regular expressions and will also force the use of the Python parsing engine. Note that regex delimiters are prone to ignoring quoted data. Regex example:'\\r\\t'
. - delimiter : str, default
None
- Alternative argument name for sep.
- delim_whitespace : boolean, default False
Specifies whether or not whitespace (e.g.
' '
or'\t'
) will be used as the delimiter. Equivalent to settingsep='\s+'
. If this option is set to True, nothing should be passed in for thedelimiter
parameter.New in version 0.18.1: support for the Python parser.
Column and Index Locations and Names¶
- header : int or list of ints, default
'infer'
- Row number(s) to use as the column names, and the start of the data. Default
behavior is as if
header=0
if nonames
passed, otherwise as ifheader=None
. Explicitly passheader=0
to be able to replace existing names. The header can be a list of ints that specify row locations for a multi-index on the columns e.g.[0,1,3]
. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines ifskip_blank_lines=True
, so header=0 denotes the first line of data rather than the first line of the file. - names : array-like, default
None
- List of column names to use. If file contains no header row, then you should
explicitly pass
header=None
. Duplicates in this list are not allowed unlessmangle_dupe_cols=True
, which is the default. - index_col : int or sequence or
False
, defaultNone
- Column to use as the row labels of the DataFrame. If a sequence is given, a
MultiIndex is used. If you have a malformed file with delimiters at the end of
each line, you might consider
index_col=False
to force pandas to not use the first column as the index (row names). - usecols : array-like or callable, default
None
Return a subset of the columns. If array-like, all elements must either be positional (i.e. integer indices into the document columns) or strings that correspond to column names provided either by the user in names or inferred from the document header row(s). For example, a valid array-like usecols parameter would be [0, 1, 2] or [‘foo’, ‘bar’, ‘baz’].
If callable, the callable function will be evaluated against the column names, returning names where the callable function evaluates to True:
In [1]: data = 'col1,col2,col3\na,b,1\na,b,2\nc,d,3' In [2]: pd.read_csv(StringIO(data)) Out[2]: col1 col2 col3 0 a b 1 1 a b 2 2 c d 3 In [3]: pd.read_csv(StringIO(data), usecols=lambda x: x.upper() in ['COL1', 'COL3'])