.. _10min_tut_06_stats: {{ header }} .. ipython:: python import pandas as pd .. raw:: html
Data used for this tutorial:
• .. include:: includes/titanic.rst .. ipython:: python titanic = pd.read_csv("data/titanic.csv") titanic.head() .. raw:: html
How to calculate summary statistics ----------------------------------- Aggregating statistics ~~~~~~~~~~~~~~~~~~~~~~ .. image:: ../../_static/schemas/06_aggregate.svg :align: center .. raw:: html
• What is the average age of the Titanic passengers? .. ipython:: python titanic["Age"].mean() .. raw:: html
Different statistics are available and can be applied to columns with numerical data. Operations in general exclude missing data and operate across rows by default. .. image:: ../../_static/schemas/06_reduction.svg :align: center .. raw:: html
• What is the median age and ticket fare price of the Titanic passengers? .. ipython:: python titanic[["Age", "Fare"]].median() The statistic applied to multiple columns of a ``DataFrame`` (the selection of two columns returns a ``DataFrame``, see the :ref:`subset data tutorial <10min_tut_03_subset>`) is calculated for each numeric column. .. raw:: html
The aggregating statistic can be calculated for multiple columns at the same time. Remember the ``describe`` function from the :ref:`first tutorial <10min_tut_01_tableoriented>`? .. ipython:: python titanic[["Age", "Fare"]].describe() Instead of the predefined statistics, specific combinations of aggregating statistics for given columns can be defined using the :func:`DataFrame.agg` method: .. ipython:: python titanic.agg( { "Age": ["min", "max", "median", "skew"], "Fare": ["min", "max", "median", "mean"], } ) .. raw:: html
To user guide Details about descriptive statistics are provided in the user guide section on :ref:`descriptive statistics `. .. raw:: html
Aggregating statistics grouped by category ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. image:: ../../_static/schemas/06_groupby.svg :align: center .. raw:: html
• What is the average age for male versus female Titanic passengers? .. ipython:: python titanic[["Sex", "Age"]].groupby("Sex").mean() As our interest is the average age for each gender, a subselection on these two columns is made first: ``titanic[["Sex", "Age"]]``. Next, the :meth:`~DataFrame.groupby` method is applied on the ``Sex`` column to make a group per category. The average age *for each gender* is calculated and returned. .. raw:: html
Calculating a given statistic (e.g. ``mean`` age) *for each category in a column* (e.g. male/female in the ``Sex`` column) is a common pattern. The ``groupby`` method is used to support this type of operations. This fits in the more general ``split-apply-combine`` pattern: - **Split** the data into groups - **Apply** a function to each group independently - **Combine** the results into a data structure The apply and combine steps are typically done together in pandas. In the previous example, we explicitly selected the 2 columns first. If not, the ``mean`` method is applied to each column containing numerical columns by passing ``numeric_only=True``: .. ipython:: python titanic.groupby("Sex").mean(numeric_only=True) It does not make much sense to get the average value of the ``Pclass``. If we are only interested in the average age for each gender, the selection of columns (rectangular brackets ``[]`` as usual) is supported on the grouped data as well: .. ipython:: python titanic.groupby("Sex")["Age"].mean() .. image:: ../../_static/schemas/06_groupby_select_detail.svg :align: center .. note:: The ``Pclass`` column contains numerical data but actually represents 3 categories (or factors) with respectively the labels ‘1’, ‘2’ and ‘3’. Calculating statistics on these does not make much sense. Therefore, pandas provides a ``Categorical`` data type to handle this type of data. More information is provided in the user guide :ref:`categorical` section. .. raw:: html
• What is the mean ticket fare price for each of the sex and cabin class combinations? .. ipython:: python titanic.groupby(["Sex", "Pclass"])["Fare"].mean() Grouping can be done by multiple columns at the same time. Provide the column names as a list to the :meth:`~DataFrame.groupby` method. .. raw:: html
.. raw:: html
To user guide A full description on the split-apply-combine approach is provided in the user guide section on :ref:`groupby operations `. .. raw:: html
Count number of records by category ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. image:: ../../_static/schemas/06_valuecounts.svg :align: center .. raw:: html
• What is the number of passengers in each of the cabin classes? .. ipython:: python titanic["Pclass"].value_counts() The :meth:`~Series.value_counts` method counts the number of records for each category in a column. .. raw:: html
The function is a shortcut, as it is actually a groupby operation in combination with counting of the number of records within each group: .. ipython:: python titanic.groupby("Pclass")["Pclass"].count() .. note:: Both ``size`` and ``count`` can be used in combination with ``groupby``. Whereas ``size`` includes ``NaN`` values and just provides the number of rows (size of the table), ``count`` excludes the missing values. In the ``value_counts`` method, use the ``dropna`` argument to include or exclude the ``NaN`` values. .. raw:: html
To user guide The user guide has a dedicated section on ``value_counts`` , see the page on :ref:`discretization `. .. raw:: html
.. raw:: html

#### REMEMBER

- Aggregation statistics can be calculated on entire columns or rows. - ``groupby`` provides the power of the *split-apply-combine* pattern. - ``value_counts`` is a convenient shortcut to count the number of entries in each category of a variable. .. raw:: html
.. raw:: html
To user guide A full description on the split-apply-combine approach is provided in the user guide pages about :ref:`groupby operations `. .. raw:: html