4.15. DataFrame Statistics

  • .count()

  • .value_counts()

  • .nunique()

  • .sum()

  • .cumsum()

  • .prod()

  • .cumprod()

  • .min()

  • .idxmin()

  • .cummin()

  • .max()

  • .idxmax()

  • .cummax()

  • .mean()

  • .median()

  • .mode()

  • .rolling(window=3).mean()

  • .abs()

  • .std()

  • .mad() - Mean absolute deviation

  • .sem() - Standard Error of the Mean (SEM)

  • .skew() - Skewness (3rd moment)

  • .kurt() - Kurtosis (4th moment)

  • .quantile(.33) - Sample quantile (value at %). Quantile also known as Percentile

  • .quantile([.25, .5, .75])

  • .var()

  • .corr()

  • .describe()

  • .nsmallest()

  • .values()

  • .rank()

4.15.1. SetUp

>>> import pandas as pd
>>> import numpy as np
>>> np.random.seed(0)
>>>
>>>
>>> df = pd.DataFrame(
...     columns = ['Morning', 'Noon', 'Evening', 'Midnight'],
...     index = pd.date_range('1999-12-30', periods=7),
...     data = np.random.randn(7, 4))
>>>
>>> df
             Morning      Noon   Evening  Midnight
1999-12-30  1.764052  0.400157  0.978738  2.240893
1999-12-31  1.867558 -0.977278  0.950088 -0.151357
2000-01-01 -0.103219  0.410599  0.144044  1.454274
2000-01-02  0.761038  0.121675  0.443863  0.333674
2000-01-03  1.494079 -0.205158  0.313068 -0.854096
2000-01-04 -2.552990  0.653619  0.864436 -0.742165
2000-01-05  2.269755 -1.454366  0.045759 -0.187184

4.15.2. Count

Number of non-null observations:

>>> df.count()
Morning     7
Noon        7
Evening     7
Midnight    7
dtype: int64
>>> df.value_counts()  
Morning    Noon       Evening   Midnight
-2.552990   0.653619  0.864436  -0.742165    1
-0.103219   0.410599  0.144044   1.454274    1
 0.761038   0.121675  0.443863   0.333674    1
 1.494079  -0.205158  0.313068  -0.854096    1
 1.764052   0.400157  0.978738   2.240893    1
 1.867558  -0.977278  0.950088  -0.151357    1
 2.269755  -1.454366  0.045759  -0.187184    1
dtype: int64
>>> df.nunique()
Morning     7
Noon        7
Evening     7
Midnight    7
dtype: int64

4.15.3. Sum

Sum of values:

df.sum() Morning 5.500273 Noon -1.050752 Evening 3.739996 Midnight 2.094039 dtype: float64

Cumulative sum:

>>> df.cumsum()
             Morning      Noon   Evening  Midnight
1999-12-30  1.764052  0.400157  0.978738  2.240893
1999-12-31  3.631610 -0.577121  1.928826  2.089536
2000-01-01  3.528391 -0.166522  2.072870  3.543809
2000-01-02  4.289429 -0.044847  2.516733  3.877484
2000-01-03  5.783508 -0.250005  2.829801  3.023388
2000-01-04  3.230518  0.403613  3.694237  2.281223
2000-01-05  5.500273 -1.050752  3.739996  2.094039

4.15.4. Product

Product of values:

>>> df.prod()
Morning     2.240538
Noon       -0.003810
Evening     0.000736
Midnight    0.019528
dtype: float64

Cumulative product:

>>> df.cumprod()
             Morning      Noon   Evening  Midnight
1999-12-30  1.764052  0.400157  0.978738  2.240893
1999-12-31  3.294470 -0.391065  0.929888 -0.339175
2000-01-01 -0.340051 -0.160571  0.133944 -0.493254
2000-01-02 -0.258792 -0.019537  0.059453 -0.164586
2000-01-03 -0.386656  0.004008  0.018613  0.140572
2000-01-04  0.987128  0.002620  0.016090 -0.104328
2000-01-05  2.240538 -0.003810  0.000736  0.019528

4.15.5. Extremes

Minimum, index of minimum and cumulative minimum:

>>> df.min()
Morning    -2.552990
Noon       -1.454366
Evening     0.045759
Midnight   -0.854096
dtype: float64
>>> df.idxmin()
Morning    2000-01-04
Noon       2000-01-05
Evening    2000-01-05
Midnight   2000-01-03
dtype: datetime64[ns]
>>> df.cummin()
             Morning      Noon   Evening  Midnight
1999-12-30  1.764052  0.400157  0.978738  2.240893
1999-12-31  1.764052 -0.977278  0.950088 -0.151357
2000-01-01 -0.103219 -0.977278  0.144044 -0.151357
2000-01-02 -0.103219 -0.977278  0.144044 -0.151357
2000-01-03 -0.103219 -0.977278  0.144044 -0.854096
2000-01-04 -2.552990 -0.977278  0.144044 -0.854096
2000-01-05 -2.552990 -1.454366  0.045759 -0.854096

Maximum, index of maximum and cumulative maximum:

>>> df.max()
Morning     2.269755
Noon        0.653619
Evening     0.978738
Midnight    2.240893
dtype: float64
>>> df.idxmax()
Morning    2000-01-05
Noon       2000-01-04
Evening    1999-12-30
Midnight   1999-12-30
dtype: datetime64[ns]
>>> df.cummax()
             Morning      Noon   Evening  Midnight
1999-12-30  1.764052  0.400157  0.978738  2.240893
1999-12-31  1.867558  0.400157  0.978738  2.240893
2000-01-01  1.867558  0.410599  0.978738  2.240893
2000-01-02  1.867558  0.410599  0.978738  2.240893
2000-01-03  1.867558  0.410599  0.978738  2.240893
2000-01-04  1.867558  0.653619  0.978738  2.240893
2000-01-05  2.269755  0.653619  0.978738  2.240893

4.15.6. Average

Arithmetic mean of values:

>>> df.mean()
Morning     0.785753
Noon       -0.150107
Evening     0.534285
Midnight    0.299148
dtype: float64

Arithmetic median of values:

>>> df.median()
Morning     1.494079
Noon        0.121675
Evening     0.443863
Midnight   -0.151357
dtype: float64

Mode:

>>> df.mode()
    Morning      Noon   Evening  Midnight
0 -2.552990 -1.454366  0.045759 -0.854096
1 -0.103219 -0.977278  0.144044 -0.742165
2  0.761038 -0.205158  0.313068 -0.187184
3  1.494079  0.121675  0.443863 -0.151357
4  1.764052  0.400157  0.864436  0.333674
5  1.867558  0.410599  0.950088  1.454274
6  2.269755  0.653619  0.978738  2.240893

Rolling Average:

>>> df.rolling(window=30)
Rolling [window=30,center=False,axis=0,method=single]
>>>
>>> df.rolling(window=3).mean()
             Morning      Noon   Evening  Midnight
1999-12-30       NaN       NaN       NaN       NaN
1999-12-31       NaN       NaN       NaN       NaN
2000-01-01  1.176130 -0.055507  0.690957  1.181270
2000-01-02  0.841792 -0.148335  0.512665  0.545530
2000-01-03  0.717299  0.109038  0.300325  0.311284
2000-01-04 -0.099291  0.190045  0.540456 -0.420862
2000-01-05  0.403615 -0.335302  0.407754 -0.594482
../../_images/pandas-dataframe-stats-rolling.png

Figure 4.9. Rolling Average

4.15.7. Distribution

Absolute value:

>>> df.abs()
             Morning      Noon   Evening  Midnight
1999-12-30  1.764052  0.400157  0.978738  2.240893
1999-12-31  1.867558  0.977278  0.950088  0.151357
2000-01-01  0.103219  0.410599  0.144044  1.454274
2000-01-02  0.761038  0.121675  0.443863  0.333674
2000-01-03  1.494079  0.205158  0.313068  0.854096
2000-01-04  2.552990  0.653619  0.864436  0.742165
2000-01-05  2.269755  1.454366  0.045759  0.187184

Standard deviation:

>>> df.std()
Morning     1.671798
Noon        0.787967
Evening     0.393169
Midnight    1.151785
dtype: float64
../../_images/pandas-dataframe-stats-stdev.png

Figure 4.10. Standard Deviation

Mean absolute deviation:

>>> df.mad()
Morning     1.214980
Noon        0.624708
Evening     0.340116
Midnight    0.894684
dtype: float64

Standard Error of the Mean (SEM):

>>> df.sem()
Morning     0.631880
Noon        0.297824
Evening     0.148604
Midnight    0.435334
dtype: float64
../../_images/pandas-dataframe-stats-sem.png

Figure 4.11. Standard Error of the Mean (SEM)

Skewness (3rd moment):

>>> df.skew()
Morning    -1.602706
Noon       -0.907414
Evening     0.031047
Midnight    0.915190
dtype: float64
../../_images/pandas-dataframe-stats-skew.png

Figure 4.12. Skewness

Kurtosis (4th moment):

>>> df.kurt()
Morning     2.502051
Noon       -0.588010
Evening    -2.208781
Midnight   -0.351782
dtype: float64
../../_images/pandas-dataframe-stats-kurt.png

Figure 4.13. Kurtosis

Sample quantile (value at %). Quantile also known as Percentile:

>>> df.quantile(.33)
Morning     0.743753
Noon       -0.220601
Evening     0.309687
Midnight   -0.198283
Name: 0.33, dtype: float64
>>> df.quantile([.25, .5, .75])
       Morning      Noon   Evening  Midnight
0.25  0.328909 -0.591218  0.228556 -0.464674
0.50  1.494079  0.121675  0.443863 -0.151357
0.75  1.815805  0.405378  0.907262  0.893974

Variance:

>>> df.var()
Morning     2.794907
Noon        0.620892
Evening     0.154582
Midnight    1.326610
dtype: float64

Correlation Coefficient:

>>> df.corr()
           Morning      Noon   Evening  Midnight
Morning   1.000000 -0.698340 -0.190219  0.201034
Noon     -0.698340  1.000000  0.307686  0.359761
Evening  -0.190219  0.307686  1.000000  0.136436
Midnight  0.201034  0.359761  0.136436  1.000000
../../_images/pandas-dataframe-stats-corr.png

Figure 4.14. Correlation Coefficient

4.15.8. Describe

>>> df.describe()
        Morning      Noon   Evening  Midnight
count  7.000000  7.000000  7.000000  7.000000
mean   0.785753 -0.150107  0.534285  0.299148
std    1.671798  0.787967  0.393169  1.151785
min   -2.552990 -1.454366  0.045759 -0.854096
25%    0.328909 -0.591218  0.228556 -0.464674
50%    1.494079  0.121675  0.443863 -0.151357
75%    1.815805  0.405378  0.907262  0.893974
max    2.269755  0.653619  0.978738  2.240893

4.15.9. Examples

>>> import pandas as pd
>>>
>>>
>>> DATA = 'https://python.astrotech.io/_static/phones-en.csv'
>>>
>>> df = pd.read_csv(DATA, parse_dates=['date'])
>>> df.drop(columns='index', inplace=True)
date

The date and time of the entry

duration

The duration (in seconds) for each call, the amount of data (in MB) for each data entry, and the number of texts sent (usually 1) for each sms entry

item

A description of the event occurring – can be one of call, sms, or data

month

The billing month that each entry belongs to – of form YYYY-MM

network

The mobile network that was called/texted for each entry

network_type

Whether the number being called was a mobile, international ('world'), voicemail, landline, or other ('special') number.

Source 1

How many rows the dataset:

>>> df['item'].count()
830

What was the longest phone call / data entry?:

>>> df['duration'].max()
10528.0

How many seconds of phone calls are recorded in total?:

>>> df.loc[ df['item'] == 'call' ]['duration'].sum()
92321.0

How many entries are there for each month?:

>>> df['month'].value_counts()
2014-11    230
2015-01    205
2014-12    157
2015-02    137
2015-03    101
Name: month, dtype: int64

Number of non-null unique network entries:

>>> df['network'].nunique()
9

4.15.10. Other

  • .nsmallest()

  • .values()

  • .rank()

4.15.11. References

1

Lynn, Shane. Summarising, Aggregating, and Grouping data in Python Pandas. https://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/ Access date: 2019-12-03. 2019.

4.15.12. Assignments

Code 4.66. Solution
"""
* Assignment: DataFrame Statistics
* Complexity: medium
* Lines of code: 15 lines
* Time: 21 min

English:
    TODO: English Translation
    X. Run doctests - all must succeed

Polish:
    1. Ustaw ziarno losowości na zero
    2. Stwórz `df: pd.DataFrame` z 50 wierszami:
        a. kolumna `mileage` - losowe `int` [0, 200_000)
        b. kolumna `consumption` - losowe `int` [0, 20)
    3. Dodaj kolumnę `status` o wartościach:
        a. `old` jeżeli `mileage` powyżej 100_000 km
        b. `young` jeżeli `mileage` od 10_000 km do 50_000 km
        c. `new` jeżeli `mileage` od 0 do 10_000 km
    4. Używając `pd.cut` dodaj kolumnę `type`:
        a. jeżeli `consumption` [0, 1] `type` to `electric`
        b. jeżeli `consumption` [2, 10] `type` to `car`
        c. jeżeli `consumption` 11 i więcej, `type` to `truck`
    5. Przeanalizuj dane statystycznie:
        a. Zapisz podstawowe statystyki opisowe (`DataFrame.describe()`) do `result: pd.DataFrame`
        b. Sprawdź liczność grup (`DataFrame.count()`, `Series.value_counts()`)
    6. Uruchom doctesty - wszystkie muszą się powieść

Extra Task:
    1. (wymaga wiedzy z przyszłych rozdziałów)
    2. Narysuj histogram dla `consumption`
    3. Pogrupuj dane po `type` i `status` a następnie wypisz statystyki opisowe
    4. Pogrupuj dane po `type` i `status`, wypisz statystyki opisowe a następnie je transponuj

Tests:
    >>> import sys; sys.tracebacklimit = 0

    >>> pd.set_option('display.width', 500)
    >>> pd.set_option('display.max_columns', 10)
    >>> pd.set_option('display.max_rows', 10)

    >>> assert result is not Ellipsis, \
    'Assign result to variable: `result`'
    >>> assert type(result) is pd.DataFrame, \
    'Variable `result` must be a `pd.DataFrame` type'

    >>> result  # doctest: +NORMALIZE_WHITESPACE
                mileage  consumption
    count      50.00000    50.000000
    mean   110421.02000     9.320000
    std     53170.24328     6.244802
    min      7877.00000     0.000000
    25%     71239.75000     4.000000
    50%    115186.00000     9.000000
    75%    154889.00000    14.750000
    max    199827.00000    20.000000
"""

import pandas as pd
import numpy as np
np.random.seed(0)


df = pd.DataFrame({
    'mileage': np.random.randint(0, 200_000, size=50),
    'consumption': np.random.randint(0, 21, size=50),
})


# type: pd.DataFrame
result = ...