4  Descriptive Statistics

Code
YouTubeVideo('AiDqx1eZzTo', width=672, height=378)

With all of the above background around analytics, we are ready to jump right in! We will start with descriptive statistics, which are key summary attributes of a dataset that help describe or summarize the dataset in a meaningful way.

Descriptive statistics help us understand the data at the highest level, and are generally what we seek when we perform exploratory analysis on a dataset for the first time. (We will cover exploratory data analysis next, after a quick review of descriptive statistics.)

Descriptive statistics include measures that summarize the:

Descriptive statistics do not allow us to make conclusions or predictions beyond the data we have analyzed, or reach conclusions regarding any hypotheses we might have made.

Below is a summary listing of the commonly used descriptive statistics. We cover them briefly, because rarely will we have to calculate any of these by hand as the software will almost always do it for us.

4.0.1 Measures of Central Tendency

  • Mean: The mean is the most commonly used measure of central tendency. It is simply the average of all observations, which is obtained by summing all the values in the dataset and dividing by the total number of observations.

  • Geometric Mean: The geometric mean is calculated by multiplying all the values in the data, and taking the n-th root, where n is the number of observations. Geometric mean may be useful when values compound over time, but is otherwise not very commonly used.

  • Median: The median is the middle value in the dataset. By definition, half the data will be more than the median value, and the other half lower than the median. There are rules around how to compute the median when the count of data values is odd or even, but those nuances don’t really matter much when one is dealing with thousands or millions of observations.

  • Mode: Mode is the most commonly occurring value in the data.

4.0.2 Measures of Variability

  • Range: Range simply is the maximum value minus the minimum value in our dataset.
  • Variance: Variance is the squared differences around the mean - which means that for every observation we calculate its difference from the mean, and square the difference. We then add up these squared values, and divide this sum by n to obtain the variance. One problem with variance is that it is a quantity expressed in units-squared, a concept intuitively difficult for humans to understand.
  • Standard Deviation: Standard deviation is the square root of the variance, and takes care of the units-squared problem.
  • Coefficient of Variation: The coefficient of variation is the ratio of the standard deviation to the mean. Being a ratio, it makes it easier for humans to comprehend the scale of variation in a distribution compared to its mean.
Code
YouTubeVideo('Ddkfq9fT62U', width=672, height=378)

4.0.3 Measures of Association

  • Covariance: Covariance measures the linear association between two variables. To calculate it, first the mean is subtracted from each of the observations, and then the two quantities are multiplied together. The products are then summed up, and divided by the number of observations. Covariance is in the units of both of the variables, and is impacted by scale (for example, if distance is a variable, whether you are measuring it in meters or kilometers will impact the computation). You can find detailed examples in any primer on high school statistics, and we will just use software to calculate variance when we need to.
  • Correlation: Correlation is the covariance between two variables divided by the product of the standard deviation of each of the variables. Dividing covariance by standard deviations has the effect of removing the impact of the scale of the units of the observations, ie, it takes out the units and you do not have to think about whether one used meters or kilometers in the calculations. This makes correlation an easy to interpret number, as always lies between -1 and +1.

4.0.4 Analyzing Distributions

  • Percentiles: Percentiles divide the dataset into a hundred equally sized buckets, and each percentile tells you how many observations lie below or above that value. Of course, the 50th percentile is the same as the median.
  • Quartiles: Similar to percentiles, only that they divide the dataset into four equal buckets. Again, the 2nd quartile is the same as the median.
  • Z-Scores: These are used to scale observations by subtracting the mean and dividing by the standard deviation. We will see these when we get to deep learning, or some of the machine learning algorithms that require inputs to be roughly in the same range. Standardizing variables by calculating z-scores is a standard practice in many situations when performing data analytics.

4.0.5 A special note on standard deviation

What is Standard Deviation useful for?
When you see a number for standard deviation, the question is - how do you interpret it? A useful way to think about standard deviation is to think of it as setting an upper and lower limit on where data points would lie either side of the mean.

If you know your data is normally distributed (or is bell shaped), the empirical rule (below) applies. However most of the time we have no way of knowing if the distribution is normal or not. In such cases, we can use Chebyshev’s rule, also listed below.

I personally find Chebyshev’s rule to be very useful - if I know the mean, and someone tells me the standard deviation, then I know that 75% of the data is between the mean and 2x standard deviation on either side of the mean.

Empirical Rule
For a normal distribution:
- Approximately 68.27% of the data values will be within 1 standard deviation.
- Approximately 95.45% of the data values will be within 2 standard deviations.
- Almost all the data values will be within 3 standard deviations

Chebyshev’s Theorem
For any distribution:
- At least 3/4th of the data lie within 2 standard deviations of the mean
- at least 8/9th of the data lie within three standard deviations of the mean
- at least \(1 - \frac{1}{k^2}\) of the data lie within \(k\) standard deviations of the mean

4.0.6 A special note on correlation

While Pearson’s correlation coefficient is generally the default, it works only when both the variables are numeric. Which becomes an issue when the variables are categories, for example, one variable is nationality and the other education.

There are multiple ways to calculate correlation. Below is an extract from the ydata-profiling library (formerly pandas_profiling), which calculates several types of correlations between variables.

Note: The pandas_profiling package was renamed to ydata-profiling in 2023. Install with pip install ydata-profiling; import as from ydata_profiling import ProfileReport.

(Source: https://ydata-profiling.ydata.ai/)

  1. Pearson’s r (generally the default, can calculate using pandas)
    The Pearson’s correlation coefficient (r) is a measure of linear correlation between two variables. It’s value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r. To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.

  2. Spearman’s \(\rho\) (supported by pandas)
    The Spearman’s rank correlation coefficient (\(\rho\)) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson’s r. It’s value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation. To calculate \(\rho\) for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.

  3. Kendall’s \(\tau\) (supported by pandas)
    Similarly to Spearman’s rank correlation coefficient, the Kendall rank correlation coefficient (\(\tau\)) measures ordinal association between two variables. It’s value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation. To calculate \(\tau\) for two variables \(X\) and \(Y\), one determines the number of concordant and discordant pairs of observations. \(\tau\) is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.

  4. Phik (\(\phi k\)) (use library phik)
    Phik (\(\phi k\)) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. (Interval variables are a special case of ordinal variables where the ordered points are equidistant.)

  5. Cramér’s V (\(\phi c\)) (use custom function, or PyCorr library)
    Cramér’s V is an association measure for nominal random variables (nominal random variables are categorical variables with no order, eg, country names). The coefficient ranges from 0 to 1, with 0 indicating independence and 1 indicating perfect association. The empirical estimators used for Cramér’s V have been proved to be biased, even for large samples.