Business Analytics

Acknowledgments: Sources where any material was referenced from or adapted have been identified in-line with the text.

Datafiles referred to in the text on this site are downloadable from https://drive.google.com/drive/folders/1WRv9AkvXHlzKK4L2xym4PSmb_io0wskf?usp=sharing

Introduction to Analytics

Analytics are ubiquitous, they are all around us. They make our daily lives a lot simpler.

Google knows to look for not just the words you search for, but can almost guess (fairly accurately) what those words mean, and what you are really searching for. Netflix and YouTube almost know what you might want to watch next. Gmail can classify your email into real email, junk, promotions, and social messages. CVS knows what coupons to offer you. Your newsfeed shows you the stories you would be interested in. Your employer probably has a good idea of whether you are a flight risk. LinkedIn can show you jobs that are suited to your skills. Companies can react to your reviews, even though they receive thousands of reviews every day. Your computer can recognize your face. Zillow can reasonably accurately estimate the value of any home.

All of this is made possible by data and analytics. And while it may look like magic, in the end it really mostly linear algebra and calculus at work behind the scenes.

In this set of notes, structured almost as as book, we are going to look behind the curtain and see what makes all of this possible. We will examine the ‘why’, the ‘what’, as well as the ‘how’. Which means we will try understand why something makes sense, what problems can be solved, and, we will also look at the how with practical examples of solving such problems.

Problems that data can solve often don't look like they can be solved by data, they almost always appear to be things that require human intelligence and judgment. This also means we will always be thinking about restructuring and reformulating problems into forms that make them amenable to be solved by the analytic tools we have at hand. This makes analytics as much a creative art, as it is about math and algorithms.

Who needs analytics?

So who needs analytics? Anyone who needs to make a decision needs analytics. Analytics support humans in making decisions, and can sometimes completely take the task of making decisions off the plates of humans.

Around us, we see many examples of analytics in action. The phrases in the parentheses suggest possible analytical tools that can help with the task described.

  • Governments analyze and predict pensions and healthcare bills. (Time series)
  • Google calculates whether you will or will not click an ad (Classification)
  • Amazon shows you what you will buy next (Association rules)
  • Insurance companies predict who will live, die, or have an accident (Classification)
  • Epidemiologists forecast the outbreak of diseases (Regression, RNNs)
  • Netflix wants to know what you would like to watch next (Recommender systems)

Defining analytics

But before we dive deeper, let us pause for a moment to think about what we mean by analytics. A quick search will reveal several definitions of analytics, and they are probably all accurate. A key thing though about analytics is that they are data-based, and that they provide us an intuition or an understanding which we did not have before. Another way of saying this is that analytics provide insights. Business analytics are actionable insights from data. Understanding the fundamental concepts underlying business analytics is essential for success at all levels in most organizations today.

So here is an attempted definition of analytics: Analytics are data-based actionable insights
- They are data-based – which means opinions alone are not analytics
- They are actionable – which means they drive decisions, helping select a course of action among multiple available
- They are insightful – which means they uncover things that weren’t known before with certainty

We will define analytics broadly to mean anything that allows us to profit from data. Profit includes not only improving the bottomline by increasing revenues or reducing costs, but also things that help us achieve our goals in any way, for example, make our customers happier, improve the dependability of our products, improve patient outcomes in a healthcare setting. Depending on the business, these may or may not have a directly defined relationship to profits.

Analyzing data with a view to profit from it has been called many different things such as data mining, business analytics, data science, decision science, and countless other phrases, and there are people you will find on the internet that know and can eloquently debate the fine differences between all of these.

But as mentioned before, we will not delve into the semantics and focus on everything that allows us to profit from data – no matter what it is called by scholars. In the end, terminology makes no difference, our goal is to use data and improve outcomes – for ourselves, for our families, for our customers, for our shareholders. To achieve this goal, we will not limit ourselves to one branch or a single narrow interpretation of analytics. If it is something that can help us, we will include it in our arsenel.

What we will cover

A lot of the data analytical work today relies on machine learning and artificial intelligence algorithms. This book will provide a high level understaning of how these algorithms structure the problem, and how they choose to solve for it.

What do analytics look like?

Analytics manifest themselves in multiple forms:
1. Analytical dashboards and reports: providing information to support decisions. This is the most common use case for consuming analytics.
2. Embedded analytics: Analytics embedded in an application, for example, providing intelligent responses based on user interactions such as a bot responding to a query, or a workflow routing a transaction to a particular path based on data.
3. Automated analytics: Analytics embedded in a process where an analytic drives an automated decision or application behavior, for example an instant decision on a credit card.

The boundary between embedded and automated analytics can be fuzzy

The practical perspective

Analytics vary in sophistication, and data can be presented in different forms. For example, data may be available as:
- Raw, or new-raw data: Counts, observed facts, sensor readings
- Summarizations: Subtotals, sorts, filters, pivots, averages, basic descriptive statistics
- Time series: Comparison of the same measure over time
- Predictive analytics: Classification, prediction, clustering

We will address all of the above in greater detail. One key thing to keep in mind is that greater sophistication and complexity does not mean superior analytics, fitness-for-purpose is more important. In practice, the use of analytics takes multiple forms, most of which can be bucketed into the following categories:

  • Inform a decision by providing facts and numbers (eg, in a report)
  • Recommend a course of action based on data
  • Automatically take a decision and execute

Given we repeatedly emphasize decision making as a key goal for analytics, a slight distinction between the types of decisions is relevant. Why this matters is because the way we build our solutions thinking about repeatability, performance and scalability depends upon the nature and frequency of decisions to be supported. Broadly, there are:

  • One-time decisions, that require discoveries of useful patterns and relationships as a one time exercise, and
  • Repeated decisions, that often need to be made at scale and require automation. These can benefit from even small improvements in decision making accuracy

Terminology confusion

Often, we see a distinction being drawn between descriptive and predictive analytics.

  • Descriptive Analytics: Describe what has happened in the past through reports, dashboards, statistics, traditional data mining etc.
  • Predictive Analytics: Use modeling techniques based on past data to predict the future or determine correlations between variables. Includes linear regression, time series analysis, data mining to find patterns for prediction, simulations.
  • Prescriptive Analytics: Identify the best course of action to take based on a rule, and the rule may be a model based prediction.

Other descriptions for analytics include exploratory, inferential, causal, mechanistic etc.

Then there are some other phrases one might come across in the context of business analytics. Here are some more:

  • Data Mining is the process of discovering meaningful correlations, patterns and trends by sifting through large amounts of data stored in repositories. Data mining employs pattern recognition technologies, as well as statistical and mathematical techniques.
    Source: Gartner, quoted from Data Mining for Business Intelligence by Shmueli et al
  • Data science is the field of study that combines domain expertise, programming skills, and knowledge of mathematics and statistics to extract meaningful insights from data.
    Source: DataRobot website
  • Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making.
    Source: Wikipedia

What about big data?

Big data essentially means datasets that are too large for traditional data processing systems, and therefore require new processing technologies. But what was big yesterday is probably only medium sized, or even small by today's standards, so the phrase big data does not have a precise definition. What is big data today, might just be right sized for your phone of the future.

Big data technologies support data processing, data engineering, and also data science activities – for example, Apache Spark, a big data solution, has an entire library of machine learning algorithms built in.

AI, ML and Deep Learning

Yet another cut of the business analytics paradigm is AI, ML and deep learning.

  • Artificial Intelligence: the effort to automate intellectual tasks normally performed by humans. This term has the most expansive scope, and includes Machine Learning and Deep Learning.
  • Machine Learning: A machine-learning system is one that is trained rather than explicitly programmed. It’s presented with many examples relevant to a task, and it finds statistical structure in these examples that eventually allows the system to come up with rules for automating the task.
  • Deep Learning: Deep learning is a specific subfield of machine learning: learning from data that puts an emphasis on learning successive layers of increasingly meaningful representations. The deep in deep stands for this idea of successive layers of representations. Neural networks (NNs) are nearly always the technology that underlies deep learning systems.

Source: Deep Learning with Python, Francois Chollet

Does terminology matter?

This may be repetitive, but necessary to stress - our goal is to profit from data, using any and all computational techniques and resources we can get our hands on. We will use many of these terms interchangeably. We will mostly talk about business analytics in the context of improving decisions and business outcomes, and refer to the general universe of tools and methods as data science. These tools and methods include Business Intelligence, Data Mining, Forecasting, Modeling, Machine Learning and AI, Big Data, NLP and other techniques that help us profit from data.

That makes sense because for the businessman, the differences between all of these "types of analytics" is merely semantic. What we are trying to do is to generate value of some sort using data and numbers, and the creation of something that is useful is more important than what we call it. We will stick with analytics to mean anything we can do with data that can provide us a profit, or help us achieve our goals.

Where are analytics used?

The widest applications of data science and analytics have been in marketing for tasks such as:
- Targeted marketing
- Online advertising
- Recommendations for cross-selling

Yet just about every other business area has benefited from advances in data science:
- Accounting
- HR
- Compliance
- Supply chain
- Web analytics
- You name it…

Analytics as strategic assets

Data and data science capabilities are strategic assets. What that means is that in the early days of adoption they provide a competitive advantage that allows us to outdo our competition. As they mature and become part of the mainstream, they become essential for survival.

Both data, and data science capabilities, are increasingly the deciding factor behind who wins and who loses. Both the data and the capabilities are needed: having lots of data is an insufficient condition for success – the capability to apply data science on the available data to profitable use cases is key.
This is becaue the best data science team can yield little value without the right data which has some predictive power. And the best data cannot yield insights without the right data science talent.

Organizations need to invest in both data and data science to benefit from analytics.

This is why understanding the fundamental concepts underlying business analytics is essential for success at all levels in most organizations. This invariably involves knowing at a conceptual level the data science process, capabilities of algorithms such as the ideas behind classification and predictions, AI and ML, and the evaluation of models. We will cover all of these in due time.

Our goal

The goal for these notes is that when you are done going through these notes, you should be able to:
- When faced with a business problem, be able to assess whether and how data can be used to arrive at a solution or improve performance
- Act competently and with confidence when working with data science teams
- Identify opportunities for data science teams to apply technical solutions to, and monitor their implementation progress, and review the business benefits
- Oversee analytical teams and direct their work
- Identify data driven competitive threats, and be able to formulate a strategic response
- Critically evaluate data science proposals (from internal teams, consultants)
- Simplify and concisely explain data driven approaches to leadership, management and others


Descriptive Statistics

With all of the above background around analytics, we are ready to jump right in! We will start with descriptive statistics, which are key summary attributes of a dataset that help describe or summarize the dataset in a meaningful way.

Descriptive statistics help us understand the data at the highest level, and are generally what we seek when we perform exploratory analysis on a dataset for the first time. (We will cover exploratory data analysis next, after a quick review of descriptive statistics.)

Descriptive statistics include measures that summarize the:

  • central tendency, for example the mean,
  • dispersion or variability, for example the range or the standard deviation, and
  • shape of a dataset’s distribution, for example the quartiles and percentiles

Descriptive statistics do not allow us to make conclusions or predictions beyond the data we have analyzed, or reach conclusions regarding any hypotheses we might have made.

Below is a summary listing of the commonly used descriptive statistics. We cover them briefly, because rarely will we have to calculate any of these by hand as the software will almost always do it for us.

Measures of Central Tendency

  • Mean: The mean is the most commonly used measure of central tendency. It is simply the average of all observations, which is obtained by summing all the values in the dataset and dividing by the total number of observations.
  • Geometric Mean: The geometric mean is calculated by multiplying all the values in the data, and taking the n-th root, where n is the number of observations. Geometric mean may be useful when values compound over time, but is otherwise not very commonly used.

  • Median: The median is the middle value in the dataset. By definition, half the data will be more than the median value, and the other half lower than the median. There are rules around how to compute the median when the count of data values is odd or even, but those nuances don't really matter much when one is dealing with thousands or millions of observations.

  • Mode: Mode is the most commonly occurring value in the data.

Measures of Variability

  • Range: Range simply is the maximum value minus the minimum value in our dataset.
  • Variance: Variance is the squared differences around the mean - which means that for every observation we calculate its difference from the mean, and square the difference. We then add up these squared values, and divide this sum by n to obtain the variance. One problem with variance is that it is a quantity expressed in units-squared, a concept intuitively difficult for humans to understand.
  • Standard Deviation: Standard deviation is the square root of the variance, and takes care of the units-squared problem.
  • Coefficient of Variation: The coefficient of variation is the ratio of the standard deviation to the mean. Being a ratio, it makes it easier for humans to comprehend the scale of variation in a distribution compared to its mean.

Measures of Association

  • Covariance: Covariance measures the linear association between two variables. To calculate it, first the mean is subtracted from each of the observations, and then the two quantities are multiplied together. The products are then summed up, and divided by the number of observations. Covariance is in the units of both of the variables, and is impacted by scale (for example, if distance is a variable, whether you are measuring it in meters or kilometers will impact the computation). You can find detailed examples in any primer on high school statistics, and we will just use software to calculate variance when we need to.
  • Correlation: Correlation is the covariance between two variables divided by the product of the standard deviation of each of the variables. Dividing covariance by standard deviations has the effect of removing the impact of the scale of the units of the observations, ie, it takes out the units and you do not have to think about whether one used meters or kilometers in the calculations. This makes correlation an easy to interpret number, as always lies between -1 and +1.

Analyzing Distributions

  • Percentiles: Percentiles divide the dataset into a hundred equally sized buckets, and each percentile tells you how many observations lie below or above that value. Of course, the 50th percentile is the same as the median.
  • Quartiles: Similar to percentiles, only that they divide the dataset into four equal buckets. Again, the 2nd quartile is the same as the median.
  • Z-Scores: These are used to scale observations by subtracting the mean and dividing by the standard deviation. We will see these when we get to deep learning, or some of the machine learning algorithms that require inputs to be roughly in the same range. Standardizing variables by calculating z-scores is a standard practice in many situations when performing data analytics.

A special note on standard deviation

What is Standard Deviation useful for?
When you see a number for standard deviation, the question is - how do you interpret it? A useful way to think about standard deviation is to think of it as setting an upper and lower limit on where data points would lie either side of the mean.

If you know your data is normally distributed (or is bell shaped), the empirical rule (below) applies. However most of the time we have no way of knowing if the distribution is normal or not. In such cases, we can use Chebyshev's rule, also listed below.

I personally find Chebyshev's rule to be very useful - if I know the mean, and someone tells me the standard deviation, then I know that 75% of the data is between the mean and 2x standard deviation on either side of the mean.

Empirical Rule
For a normal distribution:
- Approximately 68.27% of the data values will be within 1 standard deviation.
- Approximately 95.45% of the data values will be within 2 standard deviations.
- Almost all the data values will be within 3 standard deviations

Chebyshev’s Theorem
For any distribution:
- At least 3/4th of the data lie within 2 standard deviations of the mean
- at least 8/9th of the data lie within three standard deviations of the mean
- at least of the data lie within standard deviations of the mean

A special note on correlation

While Pearson's correlation coefficient is generally the default, it works only when both the variables are numeric. Which becomes an issue when the variables are categories, for example, one variable is nationality and the other education.

There are multiple ways to calculate correlation. Below is an extract from the pandas_profiling library, which calculates several types of correlations between variables.

(Source: https://pandas-profiling.ydata.ai/)

  1. Pearson’s r (generally the default, can calculate using pandas)
    The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r. To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.

  2. Spearman’s (supported by pandas)
    The Spearman's rank correlation coefficient () is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation. To calculate for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.

  3. Kendall’s (supported by pandas)
    Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient () measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation. To calculate  for two variables  and , one determines the number of concordant and discordant pairs of observations.  is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.

  4. Phik () (use library phik)
    Phik () is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. (Interval variables are a special case of ordinal variables where the ordered points are equidistant.)

  5. Cramér's V () (use custom function, or PyCorr library)
    Cramér's V is an association measure for nominal random variables (nominal random variables are categorical variables with no order, eg, country names). The coefficient ranges from 0 to 1, with 0 indicating independence and 1 indicating perfect association. The empirical estimators used for Cramér's V have been proved to be biased, even for large samples.