Statistics

Comprehensive notes on statistics fundamentals and visualization.

Published January 15, 2025 ET

So, I took statistics in college, and I also took probability in High School. Yes, it was satisfying to learn both, and while I learned them I very much enjoyed them.

But, I did have little opportunity to use statistics past that. I encountered it to a serious degree in Econometrics, but like with Statistics, Econometrics never really came up for me after taking the class.

The thing is, since then, I've recognized my need to master it, particularly for the application to both machine learning and finance. Essentially, if you want to predict the future with any kind of certainty, you need to know probability, and if you want to know probability, often it comes down to a historical analysis, the roots of which lie in statistics.

Therefore, I'm starting it fresh. And, luckily there exists such a thing as Khan Academy.

Now, even though I'd love to master all of statistics, this time around I'm going to learn it my way. That is to say, I'm going to focus on the parts that lead to immediate applications, and I'm going to judge what makes the most sense for what I'm trying to do.

My current need for statistics comprehension lies in finance. I need to understand Cumulative Distribution Functions for the purpose of understanding the Black-Scholes Function (B.S.F.).

Intro

First, I want to take a big picture look at statistics.

Categorical / non-categorical variables

On Khan Academy, they immediately make the distinction between categorical and non-categorical variables. Categorical variables are variables that have a limited number of possibilities (hot/cold) and non-categorical variables are variables that could be one of any number of possibilities (number of calories in a beverage).

The different ways in which data can be represented:

Pictograph

  • picture that describes how often some kind of variable occurs
  • answers question: which subject had the most of this variable, i.e. which person drank the most cups of coffee?

Bar Graph

  • all kinds can answer the questions:

    1. what were the mean, median, and mode values of the set?
  • Two-Column Bar graph (or two-column bar chart):

    • shows values of a variable for different subjects
    • answers the questions:
      1. which subject had the biggest difference in this variable between these use-cases, i.e. which student had the biggest increase in test score between the mid-term and the final?
  • Bar graph (or bar chart):

    • shows the values of a variable for different subjects
    • answers the questions:
      1. which subject had the most of some value, i.e. which house at Hogwarts had the most witches and wizards this year?

Pie graph (or pie chart or circle chart)

  • you know what it looks like... tells you how many subjects constituted a certain value in a set of values

Venn Diagram

  • picture that describes how many values have one or more categories of values
  • all two-value venn diagrams can be substituted using a two-way table

Two-way table (two-way frequency table)

  • table that describes how many values have one or more categories of values
  • can sometimes also be looked at as "joint distributions" along two dimensions

Stem-and-Leaf plots (stem plots)

  • table that shows how many of a certain value could be seen in a set, but in a consolidated view
  • rather than listing the full value, values are grouped by their group of ten, hundred, etc, and then distributed along that prefix
  • Example: there are 10 people in a team, and they scored 01, 05, 17, 19, 18, 21, 21, 24, 25, and 29. Rather than listing it like that, list it like this:
    • 0: 1, 5
    • 1: 7, 9, 8
    • 2: 1, 1, 4, 5, 9

Marginal VS Conditional distribution

Marginal distribution

A marginal distribution is like a single column in a table, where every row represents some bucket of the total pool of data. All the rows in that column add up to some total amount... So, say the first row represents the number of people who chose chocolate. The second row represents the number of people who chose vanilla. The third row strawberry, etc. If you add up the value in each row, you'll get the total number of people who chose flavors. That's a marginal distribution. Each row can represent a count (like 12 out of 100) or as a percentage (12%);

Conditional distribution

Just like a marginal distribution, but it's assuming some other factor is true. So, taking for example the ice cream marginal distribution, you could then do the same thing, but instead of it representing all the people, you choose all the people who got their ice cream in cones instead of in bowls or sundaes.

Distribution shapes

Say you're making an inventory of the guests at your party... there's a distribution. Most around the age of 65, and the rest trickling out evenly towards both sides on either end. i.e. there are as many 80 y/o folks as there are 50 y/o folks.