HP1100 NTU Psychology Statistics: Module Review (Basic Descriptive Statistics)

Note: this post is part of a series of posts regarding HP1100 (Fundamentals of Social Science Research)

In the previous post, we talked about the independent samples t-test. However, conducting that test requires you to already understand quite a number of concepts, including mean, median, variance, standard deviation. All these fall under descriptive statistics — because these are terms used to describe the data at hand (no inferences about population just yet).

Describing a Dataset

Typically, when we describe a dataset, we look at 2 things: central tendency, and spread. Central tendency is a single value that attempts to describe all the datapoints in the dataset, whereas spread describes how far apart the points are from each other.

In the above example, let the blue points on the number line represent our hypothetical data points. Obviously, given the the points tend towards the left a lot more than the right, the central tendency (red line) will tend towards the right. The spread — a measurement how how far the points are from each other, serves as the other information about distance between the points (purple arrow).

Central Tendency Measurements

There are several ways to reflect the central tendency of the dataset — and each method is actually preferred under certain conditions.

(Arithmetic) Mean/Average. This is the default — typically, we use this as the measurement of central tendency because it emcompasses the most information from the data out of all 3 methods of central tendency. The formula is simple — we just sum up all the points and then divide by the number total number of points.

Side note: the funny looking symbol in the formula is the summation symbol, which sums up whatever value is behind it (all the values of x, representing our datapoints). Do familiarise yourself with the notation!

So say you have a five values: 4,6,7,8,10. The average can be computed by summing up all the values, then dividing that sum by 5. By doing this, you will get the value of 7.

Extension: when we say mean, we typically are talking about the arithmetic mean. But under situations such as compound interest/speeds, there are another type of means known as harmonic/geometric means.

Median. This is the form of central tendency that is preferred when there are extreme outliers in the dataset. The reason is because it is less sensitive to outliers as compared to the average — simply because it just takes the middle value (positional information) of the dataset as the central tendency, rather than using all the values of the datapoint (as does the average).

As you can see from the example above — let’s say in our dataset the points generally tend to be within the green circle — other than one extreme outlier on the right. Because of the outlier, the mean is pulled towards the extreme value all the way out of the green circle (remember, by using average, each datapoint contributes equally to the mean — even if it is very far out!)

Using the median however, this measure of central tendency remains within the green circle — where majority of the points are concentrated. This is because to compute median, you first rank all the POSITIONS of the datapoints on the number line in acsending order — then just take the middle value. Some cases are illustrated below.

Relatively simple, right? What I want to highlight is that the median takes in less information about the dataset compared to the mean — it only takes in the positional informtion (and the value of the middle position of course). You can visualise this by imagining that I give you 5 numbers in ascending order — but block out all the numbers other than the middle one. Even in this scenario, you are still able to compute the median! (because what you need is just the center value!) If you try to compute the mean however, you are unable to because you need more information — the values of ALL the datapoints.

As such, whenever possible — preserve as much information as you can! (i.e. use the mean). However, the median can be used when there are extreme outliers — as it better preserves the central tendency when the data is heavily skewed (presence of extreme datapoints).

Mode. This is the measure of central tendency that uses the least amount of information out of the 3. To be honest I don’t really think it is used much — it uses too little information as compared to the median, which seems robust enough for most cases where the dataset is funky. But anyways.

Mode is simply the datapoint the occurs in the highest frequency. So say you have a dataset: 2,2,3,5,7. The mode would be 2 — because it occurs twice in the dataset, as compared to 3, 5 or 7 which occur once. As you probably realise — the mode is quite far away from the “center” of the data — because it uses very little information about the dataset other then most frequency occurring data point. As such, it is seldom used — but still we still learn it, I guess for completeness of our knowledge.

Spread Measurements

Spread measurements concern how far the datapoints are from each other. Again, there are multiple metrics for this.

Range. Probably the simplest one — maximum value minus minimum value. In a dataset: 2,2,3,5,7, the range would be 7–2 = 5.

Interquartile Distance (IQR). Understanding this requires an understanding of quartiles, which is linked to your understanding of median. Simply put, the median is the 50th percentile of your dataset (middle position). Splitting the first half into 2 again — that median of the first half is the 1st quartile, or the 25th percentile. The second half’s median would be the 3rd quartile, of the 75th percentile.

The IQR is defined as the 3rd quartile’s value minus the 1st quartile’s value. That is — 10–2 = 8, in the above example.

Side note: when computing quartiles, people tend to get confused between the position and the value. These are NOT the same. Percentile/Quartile/Median uses POSITIONAL information — you find the POSITION of the datapoint in question, then you take the VALUE of the datapoint at that position. Be clear about this!

Deviation. This term is a bit special — it reflects how far a datapoint is away from the mean. Yes — one datapoint. It is not used to describe the dataset as a whole, but rather quantifies how far a SINGLE datapoint is away from the mean. The formula is simple — just take the value of the datapoint in question and subtract it from the value of the mean. It is also known as error at times.

Squared Deviation. Similar to Deviviation — just square the whole formula. Naturally, this is important because the sign of the deviation can change depending on which side of the mean it is on. You want to get rid of the difference in signs because you are interested in how far the point is from the mean — not about the side in which it is on (equally far points from the mean, whether on the left or on the right of the mean, should have the same distance metric from the mean).

Variance. Often used as the measure of spread. It is the average squared deviation of all the datapoints in the dataset. Naturally, the usage of this term now goes back to describing the dataset as a whole, rather than individual points.

Standard Deviation (SD). Another extremely common form of variability measurement — more commonly used as input for various formulas in statistical tests. It is just the sqare root of the variance. Reason why is you want to square root it is becuase you when computed squared deviation/varinace, you have squared the unit of measurement of your initial scale (say you were measuring in cm — the units of the square deviation/variance is cm²). As such, square rooting helps to scale the unit back down to the original scale (unit of SD is same as the original scale — cm).

Important Note on Variance & SD

The formulas on Variance & SD above are valid — but they work only at the population level (when you have the FULL dataset). As I mentioned in my previous post, in statistics you often don’t have data for the whole population — instead you are you are working with a sample.

Due to the nature of the statistical tests, when you compute variance, we typically want the variance of the population — even though we technically only have the variance from our sample. As a result, we usually use the sample variance to estimate the population variance — and we do so by adjusting the formula slightly. This adjustment involves reducing a degree of freedom from the denominator — termed Bessel’s Correction. (it is not compulsary to understand this — but if you are interested I have attached a short explaination below).

As such, the formulas are adjusted slightly when working with sample variances/SD.

Do note that the relationship between variance and SD is still maintained — the latter is still the square root of the former. What changed is just the denominator of the terms.

Do also note the change in notation — for population parameters the variance/SD is denoted by the sigma (σ) symbol, whereas the sample variance/SD is denoted with the an (s) symbol.

Last note. It can get confusing — because at the end of the day you are using the sample spread to estimate the population spread, and then when you try to memorise it by this logic you end up losing track of which formula you are supposed to use. Here’s something that helped me — the formula names are with respect to the data you have at hand. Meaning, when what you have is the sample data — use the sample variance formula. When what you have the the population data (impossible in real life, a very possible hypothetical in exams), use the population variance formulas.

Either way, regardless of which formula you use — keep in mind that you are trying to get (estimate) the population parameters. That’s the ultimate purpose of bascially everything you do in statistics.

Optional Topic: Bessel’s Correction for Variance

Why subtract 1 from the degrees of freedom? If you are curious like me, you would have asked the professors this question. And if my guess is correct, they would probably given you some non-committal answer — partially becuase they are sick of people asking this question (it’s more common than you think hahaha)., and also because explaining it takes a bit more work that would be allowed at the moment. You have come to the right place if you want to know though — because I certainly wasn’t satisfied with the answer given in class XD.

To understand this, you first need to understand what is meant by degrees of freedom (df). It is a statistical concept representing how “free” the data is to vary. The higher the df, the less restrictions you are placing on the datasaet. The real definition is as follows: the number of independent values (that are free to vary) in the computation of a statistic.

In following cases, keep in mind that the mean and variance are related concepts — but knowing one does not necessary give you information about the other (with the exceptions of certain distributions at least). Your goal is that you want to know the population variance.

Population Case. Say I tell you that you have 3 datapoints in the population, along with a fixed population mean value of 5. The population mean is thought of as a fixed value exisitng even before you collect any data — a characteristic of the population if you will. Knowing this population mean of 5 — there are still numerous ways that you can arrange the datapoints to get a mean value of 5. Examples include: (0,5,10), (1,5,9), (2,6,7) etc etc. Each of these cases have a mean value of 5, but the individual values vary freely. To compute the variance, you can directly plug in the values into your population variance formula. As a result, all 3 points are still independent, and df = 3.

Sample Case. When what we have is sample data, the population mean is not a known value. In the computation of variance however, we need to know the population mean. As a result, we have to use the sample mean as an estimate for the population mean — computing the mean using the datapoints. Essentially, you “used up” a piece of information by computing the mean — putting the point there, rather than knowing it beforehand as per the population case.

This difference is subtle, but important. In the population case, all datapoints are still free to vary because the population mean is already a known value — “independent” of your datapoints. However, by estimating the population mean using your sample data, your mean is no longer “independent” of your datapoints.

Put another way, in the sample case you are basically fixing a datapoint in the dataset because one point is no longer free to vary independently. You can visualise this by imagining that you know the values of 2 datapoints and the mean. With this infromation, the last piece of information is forced to take on a certain value — meaning that it is no longer free to vary.

By computing the sample mean, we introduce a dependency: once (n-1) values are chosen, the last value is fixed to maintain the mean. This reduces the degree of freedom by 1 — because one independent piece of infromation is reduced (df = 2). The above case does NOT apply in the population case because the mean was not computed — it was known beforehand. And therefore, your mean is “independent” of your dataset in the population case, but not in the sample case.

Bessel’s correction is the name for this correction (subtracting one from the degree of freedom). It is needed because for a sample, the mean is not a pre-determined fixed parameter — unlike in a population. This correction utimately makes the sample variance an unbiased estimate of the population variance — because you better accounted for the fact that you imposed a certain restriction during your computation. Of course, the end result of this is that the estimated population variance is larger than your sample variance (because your denominator is smaller) — which makes sense because your population is always bigger than your sample, and thus is very likely to have more variability than your sample.

Conclusion

And we’re done! This post contains some of the most fundamental tools we use to describe a dataset, and you will need to be familiar with these concepts before we move on to use them in more complex scenarios. Do take some time to familiarise yourself with it!

In the next post, I’ll be talking about distributions — and you will see how concepts you learnt in this post (mean, variance) is used to describe these distributions as well. We will cover things like the normal distribution, t distribution, and perhaps expand a bit on why these distributions are important in Psychology as well. Stay tuned for that!

HP1100 NTU Psychology Statistics: Module Review (Basic Descriptive Statistics)

Describing a Dataset

Central Tendency Measurements

Spread Measurements

Important Note on Variance & SD

Optional Topic: Bessel’s Correction for Variance

Conclusion

Comments

HP1100 (Fundamentals of Social Science Research)

HP1100 NTU Psychology Statistics: Module Review (Independent Samples t-test)

More from this blog

Revision Guides

HP4012 NTU Psychology Statistics: Overview

HP4012 NTU Psychology Statistics: Module Review (Cluster Analysis)

HP4012 NTU Psychology Statistics: Module Review (Logistic Regression)

HP4012 NTU Psychology Statistics: Module Review (MANOVA)

Command Palette

Describing a Dataset

Central Tendency Measurements

Spread Measurements

Important Note on Variance & SD

Optional Topic: Bessel’s Correction for Variance

Conclusion

Comments

HP1100 (Fundamentals of Social Science Research)

HP1100 NTU Psychology Statistics: Module Review (Independent Samples t-test)

More from this blog