Source codeVideos

Command Palette

Search for a command to run...

Statistics

Correlation Analysis Concept

What Is Correlation Analysis?

We often want to know if there's a relationship between two things we can measure with numbers (two quantitative variables). For example:

  • Is there a relationship between students' height and weight?
  • Do study hours affect exam scores?
  • Is the age of a car related to its price?

Correlation Analysis is a statistical method used to measure how strong and in what direction the linear relationship (straight-line pattern) is between two such variables.

Correlation Coefficient

Just saying "there's a relationship" isn't enough. We need a definite measure so everyone has the same understanding. This standard measure is called the Correlation Coefficient, usually denoted by the letter rr.

The correlation coefficient (rr) gives us two important pieces of information:

  1. Direction of the Relationship:
    • Positive (r>0r > 0): If one variable increases, the other variable tends to increase as well (and vice versa). Example: Taller people usually weigh more.
    • Negative (r<0r < 0): If one variable increases, the other variable tends to decrease (and vice versa). Example: The older a car is, the usually lower its price.
  2. Strength of the Relationship:
    • How close the value of rr is to +1 or -1 indicates how strong the linear relationship is. The closer to +1 or -1, the stronger the relationship (the data points cluster more closely around a straight line).
    • If the value of rr is close to 0, it means the linear relationship is weak or even non-existent (the data points are scattered randomly).

Range of rr Values: The value of the correlation coefficient always lies between -1 and +1.

1r+1-1 \le r \le +1
  • r=+1r = +1: Perfect positive linear correlation.
  • r=1r = -1: Perfect negative linear correlation.
  • r=0r = 0: No linear correlation.

Coefficient of Determination

Sometimes, we want to know how much of the variation (ups and downs in value) in one variable can be explained by the other variable. This measure is called the Coefficient of Determination, which is the square of the correlation coefficient (r2r^2).

For example, if r=0.8r = 0.8 between study hours and exam scores, then r2=(0.8)2=0.64r^2 = (0.8)^2 = 0.64. This means about 64% of the variation in students' exam scores can be explained by the differences in their study hours. The rest (36%) might be influenced by other factors (intelligence, study methods, etc.).

The value of r2r^2 is always between 0 and 1.

0r210 \le r^2 \le 1

The closer r2r^2 is to 1, the better variable X explains the variation in variable Y.

Correlation Does Not Imply Causation

Just because two variables are strongly correlated doesn't mean one variable causes the change in the other. There might be other unmeasured factors affecting both.

Example:

Ice cream sales and drowning incidents might be positively correlated (both increase in the summer), but it doesn't mean eating ice cream causes drowning. The underlying cause is the summer season (hot weather).

So, correlation analysis helps us understand the strength and direction of a linear relationship, but it doesn't explain why that relationship exists.