What Is Correlation Analysis?
We often want to know if there's a relationship between two things we can measure with numbers (two quantitative variables). For example:
- Is there a relationship between students' height and weight?
- Do study hours affect exam scores?
- Is the age of a car related to its price?
Correlation Analysis is a statistical method used to measure how strong and in what direction the linear relationship (straight-line pattern) is between two such variables.
Correlation Coefficient
Just saying "there's a relationship" isn't enough. We need a definite measure so everyone has the same understanding. This standard measure is called the Correlation Coefficient, usually denoted by the letter .
The correlation coefficient () gives us two important pieces of information:
- Direction of the Relationship:
- Positive (): If one variable increases, the other variable tends to increase as well (and vice versa). Example: Taller people usually weigh more.
- Negative (): If one variable increases, the other variable tends to decrease (and vice versa). Example: The older a car is, the usually lower its price.
- Strength of the Relationship:
- How close the value of is to +1 or -1 indicates how strong the linear relationship is. The closer to +1 or -1, the stronger the relationship (the data points cluster more closely around a straight line).
- If the value of is close to 0, it means the linear relationship is weak or even non-existent (the data points are scattered randomly).
Range of Values: The value of the correlation coefficient always lies between -1 and +1.
- : Perfect positive linear correlation.
- : Perfect negative linear correlation.
- : No linear correlation.
Coefficient of Determination
Sometimes, we want to know how much of the variation (ups and downs in value) in one variable can be explained by the other variable. This measure is called the Coefficient of Determination, which is the square of the correlation coefficient ().
For example, if between study hours and exam scores, then . This means about 64% of the variation in students' exam scores can be explained by the differences in their study hours. The rest (36%) might be influenced by other factors (intelligence, study methods, etc.).
The value of is always between 0 and 1.
The closer is to 1, the better variable X explains the variation in variable Y.
Correlation Does Not Imply Causation
Just because two variables are strongly correlated doesn't mean one variable causes the change in the other. There might be other unmeasured factors affecting both.
Example:
Ice cream sales and drowning incidents might be positively correlated (both increase in the summer), but it doesn't mean eating ice cream causes drowning. The underlying cause is the summer season (hot weather).
So, correlation analysis helps us understand the strength and direction of a linear relationship, but it doesn't explain why that relationship exists.