Source codeVideos

Command Palette

Search for a command to run...

Statistics

Linear Regression Concept

What Is Linear Regression?

With Scatter Diagrams, we can see the relationship between two variables (X data and Y data).

Now, if the points on the scatter diagram seem to form a straight pattern (there's a linear correlation, whether positive or negative), we can try to draw a straight line that best fits through the middle of that cluster of points. This line is called the Linear Regression Line. The process of finding this line is called Linear Regression.

The "Best-Fit" Line

The Linear Regression Line is often called the best-fit line. Why? Because out of the many possible straight lines that could be drawn, this is the line whose position is "closest" to all the data points overall. This line attempts to summarize the trend or linear pattern present in the data.

Example of a Regression Line

Let's say we have data on study time (hours) and exam scores again. The points tend to rise (positive correlation).

Regression Line for the Relationship Between Study Time and Exam Scores
The line shows the linear trend (regression line) of the data.

See the line above? That is the linear regression line. The line shows the general trend: the longer the study time (X increases), the exam score (Y) also tends to increase following the direction of the line.

What is the use of this regression line?

One of its main uses is for prediction. For example, if a new student studies for 7 hours, we can use this regression line to estimate what their exam score might be, even though we don't have exact data for 7 hours.

Mathematical Concept

The linear regression line (the best-fit line) is found using a method called the Least Squares Method. The idea is to find the straight line that minimizes the sum of the squared vertical distances from each data point to the line.

Mathematically, the linear regression line has the form:

y^=a+bx\hat{y} = a + bx

Where:

  • y^\hat{y} (read: y-hat) is the predicted value of y by the regression line.
  • xx is the value of the independent variable.
  • bb is the slope of the line, indicating how much y^\hat{y} changes for each one-unit change in xx.
  • aa is the y-intercept, which is the predicted value of y^\hat{y} when x=0x = 0 .

The values of bb and aa are calculated from the (x,y)(x, y) data we have using the following formulas:

b=n(xy)(x)(y)n(x2)(x)2b = \frac{n(\sum xy) - (\sum x)(\sum y)}{n(\sum x^2) - (\sum x)^2}
a=yˉbxˉa = \bar{y} - b\bar{x}

Formula key:

  • nn is the number of data pairs.
  • x\sum x is the sum of all x values.
  • y\sum y is the sum of all y values.
  • xy\sum xy is the sum of the product of each x and y pair.
  • x2\sum x^2 is the sum of the square of each x value.
  • xˉ\bar{x} is the mean of the x values (xn\frac{\sum x}{n} ).
  • yˉ\bar{y} is the mean of the y values (yn\frac{\sum y}{n} ).

With these formulas, we can obtain the single straight line that is considered to best represent the linear relationship pattern in our data.