A Simple Linear Regression from scratch

Objective:

The goal is to create a salary estimator based on years of experience of the employees of an anonymous company, using a simple Linear Regression Model build from scratch applied on “Salary_Data.csv” Kaggle’s dataset. Then I will compare the performance of the model against the prebuilt Sklearn linear regression model.

I won’t explain theory here, only background basic equations needed for code understanding.

png

Dataset
Linear model definition
Loss function definition
Gradient definition
Gradient descent algorithm
Training the model on data
Model evaluation
Salary estimation
Comparison with Sklearn model
Conclusion

Importing libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

1.Dataset

The dataset is constituted of “YearsExperience” and “Salary” informations for 30 employees in an anonymous company.

Get the original dataset here: https://www.kaggle.com/karthickveerakumar/salary-data-simple-linear-regression

Importing dataset

df = pd.read_csv("data/Salary_Data.csv", delimiter= ',')

A simple dataframe visualization using markdown format:

import tabulate
print(df.head(7).to_markdown())

|    |   YearsExperience |   Salary |
|---:|------------------:|---------:|
|  0 |               1.1 |    39343 |
|  1 |               1.3 |    46205 |
|  2 |               1.5 |    37731 |
|  3 |               2   |    43525 |
|  4 |               2.2 |    39891 |
|  5 |               2.9 |    56642 |
|  6 |               3   |    60150 |

A better visualization for this blog is done by exporting dataframe visualization to png format:

#!pip install dataframe-image
import dataframe_image as dfi

# I export the dataframe visualization as png
# so I can use markdown to show down the dataframe
dfi.export(df.head(7), "img/linear_regression_dataframe.png")

png

Data visualization

#setting plots background color
plt.rcParams.update({'axes.facecolor':'#f8fafc'})

df.plot.scatter('YearsExperience', 'Salary',
                figsize=(10,6), color='darkorange',
                marker='o', edgecolor='k', s=100)
plt.xlabel('Years Experience', fontsize = 12)
plt.ylabel('Annual Salary (USD)', fontsize = 12)
plt.title('Annual Salary v/s Years of Experience', fontsize = 15)
plt.show()

png

One can see on plot that annual salaries of employees looks linearly correlated to years of experience !

2.Linear model definition

In statistics, linear regression is a linear approach to modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is called simple linear regression.

Let’s define y as the real random variable to be explained (dependent) and x the explanatory variable (independent). The model supposes that on average, y is an affine function of x. The writing of the model implicitly supposes a preliminary notion of causality in the sense that y depends on x:

For a given sequence of m observations

$${(x_i, y_i) \hspace{4mm}i=1 …, m}$$

$$ y_i=\beta_{0} +\beta_{1}.x_i + \epsilon_i $$

$$ Using \hspace{3mm}matrix \hspace{3mm}notations \hspace{3mm}x = [x_1, x_2, …, x_m]^T \hspace{3mm} and \hspace{3mm} y = [y_1, y_2, …, y_m]^T$$

$$The \hspace{3mm}model \hspace{3mm}is:\hspace{3mm} y = X.\beta + \epsilon$$

$$Where\hspace{3mm} X = [1 \hspace{3mm} x] \hspace{3mm}and\hspace{3mm} \beta= [\beta_{0} \hspace{3mm}\beta_{1}]^T$$

The goal is to find the parameters which would provide the “best” fit in some sense for the data points:

$$ \hat{y_i}=\hat{\beta_{0}} +\hat{\beta_{1}}.x_i $$

Converting data into arrays

To do calculations, we need to convert our data into arrays that will feed our custom model…

x = df["YearsExperience"].values.reshape(-1, 1)
y = df["Salary"].values.reshape(-1, 1)
print(x.shape, y.shape)

(30, 1) (30, 1)

…And add identity column vector to feature:

$$X = [1 \hspace{3mm} x]$$

in order to do the product:

$$X.\beta \hspace{3mm}where\hspace{3mm}\beta= [\beta_{0} \hspace{3mm}\beta_{1}]^T$$

X = np.hstack((np.ones(x.shape), x))
X[0:5]

array([[1. , 1.1],
       [1. , 1.3],
       [1. , 1.5],
       [1. , 2. ],
       [1. , 2.2]])

Defining model

def model(X, Beta):
    return np.matmul(X, Beta)

Parameters initialization

For visualization purpose, I added a constant value to Beta initial values.

Beta = np.random.randn(2, 1) + 10000
print(Beta)

[[10000.63920093]
 [ 9999.95848202]]

Checking model’s initial state

y_init = model(X, Beta)

plt.figure(figsize=(10,6))

plt.plot(x, y_init, label='Custom model initial state',
         c= 'purple', marker='None', linestyle= 'solid')
plt.scatter(x, y, c= 'darkorange', label='Dataset', marker='o', edgecolor='k', s=100)
plt.legend(facecolor='white')
plt.title('Annual Salary v/s Years of Experience', fontsize = 15)
plt.xlabel('Years Experience', fontsize = 12)
plt.ylabel('Annual Salary (USD)', fontsize = 12)
plt.xticks(np.arange(1, 12, 1))
plt.yticks(np.arange(20000, 140000, 10000))
plt.grid(linestyle='--')

plt.show()

png

One clearly see that the model is not yet fitted to our data.

3.Loss function definition using Mean Squared Error (MSE)

From the previous equations, we know that the residuals are defined by:

$$ \epsilon = y - X.\beta$$

Then to measure the error of our model we define the Loss function as the mean of the squares of the residuals:

$$ J(\beta) = \frac{1}{m} \sum (y - X.\beta)^2 $$

def lossFunction(X, y, Beta):
        m = y.shape[0] # where m is the number of samples
        return 1/(m) * np.sum(np.square(y - model(X, Beta)))

The Loss is the value that we want to minimize in order to find the best regression line. Let’s check the initial Loss value:

print(lossFunction(X, y, Beta))

199243474.14642814

Notice the high value of our Loss !! It can be possible to scale the values of the x feature in order to reduce the high values during the calculations … but for this try we can stay with non scaled values.

4.Gradient definition

Now we calculate the gradient of the Loss function:

$$\frac{\partial J(\beta) }{\partial \beta} = - \frac{2}{m} X^T.(y - X.\beta)$$

But as we want to solve a minimization problem, we keep for calculations the gradient multiplied by -1:

def gradient(X, y, Beta):
    m = y.shape[0]
    return 2/m * np.matmul(X.T, model(X, Beta) - y)

Minimizing the Loss function means that its final gradient should be close to zero. Let’s check the initial values of the gradient:

print(gradient(X, y, Beta))

[[ -25738.49612919]
 [-128193.59810319]]

5.Gradient descent algorithm

For each iteration of the calculus we update the parameters as well as the gradient of the model using the gradient descent algorithm:

$$\beta' = \beta - \alpha \frac{\partial J(\beta) }{\partial \beta}$$

def gradientDescent(X, y, Beta, learning_rate, n_iterations):
    loss_history = np.zeros(n_iterations)
    for i in range(n_iterations):
        Beta = Beta - learning_rate * gradient(X, y, Beta)
        loss_history[i] = lossFunction(X, y, Beta)
    
    return Beta, loss_history, gradient(X, y , Beta)

6.Training the model on data

Fisrt I define the hypermarameters:

N_ITERATIONS = 20000
LEARNING_RATE = 0.001

Then I train the model on the data…

Final_Beta, loss_history, Final_gradient = gradientDescent(X, y, Beta, LEARNING_RATE, N_ITERATIONS)

… check the final values of the gradient (closer to zero as expected !) …

print(Final_gradient)

[[-1.38763209]
 [ 0.20591879]]

… and get the final regression coefficients:

print(f"Intercept = {Final_Beta[0][0]:.2f}")
print(f"coefficient = {Final_Beta[1][0]:.2f}")

Intercept = 25788.92
coefficient = 9450.45

Let’s have a look on the Learning Curve:

plt.figure(figsize=(10,6))
plt.plot(range(N_ITERATIONS), loss_history, c= 'purple')
plt.title('Learning curve', fontsize = 15)
plt.xlabel('Iterations', fontsize = 12)
plt.ylabel('Loss value', fontsize = 12)
plt.show()

png

7.Model evaluation

Regression line visualization

y_pred_custom = model(X, Final_Beta)

Let’s have a look on the regression line:

plt.figure(figsize=(10,6))

plt.plot(x, y_pred_custom, label='Custom model prediction',
         c= 'purple', marker='None', linestyle= 'solid')
plt.scatter(x, y, c= 'darkorange', label='Dataset', marker='o', edgecolor='k', s=100)
plt.legend(facecolor='white', loc="upper left")
plt.title('Annual Salary v/s Years of Experience', fontsize = 15)
plt.xlabel('Years Experience', fontsize = 12)
plt.ylabel('Annual Salary (USD)', fontsize = 12)
plt.xticks(np.arange(1, 12, 1))
plt.yticks(np.arange(30000, 140000, 10000))
plt.grid(linestyle='--')

plt.show()

png

We can see on plot that this time the model fits pretty good on the data !

Coefficient of determination

This coefficient indicates the quality of predictions made with our model.

$$ R^2 = 1 - \frac{\sum ((y_i - \hat{y_i})^2}{\sum ((y_i - \bar{y})^2)} $$

def determinationCoef(y, y_pred):
    SSR = ((y - y_pred)**2).sum()    # Residual sum of squares SSR
    SST = ((y - y.mean())**2).sum()  # Total sum of squares SST
    return np.sqrt(1 - SSR/SST)

print(f"R2 = {determinationCoef(y, y_pred_custom):0.2f}")

R2 = 0.98

We almost reach the maximum value of 1 !! The model can estimate accuratly the annual salary of an employee in the company given his years of experience.

8.Salary estimation

Now let’s imagine that we want to have an estimation of the salary for someone who have 12 years of experience in the company.

x_estimation = np.array([12])
X_estimation = np.hstack((np.ones(x_estimation.shape), x_estimation))
y_estimation = model(X_estimation, Final_Beta)

print(f'The estimated annual salary of someone who has 12 years of experience is \
{y_estimation[0]:.0f} USD \n... according to our dataset.')

The estimated annual salary of someone who has 12 years of experience is 139194 USD 
... according to our dataset.

Let’s visualize the result on the plot:

plt.figure(figsize=(10,6))

plt.scatter(x_estimation, y_estimation, label='Estimation',
            c= 'yellowgreen', marker='o', edgecolor='k', s=200)
plt.plot(x, y_pred_custom, label='Custom model prediction',
         c= 'purple', marker='None', linestyle= 'solid')
plt.scatter(x, y, c= 'darkorange', label='Dataset', marker='o', edgecolor='k', s=100)
plt.legend(facecolor='white', loc="upper left")
plt.title('Annual Salary v/s Years of Experience', fontsize = 15)
plt.xlabel('Years Experience', fontsize = 12)
plt.ylabel('Annual Salary (USD)', fontsize = 12)
plt.xticks(np.arange(1, 13, 1))
plt.yticks(np.arange(30000, 150000, 10000))
plt.grid(linestyle='--')

plt.show()

png

One can see on plot the estimated salary for someone who has 12 years of experience in the company.

9.Comparison with sklearn LinearRegression model

Scikit-Learn already provides a well built Linear Regression model for general purpose (linear, multi-linear or polynomial regressions).
I will apply it on the Salary_data.csv dataset in order to compare and check the quality of my custom linear regression model.

As I’ve done previously, I define X, y variables

X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values.reshape(-1, 1)
print(X.shape, y.shape)

(30, 1) (30, 1)

Next I define a linear regression model using Sklearn library…

from sklearn.linear_model import LinearRegression
LR_model = LinearRegression()

…fit the model to the data…

LR_model.fit(X, y)

LinearRegression()

…and get the resulting regression coefficients

print(f"Intercept = {LR_model.intercept_[0]:.2f}")
print(f"coefficient = {LR_model.coef_[0][0]:.2f}")

Intercept = 25792.20
coefficient = 9449.96

Now let’s see how well the model fits to the data

y_pred_sklearn = LR_model.predict(X)

plt.figure(figsize=(10,6))

plt.plot(X, y_pred_sklearn, label='Sklearn model prediction',
         c= 'darkturquoise', marker='None', linestyle= 'solid')
plt.scatter(X, y, c= 'darkorange', label='Dataset', marker='o', edgecolor='k', s=100)
plt.legend(facecolor='white', loc="upper left")
plt.title('Annual Salary v/s Years of Experience', fontsize = 15)
plt.xlabel('Years Experience', fontsize = 12)
plt.ylabel('Annual Salary (USD)', fontsize = 12)
plt.xticks(np.arange(1, 12, 1))
plt.yticks(np.arange(30000, 140000, 10000))
plt.grid(linestyle='--')

plt.show()

png

Salary estimation:

X_estimation = np.array([12]).reshape(-1,1)
y_estimation = LR_model.predict(X_estimation)

print(f'The estimated annual salary of someone who has 12 years of experience is \
{y_estimation[0][0]:.0f} USD')

The estimated annual salary of someone who has 12 years of experience is 139192 USD

10.Conclusion

The custom model build from scratch looks pretty good to predict salary given years of experience and fits pretty well to the data, thus I get a R2 determination coefficient of 0.98 (almost 1) that indicates a good quality of predictions.

Comparing coefficients of regression and estimated salaries for both models, one can see that the custom model gives the same results as the Sklearn model.

import tabulate
import pandas as pd
results = {
    'Model': ["Custom LR model", 'Sklearn LR model'],
    'Intercetp': [25788.92, 25792.20],
    'coefficient': [9450.45, 9449.96],
    '12 Years exp. estimated Ann. Salary': [139194, 139192]}

df_results = pd.DataFrame(data=results).set_index('Model')
print(df_results.to_markdown())

| Model            |   Intercetp |   coefficient |   12 Years exp. estimated Ann. Salary |
|:-----------------|------------:|--------------:|--------------------------------------:|
| Custom LR model  |     25788.9 |       9450.45 |                                139194 |
| Sklearn LR model |     25792.2 |       9449.96 |                                139192 |

Now we have a rather precise tool which can help human resources to estimate the annual salary of an employee of this anonymous company according to his number of experience in this company.

September 9, 2020