Theoretical Physicist | Associate Director Data Science
When I set out to truly understand linear regression, I didn’t just want to call sklearn.fit() and move on. I wanted to implement it from scratch using both the closed-form mathematical solution and iterative numerical optimization. This post compares three approaches to salary prediction: an analytic solution using the least squares method, a numerical solution using gradient descent, and the scikit-learn implementation.
→ View Full Implementation on Kaggle
→ View Full Implementation on GitHub
Task: Predict salary as a function of years of experience using univariate linear regression
Datasets tested:
This project demonstrates the mathematical foundations of linear regression by implementing both analytic and numerical methods from scratch, and validates these implementations against scikit-learn’s optimized library.
The analytic solution employs the ordinary least squares (OLS) method to determine model parameters in closed form. Given the linear model:
\[f(X) = \beta_0 + X\beta_1 \quad \text{or} \quad \hat{\mathbf{y}} = \mathbf{X}_b\boldsymbol{\beta}\]where:
The parameters $\boldsymbol{\beta}$ are computed by minimizing the residual sum of squares (RSS):
\[\text{RSS}(\boldsymbol{\beta}) = \sum_{i=1}^{N}(y_i - f(x_i))^2 = (\mathbf{y} - \mathbf{X}_b\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}_b\boldsymbol{\beta})\]Expanding this expression:
\[\text{RSS}(\boldsymbol{\beta}) = \mathbf{y}^T\mathbf{y} - 2\boldsymbol{\beta}^T\mathbf{X}_b^T\mathbf{y} + \boldsymbol{\beta}^T\mathbf{X}_b^T\mathbf{X}_b\boldsymbol{\beta}\]Taking the derivative with respect to $\boldsymbol{\beta}$ and setting it to zero:
\[\frac{\partial \text{RSS}}{\partial \boldsymbol{\beta}} = -2\mathbf{X}_b^T\mathbf{y} + 2\mathbf{X}_b^T\mathbf{X}_b\boldsymbol{\beta} = 0\]Solving for $\boldsymbol{\beta}$ yields the Normal Equation:
\[\boldsymbol{\beta} = (\mathbf{X}_b^T\mathbf{X}_b)^{-1}\mathbf{X}_b^T\mathbf{y}\]This closed-form solution is exact (up to numerical precision) and requires no hyperparameter tuning.
Advantages:
Limitations:
The numerical approach implements gradient descent optimization to iteratively find the optimal parameters. Instead of computing the solution directly, we start with random parameters and gradually improve them.
We use the Mean Squared Error (MSE) as our cost function:
\[J(w, b) = \frac{1}{N}\sum_{i=1}^{N}(y_i - \hat{y}_i)^2\]where $\hat{y}_i = wx_i + b$ is the prediction, $w$ is the weight (slope), and $b$ is the bias (intercept).
The parameters are updated iteratively using the gradient of the cost function:
\[w = w - \alpha \frac{\partial J}{\partial w}, \quad b = b - \alpha \frac{\partial J}{\partial b}\]where $\alpha$ is the learning rate, and the gradients are:
\[\frac{\partial J}{\partial w} = -\frac{2}{N}\sum_{i=1}^{N} x_i(y_i - \hat{y}_i)\] \[\frac{\partial J}{\partial b} = -\frac{2}{N}\sum_{i=1}^{N} (y_i - \hat{y}_i)\]Advantages:
Limitations:
Scikit-learn’s LinearRegression provides a highly optimized implementation. Under the hood, it uses efficient linear algebra routines (LAPACK) to solve the normal equation or, for large datasets, uses singular value decomposition (SVD) for numerical stability.
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X, y)
predictions = model.predict(X)
This serves as our benchmark for validating the custom implementations.
The code is organized as a Python class containing three main methods:
univariate_linear_regression(): Analytic solution using matrix operationsGradient_descent(): Numerical solution with iterative optimizationskleanrn_LR(): Wrapper for scikit-learn’s implementationThe implementation includes comprehensive visualizations:
Both custom implementations are validated using:
where $\bar{y}$ is the mean of the true values. $R^2 = 1$ indicates perfect fit, $R^2 = 0$ indicates the model performs no better than predicting the mean.
Analytic (Normal Equation):
Gradient Descent:
Scikit-learn:
Linear regression with MSE is a convex optimization problem. The cost function forms a bowl-shaped surface (quadratic) with a single global minimum. This guarantees that:
Visualizing the cost surface reveals how gradient descent navigates the parameter space. It takes steps proportional to the negative gradient, always moving downhill, eventually settling at the minimum where $\nabla J = 0$.
1. Real-world salary data:
2. Synthetic linear data:
3. Synthetic data with offset:
Each example includes:
Possible extensions to explore:
Libraries Used:
Code Organization:
class LinearRegressionComparison:
def univariate_linear_regression(X, y)
def Gradient_descent(X, y, learning_rate, iterations)
def skleanrn_LR(X, y)
View the complete implementation on Kaggle →
Hastie, T., Tibshirani, R., & Friedman, J. (2017). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.). Springer.
Tags: #MachineLearning #LinearRegression #GradientDescent #Mathematics #NumPy #FromScratch #Optimization
| *Created: May 2022 | Published: December 2025* |