Theoretical Physicist | Associate Director Data Science
When I set out to build a digit classifier from scratch, I wanted to truly understand what happens under the hood of modern neural networks. This post is a deep dive into the mathematics behind a three-layer neural network using the Adam optimization algorithm, no black boxes, just pure NumPy and math.
→ View Full Implementation on GitHub
Modern deep learning frameworks like TensorFlow and PyTorch are powerful, but they abstract away the underlying mathematics. Building a neural network from first principles helps us understand:
Task: Classify handwritten digits (0-9) from 28×28 grayscale images
Network Architecture:
Total Parameters: 105,898 trainable weights and biases
Neural networks need non-linearity to learn complex patterns. I used two key activation functions:
ReLU (Rectified Linear Unit) for hidden layers:
\[\text{ReLU}(x) = \max(0, x)\]Why ReLU? It’s simple, fast, and avoids the vanishing gradient problem that plagues sigmoid activations in deep networks. The derivative is equally simple:
\[\frac{d\text{ReLU}(x)}{dx} = \begin{cases} 0 & \text{if } x \leq 0 \\ 1 & \text{if } x > 0 \end{cases}\]Softmax for the output layer:
\[\text{Softmax}(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{10} e^{x_j}}\]Softmax converts raw scores into a probability distribution, perfect for multi-class classification. The outputs sum to 1, making them interpretable as class probabilities.
I used Mean Squared Error (MSE) to measure prediction quality:
\[J = \frac{1}{N} \sum_{i=1}^{N} (y_{\text{true}}^{(i)} - y_{\text{pred}}^{(i)})^2\]While cross-entropy is more common for classification, MSE works well here and simplifies the math for backpropagation.
Forward propagation is the process of transforming input images into predictions. Each layer performs two operations: linear transformation and activation.
where:
The final output $A^{[3]}$ gives us 10 probabilities, one for each digit class.
Backpropagation computes how much each weight contributed to the error, allowing us to update them in the right direction. This is where calculus and the chain rule shine.
Starting from the output, the error signal is remarkably simple with MSE and Softmax:
\[\delta^{[3]} = A^{[3]} - Y\]This is just the difference between predictions and true labels! From here we can compute weight and bias gradients:
\[\frac{\partial J}{\partial W^{[3]}} = \delta^{[3]} \cdot (A^{[2]})^T\] \[\frac{\partial J}{\partial b^{[3]}} = \frac{1}{m}\sum_{i=1}^{m} \delta_i^{[3]}\]For hidden layers, we propagate the error backward using the chain rule:
\[\delta^{[2]} = (W^{[3]})^T \delta^{[3]} \odot \text{ReLU}'(Z^{[2]})\]The $\odot$ symbol means element-wise multiplication (Hadamard product). The ReLU derivative acts as a gate—it only lets gradients flow through neurons that were active (positive) during forward propagation.
We repeat this process for all layers, computing gradients from output to input.
Standard gradient descent updates weights by simply subtracting the gradient times a learning rate:
\[W = W - \alpha \nabla J\]but this is slow and can get stuck. Enter Adam (Adaptive Moment Estimation), which combines the best ideas from momentum and RMSprop.
Adam maintains two moving averages for each parameter:
Momentum equation:
\[m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla J\]Think of momentum as velocity, it helps the optimizer build up speed in directions with consistent gradients and dampens oscillations.
RMSprop equation:
\[v_t = \beta_2 v_{t-1} + (1 - \beta_2) (\nabla J)^2\]This tracks the “variance” of gradients. Parameters with large, noisy gradients get smaller effective learning rates.
Since we initialize $m_0 = 0$ and $v_0 = 0$, early estimates are biased toward zero. Adam corrects this:
\[\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}\]Finally, we update parameters using both moments:
\[W_t = W_{t-1} - \alpha \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}\]Each parameter gets its own adaptive learning rate based on its gradient history. This makes Adam robust and fast.
Hyperparameters I used:
Putting it all together, each training iteration follows this sequence:
Building this network from scratch revealed several key insights:
Accuracy: Measures the proportion of correct predictions
\[\text{Accuracy} = \frac{1}{m}\sum_{i=1}^{m} \mathbb{1}[\arg\max(y_{\text{true}}^{(i)}) = \arg\max(y_{\text{pred}}^{(i)})]\]where $\mathbb{1}[\cdot]$ is the indicator function that returns 1 if the condition is true, 0 otherwise. In other words: for each sample, we check if the predicted class (highest probability) matches the true class (position of 1 in one-hot encoding).
R² Score: Measures fit quality (1 = perfect, 0 = poor)
\[R^2 = 1 - \frac{\sum_i (y_{\text{true}}^{(i)} - y_{\text{pred}}^{(i)})^2}{\sum_i (y_{\text{true}}^{(i)} - \bar{y}_{\text{true}})^2}\]Possible extensions to explore:
Network Configuration:
| Layer | Input | Output | Activation | Parameters |
|---|---|---|---|---|
| Input | 784 | 784 | - | 0 |
| Hidden 1 | 784 | 128 | ReLU | 100,480 |
| Hidden 2 | 128 | 40 | ReLU | 5,160 |
| Output | 40 | 10 | Softmax | 410 |
View the complete implementation on GitHub →
Kingma, D. P., & Ba, J. (2014). Adam: A Method for Stochastic Optimization. arXiv:1412.6980.
Tags: #DeepLearning #NeuralNetworks #Adam #Optimization #NumPy #FromScratch #Mathematics
Have questions or suggestions? Feel free to reach out or open an issue on the GitHub repository!