Every formula, every abbreviation, every concept — explained simply, with real math.
The human brain has ~86 billion neurons. Each neuron receives signals, processes them, and fires an output. Deep Learning copies this idea with math.
Dendrites receive signals → Soma (cell body) sums them → Axon fires if threshold is crossed
Inputs x multiply by weights w → Sum → Activation function → Output
McCulloch–Pitts, 1943Without an activation function, no matter how many layers you stack, the network is just a linear function. Activation adds non-linearity — the power to learn complex patterns.
| Name | Formula | Output Range | Use Case |
|---|---|---|---|
| Step | 1 if z≥0, else 0 | {0, 1} | Original MCP (not used today) |
| Sigmoid (σ) | σ(z) = 1 / (1 + e⁻ᶻ) | (0, 1) | Binary output, probabilities |
| Tanh | (eᶻ − e⁻ᶻ) / (eᶻ + e⁻ᶻ) | (−1, 1) | Hidden layers (zero-centered) |
| ReLU | max(0, z) | [0, ∞) | Hidden layers — most popular |
| Leaky ReLU | max(αz, z), α≈0.01 | (−∞, ∞) | Fixes "dying ReLU" problem |
| Softmax | eᶻⁱ / Σeᶻʲ | (0,1), sum=1 | Multi-class output layer |
Takes inputs, applies weights, passes through activation. Can classify linearly separable data.
Stack layers of neurons. Each layer extracts more abstract features from the previous layer's output.
| Name | Abbreviation | Formula | When to use |
|---|---|---|---|
| Mean Squared Error | MSE | (1/n) Σ(y − ŷ)² | Regression |
| Binary Cross-Entropy | BCE | −[y log(ŷ) + (1−y)log(1−ŷ)] | Binary classification |
| Categorical Cross-Entropy | CCE | −Σ yₖ log(ŷₖ) | Multi-class (MNIST!) |
| Mean Absolute Error | MAE | (1/n) Σ|y − ŷ| | Regression, robust |
After computing the loss, we go backwards to compute how much each weight contributed to the error.
| Hyperparameter | Symbol | Typical Value | What it does |
|---|---|---|---|
| Learning Rate | η or α | 0.001 | Step size for weight updates |
| Batch Size | B | 32, 64, 128 | Samples per gradient update |
| Epochs | E | 10–100 | Full passes through training data |
| Momentum | β | 0.9 | Smooths gradient direction |
| Dropout Rate | p | 0.2–0.5 | Fraction of neurons dropped |
| Weight Decay | λ | 1e-4 | L2 regularization strength |
An MLP for images would need millions of weights (a 28×28 image → 784 inputs; one hidden layer of 500 neurons = 392,000 weights!). CNN solves this by using local connections and shared weights — a filter slides across the image.
A small matrix called a filter (or kernel) slides across the image, multiplying element-wise and summing → producing a Feature Map.
EXAMPLE: 3×3 filter on 5×5 image (one step)
| Term | Abbreviation | Meaning |
|---|---|---|
| Kernel / Filter | K | Small learnable weight matrix that slides over input |
| Feature Map | FM | Output of convolution — shows where a feature is detected |
| Stride | S | How many pixels the filter moves each step |
| Padding | P | Zeros added to border to control output size |
| Channels | C | Depth of input (1 for grayscale, 3 for RGB) |
| Number of Filters | F | How many different features to detect per layer |
| Receptive Field | RF | Region of input that a neuron "sees" |
Reduces spatial size, keeps important features, adds some translation invariance.
Applied after conv or dense layers to stabilize training by normalizing each mini-batch.
MNIST = Modified National Institute of Standards and Technology dataset.
70,000 grayscale images of handwritten digits (0–9), each 28×28 pixels.
70,000 (60k train + 10k test)
28 × 28 pixels = 784 values
Grayscale (1 channel), pixel values 0–255
10 (digits 0 through 9)
Load 60,000 training images. Each pixel is 0–255. Normalize to 0–1 by dividing by 255.
Labels are integers 0–9. We convert to vectors for cross-entropy loss.
Create more training data by transforming existing images. CNN learns to be invariant to these transforms.
| Model | Year | Key Innovation | Layers |
|---|---|---|---|
| LeNet-5 | 1998 | First practical CNN (digits!) | 7 |
| AlexNet | 2012 | Deep CNN, ReLU, Dropout, GPU | 8 |
| VGGNet | 2014 | Very deep, all 3×3 filters | 16–19 |
| GoogLeNet | 2014 | Inception modules, 1×1 conv | 22 |
| ResNet | 2015 | Skip/residual connections | 50–152 |
| DenseNet | 2017 | Dense connections all layers | 121–201 |
| EfficientNet | 2019 | Compound scaling | B0–B7 |