Deep Learning & CNN — Complete Guide

Section 01

🏗️ Foundations — What is a Neuron?

The human brain has ~86 billion neurons. Each neuron receives signals, processes them, and fires an output. Deep Learning copies this idea with math.

Biological vs Mathematical Neuron

🧬 Biological Neuron

Dendrites receive signals → Soma (cell body) sums them → Axon fires if threshold is crossed

🔢 Math Neuron (MCP)

Inputs x multiply by weights w → Sum → Activation function → Output

McCulloch–Pitts, 1943

The Core Math of One Neuron

--- Step 1: Weighted Sum (Pre-activation) ---
z = w₁x₁ + w₂x₂ + w₃x₃ + ... + wₙxₙ + b

--- Compact: Dot product ---
z = wᵀx + b (also written as: z = W·x + b)

--- Step 2: Activation Function ---
output = f(z) = f(wᵀx + b)

WHERE:
x = input vector [x₁, x₂, ..., xₙ]
w = weight vector [w₁, w₂, ..., wₙ]
b = bias (scalar) wᵀ = transpose of w
f = activation function

💡 Why bias b? It shifts the activation, like the y-intercept in y = mx + c. Without bias, the neuron is forced to pass through the origin.

Activation Functions — The "Decider"

Without an activation function, no matter how many layers you stack, the network is just a linear function. Activation adds non-linearity — the power to learn complex patterns.

Name	Formula	Output Range	Use Case
Step	1 if z≥0, else 0	{0, 1}	Original MCP (not used today)
Sigmoid (σ)	σ(z) = 1 / (1 + e⁻ᶻ)	(0, 1)	Binary output, probabilities
Tanh	(eᶻ − e⁻ᶻ) / (eᶻ + e⁻ᶻ)	(−1, 1)	Hidden layers (zero-centered)
ReLU	max(0, z)	[0, ∞)	Hidden layers — most popular
Leaky ReLU	max(αz, z), α≈0.01	(−∞, ∞)	Fixes "dying ReLU" problem
Softmax	eᶻⁱ / Σeᶻʲ	(0,1), sum=1	Multi-class output layer

--- Softmax (for multi-class, e.g. 10 digits) ---
softmax(zᵢ) = eᶻⁱ / (eᶻ¹ + eᶻ² + ... + eᶻᵏ)

Example: z = [2.0, 1.0, 0.1] for 3 classes
eᶻ = [7.39, 2.72, 1.11] → sum = 11.22
softmax = [0.659, 0.242, 0.099] → these are PROBABILITIES, sum = 1.0

Section 02

🔵 Perceptron → MLP — Building Neurons into Networks

Single Perceptron (Rosenblatt, 1958)

Takes inputs, applies weights, passes through activation. Can classify linearly separable data.

Perceptron Learning Rule

--- Update weights after each sample ---
wᵢ := wᵢ + η · (y − ŷ) · xᵢ
b := b + η · (y − ŷ)

η (eta) = learning rate | y = true label | ŷ = predicted label
(y − ŷ) = error signal: +1, 0, or -1

⚠️ Perceptron Limitation: Can only classify linearly separable data (can draw a straight line between classes). It cannot learn XOR. Solution → add hidden layers → MLP!

MLP — Multi-Layer Perceptron

Stack layers of neurons. Each layer extracts more abstract features from the previous layer's output.

Layer 0

Input

→

Layer 1

Hidden 1

→

Layer 2

Hidden 2

→

Layer 3

Output

--- Forward Pass through MLP ---
h¹ = f(W¹·x + b¹) ← hidden layer 1
h² = f(W²·h¹ + b²) ← hidden layer 2
ŷ = softmax(W³·h² + b³) ← output layer

W¹, W², W³ = weight MATRICES (each row = one neuron's weights)
b¹, b², b³ = bias vectors
f = activation function (ReLU for hidden layers)

Loss Functions

Name	Abbreviation	Formula	When to use
Mean Squared Error	MSE	(1/n) Σ(y − ŷ)²	Regression
Binary Cross-Entropy	BCE	−[y log(ŷ) + (1−y)log(1−ŷ)]	Binary classification
Categorical Cross-Entropy	CCE	−Σ yₖ log(ŷₖ)	Multi-class (MNIST!)
Mean Absolute Error	MAE	(1/n) Σ\|y − ŷ\|	Regression, robust

--- Categorical Cross-Entropy (used for MNIST, 10 classes) ---
L = −Σₖ yₖ · log(ŷₖ)

Example: true label = digit 3 → one-hot vector y = [0,0,0,1,0,0,0,0,0,0]
ŷ (softmax output) = [.01,.01,.01,.90,.01,.01,.01,.01,.01,.01]
L = −(1·log(0.90)) = −(−0.105) = 0.105 ← small loss, good prediction!

Section 03

🔗 Deep Learning — Training the Network

Forward Pass → Loss → Backpropagation → Update

①

Forward Pass

→

②

Compute Loss

→

③

Backprop

→

④

Update Weights

→

⑤

Repeat (Epoch)

Backpropagation & Chain Rule

After computing the loss, we go backwards to compute how much each weight contributed to the error.

--- Chain Rule (Calculus): how to differentiate nested functions ---
If L depends on ŷ, which depends on z, which depends on w:

∂L/∂w = (∂L/∂ŷ) · (∂ŷ/∂z) · (∂z/∂w)

∂L/∂ŷ = gradient of loss w.r.t output
∂ŷ/∂z = gradient of activation function
∂z/∂w = gradient of linear layer = xᵀ (the input itself!)

--- For ReLU: ∂f/∂z = 1 if z>0, else 0 ---
--- For Sigmoid: ∂σ/∂z = σ(z)·(1−σ(z)) ---

Gradient Descent & Optimizers

--- Vanilla Gradient Descent ---
w := w − η · ∂L/∂w

η (eta / alpha) = learning rate (hyperparameter, e.g. 0.001)
∂L/∂w = gradient (how steeply loss changes with w)

--- SGD with Momentum ---
v := β·v + (1−β)·∂L/∂w    (v = velocity)
w := w − η·v
β (beta) ≈ 0.9 = momentum coefficient

--- Adam Optimizer (most popular) ---
m := β₁·m + (1−β₁)·g         ← 1st moment (mean of gradients)
v := β₂·v + (1−β₂)·g²        ← 2nd moment (variance)
m̂ = m/(1−β₁ᵗ)    v̂ = v/(1−β₂ᵗ) ← bias correction
w := w − η · m̂ / (√v̂ + ε)

β₁≈0.9, β₂≈0.999, ε=10⁻⁸, η≈0.001

Key Hyperparameters

Hyperparameter	Symbol	Typical Value	What it does
Learning Rate	η or α	0.001	Step size for weight updates
Batch Size	B	32, 64, 128	Samples per gradient update
Epochs	E	10–100	Full passes through training data
Momentum	β	0.9	Smooths gradient direction
Dropout Rate	p	0.2–0.5	Fraction of neurons dropped
Weight Decay	λ	1e-4	L2 regularization strength

Section 04

🖼️ CNN — Convolutional Neural Network

An MLP for images would need millions of weights (a 28×28 image → 784 inputs; one hidden layer of 500 neurons = 392,000 weights!). CNN solves this by using local connections and shared weights — a filter slides across the image.

CNN Architecture Pipeline (for MNIST)

INPUT

Image
28×28×1

→

CONV

Conv
Layer

→

ACT

ReLU

→

POOL

MaxPool
2×2

→

FLAT

Flatten

→

Dense
128

→

OUT

Softmax
10

① Convolution — The Core Operation

A small matrix called a filter (or kernel) slides across the image, multiplying element-wise and summing → producing a Feature Map.

--- 2D Convolution at position (i,j) ---
S(i,j) = Σₘ Σₙ I(i+m, j+n) · K(m,n) + b

I = input image (or feature map from previous layer)
K = kernel / filter (learnable weights, e.g. 3×3)
S(i,j) = output (feature map) at position (i,j)
m,n = kernel indices | b = bias

--- Output size formula ---
Output size = ⌊(N − F + 2P) / S⌋ + 1
N = input size | F = filter size | P = padding | S = stride

Example: Input 28×28, Filter 3×3, P=0, S=1
→ Output = (28−3+0)/1 + 1 = 26×26

EXAMPLE: 3×3 filter on 5×5 image (one step)

Key CNN Terms

Term	Abbreviation	Meaning
Kernel / Filter	K	Small learnable weight matrix that slides over input
Feature Map	FM	Output of convolution — shows where a feature is detected
Stride	S	How many pixels the filter moves each step
Padding	P	Zeros added to border to control output size
Channels	C	Depth of input (1 for grayscale, 3 for RGB)
Number of Filters	F	How many different features to detect per layer
Receptive Field	RF	Region of input that a neuron "sees"

② Pooling — Downsample

Reduces spatial size, keeps important features, adds some translation invariance.

--- Max Pooling (most common) with 2×2 window, stride 2 ---
MaxPool(region) = max value in that region

Input 4×4 → MaxPool 2×2, S=2 → Output 2×2

[ 1 3 2 4 ] [ 3 4 ]
[ 5 6 1 2 ] →→→ [ 6 4 ]
[ 7 2 4 1 ]
[ 0 3 1 2 ]

--- Average Pooling: takes mean instead of max ---
--- Global Average Pooling (GAP): one number per feature map ---

③ Fully Connected (Dense) Layer + Softmax

--- After Conv+Pool layers, flatten to 1D vector ---
Flatten: [H × W × C] → [H·W·C] (a long 1D vector)

--- Then Dense layer ---
z = W·x_flat + b
h = ReLU(z)

--- Output layer (for 10 digit classes) ---
ŷ = softmax(W_out · h + b_out)
ŷ ∈ ℝ¹⁰ each element = probability of being that digit

④ Batch Normalization (BN)

Applied after conv or dense layers to stabilize training by normalizing each mini-batch.

μ_B = (1/m) Σ xᵢ                  ← batch mean
σ²_B = (1/m) Σ (xᵢ − μ_B)²       ← batch variance
x̂ᵢ = (xᵢ − μ_B) / √(σ²_B + ε) ← normalize
yᵢ = γ·x̂ᵢ + β                  ← scale & shift (learnable)

γ, β = learnable parameters | ε = small constant for stability

Section 05

🔢 MNIST Image Classification — Step by Step

MNIST = Modified National Institute of Standards and Technology dataset.
70,000 grayscale images of handwritten digits (0–9), each 28×28 pixels.

📦 Total Images

70,000 (60k train + 10k test)

📐 Image Size

28 × 28 pixels = 784 values

🎨 Color

Grayscale (1 channel), pixel values 0–255

🏷️ Classes

10 (digits 0 through 9)

Load & Normalize the Data

Load 60,000 training images. Each pixel is 0–255. Normalize to 0–1 by dividing by 255.

x_normalized = x / 255.0
Shape: (60000, 28, 28, 1) ← (samples, H, W, Channels)
Keep channel=1 for grayscale CNN input

One-Hot Encode the Labels

Labels are integers 0–9. We convert to vectors for cross-entropy loss.

Label = 3 → one-hot =
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
index: 0 1 2 3 4 5 6 7 8 9

Label = 7 → [0,0,0,0,0,0,0,1,0,0]

Build the CNN Architecture

--- Complete CNN for MNIST ---

INPUT: 28×28×1   (grayscale image)

CONV1: 32 filters, 3×3, padding='same', stride=1
         Output: 28×28×32
RELU1: apply ReLU → max(0,z)

POOL1: MaxPool 2×2, stride=2
         Output: 14×14×32

CONV2: 64 filters, 3×3, padding='same'
         Output: 14×14×64
RELU2: ReLU

POOL2: MaxPool 2×2, stride=2
         Output: 7×7×64 = 3136 values

FLATTEN: 3136×1 vector

FC1: Dense(128), ReLU
DROPOUT: p=0.5 (randomly zero 50% of neurons)

OUTPUT: Dense(10), Softmax
         ŷ ∈ ℝ¹⁰, Σŷᵢ = 1.0

Compile — Choose Loss, Optimizer, Metric

Loss function: Categorical Cross-Entropy
L = −Σₖ yₖ · log(ŷₖ)

Optimizer: Adam (η=0.001)
Metric: Accuracy = (correct predictions) / (total samples)

Training Loop — Forward + Backward + Update

--- For each epoch (full pass over data): ---

For each mini-batch B of size 32:
  ① Forward: ŷ = CNN(x_batch)
  ② Loss: L = CCE(y_batch, ŷ)
  ③ Backward: compute ∂L/∂W for all layers (backprop)
  ④ Update: W := W − η · ∂L/∂W (Adam)

Steps per epoch = 60000 / 32 = 1875 steps

After each epoch: evaluate on validation set
val_accuracy should increase each epoch

Predict — Inference on New Images

ŷ = CNN(x_new) ← softmax vector
predicted_class = argmax(ŷ) ← index of highest probability

Example output: ŷ = [.00, .00, .01, .97, .00, .01, .00, .00, .01, .00]
argmax(ŷ) = 3 → Predicted digit: 3 ✓

Evaluate — Measure Performance

Accuracy = (# correct) / (# total) × 100%
A simple CNN achieves ~99% on MNIST test set!

Confusion Matrix: 10×10 grid showing which digits
are confused with each other (e.g. 4 mistaken as 9)

Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1-Score = 2 · (Precision · Recall) / (Precision + Recall)

Parameter Count — How Big is Our CNN?

--- CONV1: 32 filters, 3×3, input channels=1 ---
Params = (3×3×1 + 1) × 32 = 10 × 32 = 320

--- CONV2: 64 filters, 3×3, input channels=32 ---
Params = (3×3×32 + 1) × 64 = 289 × 64 = 18,496

--- FC1: 3136 inputs, 128 outputs ---
Params = (3136 + 1) × 128 = 401,536

--- Output: 128 inputs, 10 outputs ---
Params = (128 + 1) × 10 = 1,290

Total ≈ 421,642 parameters
(vs ~600k for a plain MLP — AND CNN extracts spatial features!)

Section 06

⚙️ Training — Deep Dive

🧯 Overfitting vs Underfitting+

Total Error = Bias² + Variance + Irreducible Noise

Underfitting = High Bias = model too simple, can't learn pattern
Overfitting = High Variance = model memorizes training data
Sweet Spot = Low bias AND low variance

Signs:
  train_loss↓ & val_loss↓ → good, keep training
  train_loss↓ & val_loss↑ → overfitting! use dropout/regularize
  train_loss high & val_loss high → underfitting, need more capacity

🛡️ Regularization (L1, L2, Dropout)+

--- L2 (Ridge / Weight Decay) ---
L_total = L + λ Σ wᵢ²
Shrinks all weights → simpler model

--- L1 (Lasso) ---
L_total = L + λ Σ|wᵢ|
Pushes some weights exactly to 0 → sparse model

--- Dropout ---
During training: randomly set p fraction of neurons to 0
During inference: use all neurons, scale by (1-p)
Effect: prevents co-adaptation, like training an ensemble

λ = regularization strength (hyperparameter)

📉 Learning Rate Scheduling+

--- Step Decay ---
η = η₀ × γ^⌊epoch/step_size⌋

--- Exponential Decay ---
η = η₀ × e^(−λ·epoch)

--- Cosine Annealing ---
ηₜ = η_min + 0.5(η_max − η_min)(1 + cos(πt/T))

Intuition: start with large η to explore, reduce to fine-tune
ReduceLROnPlateau: reduce η if val_loss stops improving

🔄 Data Augmentation+

Create more training data by transforming existing images. CNN learns to be invariant to these transforms.

Rotation: rotate image ±10–20°
Translation: shift image left/right/up/down
Flipping: horizontal flip (not for digits! 6≠9)
Zoom: scale in/out
Noise: add Gaussian noise N(0, σ²)
Brightness: randomly adjust brightness/contrast

Effect: reduces overfitting, improves generalization

🏗️ Famous CNN Architectures+

Model	Year	Key Innovation	Layers
LeNet-5	1998	First practical CNN (digits!)	7
AlexNet	2012	Deep CNN, ReLU, Dropout, GPU	8
VGGNet	2014	Very deep, all 3×3 filters	16–19
GoogLeNet	2014	Inception modules, 1×1 conv	22
ResNet	2015	Skip/residual connections	50–152
DenseNet	2017	Dense connections all layers	121–201
EfficientNet	2019	Compound scaling	B0–B7

--- ResNet Skip Connection (key formula) ---
H(x) = F(x) + x
F(x) = what the layers learn | x = identity shortcut
Allows training of very deep networks (solves vanishing gradient)

Section 07

📖 Complete Glossary & Abbreviations

AIArtificial Intelligence — machines mimicking human intelligence

MLMachine Learning — learning patterns from data

DLDeep Learning — ML with deep neural networks

MCPMcCulloch–Pitts Neuron (1943) — first math neuron model

MLPMulti-Layer Perceptron — neural network with hidden layers

CNNConvolutional Neural Network — for image data

RNNRecurrent Neural Network — for sequential data

LSTMLong Short-Term Memory — improved RNN

NN / ANNNeural Network / Artificial Neural Network

ReLURectified Linear Unit — activation f(z)=max(0,z)

σ (sigma)Sigmoid activation function — outputs (0,1)

GDGradient Descent — optimization algorithm

SGDStochastic Gradient Descent — 1 sample per step

AdamAdaptive Moment Estimation — best optimizer

MSEMean Squared Error — regression loss

BCEBinary Cross-Entropy — binary classification loss

CCECategorical Cross-Entropy — multi-class loss

BNBatch Normalization — stabilizes training

FCFully Connected (Dense) Layer

CONVConvolutional Layer

POOLPooling Layer (MaxPool, AvgPool)

FMFeature Map — output of a conv layer

RFReceptive Field — input region a neuron sees

η (eta)Learning Rate — step size for updates

λ (lambda)Regularization strength (L1/L2)

∇LGradient of Loss — direction of steepest ascent

∂ (partial)Partial derivative — derivative w.r.t. one variable

w, bWeights and Bias — learnable parameters

ŷ (y-hat)Predicted output

H, W, CHeight, Width, Channels of a feature map

P, SPadding, Stride — conv hyperparameters

argmaxReturns index of maximum value

GAPGlobal Average Pooling — one number per FM

TP/FP/FNTrue/False Positives, False Negatives

MNISTModified NIST dataset — 70k handwritten digits

GPUGraphics Processing Unit — fast parallel compute for DL

EpochOne full pass through entire training dataset

BatchSubset of data used for one gradient update

Deep Learning, Perceptron & CNNThe Complete Guide