Deep Learning, Perceptron & CNN
The Complete Guide

Every formula, every abbreviation, every concept — explained simply, with real math.

Section 01

🏗️ Foundations — What is a Neuron?

The human brain has ~86 billion neurons. Each neuron receives signals, processes them, and fires an output. Deep Learning copies this idea with math.

Biological vs Mathematical Neuron

🧬 Biological Neuron

Dendrites receive signals → Soma (cell body) sums them → Axon fires if threshold is crossed

🔢 Math Neuron (MCP)

Inputs x multiply by weights w → Sum → Activation function → Output

McCulloch–Pitts, 1943

The Core Math of One Neuron

--- Step 1: Weighted Sum (Pre-activation) ---
z = w₁x₁ + w₂x₂ + w₃x₃ + ... + wₙxₙ + b

--- Compact: Dot product ---
z = wᵀx + b     (also written as: z = W·x + b)

--- Step 2: Activation Function ---
output = f(z) = f(wᵀx + b)

WHERE:
x = input vector [x₁, x₂, ..., xₙ]
w = weight vector [w₁, w₂, ..., wₙ]
b = bias (scalar)    wᵀ = transpose of w
f = activation function
💡 Why bias b? It shifts the activation, like the y-intercept in y = mx + c. Without bias, the neuron is forced to pass through the origin.

Activation Functions — The "Decider"

Without an activation function, no matter how many layers you stack, the network is just a linear function. Activation adds non-linearity — the power to learn complex patterns.

NameFormulaOutput RangeUse Case
Step1 if z≥0, else 0{0, 1}Original MCP (not used today)
Sigmoid (σ)σ(z) = 1 / (1 + e⁻ᶻ)(0, 1)Binary output, probabilities
Tanh(eᶻ − e⁻ᶻ) / (eᶻ + e⁻ᶻ)(−1, 1)Hidden layers (zero-centered)
ReLUmax(0, z)[0, ∞)Hidden layers — most popular
Leaky ReLUmax(αz, z), α≈0.01(−∞, ∞)Fixes "dying ReLU" problem
Softmaxeᶻⁱ / Σeᶻʲ(0,1), sum=1Multi-class output layer
--- Softmax (for multi-class, e.g. 10 digits) ---
softmax(zᵢ) = eᶻⁱ / (eᶻ¹ + eᶻ² + ... + eᶻᵏ)

Example: z = [2.0, 1.0, 0.1] for 3 classes
eᶻ = [7.39, 2.72, 1.11] → sum = 11.22
softmax = [0.659, 0.242, 0.099] → these are PROBABILITIES, sum = 1.0
Section 02

🔵 Perceptron → MLP — Building Neurons into Networks

Single Perceptron (Rosenblatt, 1958)

Takes inputs, applies weights, passes through activation. Can classify linearly separable data.

x₁ x₂ x₃ w₁ w₂ w₃ Σ+b z=wᵀx+b f(z) ŷ ↑ bias b

Perceptron Learning Rule

--- Update weights after each sample ---
wᵢ := wᵢ + η · (y − ŷ) · xᵢ
b := b + η · (y − ŷ)

η (eta) = learning rate  |  y = true label  |  ŷ = predicted label
(y − ŷ) = error signal: +1, 0, or -1
⚠️ Perceptron Limitation: Can only classify linearly separable data (can draw a straight line between classes). It cannot learn XOR. Solution → add hidden layers → MLP!

MLP — Multi-Layer Perceptron

Stack layers of neurons. Each layer extracts more abstract features from the previous layer's output.

Layer 0
Input
Layer 1
Hidden 1
Layer 2
Hidden 2
Layer 3
Output
--- Forward Pass through MLP ---
h¹ = f(W¹·x + b¹)    ← hidden layer 1
h² = f(W²·h¹ + b²)    ← hidden layer 2
ŷ = softmax(W³·h² + b³) ← output layer

W¹, W², W³ = weight MATRICES (each row = one neuron's weights)
b¹, b², b³ = bias vectors
f = activation function (ReLU for hidden layers)

Loss Functions

NameAbbreviationFormulaWhen to use
Mean Squared ErrorMSE(1/n) Σ(y − ŷ)²Regression
Binary Cross-EntropyBCE−[y log(ŷ) + (1−y)log(1−ŷ)]Binary classification
Categorical Cross-EntropyCCE−Σ yₖ log(ŷₖ)Multi-class (MNIST!)
Mean Absolute ErrorMAE(1/n) Σ|y − ŷ|Regression, robust
--- Categorical Cross-Entropy (used for MNIST, 10 classes) ---
L = −Σₖ yₖ · log(ŷₖ)

Example: true label = digit 3 → one-hot vector y = [0,0,0,1,0,0,0,0,0,0]
ŷ (softmax output) = [.01,.01,.01,.90,.01,.01,.01,.01,.01,.01]
L = −(1·log(0.90)) = −(−0.105) = 0.105    ← small loss, good prediction!
Section 03

🔗 Deep Learning — Training the Network

Forward Pass → Loss → Backpropagation → Update

Forward Pass
Compute Loss
Backprop
Update Weights
Repeat (Epoch)

Backpropagation & Chain Rule

After computing the loss, we go backwards to compute how much each weight contributed to the error.

--- Chain Rule (Calculus): how to differentiate nested functions ---
If L depends on ŷ, which depends on z, which depends on w:

∂L/∂w = (∂L/∂ŷ) · (∂ŷ/∂z) · (∂z/∂w)

∂L/∂ŷ = gradient of loss w.r.t output
∂ŷ/∂z = gradient of activation function
∂z/∂w = gradient of linear layer = xᵀ (the input itself!)

--- For ReLU: ∂f/∂z = 1 if z>0, else 0 ---
--- For Sigmoid: ∂σ/∂z = σ(z)·(1−σ(z)) ---

Gradient Descent & Optimizers

--- Vanilla Gradient Descent ---
w := w − η · ∂L/∂w

η (eta / alpha) = learning rate (hyperparameter, e.g. 0.001)
∂L/∂w = gradient (how steeply loss changes with w)

--- SGD with Momentum ---
v := β·v + (1−β)·∂L/∂w    (v = velocity)
w := w − η·v
β (beta) ≈ 0.9 = momentum coefficient

--- Adam Optimizer (most popular) ---
m := β₁·m + (1−β₁)·g         ← 1st moment (mean of gradients)
v := β₂·v + (1−β₂)·g²        ← 2nd moment (variance)
m̂ = m/(1−β₁ᵗ)    v̂ = v/(1−β₂ᵗ) ← bias correction
w := w − η · m̂ / (√v̂ + ε)

β₁≈0.9, β₂≈0.999, ε=10⁻⁸, η≈0.001

Key Hyperparameters

HyperparameterSymbolTypical ValueWhat it does
Learning Rateη or α0.001Step size for weight updates
Batch SizeB32, 64, 128Samples per gradient update
EpochsE10–100Full passes through training data
Momentumβ0.9Smooths gradient direction
Dropout Ratep0.2–0.5Fraction of neurons dropped
Weight Decayλ1e-4L2 regularization strength
Section 04

🖼️ CNN — Convolutional Neural Network

An MLP for images would need millions of weights (a 28×28 image → 784 inputs; one hidden layer of 500 neurons = 392,000 weights!). CNN solves this by using local connections and shared weights — a filter slides across the image.

CNN Architecture Pipeline (for MNIST)

INPUT
Image
28×28×1
CONV
Conv
Layer
ACT
ReLU
POOL
MaxPool
2×2
FLAT
Flatten
FC
Dense
128
OUT
Softmax
10

① Convolution — The Core Operation

A small matrix called a filter (or kernel) slides across the image, multiplying element-wise and summing → producing a Feature Map.

--- 2D Convolution at position (i,j) ---
S(i,j) = Σₘ Σₙ I(i+m, j+n) · K(m,n) + b

I = input image (or feature map from previous layer)
K = kernel / filter (learnable weights, e.g. 3×3)
S(i,j) = output (feature map) at position (i,j)
m,n = kernel indices  |  b = bias

--- Output size formula ---
Output size = ⌊(N − F + 2P) / S⌋ + 1
N = input size  |  F = filter size  |  P = padding  |  S = stride

Example: Input 28×28, Filter 3×3, P=0, S=1
→ Output = (28−3+0)/1 + 1 = 26×26

EXAMPLE: 3×3 filter on 5×5 image (one step)

Input (patch) 1 0 1 0 1 0 1 0 1 Filter K (3×3) 1 -1 1 0 1 0 -1 1 -1 = Sum → Output 1+0+1 +0+1+0 -1+0-1 = 1 Element-wise multiply & sum Feature = 1

Key CNN Terms

TermAbbreviationMeaning
Kernel / FilterKSmall learnable weight matrix that slides over input
Feature MapFMOutput of convolution — shows where a feature is detected
StrideSHow many pixels the filter moves each step
PaddingPZeros added to border to control output size
ChannelsCDepth of input (1 for grayscale, 3 for RGB)
Number of FiltersFHow many different features to detect per layer
Receptive FieldRFRegion of input that a neuron "sees"

② Pooling — Downsample

Reduces spatial size, keeps important features, adds some translation invariance.

--- Max Pooling (most common) with 2×2 window, stride 2 ---
MaxPool(region) = max value in that region

Input 4×4 → MaxPool 2×2, S=2 → Output 2×2

[ 1 3 2 4 ] [ 3 4 ]
[ 5 6 1 2 ] →→→ [ 6 4 ]
[ 7 2 4 1 ]
[ 0 3 1 2 ]

--- Average Pooling: takes mean instead of max ---
--- Global Average Pooling (GAP): one number per feature map ---

③ Fully Connected (Dense) Layer + Softmax

--- After Conv+Pool layers, flatten to 1D vector ---
Flatten: [H × W × C] → [H·W·C]    (a long 1D vector)

--- Then Dense layer ---
z = W·x_flat + b
h = ReLU(z)

--- Output layer (for 10 digit classes) ---
ŷ = softmax(W_out · h + b_out)
ŷ ∈ ℝ¹⁰    each element = probability of being that digit

④ Batch Normalization (BN)

Applied after conv or dense layers to stabilize training by normalizing each mini-batch.

μ_B = (1/m) Σ xᵢ                  ← batch mean
σ²_B = (1/m) Σ (xᵢ − μ_B)²       ← batch variance
x̂ᵢ = (xᵢ − μ_B) / √(σ²_B + ε)   ← normalize
yᵢ = γ·x̂ᵢ + β                  ← scale & shift (learnable)

γ, β = learnable parameters  |  ε = small constant for stability
Section 05

🔢 MNIST Image Classification — Step by Step

MNIST = Modified National Institute of Standards and Technology dataset.
70,000 grayscale images of handwritten digits (0–9), each 28×28 pixels.

📦 Total Images

70,000 (60k train + 10k test)

📐 Image Size

28 × 28 pixels = 784 values

🎨 Color

Grayscale (1 channel), pixel values 0–255

🏷️ Classes

10 (digits 0 through 9)

1

Load & Normalize the Data

Load 60,000 training images. Each pixel is 0–255. Normalize to 0–1 by dividing by 255.

x_normalized = x / 255.0
Shape: (60000, 28, 28, 1)    ← (samples, H, W, Channels)
Keep channel=1 for grayscale CNN input
2

One-Hot Encode the Labels

Labels are integers 0–9. We convert to vectors for cross-entropy loss.

Label = 3  →  one-hot =
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
index: 0 1 2 3 4 5 6 7 8 9

Label = 7  → [0,0,0,0,0,0,0,1,0,0]
3

Build the CNN Architecture

--- Complete CNN for MNIST ---

INPUT: 28×28×1   (grayscale image)

CONV1: 32 filters, 3×3, padding='same', stride=1
         Output: 28×28×32
RELU1: apply ReLU → max(0,z)

POOL1: MaxPool 2×2, stride=2
         Output: 14×14×32

CONV2: 64 filters, 3×3, padding='same'
         Output: 14×14×64
RELU2: ReLU

POOL2: MaxPool 2×2, stride=2
         Output: 7×7×64 = 3136 values

FLATTEN: 3136×1 vector

FC1: Dense(128), ReLU
DROPOUT: p=0.5 (randomly zero 50% of neurons)

OUTPUT: Dense(10), Softmax
         ŷ ∈ ℝ¹⁰, Σŷᵢ = 1.0
4

Compile — Choose Loss, Optimizer, Metric

Loss function: Categorical Cross-Entropy
L = −Σₖ yₖ · log(ŷₖ)

Optimizer: Adam (η=0.001)
Metric: Accuracy = (correct predictions) / (total samples)
5

Training Loop — Forward + Backward + Update

--- For each epoch (full pass over data): ---

For each mini-batch B of size 32:
  ① Forward: ŷ = CNN(x_batch)
  ② Loss: L = CCE(y_batch, ŷ)
  ③ Backward: compute ∂L/∂W for all layers (backprop)
  ④ Update: W := W − η · ∂L/∂W (Adam)

Steps per epoch = 60000 / 32 = 1875 steps

After each epoch: evaluate on validation set
val_accuracy should increase each epoch
6

Predict — Inference on New Images

ŷ = CNN(x_new)                     ← softmax vector
predicted_class = argmax(ŷ)     ← index of highest probability

Example output: ŷ = [.00, .00, .01, .97, .00, .01, .00, .00, .01, .00]
argmax(ŷ) = 3  → Predicted digit: 3
7

Evaluate — Measure Performance

Accuracy = (# correct) / (# total) × 100%
A simple CNN achieves ~99% on MNIST test set!

Confusion Matrix: 10×10 grid showing which digits
are confused with each other (e.g. 4 mistaken as 9)

Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1-Score = 2 · (Precision · Recall) / (Precision + Recall)

Parameter Count — How Big is Our CNN?

--- CONV1: 32 filters, 3×3, input channels=1 ---
Params = (3×3×1 + 1) × 32 = 10 × 32 = 320

--- CONV2: 64 filters, 3×3, input channels=32 ---
Params = (3×3×32 + 1) × 64 = 289 × 64 = 18,496

--- FC1: 3136 inputs, 128 outputs ---
Params = (3136 + 1) × 128 = 401,536

--- Output: 128 inputs, 10 outputs ---
Params = (128 + 1) × 10 = 1,290

Total ≈ 421,642 parameters
(vs ~600k for a plain MLP — AND CNN extracts spatial features!)
Section 06

⚙️ Training — Deep Dive

🧯 Overfitting vs Underfitting+
Total Error = Bias² + Variance + Irreducible Noise

Underfitting = High Bias = model too simple, can't learn pattern
Overfitting = High Variance = model memorizes training data
Sweet Spot = Low bias AND low variance

Signs:
  train_loss↓ & val_loss↓ → good, keep training
  train_loss↓ & val_loss↑ → overfitting! use dropout/regularize
  train_loss high & val_loss high → underfitting, need more capacity
🛡️ Regularization (L1, L2, Dropout)+
--- L2 (Ridge / Weight Decay) ---
L_total = L + λ Σ wᵢ²
Shrinks all weights → simpler model

--- L1 (Lasso) ---
L_total = L + λ Σ|wᵢ|
Pushes some weights exactly to 0 → sparse model

--- Dropout ---
During training: randomly set p fraction of neurons to 0
During inference: use all neurons, scale by (1-p)
Effect: prevents co-adaptation, like training an ensemble

λ = regularization strength (hyperparameter)
📉 Learning Rate Scheduling+
--- Step Decay ---
η = η₀ × γ^⌊epoch/step_size⌋

--- Exponential Decay ---
η = η₀ × e^(−λ·epoch)

--- Cosine Annealing ---
ηₜ = η_min + 0.5(η_max − η_min)(1 + cos(πt/T))

Intuition: start with large η to explore, reduce to fine-tune
ReduceLROnPlateau: reduce η if val_loss stops improving
🔄 Data Augmentation+

Create more training data by transforming existing images. CNN learns to be invariant to these transforms.

Rotation: rotate image ±10–20°
Translation: shift image left/right/up/down
Flipping: horizontal flip (not for digits! 6≠9)
Zoom: scale in/out
Noise: add Gaussian noise N(0, σ²)
Brightness: randomly adjust brightness/contrast

Effect: reduces overfitting, improves generalization
🏗️ Famous CNN Architectures+
ModelYearKey InnovationLayers
LeNet-51998First practical CNN (digits!)7
AlexNet2012Deep CNN, ReLU, Dropout, GPU8
VGGNet2014Very deep, all 3×3 filters16–19
GoogLeNet2014Inception modules, 1×1 conv22
ResNet2015Skip/residual connections50–152
DenseNet2017Dense connections all layers121–201
EfficientNet2019Compound scalingB0–B7
--- ResNet Skip Connection (key formula) ---
H(x) = F(x) + x
F(x) = what the layers learn  |  x = identity shortcut
Allows training of very deep networks (solves vanishing gradient)
Section 07

📖 Complete Glossary & Abbreviations

AIArtificial Intelligence — machines mimicking human intelligence
MLMachine Learning — learning patterns from data
DLDeep Learning — ML with deep neural networks
MCPMcCulloch–Pitts Neuron (1943) — first math neuron model
MLPMulti-Layer Perceptron — neural network with hidden layers
CNNConvolutional Neural Network — for image data
RNNRecurrent Neural Network — for sequential data
LSTMLong Short-Term Memory — improved RNN
NN / ANNNeural Network / Artificial Neural Network
ReLURectified Linear Unit — activation f(z)=max(0,z)
σ (sigma)Sigmoid activation function — outputs (0,1)
GDGradient Descent — optimization algorithm
SGDStochastic Gradient Descent — 1 sample per step
AdamAdaptive Moment Estimation — best optimizer
MSEMean Squared Error — regression loss
BCEBinary Cross-Entropy — binary classification loss
CCECategorical Cross-Entropy — multi-class loss
BNBatch Normalization — stabilizes training
FCFully Connected (Dense) Layer
CONVConvolutional Layer
POOLPooling Layer (MaxPool, AvgPool)
FMFeature Map — output of a conv layer
RFReceptive Field — input region a neuron sees
η (eta)Learning Rate — step size for updates
λ (lambda)Regularization strength (L1/L2)
∇LGradient of Loss — direction of steepest ascent
∂ (partial)Partial derivative — derivative w.r.t. one variable
w, bWeights and Bias — learnable parameters
ŷ (y-hat)Predicted output
H, W, CHeight, Width, Channels of a feature map
P, SPadding, Stride — conv hyperparameters
argmaxReturns index of maximum value
GAPGlobal Average Pooling — one number per FM
TP/FP/FNTrue/False Positives, False Negatives
MNISTModified NIST dataset — 70k handwritten digits
GPUGraphics Processing Unit — fast parallel compute for DL
EpochOne full pass through entire training dataset
BatchSubset of data used for one gradient update
🧠 Deep Learning Guide  ·  All formulas, all terms, simple english  ·  Made for learners