How Learning Rate Shapes the Journey of a Neural Network#
A Deep Learning Blog Post by Joanna Reyda D. Santos
Introduction#
Training a neural network is similar to teaching a student.
If the pace is too slow, learning becomes unproductive.
If the pace is too fast, the student becomes overwhelmed.
In deep learning, this “pace of learning” is controlled by a single but influential hyperparameter: the learning rate.
What is Learning Rate?#
During training, model weights are updated using gradient descent:
Where:
$\(\( \theta \)\)$: model weights
$\(\( L \)\)$: loss function
$\(\( \nabla_\theta L \)\)$: gradient
$\(\( \eta \)\)$: learning rate
Small Learning Rate (0.0001)#
learns slowly
stable but may get stuck in local minima
Ideal Learning Rate (0.001)#
balanced
allows smooth convergence
often yields the best overall accuracy
Large Learning Rate (0.01)#
unstable
may overshoot minima
risk of divergence
Experiment Overview#
To visualize how learning rate affects model behavior, I trained a simple CNN on the MNIST dataset using three different learning rates:
Learning Rate |
Expected Behavior |
|---|---|
0.0001 |
slow learning |
0.001 |
optimal |
0.01 |
unstable |
Dataset: MNIST (28×28 grayscale images)
Model: Simple CNN
Epochs: 5
Optimizer: Adam
CNN Model Code#
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt
class SimpleCNN(nn.Module):
def __init__(self):
super().__init__()
self.conv = nn.Sequential(
nn.Conv2d(1, 16, 3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(16, 32, 3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2)
)
self.fc = nn.Sequential(
nn.Flatten(),
nn.Linear(32*7*7, 128),
nn.ReLU(),
nn.Linear(128, 10)
)
def forward(self, x):
return self.fc(self.conv(x))
Dataset Loading#
transform = transforms.Compose([transforms.ToTensor()])
train_set = datasets.MNIST(root="./data", train=True, download=True, transform=transform)
test_set = datasets.MNIST(root="./data", train=False, download=True, transform=transform)
train_loader = DataLoader(train_set, batch_size=64, shuffle=True)
test_loader = DataLoader(test_set, batch_size=64, shuffle=False)
Training Function#
def train_model(lr):
device = "cuda" if torch.cuda.is_available() else "cpu"
model = SimpleCNN().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=lr)
train_losses = []
val_accs = []
for epoch in range(5):
model.train()
total_loss = 0
for x, y in train_loader:
x, y = x.to(device), y.to(device)
optimizer.zero_grad()
preds = model(x)
loss = criterion(preds, y)
loss.backward()
optimizer.step()
total_loss += loss.item()
train_losses.append(total_loss / len(train_loader))
model.eval()
correct = 0
with torch.no_grad():
for x, y in test_loader:
x, y = x.to(device), y.to(device)
preds = model(x)
correct += (preds.argmax(1) == y).sum().item()
acc = correct / len(test_set)
val_accs.append(acc)
print(f"LR={lr} | Epoch {epoch+1}: Loss={train_losses[-1]:.4f}, Acc={acc:.4f}")
return train_losses, val_accs
Running the Experiments#
lrs = [0.0001, 0.001, 0.01]
results = {}
for lr in lrs:
print("="*60)
print(f"Training with LR = {lr}")
results[lr] = train_model(lr)
Plotting the Results#
plt.figure(figsize=(14,5))
# Loss Curves
plt.subplot(1,2,1)
for lr in lrs:
plt.plot(results[lr][0], label=f"LR={lr}")
plt.title("Training Loss Curve")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend()
# Accuracy Curves
plt.subplot(1,2,2)
for lr in lrs:
plt.plot(results[lr][1], label=f"LR={lr}")
plt.title("Validation Accuracy Curve")
plt.xlabel("Epoch")
plt.ylabel("Accuracy")
plt.legend()
plt.show()
Results & Interpretation#
Learning Rate = 0.001 (Optimal)#
The training curve is stable and smooth.
Achieves the highest accuracy (~98%).
Converges efficiently.
Learning Rate = 0.0001 (Too Small)#
Training is very slow.
Loss decreases gradually.
Needs more epochs to reach high accuracy.
Learning Rate = 0.01 (Too Large)#
Highly unstable.
Accuracy oscillates and may not improve.
Sometimes diverges entirely.
Reflections#
From this experiment, I learned that even small changes in learning rate dramatically affect:
Training stability
Speed of convergence
Final performance
A poorly chosen learning rate cannot be compensated by simply training longer.
Because of this, tuning the learning rate is often the first and most important hyperparameter decision.
Conclusion#
The learning rate is one of the most influential hyperparameters in deep learning.
A well-chosen learning rate enables efficient and stable learning, while a poor choice can prevent learning altogether.
This experiment demonstrated how different learning rates affect the behavior of a CNN model on MNIST and highlighted why selecting the right value is crucial for model success.
References#
PyTorch Documentation: https://pytorch.org/docs/stable/index.html
Deep Learning (Goodfellow, Bengio, Courville): https://www.deeplearningbook.org/
MNIST Dataset Paper: http://yann.lecun.com/exdb/mnist/
DS413 Lecture Notes (Instructor-provided materials)