Implementation of The CNN Model

Here the representative model is discussed in detail along with the application strategies, the logics behind those and guidance for reproducing this exact model.

  • Jupyter notebook was chosen as the development and testing environment for this model for convenience.
  • All that is detailed here can be referenced back to the actual notebook
  • The notebook is divided into 8 properly labeled sections, referring to the task that each section handles. It displays the whole process from declaring the model to training in an intuitive way with clean notes for each section.

1. Data Pipeline & Preprocessing

CIFAR-10 consists of 50,000 training and 10,000 test images (32×32 RGB). The dataset was manually downloaded due to a temporary mirror downtime and put into the respective local folder in required setting, and for execution - ImageFolder was used to map subdirectories to class labels.

A custom TransformedSubset wrapper was necessary because PyTorch’s random_split does not natively attach transforms to the resulting subsets.

transform_train = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomCrop(32, padding=4),
    transforms.RandomRotation(15),
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465),   # Mean for CIFAR10 
                         (0.2023, 0.1994, 0.2010))   # STD for CIFAR10 
])

transform_val_test = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465),   # These values are used for Normalization
                         (0.2023, 0.1994, 0.2010))
])
  • Augmentation (Flip, Crop, Rotation) artificially expands the dataset and forces the network to learn invariant features, reducing overfitting on small 32×32 images.
  • Normalization uses precomputed CIFAR-10 channel means and standard deviations. Centering pixel values around zero stabilizes gradient flow and accelerates convergence.

For reference - check the Data Transformation section in the notebook.

2. Model Architecture: theCNN

The model followed a classic progressive feature extraction pattern, but with more modern stabilizers and activation choices than usual for reliable convergence.

class theCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=3, padding=1),
            nn.BatchNorm2d(32), nn.SiLU(),
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64), nn.SiLU(),
            nn.MaxPool2d(2, 2),
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128), nn.SiLU(),
            nn.MaxPool2d(2, 2)
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128 * 8 * 8, 256),
            nn.SiLU(),
            nn.Dropout(0.4),
            nn.Linear(256, num_classes)
        )

    def forward(self, x):
        return self.classifier(self.features(x))
  • Channel Progression (3 → 32 → 64 → 128): Gradually increases representational capacity while spatial resolution halves via pooling.
  • Batch Normalization: Placed after every convolution to mitigate internal covariate shift, allowing higher learning rates and acting as a mild regularizer.
  • SiLU over ReLU: SiLU (x * sigmoid(x)) provides smooth, non-monotonic gradients that often yield better convergence and slightly higher accuracy on image tasks compared to the piecewise-linear ReLU.
  • Dropout (0.4): Randomly disables 40% of classifier neurons during training, forcing the network to avoid co-adapting features, reduce the chance of overfitting, learn the intrinsic patterns and improving generalization.

3. Training Configuration & Optimization

The training loop spans 30 epochs on a CPU, using checkpointing to preserve the best validation performance.

optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=1e-4)
criterion = nn.CrossEntropyLoss(reduction='mean', label_smoothing=0.1)
scheduler = CosineAnnealingLR(optimizer, T_max=30)
  • AdamW + Weight Decay: Decouples weight decay from gradient updates, providing more stable L2 regularization than standard Adam.
  • Label Smoothing (0.1): Prevents the model from becoming overconfident by distributing a small probability mass across incorrect classes. This improves calibration and reduces validation loss divergence.
  • Cosine Annealing Scheduler: Gradually lowers the learning rate following a cosine curve. This allows aggressive early exploration followed by fine-grained weight refinement near the end of training.
  • Checkpointing: Saves model.state_dict() only when validation accuracy improves, guaranteeing the final loaded weights represent peak generalization, not the final (potentially overfitted) epoch.

4. Evaluation & Real-World Inference

The trained model achieved ~82% accuracy on both validation and test sets, consistent with well-tuned small CNNs on CIFAR-10. External inference was performed on manually collected images (non-CIFAR), resized to 32×32, and passed through the same normalization pipeline. The model correctly classified aircraft, cars, trucks, and birds with high confidence (>85%). The sole misclassification was a bulldog puppy labeled as bird, highlighting a known limitation: models trained on clean, centered, low-resolution datasets struggle with out-of-distribution poses, lighting, or juvenile animal morphologies.

5. Model Reusability

To deploy or test the trained weights in a new environment, the model-architecture must be declared identically before loading the checkpoint:

DEVICE = torch.device("cpu")
model = theCNN().to(DEVICE)

checkpoint = torch.load("../Checkpoints/best_model.pth", map_location=DEVICE)
model.load_state_dict(checkpoint["model_state_dict"])
model.eval()  

PS. PyTorch checkpoints store only parameter tensors, not the architecture graph. Keeping the model class definition consistent ensures deterministic weight mapping and reproducible inference across machines.


Discussion

  • Architectural Improvements: Current research articles and industry-standard models show that there are many ways to improve the model through sophisticated architecture design and technical applications.
  • Scope of Implementation: I did not try applying or testing many variations of activation functions, loss functions, or backpropagation techniques here, because this was an introductory project. Even though I’ve studied many of these ideas and the corresponding mathematics, this task’s simplicity precluded the need for complicated processes.
  • Project Context: Given the particular educational goals of this research, using the CIFAR-10 dataset is generally considered to be one of the best starting point for CNN implementation.