Fine-Tuning EfficientNet-B0 for Painting Style Classification

Introduction

I fine-tuned EfficientNet-B0 to classify artworks into 9 painting styles using the Hugging Face dataset keremberke/painting-style-classification.
The aim was to build a fully custom PyTorch training pipeline—covering dataset preparation, augmentation, transfer learning, and evaluation—to understand both what works and what limits accuracy for this task.

👉 Model card: milliyin/painting-style-classification
https://huggingface.co/milliyin/painting-style-classification

Dataset Preparation

The full dataset (not the mini split) was downloaded directly from Hugging Face in ZIP format for train, validation, and test splits. I created a folder structure like:

dataset/
  images/train
  images/validation
  images/test
  jsonl/train.jsonl
  jsonl/validation.jsonl
  jsonl/test.jsonl

Images were extracted, renamed with zero-padded IDs, and assigned numeric labels based on their original folder names (e.g., baroque → 4, renaissance → 5, surrealism → 8).
I also generated .jsonl files containing metadata for each split (image ID, label, split type) and a combined JSONL for all splits.

To work with the data easily, I implemented a custom dataset loader (FolderDataset) that reads these JSONLs and can access splits like dataset['train'].
A second wrapper (PaintingDataset) applied transforms and returned (image, label) pairs for PyTorch.

Data Augmentation

For training, I applied:

  • Resize to 224×224 (EfficientNet-B0 input size)
  • Random horizontal flip (50% probability)
  • Random rotation up to 15°
  • Color jitter (brightness, contrast, saturation, hue)
  • Random affine translation
  • Normalization to ImageNet stats

For validation and test, only resizing and normalization were applied.

This augmentation strategy was designed to help the model generalize from ~4k images without overfitting.

Model Architecture

I started from torchvision.models.efficientnet_b0 with ImageNet (IMAGENET1K_V1) pretrained weights.
The final classifier layer was replaced with:

  • Dropout (0.2)
  • Fully-connected layer to 9 output classes (matching dataset styles)

Transfer Learning Strategy

I froze all layers up to ~layer 100 at the start to speed up convergence and avoid catastrophic forgetting.
The plan was to gradually unfreeze:

  • Epoch 10: unfreeze more layers (freeze_until_layer=50)
  • Epoch 20: unfreeze all layers for full fine-tuning

This step-wise unfreezing allowed the classifier head to adapt first before updating earlier convolutional blocks.

Training Setup

  • Loss Function: CrossEntropyLoss with label smoothing (label_smoothing=0.1)
  • Optimizer: AdamW (lr=1e-4, weight_decay=0.01)
  • Scheduler: ReduceLROnPlateau (monitors validation accuracy, reduces LR by factor of 0.5 after 5 epochs without improvement)
  • Batch Size: 32
  • Epochs: 50
  • Device: CUDA

The training loop tracked train loss/accuracy and validation loss/accuracy each epoch.
If the validation accuracy improved, the model was saved as best_efficientnet_b0.pth.

Evaluation & Results

The best validation accuracy achieved was:

✅ 60.15% after 50 epochs

I also generated a classification report and plotted loss/accuracy curves to analyze overfitting patterns.
Inference was tested on individual images with top-1 predicted style and confidence score.

Why Did It Plateau Around ~60%?

  1. High Inter-Class Similarity – Certain styles (e.g., Romanticism vs. Realism) share strong visual overlap.
  2. Label Noise – Open datasets may have inconsistent labels.
  3. Data Imbalance – Some styles had fewer samples, causing uneven learning.
  4. Limited Early Unfreezing – Freezing many layers for too long limited domain adaptation from natural photos to paintings.
  5. Moderate Augmentation – Could be stronger to handle variations in scan quality, lighting, and framing.
  6. Model Size – EfficientNet-B0 is compact; larger backbones may better capture fine texture differences.

How to Improve

  • Earlier & Gradual Unfreezing – Allow backbone adaptation sooner.
  • Stronger Augmentations – Use RandAugment, CutMix, Mixup, or color-space perturbations.
  • Class-Balanced Sampling – Reduce bias toward majority classes.
  • Bigger Backbone – Try EfficientNet-B2/B3, ConvNeXt-Tiny, or ViT models.
  • Curated Splits – Avoid artist overlap between train/validation to measure generalization accurately.
  • TTA & Ensembling – Small accuracy gains from combining predictions.

You can explore the complete training pipeline, dataset processing, and fine-tuning notebook here:
👉 GitHub Notebook: painting-style-classification-finetune/finetune.ipynb

Benefits of This Project

  1. Custom Dataset Handling – Built a JSONL + folder-based loader for structured control.
  2. End-to-End Pipeline – Covered raw data → augmentation → model training → evaluation → inference.
  3. Transfer Learning Practice – Applied freezing/unfreezing and domain adaptation strategies.
  4. Error Analysis Mindset – Turned a “60% wall” into a checklist of targeted improvements.

Conclusion

This project provided a hands-on look at training image classification models for nuanced visual categories like art styles.
With a solid baseline at ~60% validation accuracy, there’s plenty of room to iterate—particularly on augmentation, layer unfreezing, and backbone scaling—to push well beyond this mark.

Hashtags

#PyTorch #ComputerVision #ImageClassification #EfficientNet #HuggingFace #DeepLearning #TransferLearning #NeuralNetworks #FineTuning #EfficientNetB0 #Torchvision #HuggingFaceDatasets #PaintingClassification #ArtRecognition #VisualRecognition #ImageAugmentation #FeatureExtraction #ComputerVisionProjects #AIArtAnalysis #ModelTraining #DatasetPreparation #ImageNetPretrained #ArtStyles