Building an SMS Spam Classifier with PyTorch and Hugging Face

Introduction

I recently built a simple SMS spam classifier using PyTorch, Scikit-learn, and the Hugging Face sms_spam dataset. This project focused on training a logistic regression-based neural network to detect whether an SMS message is spam or not spam with high accuracy.

The motivation behind this was to understand the end-to-end pipeline of an NLP binary classification task using my own custom model and code rather than relying solely on pre-trained transformers.

Model Architecture and Workflow

The neural network (CircleModelV0) used in this project consists of:

  • Input Layer: 8713 features derived from CountVectorizer applied to SMS texts.
  • Hidden Layer: 64 neurons with ReLU activation to introduce non-linearity.
  • Output Layer: Single neuron with sigmoid activation for binary output (spam or ham).

Training Workflow

  1. Dataset Loading
    The dataset was loaded using Hugging Face’s datasets library, containing 5,574 SMS messages labeled as spam or ham.

  2. Preprocessing with Scikit-learn
    The SMS texts were converted into numerical vectors using CountVectorizer, resulting in an 8713-dimensional sparse feature space.

  3. Data Preparation for PyTorch
    The feature vectors and labels were converted to PyTorch tensors and split into training and testing sets.

  4. Model Definition and Training
    Using torch.nn.Module, I implemented the CircleModelV0 class. The training loop included forward propagation, binary cross-entropy loss computation, backpropagation, and parameter updates using Adam optimizer.

  5. Saving the Model and Vectorizer
    After training, the model was saved as full_model.pth and the vectorizer as vectorizer.pkl for later inference.

Running Inference

A separate script inference.py was created to load the saved model and vectorizer for easy prediction. For example:

Input: you have won a free prize Output: Spam

This pipeline ensures efficient deployment of the trained model in production or academic experiments.

🔗 Project Repository

You can find the complete code, including training and inference scripts, in the GitHub repository here:
👉 milliyin/sms-spam-model-train

Benefits of This Project

  1. Hands-on PyTorch Training – Implemented a neural network from scratch without pre-built classifiers.
  2. Clear NLP Workflow Understanding – Learned how to process textual data into model-ready tensors.
  3. Efficient Inference Pipeline – Created a streamlined script to reuse the trained model for quick predictions.
  4. Utilized Public Datasets – Harnessed the Hugging Face dataset ecosystem for robust and reproducible training.

Conclusion

This SMS spam classification project deepened my understanding of NLP preprocessing, PyTorch model training, and practical deployment pipelines. Such projects bridge the gap between theoretical knowledge and real-world applications, providing a strong foundation for building more advanced models in the future.

Hashtags

#PyTorch #SpamDetection #NLP #MachineLearning #HuggingFace