torchTextClassifiers : Modernizing Text classification for French National Statistics

PyData Global 2025 Recap

An in-depth look at how Insee modernized text classification by transitioning from fastText to a PyTorch-based model using the open-source torchTextClassifiers package, highlighting deployment and best practices in public administration.

PyData

PyTorch

Text Classification

Machine Learning

NLP

MLOps

Lecture Overview

Discover how Insee (French National Statistics Institute) transitioned from fastText to a PyTorch-based model for text classification by developing and open-sourcing the torchTextClassifiers python package. This presentation will cover the creation, deployment, and practical applications of torchTextClassifiers in modernizing automatic coding systems, benefiting Insee and other European National Statistical Institutes (NSIs).

Learning Objectives:

The rationale for moving from fastText to a PyTorch-based model in production
Packaging a PyTorch-based model architecture and open-source collaboration
Key features and architecture of torchTextClassifiers
Deployment strategies within a public administration (MLOps, cloud native tools, security)
Lessons learned and best practices for similar transitions

Insee, France’s National Institute of Statistics and Economic Studies, has long relied on fastText for automatic coding tasks. Recognizing the need to modernize and future-proof this critical functionality, we developed torchTextClassifiers — an open-source Python package that enables easy training and deployment of a PyTorch-based model for text classification, paving the way for further innovation in this domain.

This session will delve into the motivations behind replacing the archived fastText package, the design and implementation of torchTextClassifiers , and its integration into Insee’s production environment. We’ll discuss the challenges faced during this transition, including model compatibility, performance optimization, and user adoption.

Speakers:

Cédric Couralet

Cédric Couralet, Data Scientist at Insee, is an open-source enthusiast, with expertise in software architecture and secure system design.

Meilame Tayebjee

As a Data Scientist at the Innovation Lab of the French National Institute of Statistics and Economic Studies (Insee), I focus on the deployment of machine learning models, the enhancement of MLOps best practices, and the development of torchTextClassifiers, a PyTorch package designed to streamline the training of deep learning models for text classification.

I am also pursuing a PhD in Computer Science jointly at CREST and Inria, where my research centers on foundational Transformer-based models for the analysis of healthcare pathways.

Outline:

Introduction to hallucinations in LLMs
Common causes behind hallucinated outputs
Impact on production applications
Techniques for detecting and evaluating hallucinations
Strategies to reduce hallucinations
Best practices for building trustworthy AI products
Key takeaways

slidedeck

This is a classifier for documents in terms of economic activity (NAF codes) based on their textual description.

https://github.com/InseeFrLab/torchTextClassifiers

Demo

Citation

BibTeX citation:

@online{bochman2025,
  author = {Bochman, Oren},
  title = {torchTextClassifiers : {Modernizing} {Text} Classification
    for {French} {National} {Statistics}},
  date = {2025-12-09},
  url = {https://orenbochman.github.io/posts/2025/2025-12-09-pydata-torchTextClassifiers/},
  langid = {en}
}

For attribution, please cite this work as:

Bochman, Oren. 2025. “torchTextClassifiers : Modernizing Text Classification for French National Statistics.” December 9, 2025. https://orenbochman.github.io/posts/2025/2025-12-09-pydata-torchTextClassifiers/.