Harnessing Generative Models for Synthetic Non-Life Insurance Data

PyData Global 2025 Recap

An in-depth exploration of using various generative models to create synthetic non-life insurance premium data, including validation techniques and model comparisons.
PyData
Insurance
Generative Models
Synthetic Data
Machine Learning
Author

Oren Bochman

Published

Tuesday, December 9, 2025

Keywords

PyData, Insurance, Generative Models, Synthetic Data, Machine Learning

pydata global

pydata global
TipLecture Overview

This study is oriented to a synthetic non-life insurance premium dataset generated using several Generative Models.
As a benchmark, a Conditional Gaussian Mixture Model has been employed.
The validation of the generated data involved several steps: data visualization, comparison with univariate analysis, PCA and UMAP representations between the trained data and the generated samples.
In addition, check the consistency of data produced, the statistical Kolmogorov–Smirnov test and predictive modeling of frequency and severity with Generalised Linear Models (GLMs) exploited by Tweedie distribution as a measure of the generated data’s quality, followed by the evidence of features importance.
For further comparison, advanced Deep Learning architectures have been employed:

  • Conditional Variational Autoencoders (CVAEs),
  • CVAEs enhanced with a Transformer Decoder,
  • a Conditional Diffusion Model, and Large Language Models.

The analysis assesses each model’s ability to capture the underlying distributions, preserve complex dependencies, and maintain relationships intrinsic to the premium data.
These findings provide insightful directions for enhancing synthetic data generation in insurance, with potential applications in risk modeling, pricing strategies with data scarcity, and regulatory compliance.

In classification and regression tasks, generative models aim to learn the joint probability distribution of data.
These models focus on generating data points similar to the training data.
Open insurance datasets are rare because they encode proprietary risk structures of the Company, limiting researchers’ access to comprehensive data for analysis and assessing new approaches.
Generative models enable reproducible experimentation and innovation today. In the talk I explore several generative models used to produce synthetic data.

TipWhat You’ll Learn:

In the talk I explore several generative models used to produce synthetic data.

  1. Conditional Gaussian Mixture Models used as a benchmark;
  2. Conditional Variational Autoencoders;
  3. Conditional Variational Autoencoders with a Transformer Decoder;
  4. Conditional Diffusion Model;
  5. Large Language Models.

Finally, I gave the overall results, followed by different approaches.

TipPrerequisites:
  • Basic Python and PyTorch
  • Some familiarity with neural networks (e.g., feed-forward, softmax)
  • No need for prior experience in building models from scratch

Tools and Frameworks:

We will introduce you to certain modern frameworks in the workshop but the emphasis be on first principles and using vanilla Python and LLM calls to build AI-powered systems.

workshop repo

TipSpeakers:

Claudio Giorgio Giancaterino

  • Statistics & Actuarial background
  • Actuary during the day
  • Data Scientist in the free time
  • c.f links

Outline

About the Speaker

About the Speaker

Agenda

Agenda

Motivations

Motivations

Data scarcity

Data scarcity

Anatomy of Insurence Non-Life Risk Data

Anatomy of Insurence Non-Life Risk Data

The Data

Datasets used

Datasets used
  • These are kind of similar.

  • I thought that non-life insurance was very broad.

  • Should look into other datasets

Unlocking data Quality

Unlocking data Quality

Synthetic Data Generation Trials

Synthetic Data Generation Trials

The Models

this is like a party ?

Conditional Gaussian Mixture Model (CGMM)

Conditional Gaussian Mixture Model (CGMM)
  • I covered Gaussian mixtures in my notes on Bayesian Mixture Models
  • EM is fine for Point estimates but It seems that using a Hierarcial MCMC the ClaimOcc would be handled just as well without conditioning.

Conditional Variational Auto-Encoder (CVAE)

Conditional Variational Auto-Encoder (CVAE)

This is analogous to a forger

Conditional Variational Auto-Encoder with a Transformer based Decoder (CTVAE)

Conditional Variational Auto-Encoder with a Transformer based Decoder (CTVAE)

Create a story but start with an outline - i.e. to make it a richer story.

Conditional Difusion Model (CDM)

Conditional Difusion Model (CDM)

Add noise to an image and then learn to denoise it.

LLM

LLM
  • Unclear how using an LLM would be applicable to the tabular datasets discussed above.
  • What is the context,
  • what are the sequences ?

Perhaps this is just a feed forward Neural network doing regression or a transformer doing regression.

Validation

Validation by Consistency records

Validation by Consistency records

Validation by Koglomorov-Smirnov Test

Validation by Koglomorov-Smirnov Test

Validation by data Visualization - Univariate analysis

Validation by data Visualization - Univariate analysis

Validation by data Visualization - Correleation matrix

Validation by data Visualization - Correleation matrix

Validation by data Visualization - 3D PCA

Validation by data Visualization - 3D PCA

Validation by data Visualization - 3D UMAP

Validation by data Visualization - 3D UMAP

Validation by predictive modeling - Frequency and Severity prediction

Validation by predictive modeling - Frequency and Severity prediction

Validation by Feature Importance - SHAP Feature Importance

Validation by Feature Importance - SHAP Feature Importance

Results and Conclusions

Overall Results

Overall Results

Conclusions

Conclusions

My Reflections:

  • The speaker is a very smart/accomplished person.
  • The validation parts is very interesting - it would be interesting to see what he can say about model validation in general.
  • The model intuitions are neat. Worth reviewing and noting down!
  • How was the data used with LLM ?
  • How is the LLM trained?

Citation

BibTeX citation:
@online{bochman2025,
  author = {Bochman, Oren},
  title = {Harnessing {Generative} {Models} for {Synthetic} {Non-Life}
    {Insurance} {Data}},
  date = {2025-12-09},
  url = {https://orenbochman.github.io/posts/2025/2025-12-09-pydata-synthetic-insurance-data/},
  langid = {en}
}
For attribution, please cite this work as:
Bochman, Oren. 2025. “Harnessing Generative Models for Synthetic Non-Life Insurance Data.” December 9, 2025. https://orenbochman.github.io/posts/2025/2025-12-09-pydata-synthetic-insurance-data/.