The Art of Clustering: The Good, The Bad and The Beautiful
Levine argues that clustering is useful when the first question is not “what prediction should we make?” but “what is in this data?”
Levine uses a movie dataset of roughly 5,000–100,000 films, with plots and posters, as a demonstration case.
Core claim: clustering does not simply “discover” structure; it creates a lens or perspective on the data.
A typical clustering pipeline includes:
- encoding raw data into numerical representations, such as sentence embeddings;
- dimensionality reduction, especially with Uniform Manifold Approximation and Projection (UMAP);
- clustering, for example with Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN);
- representation or labeling, where large language models can help name clusters.
The “good”:
- clustering helps explore unlabeled, unstructured data such as tickets, reviews, survey responses, or film summaries;
- it can surface patterns before the analyst knows what questions to ask;
- large language models are especially useful as a final representation layer, turning weak keyword labels like “new, young, life, family” into interpretable labels like “coming-of-age drama.”
The “bad”:
- clustering is not a single button; every design choice changes the result;
- different encoders capture different structures: term frequency–inverse document frequency (TF-IDF) is precise but misses semantic similarity, while embeddings capture relations such as “hitman” and “assassin”;
- dimensionality-reduction and clustering parameters can substantially change the apparent story;
- there is no absolute “ground truth” cluster structure in many real datasets.
The talk emphasizes that real data is usually hierarchical:
- clusters contain subclusters;
- boundaries are often fuzzy;
- the clean, separated cluster picture is usually a simplification.
The speaker shows that large language model labels can be impressive but also misleading:
- a cluster labeled “World War II and Nazi Germany” contained films from before World War II;
- this demonstrates that labels must be checked against domain knowledge and temporal plausibility.
The “beautiful”:
- combining embeddings, UMAP, HDBSCAN, and large language model labeling can produce rich, interpretable maps of data;
- examples include clusters such as “space sci-fi,” “royalty and fairytales,” and “neo-noir crime”;
- multimodal embeddings such as Contrastive Language–Image Pretraining (CLIP) allow clustering and visualizing movie posters, not just text.
Visualization is presented as a major part of the workflow:
- DataMapPlot is highlighted for static and interactive cluster maps;
- Bokeh is used for rotating three-dimensional visualizations;
- D3.js, Matplotlib, and Seaborn are also mentioned.
Practical advice:
- use clustering for exploration, not automatic labeling without validation;
- understand how early pipeline choices affect downstream conclusions;
- distinguish between exploring data and producing labels for supervised learning;
- choose clustering parameters according to the decision the analysis is meant to support.
In the question-and-answer section:
- HDBSCAN is described as useful because it handles noise and does not require specifying the number of clusters in advance;
- users can tune parameters to increase the proportion of data assigned to clusters, but forcing all data into clusters may distort perception;
- the speaker’s default text-clustering pipeline is sentence transformers, dimensionality reduction, and HDBSCAN;
- EVOC is mentioned as a newer algorithm worth watching;
- UMAP is explained as an attempt to preserve both local and global structure when reducing high-dimensional embeddings to two or three dimensions.
Final takeaway: clustering is best understood as a design process for building a useful view of the data, not as an objective mechanism that reveals one true structure.
Reflection
Citation
@online{bochman2026,
author = {Bochman, Oren},
title = {The {Art} of {Clustering}},
date = {2026-04-28},
url = {https://orenbochman.github.io/posts/2026/04-30-ODSC-AI-2026-Day-3/talk2.html},
langid = {en}
}