The Art of Clustering – Oren Bochman’s Blog

The Art of Clustering: The Good, The Bad and The Beautiful

Seth Levine
- LinkedIn
- Contentsquare
- slides

Notes

Levine argues that clustering is useful when the first question is not “what prediction should we make?” but “what is in this data?”
Levine uses a movie dataset of roughly 5,000–100,000 films, with plots and posters, as a demonstration case.
Core claim: clustering does not simply “discover” structure; it creates a lens or perspective on the data.
A typical clustering pipeline includes:
- encoding raw data into numerical representations, such as sentence embeddings;
- dimensionality reduction, especially with Uniform Manifold Approximation and Projection (UMAP);
- clustering, for example with Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN);
- representation or labeling, where large language models can help name clusters.
The “good”:
- clustering helps explore unlabeled, unstructured data such as tickets, reviews, survey responses, or film summaries;
- it can surface patterns before the analyst knows what questions to ask;
- large language models are especially useful as a final representation layer, turning weak keyword labels like “new, young, life, family” into interpretable labels like “coming-of-age drama.”
The “bad”:
- clustering is not a single button; every design choice changes the result;
- different encoders capture different structures: term frequency–inverse document frequency (TF-IDF) is precise but misses semantic similarity, while embeddings capture relations such as “hitman” and “assassin”;
- dimensionality-reduction and clustering parameters can substantially change the apparent story;
- there is no absolute “ground truth” cluster structure in many real datasets.
The talk emphasizes that real data is usually hierarchical:
- clusters contain subclusters;
- boundaries are often fuzzy;
- the clean, separated cluster picture is usually a simplification.
The speaker shows that large language model labels can be impressive but also misleading:
- a cluster labeled “World War II and Nazi Germany” contained films from before World War II;
- this demonstrates that labels must be checked against domain knowledge and temporal plausibility.
The “beautiful”:
- combining embeddings, UMAP, HDBSCAN, and large language model labeling can produce rich, interpretable maps of data;
- examples include clusters such as “space sci-fi,” “royalty and fairytales,” and “neo-noir crime”;
- multimodal embeddings such as Contrastive Language–Image Pretraining (CLIP) allow clustering and visualizing movie posters, not just text.
Visualization is presented as a major part of the workflow:
- DataMapPlot is highlighted for static and interactive cluster maps;
- Bokeh is used for rotating three-dimensional visualizations;
- D3.js, Matplotlib, and Seaborn are also mentioned.
Practical advice:
- use clustering for exploration, not automatic labeling without validation;
- understand how early pipeline choices affect downstream conclusions;
- distinguish between exploring data and producing labels for supervised learning;
- choose clustering parameters according to the decision the analysis is meant to support.
In the question-and-answer section:
- HDBSCAN is described as useful because it handles noise and does not require specifying the number of clusters in advance;
- users can tune parameters to increase the proportion of data assigned to clusters, but forcing all data into clusters may distort perception;
- the speaker’s default text-clustering pipeline is sentence transformers, dimensionality reduction, and HDBSCAN;
- EVOC is mentioned as a newer algorithm worth watching;
- UMAP is explained as an attempt to preserve both local and global structure when reducing high-dimensional embeddings to two or three dimensions.
Final takeaway: clustering is best understood as a design process for building a useful view of the data, not as an objective mechanism that reveals one true structure.

Reflection

Citation

BibTeX citation:

@online{bochman2026,
  author = {Bochman, Oren},
  title = {The {Art} of {Clustering}},
  date = {2026-04-28},
  url = {https://orenbochman.github.io/posts/2026/04-30-ODSC-AI-2026-Day-3/talk2.html},
  langid = {en}
}

For attribution, please cite this work as:

Bochman, Oren. 2026. “The Art of Clustering.” April 28. https://orenbochman.github.io/posts/2026/04-30-ODSC-AI-2026-Day-3/talk2.html.