Hands-on with Blosc2: Accelerating Your Python Data Workflows

PyData Global 2025 Recap

A practical tutorial on using Blosc2 to optimize data workflows in Python, focusing on out-of-core computations and efficient handling of large datasets.

PyData

Data Compression

Python

Lecture Overview

As datasets grow, I/O becomes a primary bottleneck, slowing down scientific computing and data analysis. This tutorial provides a hands-on introduction to Blosc2, a powerful meta-compressor designed to turn I/O-bound workflows into CPU-bound ones. We will move beyond basic compression and explore how to structure data for high-performance computation.

Participants will learn to use the python-blosc2 library to compress and decompress data with various codecs and filters, optimizing for speed and ratio.

The core of the tutorial will focus on the Blosc2 NDArray object, a chunked, N-dimensional array that lives on disk or in memory.

Through a series of interactive exercises, you will learn how to perform out-of-core mathematical operations and analytics directly on compressed arrays, effectively handling datasets larger than available RAM.

We will also cover practical topics like data storage backends, two-level partitioning for faster data slicing, and how to integrate Blosc2 into existing NumPy-based workflows. You will leave this session with the practical skills needed to significantly accelerate your data pipelines and manage massive datasets with ease.

What You’ll Learn:

Understand the core concepts behind the Blosc2 meta-compressor.
Compress and decompress NumPy arrays, tuning parameters for optimal performance.
Create, manipulate, and slice Blosc2 NDArray objects for out-of-core processing.
Perform efficient mathematical computations directly on compressed data.
Store and retrieve compressed datasets using different storage backends.
Integrate Blosc2 into their existing data analysis workflows to mitigate I/O bottlenecks.

Audience & Prerequisites:

This tutorial is for data scientists, engineers, and researchers who work with large numerical datasets in Python.

Prerequisites: Attendees should have intermediate Python programming skills and be comfortable with the basics of NumPy arrays. No prior experience with Blosc2 is necessary.

Setup: Participants will need a laptop and can follow along using a provided cloud-based environment (e.g., Binder) or a local installation of Python, Jupyter, and the python-blosc2 library.

Tools and Frameworks:

We will introduce you to certain modern frameworks in the workshop but the emphasis be on first principles and using vanilla Python and LLM calls to build AI-powered systems.

workshop repo

Speakers:

Francesc Alted

I am a curious person who studied Physics (BSc, MSc) and Applied Maths (MSc). I spent over a year at CERN for my MSc in High Energy Physics. However, I found maths and computer sciences equally fascinating, so I left academia to pursue these fields. Over the years, I developed a passion for handling large datasets and using compression to enable their analysis on commodity hardware accessible to everyone.

I am the CEO of ironArray SLU and also leading the Blosc Development Team, and currently interested in determining, ahead of time, which combinations of codecs and filters can provide a personalized compression experience. I am also very excited in providing a way for sharing Blosc2 datasets in the network in an easy and effective way via Caterva2, and Cat2Cloud, a software as a service for handling and computing with datasets directly in the cloud.

As an Open Source believer, I started the PyTables project more than 20 years ago. After 25 years in this business, I started several other useful open source projects like Blosc2, Caterva2 and Btune; those efforts won me two prizes that mean a lot to me:

2023: NumFOCUS Project Sustainability Award 2017: Google’s Open Source Peer Bonus You can know more on what I am working on by reading my latest blogs.

Luke Shaw

Degree in Physics, Princeton University, 2019
Masters in Applied Mathematics, University of Edinburgh, 2020
PhD in Applied Mathematics, Universitat Jaume I 2024
Working at ironArray as engineer and product owner since 2025.

Outline

Introduction & Setup (10 mins)
- The I/O Bottleneck Problem.
- Core Concepts: What are meta-compressors, chunks, and blocks?
- Tutorial environment setup (Jupyter notebooks).
Part 1: Compression Fundamentals (20 mins)
- Hands-on: Using blosc2.compress() and blosc2.decompress().
- Exploring codecs (lz4, zstd), compression levels, and filters (shuffle, bitshuffle).
- Exercise: Compressing a sample dataset and analyzing the trade-offs between speed and ratio.
Part 2: The NDArray - Computing on Compressed Data (35 mins)
- Hands-on: Creating NDArray objects from scratch and from NumPy arrays.
- Storing arrays on-disk vs. in-memory.
- Exercise: Slicing and accessing data from an on-disk NDArray.
- Performing mathematical operations (arr * 2 + 1) and reductions (arr.sum()) on compressed data.
- Exercise: Analyzing a dataset larger than RAM.
Part 3: Advanced Features & Integration (20 mins)
- Hands-on: Using two-level partitioning (meta-chunks) for faster slicing.
- Brief overview of Caterva2 for sharing compressed data via an API.
- Recap and Q&A.

Reflections:

can we add code snippets from the tutorial here?
why blosc_2 and not say zarr or other solutions?

Citation

BibTeX citation:

@online{bochman2025,
  author = {Bochman, Oren},
  title = {Hands-on with {Blosc2:} {Accelerating} {Your} {Python} {Data}
    {Workflows}},
  date = {2025-12-10},
  url = {https://orenbochman.github.io/posts/2025/2025-12-10-pydata-blosc2/},
  langid = {en}
}

For attribution, please cite this work as:

Bochman, Oren. 2025. “Hands-on with Blosc2: Accelerating Your Python Data Workflows.” December 10, 2025. https://orenbochman.github.io/posts/2025/2025-12-10-pydata-blosc2/.