Lessons learnt in optimizing a large-scale pandas application using Polars, FireDucks and cuDF: Go Smart and Save More!

PyData Global 2025 Recap

A detailed exploration of optimizing large-scale pandas applications using high-performance alternatives like Polars, FireDucks, and cuDF, highlighting best practices and performance improvements.
PyData
Pandas
Data Science
Performance Optimization
Polars
cuDF
FireDucks
Author

Oren Bochman

Published

Tuesday, December 9, 2025

Keywords

PyData, Pandas, Data Science, Performance Optimization, Polars, cuDF, FireDucks

pydata global

pydata global
TipLecture Overview

In general, a Data Scientist spends significant efforts in transforming the raw data into a more digestible format before training an AI model or creating visualisations. Traditional tools such as pandas have long been the linchpin in this process, offering powerful capabilities but not without limitations. With numerous possible ways to write the same thing in pandas, often a user ends up selecting the uneconomical, inefficient ones, leading to large computational costs with the growth in data size. We introduce a couple of frequently occurring intricate performance issues in pandas, and what we have learnt in solving the same using popular high-performance pandas alternatives: Polars, FireDucks and cuDF. The talk intends to highlight one of the best practices (breaking out of the loops) that one should follow while dealing with large-scale data analysis, while demonstrating the key advantages of the high-performance pandas alternatives based on different scenarios.

TipWhat You’ll Learn:
  • How the choice and execution order of API calls in writing an data-related application impacts its performance.
  • How to stop thinking the loop-based approach and design the algorithms using DataFrame APIs.
  • How the internal query optimizers in libraries like Polars, FireDucks etc, can be useful to bring SQL-like optimizations at python-level.
  • Whether to pay a large migration cost for optimizing an existing pandas-based application or to go smart with some minor modifications and save more operational cost.
TipPrerequisites:
  • Basic Python and PyTorch
  • Some familiarity with neural networks (e.g., feedforward, softmax)
  • No need for prior experience in building models from scratch

Tools and Frameworks:

We will introduce you to certain modern frameworks in the workshop but the emphasis be on first principles and using vanilla Python and LLM calls to build AI-powered systems.

workshop repo

TipSpeakers:

Sourav Saha

Sourav has 12+ years of professional experience at NEC Corporation in the diverse fields of High-Performance Computing, Distributed Programming, Compiler Design, and Data Science. Currently, his team at NEC R&D Lab, Japan, is researching various data processing-related algorithms. Blending the mixture of different niche technologies related to compiler framework, high-performance computing, and multi-threaded programming, they have developed a Python library named FireDucks with highly compatible pandas APIs for DataFrame-related operations. In his previous engagements, he has worked in research and development of performance-critical AI and Big Data solutions, optimization of several legacy applications related to weather prediction, earth-quake simulation, etc., written in C++ and Fortran. He has been speaking at several meetups and technical conferences related to HPC and Data Science.

repo

Outline

Title

Title

Quick Introduction!

Quick Introduction!

Who is this talk for?

Who is this talk for?

Overview of the Application

Overview of the Application

Pandas

Pandas

Exploring High-performance Pandas Alternatives

Exploring High-performance Pandas Alternatives

Comparison among Chosen Libraries

Comparison among Chosen Libraries

Bottleneck Analysis

Bottleneck Analysis

Bottleneck Analysis

Bottleneck Analysis

Exploring Type-1 Bottlenecks (Loop-based implementation)

Exploring Type-1 Bottlenecks (Loop-based implementation)

Query 01 Problem Statement

Query 01 Problem Statement
  • Fill missing values of “Description” column using the most frequent description of the specific “StockCode”.

Query 01: Implementation using iterrows

Query 01: Implementation using iterrows

Query 01: Implementation using vectorized APIs

Query 01: Implementation using vectorized APIs

Query 02: Problem Statement

Query 02: Problem Statement

Find the number of transactions a user performed within the N days (e.g., 90) of the current transaction

Query 02: implementation using row-wise apply

Query 02: implementation using row-wise apply

Query 02: implementation using merge+filter

Query 02: implementation using merge+filter

Query 03: Problem Statement

Query 03: Problem Statement
  • Calculate total sales per Invoice for each Customer

Query 03: apply-based vs vectorized implementation

Query 03: apply-based vs vectorized implementation

Exploring Type-2 Bottlenecks (Vectorized implementation without optimized data flow)

Exploring Type-2 Bottlenecks (Vectorized implementation without optimized data flow)

Query 04: Problem Statement

Query 04: Problem Statement

Query 04: Vectorized implementation (unoptimized data flow)

Query 04: Vectorized implementation (unoptimized data flow)

Query 04: Vectorized implementation (optimized data flow)

Query 04: Vectorized implementation (optimized data flow)

Exploring Type-3 Bottlenecks (Vectorized implementation with optimized data flow)

Exploring Type-3 Bottlenecks (Vectorized implementation with optimized data flow)

Query 05: Problem Statement

Query 05: Problem Statement

Query 05: Vectorized implementation (optimized data flow)

Query 05: Vectorized implementation (optimized data flow)

Learning Summary

Learning Summary

Learning #1: Breaking out of the loop

Learning #1: Breaking out of the loop

Learning #2: Single-node processing might be enough

Learning #2: Single-node processing might be enough

Learning #3: FireDucks might be the one you are looking for!

Learning #3: FireDucks might be the one you are looking for!

Thank you

Thank you

https://github.com/qsourav/PyData-Global-2025

Citation

BibTeX citation:
@online{bochman2025,
  author = {Bochman, Oren},
  title = {Lessons Learnt in Optimizing a Large-Scale Pandas Application
    Using {Polars,} {FireDucks} and {cuDF:} {Go} {Smart} and {Save}
    {More!}},
  date = {2025-12-09},
  url = {https://orenbochman.github.io/posts/2025/2025-12-09-pydata-optimizing-pandas-using-polars-cuDf-and-FireDucks/},
  langid = {en}
}
For attribution, please cite this work as:
Bochman, Oren. 2025. “Lessons Learnt in Optimizing a Large-Scale Pandas Application Using Polars, FireDucks and cuDF: Go Smart and Save More!” December 9, 2025. https://orenbochman.github.io/posts/2025/2025-12-09-pydata-optimizing-pandas-using-polars-cuDf-and-FireDucks/.