AI a bag of tricks – Oren Bochman’s Blog

TL;DR - Tool and Tricks for the AI age

In Jazz music, a “bag of tricks” refers to a collection of techniques, styles, and improvisational methods that musicians pick up to enhance their performances. These tricks can include various scales, chord voicings, rhythmic patterns, and expressive techniques that help create unique and engaging music.

Here is my bag of tricks for … AI age.

One problem with tools is that they have this long tail and after a while you may forget them.
Another problem is they can become outdated and better tools come along. At what point should we switch.
This is a quick summary. Ideally many of these can be moved to full blown articles.

Fire Scraper Web Scraping Python: commercial tool for extracting articles from web pages.
Faker Data Generation Python FOSS: A Python library for generating fake data such as names, addresses, and phone numbers.
Augmentation: Data augmentation is a technique used to increase the diversity of a dataset by applying various transformations to the existing data. This can include techniques such as rotation, flipping, scaling, and cropping for images, or synonym replacement, random insertion, and back-translation for text data. The goal of data augmentation is to improve the generalization of machine learning models by providing them with more varied training examples.
Data Widening: Data widening is a technique used to increase the number of features in a dataset by transforming existing features or creating new ones. This can include techniques such as one-hot encoding, polynomial feature expansion, and feature crossing. The goal of data widening is to provide machine learning models with more information to learn from, which can improve their performance on tasks such as classification and regression.
Record Reconciling: Reconciliation is the process of matching your dataset with that of an external source. Also known as record linkage, or data matching, it is an overall process that often involves other tasks such as entity resolution, data field matching (property matching) and duplicate record detection. Datasets for comparison might be produced by libraries, archives, museums, academic organizations, scientific institutions, non-profits, or interest groups

Note: the indian bufet process seems to be a model that could be useful to support record reconciliation for a wikidata type of project where there are potentially unbounded number of many records and features

OpenRefine FOSS: A FOSS tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data. Including reconciliation.
Scrapy Web Scraping Python FOSS: is an open-source and collaborative web crawling framework for Python.
word2vec Embeddings Python FOSS: A group of related models that are used to produce word embeddings.
FastText Embeddings Python FOSS: A library for efficient learning of word representations and sentence classification.
GloVe Embeddings Python FOSS: A library for efficient learning of word representations and sentence classification.
spacy NLP Python FOSS: A package for NLP in Python.

Prodi.gy NLP Data Annotation Python Commercial : A modern annotation tool for creating training data for machine learning models.

PySpark Big Data Python FOSS: is an open-source distributed computing system based on Map-Reduce paradigm.
Data Version Control (DVC): is an open-source version control system for machine learning projects. It helps data scientists and machine learning engineers manage and track changes to datasets, models, and code in a collaborative environment. DVC integrates with Git, allowing users to version control large files and datasets efficiently. It also provides features for reproducibility, experiment tracking, and pipeline management, making it easier to share and collaborate on machine learning projects.
Weights & Biases: is a platform for experiment tracking, model management, and collaboration in machine learning projects. It provides tools for logging and visualizing experiments, comparing model performance, and sharing results with team members. Weights & Biases integrates with popular machine learning frameworks and libraries, making it easier for data scientists and machine learning engineers to manage their workflows and collaborate effectively.
Data Heroes: DataHeroes uses coresets – compact geometric representations that preserve key properties of the full dataset, to represent the full training dataset.

Outlines LLM Structured Data: Outlines guarantees structured outputs during generation — directly from any LLM.
Dynamic Linear Models Time Series R Bayesian: The R library for the eponymous Bayesian Time series analysis Framework using the Kalman Filter and Superposition of state space models.
Stumpy Time Series Python: STUMPY is a library for time series data mining, focusing on matrix profile algorithms.
Pydantic Data Validation Python: Pydantic is a data validation and settings management library using Python type annotations.
Data Wrangler for VS Code: this is a Smart ETL, is a code-centric data viewing and cleaning tool that is integrated into VS Code and VS Code Jupyter Notebooks.
Parquet: Parquet is a columnar storage file format optimized for use with big data processing frameworks.
VS Code: VS Code is a lightweight but powerful source code editor which runs on your desktop and is available for Windows, macOS, and Linux.
Quarto Blogging Data Science FOSS: Quarto is a markdown based document authoring system that is powered by Pandoc and is designed for data science and technical writing.
Data Science at the Command Line: This is a collection of command-line tools and techniques for data science, enabling efficient data manipulation, analysis, and visualization directly from the terminal.
Deepchecks Model Evaluation Python FOSS: Deepchecks is a library for testing and validating machine learning models.

Mixture of Experts: is a technique in machine learning where multiple models (experts) are trained to specialize in different parts of the input space. During inference, a gating mechanism decides which expert(s) to use for a given input, allowing the model to leverage the strengths of each expert for better overall performance.
Bagging and Boosting: are ensemble learning techniques used to improve the performance of machine learning models by combining multiple weak learners into a stronger one. Bagging (Bootstrap Aggregating) involves training multiple models on different subsets of the data and averaging their predictions, while Boosting sequentially trains models, with each new model focusing on correcting the errors of the previous ones.
GAN: are adversarial networks that consist of two neural networks, a generator and a discriminator, that are trained simultaneously through a process of competition. The generator creates fake data samples, while the discriminator evaluates them against real data samples, providing feedback to the generator to improve its output. This is essentially a generate and test with a feedback from real data.
VAE: are generative models that learn to encode input data into a lower-dimensional latent space and then decode it back to reconstruct the original data. They consist of an encoder network that maps the input to a latent representation and a decoder network that reconstructs the input from this representation. VAEs are trained using a combination of reconstruction loss and a regularization term that encourages the latent space to follow a specific distribution, typically a Gaussian.
Data Fusion: is the process of integrating data from multiple sources to create a unified and consistent dataset. This involves combining data with different formats, structures, and semantics, often requiring techniques such as data cleaning, transformation, and alignment to ensure compatibility and coherence. The goal of data fusion is to enhance the quality and completeness of the information available for analysis and decision-making.
SLAM: Simultaneous Localization and Mapping (SLAM) is a computational problem in robotics and computer vision where a device or robot must build a map of an unknown environment while simultaneously keeping track of its own location within that environment. SLAM algorithms use sensor data (e.g., from cameras, LiDAR, or IMUs) to estimate the robot’s position and orientation while incrementally constructing a map of the surroundings. This is particularly useful for autonomous navigation in dynamic and unstructured environments.
Kalman Filter: is an algorithm that uses a series of measurements observed over time, containing statistical noise and other inaccuracies, and produces estimates of unknown variables that tend to be more precise than those based on a single measurement alone. It operates recursively, updating estimates as new data becomes available, and is widely used in applications such as navigation, tracking, and control systems.
Dynamic Linear Models (DLMs) R Time Series Bayesian: is a class of statistical models used for time series analysis that allow for the modeling of time-varying relationships between variables. DLMs consist of two main components: a state equation that describes how the underlying state of the system evolves over time, and an observation equation that relates the observed data to the underlying state. DLMs can incorporate various types of dynamics, such as trends, seasonality, and cycles, and can be estimated using techniques like the Kalman filter. They are particularly useful for forecasting and smoothing time series data.
Snowflake: is a cloud-based data warehousing platform that provides scalable and flexible storage and computing capabilities for big data analytics. It allows users to store and analyze large volumes of structured and semi-structured data using SQL and other programming languages. Snowflake’s architecture separates storage and compute, enabling users to scale resources independently based on their needs. It also offers features such as data sharing, data cloning, and time travel, making it a popular choice for modern data warehousing and analytics.
EDA: Exploratory Data Analysis (EDA) is an approach to analyzing datasets to summarize their main characteristics, often using visual methods. It involves techniques such as plotting distributions, identifying patterns, detecting outliers, and examining relationships between variables. EDA helps data scientists understand the data’s structure, quality, and underlying trends, guiding further analysis and modeling efforts. Common tools for EDA include histograms, scatter plots, box plots, and correlation matrices.
Extract, Transform, Load (ETL): ETL stands for Extract, Transform, Load. It is a data integration process that involves extracting data from various sources, transforming it into a suitable format or structure for analysis, and then loading it into a target database or data warehouse. ETL processes are commonly used in data warehousing and business intelligence to consolidate data from multiple systems, ensuring data quality and consistency for reporting and analysis. Talend, Apache NiFi, Apache Airflow, Pentaho, DLT
Talend ETL Data Integration FOSS: Talend is an open-source data integration platform that provides tools for ETL (Extract, Transform, Load) processes, data quality, and data management. It offers a graphical interface for designing data workflows and supports a wide range of data sources and formats. Talend enables users to automate data integration tasks, ensuring efficient and reliable data processing for analytics and business intelligence.
DLT ETL Data Pipelines FOSS: is a python library that provides a framework for building and managing data pipelines. It is designed to simplify the process of creating, scheduling, and monitoring data workflows, allowing users to focus on the logic of their data processing tasks rather than the underlying infrastructure. DLT supports various data sources and formats, making it a versatile tool for data engineering and ETL processes.

Citation

BibTeX citation:

@online{bochman2025,
  author = {Bochman, Oren},
  title = {AI a Bag of Tricks},
  date = {2025-09-04},
  url = {https://orenbochman.github.io/posts/2025/2025-09-04-Bag-Of-Tricks/},
  langid = {en}
}

For attribution, please cite this work as:

Bochman, Oren. 2025. “AI a Bag of Tricks.” September 4, 2025. https://orenbochman.github.io/posts/2025/2025-09-04-Bag-Of-Tricks/.