Over the years I’ve come across a number of tools for working with data.
each one is great in its own way. But each is siloed inside a tool. What would be better is if they were not siloed, i.e. they were integrated into a single api. Another aspect of these tools is great UI. It would be even better if we could get the ui as well.
Getting the ui as well means that we need to abstract it and to work in some way that the ui can be intriduced in the ide or jupyter notebook as a widget or plugin.
Wouldn’t it be great if we had a tool in python with a ui and api that lets us work with data
Regex - I realy like regex.1 So regex are nice but even better is a code assistant that can craft reg ex from a few examples.
import redata ="""Disaster struck on 12th of May 2021 when the Titanic sank in the Atlantic Ocean. And once again on 14th of May 2021 when the Hindenburg exploded in New Jersey. And again on 3rd of March 1948 when the Lusitania was torpedoed by a German submarine."""## extract all dates from the data above"#data = re.findall(r'\d{1,2}\w{2} of \w{3,} \d{4}', data) # extract only '12th of May 2021', '14th of May 2021'#data = re.findall(r"12th of May 2021, 14th of May 2021, 3rd of March", data) # also extracts: 3rd of March 1948data = re.findall(r'\d{1,2}\w{2} of \w{3,} \d{4}', data) # extract only '12th of May 2021', '14th of May 2021', '3rd of March 1948'print(date)
---------------------------------------------------------------------------NameError Traceback (most recent call last)
Cell In[1], line 12 7## extract all dates from the data above" 8#data = re.findall(r'\d{1,2}\w{2} of \w{3,} \d{4}', data) # extract only '12th of May 2021', '14th of May 2021' 9#data = re.findall(r"12th of May 2021, 14th of May 2021, 3rd of March", data) # also extracts: 3rd of March 1948 10 data = re.findall(r'\d{1,2}\w{2} of \w{3,} \d{4}', data) # extract only '12th of May 2021', '14th of May 2021', '3rd of March 1948'---> 12print(date)
NameError: name 'date' is not defined
So this example worked using one example with Github Copilot.
extract a representtative sample of the data that is small enough to work with quickly but large enough to be representative of the full dataset.
doing data wrangling like trifacta wrangler
open refine - faceting, clustering, reconciliation, work with a schema like wikibase
eda like voyager
visualization using a grammar of graphics.
capture explanation to create classifiers for entities etc like snorkel ai
extract a minimal datasets via coresets like data-heros
FlashText - extract entities from text, Flash++, FlashRelate
give you a python code to reproduce the steps
snuba is
Footnotes
But when you work with regex it feels like you are learning Regex for the very first time.↩︎
Reuse
CC SA BY-NC-ND
Citation
BibTeX citation:
@online{bochman2025,
author = {Bochman, Oren},
title = {Data {Wrangling} Mastery!},
date = {2025-01-25},
url = {https://orenbochman.github.io/notes-islr/posts/wrangler/},
langid = {en}
}