Data Wrangling mastery!

Author

Oren Bochman

Published

Saturday, January 25, 2025

Over the years I’ve come across a number of tools for working with data.

each one is great in its own way. But each is siloed inside a tool. What would be better is if they were not siloed, i.e. they were integrated into a single api. Another aspect of these tools is great UI. It would be even better if we could get the ui as well.

Getting the ui as well means that we need to abstract it and to work in some way that the ui can be intriduced in the ide or jupyter notebook as a widget or plugin.

Wouldn’t it be great if we had a tool in python with a ui and api that lets us work with data

  1. Regex - I realy like regex.1 So regex are nice but even better is a code assistant that can craft reg ex from a few examples.
import re

data = """Disaster struck on 12th of May 2021 when the Titanic sank in the Atlantic Ocean.
        And once again on 14th of May 2021 when the Hindenburg exploded in New Jersey.
        And again on 3rd of March 1948 when the Lusitania was torpedoed by a German submarine.
"""
## extract all  dates from the data above"
#data = re.findall(r'\d{1,2}\w{2} of \w{3,} \d{4}', data) # extract only '12th of May 2021', '14th of May 2021'
#data = re.findall(r"12th of May 2021, 14th of May 2021, 3rd of March", data) # also extracts: 3rd of March 1948
data = re.findall(r'\d{1,2}\w{2} of \w{3,} \d{4}', data) # extract only '12th of May 2021', '14th of May 2021', '3rd of March 1948'

print(date)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[1], line 12
      7 ## extract all  dates from the data above"
      8 #data = re.findall(r'\d{1,2}\w{2} of \w{3,} \d{4}', data) # extract only '12th of May 2021', '14th of May 2021'
      9 #data = re.findall(r"12th of May 2021, 14th of May 2021, 3rd of March", data) # also extracts: 3rd of March 1948
     10 data = re.findall(r'\d{1,2}\w{2} of \w{3,} \d{4}', data) # extract only '12th of May 2021', '14th of May 2021', '3rd of March 1948'
---> 12 print(date)

NameError: name 'date' is not defined

So this example worked using one example with Github Copilot.

  1. extract a representtative sample of the data that is small enough to work with quickly but large enough to be representative of the full dataset.
  2. doing data wrangling like trifacta wrangler
  3. open refine - faceting, clustering, reconciliation, work with a schema like wikibase
  4. eda like voyager
  5. visualization using a grammar of graphics.
  6. capture explanation to create classifiers for entities etc like snorkel ai
  7. extract a minimal datasets via coresets like data-heros
  8. FlashText - extract entities from text, Flash++, FlashRelate
  9. give you a python code to reproduce the steps

snuba is

Footnotes

  1. But when you work with regex it feels like you are learning Regex for the very first time.↩︎

Reuse

CC SA BY-NC-ND

Citation

BibTeX citation:
@online{bochman2025,
  author = {Bochman, Oren},
  title = {Data {Wrangling} Mastery!},
  date = {2025-01-25},
  url = {https://orenbochman.github.io/notes-islr/posts/wrangler/},
  langid = {en}
}
For attribution, please cite this work as:
Bochman, Oren. 2025. “Data Wrangling Mastery!” January 25, 2025. https://orenbochman.github.io/notes-islr/posts/wrangler/.