PDF extraction hacks

PDF extraction hacks

A collection of hacks and tips for extracting content from PDF files, including converting PDFs to images using the pdf2image library in Python, and other techniques for handling PDF data.
pdf
Author

Oren Bochman

Published

Sunday, April 10, 2022

Modified

Monday, May 18, 2026

Keywords

pdf, pdf extraction, pdf2image, python, data extraction

PDF extraction hacks

#| label: pdf-2-png
#| fig-cap: "convert pdf to png"
from pdf2image import convert_from_path
pdf_path='in.pdf'

# Store Pdf with convert_from_path function
images = convert_from_path(pdf_path)
for i in range(len(images)):
    # Save pages as images in the pdf
    images[i].save('page'+ str(i) +'.png', 'PNG')

Citation

BibTeX citation:
@online{bochman2022,
  author = {Bochman, Oren},
  title = {PDF Extraction Hacks},
  date = {2022-04-10},
  url = {https://orenbochman.github.io/posts/2020/04-10-pdf-extraction/},
  langid = {en}
}
For attribution, please cite this work as:
Bochman, Oren. 2022. “PDF Extraction Hacks.” April 10. https://orenbochman.github.io/posts/2020/04-10-pdf-extraction/.