import fitz
= fitz.open('pdf_test.pdf')
doc = doc[0] # get first page
page = fitz.Rect(0, 0, 600, page.rect.width) # define your rectangle here
rect = page.get_textbox(rect) # get text from rectangle
text = ' '.join(text.split())
clean_text
print(clean_text)
TODO:
split into:
- [] PDF blocks
- [] Page gen blocks - where we generate input images with known text to recognize
- capture different layouts
- capture different language/scripts
- capture different content
- capture different languages
- use RL and Generate & Test to approximate some image (needs a loss)
- [] OCR
- [] Font manifolds
text image –> preprocessing –> segmentation –> feature-extraction –> recognition –> postprocessing
Aquisition
- render pages from pdf -> ok for unsupervised learning.
- generate from text -> better for supervised learning.
remove text from pdf
Sometimes we should discard the OCRd text in the pdf.
In this case we want a pdf that was scanned and we want the image we don’t want to extract the images as they may have been split into layers or and also intto chunks which is not very usefull for OCR.
gs -o no-more-texts.pdf -sDEVICE=pdfwrite -dFILTERTEXT ocr-doc.pdf
render pdf page to png
we can skip the previous step is the text is ok! this generates 2 page
pdftocairo -png ./no-more-texts.pdf ./img/ -f 20 -l 22
pdftocairo -png ./no-more-texts.pdf ./img/ -f 20 -l 22 -gray
some extra flags to crop a box starting at pdftocairo -png ./no-more-texts.pdf ./img/ -f 20 -l 22 -gray -x X -y Y -W W -H H
we may then want to segement and extract regions from the page. when we segment we probably want to … use a sub rectage
A smart generator has the property of not repeating itself. Idealy we would like to generate a corpus that representitive of what we want to OCR without containing more data than needed. This could mean one thing for training and onther thing for testint. One idea to minimize the data set wrt a loss fucntion is using coresets. To use coresets we need to decide on a loss function. Since there are many steps in OCR we may need to combine many losses and this can This may make the coresets approch not viable.
Generation
- convert text to image
- segment scorer -
preprocessing
skew correction
import sys
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image as im
from scipy.ndimage import interpolation as inter
#input_file = sys.argv[1]
#input_file = sys.argv[1]
= im.open(input_file)
img # convert to binary
= img.size
wd, ht = np.array(img.convert('1').getdata(), np.uint8)
pix = 1 - (pix.reshape((ht, wd)) / 255.0)
bin_img ='gray')
plt.imshow(bin_img, cmap'binary.png')
plt.savefig(def find_score(arr, angle):
= inter.rotate(arr, angle, reshape=False, order=0)
data = np.sum(data, axis=1)
hist = np.sum((hist[1:] - hist[:-1]) ** 2)
score return hist, score
= 1
delta = 5
limit = np.arange(-limit, limit+delta, delta)
angles = []
scores for angle in angles:
= find_score(bin_img, angle)
hist, score
scores.append(score)= max(scores)
best_score = angles[scores.index(best_score)]
best_angle print('Best angle: {}'.formate(best_angle))
# correct skew
= inter.rotate(bin_img, best_angle, reshape=False, order=0)
data = im.fromarray((255 * data).astype("uint8")).convert("RGB")
img 'skew_corrected.png') img.save(
biniariation
- adaptive thresholding
- otsu biniratation
- local maximan and minima
c(i,j) = \frac{I_{max}-I_{min}}{I_{max}-I_{mi}+\epsilon}
- noise removal
import numpy as np
import cv2
from matplotlib import pyplot as plt
# Reading image from folder where it is stored
= cv2.imread('bear.png')
img # denoising of image saving it into dst image
= cv2.fastNlMeansDenoisingColored(img, None, 10, 10, 7, 15)
dst # Plotting of source and destination image
121), plt.imshow(img)
plt.subplot(122), plt.imshow(dst)
plt.subplot( plt.show()
- thining and skeletonization
sementation - line level - word level - character level
classification
identify the segment
post processing
spelling correction !?
Binarization
global
if (current)
Refernces
- https://towardsdatascience.com/pre-processing-in-ocr-fc231c6035a7
- https://towardsdatascience.com/image-filters-in-python-26ee938e57d2
- https://github.com/arthurflor23/text-segmentation
- https://pdf.wondershare.com/pdf-knowledge/extract-images-from-pdf-linux.html
- https://askubuntu.com/questions/150100/extracting-embedded-images-from-a-pdf
- https://stackoverflow.com/questions/24322338/remove-all-text-from-pdf-file
Citation
@online{bochman2024,
author = {Bochman, Oren},
title = {OCR Building Blocks},
date = {2024-03-28},
url = {https://orenbochman.github.io/posts/2024/2024-02-28-ocr/28-02-2024-ocr.html},
langid = {en}
}