import fitz
doc = fitz.open('pdf_test.pdf')
page = doc[0] # get first page
rect = fitz.Rect(0, 0, 600, page.rect.width) # define your rectangle here
text = page.get_textbox(rect) # get text from rectangle
clean_text = ' '.join(text.split())
print(clean_text)TODO:
split into:
- [] PDF blocks
- [] Page gen blocks - where we generate input images with known text to recognize
- capture different layouts
- capture different language/scripts
- capture different content
- capture different languages
- use RL and Generate & Test to approximate some image (needs a loss)
- [] OCR
- [] Font manifolds
text image –> preprocessing –> segmentation –> feature-extraction –> recognition –> postprocessing
Aquisition
- render pages from pdf -> ok for unsupervised learning.
- generate from text -> better for supervised learning.
remove text from pdf
Sometimes we should discard the OCRd text in the pdf.
In this case we want a pdf that was scanned and we want the image we don’t want to extract the images as they may have been split into layers or and also intto chunks which is not very usefull for OCR.
gs -o no-more-texts.pdf -sDEVICE=pdfwrite -dFILTERTEXT ocr-doc.pdf
render pdf page to png
we can skip the previous step is the text is ok! this generates 2 page
pdftocairo -png ./no-more-texts.pdf ./img/ -f 20 -l 22
pdftocairo -png ./no-more-texts.pdf ./img/ -f 20 -l 22 -gray
some extra flags to crop a box starting at pdftocairo -png ./no-more-texts.pdf ./img/ -f 20 -l 22 -gray -x X -y Y -W W -H H
we may then want to segement and extract regions from the page. when we segment we probably want to … use a sub rectage
A smart generator has the property of not repeating itself. Idealy we would like to generate a corpus that representitive of what we want to OCR without containing more data than needed. This could mean one thing for training and onther thing for testint. One idea to minimize the data set wrt a loss fucntion is using coresets. To use coresets we need to decide on a loss function. Since there are many steps in OCR we may need to combine many losses and this can This may make the coresets approch not viable.
Generation
- convert text to image
- segment scorer -
preprocessing
skew correction
import sys
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image as im
from scipy.ndimage import interpolation as inter
#input_file = sys.argv[1]
#input_file = sys.argv[1]
img = im.open(input_file)
# convert to binary
wd, ht = img.size
pix = np.array(img.convert('1').getdata(), np.uint8)
bin_img = 1 - (pix.reshape((ht, wd)) / 255.0)
plt.imshow(bin_img, cmap='gray')
plt.savefig('binary.png')
def find_score(arr, angle):
data = inter.rotate(arr, angle, reshape=False, order=0)
hist = np.sum(data, axis=1)
score = np.sum((hist[1:] - hist[:-1]) ** 2)
return hist, score
delta = 1
limit = 5
angles = np.arange(-limit, limit+delta, delta)
scores = []
for angle in angles:
hist, score = find_score(bin_img, angle)
scores.append(score)
best_score = max(scores)
best_angle = angles[scores.index(best_score)]
print('Best angle: {}'.formate(best_angle))
# correct skew
data = inter.rotate(bin_img, best_angle, reshape=False, order=0)
img = im.fromarray((255 * data).astype("uint8")).convert("RGB")
img.save('skew_corrected.png')biniariation
- adaptive thresholding
- otsu biniratation
- local maximan and minima
c(i,j) = \frac{I_{max}-I_{min}}{I_{max}-I_{mi}+\epsilon}
- noise removal
import numpy as np
import cv2
from matplotlib import pyplot as plt
# Reading image from folder where it is stored
img = cv2.imread('bear.png')
# denoising of image saving it into dst image
dst = cv2.fastNlMeansDenoisingColored(img, None, 10, 10, 7, 15)
# Plotting of source and destination image
plt.subplot(121), plt.imshow(img)
plt.subplot(122), plt.imshow(dst)
plt.show()- thining and skeletonization
sementation - line level - word level - character level
classification
identify the segment
post processing
spelling correction !?
Binarization
global
if (current)
Refernces
- https://towardsdatascience.com/pre-processing-in-ocr-fc231c6035a7
- https://towardsdatascience.com/image-filters-in-python-26ee938e57d2
- https://github.com/arthurflor23/text-segmentation
- https://pdf.wondershare.com/pdf-knowledge/extract-images-from-pdf-linux.html
- https://askubuntu.com/questions/150100/extracting-embedded-images-from-a-pdf
- https://stackoverflow.com/questions/24322338/remove-all-text-from-pdf-file
Citation
@online{bochman2024,
author = {Bochman, Oren},
title = {OCR Building Blocks},
date = {2024-03-28},
url = {https://orenbochman.github.io/posts/2024/2024-02-28-ocr/},
langid = {en}
}