OCR building blocks – Oren Bochman’s Blog

TODO:

split into:

[] PDF blocks
[] Page gen blocks - where we generate input images with known text to recognize

capture different layouts
capture different language/scripts
capture different content
capture different languages
use RL and Generate & Test to approximate some image (needs a loss)

[] OCR
[] Font manifolds

text image –> preprocessing –> segmentation –> feature-extraction –> recognition –> postprocessing

Aquisition

render pages from pdf -> ok for unsupervised learning.
generate from text -> better for supervised learning.

remove text from pdf

Sometimes we should discard the OCRd text in the pdf.

In this case we want a pdf that was scanned and we want the image we don’t want to extract the images as they may have been split into layers or and also intto chunks which is not very usefull for OCR.

gs -o no-more-texts.pdf -sDEVICE=pdfwrite -dFILTERTEXT ocr-doc.pdf

render pdf page to png

we can skip the previous step is the text is ok! this generates 2 page

pdftocairo -png ./no-more-texts.pdf ./img/ -f 20 -l 22

pdftocairo -png ./no-more-texts.pdf ./img/ -f 20 -l 22 -gray

some extra flags to crop a box starting at pdftocairo -png ./no-more-texts.pdf ./img/ -f 20 -l 22 -gray -x X -y Y -W W -H H

we may then want to segement and extract regions from the page. when we segment we probably want to … use a sub rectage

import fitz

doc = fitz.open('pdf_test.pdf')
page = doc[0]  # get first page
rect = fitz.Rect(0, 0, 600, page.rect.width)  # define your rectangle here
text = page.get_textbox(rect)  # get text from rectangle
clean_text = ' '.join(text.split())

print(clean_text)

A smart generator has the property of not repeating itself. Idealy we would like to generate a corpus that representitive of what we want to OCR without containing more data than needed. This could mean one thing for training and onther thing for testint. One idea to minimize the data set wrt a loss fucntion is using coresets. To use coresets we need to decide on a loss function. Since there are many steps in OCR we may need to combine many losses and this can This may make the coresets approch not viable.

Generation

convert text to image
segment scorer -

preprocessing

skew correction

import sys
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image as im
from scipy.ndimage import interpolation as inter

#input_file = sys.argv[1]
#input_file = sys.argv[1]
img = im.open(input_file)
# convert to binary
wd, ht = img.size
pix = np.array(img.convert('1').getdata(), np.uint8)
bin_img = 1 - (pix.reshape((ht, wd)) / 255.0)
plt.imshow(bin_img, cmap='gray')
plt.savefig('binary.png')
def find_score(arr, angle):
    data = inter.rotate(arr, angle, reshape=False, order=0)
    hist = np.sum(data, axis=1)
    score = np.sum((hist[1:] - hist[:-1]) ** 2)
    return hist, score
delta = 1
limit = 5
angles = np.arange(-limit, limit+delta, delta)
scores = []
for angle in angles:
    hist, score = find_score(bin_img, angle)
    scores.append(score)
best_score = max(scores)
best_angle = angles[scores.index(best_score)]
print('Best angle: {}'.formate(best_angle))
# correct skew
data = inter.rotate(bin_img, best_angle, reshape=False, order=0)
img = im.fromarray((255 * data).astype("uint8")).convert("RGB")
img.save('skew_corrected.png')

biniariation

adaptive thresholding
otsu biniratation
local maximan and minima

c(i,j) = \frac{I_{max}-I_{min}}{I_{max}-I_{mi}+\epsilon}

noise removal

import numpy as np 
import cv2 
from matplotlib import pyplot as plt 
# Reading image from folder where it is stored 
img = cv2.imread('bear.png') 
# denoising of image saving it into dst image 
dst = cv2.fastNlMeansDenoisingColored(img, None, 10, 10, 7, 15) 
# Plotting of source and destination image 
plt.subplot(121), plt.imshow(img) 
plt.subplot(122), plt.imshow(dst) 
plt.show()

thining and skeletonization

sementation - line level - word level - character level

classification

identify the segment

post processing

spelling correction !?

Binarization

global

if (current)

Refernces

https://towardsdatascience.com/pre-processing-in-ocr-fc231c6035a7
https://towardsdatascience.com/image-filters-in-python-26ee938e57d2
https://github.com/arthurflor23/text-segmentation
https://pdf.wondershare.com/pdf-knowledge/extract-images-from-pdf-linux.html
https://askubuntu.com/questions/150100/extracting-embedded-images-from-a-pdf
https://stackoverflow.com/questions/24322338/remove-all-text-from-pdf-file

Citation

BibTeX citation:

@online{bochman2024,
  author = {Bochman, Oren},
  title = {OCR Building Blocks},
  date = {2024-03-28},
  url = {https://orenbochman.github.io/posts/2024/2024-02-28-ocr/28-02-2024-ocr.html},
  langid = {en}
}

For attribution, please cite this work as:

Bochman, Oren. 2024. “OCR Building Blocks.” March 28, 2024. https://orenbochman.github.io/posts/2024/2024-02-28-ocr/28-02-2024-ocr.html.