Optical Character Recognition with imagerExtara

Shota Ochi

2018-07-23

We can do optical character recgnition by using the R package tesseract.

ocr function of tesseract works best for images with high contrast, little noise and horizontal text.

ocr function doesn’t show a good performance for degraded images as shown below.

library(imagerExtra)
plot(papers, main = "Original")

OCR(papers) %>% cat
OCR_data(papers)
# A tibble: 0 x 3
# ... with 3 variables: word <chr>, confidence <dbl>, bbox <chr>

OCR function and OCR_data function are shortcuts to ocr function and ocr_data function of tesseract.

We can see OCR function and OCR_data function failed to extract the text “Hello”.

We need to clean the image before using OCR function.

Local Adaptive Thresholding shows a good performance.

binarized <- ThresholdAdaptive(papers, 0, range = c(0,1))
plot(binarized, main = "Local Adaptive threshold")

OCR(binarized) %>% cat
<U+FB01>el’ie

We can see OCR function failed again because of the small blobs.

We need to clean the image.

A straightforward way is to label the blobs and count the size of the blobs and then remove the small blobs.

However, denoising before binarization is enough in this case.

hello <- DenoiseDCT(papers, 0.01) %>% ThresholdAdaptive(., 0.1, range = c(0,1))
plot(hello, main = "Hello")

OCR(hello) %>% cat
Hello
OCR_data(hello)
# A tibble: 1 x 3
  word  confidence bbox      
  <chr>      <dbl> <chr>     
1 Hello       69.1 8,9,118,54