We can do optical character recgnition by using the R package tesseract.
ocr function of tesseract works best for images with high contrast, little noise and horizontal text.
ocr function doesn’t show a good performance for degraded images as shown below.
# A tibble: 0 x 3
# ... with 3 variables: word <chr>, confidence <dbl>, bbox <chr>
OCR function and OCR_data function are shortcuts to ocr function and ocr_data function of tesseract.
We can see OCR function and OCR_data function failed to extract the text “Hello”.
We need to clean the image before using OCR function.
Local Adaptive Thresholding shows a good performance.
binarized <- ThresholdAdaptive(papers, 0, range = c(0,1))
plot(binarized, main = "Local Adaptive threshold")
<U+FB01>el’ie
We can see OCR function failed again because of the small blobs.
We need to clean the image.
A straightforward way is to label the blobs and count the size of the blobs and then remove the small blobs.
However, denoising before binarization is enough in this case.
hello <- DenoiseDCT(papers, 0.01) %>% ThresholdAdaptive(., 0.1, range = c(0,1))
plot(hello, main = "Hello")
Hello
# A tibble: 1 x 3
word confidence bbox
<chr> <dbl> <chr>
1 Hello 69.1 8,9,118,54