June 30, 2021
OCR algorithms: a complete guide
Independent AI expert
Optical character recognition (OCR) technologies deal with the extraction of editable text content from text that appears inside images (for example, in a photo of a road sign, or a scanned document). Though the earliest implementations date back to 1914, the ongoing need to 'de-rasterize' text has made OCR one of the central planks of the big data revolution of the last fifteen years, as well as one of the driving forces of digitization and modernization of company inventory and IP acumen.
In this article we'll take a brief look at how OCR works, consider it as a potential component in computer vision software, and see if the primary FOSS and commercial packages available could be a good fit for your project.
Optical character recognition works by dividing up the image of a text character into sections and distinguishing between empty and non-empty regions. Depending on the font or script used for the letter, the checksum of the resulting matrix is subsequently labeled (initially, by a person) as corresponding to the character in the image.
This 'identify and encode' approach is little-changed since the GISMO apparatus developed in the early 1950s for the forerunner of the NSA (see image below). The main differences are that pixel coordinates have replaced the arbitrary grids of earlier systems; preprocessing has become more automated; and machine learning solutions can greatly speed up the process.
A modern OCR training workflow follows a number of steps:
Obtaining non-editable text content from scanned documents of all types, from flatbed scans of corporate archival material through to live surveillance footage and mobile imaging data.
Cleaning up the source imagery at an aggregate level so that the text is easier to discern, and noise is reduced or eliminated.
3: Segmentation and feature extraction
Scanning of the image content for groups of pixels that are likely to constitute single characters, and assignment of each of them to their own class. The machine learning framework will then attempt to derive features for the recurring pixel groups that it finds, based on generalized OCR templates or prior models. However, human verification will be needed later.
Once all features are defined, the data can be processed in a neural network training session, where a model will attempt to develop a generalized image>text mapping for the data.
5: Verification and re-training
After processing, humans evaluate the results, with corrections fed back into subsequent training sessions. At this point, data quality may need to be reviewed. Data cleaning is time-consuming and expensive, and while initial training runs will perform de-skewing, high contrast processing, and other helpful methods to obtain a good algorithm with minimal pre-processing, further arduous refinement of the data may be necessary.
In terms of offline processing (rather than the 'live', zero-latency OCR used in the Google Translate mobile app in the image below), where the user waits for an OCR algorithm to extract text from an image, existing commercial and open-source solutions have arguably matured to entirely solve this challenge:
A range of FOSS repositories and libraries can be incorporated into a dedicated local OCR framework for automated data collection, though many of them are also leveraged by SaaS OCR providers (see 'Commercial OCR APIs', later).
The Tesseract OCR engine rose from its 1980s roots as a proprietary C/C++ Hewlett-Packard algorithm to become open-sourced in 2005 under the ongoing patronage of Google, following a decade of neglect.
Considered one of the most accurate OCR frameworks, Tesseract's capabilities were widely lauded in the FOSS community, and its associated software, datasets and secondary modules are now effectively perceived as a collective Google initiative.
The core OCR engine is available as a CLI offering on Windows and Linux, though it has less extensive support on the Mac platform.
Tesseract supports 116 languages by default, though others can be adapted to it. In 2020, the Internet Archive, possibly the largest OCR project of the last twenty years, switched to a Tesseract/OCRopus (see below) workflow, and described Tesseract as having made a 'major step forward' in accuracy in the preceding years.
Version 4 of Tesseract added a long short-term memory (LSTM) recurrent neural network (RNN) architecture and automatic language recognition. The current maintainers on GitHub note that character images may need a fair bit of cleaning prior to training—a long-time caveat for Tesseract.
Over the last 15 years, a wide range of FOSS and proprietary interfaces and GUIs have emerged to make use of this popular and capable framework, including:
These are just a few of the available interfaces. Depending on the framework in hand and the licensing terms, it's possible to incorporate many of these FOSS packages into a dedicated OCR workflow.
Another Google-supported project, OCRopus is a collection of document analysis systems that incorporates OCR and now uses GPU-capable text line recognizers and deep learning layout evaluation tools. Originally written in Python, a separate C++ CLSTM version now has its own fork.
Though these divergent frameworks are not interchangeable, the developers advise that retaining simplified and unified data formatting and storage is the best solution for using either version on the same data. The split from Python to C++ has fragmented the original clarity of the project somewhat, since it leaves the Python version (popular in machine learning ecostructures) with a dimmer future than the faster but more rarefied C++ iteration.
In either case, OCRopus solves a major shortcoming of many FOSS OCR solutions, since layout analysis is usually an essential part of the OCR pipeline and is integrated in this case.
OCRopus has been famously leveraged as the OCR engine for Google's ReCaptcha algorithm, though its performance has been subject to occasional criticism in this regard.
There have been a few refugees from the splintered OCRopus project. Among them is Kraken, a CUDA-supported turnkey OCR framework that runs on Linux and OSX and requires a number of external libraries in order to run. It can be installed via PIP or Anaconda, and must load recognition models from external sources. Though the project features a public model repository, it currently only contains the generalized English language model and a model for Syriac text.
Another OCRopus dissident, Python 3-based Calamari OCR is a CLI-only framework also derived from Kraken. It offers a model repository with an accent on historical rather than contemporary textual sources, and where French is the primary alternative language to English.
The Python-based deep learning API Keras offers a convolutional recurrent neural network (CRNN) for text recognition which has been utilized in several modular FOSS repositories, including Simple digit OCR (for tf.keras 2.1) and keras-ocr, which is easier to implement into a new framework and leverages the PyTorch Character-Region Awareness For Text detection (CRAFT) text detector.
EasyOCR is a well-maintained repository supporting more than 80 languages, offers a demo site, and supports all popular script types, including Latin, Cyrillic, Chinese and Arabic. With native PIP-based operation on Linux, EasyOCR runs via PyTorch on Windows, can be implemented via Docker, and supports CUDA.
API-based FAANG and mid-level corporate OCR offerings are likely to outperform most FOSS solutions out of the box, because:
However, their inevitable appellant cost is also accompanied by uncertainty around future pricing policies, possible issues with governance, and the need to commit to hybrid or cloud-based OCR framework models—or else accept that an on-premises model that hooks into cloud-based commercial APIs will be left with some risky external dependencies.
Nonetheless, for network-based projects (such as mobile app development) with manageable and controllable API call volumes, a commercial API may be the ideal solution. Admittedly, FAANG-level connectivity, latency and uptime can be hard to match.
Alternately, a short API subscription can be useful as an easy and low-effort proof-of-concept proxy service that can eventually be replaced by dedicated proprietary infrastructure.
First-tier OCR API services are idiosyncratic, with fragmented use cases and multiple factors that make like-for-like evaluation problematic. To boot, the market leaders in cognitive automation not only offer different products for different types of OCR scenario (such as for signs and for documents, see below), but vary among themselves in terms of architecture, features, available template datasets, modularity, and processing pipeline capabilities.
Added to this, the major OCR providers update their offerings frequently, which makes accurate long-term comparison a challenge. Periodically, new tests come online to compare factors such as text prediction error count and accuracy rates across SaaS OCR services from a small section of the largest providers.
These sporadic surveys rarely encompass a wide enough range of SaaS offerings and frequently include commercial standalone software that is difficult or costly to incorporate into a pipeline, such as ABBYY Fine Reader.
Since the customer use-case and data will be particular, and SaaS OCR test rankings are constantly in flux, the best approach is to take advantage of initial free credits and trial periods and to develop a modular OCR framework that can switch relatively easily between APIs to accommodate an exploratory phase for the project.
The search giant offers two types of text detection as API calls: Text Detection and Document Text Detection. The first is aimed at sparse amounts of text in images (such as images of signs for AR/VR or navigation products), and the second is a more traditional document OCR functionality.
GCV can make use of AutoML Vision, a proprietary model-training framework designed to ease the creation of datasets and training data.
As with Amazon's OCR services (see below), GCV OCR is a potential bottomless pit, depending on your needs, with a pricing list that is in itself a vast and bewildering document. However, GCV OCR currently comes with an initial $300 of free credits applicable to up to 20 free product registrations.
As with Google Cloud Vision, Amazon offers two distinct OCR APIs: Amazon Rekognition, for individuating small text amounts in the wild; and Amazon Textract, for a traditional document-based OCR pipeline.
The fragmentation doesn't end there: Textract itself is subdivided into the Detect Document Text (DDT) API (vanilla OCR) and the Analyze Document (AD) API (key-pair and content extraction, including OCR).
AWS pricing is, famously, a potential minefield. Textract is billed on a per-page basis, with DDT currently costing $0.0015 per page for the first million pages, and $0.0006 per page thereafter, and AD priced likewise, except that the charge drops to $0.01 per page after a million pages.
The AWS Free Tier applies, with new AWS customers getting 1,000 pages free per month using the Detect Document Text API, and up to 100 pages per month using the Analyze Document API. It should be noted that AWS pricing varies greatly in real terms between geographic regions.
Rekognition's pricing model is even more granular, split between image, video and custom labels analysis. The company provides a complex online calculator to help estimate potential Rekognition API costs. We don't have room in this article to list all the explicit per-unit pricing that Amazon currently makes available.
To access this functionality, you'll need an Azure subscription, a current .NET Core install or Visual Studio IDE, and then to create a Computer Vision Resource in the Azure portal to obtain a key and instance endpoint.
After that, you'll be subject to generic Microsoft Computer Vision pricing, where available transactions per second rise with the price, and where the prices themselves are split into 15 features across four categories. Though we cannot list all these prices here, the 'Read' section currently varies from $1.50 per 1,000 transactions (for up to a million total transactions) to $0.60 per 1,000 transactions (for more than a million transactions).
A broader range of mid-level commercial OCR APIs are available, including:
While SaaS solutions may help to kick-start an on-premises OCR workflow and can be useful in developing its base architecture, there are at least three strong reasons to consider FOSS OCR engines in a custom text extraction pipeline. Firstly, the pain barrier of pre-processing character images and training models is not much ameliorated in the otherwise glossy FAANG OCR world, because most data is quite idiosyncratic and uncleaned; secondly, the best of the FOSS solutions—such as Tesseract—represent stable software maintained by active and industry-engaged contributors.
Finally, beyond the initial effort of adapting the software to the company's needs, the future costs of a FOSS library are known—a fortunate situation that SaaS APIs can't replicate.
We look at the current approaches to automated data collection and their effectiveness for text, audio and video information extraction.
Learn how RPA and AI can work together to achieve superior business efficiency within the framework of cognitive automation.
Learn about a face recognition system developed by Itransition. The solution shows 99% accuracy and performs 6x faster than similar systems.
What’s under the hood of reinforcement learning applications? Learn their value for business use cases with Itransition.
Read our guide to supervised vs unsupervised machine learning to choose the right approach for your intelligent software.