Optical character recognition (OCR) technologies deal with the extraction of editable text content from text that appears inside images (for example, in a photo of a road sign, or a scanned document). Though the earliest implementations date back to 1914, the ongoing need to 'de-rasterize' text has made OCR one of the central planks of the big data revolution of the last fifteen years, as well as one of the driving forces of digitization and modernization of company inventory and IP acumen.
In this article we'll take a brief look at how OCR works, consider it as a potential component in computer vision software, and see if the primary FOSS and commercial packages available could be a good fit for your project.
How OCR algorithms work
Optical character recognition works by dividing up the image of a text character into sections and distinguishing between empty and non-empty regions. Depending on the font or script used for the letter, the checksum of the resulting matrix is subsequently labeled (initially, by a person) as corresponding to the character in the image.
This 'identify and encode' approach is little-changed since the GISMO apparatus developed in the early 1950s for the forerunner of the NSA (see image below). The main differences are that pixel coordinates have replaced the arbitrary grids of earlier systems and preprocessing has become more automated. You can also greatly speed up the process by hiring machine learning experts to implement a suitable ML solution.
The OCR pipeline
A modern OCR training workflow follows a number of steps:
Obtaining non-editable text content from scanned documents of all types, from flatbed scans of corporate archival material through to live surveillance footage and mobile imaging data.
Cleaning up the source imagery at an aggregate level so that the text is easier to discern, and noise is reduced or eliminated.
3: Segmentation and feature extraction
Scanning of the image content for groups of pixels that are likely to constitute single characters, and assignment of each of them to their own class. The machine learning framework will then attempt to derive features for the recurring pixel groups that it finds, based on generalized OCR templates or prior models. However, human verification will be needed later.
Once all features are defined, the data can be processed in a neural network training session, where a model will attempt to develop a generalized image>text mapping for the data.
5: Verification and re-training
After processing, humans evaluate the results, with corrections fed back into subsequent training sessions. At this point, data quality may need to be reviewed. Data cleaning is time-consuming and expensive, and while initial training runs will perform de-skewing, high contrast processing, and other helpful methods to obtain a good algorithm with minimal pre-processing, further arduous refinement of the data may be necessary.
OCR as an essential computing utility
In terms of offline processing (rather than the 'live', zero-latency OCR used in the Google Translate mobile app in the image below), where the user waits for an OCR algorithm to extract text from an image, existing commercial and open-source solutions have arguably matured to entirely solve this challenge:
- OCR is a free native feature of Google Drive and Dropbox, converting PDF, JPEG, PNG and GIF files to editable text.
- OCR is a native element in the Windows 10+ Universal Windows Platform, making it effectively a core utility that developers can hook into for free.
- OCR is very well-represented in the FOSS community, with a variety of available engines (see below).
A range of FOSS repositories and libraries can be incorporated into a dedicated local OCR framework for automated data collection, though many of them are also leveraged by SaaS OCR providers (see 'Commercial OCR APIs', later).
The Tesseract OCR engine rose from its 1980s roots as a proprietary C/C++ Hewlett-Packard algorithm to become open-sourced in 2005 under the ongoing patronage of Google, following a decade of neglect.
Considered one of the most accurate OCR frameworks, Tesseract's capabilities were widely lauded in the FOSS community, and its associated software, datasets and secondary modules are now effectively perceived as a collective Google initiative.
The core OCR engine is available as a CLI offering on Windows and Linux, though it has less extensive support on the Mac platform.
Tesseract supports 116 languages by default, though others can be adapted to it. In 2020, the Internet Archive, possibly the largest OCR project of the last twenty years, switched to a Tesseract/OCRopus (see below) workflow, and described Tesseract as having made a 'major step forward' in accuracy in the preceding years.
Version 4 of Tesseract added a long short-term memory (LSTM) recurrent neural network (RNN) architecture and automatic language recognition. The current maintainers on GitHub note that character images may need a fair bit of cleaning prior to training—a long-time caveat for Tesseract.
Interfaces for Tesseract
Over the last 15 years, a wide range of FOSS and proprietary interfaces and GUIs have emerged to make use of this popular and capable framework, including:
- gImageReader, a Gtk/Qt front-end
- YAGF, a graphical front-end that also accommodates Cuneiform
- OCRFeeder, a document layout analysis system that can develop complex, sophisticated extraction routines and leverages Tesseract for the text conversion portion of these.
- Free-Ocr-Windows-Desktop, a local Windows application with a straightforward installer that uses Tesseract as the conversion engine.
- Lime-OCR, a Google-hosted project that explicitly states it is not a front-end for Tesseract, but rather uses ImageMagick functionality to act as a front-end for Tesseract; the creators borrowed this description from Tesseract-GUI.
- Tesseract-GUI, which offers similar functionality.
These are just a few of the available interfaces. Depending on the framework in hand and the licensing terms, it's possible to incorporate many of these FOSS packages into a dedicated OCR workflow.
Another Google-supported project, OCRopus is a collection of document analysis systems that incorporates OCR and now uses GPU-capable text line recognizers and deep learning layout evaluation tools. Originally written in Python, a separate C++ CLSTM version now has its own fork.
Though these divergent frameworks are not interchangeable, the developers advise that retaining simplified and unified data formatting and storage is the best solution for using either version on the same data. The split from Python to C++ has fragmented the original clarity of the project somewhat, since it leaves the Python version (popular in machine learning ecostructures) with a dimmer future than the faster but more rarefied C++ iteration.
In either case, OCRopus solves a major shortcoming of many FOSS OCR solutions, since layout analysis is usually an essential part of the OCR pipeline and is integrated in this case.
OCRopus has been famously leveraged as the OCR engine for Google's ReCaptcha algorithm, though its performance has been subject to occasional criticism in this regard.
There have been a few refugees from the splintered OCRopus project. Among them is Kraken, a CUDA-supported turnkey OCR framework that runs on Linux and OSX and requires a number of external libraries in order to run. It can be installed via PIP or Anaconda, and must load recognition models from external sources. Though the project features a public model repository, it currently only contains the generalized English language model and a model for Syriac text.
Another OCRopus dissident, Python 3-based Calamari OCR is a CLI-only framework also derived from Kraken. It offers a model repository with an accent on historical rather than contemporary textual sources, and where French is the primary alternative language to English.
The Python-based deep learning API Keras offers a convolutional recurrent neural network (CRNN) for text recognition which has been utilized in several modular FOSS repositories, including Simple digit OCR (for tf.keras 2.1) and keras-ocr, which is easier to implement into a new framework and leverages the PyTorch Character-Region Awareness For Text detection (CRAFT) text detector.
EasyOCR is a well-maintained repository supporting more than 80 languages, offers a demo site, and supports all popular script types, including Latin, Cyrillic, Chinese and Arabic. With native PIP-based operation on Linux, EasyOCR runs via PyTorch on Windows, can be implemented via Docker, and supports CUDA.
Commercial OCR APIs
API-based FAANG and mid-level corporate OCR offerings are likely to outperform most FOSS solutions out of the box, because:
- With sales models predicated on ease of adoption, they'll bend over backwards to provide alluringly facile automation pipelines.
- They've already traversed the adoption pain barrier for you, in terms of implementing many of the aforementioned FOSS packages and recognition models into a functional OCR pipeline, which may make the FOSS option seem like 'reinventing the wheel'.
- They're constantly feeding high-volume customer experience back into the core offering.
However, their inevitable appellant cost is also accompanied by uncertainty around future pricing policies, possible issues with governance, and the need to commit to hybrid or cloud-based OCR framework models—or else accept that an on-premises model that hooks into cloud-based commercial APIs will be left with some risky external dependencies.
Nonetheless, for network-based projects (such as mobile app development) with manageable and controllable API call volumes, a commercial API may be the ideal solution. Admittedly, FAANG-level connectivity, latency and uptime can be hard to match.
Alternately, a short API subscription can be useful as an easy and low-effort proof-of-concept proxy service that can eventually be replaced by dedicated proprietary infrastructure.
The challenge of comparing SaaS OCR offerings
First-tier OCR API services are idiosyncratic, with fragmented use cases and multiple factors that make like-for-like evaluation problematic. To boot, the market leaders in cognitive automation not only offer different products for different types of OCR scenario (such as for signs and for documents, see below), but vary among themselves in terms of architecture, features, available template datasets, modularity, and processing pipeline capabilities.
Added to this, the major OCR providers update their offerings frequently, which makes accurate long-term comparison a challenge. Periodically, new tests come online to compare factors such as text prediction error count and accuracy rates across SaaS OCR services from a small section of the largest providers.
These sporadic surveys rarely encompass a wide enough range of SaaS offerings and frequently include commercial standalone software that is difficult or costly to incorporate into a pipeline, such as ABBYY Fine Reader.
Since the customer use-case and data will be particular, and SaaS OCR test rankings are constantly in flux, the best approach is to take advantage of initial free credits and trial periods and to develop a modular OCR framework that can switch relatively easily between APIs to accommodate an exploratory phase for the project.
Google Cloud Vision (GCV) API
The search giant offers two types of text detection as API calls: Text Detection and Document Text Detection. The first is aimed at sparse amounts of text in images (such as images of signs for AR/VR or navigation products), and the second is a more traditional document OCR functionality.
GCV can make use of AutoML Vision, a proprietary model-training framework designed to ease the creation of datasets and training data.
As with Amazon's OCR services (see below), GCV OCR is a potential bottomless pit, depending on your needs, with a pricing list that is in itself a vast and bewildering document. However, GCV OCR currently comes with an initial $300 of free credits applicable to up to 20 free product registrations.
Amazon Textract / Rekognition
As with Google Cloud Vision, Amazon offers two distinct OCR APIs: Amazon Rekognition, for individuating small text amounts in the wild; and Amazon Textract, for a traditional document-based OCR pipeline.
The fragmentation doesn't end there: Textract itself is subdivided into the Detect Document Text (DDT) API (vanilla OCR) and the Analyze Document (AD) API (key-pair and content extraction, including OCR).
AWS pricing is, famously, a potential minefield. Textract is billed on a per-page basis, with DDT currently costing $0.0015 per page for the first million pages, and $0.0006 per page thereafter, and AD priced likewise, except that the charge drops to