OCR algorithms: a complete guide

OCR algorithms: a complete guide

June 30, 2021

Blog

OCR algorithms: a complete guide

Independent AI expert

Optical character recognition (OCR) technologies deal with the extraction of editable text content from text that appears inside images (for example, in a photo of a road sign, or a scanned document). Though the earliest implementations date back to 1914, the ongoing need to 'de-rasterize' text has made OCR one of the central planks of the big data revolution of the last fifteen years, as well as one of the driving forces of digitization and modernization of company inventory and IP acumen.

In this article we'll take a brief look at how OCR works, consider it as a potential component in computer vision software, and see if the primary FOSS and commercial packages available could be a good fit for your project.

How OCR algorithms work

Optical character recognition works by dividing up the image of a text character into sections and distinguishing between empty and non-empty regions. Depending on the font or script used for the letter, the checksum of the resulting matrix is subsequently labeled (initially, by a person) as corresponding to the character in the image.

This 'identify and encode' approach is little-changed since the GISMO apparatus developed in the early 1950s for the forerunner of the NSA (see image below). The main differences are that pixel coordinates have replaced the arbitrary grids of earlier systems; preprocessing has become more automated; and machine learning solutions can greatly speed up the process.

The OCR pipeline

A modern OCR training workflow follows a number of steps:

1: Acquisition
Obtaining non-editable text content from scanned documents of all types, from flatbed scans of corporate archival material through to live surveillance footage and mobile imaging data.

2: Preprocessing
Cleaning up the source imagery at an aggregate level so that the text is easier to discern, and noise is reduced or eliminated.

3: Segmentation and feature extraction
Scanning of the image content for groups of pixels that are likely to constitute single characters, and assignment of each of them to their own class. The machine learning framework will then attempt to derive features for the recurring pixel groups that it finds, based on generalized OCR templates or prior models. However, human verification will be needed later.

4: Training
Once all features are defined, the data can be processed in a neural network training session, where a model will attempt to develop a generalized image>text mapping for the data.

5: Verification and re-training
After processing, humans evaluate the results, with corrections fed back into subsequent training sessions. At this point, data quality may need to be reviewed. Data cleaning is time-consuming and expensive, and while initial training runs will perform de-skewing, high contrast processing, and other helpful methods to obtain a good algorithm with minimal pre-processing, further arduous refinement of the data may be necessary.

OCR as an essential computing utility

In terms of offline processing (rather than the 'live', zero-latency OCR used in the Google Translate mobile app in the image below), where the user waits for an OCR algorithm to extract text from an image, existing commercial and open-source solutions have arguably matured to entirely solve this challenge:

  • OCR is a free native feature of Google Drive and Dropbox, converting PDF, JPEG, PNG and GIF files to editable text.
  • OCR is a native element in the Windows 10+ Universal Windows Platform, making it effectively a core utility that developers can hook into for free.
  • OCR is very well-represented in the FOSS community, with a variety of available engines (see below).

Open-source OCR

A range of FOSS repositories and libraries can be incorporated into a dedicated local OCR framework for automated data collection, though many of them are also leveraged by SaaS OCR providers (see 'Commercial OCR APIs', later).

Tesseract

The Tesseract OCR engine rose from its 1980s roots as a proprietary C/C++ Hewlett-Packard algorithm to become open-sourced in 2005 under the ongoing patronage of Google, following a decade of neglect.

Considered one of the most accurate OCR frameworks, Tesseract's capabilities were widely lauded in the FOSS community, and its associated software, datasets and secondary modules are now effectively perceived as a collective Google initiative.

The core OCR engine is available as a CLI offering on Windows and Linux, though it has less extensive support on the Mac platform.

    Tesseract supports 116 languages by default, though others can be adapted to it. In 2020, the Internet Archive, possibly the largest OCR project of the last twenty years, switched to a Tesseract/OCRopus (see below) workflow, and described Tesseract as having made a 'major step forward' in accuracy in the preceding years.

    Version 4 of Tesseract added a long short-term memory (LSTM) recurrent neural network (RNN) architecture and automatic language recognition. The current maintainers on GitHub note that character images may need a fair bit of cleaning prior to training—a long-time caveat for Tesseract.

    Interfaces for Tesseract

    Over the last 15 years, a wide range of FOSS and proprietary interfaces and GUIs have emerged to make use of this popular and capable framework, including:

    • gImageReader, a Gtk/Qt front-end
    • YAGF, a graphical front-end that also accommodates Cuneiform
    • OCRFeeder, a document layout analysis system that can develop complex, sophisticated extraction routines and leverages Tesseract for the text conversion portion of these.
    • Free-Ocr-Windows-Desktop, a local Windows application with a straightforward installer that uses Tesseract as the conversion engine.
    • Lime-OCR, a Google-hosted project that explicitly states it is not a front-end for Tesseract, but rather uses ImageMagick functionality to act as a front-end for Tesseract; the creators borrowed this description from Tesseract-GUI.
    • Tesseract-GUI, which offers similar functionality.

    These are just a few of the available interfaces. Depending on the framework in hand and the licensing terms, it's possible to incorporate many of these FOSS packages into a dedicated OCR workflow.

    OCRopus

    Another Google-supported project, OCRopus is a collection of document analysis systems that incorporates OCR and now uses GPU-capable text line recognizers and deep learning layout evaluation tools. Originally written in Python, a separate C++ CLSTM version now has its own fork.

    Though these divergent frameworks are not interchangeable, the developers advise that retaining simplified and unified data formatting and storage is the best solution for using either version on the same data. The split from Python to C++ has fragmented the original clarity of the project somewhat, since it leaves the Python version (popular in machine learning ecostructures) with a dimmer future than the faster but more rarefied C++ iteration.

    In either case, OCRopus solves a major shortcoming of many FOSS OCR solutions, since layout analysis is usually an essential part of the OCR pipeline and is integrated in this case.

    OCRopus has been famously leveraged as the OCR engine for Google's ReCaptcha algorithm, though its performance has been subject to occasional criticism in this regard.

    Kraken

    There have been a few refugees from the splintered OCRopus project. Among them is Kraken, a CUDA-supported turnkey OCR framework that runs on Linux and OSX and requires a number of external libraries in order to run. It can be installed via PIP or Anaconda, and must load recognition models from external sources. Though the project features a public model repository, it currently only contains the generalized English language model and a model for Syriac text.

    Calamari OCR

    Another OCRopus dissident, Python 3-based Calamari OCR is a CLI-only framework also derived from Kraken. It offers a model repository with an accent on historical rather than contemporary textual sources, and where French is the primary alternative language to English.

    Keras OCR

    The Python-based deep learning API Keras offers a convolutional recurrent neural network (CRNN) for text recognition which has been utilized in several modular FOSS repositories, including Simple digit OCR (for tf.keras 2.1) and keras-ocr, which is easier to implement into a new framework and leverages the PyTorch Character-Region Awareness For Text detection (CRAFT) text detector.

    EasyOCR

    EasyOCR is a well-maintained repository supporting more than 80 languages, offers a demo site, and supports all popular script types, including Latin, Cyrillic, Chinese and Arabic. With native PIP-based operation on Linux, EasyOCR runs via PyTorch on Windows, can be implemented via Docker, and supports CUDA.

    Commercial OCR APIs

    API-based FAANG and mid-level corporate OCR offerings are likely to outperform most FOSS solutions out of the box, because:

    • With sales models predicated on ease of adoption, they'll bend over backwards to provide alluringly facile automation pipelines.
    • They've already traversed the adoption pain barrier for you, in terms of implementing many of the aforementioned FOSS packages and recognition models into a functional OCR pipeline, which may make the FOSS option seem like 'reinventing the wheel'.
    • They're constantly feeding high-volume customer experience back into the core offering.

    However, their inevitable appellant cost is also accompanied by uncertainty around future pricing policies, possible issues with governance, and the need to commit to hybrid or cloud-based OCR framework models—or else accept that an on-premises model that hooks into cloud-based commercial APIs will be left with some risky external dependencies.

    Nonetheless, for network-based projects (such as mobile app development) with manageable and controllable API call volumes, a commercial API may be the ideal solution. Admittedly, FAANG-level connectivity, latency and uptime can be hard to match.

    Alternately, a short API subscription can be useful as an easy and low-effort proof-of-concept proxy service that can eventually be replaced by dedicated proprietary infrastructure.

    The challenge of comparing SaaS OCR offerings

    First-tier OCR API services are idiosyncratic, with fragmented use cases and multiple factors that make like-for-like evaluation problematic. To boot, the market leaders in cognitive automation not only offer different products for different types of OCR scenario (such as for signs and for documents, see below), but vary among themselves in terms of architecture, features, available template datasets, modularity, and processing pipeline capabilities.

    Added to this, the major OCR providers update their offerings frequently, which makes accurate long-term comparison a challenge. Periodically, new tests come online to compare factors such as text prediction error count and accuracy rates across SaaS OCR services from a small section of the largest providers.

    These sporadic surveys rarely encompass a wide enough range of SaaS offerings and frequently include commercial standalone software that is difficult or costly to incorporate into a pipeline, such as ABBYY Fine Reader.

    Since the customer use-case and data will be particular, and SaaS OCR test rankings are constantly in flux, the best approach is to take advantage of initial free credits and trial periods and to develop a modular OCR framework that can switch relatively easily between APIs to accommodate an exploratory phase for the project.

    Google Cloud Vision (GCV) API

    The search giant offers two types of text detection as API calls: Text Detection and Document Text Detection. The first is aimed at sparse amounts of text in images (such as images of signs for AR/VR or navigation products), and the second is a more traditional document OCR functionality.

    GCV can make use of AutoML Vision, a proprietary model-training framework designed to ease the creation of datasets and training data.

    As with Amazon's OCR services (see below), GCV OCR is a potential bottomless pit, depending on your needs, with a pricing list that is in itself a vast and bewildering document. However, GCV OCR currently comes with an initial $300 of free credits applicable to up to 20 free product registrations.

    Amazon Textract / Rekognition

    As with Google Cloud Vision, Amazon offers two distinct OCR APIs: Amazon Rekognition, for individuating small text amounts in the wild; and Amazon Textract, for a traditional document-based OCR pipeline.

    The fragmentation doesn't end there: Textract itself is subdivided into the Detect Document Text (DDT) API (vanilla OCR) and the Analyze Document (AD) API (key-pair and content extraction, including OCR).

    AWS pricing is, famously, a potential minefield. Textract is billed on a per-page basis, with DDT currently costing $0.0015 per page for the first million pages, and $0.0006 per page thereafter, and AD priced likewise, except that the charge drops to $0.01 per page after a million pages.

    The AWS Free Tier applies, with new AWS customers getting 1,000 pages free per month using the Detect Document Text API, and up to 100 pages per month using the Analyze Document API. It should be noted that AWS pricing varies greatly in real terms between geographic regions.

    Rekognition's pricing model is even more granular, split between image, video and custom labels analysis. The company provides a complex online calculator to help estimate potential Rekognition API costs. We don't have room in this article to list all the explicit per-unit pricing that Amazon currently makes available.

    Microsoft Azure Computer Vision (CV) API

    As an example of the near-impossibility of comparing prices across FAANG OCR APIs, Microsoft's text recognition services (which are just as region-volatile as AWS in terms of cost) are just one aspect of Microsoft Computer Vision, with pure OCR actually a well-hidden feature of the Read client library (C#, Python, Java, JavaScript or Go) or REST API.

    To access this functionality, you'll need an Azure subscription, a current .NET Core install or Visual Studio IDE, and then to create a Computer Vision Resource in the Azure portal to obtain a key and instance endpoint.

    Microsoft's Read Vision API workflow

    After that, you'll be subject to generic Microsoft Computer Vision pricing, where available transactions per second rise with the price, and where the prices themselves are split into 15 features across four categories. Though we cannot list all these prices here, the 'Read' section currently varies from $1.50 per 1,000 transactions (for up to a million total transactions) to $0.60 per 1,000 transactions (for more than a million transactions).

    Other Commercial OCR APIs

    A broader range of mid-level commercial OCR APIs are available, including:

    • Cloudmersive Optical Character Recognition API: OCR features among Cloudmersive's range of APIs, with support for 90 languages and automatic segmentation and preprocessing. A complex hierarchy of pricing ranges from 'SME' to 'Government'.
    • Free OCR API: Free OCR API defies its name by providing Enterprise tiers in its OCR offering, which boosts the permitted page length from a fairly useless (watermarked) three pages to 999+ pages. No intermediate tiers are available, and the Enterprise plan is currently set at $299 p/m.
    • Mathpix API: Mathpix OCR offers an API aimed at STEM companies, with exceptional support for the extraction of mathematical formulae (see image below), which is translated into a proprietary markdown format. Pricing is quite well-hidden in a little-publicized 'Pro' tier for customers exceeding 50 API calls per month, with the OCR functionality apparently a facet of the company’s wider Snip offering, which apparently offers unlimited Snips for $4.99 p/m.

    Conclusion

    While SaaS solutions may help to kick-start an on-premises OCR workflow and can be useful in developing its base architecture, there are at least three strong reasons to consider FOSS OCR engines in a custom text extraction pipeline. Firstly, the pain barrier of pre-processing character images and training models is not much ameliorated in the otherwise glossy FAANG OCR world, because most data is quite idiosyncratic and uncleaned; secondly, the best of the FOSS solutions—such as Tesseract—represent stable software maintained by active and industry-engaged contributors.

    Finally, beyond the initial effort of adapting the software to the company's needs, the future costs of a FOSS library are known—a fortunate situation that SaaS APIs can't replicate.