Photo by — @jromeo

Document Analysis and Recognition with ML

Summary of the preface and introduction of the book by Simone Marinai and Hiromichi Fujisawa

It is incredible what resources that can be found online. There is a lot of free material out there, of course on YouTube, however there are also plenty of books. There is a book called Machine Learning in Document Analysis and Recognition by Simone Marinai and Hiromichi Fujisawa. As such, I thought it would be interesting to go through this book for some of my articles. It must be said that this book is from 2008, and therefore much of the information will likely be heavily outdated. Still, with that in mind perhaps it can be viewed more as a historic document, we will see. Today I will start with the preface and parts of the introduction.

What is the objective of Document Analysis and Recognition (DAR)?

According to the authors it is:

“To recognise the text and graphical components of a document and to extract information.”

They date the history back to papers in the 1960’s.

Apparently Optical Character Recognition (OCR) engines are some of the most widely recognised products of the research in this field.

A well-known problem is that of handwritten characters. It was used in the past as a benchmark for evaluating machine learning algorithms, especially supervised classifiers.

There has been a large emphasis being devoted to character recognition and word recognition.

Yet, there are other tasks such as pre-processing, layout analysis, character segmentation, and signature verification that benefited much from machine learning algorithms.

The book is a collection of research papers and what was seen as the state-of-the-art reviews at the time (and did include researchers widely acknowledged to have made the breakthroughs in deep learning in 2010's).

They attempt to identify:

  1. Good practices for the use of learning strategies in DAR.
  2. DAR tasks more appropriate for these techniques.
  3. New learning algorithms that may be successfully applied to DAR.

The first chapter contains a deep introduction to the field, and how it was at the time.

Their goals were amongst other things that they should:

  • “…contribute to stimulate new ideas, new collaborations and new research activities in this research arena.”
  • “…link together the DAR research with the machine learning one.”

“Document Analysis and Recognition (DAR) aims at the automatic extraction of information presented on paper and initially addressed to human comprehension. The desired output of DAR systems is usually in a suitable symbolic representation that can subsequently be processed by computers.”

Paper documents have been a principal instrument for permanent progress.

Those days (in 2008) most information was still recorded, stored and distributed in paper format. It was noticed then, but is perhaps now even clearer that:

“The widespread use of computers for document editing, with the introduction of PCs and word- processors in the late 1980’s, had the effect of increasing, instead of reducing, the amount of information held on paper.”

This is on the one hand the case, yet they noted too that: “…the use of paper as a media for information exchange is still increasing.”

They mentioned that the most widely used applications of DAR was processing of office documents (such as invoices, bank documents, business letters, and checks), and to the automatic mail sorting.

Scanning combined with powerful computers and OCR packages made it possible to solve many simple recognition tasks for most users.

The development of a DAR system requires the integration of several competences in computer science, among the others:

  • Image processing.
  • Pattern recognition.
  • Natural language processing.
  • Artificial intelligence.
  • Database systems.

DAR applications are particularly suitable for the incorporation of machine learning techniques for two factors:

  1. Classification algorithms are used at several processing levels, from image pre-processing to character classification.
  2. Large collections of manually annotated document images are available and can be used for automatic training of classifiers.

With their introductory chapter they provided a guide to the DAR field at the time.

DAR Applications

The authors proposed we can split the DAR applications into two broad categories:

  • Business-oriented. “Office documents reach a total of more than 85% of the amount of new original information stored on paper in the world. It is therefore not surprising that business-oriented applications received a great interest.”
  • User-centered. “…software tools, such as OCR software for general purpose PCs, that can be used to process personal information originated in paper form.”

Other applications that received tools aimed at improving the access to the objects in digital libraries and processing of historical documents. Large collections of digitised documents had become available on the Internet, to both scholars and the general public.

In most DAR applications the document content is conceptually described by means of: the (1) physical and the (2) logical structures.

  1. The physical structure describes the visual aspect of the document by representing the basic objects and their mutual positions.
  2. The logical structure assigns to each object a suitable meaning.” [bold and numbers added]

That’s what I got for today! Hope that was useful or made you interested to read more.

This is #500daysofAI and you are reading article 404. I am writing one new article about or related to artificial intelligence every day for 500 days.

AI Policy and Ethics at www.nora.ai. Student at University of Copenhagen MSc in Social Data Science. All views are my own. twitter.com/AlexMoltzau