Sunday, February 7, 2010

Character Recognition and Google Tesseract

Introduction
Optical character recognition is usually abbreviated as OCR. It is a type of computer software designed to translate images of handwritten or typewritten text, usually captured by a scanner, into machine-editable text. OCR is a field of research in pattern recognition, artificial intelligence and machine vision. Optical character recognition simply involves reading text from paper and translating the images into a form that the computer can manipulate.
The ability to recognize the characters is a basic to interpreting the printed language. For the computer the character on a page is just like an image or the object to be recognized. Humans recognize the visual objects with so little effort but for the computer, it is performed as task called OCR for reading the typewritten or handwritten characters.

Types of Character Recognition Systems:
A character recognition deal with the problem of reading handwritten/typewritten character offline i.e. at some point in time after it has been written. However recognition of unconstrained handwritten text can be very difficult because characters cannot be reliably isolated especially when the text is cursive handwriting. They are classified as the following two types
  • Offline CR.
  • Online CR.
A brief description is given.

1) Online Character Recognition:
In case of online character recognition there is real time recognition of characters. Online systems have better information for doing recognition since they have timing information and since they avoid the initial search step of locating the character rather then the offline character recognition. Online systems obtain the position of the pen as a function of time directly from the interface.
Offline recognition of characters is a challenging problem because of the complex character shapes and great variation of character symbols written in different styles.

2) Offline Character Recognition:
In case of offline character recognition the typewritten/handwritten character is typically scanned in form of a paper document and made available in the form of a binary or gray scale image to the recognition algorithm. Scanning and binarization may cause the additional challenges to the algorithm. Therefore offline character recognition is considered more challenging task then online character recognition.
There are two major steps involved in the character recognition after the scanning of the document.
  • Document Analysis:
The process of extraction of text from the document is called as document analysis. It depends largely on the quality of the scanned image.
  • Character Recognition:
The character recognition consists of two essential parts:
Feature extractor: it derives the set of features that the character possesses.
Classifier: The derived features are input to the character classifier for the classification.

3. Analysis:

Problem Analysis: 
Optical character recognition is always a field of interest for the computer scientists. Initially the goal was to identify the characters that are usually of single font and size. Later on the problem was extended by recognizing the different fonts and style characters, also the hand printed text, which is still a great challenge. Humans do not print the characters consistently. The basic problem is to identify the features of the characters that are used to classify the characters. If a good set of features is found, then the classification problem becomes simpler.
The general problem of recognizing a character that was printed by hand is made difficult by the inconsistency of characters. Humans do not print the characters consistently that’s a big problem of today’s OCR systems.
There is no way to deal with all types of fonts and styles. However various methods have been used to identify the unique features of a particular character.
The character recognition comprises of two main steps:
· Feature extractor
· Classifier
Feature extractor: It is used to extract the set of features corresponding to the character that it possesses.

Classifier: The derived features are then given to the classifier for the recognition.
Basic problem is to identify the features in the character images that can be used to classify the image. If a good set of features for the character can be found then the problem can be easily solved. We will take a bird eye view of some of the methods of identifying the features of the character that it possess in order to explore the OCR systems
As the problem is to identify the individual character from an image, we can make the classes of characters based on their unique characteristics, which allow us for the further analysis of the characters.

Analysis of features:
We will take an overview of some of the methods of identifying the features of the character that it possesses for the purpose of feature analysis of the characters to classify them.

1) Counting of holes:
A simple feature of the characters is the number of holes a character contain. This demarcation divides the set of all characters into three classes. One with one hole such as A, D, O, P, Q, a, b, 6, 9, 0 etc. second class having two holes such as B and 8. the third class with no holes such as C,E,G,F,m,n,l,2,3 etc.

2) Identification of end points:
Another feature is the number of end points of the character along the corners. Such as A have two end points along bottom. X has 4 end points along 4 corners.

3) Density:
The density of an image is the ratio of number of black pixels to the total area of the region. Some characters are denser than the others.
Another feature vector can be manipulated by dividing the image into grid of equal sized cells say 3x3. And counting the number of pixel in each grid and then dividing by the area of the region will give the density of the character image.

4) Perimeter:
The normalized perimeter can also be feature of an image, which is simply the perimeter divided by the area of the box bounding the image of the character.

5) Projections:
A set of projections can also define the characteristic vector. A projection is simply the number of pixels in a particular direction. Such as the horizontal projection is the sum of all pixels in a specified row. The horizontal and vertical projections are sometimes very useful.

6) Characteristic Loci:
One of the older and interesting ideas is the use of the characteristic loci. In this method each white (background) pixel is examined to see on which side the object resides. All the pixels in the same row are examined and the white to black transitions is counted. If there is no such transition the object must be below. If the object pixels are in all directions then the object must be in the hole such as 0. The loci would be computed by looking up, down, left and right the background pixels and identifying where the object resides.

7) Stroke identification:
Another approach is to identify the strokes in the character. A linear stroke is simply the longest set of line segments in the character. Any 1-connected pixel can be the start of a stroke. Any such pixels is located and then find the longest line that begins at the selected starting pixel, passes only through the black pixels and ends at the skeletal pixel. The only restriction is that the line should not cross any of the background pixels. The equation of the line that passes through two end points in the normal form is:
a x + b y + c = 0
The coordinates of end points define in this way called the stroke. Now the pixel belonging to the stroke should be removed in order to find the next stroke. The pair of coordinates making the stroke should be saved if it is not too much short. The stroke can be used to construct the graph of the character. The position and orientation of each stroke can be used to make the feature vector for the character.
These features perform the initial analysis on the characters in order to categorize the characters in the separate groups or classes. These features are grouped into a set of vectors called feature vectors and these feature vectors are used for the identification of the characters during the process of template matching, which is an older technique to recognize the characters.


7) Optical Character Recognition Techniques:
There are various methods for the character recognition that are used in general. We will discuss many of them. Our approach to recognize the character would be to use the template matching process which is discussed in following section.

7.1) Template Matching:
Template Matching or matrix matching is one of the most common classification methods. The general problem of reading the text can be handled by the process of template matching. If a good image is available and the printing quality of the image is good enough, then many of the character images from the data set can be saved as the templates. Consider the text image that has been thresholded and is therefore a bi-level image. How can the template matching be used here? A scanned image is given to input of the system. Various processing have been done before bringing the image to the input as specified above.
The image features are extracted from the given input character. The individual image pixels are used as features. Classification is performed by comparing an input character with a set of templates from each character class. Each comparison results in a similarity measure between the input characters with a set of templates. One measure increases the amount of similarity when a pixel in the observed character is identical to the same pixel in the template image. If the pixels differ the measure of similarity may be decreased. After all templates have been compared with the observed character image, the character’s identity is assigned the identity of the most similar template. Template matching is a trainable process as template characters can be changed.
A simple way to setup the template library is to create an array of template lists, one list for each character. The listed can be sorted according to their ASCII codes. The purpose of the list is to store the different fonts and style images of the same character. There may always be the need of new of new templates; the lists would make it easy to add the templates later.
The database of characters or libraries has the ability to learn a new font or size of characters. The simple approach is just to ask the user of the program about the new templates. As the characters are extracted, they are matched against the templates. If a match for the certain character is not found, it may be asked from the user that what the character it is. The system then stores this character template in the database. The next time the system encounters this image it will be recognized successfully.


7.1.1) Characteristic Vector:
The process of template matching requires a unique set of characteristics that uniquely identifies that character. What this feature vector or characteristic vector contains? Every character has a unique set of geometrical properties associated with it, which differentiates it from the rest of the alphabets. One particular characteristic may not be sufficient to enable recognition but the set as a whole would make every alphabet unique.
A character can be quantified in the form of set. We call this a vector rather as the order of the elements is significant. This set is called as the characteristic vector of the character. A database is created to store the characteristic vectors corresponding to characters. Then the methods can be defined to compare and find the closest match of the characteristic vector from the database. The alphabet that is found to be the closest match is considered to be the recognized character for the unknown character.

8) Methods:
Various methods have been developed that use the template matching. Some of them are discussed below that uses the template matching.

8.1) OCR on Simple Perfect Images:
Consider that an image that has been thresholded called the bi-level image is given to the system for recognition. In character recognition problem, there would be no more than 128 characters that need to be identified. The system will check an incoming glyph against all the possible characters and then classify the input as the best match.
In this case we assume that the orientation of the image is perfectly horizontal. First thing is to discover the position and extent of the lines of the text in the image. It can be done by searching for the horizontal projection and then search for the minima. The projection is simply the sum of all the pixel values in the specified direction, so a horizontal projection is the sum of the pixels in each row.
The row of the pixels that begin with the new line will be the one in which the some of the pixels are black and last row will be the one having any black pixels. The start and end columns for the line are found by searching the rows that belong to that line, from column zero through to the first column having a set pixel. The same idea can be used to find the last pixel in the line but in reverse direction.
The bounding box for the text lines is known. The height, width and spacing for individual character can be measured and saved in a database for this particular font. This type of database takes a very simply from having an array of containing the list at each array index that contains he characters with different fonts and styles.
This is the simplest from of the database for character recognition: template matching. In this case the pixel by pixel match is done between the pixels of the input image and the saved character template.
The spacing between the characters can be determined by any character having a distance greater than the mean plus the width of the space as assumed by the training image.
There are some other problems for some characters using this method. The quotes can be misinterpreted as quotes, but this could be resolved by noting that the comma appears at the bottom on the line while the quotes on the above of the line. The font should be the same for the training image and the input image.

8.2) OCR on Scanned Images:
Now we consider the case when the scanned image is input to the system of optical character recognition. Most scanners produce the gray level image, so image must be threshold.
Another issue is the resolution of the printer which is not finite. This means that a
Document when scanned will affect the pixel values over the entire image and the characters will have the have a little different pixel values with respect to horizontal and vertical position on the page so will have a number of possible templates after threshold.
After threshold the characters may touch each other and seems like a single glyph. The problem is that where to split the glyph in order to get the individual character glyphs. This problem can be solved upto some extent by splitting the glyph as some documents uses the fonts in which the spacing between the character is the function of the character. Another problem is that the scanning process could add the noise to the image. And when threshold it would produce the random black pixels on the image. So we should first reduce or remove the noise.

8.2.1) Noise Removal:
There are some ways to reduce the noise from the image. These include:
First way is to get the multiple images of the same page and averaging the gray levels of each pixel across the al samples will reduce the noise. For example averaging the four samples will cut the noise into half.
Another way is to take the multiple images of the same page, threshold them and take only those pixel values that are black in all the samples.
A simpler way to reduce the image noise is to use the median filter. But it will also reduce the contrast of the edges that might make the small gaps between the pixels more close. Also its drawback is that the median filter is slow because it sorts all the pixel values in the window through a single pass of the image.
Another technique is to use the mean filter as it is easier to calculate instead of the median filter. It would produce a blurring effect in the pixels of the image because it replaces the pixel with the mean of its neighbors. But it can be tolerated.

KFill Noise Removal Method:
Another issue can be to reduce the noise of the image after it had been threshold. The problem of removing the noise takes the simple form of just filling the small holes and removing the small isolated black regions. A method named KFill (O’ Gorman 1992) filter is used. This method uses a square K×K pixel window passed over the image multiple times. The outer two rows and columns are called the neighborhood and the center portion is called the core. The first pass of the window will set all the core pixels to white (background level) if certain parameters allow this depending on the neighborhood pixels. The second pass will set the core pixel to black again depending on the neighborhood.
For example take k=5; the outer two rows and columns are the neighborhood and the only inner pixel is the core pixel. On first pass of the 5×5 window over the image, we look for the black core pixels. And measure the values:
  • The total number of white pixels in the neighborhood.
  • The number of white corner pixels called r.
  • The number of distinct connected white regions encountered in neighborhood called c.
If c has any values except one then left the window and process the next one.
Assume that c = 1, then the core pixels is set to white if:
k k
n > 3 k – V n = 3 k - ^ (r = 2)
3 3
After the first pass the isolated black pixel are removed. And in the second pas the isolated white pixels are removed.

8.2.2) Isolating the Individual Glyphs:
The connected glyphs are one of the problems in the preprocessing. An example is the letter m which can be split into two good matches of r and n. Template matching will give the poor results if one or more characters are connected to each other due to the noise or under sampling.
Its solution is to locate the minima in the vertical projection and then segment the image at those places. But this has some problems as well. It would split the characters such as h and n into two glyphs as both have the local minima in their vertical projection.

Break Cost Method:
One of its related techniques is to compute the break cost (Tsujimoto 1991), which uses the two adjacent columns. It is defined as the number of black pixels in a column that also have black neighbor in the same row of the previous column and the columns in which the break cost is minimum are the points at which to separate the glyphs. But this would also not give the perfect result in all the cases. For example if the letters “th” in the figure were connected by more than one pixel than this algorithm would fail to split these two characters.
The final strategy would be to check the characters word by word by comparing them from the dictionary, and correcting the spelling accordingly will tell us about the accuracy of segmentation.

8.2.3) Orientation:
When using the scanner to digitize a document, orientation could also generate the problem. The page lines of the text should be horizontal. The orientation of the lines of text is called skew angle and its measurement is called the skew detection.
We can perform the skew detection by using the following methods:

Skew detection:
If the skew angle is known the horizontal projection can be used. It is used to identify the position of the lines of the text. If the skew angle is large, then there will be no parts of the projection that makes a completely white band between the lines.
Another method could be used to measure the skew angle. This algorithm was proposed by Baird in 1987. He suggests that all the characters that do not have the descended, characters other than g, j, p, q, and y, the bottom of the characters in each line are collinear. Then a bounding box of the glyph can be found, the center point of the bottom edge of the boxes should be collinear for each line of text.
The steps of this algorithm are:
1. Identify all the connected regions, assuming that each represents a character.
2. Find the bounding box of each region or character and locate the center of the bottom edge of the bounding box.
3. For a given angle Θ, compute the projection of the points obtained in step 2, which gives a one dimensional projection array P (Θ).
4. maximize the function:
n
A (Θ) = Σ P 2 i (Θ)
i=1
Where n is the number of bins. The angle that gives the maximum is the correct skew angle.

8.2.4) Normalized Match Index:
The template matching work well on the perfect images having no noise and separated glyphs but will cause errors for the scanned images. Reducing these factors would allow making the good character recognition possible. It is important to use the wide variety of templates for different fonts and styles and save them in the database for the variants of the same character.
A general formula can be used to measure the accuracy of the character recognition called the Normalized Match Index. It is found by counting the number of pixels that match and the pixels that do not match.
Number of pixels that match = M +
Number of pixels that match = M
It is computed as:
NMI = M + - M
M + + M
This value has the minimum of -1 and a maximum of +1.

9) Statistical Recognition:
In statistical approach many features are combined into a large feature vector. A feature is simply a measurement made o a glyph and combining them will make a vector containing these features. A successful statistical classifier depends on the smart collection of features (measurements or properties). Usually large vectors are not used because as the size of the feature vector grows the execution time also grows.
There are many features, such as the ratio of the area of the glyph to its perimeter can characterize the shape of the glyph. As for example, area of the circle is computed by
a = ∏ r 2
and the perimeter by
p = 2 ∏ r.
The ratio of p2 /a is 4 ∏ r 2 / ∏ r 2. Hence the expression:
C = p2 / 4 ∏ a
Will be unity or nearly for a circular object and will increase in value as the shape of the region deviates from the circle. This measure called the circularity can be one of the features. Another feature can be the aspect ratio of the object which is defined as the ratio of the object height’s H 
to its width W.

9.1) Multiple Features:
Some features consist of the collection of values and are themselves vectors. For example a glyph can be resample to small, constant size. For example to resample the 12 × 12 glyph to a size of 3 × 3 glyph, simply group the pixels into three groups of four in each direction. The value of the new pixel is the average of the pixels in each of the nine regions. The nine pixels in the resulting 3 × 3 image can be stored in consecutive locations in feature vector. This is called a multiple feature.
Other examples of the multiple features include slope histogram, profiles in any direction and signatures. A slope histogram is the histogram of direction of lines comprising the boundary of the object. A profile is the sum of the pixel values in some specified direction.

10) Use of Edges to Recognize a Character:
A character or glyph can be seen as a connected black region. The character can be represented by its basic shape and topological properties that are present in the edge representation. The edges of the character are detected using any of the edge detection algorithms, then the edges are linked and then classification of the character using the edges would be done.
Edge linking is the process of collecting the pixels into line segments. This is also called vectorization.
It could be done in the following steps:
  1. The object boundary is identified using the edge detection method and all the non- boundary pixels are set to the background level.
  2. Starting from any pixel on the outside boundary, a list of pixels comprising the outline is created.
  3. Starting at the first pixel in the list, add the pixels in the list that contains the pixels that belong to the next line in the boundary.
  4. Check the distance between all the pixels in the set and the real straight line between the first and the last pixel in the set.
  5. If the distance found is less than some threshold, then continue from step 3 using the next pixel in the list.
  6. If the distance is greater than threshold then stop adding the pixels to the current line. Skip the first and last pixel in the set as the line end points. And make the next pixel in the list as the starting point of the next line. Repeat the process from step 3.
This is a classical vectorization strategy, and works well for the smaller images like glyphs or single characters.
When the boundary of the input glyph is measured in vector form then it could be used for the classification by converting the vectorized edges into the from
(Xc, Yc, L, Θ). Where (Xc, Yc) is the center of the line, L is its length and Θ is the angle that the line segment makes with X-axis. This vector is compared with the template vectors, and the best match is output.

Popular OCR libraries:

Asprise:
Asprise OCR is a high performance OCR (optical character recognition) engine, which offers API for Java, VB.NET,CSharp.NET, VC++, VB6.0, C, C++, Delphi on Windows, Mac, Linux and Solaris. Asprise OCR enables you to equip your applications with OCR ability easily.
Features of Asprise OCR include: Highest Level of Accuracy; Excellent Format Retention; High Speed; Ease of Use; Barcode Recognition; Flexible Licensing Scheme. But this library is not free.

GOCR:
GOCR is an OCR (Optical Character Recognition) program, developed under the GNU Public License. It converts scanned images of text back to text files. Joerg Schulenburg started the program, and now leads a team of developers.
GOCR can be used with different front-ends, which makes it very easy to port to different OSes and architectures. It can open many different image formats, and its quality have been improving in a daily basis.

FreeOCR:
FreeOCR is a Windows OCR program including the Windows compiled Tesseract free ocr engine. It includes a Windows installer and It is very simple to use and supports multi-page tiff's, fax documents as well as most image types including compressed Tiff's which the Tesseract engine on its own cannot read .It now has Twain scanning. It cannot open PDF's at the moment but we are working on that. We may include Ghostscript to render PDF documents and our ultimate goal would be to create a searchable PDF. Free OCR uses the Tesseract OCR engine

Tesseract:
The Tesseract free OCR engine is an open source product released by Google. It was developed at Hewlett Packard Laboratories between 1985 and 1995. In 1995 it was one of the top 3 performers at the OCR accuracy contest organized by University of Nevada in Las Vegas. The Tesseract engine source code is now maintained by Google.

Usage:
1. Add a reference of assembly Tessnet.dll to your .NET project
2. Download Tessdata.rar and extract it in c:\temp
3. Read the program.cs for sample application
And that’s it.

Code:
Bitmap image = new Bitmap("c:\\temp\\Graphic1.jpg");
tessnet2.Tesseract ocr = new tessnet2.Tesseract();
//ocr.SetVariable("tessedit_char_whitelist", "0123456789"); // If digit only
ocr.Init(@"c:\temp", "eng", false); // for english language
List result = ocr.DoOCR(image, Rectangle.Empty);
foreach (tessnet2.Word word in result)
Console.WriteLine("{0}", word.Text);

Note:
Tesseract gives much better accuracy on 300+ dpi.

Input: ( After scanning )


Output:
Data Structures By timmag TopCoder Member Even though computers can perform literally millions of mathematical computations per second, when a problem gets large and complicated, performance can nonetheless be an important consideration. One ofthe most crucial aspects to how quickly a problem can be solved is how the data is stored in memory. To illustrate this point, consider going to the local library to find a book about a specific subject matter. Most likely, you will be able to use some kind of electronic reference or, in the worst case, a card catalog, to determine the title and author of the book you want. Since the books are typically shelved by category, and within each category sorted by author's name, it is a fairly straightforward and painless process to then physically select your book from the shelves. Now, suppose instead you came to the library in search of a particular book, but instead of organized shelves, were greeted with large garbage bags lining both sides of the room, each arbitrarily Hlled with books that may or may not have anything to do with one another. It would take hours, or even days, to find the book you needed, a comparative eternity. This is how software runs when data is not stored in an efficient format appropriate to the application. Simple Data Structures The simplest data structures are primitive variables. They hold a single value, and beyond that, are of limited use. When many related values need to be stored, an array is used. It is assumed that the reader of this article has a solid understanding of variables and arrays. V A somewhat more difficult concept, though equally primitive, are pointers. Pointers, instead of holding an actual value, simply hold a memory address that, in theory, contains some useful piece of data. Most seasoned C++ coders have a solid understanding of how to use pointers, and many of the caveats, while fledgling programmers may find themselves a bit spoiled by more modern "managed" languages which, for better or worse, handle pointers implicitly. Either way, it should suffice to know that pointers "point" somewhere in memory, and do not actually store data themselves. A less abstract way to think about pointers is in how the human mind remembers (or cannot remember) certain things. Many times, a good engineer may not necessarily know a particular formula/constant/equation, but when asked, they could tell you exactly which reference to check. Arrays Arrays are a very simple data structure, and may be thought of as a list of a fixed length. Arrays are nice because of their simplicity, and are well suited for situations where the number of data items is known (or can be programmatically determined). Suppose you need a piece of code to calculate the average of several numbers. An array is a perfect data structure to hold the individual values, since they have no specific order, and the required computations do not require any special handling other than to iterate through all of the values. The other big strength of arrays is that they can be accessed randomly, by index. For instance, if you have an array containing a list of names of students seated in a classroom, where each seat is numbered 1 through n, then studentName[i] is a trivial way to read or store the name of the student in seat i. An array might also be thought of as a pre—bound pad of paper. lt has a Hxed number of pages, each page holds information, and is in a predefined location that never changes. Linked Lists, A linked list isfa data structure that can hold an arbitrary number of data items, and can easily change size to add or remove items. A linked list, at its simplest, is a pointer to a data node. Each data node is then composed of data (possibly a record with several data values), and a pointer to the next node. At the end of the list, the pointer is set to null. By nature of its design, a linked list is great for storing data when the number of items is either unknown, or subject to change. However, it provides no way to access an arbitrary item from the list, short of starting at the beginning and 94
Input:


Output:
Statement of Purpose Logical problem solving and learning technology has always appealed to me and this explains my interests in mathematics, programming and computing in general. The decision to study computer science was therefore a simple one. My desire for solving challenging problems led to my choice of studying mathematics and computer at intermediate level. These have been my favorite subjects for the past year and I hope that what I continue to learn will help me to succeed my future. At university, I hope to deepen and expand my knowledge of computing into areas which I have no formal experience of studying. For example, I am eager to leam about artificial intelligence and the mathematical foundations of computing. A computer science degree will provide me with a broad range of exciting career opportunities and I am particularly interested in using the skills gained at university in either research or industry. 

1. Download source code
2. Download  tesseract data files