FIELD OF THE INVENTION
[0001] The invention generally relates to a method and apparatus for automatically identifying
character segments for character recognition. More specifically, the invention relates
to a method and apparatus for training a classifier to automatically identify character
segments for character recognition based on one or more of a word level and a line
level ground truth.
BACKGROUND OF THE INVENTION
[0002] Automatic conversion of scanned documents into editable and searchable text requires
use of accurate and robust Optical Character Recognition (OCR) systems. OCR systems
involve recognition of text from scanned images by segmenting an input image of the
text into characters. To recognize text from scanned images, an OCR system is initially
trained with sample images of characters and their corresponding ground truths. Upon
continuous training of an OCR system to recognize the text in a script, the OCR system
learns to identify different characters in the text.
[0003] OCR systems for non-cursive scripts, such as for English text have reached a high
level of accuracy. One of the main reasons for this high level of accuracy is the
ability to automatically preprocess non-cursive scripts down to isolated characters
to provide as input to the OCR systems. Each character in a non-cursive script can
be isolated due to the inherent characteristic of non-cursive scripts to be non-touching.
Once each character is isolated, a corresponding character level ground truth may
be provided in order to train the OCR system.
[0004] However, with cursive scripts such as an Arabic script, isolating individual characters
in order to train an OCR engine is complex. This is due to the touching nature of
characters written in Arabic script. Additionally, Arabic text may include diacritics,
such as dots and accent marks placed above or below a letter to indicate the pronunciation
of the letter. This inhibits known preprocessing techniques used by OCR systems designed
for recognizing non-cursive text from accurately processing the Arabic text. Further,
many Arabic letters include three or four shapes depending on whether the letter is
placed at the beginning of a word, at the middle of the word, at the end of the word,
or as a standalone letter. These characteristics of Arabic text make it difficult
to automatically segment Arabic text into individual characters.
[0005] Currently, to train an OCR engine for recognizing Arabic text, individual characters
of a word in the Arabic text need to be manually demarcated and the corresponding
ground truths entered for each demarcated character. When a large set of documents
are used to train an OCR engine, the manual demarcation of the characters in a word
and the subsequent entering of the ground truth for each character is tedious and
error prone.
[0006] Therefore, there is a need for a method and apparatus for automatically identifying
character segments for character recognition based on one or more of a word level
and a line level ground truth.
BRIEF DESCRIPTION OF THE FIGURES
[0007] The accompanying figures, where like reference numerals refer to identical or functionally
similar elements throughout the separate views and which together with the detailed
description below are incorporated in and form part of the specification, serve to
further illustrate various embodiments and to explain various principles and advantages
all in accordance with the present invention.
[0008] FIG. 1 illustrates a flow diagram of a method of automatically identifying character
segments for character recognition in accordance with an embodiment of the invention.
[0009] FIG. 2a to FIG. 2c illustrate a schematic diagram depicting segmentation of a word
in Arabic script into character segments across multiple iterations, in accordance
with an embodiment of the invention.
[0010] FIG. 3 illustrates an apparatus for automatically identifying character segments
for character recognition in accordance with an embodiment of the invention.
[0011] Skilled artisans will appreciate that elements in the figures are illustrated for
simplicity and clarity and have not necessarily been drawn to scale. For example,
the dimensions of some of the elements in the figures may be exaggerated relative
to other elements to help to improve understanding of embodiments of the invention.
DETAILED DESCRIPTION OF THE INVENTION
[0012] Before describing in detail embodiments that are in accordance with the invention,
it should be observed that the embodiments reside primarily in combinations of method
steps and apparatus components related to automatically identifying character segments
for character recognition. Accordingly, the apparatus components and method steps
have been represented where appropriate by conventional symbols in the drawings, showing
only those specific details that are pertinent to understanding the embodiments of
the invention so as not to obscure the disclosure with details that will be readily
apparent to those of ordinary skill in the art having the benefit of the description
herein.
[0013] In this document, relational terms such as first and second, top and bottom, and
the like may be used solely to distinguish one entity or action from another entity
or action without necessarily requiring or implying any actual such relationship or
order between such entities or actions. The terms "comprises," "comprising," or any
other variation thereof, are intended to cover a non-exclusive inclusion, such that
a process, method, article, or apparatus that comprises a list of elements does not
include only those elements but may include other elements not expressly listed or
inherent to such process, method, article, or apparatus. An element proceeded by "comprises
... a" does not, without more constraints, preclude the existence of additional identical
elements in the process, method, article, or apparatus that comprises the element.
[0014] It will be appreciated that embodiments of the invention described herein may be
comprised of one or more conventional transaction-clients and unique stored program
instructions that control the one or more transaction-clients to implement, in conjunction
with certain non-transaction-client circuits, some, most, or all of the functions
of a method and apparatus for automatically identifying character segments for character
recognition. The non-transaction-client circuits may include, but are not limited
to, a radio receiver, a radio transmitter, signal drivers, clock circuits, power source
circuits, and user input devices. As such, these functions may be interpreted as steps
of methods for segmenting an image for recognizing text in the image. Alternatively,
some or all functions could be implemented by a state machine that has no stored program
instructions, or in one or more application specific integrated circuits (ASICs),
in which each function or some combinations of certain of the functions are implemented
as custom logic. Of course, a combination of the two approaches could be used. Thus,
methods and means for these functions have been described herein. Further, it is expected
that one of ordinary skill, notwithstanding possibly significant effort and many design
choices motivated by, for example, available time, current technology, and economic
considerations, when guided by the concepts and principles disclosed herein will be
readily capable of generating such software instructions and programs and ICs with
minimal experimentation.
[0015] Generally speaking, pursuant to various embodiments, the invention provides a method
and apparatus for automatically identifying character segments for character recognition.
The method involves receiving a plurality of words and a ground truth corresponding
to each word of the plurality of words. Each word of the plurality of words is segmented
into one or more character segments based on the ground truth corresponding to each
word. Thereafter, the segmentation of each word is refined by iteratively re-segmenting
each word based on one or more similar character segments.
[0016] FIG. 1 illustrates a flow diagram of a method of automatically identifying character
segments for character recognition in accordance with an embodiment of the invention.
The method involves receiving a plurality of words and a ground truth corresponding
to each word of the plurality of words at step 102. In an embodiment, the plurality
of words may correspond to a line of text and a ground truth for the entire line of
text may be received. It will be apparent to a person skilled in the art that the
plurality of words may correspond to a paragraph, a zone in a page, a page, or multiple
pages without deviating from the scope of the invention. The plurality of words may
be in a cursive script. For example, the plurality of words may be in Arabic script,
Farsi script, Kurdish script, etc.
[0017] At step 104, each word of the plurality of words is automatically segmented into
one or more character segments (hereinafter referred to as "character segments") based
on the ground truth corresponding to each word. A word is segmented into character
segments based on the number of characters indicated by the ground truth of the word.
For example, if a ground truth of a word indicates that there are four characters
in the word, then the word is divided into four segments. The segmentation of a word
into the character segments represents boundaries for the character segments within
the word. In an embodiment, a word may be segmented by randomly dividing the word
into character segments based on the number of characters indicated by the ground
truth of the word. In another embodiment, a word may be segmented by dividing the
word into character segments based on an average character width associated with each
character in the word. An average character width for a particular character may be
determined by analyzing a document corpus and averaging the width of all occurrences
of the particular character. It will be apparent to a person skilled in the art that
other methods to determine character width may also be employed without deviating
from the scope of the invention.
[0018] The segmentation of each word into its constituent character segments randomly or
based on average character width may not be accurate. This may be due to noise associated
with each character segment. The noise may correspond to one or more parts of adjacent
character segments within a segment associated with a character segment. The noise
may also include foreign segments within the character segment. These foreign segments
may be quantization noise from imaging light sensors, dirt on imaging device optics,
ink spatters, and toner smudges.
[0019] To minimize noise associated with the character segments of a word, the segmentation
of each word is automatically refined at step 106 by iteratively re-segmenting each
word by comparing the character segments with one or more similar character segments
(hereinafter referred to as "similar character segments"). To determine similar character
segments, the ground truths of the character segments of a word are compared with
ground truths of other character segments in the plurality of words and in a set of
pre-saved character segments. If two ground truths are identical, then the character
segments associated with the two ground truths are considered to be similar character
segments. On comparing similar character segments, the segmentation of each character
segment in the word is refined. Refining the segmentation of each word of the plurality
of words includes determining a plurality of horizontal boundaries and a plurality
of vertical boundaries for the character segments of each word. The plurality of horizontal
boundaries and the plurality of vertical boundaries of the character segments of each
word are then iteratively modified by comparing the character segments of each word
with similar character segments over multiple iterations. Refining the character segments
over multiple iterations eliminates noise associated with character segments as will
be explained in conjunction with FIG. 2a to FIG. 2c.
[0020] Further, the character segments associated with the plurality of words along with
the similar character segments associated with the plurality of words may be stored
and added to the set of pre-saved character segments. The character segments along
with the similar character segments may be used for subsequent iterations for refining
the segmentation of each word of the plurality of words. Here, the set of pre-saved
character segments is a dynamically growing set of character segments as character
segments along with similar character segments are added to the set of pre-saved character
segments after each iteration.
[0021] FIG. 2a to FIG. 2c exemplarily illustrates a word 200 in Arabic script iteratively
segmented into one or more character segments. Initially, word 200 is segmented randomly
or based on average character width into a number of segments based on number of characters
indicated by a ground truth of word 200. In FIG. 2a, if the ground truth of word 200
indicates word 200 is made up of seven characters, then word 200 is segmented randomly
or based on average character width into seven segments such as a segment 202-1, a
segment 202-2, a segment 202-3, a segment 202-4, a segment 202-5, a segment 202-6,
and a segment 202-7 (hereinafter referred to as segments 202-1 to 202-7). Segments
202-1 to 202-7 define boundaries for a character segment 204, a character segment
206, a character segment 208, a character segment 210, a character segment 212, a
character segment 214, and a character segment 216 (hereinafter referred to as character
segments 204-216) respectively.
[0022] Thereafter, the segmentation of character segments 204-216 are refined by iteratively
comparing character segments 204-216 with one or more similar character segments (hereinafter
referred to as similar character segments). In this case, segments 202-1 to 202-7
are refined to a segment 218-1, a segment 218-2, a segment 218-3, a segment 218-4,
a segment 218-5, a segment 218-6, and a segment 218-7 (hereinafter referred to as
segments 218-1 to 218-7) respectively based on the comparison as indicated in FIG
2b. In order to determine similar character segments, ground truths of character segments
204-216 are compared with ground truths of other character segments, from the plurality
of words and the set of pre-saved character segments. If two ground truths are identical,
then the character segments associated with the two ground truths are considered to
be similar character segments.
[0023] Upon comparing character segments 204-216 with the similar character segments, a
plurality of horizontal boundaries and a plurality of vertical boundaries are determined
for each of character segments 204-216, The plurality of horizontal boundaries and
the plurality of vertical boundaries for each of character segments 204-216 are indicated
as segments 218-1 to 218-7 in FIG 2b. Therefore, by refining the segmentation of word
200 based on the comparison, character segments 204-216 are refined to a character
segment 220, a character segment 222, a character segment 224, a character segment
226, a character segment 228, a character segment 230, and a character segment 232
(hereinafter referred to as character segments 220-232) respectively. Refining character
segments 204-216 to character segments 220-232 eliminates parts of one or more adjacent
characters in character segments 204-21b, thereby reducing noise present in character
segments 204-216.
[0024] Thereafter, character segments 204-216 along with the similar character segments
corresponding to character segments 204-216 may be stored in the set of pre-saved
character segments. The stored character segments along with the similar character
segments may then be used for subsequent iterations for refining segmentation of the
plurality of words.
[0025] Similarly, each word of the plurality of words is compared with similar characters
segments over multiple iterations and the segmentation of each word is refined at
each iteration. Referring now to FIG. 2c, an n
th iteration indicating the segmentation of word 200 is illustrated. In this case, segments
218-1 to 218-7 of FIG. 2b are refined to a segment n-1, a segment n-2, a segment n-3,
a segment n-4, a segment n-5, a segment n-6, and a segment n-7 (hereinafter referred
to as segments n-1 to n-7) based on the comparison as indicated in FIG 2c. On refining
the segmentation of word 200 based on the comparison, in the n
th iteration character segments 220-232 are refined to a character segment 234, a character
segment 236, a character segment 238, a character segment 240, a character segment
242, a character segment 244, and a character segment 246 (hereinafter referred to
as character segments 234-246) respectively.
[0026] FIG. 3 illustrates an apparatus 300 for automatically identifying character segments
for character recognition in accordance with an embodiment of the invention. Apparatus
300 includes a memory 302 and a processor 304 coupled to memory 302.
[0027] Processor 304 is configured to receive a plurality of words and a ground truth corresponding
to each word of the plurality of words. In an embodiment, the plurality of words may
correspond to a line of text and a ground truth for the entire line of text may be
received. It will be apparent to a person skilled in the art that the plurality of
words may correspond to a paragraph, a zone in a page, a page, or multiple pages without
deviating from the scope of the invention. The plurality of words may be in a cursive
script.
[0028] Further, processor 304 is configured to segment each word of the plurality of words
into one or more character segments (hereinafter referred to as "character segments")
based on the number of characters indicated by the ground truth of the word. The segmentation
of the word into the character segments represents boundaries for the character segments.
In an embodiment, processor 304 is configured to segment a word by randomly dividing
the word into character segments based on the number of characters indicated by the
ground truth of the word. In another embodiment, processor 304 is configured to segment
a word by dividing the word into character segments based on an average character
width associated with each character in the word. The segmentation of each word of
the plurality of words is explained in conjunction with FIG. 1.
[0029] Processor 304 is further configured to refine the segmentation of each word by iteratively
re-segmenting each word by comparing character segments of each word in the plurality
of words with one or more similar character segments (hereinafter referred to as "similar
character segments"). The similar character segments may be selected by processor
304 from one or more of the plurality of words and a set of pre-saved character segments.
To determine similar character segments, the ground truths of the character segments
are compared with ground truths of other character segments in the plurality of words
and in the set of pre-saved character segments. If two ground truths are identical,
then the character segments associated with the two ground truths are considered to
be similar character segments.
[0030] On comparing similar character segments, processor 304 refines segmentation of each
character segment in each word of the plurality of words. To refine the segmentation
of each word, processor 304 is configured to determine a plurality of horizontal boundaries
and a plurality of vertical boundaries for the character segments of each word. The
plurality of horizontal boundaries and the plurality of vertical boundaries are then
iteratively modified by comparing the character segments with the similar character
segments as explained in conjunction with FIG. 1 and FIG. 2a to FIG. 2c. In addition
to defining horizontal boundaries and vertical boundaries, processor 304 is further
configured to remove parts of one or more adjacent characters of the character segments
to refine the segmentation of each word.
[0031] Further, the character segments associated with the plurality of words along with
the similar character segments associated with the plurality of words may be stored
in memory 302 and added to the set of pre-saved character segments. The set of pre-saved
character segments may also be stored in memory 302. The character segments along
with the similar character segments may be used for subsequent iterations for refining
the segmentation of each word of the plurality of words.
[0032] Various embodiments of the invention provide methods and apparatuses for automatically
identifying character segments for character recognition. The method and apparatus
enables efficient segmentation of words that are in cursive script, such as words
in Arabic script. The method enables automatically segmenting each word of a plurality
of words into one or more character segments based on a word level or a line level
ground truth. The segmentation of each word by iterative comparison eliminates the
need for manually demarcating the segmentation of each word and hence reduces the
error rate and time required for segmenting a word into character segments. Since,
the ground truths are provided at a word level or a line level, manually providing
ground truths for each character segment is also avoided. This reduces the time required
for providing ground truths to train a classifier.
[0033] Those skilled in the art will realize that the above recognized advantages and other
advantages described herein are merely exemplary and are not meant to be a complete
rendering of all of the advantages of the various embodiments of the present invention.
[0034] In the foregoing specification, specific embodiments of the present invention have
been described. However, one of ordinary skill in the art appreciates that various
modifications and changes can be made without departing from the scope of the present
invention as set forth in the claims below. Accordingly, the specification and figures
are to be regarded in an illustrative rather than a restrictive sense, and all such
modifications are intended to be included within the scope of the present invention.
The benefits, advantages, solutions to problems, and any element(s) that may cause
any benefit, advantage, or solution to occur or become more pronounced are not to
be construed as a critical, required, or essential features or elements of any or
all the claims. The present invention is defined solely by the appended claims including
any amendments made during the pendency of this application and all equivalents of
those claims as issued.
1. A method of automatically identifying character segments for character recognition,
the method comprising:
receiving a plurality of words and a ground truth corresponding to each word of then
plurality of words;
segmenting each word into at least one character segment based on the ground truth
corresponding to each word; and
refining the segmentation of each word by iteratively re-segmenting each word based
on at least one similar character segment, wherein the at least one similar character
segment is similar to the at least one character segment associated with each word.
2. The method of claim 1, wherein the plurality of words is a line of text.
3. The method of claim 1 or 2, wherein the plurality of words is in a cursive script.
4. The method of anyone of the preceding claims, wherein a word is segmented into at
least one character segment based on a number of characters associated with a ground
truth of the word, in particular wherein segmenting the word comprises at least one
of: dividing the word into the at least one character segment randomly; and dividing
the word into the at least one character segment based on a predefined average character
width associated with each character of the plurality of words.
5. The method of anyone of the preceding claims, wherein refining the segmentation of
each word comprises determining horizontal and vertical boundaries for the at least
one character segment.
6. The method of anyone of claims 1 to 4, wherein refining the segmentation of each word
comprises comparing a character segment of the at least one character segment with
the at least one similar character segment, wherein a ground truth of the character
segment and a ground truth of the at least one similar character segment are identical,
in particular wherein the at least one similar character segment is selected from
at least one of: the plurality of words; and a set of pre-saved character segments.
7. The method of claim 6 further comprising storing at least one of the at least one
character segment and the at least one similar character segment.
8. The method of claim 1, wherein refining the segmentation of each word comprises removing
at least one part of at least one adjacent character.
9. An apparatus for automatically identifying character segments for character recognition,
the apparatus comprising;
a memory; and
a processor coupled to the memory, the processor configured to:
receive a plurality of words and a ground truth corresponding to each word of the
plurality of words;
segment each word into at least one character segment based on the ground truth corresponding
to each word; and
refine the segmentation of each word by iteratively re-segmenting each word based
on at least one similar character segment, wherein the at least one similar character
segment is similar to the at least one character segment associated with each word.
10. The apparatus of claim 9, wherein the plurality of words is a line of text, in particular
wherein the plurality of words is in a cursive script.
11. The apparatus of claim 9 or 10, wherein the processor is further configured to segment
a word into at least one character segment based on number of characters associated
with the ground truth of the word, in particular wherein to segment the word into
the at least one character segment, the processor is further configured to perform
at least one of: dividing the word into the at least one character segment randomly;
and dividing the word into the at least one character segment based on a predefined
average character width associated width each character of the plurality of words.
12. The apparatus of anyone of claims 9 to 11, wherein the processor is further configured
to refine the segmentation of each word by determining horizontal and vertical boundaries
for the at least one character segment.
13. the apparatus of anyone of claims 9 to 12, wherein the processor is further configured
to refine the segmentation of each word by comparing a character segment of the at
least one character segment with the at least one similar character segment, wherein
a ground truth of the character segment and a ground truth of the at least one similar
character segment are identical, in particular wherein the processor is further configured
to select the at least one similar character segment from at least one of: the plurality
of words; and a set of pre-saved character segments.
14. The apparatus of claim. 13, wherein at least one of the at least one character segment,
the at least one similar character segment, and the set of pre-saved character segments
is stored in the memory.
15. The apparatus of claim 9, wherein the processor is further configured to remove at
least one part of at least one adjacent character to refine the segmentation of each
word.