FIELD OF THE INVENTION
[0001] The present invention relates to an automatic identification of a document format.
BACKGROUND OF THE INVENTION
[0002] In a field where forms in various formats are handled, an apparatus for automatically
identifying forms by format has been proposed. This type of automatic identification
is made based on the similarity between formats of the forms. A method of determining
an origin position of a form greatly affects the result of the calculation of the
similarity between form formats. If an upper left corner of an image read through
a scanner is used as the origin of a form as is, a displacement of the form placed
on the scanner displaces the position of the form origin, preventing the form from
being properly recognized. Therefore, form format data is generated in order to correct
the form origin position. This method will be described below. When a scanner that
reads an image against a black background (hereinafter called a "black back scanner")
is used to read an image, the outer rim of a form in the read image appears in black
Therefore a process (black rim correction) for deleting the black rim is performed
to correctly recognize the shape of the form. Means for generating form format data
uses as the origin the upper left corner of the image that has undergone the black
rim correction to generate the format data (FIG. 2A).
[0003] When a scanner that reads an image against a white background (hereinafter called
a "white back scanner"), the outer rim of a form in the read image appears in white.
Thus the color of the rim in many cases is the same as that of the form itself. Therefore
the black rim correction cannot be applied to it. Because no colors appear on the
outer rim of the form except white, which is the background color, features of the
image are extracted to determine the positions of a table block and a text block to
decide its origin. For example, the top, left most position of a rectangle encompassing
a whole block is used as the origin to generate the format data. Although the upper
left corner of the form cannot be used as the origin in this method, the origin of
forms in the same format can be determined uniquely when the background color is white
(FIG. 2B).
[0004] However, the method of determining the origin in the black back scanner differs from
the method used in the white black scanner. Therefore, in an environment where various
types of scanners are used, different methods are used to calculate features, preventing
document formats from being correctly identified. In addition, an apparatus for identifying
document formats is often used in a relatively large client-server-based system environment.
When the conventional automatic identification method described above is used in such
a system in which a number of clients are used, a single type of scanners must be
used in those clients or some other restrictions must be introduced.
SUMMARY OF THE INVENTION
[0005] The present invention has been achieved in view of these problems and it is an object
of the present invention to facilitate the correct identification of various document
formats in an environment where a plurality of different features introduced by variations
in type of image input apparatuses are mixed.
[0006] Other features and advantages of the present invention will be apparent from the
following description taken in conjunction with the accompanying drawings, in which
like reference characters designate the same or similar parts throughout the figures
thereof.
BRIEF DESCRIPTION OF THE DRAWINGS
[0007]
FIG. 1 is a block diagram of a general configuration of a form identification apparatus
according to an embodiment of the present invention;
FIGS. 2A and 2B show sample images for illustrating a difference in form origins between
a black back scanner and a white back scanner;
FIG. 3 shows examples of a page format and a table format in format data according
to the present invention;
FIG. 4 shows a process performed by a combination of scanners in an environment where
a black back scanner and white back scanner are used;
FIG. 5 is a flowchart showing an outline of a process according to the present invention;
and
FIG. 6 is a flowchart of a similarity calculation process according to the present
invention.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
[0008] Embodiments of the present invention will be described below with reference to the
accompanying drawings.
[0009] FIG. 1 is a block diagram showing a general configuration of an automatic form identification
apparatus according to an embodiment of the present invention.
[0010] Reference numeral 11 denotes a scanner for optically reading a form image and outputting
form image data. Reference numeral 12 denotes a processor which functions as image
feature extraction means 12a, format data generation means 12b, and similarity calculation
means 12c by executing a control program 15d stored in memory 15. The image read by
the scanner 11 is stored in the memory 15 as a form image 15c. The form image 15c
is binarized and sent to the image feature extraction means 12a, where its attributes
are classified into blocks such as table, text, and picture blocks by means of procedures
such as histogram method of black dot. For the table block, a detailed construction
of the table is determined by a method such as ruled-line tracing. The text block
is subjected to a process such as character code conversion.
[0011] From information obtained in this way, the page format and the table format of the
form as shown in FIG. 3 are generated by the format data generation means 12b and
stored in the memory 15 and a disk 14. FIG. 3 shows a sample form 31 extracted by
the image feature extraction means 12a. Three table blocks (311 through 313) and one
picture block (314) are extracted. Format data 32 of the form is stored in layers
of the page format 321 and the table format 322. The page format 321 has a header
section 321a containing the type of the scanner, the resolution of the original form,
a difference from the white back origin, and the page width and page height of the
form. Now, the type of the scanner is set dependent on the fact which of the white
back scanner and the black back scanner is used, but the type of scanner is preset
during the installation of a scanner application. The difference from the white back
origin is a distance from the origin (hereinafter called the "black back origin")
read by a black back scanner to the origin (hereinafter called the "white back origin")
read by a white back scanner. Therefore the difference from the white back origin
of a form read by a white back scanner will be always zero.
[0012] A data section 322a contains various items of information for each block. For example,
if the block attribute is "table," it contains the left most position and top position
of the block as position information and the width and height of the block as size
information. It also contains a distance from the page origin used for picking up
a form to be compared, and the area of the relevant block divided by the area of all
the table blocks, which is used for calculating similarity. In addition, it contains
a table ID for linking to detailed table information. The table format 322 indicates
the detailed construction of cells in the table associated with this table ID. It
contains the number of cells in the table, and the position and size of cells.
[0013] When a command for registering a form, identifying a form, or other operations is
input into the form identification apparatus through a keyboard, the processor 12
performs a process corresponding to the command by using the format data 32 described
above. Then the result of identification is displayed on a display 16.
[0014] The operation of various control processes performed by the form identification apparatus,
in particular the processor 12, according to this embodiment will be described below
with reference to FIGS. 3, 4, 5, and 6.
[0015] FIG. 5 shows a general process flow in the form identification apparatus. At step
S101, a form to be identified is read by a scanner. At step S103, feature data such
as the coordinates of a table and text block is extracted. At step S105, the feature
data is converted into format data for calculating similarity. Forms that can have
the same format data as that of the form to be identified are selected from registered
forms at step S107 on the basis of this format data. At step S109, similarities of
the formats of all the selected forms are calculated. A predetermined number of registered
forms having higher similarities calculated are selected as candidates for a form
similar to the form to be identified and their identification codes and similarities
are output.
[0016] An issue in performing similarity calculation on the registered forms and the form
to be identified regardless of the type of the scanner (white back or black back)
will be clarified with reference to FIG. 4. If the form to be identified is one read
by a white back scanner and the registered form is one read by a black back scanner,
they are of different format data because their origin positions indicate mutually
different points. Similarly, if the form to be identified is one read by a black back
scanner and the registered form is one read by a white back scanner, their format
data are mutually different. To solve this problem, the scanner is checked to see
whether the type of scanners are different or not before performing the similarity
calculation. This check will be described with reference to FIG. 6.
[0017] If the scanner used for reading the form to be identified is a white back scanner
and the scanner used for reading the registered form is of a black back type, coordinate
transformation is performed on the registered form (steps S601, S602, S603). As described
earlier, the difference from the white back origin is contained in the header section
321a of the page format 321 for the registered form. The format data read by the black
back scanner can be converted into format data that would be if it were read by the
white back scanner, by subtracting the difference from the width and height of the
form page, the left most and top positions of each block, and the distance from the
page origin. By calculating the similarity between the converted format data and the
format data of the form to be identified, a result equivalent to that of a regular
similarity calculation can be obtained.
[0018] If the form to be identified is read by a black back scanner and the registered form
is read by a white back scanner, a coordinate transformation is performed on the form
to be identified (steps S601, S604, S605). The difference from the white back origin
is contained in the header section 321a of the page format 321 for the form to be
identified. The form to be identified can be converted into format data that would
be if it were read by the white back scanner, in the same way described above and
a result equivalent to that of a regular similarity calculation can be obtained.
[0019] As described above, a form can be identified regardless of whether it was read by
a black back scanner or a white back scanner.
[0020] Therefore a user can use a form without concern for what kind of scanner was used
to read the form in an environment where white back scanners and black back scanners
are used.
[0021] Some forms are prone to have an error in their origin position because its white
back origin is determined by using a table block or text block. According to the present
invention, whenever a black back origin can be used, a form is identified by using
the black back origin, therefore a form can be identified more accurately than form
identification using only a white back origin.
[Other embodiments]
[0022] While only the origin of a form has been considered in the foregoing description
of the embodiment, the width and height of a form page can also be used as factors
in the similarity calculation. In this case, the coordinates of the lower right corner
of the form should also be defined. For a black back scanner, the coordinates of the
lower right corner of an image are the coordinates of the lower right corner of the
form because black rim correction is performed. For a white back scanner, the coordinates
of the right most and bottom positions of a rectangle encompassing all blocks are
determined as the coordinates of the lower right corner of the form. A difference
(hereinafter called a "difference from the white back lower right coordinates") between
the lower right corner coordinates (hereinafter called a "white back lower right coordinates")
of a form read by a white back scanner and the lower right corner coordinates (hereinafter
called a "black back lower right coordinates") of the form read by a black back scanner
is stored beforehand in the header section 321a of a page format. In the form read
by a white back scanner, the difference from the white back lower right coordinates
is always zero.
[0023] One example of a method of recording the lower right coordinates in format data will
be provided below. For a white back scanner, the right most position - the white back
origin is recorded as the page width of the form and the bottom position - the white
back origin is recorded as the page height of the form.
[0024] According to this embodiment, if a form to be identified is read by a black back
scanner and a registered form is read by a white back scanner, the page width and
height of the form read by the white back scanner can be obtained by subtracting a
difference from the white back lower right corner coordinates from the lower right
corner coordinates of the form to be identified in coordinate transformation, calculating
the right most and bottom positions of a rectangle encompassing all blocks, and using
these positions as format data to subtract the recorded white back origin position
from the data.
[0025] While the format of a form is identified in the embodiments described above, the
formats of other documents can also be identified.
[0026] The object of the present invention can also be attained by providing a storage medium
or signal carrying a software program code for implementing functions of the embodiments
described above to a system or an apparatus to cause a computer (or CPU or MPU) of
the system or apparatus to read and execute the program code.
[0027] In such a case, the program code read from the storage medium or signal implements
the functions of the embodiments described above and the storage medium or signal
containing the program code constitutes an embodiment of the present invention.
[0028] The storage medium for providing the program code may be a floppy disk, hard disk,
optical disk, magnetooptical disk, CD-ROM, CD-R, magnetic tape, nonvolatile memory
card, or ROM.
[0029] The present invention also includes, besides the functions of the above-described
embodiments being implemented by the computer executing the read program code, an
implementation in which software such as an OS (operating system) running on the computer
performs all or some of actual processing based on the instructions of the program
code to implement the functions of the above-described embodiments.
[0030] The present invention also includes an implementation in which the program code read
from a storage medium or signal is loaded into memory provided in an expansion board
inserted in the computer or in an expansion unit attached to the computer, then a
component such as a CPU provided in the expansion board or the expansion unit performs
all or some of actual processing based on the instructions of the program code to
implement the functions of the above-described embodiments.
[0031] As many apparently widely different embodiments of the present invention can be made
without departing from the spirit and scope thereof, it is to be understood that the
invention is not limited to the specific embodiments thereof except as defined in
the appended claims.
1. A document format identification apparatus, comprising:
extraction means (12a) for extracting a feature from an image (15c) of a document
read through an image input apparatus (11);
generation means (12b) for generating, based on the feature extracted by said extraction
means (12a), document format data (32) containing identification data (322a) for identifying
a document format and correction information (321a) for indicating a type of said
image input apparatus (11) and correcting a feature difference produced by a difference
in type of the image input apparatus (11); and
storage means (14) for storing said document format data (32).
2. The document format identification apparatus according to claim 1, further comprising
identification means (12c) for using said extraction means (12a) and said generation
means (12b) to obtain the document format data for the image of the document the format
of which is to be identified, comparing the document format data with said document
format data stored in said storage means (14), calculating the similarity between
the document formats to identify the document format of said document to be identified,
wherein:
at the time of comparing the document format data stored in said storage means (14)
with the document format data generated by said generation means (12b) about said
document to be identified, said identification means (12c) determines based on the
type of the image input apparatus reading the document whether any of the identification
data stored in said storage means (14) and identification data generated in said generation
means (12b) for identifying the document to be identified should be corrected based
on said correction information (321a), and if said identification means (12c) determines
that correction is required, corrects any of said identification data based on said
correction information (321a), then compares the identification data with each other
to calculate the similarity between the identification data.
3. The document format identification apparatus according to claim 2, further comprising
output means (16) for using the results obtained through said similarity calculation
to output the identification codes and similarities of a predetermined number of document
formats in descending order of similarity.
4. The document format identification apparatus according to claim 2,
wherein said correction information (321a) indicates a difference in origin position
coordinates, said difference being caused because different types of said image input
apparatus use different origin position extraction methods in feature extraction;
and said identification means (12c) performs coordinate transformation on the identification
data stored in said storage means (14) or the identification data generated by said
generation means (12b) about the document to be identified by using said difference
if said identification means (12c) determines that correction is required, then compares
the identification data with each other to calculate the similarity between the identification
data.
5. The document format identification apparatus according to claim 4,
wherein the types of said image input apparatus include a black back scanner for
reading a document image against a black background and a white back scanner for reading
the document image against a white background;
said generation means (12b) sets the origin on a border between the background and
the document and provides as said correction information (321a) a difference between
a point obtained based on a block area in said document and said origin if said document
is read by said black back scanner, or sets the origin based on the block area in
said document and provides said correction information (321a) without containing a
difference if said document is read by said white back scanner; and
said identification means (12c) corrects the document format data read by the black
back scanner by using said correction information (321a), then compares said document
format data with each other to calculate the similarity between the document format
data if said types of said image input apparatus do not mach when comparing the document
format data stored in said storage means (14) with the document format data generated
by said generation means (12b) about said document to be identified.
6. A document format identification apparatus, comprising:
extraction means (12a) for extracting a feature from an image (15c) read through an
image input apparatus (11) for a document the format of which to be identified;
generation means (12b) for generating, based on the feature extracted by said extraction
means (12a), document format data (32) containing identification data (322a) for identifying
a document format and correction information (321a) for indicating a type of said
image input apparatus (11) and correcting a feature difference produced by a difference
in type of the image input apparatus (11); and
identification means (12c), for document format data containing identification data
about a plurality of documents for identifying data formats stored in storage means
(14) and correction information and document format data generated by said generation
means (12b), for identifying the document format of said document to be identified
by comparing the document format data after correcting any of said identification
data based on said correction information (321a) if the type of the image input apparatus
used for reading said document to be identified is different from the type of the
image input apparatus used for reading the document for identifying said document
format.
7. The document format identification apparatus according to claim 6,
wherein the types of said image input apparatus (11) include a black back scanner
for reading a document image against a black background and a white back scanner for
reading the document image against a white background; and
said generation means (12b) sets an origin in a predetermined position on the border
between the background and the document and provides as said correction information
(321a) a difference between a point determined based on a feature of said document
and said origin if said document is read by said black back scanner, or provides as
the origin a point determined based on the feature of said document if said document
is read by said white back scanner; and
said identification means (12c) identifies the format of said document to be identified
by comparing the document format data stored in said storage means (14) with the document
format data generated by said generation means (12b) about said document to be identified
after correcting the document format data read by the black back scanner by using
said correction information (321a) if the types of said image input apparatus (11)
do not match.
8. A document format identification method, comprising the steps of:
extracting a feature from the image of a document read through an image input apparatus;
generating, based on the feature extracted by said extraction step, document format
data containing identification data for identifying a document format and correction
information for indicating the type of said image input apparatus and correcting a
feature difference produced by a difference in type of the input apparatus; and
storing said document format data.
9. The document format identification method according to claim 8, further comprising
the steps of using said extraction step and said generation step to obtain the document
format data for the image of the document the format of which is to be identified,
comparing the document format data with said document format data stored in said storage
step, and calculating the similarity between the document formats to identify the
document format of said document to be identified,
wherein, at the time of comparing the document format data stored in said storage
step with the document format data generated by said generation step about said document
to be identified, said identification step determines based on the type of the image
input apparatus reading the document whether any of the identification data stored
in said storage step and identification data generated in said generation step of
identifying the document to be identified should be corrected based on said correction
information, and if said identification step determines that correction is required,
corrects any of said identification data based on said correction information, then
compares the identification data with each other to calculate the similarity between
the identification data.
10. The document format identification method according to claim 9, further comprising
the step of using the results obtained through said similarity calculation to output
the identification codes and similarities of a predetermined number of document formats
in descending order of similarity.
11. The document format identification method according to claim 9,
wherein said correction information indicates a difference in origin position coordinates,
said difference being caused because different types of said image input apparatus
use different origin position extraction methods in feature extraction; and
said identification step performs coordinate transformation on the identification
data stored in said storage step or the identification data generated by said generation
step about the document to be identified by using said difference if said identification
step determines that correction is required, then compares the identification data
with each other to calculate the similarity between the identification data.
12. The document format identification method according to claim 11,
wherein the types of said image input apparatus include a black back scanner for
reading a document image against a black background and a white back scanner for reading
the document image against a white background; and
said generation step sets the origin on a border between the background and the document
and provides as said correction information a difference between a point obtained
based on a block area in said document and said origin if said document is read by
said black back scanner, or sets the origin based on the block area in said document
and provides said correction information without containing a difference if said document
is read by said white back scanner; and
said identification step corrects the document format data read by the black back
scanner by using said correction information, then compares said document format data
with each other to calculate the similarity between the document format data if said
types of said image input apparatus do not mach when comparing the document format
data stored in said storage step with the document format data generated by said generation
step about said document to be identified.
13. A document format identification method, comprising the steps of:
extracting a feature from an image read through an image input apparatus for a document
the format of which to be identified;
generating, based on the feature extracted by said extraction step, document format
data containing identification data for identifying a document format and correction
information for indicating the type of said image input apparatus and correcting a
feature difference produced by a difference in type of the image input apparatus;
and
for document format data containing identification data about a plurality of documents
for identifying data formats stored in storage step and correction information and
document format data generated by said generation step, identifying the document format
of said document to be identified by comparing the document format data after correcting
any of said identification data based on said correction information if the type of
the image input apparatus used for reading said document to be identified is different
from the type of the image input apparatus used for reading the document for identifying
said document format.
14. The document format identification method according to claim 13,
wherein the types of said image input apparatus include a black back scanner for
reading a document image against a black background and a white back scanner for reading
the document image against a white background; and
said generation step sets an origin in a predetermined position on the border between
the background and the document and provides as said correction information a difference
between a point determined based on a feature of said document and said origin if
said document is read by said black back scanner, or provides as the origin a point
determined based on the feature of said document if said document is read by said
white back scanner; and
said identification step identifies the format of said document to be identified by
comparing the document format data stored in said storage step with the document format
data generated by said generation step about said document to be identified after
correcting the document format data read by the black back scanner by using said correction
information if the types of said image input apparatus do not match.
15. A storage medium (15) storing a control program for causing a computer to implement
the document format identification method according to any of claims 8 to 14.
16. A control program (15d) for causing a computer to implement the document format identification
method according to any of claims 8 to 14.
17. A signal carrying instructions for causing a programmable processing apparatus to
become operable to perform a method as set out in at least one of claims 8 to 14.