Copyright Notice and Permission
[0001] A portion of this patent document contains material subject to copyright protection.
The copyright owner has no objection to the facsimile reproduction by anyone of the
patent document or the patent disclosure, as it appears in the Patent and Trademark
Office patent files or records, but otherwise reserves all copyrights whatsoever.
The following notice applies to this document: Copyright Ⓒ 2003, Thomson Global Resources
AG.
Technical Field
[0002] Various embodiments of the present invention concerns information-retrieval systems,
such as those that provide legal documents or other related content
Background
[0003] In recent years, the fantastic growth of the Internet and other computer networks
has fueled an equally fantastic growth in the data accessible via these networks.
One of the seminal modes for interacting with this data is through the use of hyperlinks
within electronic documents.
[0004] More recently, there has been interest in hyperlinking documents to other documents
based on the names of people in the documents. For example, to facilitate legal research,
West Publishing Company of St. Paul, Minnesota (doing business as Thomson West) provides
thousands of electronic judicial opinions that hyperlink the names of attorneys and
judges to their online biographical entries in the West Legal Directory, a proprietary
directory of approximately 1,000,000 U.S. attorneys and 20,000 judges. These hyperlinks
allow users accessing judicial opinions to quickly obtain contact and other specific
information about lawyers and judges named in the opinions.
[0005] The hyperlinks in these judicial opinions are generated automatically, using a system
that extracts first, middle, and last names; law firm name, city, and state; and court
information from the text of the opinions and uses them as clues to determine whether
to link the named attorneys and judges to their corresponding entries in the professional
directory. See
Christopher Dozier and Robert Haschart, "Automatic Extraction and Linking of Person
Names in Legal Text" (Proceedings of RIAO 2000: Content Based Multimedia Information
Access. Paris, France. pp. 1305-1321. April 2000). An improvement to this system is described in Christopher Dozier, System, Methods
And Software For Automatic Hyperlinking Of Persons' Names In Documents To Professional
Directories,
WO 2003/060767A3 July 24, 2003.
[0006] WO 03/060767 describes a method and system for adding hyperlinks to names in documents. The names
in the documents are identified and compared with names directions to form the hyperlinks.
[0007] The present inventors have recognized still additional need for improvement in these
and other systems that generate automatic links.
[0008] According to one aspect, the present invention provides a system comprising: means
for extracting an entity reference record from each of a plurality of documents; means
for forming at least one entity profile record by merging at least one of the entity
reference records with at least one other entity reference record; means for categorizing
at least one of the entity profile records based on a taxonomy; and means for defining
links between at least one of the entity profile records and other documents or data
sets.
[0009] According to a second aspect, the present invention provides a method comprising:
extracting an entity reference record from each of a plurality of documents; forming
at least one entity reference profile by merging at least one of the entity reference
records with at least one other entity reference record; automatically categorizing
at least one of the entity profile records based on an expertise taxonomy; and defining
links between at least one of the entity profile records and other documents or data
sets.
Brief Description of Drawings
[0010]
- Figure 1
- is a diagram of an exemplary information-retrieval system 100 corresponding to one
or more embodiments of the invention;
- Figure 2
- is a flowchart corresponding to one or more exemplary methods of operating system
100 and one or more embodiments of the invention;
- Figure 3-8
- are facsimiles of exemplary user interfaces, each corresponding to one or more embodiments
of the invention.
- Figure 9
- is a flow chart corresponding to one or more embodiments of the invention.
- Figures 10
- is a flow chart corresponding to one or more additional embodiment of the invention.
Detailed Description of Exemplary Embodiments
[0011] This description, which references and incorporates the above-identified Figures,
describes one or more specific embodiments of an invention. These embodiments, offered
not to limit but only to exemplify and teach the invention, are shown and described
in sufficient detail to enable those skilled in the art to implement or practice the
invention. Thus, where appropriate to avoid obscuring the invention, the description
may omit certain information known to those of skill in the art.
Exemplary Information-Retrieval System
[0012] Figure 1 shows an exemplary online information-retrieval system 100. System 100 includes
one or more databases 110, one or more servers 120, and one or more access devices
130.
[0013] Databases 110 include a set of one or more databases. In the exemplary embodiment,
the set includes a caselaw database 111, an expert witness directory 112, professional
directories or licensing databases 113, a verdict and settlement database 114, an
court-filings database 116.
[0014] Caselaw database 111 generally includes electronic text and image copies of judicial
opinions for decided cases for one or more local, state, federal, or international
jurisdiction. Expert witness directory 112, which is defined in accord with one or
more aspects of the present invention, includes one or more records or database structures,
such as structure 1121. Structure 1121 includes an expert identifier portion 1121
A which is logically associated with one or more directory documents or entries 1121B,
one or more verdict documents or entries 1121C, and one or more articles 1121D. Some
embodiments logically associate the expert identifier with court filings documens,
such as briefs and expert reports and/or other documents.
[0015] Professional directories or licensing databases 113 include professional licensing
data from one or more state, federal, or international licensing authorities. In the
exemplary embodiment, this includes legal, medical, engineering, and scientific licensing
or credentialing authorities. Verdict and settlement database 114 includes electronic
text and image copies of documents related to the determined verdict, assessed damages,
or negotiated settlement of legal disputes associated with cases within caselaw database
111. Articles database 115 includes articles technical, medical professional, scientific
or other scholarly or authorative journals and authoritative trade publications- Some
examples includes patent publications. Court-filings database 116 includes electronic
text and image copies of court filings related to one or more subsets of the judicial
opinions caselaw database 111. Exemplary court-filing documents include briefs, motions,
complaints, pleadings, discovery matter. Other databases 117 include one or more other
databases containing documents regarding news stories, business and finance, science
and technology, medicine and bioinformatics, and intellectual property information.
In some examples, the logical relationships across documents are determined manually
or using automatic discovery processes that leverage information such as litigant
identities, dates, jurisdictions, attorney identities, court dockets, and so forth
to determine the existence or likelihood of a relationship between any pair of documents.
[0016] Databases 110, which take the exemplary form of one or more electronic, magnetic,
or optical data-storage devices, include or are otherwise associated with respective
indices (not shown). Each of the indices includes terms and/or phrases in association
with corresponding document addresses, identifiers, and other information for facilitating
the functionality described below. Databases 112, 114, and 116 are coupled or couplable
via a wireless or wireline communications network, such as a local-, wide-, private-,
or virtual-private network, to server 120.
[0017] Server 120, which is generally representative of one or more servers for serving
data in the form of web pages or other markup language forms with associated applets,
ActiveX controls, remote-invocation objects, or other related software and data structures
to service clients of various "thicknesses." More particularly, server 120 includes
a processor 121, a memory 122, a subscriber database 123, one or more search engines
124 and software module 125.
[0018] Processor 121, which is generally representative of one or more local or distributed
processors or virtual machines, is coupled to memory 122. Memory 122, which takes
the exemplary form of one or more electronic, magnetic, or optical data-storage devices,
stores subscription database 123, search engines 124, and interface module 125.
[0019] Subscription database 123 includes subscriber-related data for controlling, administering,
and managing pay-as-you-go- or subscription-based access of databases 110. Subscriber
database 123 includes subscriber-related data for controlling, administering, and
managing pay-as-you-go or subscription-based access of databases 110.
[0020] Search engines 124 provides Boolean or natural-language search capabilities for databases
110.
[0021] Interface module 125, which, among, other things defines one or portion of a graphical
user interface that helps users define searches for databases 110. Software 125 includes
one or more browser-compatible applets, webpage templates, user-interface elements,
objects or control features or other programmatic objects or structures. More specifically,
software 125 includes a search interface 1251 and a results interface 1252.
[0022] Server 120 is communicatively coupled or couplable via a wireless or wireline communications
network, such as a local-, wide-, private-, or virtual-private network, to one or
more accesses devices, such as access device 130.
[0023] Access device 130 is not only communicatively coupled or couplable to server 130,
but also generally representative of one or more access devices. In the exemplary
embodiment, access device 130 takes the form of a personal computer, workstation,
personal digital assistant, mobile telephone, or any other device capable of providing
an effective user interface with a server or database.
[0024] Specifically, access device 130 includes one or more processors (or processing circuits)
131, a memory 132, a display 133, a keyboard 134, and a graphical pointer or selector
135. Memory 132 stores code (machine-readable or executable instructions) for an operating
system 136, a browser 137, and a graphical user interface (GUI)138. In the exemplary
embodiment, operating system 136 takes the form of a version of the Microsoft Windows
operating system, and browser 137 takes the form of a version of Microsoft Internet
Explorer. Operating system 136 and browser 137 not only receive inputs from keyboard
134 and selector (or mouse) 135, but also support rendering of GUI 138 on display
133. Upon rendering, GUI 138 presents data in association with one or more interactive
control features (or user-interface elements).
[0025] (The exemplary embodiment defines one or more portions of interface 138 using applets
or other programmatic objects or structures from server 120.)
[0026] Specifically, graphical user interface 138 defines or provides one or more display
control regions, such as a query region 1381, and a results region 1382. Each region
(or page in some embodiments) is respectively defined in memory to display data from
databases 110 and/or server 120 in combination with one or more interactive control
features (elements or widgets). In the exemplary embodiment, each of these control
features takes the form of a hyperlink or other browser-compatible command input.
[0027] More specifically, query region 1381 includes interactive control features, such
as an query input portion 1381A for receiving user input at least partially defining
a profile query and a query submission button 1381B for submitting the profile query
to server 120 for data from, for example, experts database 112.
[0028] Results region 1382, which displays search results for a submitted query, includes
a results listing portion 1382A and a document display portion 1382B. Listing portion
1382A includes control features 2A1 and 2A2 for accessing or retrieving one or more
corresponding search result documents, such as professional profile data and related
documents, from one or more of databases 110, such as expert database 112, via server
120. Each control feature includes a respective document identifier or label, such
as EXP 1, EXP 2 identifying respective name and/or city, state, and subject-matter
expertise data for the corresponding expert or professional.
[0029] Display portion 1382B displays at least a portion of the full text of a first displayed
or user-selected one of the profiles identified within listing portion 1382A, EXP
2 in the illustration. (Some embodiments present regions 1382A and 1382B as selectable
tabbed regions.) Portion 1382B also includes features 2B1, 2B2, 2B3, and 2B4. User
selection of feature 2B 1 initiates retrieval and display of the profile text for
the selected expert, EXP 2; selection of feature 2B2 initiates retrieval and display
of licensing data for any licenses or other credentials held by the selected expert
or professional image copy of the document displayed in region 1382B in a separate
window; selection of feature 2B3 initiates display and retrieval of verdict data related
to the expert or professional; and selection of feature 2B4 initiates retrieval and
display of articles (from database 115) that are related to, for example authored
by, the expert or professional. Other embodiments include additional control features
for accessing court-filing documents, such as briefs, and/or expert reports authored
by the expert or professional, or even deposition and trial transcripts where the
expert or testimony was a participant. Still other examples provide control features
for initiating an Internet search based on the selected expert and other data and
for filtering results such search based on the profile of the expert or professional.
Exemplary Methods of Operation
[0030] Figure 2 shows a flow chart 200 of one or more exemplary methods of operating an
information-management system, such as system 100. Flow chart 200 includes blocks
210-290, which are arranged and described in a serial execution sequence in the exemplary
embodiment. However, other examples execute two or more blocks in parallel using multiple
processors or processor-like devices or a single processor organized as two or more
virtual machines or sub processors. Other examples also alter the process sequence
or provide different functional partitions to achieve analogous results. For example,
some examples may alter the client-server allocation of functions, such that functions
shown and described on the server side are implemented in whole or in part on the
client side, and vice versa. Moreover, still other examples implement the blocks as
two or more interconnected hardware modules with related control and data signals
communicated between and through the modules. Thus, this (and other exemplary process
flows in this description) apply to software, hardware, and firmware implementations.
[0031] Block 210 entails presenting a search interface to a user. In the exemplary embodiment,
this entails a user directing a browser in an client access device to internet-protocol
(IP) address for an online information-retrieval system, such as the Westlaw system
and then logging onto the system. Successful login results in a web-based search interface,
such as interface 138 in Figure 1 (or one or more portions thereof) being output from
server 120, stored in memory 132, and displayed by client access device 130.
Execution then advances to block 220.
[0032] Block 220 entails receipt of a query. In the exemplary embodiment, the query defines
one or more attributes of an entity, such as person professional. In some embodiments,
the query string includes a set of terms and/or connectors, and in other embodiment
includes a natural-language string. Also, in some examples the set of target databases
is defined automatically or by default based on the form of the system or search interface.
Figures 3 and 4 show alternative search interfaces 300 and 400 which one or more embodiments
use in place of interface 138 in Figure 1. Execution continues at block 230.
[0033] Block 230 entails presenting search results to the user via a graphical user interface.
In the exemplary embodiment, this entails the server or components under server control
or command, executing the query against one or more of databases 110, for example,
expert database 110, and identifying documents, such as professional profiles, that
satisfy the query criteria. A listing of results is then presented or rendered as
part of a web-based interface, such as interface 138 in Figure 1 or interface 500
in Figure 5. Execution proceeds to block 240.
[0034] Block 240 entails presenting additional information regarding one or more one or
more of the listed professionals. In the exemplary embodiment, this entails receiving
a request in the form of a user selection of one or more of the professional profiles
listed in the search results. These additional results may be displayed as shown in
interface 138 in Figure 1 or respective interfaces 600, 700, and 800 in Figures 6,
7, and 8. Interface 600 shows a listing of links 610 and 620 for additional information
related to the selected professional. As shown in Figure 7, selection of link 610
initiates retrieval and display of a verdict document (or in some case a list of associated
verdict documents) in interface 700. And, as shown in Figure 8, selection of link
620 initiates retrieval and display of an article (or in some cases a list of articles)
in interface 800.
Exemplary Method of Building Expert Directory
[0035] In Figure 9, flow chart 900 shows an exemplary method of building an expert directory
or database such as used in system 100. Flow chart 900 includes blocks 910-960.
[0036] At block 910, the exemplary method begins with extraction of entity reference records
from text documents. In the exemplary embodiment, this entails extracting entity references
from approximately 300,000 jury verdict settlement (JVS) documents using finite state
transducers. JVS documents have a consistent structure that includes an expert witness
section or paragraph, such as that exemplified in Table 1.
Table 1: Expert Witness Section of Jury Verdicts and Settlements (JVS) Document
| EXPERTS: |
| Plaintiff: |
| Neal Benowitz MD, pharmacologist, UCSF Medical Center, San Francisco. David M. Burns,
pulmonologist, UC San Diego, Div. of Pulmonary and Critical Care Medicine, La Jolla. |
| wDefendant: |
| Jerrv Whidbv PhD., chemist, Philip Morris Co., Richmond. VA. |
[0037] The exemplary embodiment uses a parsing program to locate expert-witness paragraphs
and find lexical elements (that is, terms used in this particular subject area) pertaining
to an individual. These lexical elements include name, degree, area of expertise,
organization, city, and state. Parsing a paragraph entails separating it into sentences,
and then parsing each element using a separate or specific finite state transducer.
The following example displays regular expressions from the finite state transducer
used for the organization element. (Variables are prefixed by $.)
$ORG = ($UNIVERSITY | $COMPANY|$FIRM...)
$UNIVERSITY = ($UNIVERSITY1 | $UNIVERSITY2)
$UNIVERSITY1= (University|College..) (of) [A-Z][a-z]+
$UNIVERSITY2= ([A-Z][a-z]+ $SPACE)+ (University|College..)
[0038] Typically one expert is listed in a sentence along with his or her area of expertise
and other information. If more than one expert is mentioned in a sentence, area of
expertise and other elements closest to the name are typically associated with that
name. Each JVS document generally lists only one expert witness; however, some expert
witnesses are references in more than one JVS document. Table 2 shows an example of
an entity reference records.
Table 2: Structured Expert-Witness Reference Record Created by Regular Expression
Parsers
| fname |
ARTHUR |
| mname |
|
| Iname |
ABLIN |
| suffix |
|
| degree |
MD |
| org |
|
| Expertise |
Pediatric hematology/oncology |
| city |
SAN FRANCISCO |
| state |
CA |
[0039] Once the entity reference records are defined, execution continues at block 920.
[0040] Block 920 entails defining profile records from the entity reference records: In
the exemplary embodiment, defining the profile records entails merging expert-witness
reference records that refer to the same person to create a unique expert-witness
profile record for the expert. To this end, the exemplary embodiments sorts the reference
records by last name to define a number of lastname groups. Records within each "last-name"
group are then processed by selecting an unmerged expert reference record and creating
an new expert profile record from this selected record. The new expert reference record
is then marked as unmerged and compared to each unmerged reference record in the group
using Bayesian matching to compute the probability that the expert in the profile
record refers to the same individual referenced in the record. If the computed match
probability exceeds a match threshold, the reference is marked as "merged." If unmerged
records remain in the group, the cycle is repeated.
[0041] Note that it is still possible for duplicate records to reside in the profile file
if two or more reference records pertain to one individual (for example, because of
a misspelled last name). To address this possibility, a final pass is made over the
merged profile file, and record pairs are flagged for manual review. Table 3 shows
an exemplary expert profile record created from expert reference records.
Table 3 Expert Profile Record Created from Expert Reference Records
| fname |
ARTHUR |
| mname |
|
| Iname |
ABLIN |
| suffix |
|
| degree |
MD |
| org |
|
| expertise |
Pediatric hematology/oncoloqy |
| Subcat 1 |
|
| Subcat 2 |
|
| Subcat 3 |
|
| category |
|
| address |
|
| city |
SAN FRANCISCO |
| state |
CA |
[0042] Block 930 entails adding additional information to the expert reference records.
In the exemplary embodiment, this entails harvesting information from other databases
and sources, such as from professional licensing authorities, telephone directories,
and so forth. References to experts in JVS documents, the original entity record source
in this embodiment, often have little or no location information for experts, whereas
professional license records typically include the expert's full name, and the full
current home and/or business address, making them a promising source for additional
data.
[0043] One exemplary licensing authority is the Drug Enforcement Agency, which licenses
health-care professionals to prescribe drugs.
[0044] In determining whether a harvested license record (analogous to a reference record)
and expert person refer to the same person, the exemplary embodiment computes a Bayesian
match probability based on first name, middle name, last name, name suffix, city-state
information, area of expertise, and name rarity. If the match probability meets or
exceeds a threshold probability, one or more elements of information from the harvested
license record are incorporated into the expert reference record. If the threshold
criteria is not met, the harvested license record is stored in a database for merger
consideration with later added or harvested records. In.
[0045] (Some embodiments perform an extraction procedure on the supplemental data similar
to that described at block 910 to define reference records, which are then sent as
a set for merger processing as in block 910 with the expert reference records.)
Table 4: Expert-Profile Record in which Middle Name, Address, and ZIP-code Fields
Filled or Harvested from Professional License Record
| fname |
ARTHUR |
| mname |
R |
| Iname |
ABLIN |
| suffix |
|
| degree |
MD |
| orq |
|
| Expertise |
Pediatric hematology/oncology |
| Subcat 1 |
pediatrics |
| Subcat 2 |
Blood & plasma |
| Subcat 3 |
oncology |
| category |
Medical & surgical |
| address |
43 Culloden Pk Road |
| city |
SAN FRANCISCO |
| state |
CA |
| Zip |
94901 |
[0046] Block 940 entails categorizing expert profiles by area of expertise. In the exemplary
embodiment, each expert witness record is assigned one or more classification categories
in an expertise taxonomy. Categorization of the entity records allows users to browse
and search expert witness (or other professional) profiles by area of expertise. To
map an expert profile record to an expertise subcategory, the exemplary embodiment
uses an expertise categorizer and a taxonomy that contains top-level categories and
subcategories.
[0047] The exemplary taxonomy includes the following top-level categories: Accident & Injury;
Accounting & Economics; Computers & Electronics; Construction & Architecture; Criminal,
Fraud and Personal Identity; Employment & Vocational; Engineering & Science; Environmental;
Family & Child Custody; Legal & Insurance; Medical & Surgical; Property & Real Estate;
Psychiatry & Psychology; Vehicles, Transportation, Equipment & Machines. Each categories
includes one or more subcategories. For example, the "Accident & Injury" category
has the following subcategories: Aerobics, Animals, Apparel, Asbestos, Boating, Bombing,
Burn/Thermal, Child Care, Child Safety, Construction, Coroner, Cosmetologists/Beauticians/Barbers/Tattoos,
Dog Bites, Entertainment, and Exercise.
[0048] Assignment of subject-matter categories to an expert profile record entail using
a function that maps a professional descriptor associated with the expert to a leaf
node in the expertise taxonomy. This function is represented with the following equation:

where T denotes a set of taxonomy nodes, and S is the professional descriptor. The
exemplary function
f uses a lexicon of 500 four-character sets that map professional descriptors to expertise
area. For example, experts having the "onco" professional descriptor are categorized
to the oncology specialist, oncologist, and pediatric oncologist subcategories. Other
taxonomies are also feasible. The exemplary embodiment allows descriptors to map to
more than one expertise area (that is, category or subcategory) in the taxonomy. For
example, "pediatric surgeon" can be mapped to both the "pediatrics" node and "surgery"
nodes. Table 5 shows an example of an expert profile record in which the expertise
field has been mapped to the category "Medical & Surgical" and to the subcategories
"pediatrics," "blood & plasma," and "oncology."
Table 5: Expert Profile Record with Expertise Area Mapped to "Medical & Surgical"
| fname |
ARTHUR |
| mname |
|
| Iname |
ABLIN |
| suffix |
|
| degree |
MD |
| org |
|
| Expertise |
Pediatric hematology/oncology |
| Subcat 1 |
pediatrics |
| Subcat 2 |
Blood & plasma |
| Subcat 3 |
oncology |
| category |
Medical & surgical |
| address |
|
| city |
SAN FRANCISCO |
| state |
CA |
[0049] Block 940 entails associating one or more text documents and/or additional data sets
with one or more of the professional profiles. To this end, the exemplary embodiment
logically associates or links one or more JVS documents and/or Medline articles to
expert-witness profile records using Bayesian based record matching. Table 6 shows
a sample Medline article.
Table 6: Sample Text from Medline Article
| TITLE: Functional and clinical outcomes of limb-sparing therapy for pediatric extremity
sarcomas. |
| AUTHORS: Bertucio C S; Wara W M; Matthay K K; Ablin A R; Johnston J O; O'Donnell R J; Weinberg V; Haas-Kogan D A |
| Department of Radiation Oncology, University of California-San Francisco, 505 Parnassus
Avenue, San Francisco, CA 94143-0226. USA. |
| JOURNAL: International journal of radiation oncology, biology, physics (United States) |
| DATE: Mar 1 2001. |
[0050] To link JVS documents and medline abstracts to expert profile records, expert-reference
records are extracted from the articles using one or more suitable parsers through
parsing and matched to profile records using a Bayesian inference network similar
to the profile-matching technology described previously. For JVS documents, the Bayesian
network computes match probabilities using seven pieces of match evidence: last name,
first name, middle name, name suffix, location, organization, and area of expertise.
For medline articles, the match probability is based additionally on name rarity,
as described in the previously mentioned Dozier patent application.
[0051] Figure 10 shows a flow chart 1000 of an exemplary method of growing and maintaining
one or more entity directories , such as the expert database that is used in system
100. Flow chart 1 100 includes process blocks 1010-1050.
[0052] At block 1010, the exemplary method begins with receipt of a document In the exemplary
embodiment, this entails receipt of an unmarked document, such an a judicial opinion
or brief. However, other embodiments receive and process other types of documents.
Execution then advances to block 1020.
[0053] Block 1020 entails determining the type of document. The exemplary embodiments uses
one or more methods for determining document type, for example, ' looking for particular
document format and syntax and/or keywords to differentiate among a set of types.
In some embodiments, type can be inferred from the source of the document. Incoming
content types, such as case law, jury verdicts, law reviews, briefs, etc., have a
variety of grammar, syntax, and structural differences. After type (or document description)
is determined, execution continues at block 1030.
[0054] Block 1030 entails extracting one or more entity reference records from the received
document based on the determined type of the document. In the exemplary embodiment,
four types of entity records are extracted: personal names, such as attorneys, judges,
expert witnesses; organizational names, such as firms and companies; product names,
such as drugs and chemicals; and fact profiles ("vernacular" of subject area). Specialized
or configurable parsers (finite state transducers), which are selected or configured
on the basis of the determined document type and the entity record being built, identify
and extract entity information for each type of entity.
[0055] Parsers extract information by specifically searching for a named entity (person,
address, company, etc.) or by relationships between entities. Parser text-extraction
is based on the data's input criteria. For example, the more structured (tagged) data
enables a "tighter" set of rules to be built within a parser. This set of rules allows
more specific information to be extracted about a particular entity. A more "free"
data collection, such as a web site, is not as conducive to rule-based parsers. A
collection could also include a combination of structured, semi-structured, and free
data. More specifically, parsers are developed through "regular-expression" methods.
The regular expressions serves are "rules" for parsers to find entity types and categories
of information.
[0056] Block 1040 attempts to link or logically associate each extracted entity reference
record with one or more existing authority directories. In the exemplary embodiment,
this entails computing a Bayesian match probability for each extracted entity reference
and one more corresponding candidate records in corresponding directories (or databases)
that have been designated as authoritative in terms of accepted accuracy. If the match
probability satisfies match criteria, the records are merged or associated and the
input document. Execution then proceeds to block 1050.
[0057] Block 1050 entails enriching unmatched entity reference records using a matching
process. In the exemplary embodiment, this enriching process entails operating specific
types of data harvesters on the web, other databases, and other directories or lists,
to assemble a cache of new relevant profile information for databases, such as expert
database 112 in Figure 1. The unmatched or unmarked entity records are then matched
against the harvested entity records using Bayesian matching. Those that satisfy the
match criteria are referred to a quality control process for verification or confirmation
prior to addition to the relevant entity directory. The quality control process may
be manual, semi-automatic, or fully automatic. For example, some embodiments base
the type of quality control on the degree to which the match criteria is exceeded.
[0058] In some examples blocks 1050 operates in parallel with blocks 1010-1040, continually
retrieving new entity related data using any number of web crawlers, relational databases,
or CDs, and attempting to building new entity records.
1. A system comprising:
means (910) for extracting entity reference data for at least one person from each
of a plurality of documents to form entity reference records;
means (920) for forming at least one entity profile record by merging at least one
of the entity reference records for a person with at least one other entity reference
record for the same person by:
sorting the entity reference records by last name;
selecting an unmerged entity reference record and creating an entity profile record
from the selected unmerged entity reference record; and
analyzing the unmerged entity reference record for determining a probability that
a person in a entity profile record is the same person as referenced in the selected
unmerged entity reference record;
means (940) for categorizing at least one of the entity profile records based on a
taxonomy; and
means (950) for defining links between at least one of the entity profile records
and other documents or data sets.
2. The system of claim 1, further comprising:
graphical user interface means (138) for defining a query related to an entity, for
viewing at least one document resulting from the query, for selecting at least one
of the defined links within a legal, financial, healthcare, scientific, or educational
document, and for causing retrieval and display of at least a portion of the one of
the entity profile records.
3. The system of claim 1 or claim 2, wherein at least one of the recited means include
one or more processors, computer-readable medium, display devices, and network communications,
with the machine-readable medium including coded instructions and data structures.
4. The system of any preceding claim:
wherein the at least one other entity reference records are contained in a database
(100);
wherein the means for forming at least one entity profile record may fail to merge
at least one of the entity reference records with at least one other entity reference
records in the database; and
wherein the system further comprises:
means, responsive to a failure to merge at least one of the entity reference records
with at least one of the other entity reference records, for attempting to match each
of the at least one entity reference record to a set of harvested entity reference
records outside the database; and
means, responsive to a match of at least one of the entity reference records to at
least one of the harvested entity reference records, for merging the records and adding
them to the database.
5. The system of any preceding claim, wherein the documents comprise jury verdict settlement
documents.
6. The system of claim 5, wherein the means for extracting entity records comprises finite
state transducers.
7. The system of any preceding claim, wherein the means for extracting at least one of
the entity reference records includes means for identifying name, educational degree,
area of expertise, organization, city, and state.
8. The system of claim 4, wherein the means for attempting to match at least one of the
entity reference records to at least one of the harvested entity reference records
includes means for computing a Bayesian match probability.
9. The system of any preceding claim:
wherein each of the entity reference records references a person; and
wherein the means for categorizing at least one of the defined entity records based
on a taxonomy is adapted to automatically categorize each entity reference record
to an expertise taxonomy.
10. The system of any preceding claim, the means for automatically extracting entity reference
records is adapted to perform extraction based on document type.
11. A method comprising:
extracting (910) entity reference data for at least one person from each of a plurality
of documents to form entity reference records;
forming (920) at least one entity reference profile by merging at least one of the
entity reference records for a person with at least one other entity reference record
for the same person by:
sorting the entity reference records by last name;
selecting an unmerged entity reference record and creating an entity profile record
from the selected unmerged entity reference record; and
analyzing the unmerged entity reference record for determining a probability that
a person in a entity profile record is the same person as referenced in the selected
unmerged entity reference record;
automatically categorizing (940) at least one of the entity profile records based
on an expertise taxonomy; and
defining links (950) between at least one of the entity profile records and other
documents or data sets.
12. The method of claim 11, further comprising:
receiving a query (210) related to an entity, displaying (230) one or more documents
resulting from the query, receiving a selection of one or more of the defined links
within a legal, financial, healthcare, scientific, or educational document; and retrieving
and displaying (240) of at least a portion of the at least one entity profile record.
13. The method of claim 11 or claim 12,
wherein the at least one other entity records are contained in a database (100); wherein
at least one of the entity reference records may not be merged with at
least one other entity reference records in the database; and
wherein the method further comprises:
in response to a failure to merge at least one of the entity reference records with
at least one of the other entity reference records, attempting to match each of the
at least one entity reference record to a set of harvested entity reference records
outside the database; and
in response to a match of the at least one entity reference records to at least one
of the harvested entity reference records, merging the matched records and adding
them to the database.
14. A carrier medium carrying computer readable code for controlling a computer to carry
out the method of any one of claims 11 to 13.
1. System, umfassend:
Mittel (910) zum Extrahieren von Entitätsreferenzdaten für mindestens eine Person
aus jedem aus einer Vielzahl von Dokumenten, um Entitätsreferenzdatensätze zu bilden;
Mittel (920) zum Bilden mindestens eines Entitätsprofildatensatzes durch Zusammenführen
mindestens eines der Entitätsreferenzdatensätze für eine Person mit mindestens einem
anderen Entitätsreferenzdatensatz für dieselbe Person durch:
Sortieren der Entitätsreferenzdatensätze nach Nachnamen;
Auswählen eines nicht zusammengeführten Entitätsreferenzdatensatzes und Erzeugen eines
Entitätsprofildatensatzes aus dem ausgewählten nicht zusammengeführten Entitätsreferenzdatensatz;
und
Analysieren des nicht zusammengeführten Entitätsreferenzdatensatzes zum Bestimmten
einer Wahrscheinlichkeit, dass eine Person in einem Entitätsprofildatensatz die gleiche
Person wie die in dem ausgewählten nicht zusammengeführten Entitätsreferenzdatensatz
eingetragene ist;
Mittel (940) zum Kategorisieren mindestens eines der Entitätsprofildatensätze auf
der Grundlage einer Taxonomie; und
Mittel (950) zum Definieren von Verbindungen zwischen mindestens einem der Entitätsprofildatensätze
und anderen Dokumenten oder Datensätzen.
2. System nach Anspruch 1, umfassend:
graphische Benutzerschnittstellenmittel (138) zum Definieren einer Abfrage, die sich
auf eine Entität bezieht, zum Betrachten mindestens eines Dokuments, das aus der Abfrage
resultiert, zum Auswählen mindestens einer der definierten Verbindungen innerhalb
eines Dokuments aus dem Rechtswesen, Finanzwesen, Gesundheitswesen, der Wissenschaft
oder dem Bildungswesen und zum Bewirken des Abrufs und der Anzeige mindestens eines
Abschnitts des einen der Entitätsprofildatensätze.
3. System nach Anspruch 1 oder Anspruch 2, wobei mindestens eines der erwähnten Mittel
einen oder mehrere Prozessoren, ein oder mehrere computerlesbare Medien, eine oder
mehrere Anzeigevorrichtungen und eine oder mehrere Netzwerkkommunikationen aufweist,
wobei das computerlesbare Medium codierte Anweisungen und Datenstrukturen aufweist.
4. System nach einem der vorhergehenden Ansprüche,
wobei der mindestens eine andere Entitätsprofildatensatz in einer Datenbank (100)
enthalten ist;
wobei die Mittel zum Bilden mindestens eines Entitätsprofildatensatzes daran scheitern
können, mindestens einen der Entitätsreferenzdatensätze mit mindestens einem anderen
Entitätsreferenzdatensatz in der Datenbank zusammenzuführen; und
wobei das System ferner umfasst:
Mittel, um als Antwort auf ein Scheitern, mindestens einen der Entitätsreferenzdatensätze
mit mindestens einem der anderen Entitätsreferenzdatensätze zusammenzuführen, zu versuchen,
jeden des mindestens einen Entitätsreferenzdatensatzes mit einer Menge von gesammelten
Entitätsreferenzdatensätzen außerhalb der Datenbank auf Übereinstimmung zu prüfen;
und
Mittel, um als Antwort auf eine Übereinstimmung mindestens eines der Entitätsreferenzdatensätze
mit mindestens einem der gesammelten Entitätsreferenzdatensätze die Datensätze zusammenzuführen
und sie zu der Datenbank hinzuzufügen.
5. System nach einem der vorhergehenden Ansprüche, wobei die Dokumente Prozessvergleichsdokumente
umfassen.
6. System nach Anspruch 5, wobei die Mittel zum Extrahieren von Entitätsdatensätzen Umsetzer
endlicher Zustände umfassen.
7. System nach einem der vorhergehenden Ansprüche, wobei die Mittel zum Extrahieren mindestens
eines der Entitätsreferenzdatensätze aufweisen: Mittel zum Herausfinden von Name,
Bildungsgrad, Fachgebiet, Organisation, Stadt und Staat.
8. System nach Anspruch 4, wobei die Mittel zum Versuchen, mindestens einen der Entitätsreferenzdatensätze
mit mindestens einem der gesammelten Entitätsreferenzdatensätze auf Übereinstimmung
zu prüfen, Mittel zum Berechnen einer Bayes-Übereinstimmungswahrscheinlichkeit aufweisen.
9. System nach einem der vorhergehenden Ansprüche,
wobei jeder der Entitätsreferenzdatensätze eine Person referenziert; und
wobei die Mittel zum Kategorisieren mindestens eines der definierten Entitätsdatensätze
auf der Grundlage einer Taxonomie dafür eingerichtet sind, automatisch jeden Entitätsreferenzdatensatz
nach einer Fachkenntnistaxonomie zu kategorisieren.
10. System nach einem der vorhergehenden Ansprüche, wobei die Mittel zum automatischen
Extrahieren von Entitätsreferenzdatensätzen dafür eingerichtet sind, die Extraktion
auf der Grundlage des Dokumenttyps durchzuführen.
11. Verfahren, umfassend:
Extrahieren (910) von Entitätsreferenzdaten für mindestens eine Person aus jedem aus
einer Vielzahl von Dokumenten, um Entitätsreferenzdatensätze zu bilden;
Bilden (920) mindestens eines Entitätsreferenzprofils durch Zusammenführen mindestens
eines der Entitätsreferenzdatensätze für eine Person mit mindestens einem anderen
Entitätsreferenzdatensatz für dieselbe Person durch:
Sortieren der Entitätsreferenzdatensätze nach Nachnamen;
Auswählen eines nicht zusammengeführten Entitätsreferenzdatensatzes und Erzeugen eines
Entitätsprofildatensatzes aus dem ausgewählten nicht zusammengeführten Entitätsreferenzdatensatz;
und
Analysieren des nicht zusammengeführten Entitätsreferenzdatensatzes zum Bestimmen
einer Wahrscheinlichkeit, dass eine Person in einem Entitätsprofildatensatz die gleiche
Person wie die in dem ausgewählten nicht zusammengeführten Entitätsreferenzdatensatz
eingetragene ist;
automatisches Kategorisieren (940) mindestens eines der Entitätsprofildatensätze auf
der Grundlage einer Fachgebietstaxonomie; und
Definieren von Verbindungen (950) zwischen mindestens einem der Entitätsprofildatensätze
und anderen Dokumenten oder Datensätzen.
12. Verfahren nach Anspruch 11, ferner umfassend:
Empfangen einer Abfrage (210), die sich auf eine Entität bezieht, Anzeigen (230) eines
oder mehrerer aus der Abfrage resultierender Dokument, Empfangen einer Auswahl aus
einer oder mehreren der definierten Verbindungen innerhalb eines Dokuments aus dem
Rechtswesen, Finanzwesen, Gesundheitswesen, der Wissenschaft oder dem Bildungswesen;
und Abrufen und Anzeigen (240) mindestens eines Abschnitts des mindestens einen Entitätsprofildatensatzes.
13. Verfahren nach Anspruch 11 oder 12,
wobei der mindestens eine andere Entitätsdatensatz in einer Datenbank (100) enthalten
ist;
wobei mindestens einer der Entitätsprofildatensätze möglicherweise nicht mit mindestens
einem anderen Entitätsreferenzdatensatz in der Datenbank zusammengeführt werden kann;
und
wobei das Verfahren ferner umfasst:
als Antwort auf ein Scheitern, mindestens einen der Entitätsreferenzdatensätze mit
mindestens einem der anderen Entitätsreferenzdatensätze zusammenzuführen, erfolgendes
Versuchen, jeden des mindestens einen Entitätsreferenzdatensatzes mit einer Menge
von gesammelten Entitätsreferenzdatensätzen außerhalb der Datenbank auf Übereinstimmung
zu prüfen; und
als Antwort auf eine Übereinstimmung mindestens eines der Entitätsreferenzdatensätze
mit mindestens einem der gesammelten Entitätsreferenzdatensätze erfolgendes Zusammenführen
der überprüften Datensätze und Hinzufügen derselben zu der Datenbank.
14. Trägermedium, das computerlesbaren Code trägt, um einen Computer zu steuern, damit
er das Verfahren nach einem der Ansprüche 11 bis 13 ausführt.
1. Système comprenant :
un moyen (910) destiné à extraire des données de référence d'entité pour au moins
une personne à partir de chacun d'une pluralité de documents, en vue de former des
dossiers de référence d'entité ;
un moyen (920) destiné à former au moins un dossier de profil d'entité, en fusionnant
au moins l'un des dossiers de référence d'entité pour une personne à au moins un autre
dossier de référence d'entité pour la même personne, en mettant en oeuvre les étapes
ci-dessous consistant à :
trier les dossiers de référence d'entité par nom de famille ;
sélectionner un dossier de référence d'entité non fusionné et créer un dossier de
profil d'entité à partir du dossier de référence d'entité non fusionné sélectionné
; et
analyser le dossier de référence d'entité non fusionné, en vue de déterminer une probabilité
qu'une personne dans un dossier de profil d'entité soit la même personne que celle
référencée dans le dossier de référence d'entité non fusionné sélectionné ;
un moyen (940) destiné à classer au moins l'un des dossiers de profils d'entité sur
la base d'une taxinomie ; et
un moyen (950) destiné à définir des liens entre au moins l'un des dossiers de profils
d'entité et d'autres documents ou ensembles de données.
2. Système selon la revendication 1, comprenant en outre :
un moyen d'interface utilisateur graphique (138) en vue de définir une requête d'entité,
de visualiser au moins un document résultant de la requête, de sélectionner au moins
l'un des liens définis dans un document juridique, financier, de soins de santé, scientifique
ou éducatif, et d'occasionner la récupération et l'affichage d'au moins une partie
de l'un des dossiers de profils d'entité.
3. Système selon la revendication 1 ou 2, dans lequel au moins l'un des moyens énoncés
comporte un ou plusieurs processeurs, un ou plusieurs supports lisibles par ordinateur,
un ou plusieurs dispositifs d'affichage, et une ou plusieurs communications de réseau,
dans lequel ledit support lisible par ordinateur comporte des instructions codées
et des structures de données.
4. Système selon l'une quelconque des revendications précédentes :
dans lequel ledit au moins un dossier de référence d'entité est inclus dans une base
de données (100) ;
dans lequel le moyen pour former au moins un dossier de profil d'entité peut ne pas
parvenir à fusionner au moins l'un des dossiers de référence d'entité avec au moins
un autre dossier de référence d'entité dans la base de données ; et
dans lequel le système comprend en outre :
un moyen, en réponse à l'absence de fusion d'au moins l'un des dossiers de référence
d'entité avec au moins l'un des autres dossiers de référence d'entité, destiné à tenter
de faire correspondre chaque dossier, parmi ledit au moins un dossier de référence
d'entité, à un ensemble de dossiers de référence d'entité collectés en dehors de la
base de données ; et
un moyen, en réponse à une correspondance établie entre au moins l'un des dossiers
de référence d'entité et au moins l'un des dossiers de référence d'entité collectés,
pour fusionner les dossiers et les ajouter à la base de données.
5. Système selon l'une quelconque des revendications précédentes, dans lequel les documents
comprennent des documents d'établissement de verdict de jury.
6. Système selon la revendication 5, dans lequel le moyen destiné à extraire des dossiers
d'entité comprend des transducteurs à états finis.
7. Système selon l'une quelconque des revendications précédentes, dans lequel le moyen
destiné à extraire au moins l'un des dossiers de référence d'entité comprend un moyen
permettant d'identifier un nom, un diplôme, un domaine d'expertise, une organisation,
une ville, et un État.
8. Système selon la revendication 4, dans lequel le moyen destiné à tenter d'établir
une correspondance entre au moins l'un des dossiers de référence d'entité et au moins
l'un des dossiers de référence d'entité collectés comprend un moyen permettant de
calculer une probabilité de correspondance bayésienne.
9. Système selon l'une quelconque des revendications précédentes :
dans lequel chacun des dossiers de référence d'entité renvoie à une personne ; et
dans lequel le moyen permettant de classer au moins l'un des dossiers d'entité définis
sur la base d'une taxinomie est apte à classer automatiquement chaque dossier de référence
d'entité dans une taxinomie d'expertise.
10. Système selon l'une quelconque des revendications précédentes, dans lequel le moyen
destiné à extraire automatiquement des dossiers de référence d'entité est apte à mettre
en oeuvre une extraction sur la base d'un type de document.
11. Procédé comprenant les étapes ci-dessous consistant à :
extraire (910) des données de référence d'entité pour au moins une personne à partir
de chacun d'une pluralité de documents, en vue de former des dossiers de référence
d'entité ;
former (920) au moins un profil de référence d'entité, en fusionnait au moins l'un
des dossiers de référence d'entité pour une personne à au moins un autre dossier de
référence d'entité pour la même personne, en mettant en oeuvre les étapes ci-dessous
consistant à :
trier les dossiers de référence d'entité par nom de famille ;
sélectionner un dossier de référence d'entité non fusionné et créer un dossier de
profil d'entité à partir du dossier de référence d'entité non fusionné sélectionné
; et
analyser le dossier de référence d'entité non fusionné, en vue de déterminer une probabilité
qu'une personne dans un dossier de profil d'entité soit la même personne que celle
référencée dans le dossier de référence d'entité non fusionné sélectionné ;
classer automatiquement (940) au moins l'un des dossiers de profils d'entité sur la
base d'une taxinomie d'expertise ; et
définir des liens (950) entre au moins l'un des dossiers de profils d'entité et d'autres
documents ou ensembles de données.
12. Procédé selon la revendication 11, comprenant en outre les étapes ci-dessous consistant
à :
recevoir une requête (210) d'entité, afficher (230) un ou plusieurs documents résultant
de la requête, recevoir une sélection d'un ou plusieurs des liens définis dans un
document juridique, financier, de soins de santé, scientifique ou éducatif; et récupérer
et afficher (240) au moins une partie dudit au moins un dossier de profil d'entité.
13. Procédé selon la revendication 11 ou 12 :
dans lequel ledit au moins un dossier de référence d'entité est inclus dans une base
de données (100) ;
dans lequel au moins l'un des dossiers de référence d'entité peut ne pas être fusionné
avec au moins un autre dossier de référence d'entité dans la base de données ; et
dans lequel le procédé comprend en outre les étapes ci-dessous consistant à :
en réponse à l'absence de fusion d'au moins l'un des dossiers de référence d'entité
avec au moins l'un des autres dossiers de référence d'entité, tenter de faire correspondre
chaque dossier, parmi ledit au moins un dossier de référence d'entité, à un ensemble
de dossiers de référence d'entité collectés en dehors de la base de données ; et
en réponse à une correspondance établie entre au moins l'un des dossiers de référence
d'entité et au moins l'un des dossiers de référence d'entité collectés, fusionner
les dossiers correspondants et les ajouter à la base de données.
14. Support transportant du code lisible par ordinateur en vue d'amener un ordinateur
à mettre en oeuvre le procédé selon l'une quelconque des revendications 11 à 13.