CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application claims priority to Chinese Patent Application No.
201410513189.4, filed with the Chinese Patent Office on September 29, 2014 and entitled "METHOD
AND DEVICE FOR PARSING QUESTION IN KNOWLEDGE BASE", which is incorporated herein by
reference in its entirety.
TECHNICAL FIELD
[0002] Embodiments of the present invention relate to the field of communications, and more
specifically, to a method and a device for parsing a question in a knowledge base.
BACKGROUND
[0003] A knowledge base (Knowledge Base, KB) is a structured, organized, and comprehensive
knowledge cluster that is easy to operate and easy to use in knowledge engineering.
It is a set of interlinked knowledge fragments that are stored, organized, managed,
and used in a computer storage in one or several knowledge representation forms according
to requirements for question answering in one or several fields.
[0004] Currently, a large quantity of knowledge resources and knowledge communities have
emerged on the Internet, for example, Wikipedia (Wikipedia),
http://baike.baidu.com/, and
http://www.baike.com/. From these knowledge resources, large-scale knowledge bases centering on entities
and entity relations have been mined through research. In addition, there are also
knowledge bases in some fields, for example, a weather knowledge base and a catering
knowledge base.
[0005] Building of a knowledge base experiences a process from addition by using artificial
or collective intelligence to automatic acquisition oriented to the entire Internet
by using machine learning and information extraction technologies. Earlier knowledge
bases are built by experts manually, for example, WordNet, CYC, CCD, HowNet, and Encyclopedia
of China. However, with development of information technologies, disadvantages such
as small scales, a small amount of knowledge, and slow update of conventional knowledge
bases built manually are exposed gradually. In addition, a certainty knowledge framework
built by an expert also cannot satisfy requirements for large-scale computing in a
noisy environment on the Internet. This is also one of reasons why a CYC project finally
fails. With fast development of Web 2.0, a large quantity of collective intelligence-based
web knowledge bases including Wikipedia,
http://baike.baidu.com/, and
http://www.baike.com/ emerge. Based on these network resources, many automatic and semi-automatic knowledge
base building methods are used to build large-scale available knowledge bases, such
as YAGO, DBpedia, and Freebase.
[0006] Based on these knowledge bases, a knowledge base-based question answering (Knowledge-base-based
Question Answering) system may be built. Compared with a retrieval technology-based
question answering system, the knowledge base-based question answering system may
have lower question coverage due to a limit of a knowledge base scale, but it has
an inference capability. In addition, in limited fields, higher accuracy may be achieved.
Therefore, some knowledge base-based question answering systems are developed as the
times require, where some have become independent applications, and some are used
as enhanced functions of an existing product, for example, siri of Apple and Knowledge
Graph of Google.
[0007] The question answering (Question Answering) system does not need to break down a
question of a user into keywords. Instead, the question is raised directly in a natural
language form, after the question of the user is processed by the question answering
system, an answer corresponding to the question of the user is quickly searched out
from a knowledge base or the Internet, and then the answer instead of a related web
page is directly returned to the user. Therefore, the question answering system greatly
reduces use difficulty for the user, and it is more convenient and efficient than
conventional search engines such as keyword retrieval and semantic search technologies.
[0008] Evaluation campaigns of question answering over linked data (Question Answering over
Linked Data, QALD) have promoted the development of the question answering system.
An objective of the QALD is to convert a natural language question into a structured
Simple Protocol and Resource Description Framework Query Language (Simple Protocol
and RDF (Resource Description Framework, Resource Description Framework) Query Language,
SPARQL) for large-scale structured linked data, and thereby establishing a friendly
natural language query interface. Converting the natural language question into the
structured SPARQL needs to depend on a conversion rule for a knowledge base. However,
in the current question answering system, all conversion rules are configured manually,
which causes not only huge labor consumption, but also poor field extensibility.
SUMMARY
[0009] Embodiments of the present invention provide a method for parsing a question based
on a knowledge base, where the method is field-independent, and it is unnecessary
to manually configure a conversion rule.
[0010] According to a first aspect, a method for parsing a question in a knowledge base
is provided and includes:
receiving a question entered by a user;
performing phrase detection on the question to determine first candidate phrases;
mapping the first candidate phrases to first resource items in the knowledge base,
where the first resource items have consistent semantic meanings with the first candidate
phrases;
determining values of observed predicates and possible question parse spaces according
to the first candidate phrases and the first resource items, where the observed predicates
are used to indicate features of the first candidate phrases, features of the first
resource items, and a relationship between the first candidate phrases and the first
resource items, points in the possible question parse spaces are proposition sets,
and truth or falsity of propositions in the proposition sets are represented by values
of hidden predicates;
performing uncertain inference on each proposition set in the possible question parse
spaces according to the values of the observed predicates and the values of the hidden
predicates, and calculating confidence of each proposition set;
acquiring a combination of true propositions in a proposition set whose confidence
satisfies a preset condition, where the true propositions are used to indicate search
phrases selected from the first candidate phrases, search resource items selected
from the first resource items, and features of the search resource items; and
generating a formal query statement according to the combination of true propositions.
[0011] With reference to the first aspect, in a first possible implementation manner of
the first aspect, the uncertain inference is based on a Markov logic network MLN,
where the MLN includes a predefined first-order formula and a weight of the first-order
formula.
[0012] With reference to the first aspect or the first possible implementation manner of
the first aspect, in a second possible implementation manner of the first aspect,
before the receiving a question entered by a user, the method further includes:
acquiring multiple natural language questions from the knowledge base;
performing phrase detection on the multiple natural language questions to determine
second candidate phrases of the multiple natural language questions;
mapping the second candidate phrases to second resource items in the knowledge base,
where the second resource items have consistent semantic meanings with the second
candidate phrases;
determining, according to the second candidate phrases and the second resource items,
values of observed predicates corresponding to the multiple natural language questions;
acquiring hand-labeled values of hidden predicates corresponding to the multiple natural
language questions; and
creating an undirected graph according to the values of the observed predicates corresponding
to the multiple natural language questions, the values of the hidden predicates corresponding
to the multiple natural language questions, and the first-order formula, and determining
the weight of the first-order formula through training.
[0013] With reference to the second possible implementation manner of the first aspect,
in a third possible implementation manner of the first aspect, the first-order formula
includes a Boolean formula and a weighted formula, a weight of the Boolean formula
is +∞, a weight of the weighted formula is a weighted formula weight, and the hand-labeled
values of the hidden predicates corresponding to the multiple natural language questions
satisfy the Boolean formula; and
the creating an undirected graph according to the values of the observed predicates
corresponding to the multiple natural language questions, the values of the hidden
predicates corresponding to the multiple natural language questions, and the first-order
formula, and determining the weight of the first-order formula through training includes:
creating the undirected graph according to the values of the observed predicates corresponding
to the multiple natural language questions, the values of the hidden predicates corresponding
to the multiple natural language questions, and the first-order formula, and determining
the weight of the weighted formula through training.
[0014] With reference to the second possible implementation manner of the first aspect,
in a fourth possible implementation manner of the first aspect, the creating an undirected
graph according to the values of the observed predicates corresponding to the multiple
natural language questions, the values of the hidden predicates corresponding to the
multiple natural language questions, and the first-order formula, and determining
the weight of the first-order formula through training includes:
creating the undirected graph according to the values of the observed predicates corresponding
to the multiple natural language questions, the values of the hidden predicates corresponding
to the multiple natural language questions, and the first-order formula, and determining
the weight of the first-order formula by using a margin infused relaxed algorithm
MIRA.
[0015] With reference to any possible implementation manner of the first aspect, in a fifth
possible implementation manner of the first aspect, the MLN is indicated by
M, the first-order formula is indicated by φ
i, the weight of the first-order formula is indicated by
wi, and the proposition set is indicated by
y ; and
the performing uncertain inference on each proposition set in the question parse spaces
according to the values of the observed predicates and the values of the hidden predicates,
and calculating confidence of each proposition set includes:
calculating the confidence of each proposition set according to

where
Z is a normalization constant, Cnφi is a sub-formula set corresponding to the first-order formula φi, c is a sub-formula in the sub-formula set Cnφi,

is a binary feature function, and

indicates truth or falsity of the first-order formula in the proposition set y.
[0016] With reference to the first aspect or any possible implementation manner of the first
aspect, in a sixth possible implementation manner of the first aspect, the acquiring
a combination of true propositions in a proposition set whose confidence satisfies
a preset condition includes:
determining a proposition set whose confidence value is largest, and acquiring a combination
of true propositions in the proposition set whose confidence value is largest.
[0017] With reference to the first aspect or any possible implementation manner of the first
aspect, in a seventh possible implementation manner of the first aspect,
the features of the first candidate phrases include positions of the first candidate
phrases in the question, parts of speech of head words of the first candidate phrases,
and tags on a dependency path between every two of the first candidate phrases;
the features of the first resource items include types of the first resource items,
a correlation value between every two of the first resource items, and a parameter
matching relationship between every two of the first resource items;
the relationship between the first candidate phrases and the first resource items
includes prior matching scores between the first candidate phrases and the first resource
items; and
the determining values of observed predicates according to the first candidate phrases
and the first resource items includes:
determining the positions of the first candidate phrases in the question;
determining the parts of speech of the head words of the first candidate phrases by
using a stanford part-of-speech tagging tool;
determining the tags on the dependency path between every two of the first candidate
phrases by using a stanford dependency syntax parser tool;
determining the types of the first resource items from the knowledge base, where the
types are entity or class or relation;
determining the parameter matching relationship between every two of the first resource
items from the knowledge base;
using a similarity coefficient between every two of the first resource items as the
correlation value between every two of the first resource items; and
calculating the prior matching scores between the first candidate phrases and the
first resource items, where the prior matching scores are used to indicate probabilities
that the first candidate phrases are mapped to the first resource items.
[0018] With reference to the first aspect or any possible implementation manner of the first
aspect, in an eighth possible implementation manner of the first aspect, the formal
query statement is a Simple Protocol and Resource Description Framework Query Language
SPARQL.
[0019] With reference to the eighth possible implementation manner of the first aspect,
in a ninth possible implementation manner of the first aspect, the generating a formal
query statement according to the combination of true propositions includes:
generating the SPARQL according to the combination of true propositions by using a
SPARQL template.
[0020] With reference to the ninth possible implementation manner of the first aspect, in
a tenth possible implementation manner of the first aspect, the SPARQL template includes
an ASK WHERE template, a SELECT COUNT(?url) WHERE template, and a SELECT ?url WHERE
template; and
the generating the SPARQL according to the combination of true propositions by using
a SPARQL template includes:
when the question is a Yes/No question, generating the SPARQL according to the combination
of true propositions by using the ASK WHERE template;
when the question is a Normal question, generating the SPARQL according to the combination
of true propositions by using the SELECT ?url WHERE template; and
when the question is a Number question, generating the SPARQL according to the combination
of true propositions by using the SELECT ?url WHERE template, or when a numeric answer
cannot be obtained for the SPARQL generated by using the SELECT ?url WHERE template,
generating the SPARQL by using the SELECT COUNT(?url) WHERE template.
[0021] With reference to the first aspect or any possible implementation manner of the first
aspect, in an eleventh possible implementation manner of the first aspect, the performing
phrase detection on the question to determine first candidate phrases includes: using
word sequences in the question as the first candidate phrases, where the word sequences
satisfy:
all consecutive non-stop words in the word sequence begin with a capital letter, or
if all consecutive non-stop words in the word sequence do not begin with a capital
letter, a length of the word sequence is less than four;
a part of speech of a head word of the word sequence is jj or nn or rb or vb, where
jj is an adjective, nn is a noun, rb is an adverb, and vb is a verb; and
all words included in the word sequence are not stop words.
[0022] According to a second aspect, a device for parsing a question is provided and includes:
a receiving unit, configured to receive a question entered by a user;
a phrase detection unit, configured to perform phrase detection on the question received
by the receiving unit to determine first candidate phrases;
a mapping unit, configured to map the first candidate phrases determined by the phrase
detection unit to first resource items in the knowledge base, where the first resource
items have consistent semantic meanings with the first candidate phrases;
a first determining unit, configured to determine values of observed predicates and
possible question parse spaces according to the first candidate phrases and the first
resource items, where the observed predicates are used to indicate features of the
first candidate phrases, features of the first resource items, and a relationship
between the first candidate phrases and the first resource items, points in the possible
question parse spaces are proposition sets, and truth or falsity of propositions in
the proposition sets are represented by values of hidden predicates;
a second determining unit, configured to: perform uncertain inference on each proposition
set in the possible question parse spaces according to the values that are of the
observed predicates and are determined by the first determining unit and the values
of the hidden predicates, and calculate confidence of each proposition set;
an acquiring unit, configured to acquire a combination of true propositions in a proposition
set that is determined by the second determining unit and whose confidence satisfies
a preset condition, where the true propositions are used to indicate search phrases
selected from the first candidate phrases, search resource items selected from the
first resource items, and features of the search resource items; and
a generating unit, configured to generate a formal query statement according to the
combination of true propositions.
[0023] With reference to the second aspect, in a first possible implementation manner of
the second aspect, the uncertain inference is based on a Markov logic network MLN,
where the MLN includes a predefined first-order formula and a weight of the first-order
formula.
[0024] With reference to the second aspect or the first possible implementation manner of
the second aspect, in a second possible implementation manner of the second aspect,
the acquiring unit is further configured to acquire multiple natural language questions
from the knowledge base;
the phrase detection unit is further configured to perform phrase detection on the
question received by the acquiring unit to determine the first candidate phrases;
the mapping unit is further configured to map the second candidate phrases to second
resource items in the knowledge base, where the second resource items have consistent
semantic meanings with the second candidate phrases;
the first determining unit is further configured to determine, according to the second
candidate phrases and the second resource items, values of observed predicates corresponding
to the multiple natural language questions;
the acquiring unit is further configured to acquire hand-labeled values of hidden
predicates corresponding to the multiple natural language questions; and
the second determining unit is further configured to create an undirected graph according
to the values of the observed predicates corresponding to the multiple natural language
questions, the values of the hidden predicates corresponding to the multiple natural
language questions, and the first-order formula, and determine the weight of the first-order
formula through training.
[0025] With reference to the second possible implementation manner of the second aspect,
in a third possible implementation manner of the second aspect, the first-order formula
includes a Boolean formula and a weighted formula, a weight of the Boolean formula
is +∞, a weight of the weighted formula is a weighted formula weight, and the hand-labeled
values of the hidden predicates corresponding to the multiple natural language questions
satisfy the Boolean formula; and
the second determining unit is specifically configured to: create the undirected graph
according to the values of the observed predicates corresponding to the multiple natural
language questions, the values of the hidden predicates corresponding to the multiple
natural language questions, and the first-order formula, and determine the weight
of the weighted formula through training.
[0026] With reference to the second possible implementation manner of the second aspect,
in a fourth possible implementation manner of the second aspect, the second determining
unit is specifically configured to:
create the undirected graph according to the values of the observed predicates corresponding
to the multiple natural language questions, the values of the hidden predicates corresponding
to the multiple natural language questions, and the first-order formula, and determine
the weight of the first-order formula by using a margin infused relaxed algorithm
MIRA.
[0027] With reference to any possible implementation manner of the second aspect, in a fifth
possible implementation manner of the second aspect, the MLN is indicated by
M, the first-order formula is indicated by φ
i, the weight of the first-order formula is indicated by
wi, and the proposition set is indicated by
y ; and
the second determining unit is specifically configured to:
create a possible world according to the values of the observed predicates and the
hidden predicates, where the possible world is indicated by y ; and
calculate the confidence of each proposition set according to

where
Z is a normalization constant, Cnφi is a sub-formula set corresponding to the first-order formula φi, c is a sub-formula in the sub-formula set Cnφi,

is a binary feature function, and

indicates truth or falsity of the first-order formula in the proposition set y.
[0028] With reference to the second aspect or any possible implementation manner of the
second aspect, in a sixth possible implementation manner of the second aspect, the
acquiring unit is specifically configured to:
determine a proposition set whose confidence value is largest, and acquire a combination
of true propositions in the proposition set whose confidence value is largest.
[0029] With reference to the second aspect or any possible implementation manner of the
second aspect, in a seventh possible implementation manner of the second aspect,
the features of the first candidate phrases include positions of the first candidate
phrases in the question, parts of speech of head words of the first candidate phrases,
and tags on a dependency path between every two of the first candidate phrases;
the features of the first resource items include types of the first resource items,
a correlation value between every two of the first resource items, and a parameter
matching relationship between every two of the first resource items;
the relationship between the first candidate phrases and the first resource items
includes prior matching scores between the first candidate phrases and the first resource
items; and
the first determining unit is specifically configured to:
determine the positions of the first candidate phrases in the question;
determine the parts of speech of the head words of the first candidate phrases by
using a stanford part-of-speech tagging tool;
determine the tags on the dependency path between every two of the first candidate
phrases by using a stanford dependency syntax parser tool;
determine the types of the first resource items from the knowledge base, where the
types are entity or class or relation;
determine the parameter matching relationship between every two of the first resource
items from the knowledge base;
use a similarity coefficient between every two of the first resource items as the
correlation value between every two of the first resource items; and
calculate the prior matching scores between the first candidate phrases and the first
resource items, where the prior matching scores are used to indicate probabilities
that the first candidate phrases are mapped to the first resource items.
[0030] With reference to the second aspect or any possible implementation manner of the
second aspect, in an eighth possible implementation manner of the second aspect, the
formal query statement is a Simple Protocol and Resource Description Framework Query
Language SPARQL.
[0031] With reference to the eighth possible implementation manner of the second aspect,
in a ninth possible implementation manner of the second aspect, the generating unit
is specifically configured to:
generate the SPARQL according to the combination of true propositions by using a SPARQL
template.
[0032] With reference to the ninth possible implementation manner of the second aspect,
in a tenth possible implementation manner of the second aspect, the SPARQL template
includes an ASK WHERE template, a SELECT COUNT(?url) WHERE template, and a SELECT
?url WHERE template; and
the generating unit is specifically configured to:
when the question is a Yes/No question, generate the SPARQL according to the combination
of true propositions by using the ASK WHERE template;
when the question is a Normal question, generate the SPARQL according to the combination
of true propositions by using the SELECT ?url WHERE template; and
when the question is a Number question, generate the SPARQL according to the combination
of true propositions by using the SELECT ?url WHERE template, or when a numeric answer
cannot be obtained for the SPARQL generated by using the SELECT ?url WHERE template,
generate the SPARQL by using the SELECT COUNT(?url) WHERE template.
[0033] With reference to the second aspect or any possible implementation manner of the
second aspect, in an eleventh possible implementation manner of the second aspect,
the phrase detection unit is specifically configured to:
use word sequences in the question as the first candidate phrases, where the word
sequences satisfy:
all consecutive non-stop words in the word sequence begin with a capital letter, or
if all consecutive non-stop words in the word sequence do not begin with a capital
letter, a length of the word sequence is less than four;
a part of speech of a head word of the word sequence is jj or nn or rb or vb, where
jj is an adjective, nn is a noun, rb is an adverb, and vb is a verb; and
all words included in the word sequence are not stop words.
[0034] The embodiments of the present invention are based on a predefined uncertain inference
network, and can be used for converting a natural language question entered by a user
into a structured SPARQL. In the embodiments of the present invention, the predefined
uncertain inference network can be applied to a knowledge base in any field, and has
field extensibility. Therefore, it is unnecessary to manually configure a conversion
rule for a knowledge base.
BRIEF DESCRIPTION OF DRAWINGS
[0035] To describe the technical solutions in the embodiments of the present invention more
clearly, the following briefly introduces the accompanying drawings required for describing
the embodiments or the prior art. Apparently, the accompanying drawings in the following
description show merely some embodiments of the present invention, and a person of
ordinary skill in the art may still derive other drawings from these accompanying
drawings without creative efforts.
FIG. 1 is a flowchart of a method for parsing a question in a knowledge base according
to an embodiment of the present invention;
FIG. 2 is an example of a dependency parse tree according to an embodiment of the
present invention;
FIG. 3 is a schematic diagram of a method for parsing a question in a knowledge base
according to another embodiment of the present invention;
FIG. 4 is another example of a resource items query graph according to an embodiment
of the present invention;
FIG. 5 is a flowchart of a method for determining a weight of a weighted formula according
to an embodiment of the present invention;
FIG. 6 is a block diagram of a device for parsing a question according to an embodiment
of the present invention; and
FIG. 7 is a block diagram of a device for parsing a question according to another
embodiment of the present invention.
DESCRIPTION OF EMBODIMENTS
[0036] The following clearly and completely describes the technical solutions in the embodiments
of the present invention with reference to the accompanying drawings in the embodiments
of the present invention. Apparently, the described embodiments are some but not all
of the embodiments of the present invention. All other embodiments obtained by a person
of ordinary skill in the art based on the embodiments of the present invention without
creative efforts shall fall within the protection scope of the present invention.
[0037] In a knowledge base-based question answering system, a natural language question
(natural language question) needs to be converted into a formal query statement. For
example, the formal query question is a structured query language (Structure Query
Language, SQL) or a SPARQL. Generally, the SPARQL is expressed in a subject-property-object
(subject-property-object, SPO) triple format (triple format).
[0038] For example, a SPARQL corresponding to a natural language question "Which software
has been developed by organization founded in California, USA?" is:
?url_answer rdf:type dbo:Software
?url_answer db:developer ?x1
?x1 rdf:type dbo:Company
?x1 dbo:foundationPlace dbr:California
[0039] Converting a natural language question into a formal query statement needs to depend
on a conversion rule for a knowledge base. That is, conversion rules corresponding
to different knowledge bases are also different. However, in a current question answering
system, it is necessary to manually configure a conversion rule for each knowledge
base. For a knowledge base, some questions are collected manually, and answers to
the questions are determined, and some rules are obtained through manual summarization
according to these questions and used as conversion rules. That is, the manually configured
conversion rules do not have field extensibility, and a conversion rule configured
for one knowledge base cannot be used for another knowledge base. In addition, because
many ambiguities exist in natural language questions, the manually configured conversion
rules lack robustness.
[0040] Natural language processing (Natural Language Processing, NLP) is a tool for describing
a relationship between a machine language and a natural language in computing science,
artificial intelligence, and linguistic disciplines. The NLP involves human-machine
interactions. Tasks (tasks) of the NLP may include: automatic monitoring (Automatic
summarization), coreference resolution (Coreference resolution), discourse analysis
(Discourse analysis), machine translation (Machine translation), morphological segmentation
(Morphological segmentation), named entity recognition (Named entity recognition,
NER), natural language generation (Natural language generation), natural language
understanding (Natural language understanding), optical character recognition (Optical
character recognition, OCR), part-of-speech tagging (Part-of-speech tagging), syntax
parsing (Parsing), question answering system (Question answering), relationship extraction
(Relationship extraction), sentence breaking (Sentence breaking), sentiment analysis
(Sentiment analysis), speech recognition (Speech recognition), speech segmentation
(Speech segmentation), topic segmentation and recognition (Topic segmentation and
recognition), word segmentation (Word segmentation), word sense disambiguation (Word
sense disambiguation), information retrieval (Information retrieval, IR), information
extraction (Information extraction, IE), speech processing (Speech processing), and
the like.
[0041] Specifically, a Stanford (Stanford) natural language processing (Natural Language
Processing, NLP) tool is designed for different tasks of the NLP. The Stanford NLP
tool is used in the embodiments of the present invention. For example, a part-of-speech
tagging tool in the Stanford NLP tool may be used to determine a part of speech (Part-of-speech)
of each word (word) in a question.
[0042] Uncertain inference generally refers to inference of various questions except exact
inference, including inference of incomplete and inaccurate knowledge, inference of
vague knowledge, non-monotonic inference, and the like.
[0043] An uncertain inference process is actually a process of thinking that starts from
uncertain original evidence and, finally infers, by using uncertainty knowledge, a
structure that has uncertainty but is reasonable or basically reasonable.
[0044] Uncertain inference types include a numeric method and a nonnumeric method, where
the numeric method includes a probability-based method. Specifically, the probability-based
method is a method developed on a basis of a related theory of a probability theory,
such as a confidence method, a subjective Bayes (Bayes) method, and a theory of evidence.
[0045] A Markov logic network is a common one of uncertain inference networks.
[0046] The Markov logic network (Markov Logic Network, MLN) is a framework combining first-order
logic (First-Order Logic, FOL) and statistical relational learning (Statistical Relational
Learning) that is of a Markov network (Markov Network). A difference between the Markov
logic network and the conventional first-order logic is: the conventional first-order
logic requires that no conflict should be allowed among all rules, and if one proposition
cannot satisfy all rules simultaneously, the proposition is false; however, in the
Markov logic network, each rule has a weight, and a proposition is true according
to a probability.
[0047] The first-order logic (First-Order Logic, FOL) may also be referred to as predicate
logic or first-order predicate logic. It is formed by several first-order predicate
rules. A first-order predicate rule is formed by symbols of four types, that is, constant,
variable, function, and predicate. The constant refers to a simple object in a domain.
The variable may refer to several objects in the domain. The function indicates a
mapping from a group of objects to one object. The predicate refers to a relationship
between several objects in the domain or a property of an object. The variable and
constant may have types. A variable of a type can have a value only from an object
set that defines the type. A term may be any expression that indicates an object.
An atom is a predicate that is effective on a group of terms. A constant term refers
to a term without a variable. A ground atom (ground atom) or a ground predicate (ground
predicate) refers to an atom or a predicate whose parameters are all constant terms.
Generally, a rule is established recursively from atoms by using connectors (such
as an implication relationship and an equivalence relationship) and quantifiers (such
as universal quantifiers and existential quantifiers). In the first-order logic, a
rule is generally expressed in a form of a subordinate clause. A possible world (a
possible world) refers to a world in which true values are assigned to all possible
ground atoms that may occur. The first-order logic may be considered as a series of
hard rules established in a possible-world set, that is, if a world violates one of
the rules, an existential probability of the world is zero.
[0048] A basic idea of the MLN is to relax those hard rules, that is, when a world violates
one of the rules, an existential possibility of the world is reduced, but it does
not mean that existence of the world is impossible. If a world violates fewer rules,
an existential possibility of the world is higher. Therefore, a specific weight is
added to each rule, and the weight reflects a constraint on a possible world that
satisfies the rule. If the weight of a rule is greater, a difference between a world
that satisfies the rule and a world that does not satisfy the rule is greater.
[0049] In this manner, by designing different first-order logic formulas (high-order rule
templates), the Markov logic network can be properly combined with language features
and knowledge base constraints. Constraints of soft rules can be modeled by using
a logic formula in the probability framework. In the Markov logic (Markov Logic),
a group of weighted formulas is referred to as a Markov logic network.
[0050] Specifically, the MLN may include a first-order formula and a penalty (penalty).
A penalty may be applied if a ground atom violates a corresponding first-order formula.
[0051] The first-order formula includes first-order predicates, logical connectors (logical
connectors), and variables.
[0052] FIG. 1 is a flowchart of a method for parsing a question in a knowledge base according
to an embodiment of the present invention. The method shown in FIG. 1 includes:
101. Receive a question entered by a user.
102. Perform phrase detection on the question to determine first candidate phrases.
103. Map the first candidate phrases to first resource items in the knowledge base,
where the first resource items have consistent semantic meanings with the first candidate
phrases.
104. Calculate values of observed predicates and possible question parse spaces according
to the corresponding candidate phrases and the corresponding resource items, where
the observed predicates are used to indicate features of the first candidate phrases,
features of the first resource items, and a relationship between the first candidate
phrases and the first resource items, points in the possible question parse spaces
are proposition sets, and truth or falsity of propositions in the proposition sets
are represented by values of hidden predicates.
105. Perform uncertain inference on each proposition set in the possible question
parse spaces according to the values of the observed predicates and the values of
the hidden predicates, and calculate confidence of each proposition set.
106. Acquire a combination of true propositions in a proposition set whose confidence
satisfies a preset condition, where the true propositions are used to indicate search
phrases selected from the first candidate phrases, search resource items selected
from the first resource items, and features of the search resource items.
107. Generate a formal query statement according to the combination of true propositions.
[0053] In this embodiment of the present invention, uncertain inference is performed by
using observed predicates and hidden predicates, and a natural language question can
be converted into a formal query statement. In addition, in this embodiment of the
present invention, an uncertain inference method can be applied to a knowledge base
in any field, and has field extensibility. Therefore, it is unnecessary to manually
configure a conversion rule for a knowledge base.
[0054] It may be understood that in this embodiment of the present invention, the question
entered by the user in step 101 is a natural language question (natural language question).
[0055] For example, the natural language question is "Give me all actors who were born in
Berlin."
[0056] Further, in step 102, word (token) sequences may be recognized through phrase detection
(phrase detection). Optionally, the word sequences in the question may be used as
the first candidate phrases. A word sequence, also referred to as a multi-word sequence
or a word sequence or a word item or an n-gram word sequence or n-gram(s), refers
to a sequence formed by n consecutive words.
[0057] It may be understood that multiple first candidate phrases may be determined in step
102.
[0058] Optionally, in step 102, a word sequence that satisfies the following constraint
may be used as a first candidate phrase:
- (1) all consecutive non-stop words in the word sequence begin with a capital letter,
or if all consecutive non-stop words in the word sequence do not begin with a capital
letter, a length of the word sequence is less than four;
- (2) a part of speech of a head word (head word) of the word sequence is jj or nn or
rb or vb, where jj is an adjective, nn is a noun, rb is an adverb, and vb is a verb;
and
- (3) all words included in the word sequence are not stop words.
[0059] In addition, all consecutive non-stop words beginning with a capital letter must
be in a same word sequence.
[0060] It may be understood that in this embodiment of the present invention, a head word
may also be referred to as an important word or a main word or the like, and a symbol
indicating a part of speech may be acquired from a part-of-speech tagging set.
[0061] For example, all consecutive non-stop words in "United States Court of Appeals for
the District of Columbia Circuit" begin with a capital letter, and are a candidate
phrase. It may be understood that a word sequence in which all consecutive non-stop
words begin with a capital letter is generally a proper noun.
[0062] A length of a word sequence refers to a quantity of words included in the word sequence.
For example, a length of a word sequence "born in" is 2.
[0063] A part of speech of each word may be determined by using a stanford part-of-speech
tagging tool.
[0064] For example, English stop words (stop words) include "a", "an", "the", "that" and
the like, and Chinese stop words include "

", "

", "

", and the like.
[0065] For example, in the question "Give me all actors who were born in Berlin", the determined
first candidate phrases include: actors, who, born in, in, and Berlin.
[0066] Specifically, the first candidate phrases may be expressed in a form of Table 1,
where the first column in Table 1 indicates phrase identifiers of the first candidate
phrases.
Table 1
| 11 |
actors |
| 12 |
who |
| 13 |
born in |
| 14 |
in |
| 15 |
Berlin |
[0067] In this embodiment of the present invention, it may be understood that step 103 is
to map each first candidate phrase to a first resource item in the knowledge base.
In this embodiment of the present invention, step 103 may also be referred to as phrase
mapping (phrase mapping). Specifically, one first candidate phrase may be mapped to
multiple first resource items. A type of the first resource item may be entity (Entity)
or class (Class) or relation (Relation).
[0068] For example, it is assumed that the knowledge base is DBpedia. Step 103 is specifically:
mapping the first candidate phrases to entities (Entities), where considering that
entities in DBpedia come from entity pages in Wikipedia, first, an anchor text (anchor
text), a redirection page, and a disambiguation page in Wikipedia are collected, and
a corresponding dictionary between a first candidate phrase and an entity is created
by using the anchor text, the redirection page, and the disambiguation page in Wikipedia,
and when the first candidate phrase matches a mention (mention) phrase of the entity,
the entity is a first resource item that has a consistent semantic meaning with the
first candidate phrase.
[0069] The first candidate phrase is mapped to a class (Class). Considering a case of variations
of words, and in particular, synonyms, for example, phrases film, movie, and show
may all be mapped to a class dbo:Film. First, all words in the first candidate phrase
are converted in a vector form by using a word2vec tool, where a vector form of a
class in the knowledge base is a vector form of a label thereof (corresponding to
an rdfs:label relation); then a cosine similarity between the first candidate phrase
and each class in vectors is calculated; and finally, N classes with a maximum cosine
similarity value are used as a first resource item that has a consistent semantic
meaning with the first candidate phrase.
[0070] The word2vec tool is a tool for converting a word (word) into a vector (vector).
For example, it may be a segment of open code developed and provided by Google (google).
For details, reference may be made to http://code.google.com/p/word2vec/.
[0071] The first candidate phrase is mapped to a relation (Relation), and relation patterns
defined by PATTY and ReVerb are used as resources. First, alignments between relations
in DBpedia and relation patterns (relation patterns) defined by PATTY and ReVerb in
instances are calculated, that is, statistics of instance pairs satisfying the relations
in the relation patterns in DBpedia are collected. Then, if the first candidate phrase
can match a relation pattern, a relation satisfying the relation pattern is used as
a first resource item that has a consistent semantic meaning with the first candidate
phrase.
[0073] In this manner, by performing step 103, the first candidate phrases may be mapped
to the first resource items. Specifically, each first candidate phrase is mapped to
at least one first resource item. In addition, the first candidate phrases and the
first resource items having a mapping relationship have consistent semantic meanings.
[0074] If one first candidate phrase is mapped to multiple first resource items, it indicates
that the first candidate phrase is ambiguous.
[0075] For example, in step 103, it may be determined that the first candidate phrases actors,
who, born in, in, and Berlin in the question "Give me all actors who were born in
Berlin" are mapped to the first resource items shown in Table 2. In Table 2, the first
column indicates the first candidate phrases, the second column indicates the first
resource items, and the third column indicates identifiers of the first resource items.
In addition, a first candidate phrase "in" is mapped to five first resource items.
Table 2
| actors |
dbo:Actor |
21 |
| who |
dbo:Person |
22 |
| born in |
dbo:birthPlace |
23 |
| in |
dbo:headquarter |
24 |
| in |
dbo:league |
25 |
| in |
dbo:location |
26 |
| in |
dbo:ground |
27 |
| in |
dbo:locationCity |
28 |
| Berlin |
dbr:Berlin |
29 |
[0076] In this embodiment of the present invention, step 104 may be understood as a feature
extraction (feature extraction) process.
[0077] Specifically, the hidden predicate (hidden predicates) is defined in this embodiment
of the present invention. The hidden predicate may include the following forms:
hasphrase(p) indicates that a candidate phrase p is selected.
hasResource(p,r) indicates that a resource item r is selected, and that the candidate
phrase p is mapped to the resource item r.
hasRelation(p,r,rr) indicates that a parameter matching relationship rr between a
resource item p and the resource item r is selected.
[0078] It may be understood that p may be a phrase identifier of a candidate phrase, and
that p and r may be identifiers of resource items. The parameter matching relationship
rr may be one of the following: 1_1, 1_2, 2_1, and 2_2.
[0079] Specifically, in this embodiment of the present invention, the parameter matching
relationship rr may be one of the following: 1_1, 1_2, 2_1, and 2_2. Then, that the
parameter matching relationship between the resource item p and the resource r is
m1_m2 indicates that the m1
th parameter of the resource item p is aligned with the m2
th parameter of the resource item r, where m1 is 1 or 2, and m2 is 1 or 2.
[0080] Table 3 shows a specific example of the foregoing parameter matching relationship.
The third column in Table 3 provides a question to explain a parameter matching relationship
in the second column.
Table 3
| 1_1 |
dbo:height 1_1 dbr:Michael Jordan |
How tall is Michael Jordan? |
| 1_2 |
dbo:River 1_2 dbo:crosses |
Which river does the Brooklyn Bridge cross? |
| 2_1 |
dbo:creator 2_1 dbr:Walt Disney |
Which television shows were created by Walt Disney? |
| 2_2 |
dbo:birthplace 2_2 dbo:capital |
Which actors were born in the capital of American? |
| "dbo:height 1_1 dbr:Michael Jordan" indicates that a parameter matching relationship
between a resource item dbo:height and a resource item dbr:Michael Jordan is 1_1.
That is, the first parameter of the resource item dbo:height is aligned with the first
parameter of the resource item dbr:Michael Jordan. |
[0081] It may be understood that a value 1 of a hidden predicate indicates that a corresponding
candidate phrase and resource item and a parameter matching relationship between resource
items are selected, and that a value 0 of the hidden predicate indicates that a corresponding
candidate phrase and resource item and a parameter matching relationship between resource
items are not selected. In other words, the value 1 of the hidden predicate indicates
that a corresponding proposition is true, and the value 0 of the hidden predicate
indicates that the corresponding proposition is false.
[0082] For example, with reference to Table 1, hasphrase(11)=1 indicates that the proposition
"candidate phrase actors is selected" is true, and hasphrase(11)=1 indicates that
the proposition "candidate phrase actors is selected" is false.
[0083] In this manner, for the first candidate phrases and the first resource items that
are determined in steps 102 and 103, possible question parse spaces (possible question
parse spaces) may be created based on the hidden predicates. Specifically, one point
in a possible question parse space indicates one proposition set. A proposition set
includes a group of propositions, and the group of propositions is represented by
values of a group of hidden predicates. It may be understood that truth or falsity
of a group of propositions in a proposition set are represented by values of corresponding
hidden predicates.
[0084] Specifically, in this embodiment of the present invention, observed predicates (observed
predicates) are further defined to indicate features of the first candidate phrases,
features of the first resource items, and a relationship between the first candidate
phrases and the first resource items.
[0085] The features of the first candidate phrases include positions of the first candidate
phrases in the question, parts of speech of head words of the first candidate phrases,
tags on a dependency path between every two of the first candidate phrases, and the
like.
[0086] The features of the first resource items include types of the first resource items,
a correlation value between every two of the first resource items, a parameter matching
relationship between every two of the first resource items, and the like.
[0087] The relationship between the first candidate phrases and the first resource items
includes prior matching scores between the first candidate phrases and the first resource
items.
[0088] Then, it may be understood that determining the values of the observed predicates
in step 104 includes: determining the positions of the first candidate phrases in
the question; determining the parts of speech of the head words of the first candidate
phrases by using a stanford part-of-speech tagging tool; determining the tags on the
dependency path between every two of the first candidate phrases by using a stanford
dependency syntax parser tool; determining the types of the first resource items from
the knowledge base, where the types are entity or class or relation; determining the
parameter matching relationship between every two of the first resource items from
the knowledge base, where the parameter matching relationship is one of the following:
1_1, 1_2, 2_1, and 2_2; using a similarity coefficient between every two of the first
resource items as the correlation value between every two of the first resource items;
and calculating the prior matching scores between the first candidate phrases and
the first resource items, where the prior matching scores are used to indicate probabilities
that the first candidate phrases are mapped to the first resource items.
[0089] Specifically, the determining the parameter matching relationship between every two
of the first resource items from the knowledge base includes: determining a parameter
matching relationship m1_m2 between a first resource item r1 and a first resource
item r2 from the knowledge base, for indicating that the m1
th parameter of the first resource item r1 is aligned with the m2
th parameter of the first resource item r2. The first resource items include the first
resource item r1 and the first resource item r2, where m1 is 1 or 2, and m2 is 1 or
2.
[0090] Specifically, the hidden predicate may include the following forms:
phraseIndex(p,i,j) indicates a start position i and an end position j of a candidate
phrase p in a question.
phrasePosTag(p,pt) indicates a part of speech pt of a head word (head word) of the
candidate phrase p.
[0091] Specifically, a stanford part-of-speech tagging tool may be used to determine the
part of speech of the head word.
[0092] phraseDepTag(p,q,dt) indicates a tag dt on a dependency path between the candidate
phrase p and a candidate phrase q.
[0093] Specifically, a stanford dependency parser (stanford dependency parser) tool may
be used to create a dependency parse tree (dependency parse trees) of a question,
and feature extraction is performed according to the dependency parse tree to determine
tags on the dependency path between two candidate phrases.
[0094] For example, a dependency parse tree of the question "Give me all actors who were
born in Berlin." is shown in FIG. 2.
[0095] phraseDepOne(p,q) indicates that when there is only one tag on the dependency path
between the candidate phrase p and the candidate phrase q, the predicate is true,
or otherwise, the predicate is false.
[0096] It may be understood that the predicate phraseDepOne(p,q) in the observed predicates
includes only a predicate whose result is true.
[0097] hasMeanWord(p,q) indicates that when words on the dependency path between the candidate
phrase p and the candidate phrase q are all stop words or their parts of speech are
"dt", "in", "wdt", "to", "cc", "ex", "pos", or "wp", hasMeanWord(p,q) is false, or
otherwise, hasMeanWord(p,q) is true.
[0098] "dt" is a determiner, "in" is a preposition "in", "wdt" is an interrogative word
beginning with "w", "to" is a preposition "to", "cc" is a connector, "ex" is an existential
word "there", "pos" is a word ending with a possessive case, and "wp" is an interrogative
pronoun. Interrogative words beginning with "w" include "what", "which", and the like,
and the connectors include "and", "but", "or", and the like. Specifically, a symbol
indicating the foregoing parts of speech may be acquired from a part-of-speech tagging
set.
[0099] It may be understood that the predicate hasMeanWord(p,q) in the observed predicates
includes only a predicate whose result is true.
[0100] resourceType(r,rt) indicates that a type of the resource item r is rt. rt is E or
C or R. E indicates an entity (Entity), C indicates a class (Class), and R indicates
a relation (Relation).
[0101] priorMatchScore(p,r,s) indicates a prior matching score s between the candidate phrase
p and the resource item r.
[0102] For example, it is assumed that the knowledge base is DBpedia.
[0103] Specifically, if the type of the resource item r is E, first, an anchor text, a redirection
page, and a disambiguation page in Wikipedia are collected; the candidate phrase p
matches a mention phrase of the resource item r; a corresponding frequency may be
used as the prior matching score. The corresponding frequency refers to a value obtained
after times that the candidate phrase p is linked to the resource item r is divided
by a total times that the candidate phrase p is linked.
[0105] Specifically, if the type of the resource item r is R, the prior matching score between
the candidate phrase p and the resource item r may be α·
s1 +β·
s2 + (1- α - β)·
s3. α and β are any values between 0 and 1, and α+β<1, for example, α = 0.3, and β=0.3.
s1 is a Levenshtein distance between a label of the resource item r and the candidate
phrase p,
s2 is a measurement value of cosine similarity between a vector of the candidate phrase
p and a vector of the resource item r, and
s3 is a Jaccard coefficient of a matching set of the resource item r and a relation
pattern. The relation pattern is the relation pattern defined by PATTY and ReVerb.
For calculation of
s3, reference may be made to "
Natural language questions for the web of data" published in EMNLP by Yahya, etc.
in 2012.
[0106] hasRelatedness(p,q,s) indicates a correlation value s between the resource item p
and the resource item q. A value interval of the correlation value s is 0 to 1. Specifically,
the correlation value s may be a similarity coefficient between the resource item
p and the resource item q. Optionally, the similarity coefficient may also be referred
to as a Jaccard similarity coefficient or a Jaccard coefficient or a similarity evaluation
coefficient.
[0108] isTypeCompatible(p,q,rr) indicates a parameter matching relationship rr between the
resource item p and the resource item q.
[0109] Specifically, in this embodiment of the present invention, the parameter matching
relationship rr may be one of the following: 1_1, 1_2, 2_1, and 2_2. The parameter
matching relationship is not further described herein to avoid repetition. For details,
reference may be made to the foregoing description.
[0110] hasQueryResult(p,q,o,rr1,rr2) indicates a parameter matching relationship between
the resource item p, the resource item q, and a resource item o. Specifically, a parameter
matching relationship rr1 exists between the resource item p and the resource item
q, and a parameter matching relationship rr2 exists between the resource item q and
the resource item o.
[0111] It may be understood that in the observed predicates described above, phraseIndex(p,i,j),
phrasePosTag(p,pt), phraseDepTag(p,q,dt), phraseDepOne(p,q), and hasMeanWord(p,q)
are used to indicate features of the candidate phrases. resourceType(r,rt), hasRelatedness(p,q,s),
isTypeCompatible(p,q,rr), and hasQueryResult(p,q,o,rr1,rr2) are used to indicate features
of the resource items. priorMatchScore(p,r,s) is used to indicate the relationship
between the candidate phrases and the resource items.
[0112] p and q may be phrase identifiers of candidate phrases, and p, q, r, and o may be
identifiers of resource items.
[0113] In this manner, the values of the corresponding predicates can be determined based
on the first candidate phrases and the first resource items that are determined in
steps 102 and 103.
[0114] For example, for the question "Give me all actors who were born in Berlin", on a
basis of Table 1 and Table 2, the values of the observed predicates may be calculated
in step 104. Specifically, expressions in which values of observed predicates are
1 include:
phraseIndex(11,3,3)
phraseIndex(12,4,4)
phraseIndex(13,6,7)
phraseIndex(14,7,7)
phraseIndex(15,8,8)
phrasePosTag(11,nn)
phrasePosTag(12,wp)
phrasePosTag(13,vb)
phrasePosTag(14,in)
phrasePosTag(15,nn)
phraseDepTag(11,13,rcmod)
phraseDepTag(12,13,nsubjpass)
phraseDepTag(12,14,nsubjpass)
phraseDepTag(13,15,pobj)
phraseDepTag(14,15,pobj)
phraseDepOne(11,13)
phraseDepOne(12,13)
phraseDepOne(12,14)
phraseDepOne(13,15)
phraseDepOne(14,15)
hasMeanWord(12,14)
resourceType(21,E)
resourceType(22,E)
resourceType(23,R)
resourceType(24,R)
resourceType(25,R)
resourceType(26,R)
resourceType(27,R)
resourceType(28,R)
resourceType(29,E)
priorMatchScore(11,21,1.000000)
priorMatchScore(12,22,1.000000)
priorMatchScore(13,23,1.000000)
priorMatchScore(14,24,1.000000)
priorMatchScore(14,25,1.000000)
priorMatchScore(14,26,1.000000)
priorMatchScore(14,27,1.000000)
priorMatchScore(14,28,1.000000)
priorMatchScore(15,29,1.000000)
hasRelatedness(21,23,1.000000)
hasRelatedness(22,23,1.000000)
hasRelatedness(22,24,0.440524)
hasRelatedness(22,25,0.425840)
hasRelatedness(22,26,0.226393)
hasRelatedness(22,27,0.263207)
hasRelatedness(23,29,0.854583)
hasRelatedness(24,29,0.816012)
hasRelatedness(26,29,0.532818)
hasRelatedness(27,29,0.569732)
hasRelatedness(28,29,0.713400)
isTypeCompatible(21,23, 1_1)
isTypeCompatible(22,23, 1_1)
isTypeCompatible(22,23, 1_2)
isTypeCompatible(22,24, 1_2)
isTypeCompatible(22,25, 1_1)
isTypeCompatible(22,26, 1_1)
isTypeCompatible(22,26, 1_2)
isTypeCompatible(22,27, 1_2)
isTypeCompatible(23,29, 2_1)
isTypeCompatible(24,29, 2_1)
isTypeCompatible(26,29, 2_1)
isTypeCompatible(27,29, 2_1)
isTypeCompatible(28,29, 2_1)
hasQueryResult(21,23,29, 1_1, 2_1)
hasQueryResult(22,23,29, 1_1, 2_1)
hasQueryResult(22,26,29, 1_1, 2_1)
[0115] It may be understood that a value 1 of an observed predicate indicates that a corresponding
proposition is true.
[0116] For example, a value of phraseIndex(11, 3, 3) is 1, which indicates that the proposition
"a start position i and an end position j of a first candidate phrase actors in the
question are both 3" is true. 11 is a phrase identifier of the candidate phrase "actors",
as shown in Table 1.
[0117] A value of phrasePosTag(13, vb) is 1, which indicates that the proposition "a head
word of the first candidate phrase born in is born, and a part of speed thereof is
vb" is true. 13 is a phrase identifier of the candidate phrase "born in", as shown
in Table 1.
[0118] A value of phraseDepTag(13,15, pobj) is 1, which indicates that the proposition "a
tag on a dependency path between the first candidate phrase born in and the first
candidate phrase Berlin is pobj" is true. 13 is a phrase identifier of the candidate
phrase "born in", and 15 is a phrase identifier of the candidate phrase "Berlin",
as shown in Table.
[0119] For meanings of other expressions in which values of observed predicates are 1, reference
may be made to the foregoing explanation. To avoid repetition, details are not described
herein again.
[0120] It may be understood that expressions in which values of observed predicates are
0 may also be included. For brevity, such expressions are not further listed herein.
[0121] Optionally, in this embodiment of the present invention, a predicate resource may
also be used to indicate an identifier of a resource item.
[0122] For example, it can be learned from Table 2, values of the following predicates are
1:
resource(21,dbo:Actor)
resource(22,dbo:Person)
resource(23,dbo:birthPlace)
resource(24,dbo:headquarter)
resource(25,dbo:league)
resource(26,dbo:location)
resource(27,dbo:ground)
resource(28,dbo:locationCity)
resource(29,dbr:Berlin)
[0123] It may be understood that in this embodiment of the present invention, the first
candidate phrases and the first resource items that are determined in steps 102 and
103 are ambiguous. In this embodiment of the present invention, the ambiguities of
the first candidate phrases and the first resource items are eliminated through uncertain
inference.
[0124] The uncertain inference is to perform inference and make a decision according to
uncertainty information. An uncertain inference network may process an incomplete
data set with noise, use a probability measurement weight to describe a correlation
between data, and aim at solving inconsistency and uncertainty of data.
[0125] In this embodiment of the present invention, a model used for the uncertain inference
in step 105 may be any one of the following: a Bayesian network (Bayesian Network),
a probabilistic relational model (Probabilistic relational models), a Bayesian logic
program model (Bayesian logic programs), a relational Markov network (Relational Markov
Network), a Markov logic network (Markov Logic Network), and probabilistic soft logic
(Probabilistic Soft Logic). The present invention is not limited thereto.
[0126] Optionally, in this embodiment of the present invention, the uncertain inference
in step 105 is based on the Markov logic network (Markov Logic Network, MLN), where
the MLN includes a predefined first-order formula and a weight of the first-order
formula. That is, a model used for the uncertain inference is the MLN.
[0127] Optionally, in this embodiment of the present invention, the first-order formula
may include a Boolean formula (Boolean formulas) and a weighted formula (weighted
formulas). A weight of the Boolean formula is +∞. The Boolean formula may be understood
as a first-order logic formula in first-order logic, indicating a hard rule (hard
constraints), and may also be referred to as a hard formula (hard formulas, hf), and
is a constraint that all ground atoms must satisfy. A weight of the weighted formula
is a weighted formula weight. The weighted formula is a soft rule (soft constraints),
and may also be referred to as a soft formula (soft formulas, sf). A penalty may be
applied if a ground atom violates the rule.
[0128] The first-order formula is formed by a first-order predicate, a logical connector,
and a variable. The first-order predicate may include the foregoing observed predicate
and hidden predicate.
[0129] It should be noted that in this embodiment of the present invention, the MLN may
also include a second-order formula, a first-order formula, a weight of the second-order
formula, and a weight of the first-order formula. Alternatively, the MLN may also
include a higher-order formula and a weight, which is not limited in the present invention.
[0130] Specifically, Boolean formulas are shown in Table 4, where a symbol "_" indicates
any constant in a logical variable, and |·| indicates a quantity of true ground atoms
in the formula.
Table 4
| hf1 |
hasPhase(p) => hasResource(p,_) |
| hf2 |
hasResource(p,_) ⇒ hasPhase(p) |
| hf3 |
|hasResource(p,_)| ≤ 1 |
| hf4 |
!hasPhase(p) ⇒ !hasResource(p,r) |
| hf5 |
hasResource(_,r) ⇒ hasRelation(r,_,_) V hasRelation(_,r,_) |
| hf6 |
|hasRelation(r1,r2,_)| ≤ 1 |
| hf7 |
hasRelation(r1,r2,_) ⇒ hasResource(_,r1) ∧ hasResource(_,r2) |
| hf8 |
phraseIndex(p1,s1,e1) ∧ phraseIndex(p2,s2,e2) ∧ overlap(sl,el,s2,e2) ∧hasPhase(p1)
⇒ !hasPhase(p2) |
| hf9 |
resourceType(r,E) ⇒ !hasRelation(r,_,2_1) ∧ !hasRelation(r,_,2_2) |
| hf10 |
resourceType(r,E) ⇒ !hasRelation(_,r,2_1) ∧ !hasRelation(r,_,2_2) |
| hf11 |
resourceType(r,C) ⇒ !hasRelation(r,_,2_1) ∧ !hasRelation(r,_,2_2) |
| hf12 |
resourceType(r,C) ⇒ !hasRelation(_,r,2_1) ∧ !hasRelation(r,_,2_2) |
| hf13 |
!isTypeCompatible(rl,r2,rr) => !hasRelation(rl,r2,rr) |
Specifically, meanings in Table 4 are as follows:
hf1: indicates that if a phrase p is selected, the phrase p is mapped to at least
one resource item.
hf2: indicates that if a mapping of a phrase p to a resource item is selected, the
phrase p must be selected.
hf3: indicates that a phrase p can be mapped to only one resource item.
hf4: indicates that if a phrase p is not selected, any mapping relationship of the
phrase p to a resource item is not selected.
hf5: indicates that if a mapping of a phrase to a resource item r is selected, the
resource item i is related to at least another one resource item.
hf6: indicates that there is only one parameter matching relationship between two
resource items r1 and r2.
hf7: indicates that if two resource items r1 and r2 have a parameter matching relationship,
at least one mapping of a phrase to the resource item r1 is selected and at least
one mapping of a phrase to the resource item 2 is selected.
hf8: indicates that any two selected phrases do not overlap. Herein, the overlap may
be used for representing positions in a question.
hf9, hf10, hf11, and hf12: indicate that if a type of a resource item r is entity
or class, the resource item r cannot have a second parameter that is aligned with
other resource items.
hf13: indicates that the parameter matching relationship between two resource items
r1 and r2 must be consistent. |
[0131] It may be understood that in Table 4, a logical connector "∧" indicates and (and),
a logical connector "∨" indicates or (or), and a logical connector "!" indicates not
(not).
[0132] Specifically, weighted formulas are shown in Table 5, where a symbol "+" indicates
that a weight must be set for each constant of a logical variable.
Table 5
| sf1 |
priorMatchScore(p,r,s) ⇒ hasPhase(p) |
| sf2 |
priorMatchScore(p,r,s) ⇒ hasResource(p,r) |
| sf3 |
phrasePosTag(p,pt+) ∧ resourceType(r,rt+) ⇒ hasResource(p,r) |
| sf4 |
phraseDepTag(pl,p2,dp+) ∧ hasResource(p1,r1) ∧ hasResource(p2,r2) |
| |
⇒ hasRelation(rl,r2,rr+) |
| sf5 |
phraseDepTag(pl,p2,dp+) ∧ hasResource(p1,r1) ∧ hasResource(p2,r2) |
| |
∧ !hasMeanWord(pl,p2) ⇒ hasRelation(rl,r2,rr+) |
| sf6 |
phraseDepTag(pl,p2,dp+) ∧ hasResource(p1,r1) ∧ hasResource(p2,r2) |
| |
A phraseDepOne(p1,p2) ⇒ hasRelation(r1,r2,rr+) |
| sf7 |
hasRelatedness(rl,r2,s) ∧ hasResource(_,r1) ∧ hasResource(_,r2) |
| |
⇒ hasRelation(r1,r2,_) |
| sf8 |
hasQueryResult(r1,r2,r3,rr1,rr2) |
| |
⇒ hasRelation(r1,r2,rr1) ∧ hasRelation(r2,r3,rr2_) |
Specifically, meanings in Table 5 are as follows:
sf1 and sf2: indicate that if a prior matching score s of a phrase p mapped to a resource
item r is greater, a probability that the phrase r and resource item r are selected
is higher.
sf3: indicates that a part of speech of a head word of the phrase p and a type of
the resource item r to which the phrase p is mapped have a relationship.
sf4, sf5, and sf6: indicate that a tag on a dependency path between two phrases p1
and p2 and a parameter matching relationship between two resource items r1 and r2
have a relationship, where the phrase p1 is mapped to the resource item r1 and the
phrase p2 is mapped to the resource item r2.
sf7: indicates that if a correlation value between two resource items r1 and r2 is
greater, a probability that the two resource items r1 and r2 have a parameter matching
relationship is higher.
sf8: indicates that if a resource item triple has a query result, the three resource
items should have a corresponding parameter matching relationship. |
[0133] It should be noted that in this embodiment of the present invention, the weighted
formula weight may be set manually. For example, the weight may be an empirical value
preset by an administrator or an expert of a knowledge base.
[0134] In this embodiment of the present invention, the weighted formula weight may also
be obtained through training by using a learning method.
[0135] It may be understood that weighted formula weights are generally different for different
knowledge bases. In this embodiment of the present invention, the Boolean formulas
shown in Table 4 may be understood as general rules that all knowledge bases satisfy.
The weighted formulas shown in Table 5 may be understood as particular rules for which
weighted formula weights are different for different knowledge bases.
[0136] In this embodiment of the present invention, the Boolean formula and the weighted
formula may be collectively referred to as "meta rule". That is, the "meta rule" is
a rule that is applicable to knowledge bases in different fields.
[0137] In this embodiment of the present invention, step 105 may also be referred to as
inference (Inference) or joint inference (joint Inference) or joint disambiguation
(joint disambiguation). Specifically, a thebeast tool may be used to perform joint
inference. Optionally, for each proposition set in the question parse spaces, confidence
of each proposition set may be calculated according to the values of the observed
predicates and the values of the hidden predicates by using a cutting plane method
(cutting plane method or cutting plane approach). Specifically, for the thebeast tool,
reference may be made to
https://code.ggogle.com/p/thebeast/.
[0138] It may be understood that confidence may also be referred to as confidence. In addition,
confidence of each proposition set may be calculated by means of maximum-likelihood
estimation of an undirected graph model.
[0139] Optionally, the MLN is indicated by
M, the first-order formula is indicated by φ
i, the weight of the first-order formula is indicated by
wi, and the proposition set is indicated by
y ; then, step 105 may be:
calculating the confidence of each proposition set according to

where
Z is a normalization constant, Cnφi is a sub-formula set corresponding to the first-order formula φi, c is a sub-formula in the sub-formula set Cnφi, is a binary feature function, and

indicates truth or falsity of the first-order formula in the proposition set y.
[0140] A value of the binary feature function (binary feature function)

is 1 or 0. Specifically, in the proposition set
y, when the sub-formula
c is true,

is 1, or otherwise,

is 0.
[0141] Optionally, a maximum count of cycle times may be set in step 105. For example, the
maximum count of cycle times is 100.
[0142] In this manner, after the confidence of each proposition set is calculated in step
105, a confidence set corresponding to a possible question parse space may be obtained,
and each confidence in the confidence set is corresponding to a proposition set.
[0143] Further, in step 106, one or several proposition sets may be selected from multiple
proposition sets of the possible question parse spaces, and confidence of the selected
one or several proposition sets satisfies a preset condition.
[0144] Optionally, in step 106, a proposition set whose confidence value is largest may
be determined, and a combination of true propositions in the proposition set whose
confidence value is largest is acquired.
[0145] Optionally, in step 106, multiple proposition sets whose confidence values are largest
may be determined, and a combination of true propositions in the multiple proposition
sets whose confidence values are largest is acquired. The present invention is not
limited thereto.
[0146] Because truth or falsity of propositions in the proposition sets are represented
by values of hidden predicates, it may be understood that the acquiring a combination
of true propositions in step 106 is acquiring a combination of hidden predicates whose
values are 1. In addition, the true propositions are used to indicate search phrases
selected from the first candidate phrases, search resource items selected from the
first resource items, and features of the search resource items.
[0147] For example, for the question "Give me all actors who were born in Berlin.", expressions
in which determined values of hidden predicates are 1 are as follows:
hasphrase(11)
hasphrase(13)
hasphrase(15)
hasResource(11,21)
hasResource(13,23)
hasResource(15,29)
hasRelation(21,23,1_1)
hasRelation(23,29,2_1)
[0148] Further, a formal query statement may be generated in step 107. Optionally, the formal
query statement may be an SQL. Alternatively, in this embodiment of the present invention,
the formal query statement may be a SPARQL; correspondingly, step 107 may also be
referred to as a SPARQL generation (SPARQL Generation) process.
[0149] Optionally, step 107 may be: generating the SPARQL according to the combination of
true propositions by using a SPARQL template.
[0150] Specifically, a triple of the SPARQL may be created by using the combination of true
propositions, and further, the SPARQL is generated by using the SPARQL template.
[0151] Specifically, natural language questions may be classified into three types: Yes/No,
Number, and Normal. Correspondingly, the SPARQL template also includes an ASK WHERE
template, a SELECT COUNT(?url) WHERE template, and a SELECT ?url WHERE template.
[0152] Then, when the question is a Yes/No question, the SPARQL is generated according to
the combination of true propositions by using the ASK WHERE template.
[0153] When the question is a Normal question, the SPARQL is generated according to the
combination of true propositions by using the SELECT ?url WHERE template.
[0154] When the question is a Number question, the SPARQL is generated according to the
combination of true propositions by using the SELECT ?url WHERE template, or when
a numeric answer cannot be obtained for the SPARQL generated by using the SELECT ?url
WHERE template, the SPARQL is generated by using the SELECT COUNT(?url) WHERE template.
[0155] For example, the question "Give me all actors who were born in Berlin." is a Normal
question, and the generated SPARQL is:
SELECT ?url WHERE{
?x rdf:type dbo:Actor.
?x dbo:birthplace dbr:Berlin .
}
[0156] Optionally, step 107 may include: generating a resource query graph according to
the combination of true propositions, where the resource query graph includes vertexes
and edges, where the vertexes include the search phrases and the search resource items,
and a search phrase in each vertex is mapped to a search resource item in the vertex.
The edge indicates a parameter matching relationship between two search resource items
in two connected vertexes, and further the SPARQL is generated according to the resource
query graph.
[0157] Specifically, three interconnected search resource items in the resource query graph
may be used as the triple of the SPARQL, where a type of a middle search resource
item in the three interconnected search resource items is relation.
[0158] In this manner, in this embodiment of the present invention, the natural language
question may be converted into the SPARQL. In addition, the used predefined first-order
formula is field-independent, that is, the predefined Boolean formula and weighted
formula may be applied to all knowledge bases and have extensibility. That is, by
using the method provided in this embodiment of the present invention, it is unnecessary
to manually set a conversion rule.
[0159] For example, FIG. 3 shows an example of question parsing according to the present
invention.
[0160] 301. Receive a question entered by a user. It is assumed that the question is a natural
language question "Which software has been developed by organization founded in California,
USA?"
[0161] 302. Perform phrase detection (phrase detection) on the question entered in step
301 to determine first candidate phrases.
[0162] For a detailed description of step 302, reference may be made to step 102 in the
foregoing embodiment, and to avoid repetition, details are not described herein again.
[0163] For example, the determined first candidate phrases include: software, developed,
developed by, organizations, founded in, founded, California, and USA.
[0164] 303. Perform phrase mapping (phrase mapping) on the first candidate phrases determined
in step 302, and map the first candidate phrases to first resource items.
[0165] For a detailed description of step 303, reference may be made to step 103 in the
foregoing embodiment, and to avoid repetition, details are not described herein again.
[0166] For example, the first candidate phrase software is mapped to dbo:Software, dbr:Software,
and the like, which are not further listed herein.
[0167] 304. Determine values of observed predicates and create possible question parse spaces,
through feature extraction (feature extraction).
[0168] For a detailed description of step 304, reference may be made to step 104 in the
foregoing embodiment, and to avoid repetition, details are not described herein again.
[0169] It should be noted that details are not further listed herein.
[0170] 305. Calculate confidence of each proposition set through joint inference (Inference),
and acquire a combination of true propositions in a proposition set whose confidence
satisfies a preset condition.
[0171] For a detailed description of step 305, reference may be made to steps 105 and 106
in the foregoing embodiment, and to avoid repetition, details are not described herein
again.
[0172] The combination of true propositions is a combination of hidden predicates whose
values are 1.
[0173] For example, expressions in which determined values of hidden predicates are 1 are:
hasPhrase(software),
hasPhrase(developed by),
hasPhrase(organizations),
hasPhrase(founded in),
hasPhrase(California);
hasResource(software, dbo:Software),
hasResource(developed by, dbo:developer),
hasResource(California, dbr: California),
hasResource(organizations, dbo:Company),
hasResource(founded in, dbo:foundationPlace);
hasRelation(dbo:Software, dbo:developer, 1_1),
hasRelation(dbo:developer, dbo:Company, 2_1),
hasRelation(dbo:Company, dbo:foundationPlace, 1_1),
hasRelation(dbo:foundationPlace, dbr: California, 2_1).
[0174] 306. Generate a resource items query graph.
[0175] Specifically, the resource items query graph may also be referred to as a semantic
items query graph (Semantic Items Query Graph).
[0176] Specifically, a vertex in the resource items query graph may include a search resource
item, a type of the search resource item, and a position of a search phrase that is
in the question and is mapped to the search resource item.
[0177] Specifically, an edge in the resource items query graph includes a parameter matching
relationship between two search resource items in two vertexes connected by the edge.
[0178] It should be noted that a relation between search resource items in the resource
items query graph is a binary relation.
[0179] Optionally, a vertex in the resource items query graph may include a search phrase,
a search resource item, a type of the search resource item, a search phrase mapped
to the search resource item, and a position of the search phrase in the question.
FIG. 4 is another example of the resource items query graph, including vertexes 311
to 315.
[0180] The vertex 311 includes a search resource item dbo:Software, a type Class of the
search resource item, and a search phrase Software and a position 1 1 of the search
phrase in the question. The search phrase Software is mapped to the search resource
item dbo:Software.
[0181] The vertex 312 includes a search resource item dbo:developer, a type Relation of
the search resource item, and a search phrase developed by and a position 4 5 of the
search phrase in the question. The search phrase Software is mapped to the search
resource item dbo:Software.
[0182] The vertex 313 includes a search resource item dbo:Company, a type Class of the search
resource item, and a search phrase organizations and a position 6 6 of the search
phrase in the question. The search phrase organizations is mapped to the search resource
item dbo: Company.
[0183] The vertex 314 includes a search resource item dbo:foundationPlace, a type Relation
of the search resource item, and a search phrase founded in and a position 7 8 of
the search phrase in the question. The search phrase founded in is mapped to the search
resource item dbo:foundationPlace.
[0184] The vertex 315 includes a search resource item dbr:California, a type Entity of the
search resource item, and a search phrase California and a position 9 9 of the search
phrase in the question. The search phrase California is mapped to the search resource
item dbr: California.
[0185] An edge 1_1 between the vertex 311 and the vertex 312 indicates that a parameter
matching relationship between the search resource item dbo:Software and the search
resource item dbo:developer is 1_1.
[0186] An edge 2_1 between the vertex 312 and the vertex 313 indicates that a parameter
matching relationship between the search resource item dbo:developer and the search
resource item dbo:Company is 2_1.
[0187] An edge 1_1 between the vertex 313 and the vertex 314 indicates that a parameter
matching relationship between the search resource item dbo:Company and the search
resource item dbo:foundationPlace is 1_1.
[0188] An edge 1_2 between the vertex 315 and the vertex 314 indicates that a parameter
matching relationship between the search resource item dbr:California and the search
resource item dbo:foundationPlace is 1_2.
[0189] 307. Generate a SPARQL (SPARQL genaration).
[0190] Specifically, a binary relation in the resource items query graph is converted into
a ternary relation.
[0191] That is, three interconnected search resource items in the resource items query graph
have a ternary relation, and a type of a middle search resource item in the three
interconnected search resource items is relation.
[0192] For example, the natural language question in step 301 is a Normal question, and
a SPARQL generated by using the SELECT ?url WHERE template is:
SELECT ?url WHERE{
?url_answer rdf:type dbo:Software
?url_answer dbo:developer ?x1
?x1 rdf:type dbo:Company
?x1 dbo:foundationPlace dbr:California
}
[0193] In this manner, in this embodiment of the present invention, the natural language
question may be converted into the SPARQL. In addition, the used predefined first-order
formula is field-independent, that is, the predefined Boolean formula and weighted
formula may be applied to all knowledge bases and have extensibility. That is, by
using the method provided in this embodiment of the present invention, it is unnecessary
to manually set a conversion rule.
[0194] In addition, it may be understood that in this embodiment of the present invention,
the predefined Boolean formula and weighted formula are language-independent, that
is, have language extensibility. For example, the formulas may be used both in an
English knowledge base and a Chinese knowledge base.
[0195] As described above, in this embodiment of the present invention, the uncertain inference
in step 105 may be based on the MLN. The MLN includes the predefined first-order formula
and the weight of the first-order formula.
[0196] Optionally, the first-order formula may include a Boolean formula and a weighted
formula. A weight of the Boolean formula is +∞, and a weight of the weighted formula
is a weighted formula weight. The weighted formula weight may be obtained through
training by using a learning method. Then, it may be understood that before step 101,
as shown in FIG. 5, the method may further include:
401. Acquire multiple natural language questions from the knowledge base.
402. Perform phrase detection on the multiple natural language questions to determine
second candidate phrases of the multiple natural language questions.
403. Map the second candidate phrases to second resource items in the knowledge base,
where the second resource items have consistent semantic meanings with the second
candidate phrases.
404. Determine, according to the second candidate phrases and the second resource
items, values of observed predicates corresponding to the multiple natural language
questions.
405. Acquire hand-labeled values of hidden predicates corresponding to the multiple
natural language questions.
406. Create an undirected graph according to the values of the observed predicates
corresponding to the multiple natural language questions, the values of the hidden
predicates corresponding to the multiple natural language questions, and the first-order
formula, and determine the weight of the first-order formula through training.
[0197] In this manner, in this embodiment of the present invention, based on the predefined
first-order formula, by using the learning method, the weight of the first-order formula
for the knowledge base can be determined, and the first-order formula may be used
as a conversion rule for the knowledge base. In this manner, it is unnecessary to
manually set a conversion rule, and the predefined first-order formula of the Markov
logic network MLN has extensibility, and is applicable to any knowledge base.
[0198] Specifically, a knowledge base of a question answering system includes a question
base, where the question base includes multiple natural language questions. Then,
step 401 may be acquiring multiple natural language questions from the question base
of the knowledge base of the question answering system. In this embodiment of the
present invention, a quantity of multiple natural language questions is not limited.
For example, multiple natural language questions may be 1000 natural language questions.
[0199] For example, 110 natural language questions may be acquired from a training set (training
set) of a question base Q1 in a question answering over linked data (Question Answering
over Linked Data, QALD) system.
[0200] In this embodiment of the present invention, for the process of step 402, reference
may be made to the process of step 102 in the foregoing embodiment; for the process
of step 403, reference may be made to the process of step 103 in the foregoing embodiment;
and for the process of step 404, reference may be made to the process of step 104
in the foregoing embodiment. To avoid repetition, details are not described herein
again. In this manner, for multiple natural language questions in step 401, values
of observed predicates corresponding to the multiple natural language files can be
determined.
[0201] It may be understood that, before step 405, it is necessary to manually label values
of hidden predicates corresponding to each natural language question in the multiple
natural language questions, that is, the values that are of the hidden predicates
corresponding to the multiple natural language questions and are acquired in step
405 are hand-labeled (hand-labeled).
[0202] Optionally, the first-order formula includes a Boolean formula and a weighted formula.
A weight of the Boolean formula is +∞, and a weight of the weighted formula is a weighted
formula weight. The hand-labeled values of the hidden predicates in step 405 satisfy
the Boolean formula. Correspondingly, in step 406, the weight of the first-order formula
is determined through training, that is, the weight of the weighted formula is determined
through training. The undirected graph may include a Markov network (Markov Network,
MN).
[0203] Optionally, in step 406, the weight of the first-order formula may be determined
according to the values of the observed predicates corresponding to the multiple natural
language questions, the values of the hidden predicates corresponding to the multiple
natural language questions, and the first-order formula by using a margin infused
relaxed algorithm (Margin Infused Relaxed Algorithm, MIRA).
[0204] Specifically, in step 406, a thebeast tool may be used to learn the weighted formula
weight. In a parameter learning process, the weighted formula weight may be first
initialized to 0, and then the MIRA is used to update the weighted formula weight.
Optionally, in a training process, a maximum count of cycle times of training may
be further set, for example, the maximum count of cycle times of training is 10.
[0205] For example, the weighted formula weight of sf3 in Table 5 may be shown in Table
6. It may be learned from Table 6 that when a part of speech of a head word of a candidate
phrase is nn, a probability that the candidate phrase is mapped to a resource item
of a type E is relatively high.
Table 6
| Part of speech of a head word of a candidate phrase |
Type of a resource item to which the candidate phrase is mapped |
Weighted formula weight |
| nn |
E |
2.11 |
| nn |
C |
0.243 |
| nn |
R |
0.335 |
| vb |
R |
0.517 |
| wp |
C |
0.143 |
| wr |
C |
0.025 |
[0206] In this manner, through this embodiment shown in FIG. 5, a weighted formula weight
of any knowledge base may be determined, and therefore, a conversion rule for any
knowledge base may be obtained.
[0207] It may be understood that in this embodiment of the present invention, a method for
determining a weight of a first-order formula is a data drive manner, and may be applied
to different knowledge bases. In a case of a great reduction of labor, efficiency
of parsing questions in a knowledge base may be improved.
[0208] It may be understood that in this embodiment of the present invention, structure
learning may also be performed according to the created undirected graph, and further
a second-order formula or even a higher-order formula may be learned; further, a new
undirected graph is created according to the learned second-order formula or even
the higher-order formula, and a weight corresponding to the second-order formula or
even the higher-order formula is learned. The present invention is not limited thereto.
[0209] FIG. 6 is a block diagram of a device for parsing a question according to an embodiment
of the present invention. A device 500 shown in FIG. 6 includes a receiving unit 501,
a phrase detection unit 502, a mapping unit 503, a first determining unit 504, a second
determining unit 505, an acquiring unit 506, and a generating unit 507.
[0210] The receiving unit 501 is configured to receive a question entered by a user.
[0211] The phrase detection unit 502 is configured to perform phrase detection on the question
received by the receiving unit 501 to determine first candidate phrases.
[0212] The mapping unit 503 is configured to map the first candidate phrases determined
by the phrase detection unit 502 to first resource items in a knowledge base, where
the first resource items have consistent semantic meanings with the first candidate
phrases.
[0213] The first determining unit 504 is configured to determine values of observed predicates
and possible question parse spaces according to the first candidate phrases and the
first resource items, where the observed predicates are used to indicate features
of the first candidate phrases, features of the first resource items, and a relationship
between the first candidate phrases and the first resource items, points in the possible
question parse spaces are proposition sets, and truth or falsity of propositions in
the proposition sets are represented by values of hidden predicates.
[0214] The second determining unit 505 is configured to: perform uncertain inference on
each proposition set in the possible question parse spaces according to the values
that are of the observed predicates and are determined by the first determining unit
504 and the values of the hidden predicates, and calculate confidence of each proposition
set.
[0215] The acquiring unit 506 is configured to acquire a combination of true propositions
in a proposition set whose confidence satisfies a preset condition, where the true
propositions are used to indicate search phrases selected from the first candidate
phrases, search resource items selected from the first resource items, and features
of the search resource items.
[0216] The generating unit 507 is configured to generate a formal query statement according
to the combination of true propositions that is acquired by the acquiring unit 506.
[0217] In this embodiment of the present invention, uncertain inference is performed by
using observed predicates and hidden predicates, and a natural language question can
be converted into a formal query statement. In addition, in this embodiment of the
present invention, an uncertain inference method can be applied to a knowledge base
in any field, and has field extensibility. Therefore, it is unnecessary to manually
configure a conversion rule for a knowledge base.
[0218] Optionally, in an embodiment, the uncertain inference is based on a Markov logic
network MLN, where the MLN includes a predefined first-order formula and a weight
of the first-order formula.
[0219] Optionally, in another embodiment,
the acquiring unit 506 is further configured to acquire multiple natural language
questions from the knowledge base;
the phrase detection unit 502 is further configured to perform phrase detection on
the question received by the acquiring unit 506 to determine the first candidate phrases;
the mapping unit 503 is further configured to map the second candidate phrases to
second resource items in the knowledge base, where the second resource items have
consistent semantic meanings with the second candidate phrases;
the first determining unit 504 is further configured to determine, according to the
second candidate phrases and the second resource items, values of observed predicates
corresponding to the multiple natural language questions;
the acquiring unit 506 is further configured to acquire hand-labeled values of hidden
predicates corresponding to the multiple natural language questions; and
the second determining unit 505 is further configured to: create an undirected graph
according to the values of the observed predicates corresponding to the multiple natural
language questions, the values of the hidden predicates corresponding to the multiple
natural language questions, and the first-order formula, and determine the weight
of the first-order formula through training.
[0220] Optionally, in another embodiment, the first-order formula includes a Boolean formula
and a weighted formula, a weight of the Boolean formula is +∞, a weight of the weighted
formula is a weighted formula weight, and the hand-labeled values of the hidden predicates
corresponding to the multiple natural language questions satisfy the Boolean formula;
and the second determining unit 505 is specifically configured to: create the undirected
graph according to the values of the observed predicates corresponding to the multiple
natural language questions, the values of the hidden predicates corresponding to the
multiple natural language questions, and the first-order formula, and determine the
weight of the weighted formula through training.
[0221] Optionally, in another embodiment, the second determining unit 505 is specifically
configured to: create the undirected graph according to the values of the observed
predicates corresponding to the multiple natural language questions, the values of
the hidden predicates corresponding to the multiple natural language questions, and
the first-order formula, and determine the weight of the first-order formula by using
a margin infused relaxed algorithm MIRA.
[0222] Optionally, in another embodiment, the MLN is indicated by
M, the first-order formula is indicated by φ
i, the weight of the first-order formula is indicated by
wi, and the proposition set is indicated by
y ; and the second determining unit 505 is specifically configured to:
calculate the confidence of each proposition set according to

where, Z is a normalization constant, Cnφi is a sub-formula set corresponding to the first-order formula φi, c is a sub-formula in the sub-formula set Cnφi,

is a binary feature function, and

indicates truth or falsity of the first-order formula in the proposition set y.
[0223] Optionally, in another embodiment, the acquiring unit 506 is specifically configured
to: determine a proposition set whose confidence value is largest, and acquire a combination
of true propositions in the proposition set whose confidence value is largest.
[0224] Optionally, in another embodiment,
the features of the first candidate phrases include positions of the first candidate
phrases in the question, parts of speech of head words of the first candidate phrases,
and tags on a dependency path between every two of the first candidate phrases;
the features of the first resource items include types of the first resource items,
a correlation value between every two of the first resource items, and a parameter
matching relationship between every two of the first resource items;
the relationship between the first candidate phrases and the first resource items
includes prior matching scores between the first candidate phrases and the first resource
items; and
the first determining unit 504 is specifically configured to:
determine the positions of the first candidate phrases in the question;
determine the parts of speech of the head words of the first candidate phrases by
using a stanford part-of-speech tagging tool;
determine the tags on the dependency path between every two of the first candidate
phrases by using a stanford dependency syntax parser tool;
determine the types of the first resource items from the knowledge base, where the
types are entity or class or relation;
determine the parameter matching relationship between every two of the first resource
items from the knowledge base;
use a similarity coefficient between every two of the first resource items as the
correlation value between every two of the first resource items; and
calculate the prior matching scores between the first candidate phrases and the first
resource items, where the prior matching scores are used to indicate probabilities
that the first candidate phrases are mapped to the first resource items.
[0225] Optionally, in another embodiment, the formal query statement is a Simple Protocol
and Resource Description Framework Query Language SPARQL.
[0226] Optionally, in another embodiment, the generating unit 507 is specifically configured
to:
generate the SPARQL according to the combination of true propositions by using a SPARQL
template.
[0227] Optionally, in another embodiment, the SPARQL template includes an ASK WHERE template,
a SELECT COUNT(?url) WHERE template, and a SELECT ?url WHERE template; and
the generating unit 507 is specifically configured to:
when the question is a Yes/No question, generate the SPARQL according to the combination
of true propositions by using the ASK WHERE template;
when the question is a Normal question, generate the SPARQL according to the combination
of true propositions by using the SELECT ?url WHERE template; and
when the question is a Number question, generate the SPARQL according to the combination
of true propositions by using the SELECT ?url WHERE template, or when a numeric answer
cannot be obtained for the SPARQL generated by using the SELECT ?url WHERE template,
generate the SPARQL by using the SELECT COUNT(?url) WHERE template.
[0228] Optionally, in another embodiment, the phrase detection unit 502 is specifically
configured to:
use word sequences in the question as the first candidate phrases, where the word
sequences satisfy:
all consecutive non-stop words in the word sequence begin with a capital letter, or
if all consecutive non-stop words in the word sequence do not begin with a capital
letter, a length of the word sequence is less than four;
a part of speech of a head word of the word sequence is jj or nn or rb or vb, where
jj is an adjective, nn is a noun, rb is an adverb, and vb is a verb; and
all words included in the word sequence are not stop words.
[0229] Optionally, in another embodiment, the device 500 may be a server of the knowledge
base.
[0230] The device 500 can implement each process implemented by a device in the embodiments
shown in FIG. 1 to FIG. 5. To avoid repetition, details are not described herein again.
[0231] FIG. 7 is a block diagram of a device for parsing a question according to another
embodiment of the present invention. A device 600 shown in FIG. 7 includes a processor
601, a receiver circuit 602, a transmitter circuit 603, and a memory 604.
[0232] The receiver circuit 602 is configured to receive a question entered by a user.
[0233] The processor 601 is configured to perform phrase detection on the question received
by the receiver circuit 602 to determine first candidate phrases.
[0234] The processor 601 is further configured to map the first candidate phrases to first
resource items in a knowledge base, where the first resource items have consistent
semantic meanings with the first candidate phrases.
[0235] The processor 601 is further configured to determine values of observed predicates
and possible question parse spaces according to the first candidate phrases and the
first resource items, where the observed predicates are used to indicate features
of the first candidate phrases, features of the first resource items, and a relationship
between the first candidate phrases and the first resource items, points in the possible
question parse spaces are proposition sets, and truth or falsity of propositions in
the proposition sets are represented by values of hidden predicates.
[0236] The processor 601 is further configured to: perform uncertain inference on each proposition
set in the possible question parse spaces according to the values that are of the
observed predicates and are determined by the first determining unit 504 and the values
of the hidden predicates, and calculate confidence of each proposition set.
[0237] The receiver circuit 602 is further configured to acquire a combination of true propositions
in a proposition set whose confidence satisfies a preset condition, where the true
propositions are used to indicate search phrases selected from the first candidate
phrases, search resource items selected from the first resource items, and features
of the search resource items.
[0238] The processor 601 is further configured to generate a formal query statement according
to the combination of true propositions.
[0239] In this embodiment of the present invention, uncertain inference is performed by
using observed predicates and hidden predicates, and a natural language question can
be converted into a formal query statement. In addition, in this embodiment of the
present invention, an uncertain inference method can be applied to a knowledge base
in any field, and has field extensibility. Therefore, it is unnecessary to manually
configure a conversion rule for a knowledge base.
[0240] Components in the device 600 are coupled together by using a bus system 605, where
the bus system 605 includes a power bus, a control bus, a status signal bus, in addition
to a data bus. However, for clear description, various buses in FIG. 7 are marked
as the bus system 605.
[0241] The foregoing method disclosed in this embodiment of the present invention may be
applied in the processor 601 or implemented by the processor 601. The processor 601
may be an integrated circuit chip, and has a signal processing capability. In an implementation
process, each step of the foregoing method may be completed by using an integrated
logic circuit of hardware in the processor 1001 or an instruction in a form of software.
The processor 1001 may be a general purpose processor, a digital signal processor
(Digital Signal Processor, DSP), an application-specific integrated circuit (Application
Specific Integrated Circuit, ASIC), a field programmable gate array (Field Programmable
Gate Array, FPGA), or another programmable logical device, discrete gate or transistor
logic device, or discrete hardware component. The processor may implement or execute
each method, step, and logic block diagram disclosed in this embodiment of the present
invention. The general purpose processor may be a microprocessor or the processor
may be any conventional processor and the like. The steps of the method disclosed
with reference to this embodiment of the present invention may be directly executed
and completed by a hardware decoding processor, or executed and completed by a combination
of hardware and software modules in a decoding processor. The software module may
be located in a mature storage medium in the field, such as a random access memory,
a flash memory, a read-only memory, a programmable read-only memory, an electrically-erasable
programmable memory, or a register. The storage medium is located in the memory 604,
and the processor 601 reads information in the memory 604 and completes the steps
in the foregoing methods in combination with hardware of the processor.
[0242] It may be understood that the memory 604 in this embodiment of the present invention
may be a volatile memory or non-volatile memory, or may include both a volatile memory
and a non-volatile memory. The non-volatile memory may be a read-only memory (Read-Only
Memory, ROM), a programmable read-only memory (Programmable ROM, PROM), an erasable
programmable read-only memory (Erasable PROM, EPROM), an electrically erasable programmable
read-only memory (Electrically EPROM, EEPROM), or a flash memory. The volatile memory
may be a random access memory (Random Access Memory, RAM), and it is used as an external
high-speed cache. According to an exemplary description rather than limitation, RAMs
in many forms may be used, for example, a static random access memory (Static RAM,
SRAM), a dynamic random access memory (Dynamic RAM, DRAM), a synchronous dynamic random
access memory (Synchronous DRAM, SDRAM), a double data rate synchronous dynamic random
access memory (Double Data Rate SDRAM, DDR SDRAM), an enhanced synchronous dynamic
random access memory (Enhanced SDRAM, ESDRAM), a synchronous link dynamic random access
memory (Synchlink DRAM, SLDRAM), and a direct memory bus random access memory (Direct
Rambus RAM, DR RAM). The memory 604 in the system and method described in this specification
is intended to include but is not limited to these and memories of any other appropriate
types.
[0243] It may be understood that these embodiments described in the specification may be
implemented by hardware, software, firmware, middleware, micro code, or a combination
thereof. For hardware implementation, a processing unit may be implemented in one
or more application specific integrated circuits (Application Specific Integrated
Circuits, ASIC), a digital signal processor (Digital Signal Processing, DSP), a digital
signal processing device (DSP Device, DSPD), a programmable logic device (Programmable
Logic Device, PLD), a field-programmable gate array (Field-Programmable Gate Array,
FPGA), a general purpose processor, a controller, a micro controller, other electronic
units used to execute functions described in this application, or a combination thereof.
[0244] When the embodiments are implemented in the software, firmware, middleware, micro
code, program code, or code segment, they may be stored, for example, in a machine-readable
medium of a storage component. The code segment may indicate any combination of a
process, a function, a subprogram, a program, a routine, a subroutine, a module, a
software group, a class, an instruction, a data structure, or a program statement.
The code segment may be coupled into another code segment or hardware circuit by transferring
and/or receiving information, data, independent variables, parameters, or content
of the memory. Any appropriate mode including memory sharing, message transfer, token
transfer, network transmission, or the like may be used to transfer, forward, or send
the information, independent variables, parameters, data, or the like.
[0245] For software implementation, the technology described in the specification may be
implemented by using modules (for example, processes, functions, and the like) that
execute the functions in the specification. The software code may be stored in a memory
unit and executed by the processor. The memory unit may be implemented in the processor
or outside the processor. In the latter case, the memory unit may be coupled into
the processor in a communication mode by various means known in the art.
[0246] Optionally, in an embodiment, the uncertain inference is based on a Markov logic
network MLN, where the MLN includes a predefined first-order formula and a weight
of the first-order formula.
[0247] In this embodiment of the present invention, the memory 604 may be configured to
store resource items, types of the resource items, and the like. The memory 604 may
be further configured to store the first-order formula. The memory 604 may be further
configured to store a SPARQL template.
[0248] Optionally, in another embodiment,
the receiver circuit 602 is further configured to acquire multiple natural language
questions from the knowledge base;
the processor 601 is further configured to perform phrase detection on the question
to determine the first candidate phrases;
the processor 601 is further configured to map the second candidate phrases to second
resource items in the knowledge base, where the second resource items have consistent
semantic meanings with the second candidate phrases;
the processor 601 is further configured to determine, according to the second candidate
phrases and the second resource items, values of observed predicates corresponding
to the multiple natural language questions;
the receiver circuit 602 is further configured to acquire hand-labeled values of hidden
predicates corresponding to the multiple natural language questions; and
the processor 601 is further configured to: create an undirected graph according to
the values of the observed predicates corresponding to the multiple natural language
questions, the values of the hidden predicates corresponding to the multiple natural
language questions, and the first-order formula, and determine the weight of the first-order
formula through training.
[0249] Optionally, in another embodiment, the first-order formula includes a Boolean formula
and a weighted formula, a weight of the Boolean formula is +∞, a weight of the weighted
formula is a weighted formula weight, and the hand-labeled values of the hidden predicates
corresponding to the multiple natural language questions satisfy the Boolean formula;
and
the processor 601 is specifically configured to: create the undirected graph according
to the values of the observed predicates corresponding to the multiple natural language
questions, the values of the hidden predicates corresponding to the multiple natural
language questions, and the first-order formula, and determine the weight of the weighted
formula through training.
[0250] Optionally, in another embodiment, the processor 601 is specifically configured to:
create the undirected graph according to the values of the observed predicates corresponding
to the multiple natural language questions, the values of the hidden predicates corresponding
to the multiple natural language questions, and the first-order formula, and determine
the weight of the first-order formula by using a margin infused relaxed algorithm
MIRA.
[0251] Optionally, in another embodiment, the MLN is indicated by
M, the first-order formula is indicated by φ
i, the weight of the first-order formula is indicated by
wi, and the proposition set is indicated by
y ; and the processor 601 is specifically configured to:
calculate the confidence of each proposition set according to

where, Z is a normalization constant, Cnφi is a sub-formula set corresponding to the first-order formula φi, c is a sub-formula in the sub-formula set Cnφi,

is a binary feature function, and

indicates truth or falsity of the first-order formula in the proposition set y.
[0252] Optionally, in another embodiment, the receiver circuit 602 is specifically configured
to: determine a proposition set whose confidence value is largest, and acquire a combination
of true propositions in the proposition set whose confidence value is largest.
[0253] Optionally, in another embodiment,
the features of the first candidate phrases include positions of the first candidate
phrases in the question, parts of speech of head words of the first candidate phrases,
and tags on a dependency path between every two of the first candidate phrases;
the features of the first resource items include types of the first resource items,
a correlation value between every two of the first resource items, and a parameter
matching relationship between every two of the first resource items;
the relationship between the first candidate phrases and the first resource items
includes prior matching scores between the first candidate phrases and the first resource
items; and
the processor 601 is specifically configured to:
determine the positions of the first candidate phrases in the question;
determine the parts of speech of the head words of the first candidate phrases by
using a stanford part-of-speech tagging tool;
determine the tags on the dependency path between every two of the first candidate
phrases by using a stanford dependency syntax parser tool;
determine the types of the first resource items from the knowledge base, where the
types are entity or class or relation;
determine the parameter matching relationship between every two of the first resource
items from the knowledge base;
use a similarity coefficient between every two of the first resource items as the
correlation value between every two of the first resource items; and
calculate the prior matching scores between the first candidate phrases and the first
resource items, where the prior matching scores are used to indicate probabilities
that the first candidate phrases are mapped to the first resource items.
[0254] Optionally, in another embodiment, the formal query statement is a Simple Protocol
and Resource Description Framework Query Language SPARQL.
[0255] Optionally, in another embodiment, the processor 601 is specifically configured to:
generate the SPARQL according to the combination of true propositions by using a SPARQL
template.
[0256] Optionally, in another embodiment, the SPARQL template includes an ASK WHERE template,
a SELECT COUNT(?url) WHERE template, and a SELECT ?url WHERE template; and
the processor 601 is specifically configured to:
when the question is a Yes/No question, generate the SPARQL according to the combination
of true propositions by using the ASK WHERE template;
when the question is a Normal question, generate the SPARQL according to the combination
of true propositions by using the SELECT ?url WHERE template; and
when the question is a Number question, generate the SPARQL according to the combination
of true propositions by using the SELECT ?url WHERE template, or when a numeric answer
cannot be obtained for the SPARQL generated by using the SELECT ?url WHERE template,
generate the SPARQL by using the SELECT COUNT(?url) WHERE template.
[0257] Optionally, in another embodiment, the processor 601 is specifically configured to:
use word sequences in the question as the first candidate phrases, where the word
sequences satisfy:
all consecutive non-stop words in the word sequence begin with a capital letter, or
if all consecutive non-stop words in the word sequence do not begin with a capital
letter, a length of the word sequence is less than four;
a part of speech of a head word of the word sequence is jj or nn or rb or vb, where
jj is an adjective, nn is a noun, rb is an adverb, and vb is a verb; and
all words included in the word sequence are not stop words.
[0258] Optionally, in another embodiment, the device 600 may be a server of the knowledge
base.
[0259] The device 600 can implement each process implemented by a device in the embodiments
shown in FIG. 1 to FIG. 5. To avoid repetition, details are not described herein again.
[0260] A person of ordinary skill in the art may be aware that, in combination with the
examples described in the embodiments disclosed in this specification, units and algorithm
steps may be implemented by electronic hardware or a combination of computer software
and electronic hardware. Whether the functions are performed by hardware or software
depends on particular applications and design constraint conditions of the technical
solutions. A person skilled in the art may use different methods to implement the
described functions for each particular application, but it should not be considered
that the implementation goes beyond the scope of the present invention.
[0261] It may be clearly understood by a person skilled in the art that, for the purpose
of convenient and brief description, for a detailed working process of the foregoing
system, apparatus, and unit, reference may be made to a corresponding process in the
foregoing method embodiments, and details are not described herein again.
[0262] In the several embodiments provided in the present application, it should be understood
that the disclosed system, apparatus, and method may be implemented in other manners.
For example, the described apparatus embodiment is merely exemplary. For example,
the unit division is merely logical function division and may be other division in
actual implementation. For example, a plurality of units or components may be combined
or integrated into another system, or some features may be ignored or not performed.
In addition, the displayed or discussed mutual couplings or direct couplings or communication
connections may be implemented by using some interfaces. The indirect couplings or
communication connections between the apparatuses or units may be implemented in electronic,
mechanical, or other forms.
[0263] The units described as separate parts may or may not be physically separate, and
parts displayed as units may or may not be physical units, may be located in one position,
or may be distributed on a plurality of network units. Some or all of the units may
be selected according to actual needs to achieve the objectives of the solutions of
the embodiments.
[0264] In addition, functional units in the embodiments of the present invention may be
integrated into one processing unit, or each of the units may exist alone physically,
or two or more units are integrated into one unit.
[0265] When the functions are implemented in the form of a software functional unit and
sold or used as an independent product, the functions may be stored in a computer-readable
storage medium. Based on such an understanding, the technical solutions of the present
invention essentially, or the part contributing to the prior art, or some of the technical
solutions may be implemented in a form of a software product. The computer software
product is stored in a storage medium, and includes several instructions for instructing
a computer device (which may be a personal computer, a server, or a network device)
to perform all or some of the steps of the methods described in the embodiments of
the present invention. The foregoing storage medium includes: any medium that can
store program code, such as a USB flash drive, a removable hard disk, a read-only
memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM),
a magnetic disk, or an optical disc.
[0266] The foregoing descriptions are merely specific implementation manners of the present
invention, but are not intended to limit the protection scope of the present invention.
Any variation or replacement readily figured out by a person skilled in the art within
the technical scope disclosed in the present invention shall fall within the protection
scope of the present invention. Therefore, the protection scope of the present invention
shall be subject to the protection scope of the claims.