(19)
(11)EP 3 648 099 A1

(12)EUROPEAN PATENT APPLICATION
published in accordance with Art. 153(4) EPC

(43)Date of publication:
06.05.2020 Bulletin 2020/19

(21)Application number: 18825077.3

(22)Date of filing:  28.05.2018
(51)Int. Cl.: 
G10L 15/183  (2013.01)
(86)International application number:
PCT/CN2018/088646
(87)International publication number:
WO 2019/001194 (03.01.2019 Gazette  2019/01)
(84)Designated Contracting States:
AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR
Designated Extension States:
BA ME
Designated Validation States:
KH MA MD TN

(30)Priority: 29.06.2017 CN 201710517737

(71)Applicant: Tencent Technology (Shenzhen) Company Limited
Shenzhen, Guangdong 518057 (CN)

(72)Inventors:
  • ZHENG, Ping
    Shenzhen Guangdong 518057 (CN)
  • RAO, Feng
    Shenzhen Guangdong 518057 (CN)
  • LU, Li
    Shenzhen Guangdong 518057 (CN)
  • LI, Tao
    Shenzhen Guangdong 518057 (CN)

(74)Representative: EP&C 
P.O. Box 3241
2280 GE Rijswijk
2280 GE Rijswijk (NL)

  


(54)VOICE RECOGNITION METHOD, DEVICE, APPARATUS, AND STORAGE MEDIUM


(57) This application discloses a speech recognition method, apparatus, and device, and a storage medium and belongs to the field of computers. The method includes: obtaining a speech signal; recognizing the speech signal according to a speech recognition algorithm, to obtain n candidate recognition results; determining a target result in the n candidate recognition results according to a selection rule whose execution sequence is j in m selection rules; determining the target result in the n candidate recognition results according to a selection rule whose execution sequence is j+1 when the target result is not determined according to the selection rule whose execution sequence is j. A problem of poor real-time of selecting the target result from the plurality of candidate recognition results resulting from a long time taken to calculate a perplexity according to an RNN language model is resolved, and real-time of selecting the target result from the n candidate recognition results is improved.




Description

RELATED APPLICATION



[0001] This application claims priority to Chinese Patent Application No. 201710517737.4, entitled "SPEECH RECOGNITION METHOD AND APPARATUS" and filed with the China National Intellectual Property Administration on June 29, 2017, which is incorporated by reference in its entirety.

FIELD OF THE TECHNOLOGY



[0002] Embodiments of this application relate to the field of computers, and in particular, to a speech recognition method and apparatus, and a storage medium.

BACKGROUND OF THE DISCLOSURE



[0003] Speech recognition technology is a technology that recognizes speech information as text information through a speech recognition device. The speech recognition technology is widely applied to scenarios such as speech dialing, speech navigation, smart home control, speech search, and listen/write data input.

SUMMARY



[0004] One or more embodiments of this application provide a speech recognition method, apparatus, and device, and a storage medium, one or more embodiments may resolve a problem of poor real-time selection of a target result from a plurality of candidate recognition results resulting from a long time taken by a speech recognition device to calculate a perplexity according to an RNN language model. The technical solutions are as follows:

[0005] According to one aspect of this application, a speech recognition method is provided. The method includes:

obtaining a speech signal;

recognizing the speech signal according to a speech recognition algorithm, to obtain n candidate recognition results, the candidate recognition results comprising text information corresponding to the speech signal, and n being an integer greater than 1;

determining a target result from among the n candidate recognition results according to a selection rule selected from among m selection rules, the selection rule having an execution sequence of j, the target result being a candidate recognition result that has a highest matching degree with the speech signal in the n candidate recognition results, m being an integer greater than 1, and an initial value of j being 1; and

determining the target result from among the n candidate recognition results according to a selection rule having an execution sequence of j+1 when the target result is not determined according to the selection rule having the execution sequence of j.



[0006] According to another aspect of this application, a candidate recognition result selection apparatus is provided. The apparatus includes:

a signal obtaining module, configured to obtain a speech signal;

a speech recognition module, configured to recognize, according to a speech recognition algorithm, the speech signal obtained by the signal obtaining module, to obtain n candidate recognition results, the candidate recognition results comprising text information corresponding to the speech signal, and n being an integer greater than 1; and

a determining module, configured to determine, according to a selection rule selected from among m selection rules, the selection rule having an execution sequence of j, a target result from among the n candidate recognition results that are obtained by recognition by the speech recognition module, the target result being a candidate recognition result that has a highest matching degree with the speech signal in the n candidate recognition results, m being an integer greater than 1, and an initial value of j being 1,

wherein the determining module is configured to determine the target result from among the n candidate recognition results according to a selection rule having an execution sequence of j+1 when the target result is not determined according to the selection rule having the execution sequence of j.



[0007] According to another aspect of this application, a speech recognition device is provided. The speech recognition device includes a processor and a memory, the memory storing at least one instruction, at least one program, and a code set or an instruction set, and the at least one instruction, the at least one program, and the code set or the instruction set being loaded and executed by the processor to implement the speech recognition method according to the first aspect.

[0008] According to another aspect of this application, a computer-readable storage medium is provided, the storage medium storing at least one instruction, at least one program, and a code set or an instruction set, and the at least one instruction, the at least one program, and the code set or the instruction set being loaded and executed by the processor to implement the speech recognition method according to the first aspect.

[0009] The technical solutions provided in the embodiments of this application have at least the following beneficial effects:

[0010] At least one of m selection rules is executed in sequence to select a target result from n candidate recognition results of speech recognition. An algorithm complexity degree of each selection rule is lower than an algorithm complexity degree of calculating a perplexity according to an RNN language model, to resolve a problem of poor real-time of selecting the target result from the plurality of candidate recognition results resulting from a long time taken to calculate the perplexity according to the RNN language model. When the target result can be determined by executing only one selection rule, because an algorithm complexity degree of the selection rule is lower than the algorithm complexity degree of calculating the perplexity according to the RNN language model, real-time of selecting the target result from the n candidate recognition results is improved.

BRIEF DESCRIPTION OF THE DRAWINGS



[0011] To describe the technical solutions in the embodiments of this application more clearly, the following briefly describes the accompanying drawings required for describing the embodiments. Apparently, the accompanying drawings in the following description show merely some embodiments of this application, and a person of ordinary skill in the art may still derive other drawings from these accompanying drawings without creative efforts.

FIG. 1 is a schematic structural diagram of a speech recognition system according to an embodiment.

FIG. 2 is a flowchart of a speech recognition method according to an embodiment.

FIG. 3 is a flowchart of a speech recognition method according to another embodiment.

FIG. 4 is a schematic diagram of a first correspondence and a second correspondence according to an embodiment.

FIG. 5 is a flowchart of a speech recognition method according to another embodiment.

FIG. 6 is a flowchart of a speech recognition method according to another embodiment.

FIG. 7 is a block diagram of a speech recognition apparatus according to an embodiment.

FIG. 8 is a schematic structural diagram of a speech recognition device according to an embodiment.


DESCRIPTION OF EMBODIMENTS



[0012] To make objectives, technical solutions, and advantages of this application clearer, the following further describes in detail implementations of this application with reference to the accompanying drawings.

[0013] First, several terms described in the embodiments of this application are introduced.

[0014] A speech recognition device: an electronic device having a function of recognizing a speech signal as text information.

[0015] The speech recognition device may be a server on which a speech recognition engine is mounted. The speech recognition device recognizes a speech signal as text information through the speech recognition engine.

[0016] The speech signal received by the speech recognition device may be collected by the speech recognition device through an audio collection component or may be collected by a speech receiving device through an audio collection component and sent to the speech recognition device. The speech receiving device may be an electronic device independent of the speech recognition device. For example, the speech receiving device may be a mobile phone, a tablet computer, a smart speaker, a smart television, an intelligent air cleaner, an intelligent air conditioner, an e-book reader, a Moving Picture Experts Group Audio Layer III (MP3) player, a Moving Picture Experts Group Audio Layer IV (MP4) layer, a laptop portable computer, a desktop computer, or the like.

[0017] The speech recognition device may also be a mobile phone, a tablet computer, a smart speaker, a smart television, an intelligent air cleaner, an intelligent air conditioner, or the like. This is not limited in this embodiment.

[0018] Descriptions may be made below by using an example in which the speech recognition device may be a server, and the speech recognition device may receive the speech signal from the speech receiving device.

[0019] A candidate recognition result: for a speech signal, at least one piece of text information recognized by the speech recognition device.

[0020] When the speech recognition device obtains at least two candidate recognition results, a target result needs to be selected from the at least two candidate recognition results. The target result is a candidate recognition result that has a highest matching degree with the speech signal.

[0021] In a related technology, because speech signals having the same pronunciation may correspond to a plurality of groups of combinations of different words. For example, nihao corresponds to three combinations "Chinese Characters

(Chinese spell: ni hao)", "Chinese Characters

(Chinese spell: ni hao)", and "Chinese Characters

(Chinese spell: ni hao)". Therefore, the speech recognition device may recognize a plurality of candidate recognition results according to the speech signal. When the speech recognition device recognizes the plurality of candidate recognition results, how to select the candidate recognition result that has the highest matching degree with the speech signal becomes especially important.

[0022] Related technology provides a typical speech recognition method in which, after obtaining n candidate recognition results, a speech recognition device calculates a perplexity of each candidate recognition result according to a recurrent neural network (RNN) and determines that a candidate recognition result corresponding to a smallest value of the perplexities is a target result. The RNN language model is obtained by training according to a general corpus. The perplexities are used for indicating the similarity degrees between the candidate recognition results and the speech signal, and the perplexities and the similarity degrees are in a negative correlation. The target result is a candidate recognition result that has a highest matching degree with an actually received speech signal in the n candidate recognition results, n being an integer greater than 1.

[0023] Because it takes a long time to calculate the perplexities according to the RNN language model, real-time of selection the target result from the n candidate recognition results is poor.

[0024] Referring to FIG. 1, FIG. 7 is a schematic structural diagram of a speech recognition system according to an embodiment of this application. The system includes at least one speech receiving device 110 and a speech recognition device 120.

[0025] The speech receiving device 110 may be a mobile phone, a tablet computer, a smart speaker, a smart television, an intelligent air cleaner, an intelligent air conditioner, an e-book reader, an MP3 player, an MP4 layer, a laptop portable computer, a desktop computer, or the like. Embodiments are not limited to these specific devices.

[0026] An audio collection component 111 is mounted in the speech receiving device 110. The audio collection component 111 is configured to collect a speech signal.

[0027] The speech receiving device 110 and the speech recognition device 120 are established and connected through a wireless network or a wired network. After collecting the speech signal through the audio collection component 111, the speech receiving device 110 sends the speech signal to the speech recognition device 120 through the connection.

[0028] The speech recognition device 120 is configured to recognize the speech signal as text information (a candidate recognition result). There may be at least two pieces of text information.

[0029] The speech recognition device 120 is configured to select a target result from a plurality of candidate recognition results when recognizing the plurality of candidate recognition results.

[0030] The speech recognition device 120 may feed back the target result to the speech receiving device 110 after selecting the target result.

[0031] The speech recognition device 120 may be implemented as a server or a server cluster. This is not limited in this embodiment.

[0032] When physical hardware of a mobile terminal, such as a mobile phone, a tablet computer, a smart speaker, a smart television, an intelligent air cleaner, an intelligent air conditioner, an e-book reader, an MP3 player, an MP4 layer, or a laptop portable computer, supports in running a complex algorithm, the speech recognition device 120 may be implemented as at least one of the foregoing mobile terminals. However, embodiments are not limited thereto.

[0033] The foregoing wireless network or wired network may use a standard communication technology and/or protocol. A network may usually be the Internet but may alternatively be any network, including but being not limited to, any combination of a local area network (LAN), a metropolitan area network (MAN), a wide area network (WAN), a mobile network, a wired network, a wireless network, a dedicated network, or a virtual dedicated network. In some embodiments, data exchanged by using a network may be represented by using a technology and/or format such as the hyper text mark-up language (HTML) and the extensible markup language (XML). In addition, all or some of links may be encrypted by using a conventional encryption technology such as the Secure Socket Layer (SSL), the Transport Layer Security (TLS), the Virtual Private Network (VPN), and the Internet Protocol Security (IPsec). In some other embodiments, the foregoing data communication technology may be replaced or supplemented with a customized and/or dedicated data communication technology.

[0034] The embodiments of this application may be described by using an example in which the embodiments are executed by the speech recognition device.

[0035] Referring to FIG. 2, FIG. 2 is a flowchart of a speech recognition method according to an embodiment of this application. This embodiment is described by using an example in which the method is applied to the speech recognition device. The method may include the following several steps:
Step 101: Obtain a speech signal.

[0036] The speech signal may be sent by the speech receiving device to the speech recognition device, may be collected by the speech recognition device, or may be input into the speech recognition device through a mobile storage apparatus.

[0037] Step 102: Recognize the speech signal according to a speech recognition algorithm, to obtain n candidate recognition results.

[0038] The candidate recognition result is text information corresponding to the speech signal, and n is an integer greater than 1.

[0039] The speech recognition algorithm is used for recognizing the speech signal as at least one piece of text information. The speech recognition algorithm may be a parallel algorithm obtained based on improvement to a Viterbi algorithm, may be a serial algorithm obtained based on improvement to a Viterbi algorithm, or may be a Tree-Trellis algorithm. However, embodiments are not limited thereto.

[0040] The speech recognition algorithm may have a function of preliminarily sorting the n candidate recognition results. In this case, the n candidate recognition results obtained by the speech recognition device have sequence identifiers. In this way, when selecting the target result, the speech recognition device sequentially detects, according to a sequence identifier indication sequence, whether the n candidate recognition results are the target result.

[0041] It should be noted that the speech recognition device may recognize only one candidate recognition result. However, embodiments are not limited thereto.

[0042] Step 103: Determine a target result in the n candidate recognition results according to a selection rule whose execution sequence is j in m selection rules.

[0043] The target result is a candidate recognition result that has a highest matching degree with the speech signal in the n candidate recognition results, m being an integer greater than 1, and an initial value of j being 1. 1≤j≤m-1.

[0044] Execution sequences of the m selection rules are determined according to an algorithm complexity degree of each selection rule, and the algorithm complexity degrees and the execution sequences are in a positive correlation. That is, a smaller algorithm complexity degree indicates a smaller sequence number of an execution sequence, and the execution sequence is ranked nearer to the top. A larger algorithm complexity degree indicates a larger sequence number of an execution sequence, and the execution sequence is ranked nearer to the bottom.

[0045] The algorithm complexity degrees of the selection rules and speeds of selecting the target result are in a negative correlation. That is, a larger algorithm complexity degree indicates a slower speed of selecting the target result, and a smaller algorithm complexity degree indicates a faster speed of selecting the target result.

[0046] The algorithm complexity degree of each selection rule may be represented by a complexity degree identifier. For example, algorithm complexity degree identifiers are 1, 2, and 3, and a smaller value indicates a smaller algorithm complexity degree.

[0047] The execution sequences of the m selection rules may be appointed by a developer. Because the algorithm complexity degrees of the m selection rules are all lower than an algorithm complexity degree of calculating the perplexity according to the RNN language model, regardless of which selection rule is preferentially executed, speeds for the speech recognition device to select the target result are all faster than a speed of selecting the target result by calculating the perplexity according to the RNN language model.

[0048] In this case, the execution sequence may be represented by an execution sequence identifier. For example, the execution sequence identifier may be #1, #2, or #3. #1 indicates that the execution sequence is 1, #2 indicates that the execution sequence is 2, and #3 indicates that the execution sequence is 3.

[0049] The execution sequences of the m selection rules may be randomly selected.

[0050] Step 104: Determine the target result in the n candidate recognition results according to a selection rule whose execution sequence is j+1 when the target result is not determined according to the selection rule whose execution sequence is j.

[0051] The speech recognition device may not determine the target result according to the selection rule whose execution sequence is j. In this case, the speech recognition device continues determining the target result according to the selection rule whose execution sequence is j+1. The process continues until the target result in the n candidate recognition results is determined.

[0052] The speech recognition device may re-sort the n candidate recognition results. A sorting sequence of the target result in the n candidate recognition results is the first. A sorting sequence of a target result in remaining n-1 candidate recognition results excluding the result of the first is the second. A sorting sequence of a target result in remaining n-2 candidate recognition results excluding the results of the first and the second is the third. The process circulates in this way.

[0053] In conclusion, in the speech recognition method provided in this application, at least one of m selection rules is executed in sequence to select a target result from n candidate recognition results of speech recognition. An algorithm complexity degree of each selection rule is lower than an algorithm complexity degree of calculating a perplexity according to an RNN language model, to resolve a problem of poor real-time selection of the target result from the plurality of candidate recognition results resulting from a long time taken to calculate the perplexity according to the RNN language model. When the target result can be determined by executing only one selection rule, because an algorithm complexity degree of the selection rule is lower than the algorithm complexity degree of calculating the perplexity according to the RNN language model, real-time selection of the target result from the n candidate recognition results is improved.

[0054] The m selection rules may be determined according to different use scenarios. The m selection rules include at least two of a command selection rule, a function selection rule, and a dialogue selection rule. In a command scenario (that is, the speech signal is a message in a command form), the target result can be recognized through the command selection rule in the m selection rules. In a power scenario (that is, the speech signal is a functional message), the target result can be recognized through the function selection rule in the m selection rules. In a dialogue scenario (that is, the speech signal is a message in a dialogue form), the target result can be recognized through the dialogue selection rule in the m selection rules.

[0055] The message in a command form is used for instructing the speech receiving device to execute a command. For example, when the speech receiving device is a smart speaker, the message in a command form may be a message such as last, next, pause, or play.

[0056] Usually, messages in a command form are irregular and have a limited quantity. For example, a message in a command form of last may change into previous, play last, play previous, switch to previous, switch to last, and the like. The foregoing various changes are irregular, and types of the changes are limited.

[0057] Because messages in a command form are irregular and have a limited quantity, in this embodiment, the speech recognition device presets a command lexicon. The command lexicon includes a plurality of command keywords. The command selection rule is used for instructing the speech recognition device to detect, depending on whether the command lexicon includes a command keyword matching an ith candidate recognition result, whether the ith candidate recognition result is the target result, 1≤i≤n.

[0058] The functional message is used for instructing the speech receiving device to execute a command according to at least one speech keyword. For example, the functional message is "play Jay Chou's songs".

[0059] Usually, the functional message has a function template in a fixed form and a variable speech keyword. For example, in "play Jay Chou's songs", a function template is play ()'s songs", and a speech keyword is Jay Chou.

[0060] Because usually, the functional message has a function template in a fixed form and a variable speech keyword, in this embodiment, the speech recognition device presets a function template library and a speech lexicon. The function selection rule is used for instructing the speech recognition device to detect, depending on whether the speech lexicon includes a lexicon keyword matching the speech keyword, whether the ith candidate recognition result is the target result, the speech keyword being at least one keyword in the ith candidate recognition result.

[0061] The message in a dialogue form is a message that is irregular and whose quantity of changes is unknown. For example, a dialogue message is "what are you doing", "are you free today", "what a movie", and the like.

[0062] Because the message in a dialogue form is irregular and has an unknown quantity of changes, in this embodiment, the speech recognition device sets a pre-trained language model. The dialogue selection rule is used for instructing the speech recognition device to determine a similarity degree between each candidate recognition result and the speech signal according to a trained language model, to select the target result.

[0063] An algorithm complexity degree of the command selection rule may be lower than an algorithm complexity degree of the function selection rule, and the algorithm complexity degree of the function selection rule may be lower than an algorithm complexity degree of the dialogue selection rule. Correspondingly, the speech recognition device preferentially executes the command selection rule to select the target result, then executes the function selection rule to select the target result when the target result is not selected according to the command selection rule, and then executes the dialogue selection rule to select the target result when the target result is not selected according to the function selection rule.

[0064] The algorithm complexity degree of the command selection rule, the algorithm complexity degree of the function selection rule, and the algorithm complexity degree of the dialogue selection rule may all be far smaller than the algorithm complexity degree of selecting the target result according to the RNN language model. Therefore, if the speech recognition device sequentially executes the command selection rule, the function selection rule, and the dialogue selection rule to determine the target result, a total time taken by the speech recognition device is also smaller than a total time taken to select the target result according to the RNN language model.

[0065] Selecting the target result according to the command selection rule (referring to the embodiment shown in FIG. 3), selecting the target result according to the function selection rule (referring to the embodiment shown in FIG. 5), and selecting the target result according to the dialogue selection rule (referring to the embodiment shown in FIG. 6) are separately described below.

[0066] Referring to FIG. 3, FIG. 3 is a flowchart of a speech recognition method according to another embodiment of this application. This embodiment is described by using an example in which the speech recognition method is applied to the speech recognition device. The method may include the following steps:

[0067] Step 201: Detect whether a first correspondence of the command lexicon includes the command keyword matching the ith candidate recognition result.

[0068] The first correspondence includes a correspondence between index values and command keywords.

[0069] The first correspondence may be implemented through a forward table. The forward table includes at least one key value pair, a key in each key value pair is a hash value (index value), and a value in each key value pair is a command keyword.

[0070] In this embodiment, a quantity of key value pairs in the first correspondence is not limited. For example, the quantity of key value pairs in the first correspondence is 1000.

[0071] That the speech recognition device detects whether the first correspondence of the command lexicon includes the command keyword matching the ith candidate recognition result includes: calculating a hash value of the ith candidate recognition result, detecting whether a key equal to the hash value exists in the first correspondence, determining that the first correspondence includes the command keyword matching the ith candidate recognition result and performing step 202 if yes, and making i=i+1 and continuing performing this step if not.

[0072] The first correspondence may refer to including at least one command keyword, matching, by the speech recognition device, the ith candidate recognition result with each command keyword, performing step 202 if the first correspondence includes the command keyword matching the ith candidate recognition result, and making i=i+1 and continuing performing this step if the first correspondence does not include the command keyword matching the ith candidate recognition result.

[0073] Step 202: Determine that the ith candidate recognition result is the target result; the process ends.

[0074] When the first correspondence includes command keywords corresponding to at least two candidate recognition results, the speech recognition device may use the first candidate recognition result as the target result, or the speech recognition device performs step 203, and selects the target result from the at least two candidate recognition results again.

[0075] Step 203: Detect, when the first correspondence does not include a command keyword matching any candidate recognition result of the n candidate recognition results, whether a second correspondence of the command lexicon includes a keyword matching any word in the ith candidate recognition result.

[0076] The second correspondence includes a correspondence between index values and keywords, and the command keywords include the keywords.

[0077] The second correspondence may be implemented through an inverted table. The inverted table includes at least one key value pair, a key in each key value pair is a hash value of a keyword, and a value in each key value pair is at least one index value corresponding to the keyword in the first correspondence.

[0078] That the speech recognition device detects whether the second correspondence in the command lexicon includes a keyword matching any word in the ith candidate recognition result includes: calculating a hash value of each word in the ith candidate recognition result; detecting whether the second correspondence includes a key equal to a hash value of any word; determining that the second correspondence includes a keyword matching a word in the ith candidate recognition result and performing step 204 if the second correspondence includes a key equal to a hash value of any word; and making i=i+1 and continuing performing this step if the second correspondence does not include a key equal to a hash value of any word.

[0079] A key of each key value pair in the second correspondence may alternatively be a keyword.

[0080] Step 204: Search, according to an index value corresponding to the keyword in the second correspondence, the first correspondence for a command keyword corresponding to the index value.

[0081] Because the command keyword includes a keyword, and different command keywords may include the same keyword, a quantity of command keywords that are found by the speech recognition device according to an index value corresponding to a keyword, that is, a value in a key value pair corresponding to a keyword in the second correspondence is at least one.

[0082] In this embodiment, the command keyword matching the ith candidate recognition result is detected by combining the first correspondence and the second correspondence, so that the speech recognition device does not need to store all change forms of the command keyword but only needs to store keywords included in all change forms to determine the corresponding command keyword, thereby saving storage space of the speech recognition device.

[0083] Step 205: Determine an edit distance between the ith candidate recognition result and the command keyword.

[0084] The edit distance (or referred to as a Levenshtein distance) is used for indicating a quantity of operations required for conversion of the ith candidate recognition result into the command keyword. The conversion operations include, but are not limited to: replacement, insertion, and deletion.

[0085] The speech recognition device may determine a plurality of command keywords. In this case, the edit distance between the ith candidate recognition result and each command keyword is determined.

[0086] For example, the ith candidate recognition result is "Chinese characters

(Chinese spell: zai ting)", which is similar in the following, and the command keyword determined by the speech recognition device is "

(zan ting)". The speech recognition device needs to only replace "

(zai)" with "

(zan)" to convert "

(zai ting)" into "

(zan ting)". The edit distance between the ith candidate recognition result and the command keyword is 1.

[0087] Step 206: Determine, when the edit distance is less than a preset value, that the ith candidate recognition result is the target result.

[0088] When the edit distance is less than the preset value, it indicates that a similarity degree between the ith candidate recognition result and the command keyword is high. In this case, it is determined that the ith candidate recognition result is the target result.

[0089] A value of the preset value is usually small, and the value of the preset value is not limited in this embodiment. For example, the preset value is 2.

[0090] Referring to a diagram of a first correspondence and a second correspondence shown in FIG. 4, the first correspondence includes three key value pairs, and each key value pair includes an index value and a command keyword; the second correspondence includes three key value pairs, and each key value pair includes a hash value and an index value.

[0091] If the speech recognition device recognizes four candidate recognition results, the four candidate recognition results are respectively: Chinese characters

(Chinese spell: zai tian), which is similar in the following,

(zai tian),

(zai tian), and

(zan ting). The speech recognition device calculates hash values of the four candidate recognition results. The hash value of

(zai tian) is 1, the hash value of

(zai tian) is 2, the hash value of

(zai tian) is 3, and the hash value of

(zan ting) is 4. A key in the first correspondence includes 4. Therefore, it is determined that

(zan ting) is the target result.

[0092] If the speech recognition device recognizes four candidate recognition results, the four candidate recognition results are respectively: Chinese characters

(Chinese spell: zai tian), which is similar in the following,

(zai tian),

(zai tian), and

(zai ting). The speech recognition device calculates hash values of the four candidate recognition results. The hash value of

(zai tian) is 1, the hash value of

(zai tian) is 2, the hash value of

(zai tian) is 3, and the hash value of

(zai ting) is 4. In this case, a key in the first correspondence does not include 1, 2, 3, and 5. Therefore, the speech recognition device calculates a hash value of each word in each candidate recognition result. For the candidate recognition result "

(zai ting)", the hash value of "

(zai)" is 11, the hash value of "

(ting)" is 12, and the key in the second correspondence includes 12. The speech recognition device searches the first correspondence for the command keyword "

(zan ting)" corresponding to the index value 4 according to the index value 4 corresponding to 12 in the second correspondence. An edit distance between "

(zai ting)" and "

(zan ting)" is 1 and is less than the preset value 2. Therefore, it is determined that "

(zai ting)" is the target result.

[0093] When edit distances between all the candidate recognition results and the command keyword are all greater than or equal to the preset value, the target result may not be selected according to the command selection rule. In this case, the speech recognition device continues selecting the target result according to another selection rule, determines that the first candidate recognition result is the target result, or does not select the target result; the process ends. The another selection rule is the function selection rule or the dialogue selection rule.

[0094] The speech recognition device may determine that a candidate recognition result having a smallest edit distance is the target result.

[0095] In conclusion, in the speech recognition method provided in this application, the target result in the n candidate recognition results is selected through the command selection rule. When the target result can be determined by executing only the command selection rule, because the algorithm complexity degree of the command selection rule is lower than the algorithm complexity degree of calculating the perplexity according to the RNN language model, real-time selection of the target result from the n candidate recognition results is improved.

[0096] In addition, the command keyword matching the ith candidate recognition result is detected by combining the first correspondence and the second correspondence, so that the speech recognition device does not need to store all change forms of the command keyword but only needs to store keywords included in all change forms to determine the corresponding command keyword, thereby saving storage space of the speech recognition device.

[0097] The speech recognition device may send the target result to the speech receiving device. The speech receiving device performs a corresponding operation according to a command corresponding to the target result. For example, the speech receiving device is a smart speaker, and the target result is pause. Therefore, after receiving the target result, the smart speaker pauses playing currently played audio information.

[0098] Referring to FIG. 5, FIG. 5 is a flowchart of a speech recognition method according to another embodiment. This embodiment is described by using an example in which the speech recognition method is applied to the speech recognition device. The method may include the following steps:

[0099] Step 401: Analyze a function template of the ith candidate recognition result, 1≤i≤n.

[0100] The speech recognition device may preset a function template library. The function template library includes at least one function template.

[0101] The function template may be represented, or referred to as, through a regular expression. For example, the function template is "a (.+)'s song". A quantity of function templates in the function template library is not limited in this embodiment. For example, the quantity of function templates in the function template library is 540.

[0102] The regular expression is used for retrieving and/or replacing text information satisfying a function template.

[0103] The speech recognition device analyzes the function template of the ith candidate recognition result by matching the ith candidate recognition result with each function template in the function template library.

[0104] Step 402: Detect whether the speech lexicon includes the lexicon keyword matching the speech keyword in the ith candidate recognition result.

[0105] The ith candidate recognition result includes the function template and at least one speech keyword. After analyzing the function template of the ith candidate recognition result, the speech recognition device uses remaining keywords in the ith candidate recognition result as the speech keyword.

[0106] The speech recognition device presets a speech lexicon, and the speech lexicon includes at least one lexicon keyword. A quantity of lexicon keywords in the speech lexicon is not limited in this embodiment. For example, the quantity of lexicon keywords in the speech lexicon is 1 million.

[0107] The speech recognition device matches the speech keyword in the ith candidate recognition result with at least one lexicon keyword in the speech lexicon one by one. When the speech lexicon includes the lexicon keyword matching the speech keyword in the ith candidate recognition result, perform step 403. When the speech lexicon does not include the lexicon keyword matching the speech keyword in the ith candidate recognition result, make i=i+1 and continue performing this step.

[0108] Step 403: Determine that the ith candidate recognition result is the target result; the process ends.

[0109] When the target result is not selected according to the function selection rule, the speech recognition device may continue selecting the target result according to another selection rule, determine that the first candidate recognition result is the target result, or does not select the target result; the process ends. The another selection rule is the command selection rule or the dialogue selection rule.

[0110] That the target result is not selected according to the function selection rule includes, but is not limited to, the following several situations: the speech recognition device does not analyze function templates of the candidate recognition results, or the speech recognition device does not find lexicon keywords matching speech keywords in the candidate recognition results in the speech lexicon.

[0111] It is assumed that the speech recognition device obtains three candidate recognition results, respectively: 1. Chinese Characters

(Chinese spell: wo xiang ting tu an ge de ge), 2. Chinese Characters

(Chinese spell: wo xiang ting tong an ge de lo), and 3. Chinese Characters

(Chinese spell:wo xiang ting tong an ge de ge). The speech recognition device respectively matches the three candidate recognition results with the function template in the function template library, to obtain that a function template of the first candidate recognition result is "

(.+)

(wo xiang ting (.+) de ge)", that a function template of the second candidate recognition result is "

(.+)

(.+)(wo xiang ting (.+) de (.+))", and that a function template of the third candidate recognition result is

(.+)

(wo xiang ting (.+) de ge)".

[0112] For the first candidate recognition result, the speech keyword is Chinese Characters

(Chinese spell: tu an ge). For the second candidate recognition result, the speech recognition device uses the first keyword as the speech keyword, that is, the speech keyword is Chinese Characters

(Chinese spell: tong an ge). For the third candidate recognition result, the speech keyword is Chinese Characters

(Chinese spell: tong an ge).

[0113] The speech recognition device sequentially matches the speech keywords in the candidate recognition results with the lexicon keyword in the speech lexicon. When matching the speech keyword in the second candidate recognition result with the lexicon keyword, the speech recognition device can determine the lexicon keyword matching the speech keyword and determines that the second candidate recognition result is the target result.

[0114] For the second candidate recognition result, the speech recognition device may use all keywords as the speech keyword, that is, the speech keyword is Chinese characters

(Chinese spell:tong an ge) and Chinese character

(Chinese spell:lo), which is similar in the following. In this case, although the speech lexicon includes the lexicon keyword matching

(tong an ge), the speech lexicon does not include the lexicon keyword matching

(lo). In this case, the speech recognition device sequentially matches the speech keywords in the candidate recognition results with the lexicon keyword in the speech lexicon. When matching the speech keyword in the third candidate recognition result with the lexicon keyword, the speech recognition device can determine the lexicon keyword matching the speech keyword and determines that the third candidate recognition result is the target result.

[0115] In conclusion, in the speech recognition method provided in this application, the target result in the n candidate recognition results is selected through the function selection rule. When the target result can be determined by executing only the function selection rule, because the algorithm complexity degree of the function selection rule is lower than the algorithm complexity degree of calculating the perplexity according to the RNN language model, real-time selection of the target result from the n candidate recognition results is improved.

[0116] The speech recognition device sends the target result to the speech receiving device. The speech receiving device performs a corresponding operation according to the speech keyword in the target result. For example, the speech receiving device is a smart speaker, and the target result is playing Jay Chou's songs. Therefore, the smart speaker searches for Jay Chou's songs after receiving the target result and plays audio information corresponding to a searching result.

[0117] The speech recognition device may perform searching according to the speech keyword in the target result and send a searching result to the speech receiving device. The speech receiving device plays audio information corresponding to the searching result. For example, the speech receiving device is a smart speaker, and the target result is playing Jay Chou's songs. Therefore, the speech recognition device searches for Jay Chou's songs according to a speech keyword, Jay Chou, in the target result and sends a searching result to the smart speaker. The smart speaker plays audio information corresponding to the searching result.

[0118] Referring to FIG. 6, FIG. 6 is a flowchart of a speech recognition method according to another embodiment. This embodiment is described by using an example in which the speech recognition method is applied to the speech recognition system. The method may include the following steps:
Step 501: Calculate a perplexity of each candidate recognition result according to the language model.

[0119] The perplexity is used for indicating a similarity degree between the candidate recognition result and the speech signal. The perplexity and the similarity degree are in a negative correlation.

[0120] The language model is a mathematical model for describing an inherent law of natural languages.

[0121] The language model may be an N-gram language model that is generated according to a dedicated corpus corresponding to at least one field. The N-gram language model is used for determining an occurrence probability of a current word according to occurrence probabilities of N-1 words before the current word, N being a positive integer. A value of N is not limited in this embodiment. For example, N is 3, and a 3-gram language model is also referred to as a Tri-gram language model. For example, N is 2, and a 2-gram language model is also referred to as a Bi-gram language model.

[0122] The N-gram language model describes the properties and relationship of natural language basic units, such as words, word groups, and sentences, by using probabilities and distribution functions and reflects generation and processing rules based on statistical principles in natural languages.

[0123] In this embodiment, descriptions are made by using an example in which the speech recognition device calculates a perplexity of each candidate recognition result according to the 3-gram language model or the 2-gram language model.

[0124] The 3-gram language model may be represented through the following formula:



[0125] p(S) represents a probability of occurrence of a candidate recognition result, p(w1) represents a probability of occurrence of the first word in the candidate recognition result, p(w2|w1) represents a probability of occurrence of the second word in the candidate recognition result due to occurrence of the first word, p(w3|w1,w2) represents a probability of occurrence of the third word in the candidate recognition result due to occurrence of the first word and the second word, and p(wn|wn-1,wn-2) represents a probability of occurrence of the nth word in the candidate recognition result due to occurrence of a previous word (the (n-1)th word) and a previous but one word (the (n-2)th word).

[0126] The 2-gram language model may be represented through the following formula:



[0127] p(S) represents a probability of occurrence of a candidate recognition result, p(w1) represents a probability of occurrence of the first word in the candidate recognition result, p(w2|w1) represents a probability of occurrence of the second word in the candidate recognition result due to occurrence of the first word, p(w3|w2) represents a probability of occurrence of the third word in the candidate recognition result due to occurrence of the second word, and p(wn|wn-1) represents a probability of occurrence of the nth word in the candidate recognition result due to occurrence of a previous word (the (n-1)th word).

[0128] At least one field includes, but is not limited to, the following ones: the weather field, the music field, the mathematics field, the sports field, the computer field, the home field, the geographical field, and the natural field.

[0129] Although not described, the at least one field may also include other fields.

[0130] The speech recognition device calculates the perplexity of each candidate recognition result through a preset formula according to the language model.

[0131] The perplexity may be regarded as a geometric mean of an occurrence probability of a candidate word after each word predicted by the language model. Usually, a probability of occurrence of the candidate recognition result and the perplexity are in a negative correlation. That is, a larger probability of occurrence of the candidate recognition result indicates a lower perplexity; a smaller probability of occurrence of the candidate recognition result indicates a higher perplexity.

[0132] When the speech recognition device calculates the perplexity of each candidate recognition result through a preset formula according to the language model, the speech recognition device may first calculate a cross entropy of each candidate recognition result and determine a perplexity of a language recognition result according to the cross entropy and the preset formula.

[0133] The cross entropy is used for indicating a difference between a model language determined by a language model and the candidate recognition result. A smaller cross entropy indicates a smaller difference between the model language and the candidate recognition result and a higher matching degree between the candidate recognition result and the speech signal. A larger cross entropy indicates a greater difference between the model language and the candidate recognition result and a lower matching degree between the speech signal and the matching degree.

[0134] The language model may be of another type, such as a neural network language model. However, embodiments are not limited thereto.

[0135] Step 502: Determine a smallest value of the perplexities in the n candidate recognition results and determining that the ith candidate recognition result corresponding to the smallest value is the target result.

[0136] Because a smaller perplexity indicates a higher similarity degree between the candidate recognition result and the speech signal, it is determined that the ith candidate recognition result corresponding to the smallest value of the perplexities is the target result.

[0137] In conclusion, in the speech recognition method provided in this application, the target result in the n candidate recognition results is selected through the dialogue selection rule. When the target result can be determined by executing only the dialogue selection rule, because the algorithm complexity degree of the dialogue selection rule is lower than the algorithm complexity degree of calculating the perplexity according to the RNN language model, real-time selection of the target result from the n candidate recognition results is improved.

[0138] The speech recognition device may send the target result to the speech receiving device. The speech receiving device obtains dialogue information according to the target result. For example, the speech receiving device is a smart speaker, and the target result is "what are you doing". Therefore, after receiving the target result, the smart speaker generates dialogue information according to a dialogue model.

[0139] The speech recognition device may generate the dialogue information according to the target result and sends the dialogue information to the speech receiving device. The speech receiving device plays audio information corresponding to the dialogue information. For example, the speech receiving device is a smart speaker, and the target result is "what are you doing". Therefore, the speech recognition device generates the dialogue information according to the target result and sends the dialogue information to the smart speaker, and the smart speaker plays audio information corresponding to the dialogue information.

[0140] It should be noted that the embodiment shown in FIG. 3, any two of the embodiment shown in FIG. 5, and the embodiment shown in FIG. 6 may be combined to form a new embodiment, or the three embodiments are combined to form a new embodiment. Using m=3 as an example, the command selection rule is the first selection rule, the function selection rule is the second selection rule, and the dialogue selection rule is the third selection rule.

[0141] The following is an apparatus embodiment, which can be used to execute the method embodiments. For details not disclosed in the apparatus embodiment, refer to the method embodiments.

[0142] Referring to FIG. 7, FIG. 7 is a block diagram of a speech recognition apparatus according to an embodiment. The apparatus has functions of performing the foregoing method examples. The functions may be implemented by using hardware, or may be implemented by hardware executing corresponding software. The apparatus may include a signal obtaining module 610, a speech recognition module 620, and a determining module 630.

[0143] The signal obtaining module 610 is configured to obtain a speech signal.

[0144] The speech recognition module 620 is configured to recognize, according to a speech recognition algorithm, the speech signal obtained by the signal obtaining module 610, to obtain n candidate recognition results, the candidate recognition results being text information corresponding to the speech signal, and n being an integer greater than 1.

[0145] The determining module 630 is configured to determine, according to a selection rule whose execution sequence is j in m selection rules, a target result in the n candidate recognition results that are obtained by recognition by the speech recognition module 620, the target result being a candidate recognition result that has a highest matching degree with the speech signal in the n candidate recognition results, m being an integer greater than 1, and an initial value of j being 1.

[0146] The determining module 630 is configured to determine the target result in the n candidate recognition results according to a selection rule whose execution sequence is j+1 when the target result is not determined according to the selection rule whose execution sequence is j.

[0147] Execution sequences of the m selection rules may be determined according to respective algorithm complexity degrees, and the execution sequences and the algorithm complexity degrees are in a positive correlation.

[0148] The m selection rules may include at least two of a command selection rule, a function selection rule, and a dialogue selection rule, an algorithm complexity degree of the command selection rule may be lower than an algorithm complexity degree of the function selection rule, and the algorithm complexity degree of the function selection rule may be lower than an algorithm complexity degree of the dialogue selection rule,
the command selection rule being used for instructing a speech recognition device to detect, depending on whether a command lexicon includes a command keyword matching an ith candidate recognition result, whether the ith candidate recognition result is the target result, 1≤i≤n;
the function selection rule being used for instructing the speech recognition device to detect, depending on whether a speech lexicon includes a lexicon keyword matching a speech keyword, whether the ith candidate recognition result is the target result, the speech keyword being at least one keyword in the ith candidate recognition result; and
the dialogue selection rule being used for instructing the speech recognition device to determine a similarity degree between each candidate recognition result and the speech signal according to a trained language model, to select the target result.

[0149] The determining module 630 may include a first detection unit and a first determining unit.
the first detection unit being configured to detect whether a first correspondence of the command lexicon includes the command keyword matching the ith candidate recognition result, 1≤i≤n; and
the first determining unit being configured to determine, when the first correspondence includes the command keyword matching the ith candidate recognition result, that the ith candidate recognition result is the target result,
the first correspondence including at least the command keyword.

[0150] The determining module 630 may further include a second detection unit, a keyword searching unit, a second determining unit, and a third determining unit.
the second detection unit being configured to detect, when the first correspondence does not include a command keyword matching any candidate recognition result of the n candidate recognition results, whether a second correspondence of the command lexicon includes a keyword matching any word in the ith candidate recognition result;
the keyword searching unit being configured to, when the second correspondence includes a keyword matching a word in the ith candidate recognition result, search, according to an index value corresponding to the keyword in the second correspondence, the first correspondence for a command keyword corresponding to the index value;
the second determining unit being configured to determine an edit distance between the ith candidate recognition result and the command keyword, the edit distance being used for indicating a quantity of operations required for conversion of the ith candidate recognition result into the command keyword; and
the third determining unit being configured to determine, when the edit distance is less than a preset value, that the ith candidate recognition result is the target result,
the first correspondence including a correspondence between the index value and the command keyword, and the second correspondence including a correspondence between the index value and the keyword.

[0151] The determining module 630 may include a template analysis unit, a third detection unit, and a fourth determining unit,
the template analysis unit being configured to analyze a function template of the ith candidate recognition result, 1≤i≤n;
the third detection unit being configured to detect whether the speech lexicon includes the lexicon keyword matching the speech keyword in the ith candidate recognition result; and
the fourth determining unit being configured to determine, when the speech lexicon includes the lexicon keyword matching the speech keyword in the ith candidate recognition result, that the ith candidate recognition result is the target result, the speech keyword being at least one keyword in the ith candidate recognition result,
the ith candidate recognition result including the function template and the speech keyword.

[0152] The determining module 630 may include a perplexity calculation unit and a fifth determining unit,
the perplexity calculation unit being configured to calculate a perplexity of each candidate recognition result according to the language model;
the fifth determining unit being configured to determine a smallest value of the perplexities in the n candidate recognition results and determining that the ith candidate recognition result corresponding to the smallest value is the target result,
the perplexities being used for indicating the similarity degrees between the candidate recognition results and the speech signal, the perplexities and the similarity degrees being in a negative correlation, the language model being an N-gram language model that is generated according to a dedicated corpus corresponding to at least one field, and the N-gram language model being used for determining an occurrence probability of a current word according to occurrence probabilities of N-1 words before the current word, N being a positive integer.

[0153] An embodiment further provides a computer-readable storage medium. The computer-readable storage medium may be a computer-readable storage medium included in the memory, or may be a computer-readable storage medium that exists alone and is not assembled into the speech recognition device. The computer-readable storage medium stores at least one instruction, at least one program, and a code set or an instruction set, and the at least one instruction, the at least one program, and the code set or the instruction set is loaded and executed by the processor to implement the speech recognition method according to the foregoing method embodiments.

[0154] FIG. 8 is a schematic structural diagram of a speech recognition device according to an embodiment. The speech recognition device 700 includes a Central Processing Unit (CPU) 701, a system memory 704 including a random access memory (RAM) 702 and a read-only memory (ROM) 703, and a system bus 705 connecting the system memory 704 and the CPU 701. The speech recognition device 700 further includes a basic input/output system (I/O system) 706 for transmitting information between components in a computer, and a mass storage device 707 used for storing an operating system 713, an application program 714, and another program module 715.

[0155] The basic I/O system 706 includes a display 708 configured to display information, and an input device 709 used by a user to input information, such as a mouse or a keyboard. The display 708 and the input device 709 are both connected to the CPU 701 by using an input/output controller 710 connected to the system bus 705. The basic I/O system 706 may further include the input/output controller 710, to receive and process inputs from multiple other devices, such as the keyboard, the mouse, or an electronic stylus. Similarly, the input/output controller 710 further provides an output to a display screen, a printer or another type of output device.

[0156] The mass storage device 707 is connected to the CPU 701 by using a mass storage controller (not shown) connected to the system bus 705.The mass storage device 707 and an associated computer-readable medium provide non-volatile storage for the speech recognition device 700. That is, the mass storage device 707 may include a computer-readable medium (not shown) such as a hard disk or a compact disc ROM (CD-ROM) drive.

[0157] The computer-readable medium may include a computer storage medium and a communication medium. The computer storage medium includes volatile and non-volatile media, and removable and non-removable media implemented by using any method or technology and configured to store information such as a computer-readable instruction, a data structure, a program module, or other data. The computer storage medium includes a RAM, a ROM, an erasable programmable ROM (EPROM), an electrically erasable programmable ROM (EEPROM), a flash memory or another solid-state memory technology, a CD-ROM, a digital versatile disc (DVD) or another optical memory, a tape cartridge, a magnetic cassette, a magnetic disk memory, or another magnetic storage device. Certainly, a person skilled in the art would appreciate that the computer storage medium is not limited to the foregoing types. The system memory 704 and the mass storage device 707 may be collectively referred to as a memory.

[0158] According to the embodiments, the speech recognition device 700 may further be connected, through a network such as the Internet, to a remote computer on the network. That is, the speech recognition device 700 may be connected to a network 712 by using a network interface unit 711 connected to the system bus 705, or may be connected to another type of network or a remote computer system (not shown) by using a network interface unit 711.

[0159] Specifically, in this embodiment, the speech recognition device 700 further includes a memory and one or more programs, where the one or more programs are stored in the memory, and are configured to be executed by one or more processors. The one or more programs include an instruction used for performing the foregoing speech recognition method.

[0160] According to an embodiment, there is provided a speech recognition system. The speech recognition system includes a smart speaker and a server. The smart speaker may be the speech collection device as shown in FIG. 1, and the server may be the speech recognition device shown in FIG. 1.

[0161] The smart speaker being configured to collect a speech signal and send the speech signal to the server.

[0162] The server is configured to: obtain a speech signal; recognize the speech signal according to a speech recognition algorithm, to obtain n candidate recognition results, the candidate recognition results being text information corresponding to the speech signal, and n being an integer greater than 1; determine a target result in the n candidate recognition results according to a selection rule whose execution sequence is j in m selection rules, the target result being a candidate recognition result that has a highest matching degree with the speech signal in the n candidate recognition results, m being an integer greater than 1, and an initial value of j being 1; and determine the target result in the n candidate recognition results according to a selection rule whose execution sequence is j+1 when the target result is not determined according to the selection rule whose execution sequence is j, and send the target result to the smart speaker. The server may recognize the target result according to the speech recognition method shown in any one of FIG. 3 to FIG. 6.

[0163] The smart speaker is further configured to make a response according to the target result. The response includes but is not limited to: at least one of performing a command execution according to the target result, making a function response according to the target result, and making a speech dialogue according to the target result.

[0164] For example, performing a command execution according to the target result includes at least one of the following command executions: play, pause, last, and next.

[0165] For example, making a function response according to the target result includes at least one of the following function responses: playing a song of a singer, a song name, or a style, playing a music program of a host, a program name, or a type, speech navigation, schedule reminder, and translation.

[0166] For example, making a speech dialogue according to the target result includes at least one of the following dialogue scenarios: weather questions and answers, knowledge questions and answers, entertainment chatting, and joke explanation.

[0167] A person of ordinary skill in the art would understand that all or some of the steps of the foregoing embodiments may be implemented by using hardware, or may be implemented by a program instructing relevant hardware. The program may be stored in a computer-readable storage medium. The storage medium may be a ROM, a magnetic disk, an optical disc, or the like.

[0168] The foregoing descriptions are merely preferred embodiments of this application, but are not intended to limit this application. Any modification, equivalent replacement, or improvement made within the spirit and principle of this application shall fall within the protection scope of this application.


Claims

1. A speech recognition method, comprising:

obtaining a speech signal;

recognizing the speech signal according to a speech recognition algorithm, to obtain n candidate recognition results, the candidate recognition results comprising text information corresponding to the speech signal, and n being an integer greater than 1;

determining a target result from among the n candidate recognition results according to a selection rule selected from among m selection rules, the selection rule having an execution sequence of j, the target result being a candidate recognition result that has a highest matching degree with the speech signal from among the n candidate recognition results, m being an integer greater than 1, and an initial value of j being 1; and

determining the target result from among the n candidate recognition results according to a selection rule having an execution sequence of j+1 based on the target result not being identified according to the selection rule having the execution sequence of j.


 
2. The speech recognition method according to claim 1, wherein execution sequences of the m selection rules are determined according to respective algorithm complexity degrees, wherein execution sequences and the algorithm complexity degrees have a positive correlation.
 
3. The speech recognition method according to claim 1, wherein the m selection rules comprise at least two of a command selection rule, a function selection rule, and a dialogue selection rule, wherein an algorithm complexity degree of the command selection rule is lower than an algorithm complexity degree of the function selection rule, and the algorithm complexity degree of the function selection rule is lower than an algorithm complexity degree of the dialogue selection rule,
wherein the command selection rule is used for instructing a speech recognition device to detect, depending on whether a command lexicon comprises a command keyword matching an ith candidate recognition result, whether the ith candidate recognition result is the target result, 1≤i≤n;
wherein the function selection rule is used for instructing the speech recognition device to detect, depending on whether a speech lexicon comprises a lexicon keyword matching a speech keyword, whether the ith candidate recognition result is the target result, the speech keyword being at least one keyword in the ith candidate recognition result; and
wherein the dialogue selection rule is used for instructing the speech recognition device to determine a similarity degree between each candidate recognition result and the speech signal according to a trained language model, to select the target result.
 
4. The speech recognition method according to claim 3, wherein the selection rule having an execution sequence of j comprises the command selection rule, and the determining a target result in the n candidate recognition results according to a selection rule whose execution sequence is j in m selection rules comprises:

detecting whether a first correspondence of the command lexicon comprises the command keyword matching the ith candidate recognition result, 1≤i≤n; and

determining, when the first correspondence comprises the command keyword matching the ith candidate recognition result, that the ith candidate recognition result is the target result,

wherein the first correspondence comprises at least the command keyword.


 
5. The speech recognition method according to claim 4, wherein after the detecting whether the first correspondence of the command lexicon comprises the command keyword matching the ith candidate recognition result, the method further comprises:

detecting, when the first correspondence does not comprise a command keyword matching any candidate recognition result of the n candidate recognition results, whether a second correspondence of the command lexicon comprises a keyword matching any word in the ith candidate recognition result;

when the second correspondence comprises a keyword matching a word in the ith candidate recognition result, searching, according to an index value corresponding to the keyword in the second correspondence, the first correspondence for a command keyword corresponding to the index value;

determining an edit distance between the ith candidate recognition result and the command keyword, the edit distance being used for indicating a quantity of operations required for conversion of the ith candidate recognition result into the command keyword; and

determining, when the edit distance is less than a preset value, that the ith candidate recognition result is the target result,

wherein the first correspondence comprises a correspondence between the index value and the command keyword, and the second correspondence comprises a correspondence between the index value and the keyword.


 
6. The speech recognition method according to claim 3, wherein the selection rule having the execution sequence of j comprises the function selection rule, and the determining a target result in the n candidate recognition results according to the selection rule having the execution sequence of j in m selection rules comprises:

analyzing a function template of the ith candidate recognition result, 1≤i≤n;

detecting whether the speech lexicon comprises the lexicon keyword matching the speech keyword in the ith candidate recognition result; and

determining, when the speech lexicon comprises the lexicon keyword matching the speech keyword in the ith candidate recognition result, that the ith candidate recognition result is the target result, the speech keyword being at least one keyword in the ith candidate recognition result,

wherein the ith candidate recognition result comprises the function template and the speech keyword.


 
7. The speech recognition method according to claim 3, wherein the selection rule having the execution sequence of j comprises the dialogue selection rule, and the determining a target result in the n candidate recognition results according to the selection rule having the execution sequence of j in m selection rules comprises:

calculating a perplexity of each candidate recognition result according to the language model;

determining a smallest value of the perplexities in the n candidate recognition results and determining that the ith candidate recognition result corresponding to the smallest value is the target result,

wherein the perplexities are used for indicating the similarity degrees between the candidate recognition results and the speech signal, the perplexities and the similarity degrees have a negative correlation, the language model is an N-gram language model that is generated according to a dedicated corpus corresponding to at least one field, the N-gram language model is used for determining an occurrence probability of a current word according to occurrence probabilities of N-1 words before the current word, and N is a positive integer.


 
8. A speech recognition apparatus, comprising:

a signal obtaining module, configured to obtain a speech signal;

a speech recognition module, configured to recognize, according to a speech recognition algorithm, the speech signal obtained by the signal obtaining module, to obtain n candidate recognition results, the candidate recognition results comprising text information corresponding to the speech signal, and n being an integer greater than 1; and

a determining module, configured to determine, according to a selection rule selected from among m selection rules, the selection rule having an execution sequence of j, a target result from among the n candidate recognition results that are obtained by recognition by the speech recognition module, the target result being a candidate recognition result that has a highest matching degree with the speech signal in the n candidate recognition results, m being an integer greater than 1, and an initial value of j being 1,

wherein the determining module is further configured to determine the target result from among the n candidate recognition results according to a selection rule having an execution sequence of j+1 when the first determining module does not determine the target result according to the selection rule having the execution sequence of j.


 
9. The speech recognition apparatus according to claim 8, wherein execution sequences of the m selection rules are determined according to respective algorithm complexity degrees, and wherein the execution sequences and the algorithm complexity degrees have a positive correlation.
 
10. The speech recognition apparatus according to claim 8, wherein the m selection rules comprise at least two selected from among a command selection rule, a function selection rule, and a dialogue selection rule,
wherein an algorithm complexity degree of the command selection rule is lower than an algorithm complexity degree of the function selection rule, and the algorithm complexity degree of the function selection rule is lower than an algorithm complexity degree of the dialogue selection rule,
wherein the command selection rule is used for instructing a speech recognition device to detect, depending on whether a command lexicon comprises a command keyword matching an ith candidate recognition result, whether the ith candidate recognition result is the target result, 1≤i≤n;
wherein the function selection rule is used for instructing the speech recognition device to detect, depending on whether a speech lexicon comprises a lexicon keyword matching a speech keyword, whether the ith candidate recognition result is the target result, the speech keyword being at least one keyword in the ith candidate recognition result; and
wherein the dialogue selection rule is used for instructing the speech recognition device to determine a similarity degree between each candidate recognition result and the speech signal according to a trained language model, to select the target result.
 
11. The speech recognition apparatus according to claim 10, wherein the determining module comprises a first detection unit and a first determining unit,
wherein the first detection unit is configured to detect whether a first correspondence of the command lexicon comprises the command keyword matching the ith candidate recognition result, 1≤i≤n; and
wherein the first determining unit is configured to determine, when the first correspondence comprises the command keyword matching the ith candidate recognition result, that the ith candidate recognition result is the target result,
wherein the first correspondence comprises at least the command keyword.
 
12. The speech recognition apparatus according to claim 11, wherein the determining module further comprises a second detection unit, a keyword searching unit, a second determining unit, and a third determining unit,
wherein the second detection unit is configured to detect, when the first correspondence does not comprise a command keyword matching any candidate recognition result of the n candidate recognition results, whether a second correspondence of the command lexicon comprises a keyword matching any word in the ith candidate recognition result;
wherein the keyword searching unit is configured to, when the second correspondence comprises a keyword matching a word in the ith candidate recognition result, search, according to an index value corresponding to the keyword in the second correspondence, the first correspondence for a command keyword corresponding to the index value;
wherein the second determining unit is configured to determine an edit distance between the ith candidate recognition result and the command keyword, the edit distance being used for indicating a quantity of operations required for conversion of the ith candidate recognition result into the command keyword; and
wherein the third determining unit is configured to determine, when the edit distance is less than a preset value, that the ith candidate recognition result is the target result,
wherein the first correspondence comprises a correspondence between the index value and the command keyword, and the second correspondence comprises a correspondence between the index value and the keyword.
 
13. The voice recognition apparatus according to claim 10, wherein the determining module comprises a template analysis unit, a third detection unit, and a fourth determining unit,
wherein the template analysis unit is configured to analyze a function template of the ith candidate recognition result, 1≤i≤n;
wherein the third detection unit is configured to detect whether the voice lexicon comprises the lexicon keyword matching the voice keyword in the ith candidate recognition result; and
wherein the fourth determining unit is configured to determine, when the voice lexicon comprises the lexicon keyword matching the voice keyword in the ith candidate recognition result, that the ith candidate recognition result is the target result, the voice keyword being at least one keyword in the ith candidate recognition result, and,
wherein the ith candidate recognition result comprises the function template and the voice keyword.
 
14. The voice recognition apparatus according to claim 10, wherein the determining module comprises a perplexity calculation unit and a fifth determining unit,
wherein the perplexity calculation unit is configured to calculate a perplexity of each candidate recognition result according to the language model;
wherein the fifth determining unit is configured to determine a smallest value of the perplexities in the n candidate recognition results and determining that the ith candidate recognition result corresponding to the smallest value is the target result,
wherein the perplexities are used for indicating the similarity degrees between the candidate recognition results and the voice signal, the perplexities and the similarity degrees have a negative correlation, the language model is an N-gram language model that is generated according to a dedicated corpus corresponding to at least one field, and the N-gram language model is used for determining an occurrence probability of a current word according to occurrence probabilities of N-1 words before the current word, and N is a positive integer.
 
15. A voice recognition method, comprising:

obtaining, by a voice recognition device, a voice signal;

recognizing, by the voice recognition device, the voice signal according to a voice recognition algorithm, to obtain n candidate recognition results, the candidate recognition results comprising text information corresponding to the voice signal, and n being an integer greater than 1;

determining, by the voice recognition device, a target result from among the n candidate recognition results according to a selection rule selected from among m selection rules, the selection rule having an execution sequence of j, the target result being a candidate recognition result that has a highest matching degree with the voice signal in the n candidate recognition results, m being an integer greater than 1, and an initial value of j being 1; and

determining, by the voice recognition device, the target result from among the n candidate recognition results according to a selection rule having an execution sequence of j+1 when the target result is not determined according to the selection rule having the execution sequence of j.


 
16. The method according to claim 15, wherein execution sequences of the m selection rules are determined according to respective algorithm complexity degrees, and the execution sequences and the algorithm complexity degrees have a positive correlation.
 
17. The method according to claim 15, wherein the m selection rules comprises at least two of a command selection rule, a function selection rule, and a dialogue selection rule, an algorithm complexity degree of the command selection rule is lower than an algorithm complexity degree of the function selection rule, and the algorithm complexity degree of the function selection rule is lower than an algorithm complexity degree of the dialogue selection rule,
wherein the command selection rule is used for instructing a voice recognition device to detect, depending on whether a command lexicon comprises a command keyword matching an ith candidate recognition result, whether the ith candidate recognition result is the target result, 1≤i≤n;
wherein the function selection rule is used for instructing the voice recognition device to detect, depending on whether a voice lexicon comprises a lexicon keyword matching a voice keyword, whether the ith candidate recognition result is the target result, the voice keyword being at least one keyword in the ith candidate recognition result; and
wherein the dialogue selection rule is used for instructing the voice recognition device to determine a similarity degree between each candidate recognition result and the voice signal according to a trained language model, to select the target result.
 
18. A speech recognition device, comprising a processor and a memory, the memory storing at least one instruction, at least one program, and a code set or an instruction set, and the at least one instruction, the at least one program, and the code set or the instruction set being loaded and executed by the processor to implement the speech recognition method according to any one of claims 1 to 7.
 
19. A computer-readable storage medium, the storage medium storing at least one instruction, at least one program, and a code set or an instruction set, and the at least one instruction, the at least one program, and the code set or the instruction set being loaded and executed by the processor to implement the speech recognition method according to any one of claims 1 to 7.
 
20. A speech recognition system, comprising a smart speaker and a server,
wherein the smart speaker is configured to collect a speech signal and send the speech signal to the server;
wherein the server is configured to: obtain a speech signal; recognize the speech signal according to a speech recognition algorithm, to obtain n candidate recognition results, the candidate recognition results being text information corresponding to the speech signal, and n being an integer greater than 1; determine a target result in the n candidate recognition results according to a selection rule whose execution sequence is j in m selection rules, the target result being a candidate recognition result that has a highest matching degree with the speech signal in the n candidate recognition results, m being an integer greater than 1, and an initial value of j being 1; and determine the target result in the n candidate recognition results according to a selection rule whose execution sequence is j+1 when the target result is not determined according to the selection rule whose execution sequence is j, and send the target result to the smart speaker; and
wherein the smart speaker is configured to make a response according to the target result.
 




Drawing

























REFERENCES CITED IN THE DESCRIPTION



This list of references cited by the applicant is for the reader's convenience only. It does not form part of the European patent document. Even though great care has been taken in compiling the references, errors or omissions cannot be excluded and the EPO disclaims all liability in this regard.

Patent documents cited in the description