(19)
(11)EP 3 550 477 A1

(12)EUROPEAN PATENT APPLICATION
published in accordance with Art. 153(4) EPC

(43)Date of publication:
09.10.2019 Bulletin 2019/41

(21)Application number: 17875986.6

(22)Date of filing:  11.01.2017
(51)International Patent Classification (IPC): 
G06N 99/00(2019.01)
(86)International application number:
PCT/CN2017/070812
(87)International publication number:
WO 2018/098892 (07.06.2018 Gazette  2018/23)
(84)Designated Contracting States:
AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR
Designated Extension States:
BA ME
Designated Validation States:
MA MD

(30)Priority: 29.11.2016 CN 201611070244

(71)Applicant: Iflytek Co., Ltd.
Hefei, Anhui 230088 (CN)

(72)Inventors:
  • PAN, Jia
    Hefei, Anhui 230088 (CN)
  • ZHANG, Shiliang
    Hefei, Anhui 230088 (CN)
  • XIONG, Shifu
    Hefei, Anhui 230088 (CN)
  • WEI, Si
    Hefei, Anhui 230088 (CN)
  • HU, Guoping
    Hefei, Anhui 230088 (CN)

(74)Representative: Epping - Hermann - Fischer 
Patentanwaltsgesellschaft mbH Schloßschmidstraße 5
80639 München
80639 München (DE)

  


(54)END-TO-END MODELLING METHOD AND SYSTEM


(57) An end-to-end modelling method and system. The method comprises: determining a topological structure of a target-based end-to-end model, wherein the topological structure comprises an input layer, an encoding layer, an enhancement encoding layer, a filtering layer, a decoding layer and an output layer, with the enhancement encoding layer being used for adding target unit information to a feature sequence output by the encoding layer, and the filtering layer being used for performing information filtering on the feature sequence to which the target unit information is added by the enhancement encoding layer; collecting a large amount of training data; determining a labelled object of the training data, and labelling a target unit in the labelled object; extracting a feature sequence of the training data; and using the feature sequence of the training data and labelling information about the target unit thereof to train a parameter of the target-based end-to-end model, so as to obtain a target-based end-to-end model parameter. By means of the present invention, the accuracy of modelling can be improved.




Description

FIELD



[0001] The present disclosure relates to the technical field of machine learning and in particular to a method and a system for end-to-end modeling.

BACKGROUND



[0002] End-to-end modeling refers to building a model based on a corresponding between a feature sequence of an input end and a feature sequence of an output end. End-to-end modeling is widely used in the field of pattern recognition or machine learning. For example, end-to-end modeling is usually used in an application system for speech recognition, image recognition, machine translation or the like. A corresponding relationship between the input end and the output end is established to accomplish requirements of the application system. Taking the speech recognition as an example, end-to-end modeling refers to building a model by combining an acoustic model with a language model, to output a recognition text directly. In Chinese language, a Chinese character or word is usually served as a modeling unit, i.e., a target labeling unit, a model is built by learning a corresponding relationship between an inputted speech signal sequence and an outputted Chinese character or word.

[0003] The conventional method for end-to-end modeling is usually realized based on a Encode-Decode model, and the method includes the following steps:
  1. (1) determining a topological structure of an Encode-Decode model;
  2. (2) collecting multiple pieces of training data, extracting a feature sequence of each piece of the training data, and determining target labeling information in the training data; and
  3. (3) training parameters of the model by using the feature sequences of the multiple pieces of the training data and the target labeling information in the training data.


[0004] A topological structure of the Encode-Decode model, as shown in Figure 1, mainly includes an input layer, an encoding layer, a decoding layer and an output layer. The encoding layer is configured to encode a feature sequence inputted from the input layer. The decoding layer is configured to decode the encoded feature sequence. The decoded feature sequence is served as an input of the output layer, and the output layer outputs a posteriori probability of each target labeling unit.

[0005] It can be seen from Figure 1 that, in this model, only the inputted feature sequence is encoded to acquire encoding information, and the encoding information is served as an input of the decoding layer, the encoding information is decoded by the decoding layer. In this way, the acquired encoded feature sequence has a big difference with the target labeling unit, and an accurate relationship between the feature sequence of the input end and the feature sequence of the output end may not be built, thus the accuracy of modeling is lowered.

SUMMARY



[0006] A method and a system for end-to-end modeling are provided in embodiments of the present disclosure, to improve accuracy of modeling.

[0007] The following technical solutions are provided by the present disclosure.

[0008] A method for end-to-end modeling includes:

determining a topological structure of a target-based end-to-end model, where the topological structure includes an input layer, an encoding layer, a code enhancement layer, a filtering layer, a decoding layer and an output layer; where the code enhancement layer is configured to add information of a target unit to a feature sequence outputted by the encoding layer, the filtering layer is configured to filter the feature sequence added with the information of the target unit by the code enhancement layer;

collecting multiple pieces of training data;

determining a labeling object of each piece of the training data, and labeling a target unit in the labeling object;

extracting a feature sequence of each piece of the training data; and

training parameters of the target-based end-to-end model by using the feature sequences of the multiple pieces of the training data and labeling information of the target units in the multiple pieces of the training data, to acquire the parameters of the target-based end-to-end model.



[0009] Preferably, the number of encoding layers is one or more, and the number of nodes of each encoding layer is the same as the number of nodes of the input layer.

[0010] Preferably, each encoding layer is a Long Short Term Memory layer in a unidirectional or bidirectional Long Short Term Memory neural network, or is a convolutional layer in a convolutional neural network.

[0011] Preferably, the topological structure further includes a down sampling layer located between adjacent encoding layers.

[0012] Preferably, the number of down sampling layers is one or more.

[0013] Preferably, an input of each node of the down sampling layer is feature information of adjacent multiple nodes of the encoding layer previous to the down sampling layer.

[0014] Preferably, the information of the target unit is added to the code enhancement layer via an enhancement node, each target unit corresponds to one enhancement node, a feature vector of a target unit is inputted to the enhancement node corresponding to the target unit, and the number of code enhancement layers and the number of enhancement nodes are the same as the number of target units.

[0015] Preferably, each enhancement node is connected to all nodes of the code enhancement layer corresponding to the enhancement node; or each enhancement node is only connected to the first node of the code enhancement layer corresponding to the enhancement node.

[0016] Preferably, the number of filtering layers is the same as the number of the code enhancement layers, and each code enhancement layer is connected to one filtering layer directly.

[0017] Preferably, the filtering layer has a structure of a unidirectional or bidirectional Long Short Term Memory layer, the number of nodes of the filtering layer is the same as the number of nodes of the code enhancement layer, a feature outputted by each code enhancement layer is served as an input of the filtering layer connected to the code enhancement layer, and an output of the last node of the filtering layer is served as an output of the filtering layer; or
the filtering layer has a structure of a convolutional layer and a pooling layer in a convolutional neural network, each filtering layer includes one or more convolutional layers and one pooling layer, and an output of the pooling layer is served as an output of the filtering layer including the pooling layer.

[0018] Preferably, the training parameters of the target-based end-to-end model by using the feature sequences of the multiple pieces of the training data and labeling information of the target units in the multiple pieces of the training data includes:
training the parameters of the end-to-end model by using the feature sequences of the multiple pieces of the training data as an input of the end-to-end model and using the labeling information of the target units in the multiple pieces of the training data as an output of the end-to-end model, where the parameters of the end-to-end model are converting matrices called weights and biases for connections among layers of the end-to-end model.

[0019] A system for end-to-end modeling includes:

a topological structure determining module, configured to determine a topological structure of a target-based end-to-end model, where the topological structure includes an input layer, an encoding layer, a code enhancement layer, a filtering layer, a decoding layer and an output layer; where the code enhancement layer is configured to add information of a target unit to a feature sequence outputted by the encoding layer, the filtering layer is configured to filter the feature sequence added with the information of the target unit by the code enhancement layer;

a training data collecting module, configured to collect multiple pieces of training data;

a labeling module, configured to determine a labeling object of each piece of the training data, and labeling a target unit in the labeling object;

a feature extracting module, configured to extract a feature sequence of each piece of the training data; and

a parameter training module, configured to train parameters of the target-based end-to-end model by using the feature sequences of the multiple pieces of the training data and labeling information of the target units in the multiple pieces of the training data, to acquire the parameters of the target-based end-to-end model.



[0020] Preferably, the number of encoding layers is one or more, and the number of nodes of each encoding layer is the same as the number of nodes of the input layer.

[0021] Preferably, each encoding layer is a Long Short Term Memory layer in a unidirectional or bidirectional Long Short Term Memory neural network, or is a convolutional layer in a convolutional neural network.

[0022] Preferably, the topological structure further includes a down sampling layer located between adjacent encoding layers.

[0023] Preferably, the number of down sampling layers is one or more.

[0024] Preferably, an input of each node of the down sampling layer is feature information of adjacent multiple nodes of the encoding layer previous to the down sampling layer.

[0025] Preferably, the information of the target unit is added to the code enhancement layer via an enhancement node, each target unit corresponds to one enhancement node, a feature vector of a target unit is inputted to the enhancement node corresponding to the target unit, and the number of code enhancement layers and the number of enhancement nodes are the same as the number of target units.

[0026] Preferably, each enhancement node is connected to all nodes of the code enhancement layer corresponding to the enhancement node; or each enhancement node is only connected to the first node of the code enhancement layer corresponding to the enhancement node.

[0027] Preferably, the number of filtering layers is the same as the number of the code enhancement layers, and each code enhancement layer is connected to one filtering layer directly.

[0028] Preferably, the filtering layer has a structure of a unidirectional or bidirectional Long Short Term Memory layer, the number of nodes of the filtering layer is the same as the number of nodes of the code enhancement layer, a feature outputted by each code enhancement layer is served as an input of the filtering layer connected to the code enhancement layer, and an output of the last node of the filtering layer is served as an output of the filtering layer; or the filtering layer has a structure of a convolutional layer and a pooling layer in a convolutional neural network, each filtering layer includes one or more convolutional layers and one pooling layer, and an output of the pooling layer is served as an output of the filtering layer including the pooling layer.

[0029] Preferably, the parameter training module is configured to: train the parameters of the end-to-end model by using the feature sequences of the multiple pieces of the training data as an input of the end-to-end model and using the labeling information of the target units in the multiple pieces of the training data as an output of the end-to-end model, where the parameters of the end-to-end model are converting matrices called weights and biases for connections among layers of the end-to-end model.

[0030] According to the method and system for end-to-end modeling provided by embodiments of the present disclosure, a code enhancement layer and a filtering layer are added to a topological structure of a target-based end-to-end model. The code enhancement layer is configured to add labeling information of a target unit to a feature sequence outputted by an encoding layer, thus the encoded feature sequence obtained by code enhancement includes more complete information, and a difference between the encoded feature sequence and a target labeling unit is reduced effectively. The filtering layer is configured to filter the feature sequence added with the labeling information of the target unit by the code enhancement layer, to eliminate redundant information after code enhancement. The decoding layer is configured to decode the filtered feature sequence. A decoded feature sequence is served as an input of an output layer, a feature sequence normalized by the output layer is obtained, thus accuracy of modeling from an input end to an output end is improved effectively.

BRIEF DESCRIPTION OF THE DRAWINGS



[0031] In order to more clearly illustrate technical solutions in embodiments of the present disclosure or in the conventional technology, drawings used in the description of the embodiments are introduced briefly hereinafter. Apparently, the drawings described in the following illustrate some embodiments of the present disclosure, other drawings may be obtained by those ordinarily skilled in the art based on these drawings without any creative efforts.

Figure 1 is a schematic diagram of a topological structure of an Encode-Decode model in the conventional art;

Figure 2 is a flow chart of a method for end-to-end modeling according to an embodiment of the present disclosure;

Figure 3 is a schematic diagram of a topological structure of a target-based end-to-end model according to an embodiment of the present disclosure;

Figure 4 is a schematic diagram of inserting a down sampling layer between encoding layers in the topological structure shown in Figure 3 according to an embodiment of the present disclosure;

Figure 5A is a schematic diagram of connections between enhancement nodes and nodes of a code enhancement layer according to an embodiment of the present disclosure;

Figure 5B is a schematic diagram of connections between enhancement nodes and nodes of a code enhancement layer according to another embodiment of the present disclosure;

Figure 6A is a schematic diagram of connections between a code enhancement layer and a filtering layer according to an embodiment of the present disclosure;

Figure 6B is a schematic diagram of connections between a code enhancement layer and a filtering layer according to another embodiment of the present disclosure; and

Figure 7 is a schematic structural diagram of a system for end-to-end modeling according to an embodiment of the present disclosure.


DETAILED DESCRIPTION OF EMBODIMENTS



[0032] In order to make those skilled in the art understand the technical solutions according to the embodiments of the present disclosure better, the embodiments of the present disclosure are described in detail below in conjunction with the drawings.

[0033] In order to address the above issues in a conventional method for end-to-end modeling, a method and a system for end-to-end modeling are provided in embodiments of the present disclosure. In the method and system according to the embodiment of the present disclosure, a code enhancement layer and a filtering layer are added to a topological structure of a target-based end-to-end model. That is, the topological structure of the target-based end-to-end model includes: an input layer, an encoding layer, a code enhancement layer, a filtering layer, a decoding layer and an output layer. The code enhancement layer is configured to add labeling information of a target unit to a feature sequence outputted by an encoding layer, thus the encoded feature sequence obtained by code enhancement includes more complete information, and a difference between the encoded feature sequence and a target labeling unit is reduced effectively. The filtering layer is configured to filter the feature sequence added with the labeling information of the target unit by the code enhancement layer, to eliminate redundant information after code enhancement. The decoding layer is configured to decode the filtered feature sequence. A decoded feature sequence is served as an input of an output layer, a feature sequence normalized by the output layer is obtained, thus accuracy of modeling from an input end to an output end is improved effectively.

[0034] Figure 2 illustrates a flow chart of a method for end-to-end modeling according to an embodiment of the present disclosure. The method includes the following steps 201 to 205.

[0035] In step 201, a topological structure of a target-based end-to-end model is determined.

[0036] Compared to a conventional Encode-Decode model, a code enhancement layer and a filtering layer are added to a topological structure of the target-based end-to-end model according to the embodiment of the disclosure. Specifically, the topological structure of the end-to-end model includes an input layer, an encoding layer, a code enhancement layer, a filtering layer, a decoding layer and an output layer. The code enhancement layer is configured to add information of a target unit to a feature sequence outputted by the encoding layer, thus the encoded feature sequence obtained by code enhancement includes more complete information, and a difference between the encoded feature sequence and a target unit is reduced effectively. The filtering layer is configured to filter the feature sequence added with the information of the target unit by the code enhancement layer, to eliminate redundant information after code enhancement. The decoding layer is configured to decode the filtered feature sequence. A decoded feature sequence is served as an input of an output layer, a feature sequence normalized by the output layer is obtained. A specific structure of the target-based end-to-end model is described in detail hereinafter.

[0037] In step 202, multiple pieces of training data is collected.

[0038] The pieces of training data may be collected according to requirements of an application, for example, the training data may be speech data, image data, text data or the like.

[0039] In step 203, a labeling object of each piece of the training data is determined, and a target unit in the labeling object is labeled.

[0040] The target unit may be determined according to requirements of an application. Generally, the target unit is obtained by performing labeling on the labeling object of the piece of training data by a domain expert. The labeling object may also be the piece of training data.

[0041] It should be illustrated that, in practice, the target unit may be determined according to requirements of an application. For example, in a speech recognition application, the collected piece of training data is speech data, the labeling object may be a recognition text corresponding to the speech data, and a single character or a word in the recognition text may be served as a target unit. In an image recognition application, the collected piece of training data is image data, the labeling object may be a recognition text corresponding to the image data, that is, a recognition text obtained by image recognition, and a single character or a word in the recognition text is served as the target unit. In a machine translation application, the collected piece of training data is source language text data, the labeling object may be a target language text data, and a single character or a word in the target language text data is served as the target unit.

[0042] In step 204, a feature sequence of each piece of the training data is extracted.

[0043] A feature in the feature sequence may be determined according to requirements of an application. For example, in a speech recognition application, the feature may be acoustic information illustrating speech data in each speech frame, such as Filter Bank feature, MFCC feature, PLP feature. In an image recognition application, the feature may be a value of a pixel in each image frame. In a machine translation application, the feature may be a word vector of each word in a source language text data.

[0044] In step 205, parameters of the target-based end-to-end model are trained by using the feature sequences of the multiple pieces of the training data and labeling information of the target units in the multiple pieces of the training data, to acquire the parameters of the target-based end-to-end model.

[0045] A target-based end-to-end model in the embodiments of the present disclosure is described in detail below in conjunction with Figures 3 to 6.

[0046] Reference is made to Figure 3, which is a schematic diagram of a topological structure of a target-based end-to-end model according to an embodiment of the present disclosure.

[0047] The topological structure of the target-based end-to-end model includes an input layer, an encoding layer, a code enhancement layer, a filtering layer, a decoding layer and an output layer. A detailed topological structure and feature transformation among layers are described as follows.

(1) Input layer



[0048] An input layer is used for inputting a feature sequence of a piece of training data, and the number of nodes of input layer is determined based on the feature sequence of the piece of the training data. For example, in a case that the training data is speech data, the feature sequence inputted to the input layer is speech feature of each frame of each speech. The number of nodes of the input layer is the number of frames of each speech, and is illustrated as X = {x1,x2,...,xt,...,xT}, where xt represents a feature vector of the t-th frame of the current training data, and T represents the number of frames of the current training data.

(2) Encoding layer



[0049] The feature sequence inputted to the input layer is encoded by an encoding layer. The number of encoding layers is one or more. The number of nodes of each encoding layer is the same as the number of nodes of the input layer. Each encoding layer is a Long Short Term Memory layer in a unidirectional or bidirectional Long Short Term Memory neural network, or is a convolutional layer in a convolutional neural network. The structure of the encoding layer is determined according to requirements of an application. For example, for a speech recognition task with a large vocabulary having a large number of pieces of training data, the encoding layer may be a bidirectional Long Short Term Memory layer having three to five layers. For a speech recognition task in a limited domain having a small number of pieces of training data, the encoding layer may be a unidirectional Long Short Term Memory layer having one to three layers.

[0050] Further, a down sampling layer may be inserted between encoding layers to improve computation efficiency of the encoding layers. Specifically, one down sampling layer may be inserted between each two adjacent encoding layers, thus multiple down sampling layers are inserted. Alternatively, one down sampling layer may be inserted between random two adjacent encoding layers, thus only one down sampling layer is inserted. The number of nodes of the encoding layer inserted with the down sampling layer is the same as the number of nodes of the down sampling layer previous to the encoding layer. The number of nodes of the last encoding layer is the same as the number of nodes of the last down sampling layer. For example, for a task with overlapped inputted feature sequences of multiple frames such as speech recognition or image recognition, a down sampling layer may be inserted between the encoding layers, to improve the computation efficiency. For a task without overlapped inputted feature sequences such as machine translation, the down sampling layer may not be inserted between the encoding layers.

[0051] Figure 4 is a schematic diagram of inserting a down sampling layer between the encoding layer 1 and the encoding layer 2. An input of each node of the down sampling layer is feature information of adjacent multiple nodes of the encoding layer previous to the down sampling layer. The feature information may be obtained by calculating a maximum value, a mean value or p-norm of features of multiple nodes of the encoding layer previous to the down sampling layer, to realize the object of down sampling. In Figure 4, an input of each node of the down sampling layer is feature information of two adjacent nodes of the encoding layer previous to the down sampling layer, where M represents is a total number of the encoding layers.

[0052] A feature transformation method of the encoding layer is determined based on the structure of the encoding layer. For example, in a case that the encoding layer is a unidirectional or bidirectional Long Short Term Memory layer, an output feature sequence of the /-th encoding layer is represented as

where

represents an output feature vector of the t-th frame of the l-th encoding layer. The transforming method is represented as

where f is a unidirectional or bidirectional Long Short Term transforming function, and Dl is the number of dimensions of a feature vector in each node of the l-th encoding layer.

(3) Code enhancement layer



[0053] Information of a target unit is added to the code enhancement layer. A feature sequence outputted by the encoding layer is enhancement, and the enhancement feature sequence includes more complete information.

[0054] The information of the target unit is added to the code enhancement layer via a enhancement node. Each target unit corresponds to one enhancement node, a feature vector of a target unit is inputted to the enhancement node corresponding to the target unit.

[0055] There may be multiple target units in each target object, thus multiple code enhancement layers are required. Each code enhancement layer corresponds to one enhancement node.

[0056] The number of code enhancement layers and the number of enhancement nodes are the same as the number of target units. Each code enhancement layer is connected with an enhancement node corresponding to a target unit previous to the target unit corresponding to the code enhancement layer. As shown in Figures 5A and 5B, it is assumed that there are N target units in total, thus N code enhancement layers are required. A code enhancement layer 1 corresponds to an empty enhancement node, a code enhancement layer 2 corresponds to a first target unit, and a code enhancement layer 3 corresponds to a second target unit, and so on. That is, a code enhancement layer N corresponds to a (N-1)-th target unit, and information of the first target unit to the (N-1)-th target unit is added layer by layer. Taking speech recognition as an example, a word is served as a target unit, if labeling information of a target unit of the current speech data is "

", thus the number of target units is four, and four code enhancement layers and four enhancement nodes are needed to enhance the feature sequence outputted by the encoding layer. During enhancing the feature sequence outputted by the encoding layer, a code enhancement layer corresponding to the target unit "

" is connected to an enhancement node corresponding to the target unit "

", and the first code enhancement layer is connected to an empty enhancement node.

[0057] Since processes of enhancing the feature sequences outputted by the encoding layers by using information of target units are the same, in practice, multiple code enhancement layers may be regarded as multiple times of enhancement performed on multiple target units of the labeling object by one code enhancement layer.

[0058] It should be noted that, in practice, the enhancement nodes and the code enhancement layers may be connected in different ways. For example, a first connection way is to connect each enhancement node to all nodes of the code enhancement layer corresponding to the enhancement node, which is shown in Figure 5A. A second connection way is to connect each enhancement node to the first node of the code enhancement layer corresponding to the enhancement node, which is shown in Figure 5B. In Figures 5A and 5B, N is the number of target units. It should be illustrated that, Figure 3 merely illustrates the first connection way as shown in Figure 5A, that is, the way of connecting each enhancement node to all nodes of the code enhancement layer corresponding to the enhancement node. A calculation amount of the second connection way is less than a calculation amount of the first connection way, an enhancement effect of the first connection way is better than an enhancement effect of the second connection way.

[0059] The number of nodes of each code enhancement layer is the same as the number of nodes of the last encoding layer, and the connection way of nodes of the code enhancement layer is the same as the connection way of nodes of the encoding layer.

[0060] When performing feature transformation, a product of a feature vector of a target unit of each enhancement node and a connection weight of a node of the code enhancement layer is added to a feature vector of the node of the code enhancement layer.

(4) Filtering layer



[0061] The filtering layer is configured to filter the feature sequence enhanced by the code enhancement layer. The number of filtering layers is the same as the number of the code enhancement layers, and each code enhancement layer is connected to one filtering layer directly.

[0062] In practice, the filtering layer may have two types of structure in the following. One type is a structure of a unidirectional or bidirectional Long Short Term Memory layer, and another type is a structure of a convolutional layer and a pooling layer in a convolutional neural network.

[0063] Figure 6A illustrates a first connection manner for connecting a code enhancement layer to a filtering layer. When the first connection manner is used, the number of the filtering layers is the same as the number of the code enhancement layers, the number of nodes of the filtering layer is the same as the number of nodes of the code enhancement layer, a feature outputted by each code enhancement layer is served as an input of the filtering layer connected to the code enhancement layer, and an output of the last node of the filtering layer is served as an output of the filtering layer, i.e., filtered enhanced encoding information.

[0064] Figure 6B illustrates a second connection manner of connecting the code enhancement layer to the filtering layer. When the second connection manner is used, the filtering layer has a connection manner that one or more convolutional layers are connected and then the convolutional layers are connected to a pooling layer. An output of the pooling layer is served as filtered enhanced encoding information. In this way, the enhanced encoding information is filtered and collected from each node by using multiple convolutional layers, and the enhanced encoding information is finally converged to a node. As compared with the first connection manner in which only one filtering layer is used, the second connection manner has a better filtering effect.

[0065] It should be noted that, Figure 3 only illustrates the first connection manner shown in Figure 6A.

[0066] A feature transformation method of the filtering layer is the same as a feature transformation method of each connection manner, which is not described here for simplicity.

(5) Decoding layer



[0067] An input of the decoding layer is filtered enhanced encoding information outputted by each filtering layer. The decoding layer usually has a structure of a unidirectional Long Short Term Memory layer. There may be one or more decoding layers, generally, one or two decoding layers are used. The number of nodes of each decoding layer is the same as the number of the filtering layers. The detailed decoding process is the same as that in conventional technology, which is not described here.

(6) Output layer



[0068] An output feature sequence transformed by the decoding layer is served as an input of the output layer. The output layer normalizes the input feature sequence and outputs a vector sequence of a posterior probability of each target labeling unit. The detailed method for normalizing can be found in conventional technology, and a normalization function such as softmax function may be used.

[0069] According to the topological structure of the end-to-end model, when training the model, parameters of the end-to-end model are trained by using the feature sequences of the multiple pieces of the training data as an input of the end-to-end model and using the labeling information of the target units in the multiple pieces of the training data as an output of the end-to-end model, where the parameters of the model are converting matrices called weights and biases for connections among layers of the end-to-end model. The detailed process of training the parameters can be found in conventional technology, for example, cross entropy may be used as an optimization indicator of the model, to update the parameters of the model constantly by using an error back propagation algorithm. For example, multiple iterations are applied to update the parameters of the model. The iteration process is stopped in a case that the parameters of the model reach to a convergence target, thus the updating of the parameters of the model is completed, and parameters of the end-to-end model are obtained.

[0070] According to the method for end-to-end modeling provided in embodiments of the present disclosure, a code enhancement layer and a filtering layer are added to a topological structure of a target-based end-to-end model. After encoding an input feature sequence, a code enhancement layer is added for each target unit. Information of a target unit previous to a target unit corresponding to code enhancement layer is added to an encoding sequence by the code enhancement layer. Since historical information of the target unit is considered, the encoding feature sequence after code enhancement includes more complete information, thus a difference between the encoded feature sequence and a target unit is reduced effectively. Further, a filtering layer is added after each code enhancement layer to eliminate redundant information after code enhancement. The feature sequence after code enhancement is filtered, and the filtered feature sequence is decoded. Decoded feature sequence is served as an input of the output layer, to obtain the feature sequence normalized by the output layer, thus accuracy of modeling from an input end to an output end is improved effectively.

[0071] It can be understood by those skilled in the art that, all or part of steps in the method according to the above embodiments may be completed by a related hardware instructed by a program. The program may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disc, an optical disc.

[0072] Correspondingly, a computer readable storage medium is provided by the present disclosure, and the computer readable storage medium includes computer program codes, when executed by a processor, cause the processor to:

determine a topological structure of a target-based end-to-end model, where the topological structure includes an input layer, an encoding layer, a code enhancement layer, a filtering layer, a decoding layer and an output layer; where the code enhancement layer is configured to add information of a target unit to a feature sequence outputted by the encoding layer, the filtering layer is configured to filter the feature sequence added with the information of the target unit by the code enhancement layer;

collect multiple pieces of training data;

determine a labeling object of each piece of the training data, and labeling a target unit in the labeling object;

extract a feature sequence of each piece of the training data; and

train parameters of the target-based end-to-end model by using the feature sequences of the multiple pieces of the training data and labeling information of the target units in the multiple pieces of the training data, to acquire the parameters of the target-based end-to-end model.



[0073] A detailed structure of the target-based end-to-end model can be referred to the descriptions stated above.

[0074] The processor trains the parameters of the end-to-end model by using the feature sequences of the multiple pieces of the training data as an input of the end-to-end model and using the labeling information of the target units in the multiple pieces of the training data as an output of the end-to-end model. The parameters of the model are converting matrices called weights and biases for connections among layers of the end-to-end model.

[0075] Correspondingly, a system for end-to-end modeling is provided in an embodiment of the present disclosure. Figure 7 is a schematic structural diagram of the system.

[0076] In the embodiment, the system includes a topological structure determining module 701, a training data collecting module 702, a labeling module 703, a feature extracting module 704 and a parameter training module 705.

[0077] The topological structure determining module 701 is configured to determine a topological structure of a target-based end-to-end model. The topological structure includes an input layer, an encoding layer, a code enhancement layer, a filtering layer, a decoding layer and an output layer. The code enhancement layer is configured to add information of a target unit to a feature sequence outputted by the encoding layer, the filtering layer is configured to filter the feature sequence added with the information of the target unit by the code enhancement layer.

[0078] The training data collecting module 702 is configured to collect multiple pieces of training data.

[0079] The labeling module 703 is configured to determine a labeling object of each piece of the training data, and labeling a target unit in the labeling object.

[0080] The feature extracting module 704 is configured to extract a feature sequence of each piece of the training data.

[0081] The parameter training module 705 is configured to train parameters of the target-based end-to-end model by using the feature sequences of the multiple pieces of the training data and labeling information of the target units in the multiple pieces of the training data, to acquire the parameters of the target-based end-to-end model.

[0082] The topological structure of the target-based end-to-end model is described in detail in the above method embodiments of the disclosure, which is not repeated herein.

[0083] According to the topological structure of the target-based end-to-end model, when the parameter training module 705 trains the model, parameters of the end-to-end model are trained by using the feature sequences of the multiple pieces of the training data as an input of the end-to-end model and using the labeling information of the target units in the multiple pieces of the training data as an output of the end-to-end model, where the parameters of the model are converting matrices called weights and biases for connections among layers of the end-to-end model. The detailed process of training the parameters can be found in conventional technology, for example, cross entropy may be used as an optimization indicator of the model, to update the parameters of the model constantly by using an error back propagation algorithm. For example, multiple iterations are applied to update the parameters of the model. The iteration process is stopped in a case that the parameters of the model reach to a convergence target, thus the updating of the parameters of the model is completed, and parameters of the end-to-end model are obtained.

[0084] According to the system for end-to-end modeling provided in embodiments of the present disclosure, a code enhancement layer and a filtering layer are added to a topological structure of a target-based end-to-end model. After encoding an input feature sequence, a code enhancement layer is added for each target unit. Information of a target unit previous to a target unit corresponding to code enhancement layer is added to an encoding sequence by the code enhancement layer. Since historical information of the target unit is considered, the encoding feature sequence after code enhancement includes more complete information, thus a difference between the encoded feature sequence and a target unit is reduced effectively. Further, a filtering layer is added after each code enhancement layer to eliminate redundant information after code enhancement. The feature sequence after code enhancement is filtered, and the filtered feature sequence is decoded. Decoded feature sequence is served as an input of the output layer, to obtain the feature sequence normalized by the output layer, thus accuracy of modeling from an input end to an output end is improved effectively.

[0085] The modules in the system for end-to-end modeling in embodiments of the present disclosure may be accomplished by a memory, a processor and a hardware. Each of the modules may be accomplished by one or more independent hardware, or one hardware integrated by multiple modules. Function of some modules may also be accomplished by a software, which is not limited herein.

[0086] It should be illustrated that, the method and system provided in embodiments of the present disclosure can be used for multiple kinds of application requirements in the mode recognition or machine learning field, such as speech recognition, image recognition, machine translation. Taking speech recognition as an example, end-to-end modeling can build a model by combining an acoustic model with a language model, to output a recognition text directly. In Chinese language, a Chinese character or word is usually served as a modeling unit, i.e., a target unit, a model is built by learning a corresponding relationship between an inputted speech signal sequence and an outputted Chinese character or word.

[0087] The embodiments in this specification are described in a progressive manner. For the same or similar parts between the embodiments, one may refer to the description of other embodiments. Each embodiment lays emphasis on differences from other embodiments. Since the system embodiment is similar to the method embodiment, the description for the system embodiment is relatively simple. For related parts, reference may be made to description in the method embodiment. The system embodiment described above are merely illustrative, and units described as separate components may or may not be physically separated. The components shown as units may be or not be physical units, i.e., the units may be located at the same place or may be distributed onto multiple network units. All or a part of the modules may be selected based on actual needs to realize the objective of the solutions according to the embodiments. The solutions according to the embodiments can be understood and implemented by those skilled in the art without creative work.

[0088] The embodiments of the disclosure are described in detail in the above content. The present disclosure is described by specific embodiments in the specification. The above description for embodiments is only for helping to understand the method and system of the present disclosure. For those skilled in the art, modification can be made to the specific embodiments and the application scopes based on the concept of the present disclosure, as above, the specification should not be understood to limit the present disclosure.


Claims

1. A method for end-to-end modeling, comprising:

determining a topological structure of a target-based end-to-end model, wherein the topological structure comprises an input layer, an encoding layer, a code enhancement layer, a filtering layer, a decoding layer and an output layer; wherein the code enhancement layer is configured to add information of a target unit to a feature sequence outputted by the encoding layer, the filtering layer is configured to filter the feature sequence added with the information of the target unit by the code enhancement layer;

collecting a plurality of pieces of training data;

determining a labeling object of each piece of the training data, and labeling a target unit in the labeling object;

extracting a feature sequence of each piece of the training data; and

training parameters of the target-based end-to-end model by using the feature sequences of the plurality of pieces of the training data and labeling information of the target units in the plurality of pieces of the training data, to acquire the parameters of the target-based end-to-end model.


 
2. The method according to claim 1, wherein the number of encoding layers is one or more, and the number of nodes of each encoding layer is the same as the number of nodes of the input layer.
 
3. The method according to claim 2, wherein each encoding layer is a Long Short Term Memory layer in a unidirectional or bidirectional Long Short Term Memory neural network, or is a convolutional layer in a convolutional neural network.
 
4. The method according to claim 1, wherein the topological structure further comprises a down sampling layer located between adjacent encoding layers.
 
5. The method according to claim 4, wherein the number of down sampling layers is one or more.
 
6. The method according to claim 4, wherein an input of each node of the down sampling layer is feature information of a plurality of adjacent nodes of the encoding layer previous to the down sampling layer.
 
7. The method according to claim 1, wherein the information of the target unit is added to the code enhancement layer via an enhancement node, each target unit corresponds to one enhancement node, a feature vector of a target unit is inputted to the enhancement node corresponding to the target unit, and the number of code enhancement layers and the number of enhancement nodes are the same as the number of target units.
 
8. The method according to claim 7, wherein
each enhancement node is connected to all nodes of the code enhancement layer corresponding to the enhancement node; or
each enhancement node is only connected to the first node of the code enhancement layer corresponding to the enhancement node.
 
9. The method according to claim 7, wherein the number of filtering layers is the same as the number of the code enhancement layers, and each code enhancement layer is connected to one filtering layer directly.
 
10. The method according to claim 9, wherein
the filtering layer has a structure of a unidirectional or bidirectional Long Short Term Memory layer, the number of nodes of the filtering layer is the same as the number of nodes of the code enhancement layer, a feature outputted by each code enhancement layer is served as an input of the filtering layer connected to the code enhancement layer, and an output of the last node of the filtering layer is served as an output of the filtering layer; or
the filtering layer has a structure of a convolutional layer and a pooling layer in a convolutional neural network, each filtering layer comprises one or more convolutional layers and one pooling layer, and an output of the pooling layer is served as an output of the filtering layer comprising the pooling layer.
 
11. The method according to any one of claims 1 to 10, wherein the training parameters of the target-based end-to-end model by using the feature sequences of the plurality of pieces of the training data and labeling information of the target units in the plurality of pieces of the training data comprises:
training the parameters of the end-to-end model by using the feature sequences of the plurality of pieces of the training data as an input of the end-to-end model and using the labeling information of the target units in the plurality of pieces of the training data as an output of the end-to-end model, wherein the parameters of the end-to-end model are converting matrices called weights and biases for connections among layers of the end-to-end model.
 
12. A system for end-to-end modeling, comprising:

a topological structure determining module, configured to determine a topological structure of a target-based end-to-end model, wherein the topological structure comprises an input layer, an encoding layer, a code enhancement layer, a filtering layer, a decoding layer and an output layer; wherein the code enhancement layer is configured to add information of a target unit to a feature sequence outputted by the encoding layer, the filtering layer is configured to filter the feature sequence added with the information of the target unit by the code enhancement layer;

a training data collecting module, configured to collect a plurality of pieces of training data;

a labeling module, configured to determine a labeling object of each piece of the training data, and labeling a target unit in the labeling object;

a feature extracting module, configured to extract a feature sequence of each piece of the training data; and

a parameter training module, configured to train parameters of the target-based end-to-end model by using the feature sequences of the plurality of pieces of the training data and labeling information of the target units in the plurality of pieces of the training data, to acquire the parameters of the target-based end-to-end model.


 
13. The system according to claim 12, wherein the number of encoding layers is one or more, and the number of nodes of each encoding layer is the same as the number of nodes of the input layer.
 
14. The system according to claim 13, wherein each encoding layer is a Long Short Term Memory layer in a unidirectional or bidirectional Long Short Term Memory neural network, or is a convolutional layer in a convolutional neural network.
 
15. The system according to claim 12, wherein the topological structure further comprises a down sampling layer located between adjacent encoding layers.
 
16. The system according to claim 15, wherein the number of down sampling layers is one or more.
 
17. The system according to claim 15, wherein an input of each node of the down sampling layer is feature information of a plurality of adjacent nodes of the encoding layer previous to the down sampling layer.
 
18. The system according to claim 12, wherein the information of the target unit is added to the code enhancement layer via an enhancement node, each target unit corresponds to one enhancement node, a feature vector of a target unit is inputted to the enhancement node corresponding to the target unit, and the number of code enhancement layers and the number of enhancement nodes are the same as the number of target units.
 
19. The system according to claim 18, wherein
each enhancement node is connected to all nodes of the code enhancement layer corresponding to the enhancement node; or
each enhancement node is only connected to the first node of the code enhancement layer corresponding to the enhancement node.
 
20. The system according to claim 18, wherein the number of filtering layers is the same as the number of the code enhancement layers, and each code enhancement layer is connected to one filtering layer directly.
 
21. The system according to claim 20, wherein
the filtering layer has a structure of a unidirectional or bidirectional Long Short Term Memory layer, the number of nodes of the filtering layer is the same as the number of nodes of the code enhancement layer, a feature outputted by each code enhancement layer is served as an input of the filtering layer connected to the code enhancement layer, and an output of the last node of the filtering layer is served as an output of the filtering layer; or
the filtering layer has a structure of a convolutional layer and a pooling layer in a convolutional neural network, each filtering layer comprises one or more convolutional layers and one pooling layer, and an output of the pooling layer is served as an output of the filtering layer comprising the pooling layer.
 
22. The system according to any one of claims 12 to 21, wherein the parameter training module is configured to:
train the parameters of the end-to-end model by using the feature sequences of the plurality of pieces of the training data as an input of the end-to-end model and using the labeling information of the target units in the plurality of pieces of the training data as an output of the end-to-end model, wherein the parameters of the end-to-end model are converting matrices called weights and biases for connections among layers of the end-to-end model.
 




Drawing




























Search report