(19)
(11)EP 3 654 249 A1

(12)EUROPEAN PATENT APPLICATION

(43)Date of publication:
20.05.2020 Bulletin 2020/21

(21)Application number: 18306501.0

(22)Date of filing:  15.11.2018
(51)Int. Cl.: 
G06N 3/04  (2006.01)
G10L 15/16  (2006.01)
G06N 3/08  (2006.01)
G10L 15/06  (2013.01)
(84)Designated Contracting States:
AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR
Designated Extension States:
BA ME
Designated Validation States:
KH MA MD TN

(71)Applicant: SNIPS
75002 Paris (FR)

(72)Inventors:
  • COUCKE, Alice
    75002 Paris (FR)
  • CHLIEH, Mohammed
    75011 Paris (FR)
  • GISSELBRECHT, Thibault
    75004 Paris (FR)
  • LEROY, David
    75002 Paris (FR)
  • POUMEYROL, Mathieu
    75011 Paris (FR)
  • LAVRIL, Thibaut
    75002 Paris (FR)

(74)Representative: Verriest, Philippe et al
Cabinet Germain & Maureau 12, rue Boileau BP 6153
69466 Lyon Cedex 06
69466 Lyon Cedex 06 (FR)

  


(54)DILATED CONVOLUTIONS AND GATING FOR EFFICIENT KEYWORD SPOTTING


(57) Method for detection of a keyword in a continuous stream of audio signal, by using a dilated convolutional neural network (DCNN), implemented by one or more computers embedded on a device, the dilated convolutional network (DCNN) comprising a plurality of dilation layers (DL), including an input layer (IL) and an output layer (OL), each layer of the plurality of dilation layers (DL) comprising gated activation units, and skip-connections to the output layer (OL), the dilated convolutional network (DCNN) being configured to generate an output detection signal when a predetermined keyword is present in the continuous stream of audio signal, the generation of the output detection signal being based on a sequence (SSM) of successive measurements (SM) provided to the input layer (IL), each successive measurement (SM) of the sequence (SSM) being measured on a corresponding frame from a sequence of successive frames extracted from the continuous stream of audio signal, at a plurality of successive time steps.




Description

TECHNICAL FIELD OF THE INVENTION



[0001] This invention relates to the field of using neural networks to automatically recognize speech, and, more precisely, to automatically detect pre-defined keywords in a continuous stream of audio signal.

BACKGROUND



[0002] Traditional approaches to keyword spotting either require important memory resources and fail at capturing large patterns with reasonably small models, or require such important computational resources that they cannot be implemented on a low-resource device.

[0003] Therefore, there is a need for an effective on-device keyword spotting method, providing real-time response and high accuracy for good user experience, while limiting memory footprint and computational cost.

SUMMARY OF THE INVENTION



[0004] The present invention provides a method for detection of a keyword in a continuous stream of audio signal, by using a dilated convolutional neural network, implemented by one or more computers embedded on a device, the dilated convolutional network comprising a plurality of dilation layers, including an input layer and an output layer, each layer of the plurality of dilation layers comprising gated activation units, and skip-connections to the output layer, the dilated convolutional network being configured to generate an output detection signal when a predetermined keyword is present in the continuous stream of audio signal, the generation of the output detection signal being based on a sequence of successive measurements provided to the input layer, each successive measurement of the sequence being measured on a corresponding frame from a sequence of successive frames extracted from the continuous stream of audio signal, at a plurality of successive time steps.

[0005] According to these provisions, it is possible to embed on a low power and performance limited device the necessary computation and memory resources to implement a dilated convolutional network and use it for keyword detection applications.

[0006] According to an embodiment, the invention comprises one or more of the following features, alone or in combination.

[0007] According to an embodiment, the dilation convolutional neural network comprises 24 layers.

[0008] According to an embodiment, the successive measurements are acoustic features measured on successive frames extracted from the audio stream every 10 ms, each frame having a 25 ms duration.

[0009] According to an embodiment, the acoustic features measured on successive frames are 20 dimensional log-Mel filterbank energies.

[0010] According to an embodiment, the dilated convolutional neural network is configured to compute, at a time step, a dilated convolution based on a convolution kernel for each dilation layer, and to put in a cache memory the result of the computation at the time step, so that, at a next time step, the result of the computation is used to compute a new dilated convolution based on a shifted convolution kernel for each dilation layer.

[0011] According to these provisions, using the result of the computation at a time to compute the dilation convolution at a next time allows reducing the amount of floating point operations per second to a level compatible with the requirement of embedding the computer implemented dilated convolutional neural network on a small device.

[0012] According to another aspect, the invention provides a computer implemented method for training a dilated convolutional neural network, the dilated convolutional neural network being implemented by one or more computers embedded on a device, for keyword detection in a continuous stream of audio signal, the method comprising a data set preparation phase followed by a training phase based on the result of the data set preparation phase, the data set preparation phase comprising a labelling step comprises a step of associating a first label to successive frames which occur inside a predetermined time period centred on a time step at which an end of the keyword occurs, and in associating a second label to frames occurring outside the predetermined time period and inside a positive audio sample containing a formulation of the keyword, the positive audio samples comprising a first sequence of frames, the frames of the first sequence of frames occurring at successive time steps in between the beginning of the positive audio sample and the end of the positive audio sample.

[0013] According to an embodiment, the invention comprises one or more of the following features, alone or in combination.

[0014] According to an embodiment, the labelling step further comprises a step of associating the second label to frames inside a negative audio sample not containing a formulation of the keyword, the negative audio sample comprising a second sequence of frames, the frames of the second sequence of frames occurring at successive time steps in between a beginning time step of the positive audio sample and an ending time step of the positive audio sample.

[0015] According to these provisions, it is possible to train a more accurate model, and therefore more accurate detection results when using the computer implemented dilated convolutional network (DCNN) for keyword detection.

[0016] According to an embodiment the first label is a 1, and the second label is a 0.

[0017] According to an embodiment, the end of the keyword is detected using a voice activity detection computer implemented algorithm.

[0018] According to an embodiment, a width of the predetermined time period is optimised during a further step of validation based on a set of validation data.

[0019] According to an embodiment, during the training phase, the training of the dilated convolutional neural network is configured to learn only from the frames included in the second sequence of frames and from the frames which are associated to the first label and which are included in the first sequence of frames, and not to learn from the frames which are included in the first sequence frames and which are associated to the second label.

[0020] According to these provisions, the efficiency of the method is further improved, allowing even better accuracy in the model, and better accuracy in the detection results when using the computer implemented dilated convolutional network (DCNN) for keyword detection.

[0021] According to another aspect, the invention provides a method for detection of a keyword in a continuous stream of audio signal, by using a dilated convolutional neural network, implemented by one or more computers embedded on a device, the dilated convolutional network comprising a plurality of dilation layers, including an input layer and an output layer, each layer of the plurality of dilation layers comprising gated activation units, and skip-connections to the output layer, the dilated convolutional network being configured to generate an output detection signal when a predetermined keyword is present in the continuous stream of audio signal, the generation of the output detection signal being based on a sequence of successive measurements provided to the input layer, each successive measurement of the sequence being measured on a corresponding frame from a sequence of successive frames extracted from the continuous stream of audio signal, at a plurality of successive time steps, wherein the dilated convolutional network is trained according to the computer implemented method for training a dilated convolutional neural network, of the invention.

BRIEF DESCRIPTION OF THE DRAWINGS



[0022] The foregoing and other purposes, features, aspects and advantages of the invention will become apparent from the following detailed description of embodiments, given by way of illustration and not limitation with reference to the accompanying drawings, in which the same reference refer to similar elements or to elements having similar functions, and in which:
  • Figure 1 schematically represents a view of an embodiment of a dilated convolutional neural network at a given time step ;
  • Figure 2 schematically represents a view of an embodiment of activation gated units and skip-connections for a dilated convolutional neural network;
  • Figures 3a and 3b illustrates an embodiment of a labelling method to prepare training data sets.

DETAILED DESCRIPTION OF THE INVENTION ACCORDING TO AN EMBODIMENT



[0023] An embodiment of a computer implemented method for keyword detection in a continuous stream of audio signal, using a computer implemented dilated convolutional network, will be described in reference to figures 1 and 2.

[0024] According to an embodiment illustrated in figure 1, a continuous audio-stream is fragmented into a sequence SSM of successive measurements SM, each successive measurement SM resulting from the measurement of one or more acoustic features on a frame extracted from the continuous audio stream. According to an embodiment, the acoustic features are 20 dimensional log-Mel filterbank energies measured on successive frames extracted from the audio stream every 10 ms, each frame having a 25 ms duration.

[0025] As illustrated in figure 1, the sequence SSM of successive measurements SM is provided as an input to a computer implemented dilated convolutional neural network DCNN.

[0026] Figure 1 illustrates the configuration at a given time step of the inference process for keyword detection, with a given number of successive measurements SM being provided as input to the dilated convolutional neural network DCNN. At a next time step of the inference process for keyword detection, a new successive measurement SM is introduced in the sequence provided as input, pushing the sequence in the direction opposite to the time direction T represented on figure 1, the time direction T being directed towards the future.

[0027] According to an embodiment illustrated in figure 1, as it is well-known from the one skilled in the art, the computer implemented dilated convolutional network comprises a plurality of dilation layers DL, including an input layer IL and an output layer OL, and it is configured to implement, at each successive time steps of the process, dilated convolution of the sequence SSM of successive measurements SM which is provided, at each given time step of the process, to the input layer IL of the computer implemented dilated convolutional network.

[0028] According to an embodiment illustrated at figure 2, as it is well-known from the one skilled in the art, each layer of the plurality of dilation layers DL further comprises gated activation units GAU, and skip-connections SC to the output layer OL.

[0029] According to an embodiment of the dilated convolutional network DCNN is configured to generate an output detection signal when a predetermined keyword is present in the continuous stream of audio signal, the generation of the output detection signal being based on the result of the dilated convolution of the sequence SSM of successive measurements SM provided to the input layer IL, the result of the dilated convolution being transformed by operation of gated activation units GAU and of skip-connections SC to contribute to the generation of the output detection signal. Skip connections are introduced to speed up convergence address the issue of vanishing gradients posed by training of models of higher depth. Each layer yields two outputs: one is directly fed to the next layer as usual, but the second one skips it. All skip-connections outputs are then summed into the final output of the network. Without these bypassing strategies, one could not train deeper architectures, as required by the keyword detection application.

[0030] The gated activation units are a combination of tanh and sigmoid activations. Sigmoid activation filter acts like a gate for the tanh activation filter, depending on how important is the output of the tanh filter.

[0031] The computer implemented dilated convolutional network DCNN is configured to run in a streaming fashion during inference process for detection of keyword. When receiving a new input frame at a next time step, the result of the dilated convolution computation at a previous time step is used to compute a new dilated convolution based on a shifted convolution kernel for each dilation layer. This is possible because convolution kernels of each dilation layer are shifted one time step at a time, or a few time steps at a time, but in any case the "stride", or the number of time steps the kernel is shifted at a time, is usually smaller than the kernel size, so that two subsequent convolution kernels overlap. This cached implementation allows reducing the amount of Floating Point Operations per Second (FLOPS), so that the level of computing resources required by the inference process for keyword detection task is compatible with technical constraints imposed by embedding of the computer implemented dilated convolutional network DCNN on a low power and performance limited device. Indeed, using a dilated convolutional network architecture for keyword detection implies technically dealing with a deeper model dealing with a larger number of parameters, therefore it is important for this specific application to be able to reduce as indicated the amount of FLOPS.

[0032] Before using the computer implemented dilation convolutional network DCNN in an inference mode for keyword detection, it is necessary to train the dilation convolutional network DCNN so that it builds an internal model adapted to the keyword (s) to be detected during the inference process.

[0033] According to an aspect, the invention also relates to a computer implemented method for training a dilated convolutional neural network (DCNN). The method comprises a data set preparation phase, followed by a training phase based on the result of the data set preparation phase, the data set preparation phase comprising the following steps :
  • collect two sets of training data comprising respectively to types of audio samples of varying duration. A first type of audio samples, that will be denoted positive audio samples, the positive audio samples corresponding to the utterance by someone of the predetermined keyword(s); for example an audio sample corresponding to someone saying the keyword to be detected, "Hey SNIPS" for example, as illustrated in figure 3a, with silence at the beginning and the end, will be denoted as "positive sample". A second type of audio samples, that will be denoted negative audio samples, the negative audio samples corresponding to the utterance by someone of a random sentence, "Hello world" for example, as illustrated in figure 3b.
  • to be processed by the computer implemented dilated convolutional network, the audio samples are respectively divided into sequences of successive frames; according to an embodiment, the frames are of 25 ms duration and overlap by 10 ms with the previous and next frames. Each successive frame corresponds to a portion of the audio sample occurring respectively at one of a sequence of successive time steps.
  • in the sequence of successive frames corresponding to positive audio samples, automatically detect a frame, for example by using a voice activity detection algorithm, the detected frame corresponding to an end EK of the keyword, and associate a first label, 1 for example as illustrated on figure 3a, to all successive frames which occur, in the sequence of successive frames, inside a predetermined time period starting before and ending after the occurrence time step of the detected frame, and associate a second label, 0 for example, to each other frame of the sequence of successive frames, corresponding to positive audio samples, which occur outside the predetermined time period.
  • associate the second label, 0 for example, to each frame of the sequences of successive frames corresponding to negative audio samples, as illustrated on figure 3b.


[0034] According to these provisions, instead of using an alignment algorithm to find the keyword window that is aligned with the spoken keyword, and to label 1, for example, the frames inside the window, and 0 the frames outside the window, according to the method of the invention, only the frames close to the end of the keyword are labelled 1. The end of the keyword can easily be detected by, for example, a voice activity detection algorithm. Thus, it is possible to train a more accurate model, and therefore to obtain more accurate detection results when using the computer implemented dilated convolutional network DCNN for keyword detection.

[0035] In the traditional approach, the model has a tendency to trigger as soon as the keyword starts, whether or not the sample contains only a fraction of the keyword. One advantage of our approach is that the network will trigger near the end EK of keyword, once it has seen enough context.

[0036] According to an embodiment of the method, the predetermined time period is centered on the frame corresponding to the end EK of the keyword, the width of the predetermined time period being optimised during a further step of validation tests based on a set of validation data.

[0037] According to an embodiment of the method, during the training of the dilated convolutional neural network DCNN, the dilated convolutional neural network DCNN is configured to learn only from the successive frames of the negative audio samples, and from the successive frames of the positive audio samples which are associated to the first label, 1 for example, and not to learn from successive frames of the positive audio samples which are associated to the second label, 0 for example.

[0038] According to these provisions, the efficiency of the method is further improved, allowing even better accuracy in the model, and better accuracy in the detection results when using the computer implemented dilated convolutional network DCNN for keyword detection.


Claims

1. Method for detection of a keyword in a continuous stream of audio signal, by using a dilated convolutional neural network (DCNN), implemented by one or more computers embedded on a device, the dilated convolutional network (DCNN) comprising a plurality of dilation layers (DL), including an input layer (IL) and an output layer (OL), each layer of the plurality of dilation layers (DL) comprising gated activation units, and skip-connections to the output layer (OL), the dilated convolutional network (DCNN) being configured to generate an output detection signal when a predetermined keyword is present in the continuous stream of audio signal, the generation of the output detection signal being based on a sequence (SSM) of successive measurements (SM) provided to the input layer (IL), each successive measurement (SM) of the sequence (SSM) being measured on a corresponding frame from a sequence of successive frames extracted from the continuous stream of audio signal, at a plurality of successive time steps.
 
2. Method according to claim 1, wherein the dilated convolutional neural network (DCNN) is configured to compute, at a time step, a dilated convolution based on a convolution kernel for each dilation layer, and to put in a cache memory the result of the computation at the time step, so that, at a next time step, the result of the computation is used to compute a new dilated convolution based on a shifted convolution kernel for each dilation layer.
 
3. A computer implemented method for training a dilated convolutional neural network (DCNN), the dilated convolutional neural network (DCNN) being implemented by one or more computers embedded on a device, for keyword detection in a continuous stream of audio signal, the method comprising a data set preparation phase followed by a training phase based on the result of the data set preparation phase, the data set preparation phase comprising a labelling step comprising a step of associating a first label to successive frames which occur inside a predetermined time period centred on a time step at which an end (EK) of the keyword occurs, and in associating a second label to frames occurring outside the predetermined time period and inside a positive audio sample containing a formulation of the keyword, the positive audio samples comprising a first sequence of frames, the frames of the first sequence of frames occurring at successive time steps in between the beginning of the positive audio sample and the end of the positive audio sample.
 
4. A computer implemented method according to claim 3, wherein the labelling step further comprises a step of associating the second label to frames inside a negative audio sample not containing a formulation of the keyword, the negative audio sample comprising a second sequence of frames, the frames of the second sequence of frames occurring at successive time steps in between a beginning time step of the positive audio sample and an ending time step of the positive audio sample.
 
5. A computer implemented method according to anyone of claims 3 or 4, wherein the width of the predetermined time period is optimised during a further step of validation based on a set of validation data.
 
6. A computer implemented method according to anyone of claims 3, 4 or 5, wherein, during the training phase, the training of the dilated convolutional neural network (DCNN) is configured to learn only from the frames included in the second sequence of frames and from the frames which are associated to the first label and which are included in the first sequence of frames, and not to learn from the frames which are included in the first sequence of frames and which are associated to the second label.
 




Drawing