<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE ep-patent-document PUBLIC "-//EPO//EP PATENT DOCUMENT 1.4//EN" "ep-patent-document-v1-4.dtd">
<ep-patent-document id="EP07405332A1" file="EP07405332NWA1.xml" lang="en" country="EP" doc-number="2063420" kind="A1" date-publ="20090527" status="n" dtd-version="ep-patent-document-v1-4">
<SDOBI lang="en"><B000><eptags><B001EP>ATBECHDEDKESFRGBGRITLILUNLSEMCPTIESILTLVFIROMKCYALTRBGCZEEHUPLSKBAHRIS..MT..RS..</B001EP><B005EP>J</B005EP><B007EP>DIM360 Ver 2.15 (14 Jul 2008) -  1100000/0</B007EP></eptags></B000><B100><B110>2063420</B110><B120><B121>EUROPEAN PATENT APPLICATION</B121></B120><B130>A1</B130><B140><date>20090527</date></B140><B190>EP</B190></B100><B200><B210>07405332.3</B210><B220><date>20071126</date></B220><B250>en</B250><B251EP>en</B251EP><B260>en</B260></B200><B400><B405><date>20090527</date><bnum>200922</bnum></B405><B430><date>20090527</date><bnum>200922</bnum></B430></B400><B500><B510EP><classification-ipcr sequence="1"><text>G10L  21/02        20060101AFI20080411BHEP        </text></classification-ipcr></B510EP><B540><B541>de</B541><B542>Verfahren und Baugruppe zur Erhöhung der Verständlichkeit von Sprache</B542><B541>en</B541><B542>Method and assembly to enhance the intelligibility of speech</B542><B541>fr</B541><B542>Procédé et assemblage pour améliorer l'intelligibilité de la parole</B542></B540><B590><B598>1</B598></B590></B500><B700><B710><B711><snm>EyeP Media S.A.</snm><iid>08279290</iid><irf>BR - 11105 EU</irf><adr><str>Avenue des Baumettes 15</str><city>1020 Renens</city><ctry>CH</ctry></adr></B711></B710><B720><B721><snm>Dubuis, Baptiste</snm><adr><str>Rue de la Borde 29</str><city>1018 Lausanne</city><ctry>CH</ctry></adr></B721><B721><snm>Zoia, Giorgio</snm><adr><str>Chemin Croix Rouges 10</str><city>1007 Lausanne</city><ctry>CH</ctry></adr></B721></B720><B740><B741><snm>Nithardt, Roland</snm><iid>00026316</iid><adr><str>Cabinet Roland Nithardt 
Conseils en Propriété Industrielle S.A. 
Y-Parc / Rue Galilée 9</str><city>1400 Yverdon-les-Bains</city><ctry>CH</ctry></adr></B741></B740></B700><B800><B840><ctry>AT</ctry><ctry>BE</ctry><ctry>BG</ctry><ctry>CH</ctry><ctry>CY</ctry><ctry>CZ</ctry><ctry>DE</ctry><ctry>DK</ctry><ctry>EE</ctry><ctry>ES</ctry><ctry>FI</ctry><ctry>FR</ctry><ctry>GB</ctry><ctry>GR</ctry><ctry>HU</ctry><ctry>IE</ctry><ctry>IS</ctry><ctry>IT</ctry><ctry>LI</ctry><ctry>LT</ctry><ctry>LU</ctry><ctry>LV</ctry><ctry>MC</ctry><ctry>MT</ctry><ctry>NL</ctry><ctry>PL</ctry><ctry>PT</ctry><ctry>RO</ctry><ctry>SE</ctry><ctry>SI</ctry><ctry>SK</ctry><ctry>TR</ctry></B840><B844EP><B845EP><ctry>AL</ctry></B845EP><B845EP><ctry>BA</ctry></B845EP><B845EP><ctry>HR</ctry></B845EP><B845EP><ctry>MK</ctry></B845EP><B845EP><ctry>RS</ctry></B845EP></B844EP></B800></SDOBI>
<abstract id="abst" lang="en">
<p id="pa01" num="0001">The present invention concerns a method and an assembly designed to enhance the intelligibility of speech produced by a sound device in a noisy environment. The assembly comprises a microphone, or a telecommunication unit, which provides to a data processing module the voice of a remote speech. The data processing module is designed to combine specific algorithms offering a perceptual improvement of the produced speech by increasing intelligibility, by saving an adequate signal quality and by saving as far as possible the overall power consumption. The enhanced speech as produced by the data processing module is then played in a speaker.
<img id="iaf01" file="imgaf001.tif" wi="165" he="84" img-content="drawing" img-format="tif"/></p>
</abstract><!-- EPO <DP n="1"> -->
<description id="desc" lang="en">
<heading id="h0001"><b>Technical field</b></heading>
<p id="p0001" num="0001">The present invention concerns a method to enhance the intelligibility of speech produced by a sound device in a noisy environment.</p>
<p id="p0002" num="0002">The present invention also concerns an assembly for implementing this method to enhance the intelligibility of speech produced by a sound device in a noisy environment.</p>
<heading id="h0002"><b>Background Art</b></heading>
<p id="p0003" num="0003">Over the last decade, communication devices market has experienced a spectacular growth in terms of research, technology and users attention, especially the mobile or portable devices such as mobile phones, personal digital assistants or hearing aids.</p>
<p id="p0004" num="0004">The need to solve the problems of noise control and speech quality when dealing with small, low power device is critical.</p>
<p id="p0005" num="0005">In the history of communication, noise, in particular stationary background noise, has always been a problem. Every signal traveling from one point to another is prone to be corrupted by noise. Noise can come in various manners: from surrounding acoustic sources, such as traffic, babbling, reverberation or acoustic echo paths, or from electric/electronic sources such as thermal noise. Background noise, also known as environmental noise, can seriously affect speech perceptual aspects such as quality or intelligibility. Therefore huge efforts have been produced during the last decades to overcome this problem.</p>
<p id="p0006" num="0006">A solution to the speech enhancement in the presence of local background noise is fundamental to the user experience. This issue is compounded by the consequences of possible usage in unfavorable environments and of rapid<!-- EPO <DP n="2"> --> change in background conditions. Rapid means that those conditions may vary one or several times during the time of a normal conversation, even if this is a rather slow change in comparison to signal and noise frequencies so that noise can be mainly approximated as stationary in comparison. Automatic adaptation of perceptual aspects such as quality and especially intelligibility are then of the uppermost importance to provide as seamless as possible conversation and device use.</p>
<p id="p0007" num="0007">A classic noise reduction problem consists of reducing the level of stationary noise superimposed to a local voice (or sound in general) signal that is captured by the same recording device in the same time interval. On the other hand, remote voice signal arrives to a sound device more or less disturbed by remote background noise and local device noise, but it is added to local background noise only during the acoustic path from the device speaker to one ear and further disturbed by local background noise possibly reaching the other ear. This kind of noise cannot be reduced for the local user by signal processing in the digital domain; this can be obtained using the classic scheme only for the remote user. So, the only possible solution is to enhance the remote voice signal locally, in order to improve its perception when immersed in the local noisy condition.</p>
<p id="p0008" num="0008">If classic noise reduction constitutes a well-known branch of research and signal processing tools are mature enough to face it consistently in many cases, far-end speech enhancement in noisy condition is instead a relatively new issue. It is also trickier as it presents the necessity to compare signals and surrounding noise that cannot be captured by the very same device due to a dual-channel problem, and therefore are not so easy to compare in an objective manner.</p>
<p id="p0009" num="0009">One of the possible solutions concerns the change of volume, which is in fact not usable in any situation and in any place. Another solution is to use isolating headset devices. This solution is invasive and cannot be used everywhere. A<!-- EPO <DP n="3"> --> conventional solution consists of changing location but it reduces the mobility and is not applicable in any case. A further solution consists of using noise canceling headsets. The drawback of such a solution is that it is invasive, needs extra battery and is costly.</p>
<heading id="h0003"><b>Disclosure of the Invention</b></heading>
<p id="p0010" num="0010">To overcome the above drawbacks of the prior art, an object of the present invention is to provide a method such as defined in preamble and characterized by a combination of specific algorithms offering a perceptual improvement of the produced speech by increasing intelligibility, by saving an adequate signal quality and by saving as far as possible the overall power consumption.</p>
<p id="p0011" num="0011">The method primarily adapts to non-personally invasive devices but it also operates on invasive devices.</p>
<p id="p0012" num="0012">The method applies especially when no direct or indirect control is possible on the source of background noise. It applies when the microphones of the device capture the background noise but not necessarily the source of speech, which may be local as well as remote, received through a communication link and rendered through the device speaker(s).</p>
<p id="p0013" num="0013">The field of use especially includes telecommunication devices, hearing aids devices and multimedia devices.</p>
<p id="p0014" num="0014">According to a preferred form of realisation, at least one algorithm is used for identifying signal segments as silence, voiced or unvoiced segments (SUV).</p>
<p id="p0015" num="0015">The unvoiced segments are processed by applying a constant amplification, given the reduced bandwidth of the voice signal and the corresponding high bandwidth of these unvoiced segments.</p>
<p id="p0016" num="0016">Advantageously, the silence segments are simply ignored.<!-- EPO <DP n="4"> --></p>
<p id="p0017" num="0017">According to an attractive form of the present invention, a band energy adaptation is especially conceived to avoid increases in the overall power of the long voiced segments. To this purpose, the overall power is redistributed where noise is less masking, with consequent reduction in the energy, instead of increasing it where noise is more intense.</p>
<p id="p0018" num="0018">Preferably, a certain amount of signal distortion is accepted to permit as advantage an increase in intelligibility in particular environmental conditions.</p>
<p id="p0019" num="0019">Specific approximations to theoretical algorithms are made in SUV segmentation, thresholds and band gain adjustments to reduce computation, allowing execution in real-time on portable devices and with consequent reduction in both CPU load and battery load of the sound device.</p>
<p id="p0020" num="0020">The object of the present invention is also achieved by an assembly for implementing this method as defined in the preamble and characterized in that said assembly comprises at least one microphone, one speaker, and a data processing module designed to combine specific algorithms offering a perceptual improvement of the produced speech by increasing intelligibility, by saving an adequate signal quality and by saving as far as possible the overall power consumption.</p>
<p id="p0021" num="0021">Advantageously, the data processing module comprises means designed to identify signal segments as silence, voiced and unvoiced segments. Preferably, this means is at least one algorithm.</p>
<p id="p0022" num="0022">For simplifying processing of unvoiced segments, the data processing module also comprises means designed to apply a constant amplification to said unvoiced segments, given the reduced bandwidth of the voice signal.<!-- EPO <DP n="5"> --></p>
<p id="p0023" num="0023">Furthermore the data processing module of the assembly may also comprise means designed to ignore the silence segments, and means designed to provide a band energy adaptation especially conceived to avoid increases in the overall power of the long voiced segment.</p>
<p id="p0024" num="0024">In a preferred embodiment of the assembly, the data processing module may comprise means designed to redistribute the overall power where noise is less masking instead of increasing it where noise is more intense, with consequent reduction in the energy consumed.</p>
<p id="p0025" num="0025">In order to reduce computation, with consequent reduction in both CPU load and battery load of the sound device, the assembly according to the present invention may comprise means designed to make specific approximations in SUV segmentation, thresholds and band gain adjustments.</p>
<heading id="h0004"><b>Brief Description of the Drawings</b></heading>
<p id="p0026" num="0026">The present invention and its advantages will best appear in the following description of a mode of embodiment given as a non-limiting example and referring to the appended drawings, in which:
<ul id="ul0001" list-style="none">
<li><figref idref="f0001">Figure 1</figref> represents a block diagram for the overall speech enhancement method according to the present invention,</li>
<li><figref idref="f0002">Figure 2</figref> represents a block diagram for the SUV decision algorithm according to the method of the present invention,</li>
<li><figref idref="f0003">Figure 3</figref> represents a Bark filter bank usable for both noise and speech analysis according to the method of the present invention, and</li>
<li><figref idref="f0004">Figure 4</figref> represents a block diagram for the assembly according to the present invention.</li>
</ul><!-- EPO <DP n="6"> --></p>
<heading id="h0005"><b>Best Mode for Carrying Out the Invention</b></heading>
<p id="p0027" num="0027">The following subsections of the block diagram for the overall speech enhancement method according to the present invention such as illustrated by the <figref idref="f0001">Figure 1</figref> will give a behavioral description of the different processing blocks, whereas the next section will describe in detail the implementation of each block. The noise estimation part is described in less detail as it constitutes a better known algorithm and is not relevant to the actual novelty of the proposal.</p>
<heading id="h0006"><u style="single">DC remove block 21</u></heading>
<p id="p0028" num="0028">Voice signals captured through a microphone may contain a DC (continuous) component. Since signal processing modules are often based on energy estimation, it is important to remove this DC component in order to avoid useless very high offsets, especially in a limited numerical format case (16-bit integer). The DC remove filter implements a simple IIR filter allowing the removal of the DC component inside the telephone narrow- and wide-band range limiting the loss in other low frequencies as far as possible.</p>
<heading id="h0007"><u style="single">SUV Detection block 22</u></heading>
<p id="p0029" num="0029">A voice-only signal is typically composed by speech periods that are separated by Silence intervals. Moreover, speech periods can be subdivided into two classes, Unvoiced and Voiced sounds.</p>
<p id="p0030" num="0030">Speech periods are those when the talker is active. Roughly speaking, a speech sound can be considered as voiced if it is produced by the vibration of the vocal cords. Vowels are voiced sounds by definition. When a sound is instead pronounced so that it does not require the vocal cords to vibrate, it is called unvoiced. Only consonants can be unvoiced, but not all of them are. Silence normally refers to a period in the signal of interest where the talker is not speaking. But while not containing speech, most of the time the signal corresponding to "silence" regions rather different from zero as it can contain many kinds of interfering signals, such as background noise, reverberation or echo.<!-- EPO <DP n="7"> --></p>
<p id="p0031" num="0031">The SUV detection block 22 allows separating signal into silence, unvoiced and voiced periods. This is normally obtained by calculating a number of selected signal features, which are then weighted and fed to a suitable decision algorithm. As the whole algorithm works on a frame-by-frame basis, as often in signal processing for efficiency in computation, this block provides as output signal frames, each frame being windowed before processing (and frames are then overlapped at the end).</p>
<heading id="h0008"><u style="single">Speech Signal Simple Boost block 23</u></heading>
<p id="p0032" num="0032">In terms of speech intelligibility, consonants, and therefore unvoiced sounds, often convey more important information than vowels do. Furthermore, unvoiced sounds are weaker than voiced sounds and are therefore more prone to be masked by noise.</p>
<p id="p0033" num="0033">Unvoiced signals nearly cover the entire speech band, which in most cases is approximately 3.5 or 7 kHz wide (8 or 16 kHz sampling rate). This allows boosting in a simple manner unvoiced portions by limiting at maximum the processing power. The enhancement is obtained by applying a gain in time domain to each sample so as to increase unvoiced speech power to a level at least equal to that of the background noise power. This has the effect of increasing the power of consonants against vowels.</p>
<heading id="h0009"><u style="single">Frequency Transform and Band Grouping block 24</u></heading>
<p id="p0034" num="0034">The processing of the voiced part is the most expensive from a computation point of view: it requires analysis in the frequency domain. The frequency coefficients are preferably calculated by applying a Short-Time Fourier Transform (STFT) to the voiced speech signal. Once the coefficients computed, they are grouped into frequency bands to reflect in relative importance the nonlinear behavior of the human hearing. In fact, from a psychoacoustic point of view, critical bands increase in width as frequency increases. Grouping is obtained preferably using a Bark-like scale. The number of critical bands has<!-- EPO <DP n="8"> --> been chosen to be preferably twenty-four, which trade-offs enough frequency resolution for the purpose of noise estimation, noise reduction and speech enhancement.</p>
<heading id="h0010"><u style="single">Band Gain Adjustment 25</u></heading>
<p id="p0035" num="0035">After frequency transforming and grouping the signal into psycho-acoustically relevant critical bands (the same as done in noise analysis branch), the gain of each critical band is adjusted according to criteria that can result in an improvement of the overall intelligibility of voice periods of speech over noise. In particular, gain is increased inversely to the noise distribution in critical bands, which means signal is increased more where noise has less energy aiming at reinforcing SNR in bands that require a lower energy increase. Signal may even be reduced where noise is very strong to preserve as far as possible the energy level.</p>
<p id="p0036" num="0036">Improvement of intelligibility is often detrimental to speech quality (perceived quality in absence of background noise). To preserve good quality a number of thresholds are used to avoid:
<ul id="ul0002" list-style="dash" compact="compact">
<li>too much signal distortion when the signal-to-noise ratio is low, and</li>
<li>too much useless distortion when noise is overall low,</li>
<li>too much distortion after repartition of energy among critical bands.<br/>
These thresholds aim at preserving main timbre features, so that recognition of speaker is not compromised.</li>
</ul></p>
<heading id="h0011"><u style="single">Frame Gain Normalization block 26</u></heading>
<p id="p0037" num="0037">After the application of gains to each critical band of a signal frame, the frame gains are normalized depending on the power of the noise frame. If the original power of the speech frame was greater or equal than the power of the noise frame, then the energy of the signal is kept unchanged. But if the power of the noise frame was greater, then masking may occur. The speech frame power is boosted so that it has the same power as noise, taking care not to hit too high values leading to signal saturation.<!-- EPO <DP n="9"> --></p>
<p id="p0038" num="0038">After this normalization, signal is transformed back to the time domain and overlap-and-add is applied to frames to recreate a complete signal (with silence, unvoiced and voiced parts all together again).</p>
<heading id="h0012"><u style="single">Background Noise Estimation and Features Extraction block 27</u></heading>
<p id="p0039" num="0039">Background Noise Estimation consists of separating to background noise captured locally by the device microphone from noise + speech periods. Many algorithms exist for this kind of separation. A voice activity detection (VAD) is preferably used here to separate pure noise segments and the noise features are extracted as explained above by frequency transform and grouping into critical bands. Noise energy for each critical band is used by the enhancement algorithm outlined above.</p>
<heading id="h0013"><u style="single">Parametric Spectral Subtraction block 28</u></heading>
<p id="p0040" num="0040">Parametric Spectral Subtraction is the core of the noise reduction algorithm that can be applied to the local speech signal before transmission to the remote peer. This part has no influence on the remote speech enhancement. In any case, gains are calculated according to an Ephraim-Malah algorithm.</p>
<p id="p0041" num="0041">The proposed application preferably targets mobile device implementations. As such, important limitations are imposed by the device and CPU in comparison to theoretical solutions and many approximations may be necessary to reduce the computational complexity while saving the result accuracy.</p>
<p id="p0042" num="0042">The following paragraphs describe examples of approximations which are preferably made in SUV segmentation, thresholds and band gain adjustments to reduce computation, with consequent reduction in both CPU load and battery load.<!-- EPO <DP n="10"> --></p>
<heading id="h0014"><u style="single">Fixed-point proposed implementation example</u></heading>
<p id="p0043" num="0043">The proposed implementation example runs completely in fixed-point arithmetic. Signals are signed short integers (16-bit dynamic range), whereas internal coefficients for frequency transforms and other analyses are 32-bit fixed-point numbers. Precision of fixed-point numbers will be detailed later in this document when important.</p>
<p id="p0044" num="0044">In terms of numerical operations, solutions are proposed too in order to avoid division and modulo operators at least on a sample-by-sample basis, since these functions are often not available in device instruction sets and are consequently realized in software using hundreds or thousands of CPU cycles.</p>
<p id="p0045" num="0045">The following paragraphs replicate the structure of the overall process description and contain detail about the specific fixed-point arithmetic algorithm implementation and specific filter and formula aspects.</p>
<p id="p0046" num="0046">The DC Remove filter block 21 is applied to the audio signal frames before processing. In order to save CPU resources, and since microphone characteristics are often poor at low frequencies in mobile devices, a simple high-pass, fixed-point IIR filter is used. Cutoff frequency is approximately 200 Hz in narrowband, 60 Hz in wider bands.</p>
<heading id="h0015"><u style="single">SUV segmentation</u></heading>
<p id="p0047" num="0047">To segment the audio signal into a silence, unvoiced or voiced portions, three different features are considered, the log-energy, the normalized autocorrelation coefficient and the zero-crossing count.</p>
<heading id="h0016">The log-energy is computed as:</heading>
<p id="p0048" num="0048"><maths id="math0001" num=""><math display="block"><msub><mi>E</mi><mi>s</mi></msub><mo>=</mo><mn>10</mn><mo>×</mo><msub><mi>log</mi><mn>10</mn></msub><mo>⁢</mo><mfenced separators=""><mi>ε</mi><mo>+</mo><mfrac><mn>1</mn><mi>N</mi></mfrac><mstyle displaystyle="true"><munderover><mo>∑</mo><mrow><mi>n</mi><mo>=</mo><mn>1</mn></mrow><mi>N</mi></munderover></mstyle><msup><mi>s</mi><mn>2</mn></msup><mfenced><mi>n</mi></mfenced></mfenced></math><img id="ib0001" file="imgb0001.tif" wi="77" he="18" img-content="math" img-format="tif"/></maths><br/>
<!-- EPO <DP n="11"> -->where <i>s(n)</i> is the signal sample, N is the number of samples per frame (20 ms frame for example) and ε is a small constant to avoid log of 0. After log calculation log-energy values may be stored in signed 7b/8b (16-bit) numbers.</p>
<p id="p0049" num="0049">The normalized autocorrelation coefficient at unit sample delay is approximated as: <maths id="math0002" num=""><math display="block"><msub><mi>C</mi><mn>1</mn></msub><mo>=</mo><mfrac><mstyle displaystyle="true"><munderover><mo>∑</mo><mrow><mi>n</mi><mo>=</mo><mn>1</mn></mrow><mi>N</mi></munderover><mi>s</mi><mfenced><mi>n</mi></mfenced><mi>s</mi><mfenced separators=""><mi>n</mi><mo>-</mo><mn>1</mn></mfenced></mstyle><mrow><mstyle displaystyle="true"><munderover><mo>∑</mo><mrow><mi>n</mi><mo>=</mo><mn>1</mn></mrow><mi>N</mi></munderover></mstyle><msup><mi>s</mi><mn>2</mn></msup><mfenced><mi>n</mi></mfenced></mrow></mfrac></math><img id="ib0002" file="imgb0002.tif" wi="45" he="26" img-content="math" img-format="tif"/></maths><br/>
Voiced sounds are more concentrated at low frequencies, and then normalized autocorrelation tends to be higher (near to 1) for voiced than unvoiced segments. The denominator sum is an approximation of the correct formula to avoid calculation of the square root. The range is of course -1 to 1 (signed 0b/15b representation).</p>
<p id="p0050" num="0050">The number of zero-crossings for a frame is computed as: <maths id="math0003" num=""><math display="block"><msub><mi>N</mi><mi>z</mi></msub><mo>=</mo><mstyle displaystyle="true"><munderover><mo>∑</mo><mrow><mi>n</mi><mo>=</mo><mn>1</mn></mrow><mi>N</mi></munderover></mstyle><mfenced open="|" close="|" separators=""><mi>sgn</mi><mfenced separators=""><mi>s</mi><mfenced><mi>n</mi></mfenced></mfenced><mo>-</mo><mi>sgn</mi><mo>⁢</mo><mfenced separators=""><mi>s</mi><mo>⁢</mo><mfenced separators=""><mi>n</mi><mo>-</mo><mn>1</mn></mfenced></mfenced></mfenced></math><img id="ib0003" file="imgb0003.tif" wi="74" he="15" img-content="math" img-format="tif"/></maths><br/>
where sgn is the sign operator. The number of zero-crossing is an integer value (15b/0b) representation.</p>
<p id="p0051" num="0051"><figref idref="f0002">Figure 2</figref> represents the block diagram for the SUV decision algorithm. To decide to which class (S, U or V) the segment belongs, a distance is computed between the actual feature vector and each of the three classes. This is done by assuming that the features for each class belong to a multidirectional Gaussian distribution with known mean vector and covariance matrices W<sub>i</sub>, corresponding respectively to the class voiced, unvoiced and silence. The index i is 1, 2 or 3 for the three classes.</p>
<p id="p0052" num="0052">Mean vectors and covariance matrices for the three classes are obtained (trained) by a given database of speech utterances. The data is segmented<!-- EPO <DP n="12"> --> manually into silence, voiced and unvoiced, and then for each of these segments the three features above are calculated.</p>
<p id="p0053" num="0053">Once mean vectors and covariance matrices are available, the decision is taken according to the scheme of the <figref idref="f0002">Figure 2</figref>, where <i>d<sub>1</sub></i> is the error to be minimized in classical minimum probability-of-error decision rule, and then: <maths id="math0004" num=""><math display="block"><msub><mi>d</mi><mi>i</mi></msub><mo>=</mo><msup><mfenced separators=""><msub><mover><mi>x</mi><mo>→</mo></mover><mi>i</mi></msub><mo>-</mo><msub><mover><mi>m</mi><mo>→</mo></mover><mi>i</mi></msub></mfenced><mi>t</mi></msup><mo>⋅</mo><msubsup><mi>W</mi><mi>i</mi><mrow><mo>-</mo><mn>1</mn></mrow></msubsup><mo>⋅</mo><mfenced separators=""><msub><mover><mi>x</mi><mo>→</mo></mover><mi>i</mi></msub><mo>-</mo><msub><mover><mi>m</mi><mo>→</mo></mover><mi>i</mi></msub></mfenced></math><img id="ib0004" file="imgb0004.tif" wi="67" he="13" img-content="math" img-format="tif"/></maths><br/>
being <i>x</i> the feature vector, <i>m</i> the mean vector and <i>W</i> the covariance matrix.</p>
<p id="p0054" num="0054">Instead of using all features with the same weight to discriminate among classes, the following procedure is used as shown in the block diagram. First the segment is tested for Voice class using the log-energy and the zero-crossing count. If the resulting distance <i>d<sub>1</sub></i> is minimal among the three distances, and if the log-energy is higher than a given threshold, then Voice is decided. If the log-energy is lower than the threshold, then Silence is decided. The threshold has to be determined empirically. The actual value of the threshold is preferably 3'900, relative to the 7b/8b format described above for log-energy precision.</p>
<p id="p0055" num="0055">If <i>d<sub>1</sub></i> is not minimal, then the distance <i>d<sub>3</sub></i> from the silence class with the autocorrelation feature only is calculated. If it is minimal, then Silence is decided, otherwise Unvoiced is decided.</p>
<heading id="h0017"><u style="single">Speech Signal Simple Boost</u></heading>
<p id="p0056" num="0056">Calling the power of the speech signal P<sub>s</sub>, the power of the noise signal P<sub>w</sub>, the signal-to-noise ratio SNR = P<sub>s</sub>/P<sub>w</sub>, the enhancement of unvoiced segments is simply obtained applying a gain in time domain to each sample to increase the signal power to a level at least equal to that of the noise power.</p>
<p id="p0057" num="0057">The simple boost can be described for each sample as follows:<!-- EPO <DP n="13"> --> <maths id="math0005" num=""><math display="block"><msub><mi>s</mi><mi mathvariant="italic">enhanced</mi></msub><mfenced><mi>n</mi></mfenced><mo>=</mo><mrow><mo>{</mo><mtable><mtr><mtd><mi>s</mi><mfenced><mi>n</mi></mfenced><mo>,</mo><mi mathvariant="italic">SNR</mi><mo>≥</mo><mn>1</mn></mtd></mtr><mtr><mtd><mi>min</mi><mfenced separators=""><msub><mi>T</mi><mi mathvariant="italic">unvoiced</mi></msub><mo>⁢</mo><mfrac><mn>1</mn><msqrt><mi mathvariant="italic">SNR</mi></msqrt></mfrac></mfenced><mo>⋅</mo><mi>s</mi><mfenced><mi>n</mi></mfenced><mo>,</mo><mi mathvariant="italic">SNR</mi><mo>&lt;</mo><mn>1</mn></mtd></mtr></mtable></mrow></math><img id="ib0005" file="imgb0005.tif" wi="97" he="23" img-content="math" img-format="tif"/></maths></p>
<p id="p0058" num="0058">The parameter <i>T<sub>unvoiced</sub></i> is an adaptive threshold that avoids saturation. For each frame, the threshold is calculated as the maximum given by the chosen representation (32-bit) over the actual frame energy.</p>
<p id="p0059" num="0059">After the STFT, frequencies are grouped into frequency bands (according to human hearing) using a Bark-like scale as represented by the <figref idref="f0003">Figure 3</figref>. The following formula is used for the single frequencies: <maths id="math0006" num=""><math display="block"><mi mathvariant="italic">Bark</mi><mo>=</mo><mn>13.1</mn><mo>⋅</mo><mi>arctan</mi><mfenced separators=""><mn>0.00074</mn><mo>⋅</mo><mi>f</mi></mfenced><mo>+</mo><mn>2.24</mn><mo>⋅</mo><mi>arctan</mi><mfenced separators=""><mn>1.85</mn><mo>⋅</mo><msup><mn>10</mn><mrow><mo>-</mo><mn>8</mn></mrow></msup><mo>⋅</mo><msup><mi>f</mi><mn>2</mn></msup></mfenced><mo>+</mo><msup><mn>10</mn><mrow><mo>-</mo><mn>4</mn></mrow></msup><mo>⋅</mo><mi>f</mi></math><img id="ib0006" file="imgb0006.tif" wi="165" he="20" img-content="math" img-format="tif"/></maths></p>
<p id="p0060" num="0060">The number of band-pass filters, and therefore the number of critical band is twenty-four, the result as shown in <figref idref="f0003">Figure 3</figref>.</p>
<heading id="h0018"><u style="single">Band Gain Adjustment</u></heading>
<p id="p0061" num="0061">Signal-to-Noise ratio in the frequency domain is defined as: <maths id="math0007" num=""><math display="block"><mi mathvariant="italic">SNR</mi><mo>=</mo><mfrac><mrow><mstyle displaystyle="true"><munderover><mo>∑</mo><mrow><mi>m</mi><mo>=</mo><mn>1</mn></mrow><mi>M</mi></munderover></mstyle><msup><mfenced open="|" close="|" separators=""><mi>S</mi><mfenced><mi>m</mi></mfenced></mfenced><mn>2</mn></msup></mrow><mrow><mstyle displaystyle="true"><munderover><mo>∑</mo><mrow><mi>m</mi><mo>=</mo><mn>1</mn></mrow><mi>M</mi></munderover></mstyle><msup><mfenced open="|" close="|" separators=""><mi>W</mi><mfenced><mi>m</mi></mfenced></mfenced><mn>2</mn></msup></mrow></mfrac><mo>=</mo><mfrac><msub><mi>P</mi><mi>s</mi></msub><msub><mi>P</mi><mi>w</mi></msub></mfrac></math><img id="ib0007" file="imgb0007.tif" wi="73" he="26" img-content="math" img-format="tif"/></maths><br/>
where <i>S</i> and <i>W</i> are the STFTs of signal and noise respectively. To avoid useless calculation, power in the frequency domain is simply taken from the one in the time domain by the following well-known theorem: <maths id="math0008" num=""><math display="block"><mstyle displaystyle="true"><munderover><mo>∑</mo><mrow><mi>n</mi><mo>=</mo><mn>1</mn></mrow><mi>N</mi></munderover></mstyle><msup><mfenced open="|" close="|" separators=""><mi>s</mi><mfenced><mi>n</mi></mfenced></mfenced><mn>2</mn></msup><mo>=</mo><mfrac><mn>1</mn><mi>N</mi></mfrac><mstyle displaystyle="true"><munderover><mo>∑</mo><mrow><mi>k</mi><mo>=</mo><mn>1</mn></mrow><mi>N</mi></munderover></mstyle><msup><mfenced open="|" close="|" separators=""><mi>S</mi><mfenced><mi>k</mi></mfenced></mfenced><mn>2</mn></msup></math><img id="ib0008" file="imgb0008.tif" wi="63" he="17" img-content="math" img-format="tif"/></maths></p>
<p id="p0062" num="0062">Furthermore, given the twenty-four critical bands B<sub>i</sub>, the Noise Repartition Ratio for the i<sup>th</sup> band is calculated by the following formula: <maths id="math0009" num=""><math display="block"><msub><mi mathvariant="italic">SRR</mi><mi>i</mi></msub><mo>=</mo><mfrac><mstyle displaystyle="false"><mstyle displaystyle="true"><munder><mo>∑</mo><mrow><mi>b</mi><mo>∈</mo><msub><mi>B</mi><mi>i</mi></msub></mrow></munder></mstyle><msup><mfenced open="|" close="|" separators=""><mi>W</mi><mfenced><mi>b</mi></mfenced></mfenced><mn>2</mn></msup></mstyle><mrow><mstyle displaystyle="true"><munderover><mo>∑</mo><mrow><mi>m</mi><mo>=</mo><mn>1</mn></mrow><mi>M</mi></munderover></mstyle><msup><mfenced open="|" close="|" separators=""><mi>W</mi><mfenced><mi>m</mi></mfenced></mfenced><mn>2</mn></msup></mrow></mfrac></math><img id="ib0009" file="imgb0009.tif" wi="42" he="33" img-content="math" img-format="tif"/></maths><!-- EPO <DP n="14"> --></p>
<p id="p0063" num="0063">The adjustment gain for each speech band is calculated as follows: <maths id="math0010" num=""><math display="block"><mi mathvariant="italic">G</mi><mfenced><mi mathvariant="italic">i</mi></mfenced><mo>=</mo><mi mathvariant="normal">α</mi><mo>+</mo><mi mathvariant="normal">β</mi><mo>⋅</mo><mi mathvariant="italic">SNR</mi><mo>+</mo><mi mathvariant="normal">γ</mi><mo>⋅</mo><mi>min</mi><mfenced separators=""><mfrac><mn>1</mn><msub><mi mathvariant="italic">NRR</mi><mi mathvariant="italic">i</mi></msub></mfrac><mo>⁢</mo><mi mathvariant="italic">T</mi></mfenced></math><img id="ib0010" file="imgb0010.tif" wi="77" he="15" img-content="math" img-format="tif"/></maths><br/>
with the timbre variation bias α that has a value of 0.5, the SNR reference factor β has a value of 3, the noise factor γ has a value of 12.</p>
<p id="p0064" num="0064">This last formula is in theory one of the most critical parts of the algorithm since the computation of the inverse of the NRR can be very costly as it would require one integer division per critical bands. This has some consequences in mobile devices. Therefore a different solution is used in practice than the flat division. A property of logarithms is used: <maths id="math0011" num=""><math display="block"><msub><mi>log</mi><mi>b</mi></msub><mfenced><mfrac><mi>x</mi><mi>y</mi></mfrac></mfenced><mo>=</mo><msub><mi>log</mi><mi>b</mi></msub><mfenced><mi>x</mi></mfenced><mo>-</mo><msub><mi>log</mi><mi>b</mi></msub><mfenced><mi>y</mi></mfenced></math><img id="ib0011" file="imgb0011.tif" wi="70" he="17" img-content="math" img-format="tif"/></maths><br/>
so that: <maths id="math0012" num=""><math display="block"><mfrac><mn>1</mn><mi>x</mi></mfrac><mo>=</mo><msup><mn>2</mn><mrow><msub><mi>log</mi><mn>2</mn></msub><mfenced><mn>1</mn></mfenced><mo>-</mo><msub><mi>log</mi><mn>2</mn></msub><mfenced><mi>x</mi></mfenced></mrow></msup></math><img id="ib0012" file="imgb0012.tif" wi="45" he="14" img-content="math" img-format="tif"/></maths></p>
<p id="p0065" num="0065">The choice of base 2 is made for efficiency reasons with simple instruction sets (such as those of portable devices). In fact, the exponential can be obtained by a left shift of the necessary positions (since the binary format is used), whereas the log<sub>2</sub> can be approximated by the following pseudo-code:
<tables id="tabl0001" num="0001">
<table frame="all">
<tgroup cols="2" colsep="0" rowsep="0">
<colspec colnum="1" colname="col1" colwidth="11mm" colsep="1"/>
<colspec colnum="2" colname="col2" colwidth="23mm"/>
<tbody>
<row>
<entry colsep="0">r=0;</entry>
<entry/></row>
<row>
<entry colsep="0"/>
<entry>if (x&gt;=65536)</entry></row>
<row>
<entry colsep="0"/>
<entry>{</entry></row>
<row>
<entry colsep="0"/>
<entry>x&gt;&gt;=16;</entry></row>
<row>
<entry colsep="0"/>
<entry>r += 16;</entry></row>
<row>
<entry colsep="0"/>
<entry>}</entry></row>
<row>
<entry colsep="0"/>
<entry>if(x&gt;=256)</entry></row>
<row>
<entry colsep="0"/>
<entry>{</entry></row>
<row rowsep="1">
<entry colsep="0"/>
<entry>x &gt;&gt;= 8;</entry></row><!-- EPO <DP n="15"> -->
<row>
<entry colsep="0"/>
<entry>r+=8;</entry></row>
<row>
<entry colsep="0"/>
<entry>}</entry></row>
<row>
<entry colsep="0"/>
<entry>if (x&gt;=16)</entry></row>
<row>
<entry colsep="0"/>
<entry>{</entry></row>
<row>
<entry colsep="0"/>
<entry>x &gt;&gt;= 4;</entry></row>
<row>
<entry colsep="0"/>
<entry>r+=4;</entry></row>
<row>
<entry colsep="0"/>
<entry>}</entry></row>
<row>
<entry colsep="0"/>
<entry>if (x&gt;=4)</entry></row>
<row>
<entry colsep="0"/>
<entry>{</entry></row>
<row>
<entry colsep="0"/>
<entry>x&gt;&gt;=2;</entry></row>
<row>
<entry colsep="0"/>
<entry>r+=2;</entry></row>
<row>
<entry colsep="0"/>
<entry>}</entry></row>
<row>
<entry colsep="0"/>
<entry>if (x&gt;=2)</entry></row>
<row>
<entry colsep="0"/>
<entry>{</entry></row>
<row>
<entry colsep="0"/>
<entry>r += 1;</entry></row>
<row>
<entry colsep="0"/>
<entry>}</entry></row>
<row rowsep="1">
<entry colsep="0"/>
<entry>Result = r;</entry></row></tbody></tgroup>
</table>
</tables></p>
<p id="p0066" num="0066">The result is approximated but computation is reduced by a factor 15. Using this algorithm, the threshold T has an actual value of 256.</p>
<heading id="h0019"><u style="single">Frame Gain Normalization</u></heading>
<p id="p0067" num="0067">Gains are normalized using the following equation: <maths id="math0013" num=""><math display="block"><mi mathvariant="italic">Gʹ</mi><mfenced><mi>i</mi></mfenced><mo>=</mo><mrow><mo>{</mo><mtable><mtr><mtd><mi>G</mi><mfenced><mi>i</mi></mfenced><mo>,</mo><msub><mi>P</mi><mi>s</mi></msub><mo>≥</mo><msub><mi>P</mi><mi>w</mi></msub></mtd></mtr><mtr><mtd><mi>min</mi><mfenced separators=""><msub><mi>T</mi><mi mathvariant="italic">voiced</mi></msub><mo>,</mo><msqrt><mfrac><msub><mi>P</mi><mi>w</mi></msub><msubsup><mi>P</mi><mi>s</mi><mi>ʹ</mi></msubsup></mfrac></msqrt><mo>⋅</mo><mi>G</mi><mfenced><mi>i</mi></mfenced></mfenced><mo>,</mo><msub><mi>P</mi><mi>s</mi></msub><mo>&lt;</mo><msub><mi>P</mi><mi>w</mi></msub></mtd></mtr></mtable></mrow></math><img id="ib0013" file="imgb0013.tif" wi="83" he="21" img-content="math" img-format="tif"/></maths><br/>
If the power of the noise frame was greater than signal originally, then masking is more likely to occur. It is then necessary to boost the speech frame power so that it has the same power as noise. A threshold <i>T<sub>voiced</sub></i> is set based on the initial<!-- EPO <DP n="16"> --> power of the signal to avoid saturation. (the same as <i>T<sub>unvoiced</sub></i> is estimated above).</p>
<heading id="h0020"><u style="single">Background Noise Estimation and Features Extraction</u></heading>
<p id="p0068" num="0068">Background noise is analyzed in the same way as the remote signal is, that is STFT is calculated and noise power is calculated for each critical band as explained above. The twenty-four noise coefficients are passed to the enhancement algorithm to proceed with SNR calculation and gain modifications for unvoiced and voiced segments.</p>
<p id="p0069" num="0069"><figref idref="f0004">Figure 4</figref> represents the block diagram of the assembly 10 according to the present invention and shows how the different elements are connected. The source of the voice can be either a local microphone 11, or optionally a telecommunication unit 12, which provides to a data processing module 13 the voice of a remote speech. The data processing module 13 is used to combine specific algorithms offering a perceptual improvement of the produced speech by increasing intelligibility, by saving an adequate signal quality and by saving as far as possible the overall power consumption. The enhanced speech as produced by the data processing module 13 is played in a speaker 14. The telecommunication unit 12 has the capability to connect to a remote system that is a source of speech, especially a telecommunication device, and is optional.</p>
</description><!-- EPO <DP n="17"> -->
<claims id="claims01" lang="en">
<claim id="c-en-0001" num="0001">
<claim-text>Method to enhance the intelligibility of speech produced by a sound device in a noisy environment, <b>characterized by</b> a combination of specific algorithms offering a perceptual improvement of the produced speech by increasing intelligibility, by saving an adequate signal quality and by saving as far as possible the overall power consumption.</claim-text></claim>
<claim id="c-en-0002" num="0002">
<claim-text>Method according to claim 1, <b>characterized in that</b> at least one algorithm is used for identifying signal segments as silence, voiced or unvoiced segments.</claim-text></claim>
<claim id="c-en-0003" num="0003">
<claim-text>Method according to claim 2, <b>characterized in that</b> the processing of unvoiced segments is simplified by applying a constant amplification to said unvoiced segments, given the reduced bandwidth of the voice signal.</claim-text></claim>
<claim id="c-en-0004" num="0004">
<claim-text>Method according to claim 2, <b>characterized in that</b> the silence segments are ignored.</claim-text></claim>
<claim id="c-en-0005" num="0005">
<claim-text>Method according to claim 1, <b>characterized in that</b> a band energy adaptation is especially conceived to avoid increases in the overall power of the long voiced segment.</claim-text></claim>
<claim id="c-en-0006" num="0006">
<claim-text>Method according to claim 5, <b>characterized in that</b> the overall power is redistributed where noise is less masking instead of increasing it where noise is more intense, with consequent reduction in the energy consumed.</claim-text></claim>
<claim id="c-en-0007" num="0007">
<claim-text>Method according to claim 1, <b>characterized in that</b> a certain amount of distortion is accepted to permit an increase in intelligibility in particular environmental conditions.</claim-text></claim>
<claim id="c-en-0008" num="0008">
<claim-text>Method according to claim 1, <b>characterized in that</b> specific approximations are made in SUV segmentation, thresholds and band gain adjustments to<!-- EPO <DP n="18"> --> reduce computation, with consequent reduction in both CPU load and battery load of the sound device .</claim-text></claim>
<claim id="c-en-0009" num="0009">
<claim-text>Assembly to enhance the intelligibility of speech produced by a sound device in a noisy environment, this assembly being designed for implementing the method according to claims 1 to 8, <b>characterized in that</b> said assembly (10) comprises at least one microphone (11), one speaker (14), and a data processing module (13) designed to combine specific algorithms offering a perceptual improvement of the produced speech by increasing intelligibility, by saving an adequate signal quality and by saving as far as possible the overall power consumption.</claim-text></claim>
<claim id="c-en-0010" num="0010">
<claim-text>Assembly according to claim 9, <b>characterized in that</b> the data processing module (13) comprises means designed to identify signal segments as silence, voiced and unvoiced segments.</claim-text></claim>
<claim id="c-en-0011" num="0011">
<claim-text>Assembly according to claim 10, <b>characterized in that</b> the means designed to identify signal segments as silence, voiced and unvoiced segments is at least one algorithm.</claim-text></claim>
<claim id="c-en-0012" num="0012">
<claim-text>Assembly according to claim 9, <b>characterized in that</b>, for simplifying the processing of the unvoiced segments, the data processing module (13) comprises means designed to apply a constant amplification to said unvoiced segments, given the reduced bandwidth of the voice signal.</claim-text></claim>
<claim id="c-en-0013" num="0013">
<claim-text>Assembly according to claim 9, <b>characterized in that</b> the data processing module (13) comprises means designed to ignore the silence segments.</claim-text></claim>
<claim id="c-en-0014" num="0014">
<claim-text>Assembly according to claim 9, <b>characterized in that</b> the data processing module (13) further comprises means designed to provide a band energy adaptation especially conceived to avoid increases in the overall power of the long voiced segment.<!-- EPO <DP n="19"> --></claim-text></claim>
<claim id="c-en-0015" num="0015">
<claim-text>Assembly according to claim 14, <b>characterized in that</b> the data processing module (13) comprises means designed to redistribute the overall power where noise is less masking instead of increasing it where noise is more intense, with consequent reduction in the energy consumed.</claim-text></claim>
<claim id="c-en-0016" num="0016">
<claim-text>Assembly according to claim 9, <b>characterized in that</b> the data processing module (13) comprises means designed to make specific approximations in SUV segmentation, thresholds and band gain adjustments to reduce computation, with consequent reduction in both CPU load and battery load of the sound device.</claim-text></claim>
</claims><!-- EPO <DP n="20"> -->
<drawings id="draw" lang="en">
<figure id="f0001" num="1"><img id="if0001" file="imgf0001.tif" wi="142" he="233" img-content="drawing" img-format="tif"/></figure><!-- EPO <DP n="21"> -->
<figure id="f0002" num="2"><img id="if0002" file="imgf0002.tif" wi="137" he="184" img-content="drawing" img-format="tif"/></figure><!-- EPO <DP n="22"> -->
<figure id="f0003" num="3"><img id="if0003" file="imgf0003.tif" wi="147" he="208" img-content="drawing" img-format="tif"/></figure><!-- EPO <DP n="23"> -->
<figure id="f0004" num="4"><img id="if0004" file="imgf0004.tif" wi="165" he="189" img-content="drawing" img-format="tif"/></figure>
</drawings>
<search-report-data id="srep" lang="en" srep-office="EP" date-produced=""><doc-page id="srep0001" file="srep0001.tif" wi="157" he="233" type="tif"/><doc-page id="srep0002" file="srep0002.tif" wi="158" he="233" type="tif"/></search-report-data>
</ep-patent-document>
