FIELD OF THE INVENTION
[0001] The present invention relates to speech coding technologies, and in particular, to
a method and apparatus for classifying sound signals.
BACKGROUND
[0002] In speech communication, only about 40% signals include speech, and the others are
mute or background noise. In order to save transmission bandwidth, a Voice Activity
Detection (VAD) technique is applied in speech coding in the speech signal processing
field. Therefore, the coder may encode the background noise and active speech at different
rates. That is, the coder encodes the background noise at a lower rate, and encodes
the active speech at a higher rate, thus reducing the average code rate and enhancing
the variable-rate speech coding technology greatly.
[0003] The VAD in the related art is developed for speech signals only, and categorizes
input audio signals into only two types: noise and non-noise. Later coders such as
AMR_WB+ and SMV covers detection of music signals, serving as a correction and supplement
to the VAD decision. The AMR-WB+ coder is characterized that after VAD, the coding
mode varies between a speech signal and a music signal, and depends on whether the
input audio signal is a speech signal or music signal, thus minimizing the code rate
and ensuring the coding quality.
[0004] The two different coding modes in the AMR-WB+ are: Algebraic Code Excited Linear
Prediction (ACELP)-based coding algorithm, and Transform Coded eXcitation (TCX)-based
coding algorithm. The ACELP sets up a speech phonation model, makes the most of the
speech characteristics, and is highly efficient in encoding speech signals. Moreover,
the ACELP technology is so mature that the ACELP may be extended on a universal audio
coder to improve the speech coding quality massively. Likewise, the TCX may be extended
on the low-bit-rate speech coder to improve the quality of encoding broadband music.
[0005] Depending on complexity, the ACELP mode selection algorithm and the TCX mode selection
algorithm of the AMR-WB+ coding algorithm come in two types: open loop selection algorithm,
and closed loop selection algorithm. Closed-loop selection corresponds to high complexity,
and is default option. It is a traversal search selection mode based on a perceptive
weighted Signal-to-Noise Ratio (SNR). Evidently, such a selection method is rather
accurate, but involves rather complicated operation and a huge amount of codes.
[0006] The open-loop selection includes the following steps.
[0007] In step 101, the VAD module judges whether the signal is a non-usable signal or usable
signal according to the Tone_flag and the sub-band energy parameter (Level[n]).
[0008] In step 102, primary mode selection (EC) is performed.
[0009] In step 103, the mode primarily determined in step 102 is corrected, and refined
mode selection is performed to determine the coding mode to be selected. Specifically,
this step is performed based on open loop pitch parameters and Immittance Spectral
Frequency (ISF) parameters.
[0010] In step 104, TCXS processing is performed. That is, when the number of times of selecting
the speech signal coding mode continuously is less than three times, a small-sized
closed-loop traversal search is performed to determine the coding mode finally, where
the speech signal coding mode is ACELP and the music signal coding mode is TCX.
[0011] In the process of implementing the present invention, the inventor finds that the
AMR-WB+ speech signal selection algorithm in the related art involves the following
defects:
- 1. The VAD module in the related art underperforms in identifying noise and some music
signals in the process of classifying signals, thus reducing accuracy of classifying
sound signals.
- 2. Calculation of the open pitch parameters is necessary to the ACELP coding mode,
but unnecessary to the TCX coding mode. According to the AMR-WB+ structure design,
the VAD and the open-loop mode selection algorithm involve use of the open loop pitch
parameters. Therefore, the open loop pitch needs to be calculated for all frames.
However, as for other non-ACELP coding modes (such as TCX), the calculation of such
parameters is redundant complexity, increases the calculation load of coding mode
selection, and reduces the efficiency.
- 3. Although the VAD algorithm is superior in speech detection and noise immunity among
the coders currently available, it may mistake music signals for noise at the hangover
of some special music signals, thus intercepting the music hangover and making the
music unnatural.
- 4. The AMR-WB+ mode selection algorithm disregards the Signal Noise Ratio (SNR) environment
of the signal, and its performance of identifying speech and music in the case of
a low SNR is further deteriorated.
SUMMARY
[0012] A method and apparatus for classifying sound signals are provided in an embodiment
of the present invention to improve accuracy of sound signal classification.
[0013] A method for classifying and detecting sound signals in an embodiment of the present
invention includes: receiving sound signals, and determining the update rate of background
noise according to spectral distribution parameters of the background noise and spectral
distribution parameters of the sound signals; and updating the noise parameters according
to the update rate, and classifying the sound signals according to sub-band energy
parameters and updated noise parameters.
[0014] An apparatus for classifying sound signals in an embodiment of the present invention
includes: a background noise parameter updating module, configured to: determine the
update rate of background noise according to spectral distribution parameters of the
background noise and spectral distribution parameters of the current sound signals;
and send the determined update rate; and a Primary Signal Classification (PSC) module,
configured to: receive the update rate from the background noise parameter updating
module, update the noise parameters, classify the current sound signals according
to the sub-band energy parameters and updated noise parameters, and send the sound
signal type determined through classification.
[0015] In the embodiments of the present invention, the update rate of the background noise
is determined, the noise parameters are updated according to the update rate, the
signals are classified primarily according to the sub-band energy parameters and the
updated noise parameters, and the nonuseful signals and the useful signals in the
received speech signals are determined, thus reducing the probability of mistaking
useful signals for noise signals and improving accuracy of classifying sound signals.
BRIEF DESCRIPTION OF THE DRAWINGS
[0016] Figure 1 shows open loop selection of AMR-WB+ coding algorithm in the related art;
[0017] Figure 2 is a general flowchart of a method for classifying and detecting sound signals
in an embodiment of the present invention;
[0018] Figure 3 is a schematic diagram showing an apparatus for classifying sound signals
in an embodiment of the present invention;
[0019] Figure 4 is a schematic diagram showing a system in an embodiment of the present
invention;
[0020] Figure 5 is a flowchart of calculating various parameters on a coder parameter extracting
module in an embodiment of the present invention;
[0021] Figure 6 is a flowchart of calculating various parameters on another coder parameter
extracting module in an embodiment of the present invention;
[0022] Figure 7 shows composition of a PSC module in an embodiment of the present invention;
[0023] Figure 8 shows how a signal type judging module determines characteristic parameters
in an embodiment of the present invention;
[0024] Figure 9 shows how a signal type judging module performs speech judgment in an embodiment
of the present invention;
[0025] Figure 10 shows how a signal type judging module performs music judgment in an embodiment
of the present invention;
[0026] Figure 11 shows how a signal type judging module corrects a primary judgment result
in an embodiment of the present invention;
[0027] Figure 12 shows how a signal type judging module performs primary type correction
for uncertain signals in an embodiment of the present invention;
[0028] Figure 13 shows how a signal type judging module performs final type correction for
signals in an embodiment of the present invention; and
[0029] Figure 14 shows how a signal type judging module performs parameter update in an
embodiment of the present invention.
DETAILED DESCRIPTION
[0030] In order to make the technical solution, objectives and merits of the present invention
clearer, a detailed description of the present invention is given below by reference
to the accompanying drawings and preferred embodiments.
[0031] In the embodiments of the present invention, the update rate of the background noise
is determined according to the spectral distribution parameters of the current sound
signal and the background noise, and the noise parameters are updated according to
the update rate. Therefore, the useful signals and the non-useful signals in the received
speech signals are determined according to the updated noise parameters, thus improving
the accuracy of the noise parameters in determining the useful signals and non-useful
signals, and improving the accuracy of classifying sound signals.
[0032] Figure 2 shows a method for classifying and detecting sound signals in an embodiment
of the present invention, including the following process:
[0033] Block 201: Sound signals are received, and the update rate of background noise is
determined according to the spectral distribution parameters of the background noise
and the sound signals.
[0034] Block 202: The noise parameters are updated according to the update rate, and the
sound signals are classified according to sub-band energy parameters and updated noise
parameters.
[0035] In block 202, the sound signals are classified into two types: useful signals, and
non-useful signals. Afterward, the useful signals may be subdivided into speech signals
and music signals, depending on whether the noise converges. The subdividing may be
based on open loop pitch parameters, ISF parameters, and sub-band energy parameters,
or based on ISF parameters and sub-band energy parameters.
[0036] Besides, in order to prevent mistaking music signal hangovers for non-useful signals
and reducing the sound effect, a determined useful signal type is obtained in an embodiment
of the present invention. The signal hangover length is determined according to the
useful signal type, and the useful signals and the non-useful signals in the received
speech signals are further determined according to the signal hangover length. Here
the music signal hangover may be set to a relatively great value to improve the sound
effect of the music signal.
[0037] In the process of determining a useful signal as a speech signal or music signal,
it is appropriate to set the signal not accurately identifiable to an uncertain type
first, and then correct the uncertain type according to other parameters, and finally
determine the type of useful signals.
[0038] Calculation of ISF parameters is not necessarily involved in the coding mode of non-useful
signals. Therefore, no ISF parameters are calculated for the determined noise signals
if the corresponding coding mode needs no calculation of ISF parameters, with a view
to reducing the calculation load in the classification process and improving the classification
efficiency.
As shown in Figure 3, an apparatus for classifying sound signals in an embodiment
of the present invention includes: a background noise parameter updating module, configured
to: determine the update rate of background noise according to the spectral distribution
parameters of the background noise and the current sound signals, and send the determined
update rate to a PSC module; and a PSC module, configured to: update the noise parameters
according to the update rate received from the background noise parameter updating
module, perform primary classification for the signals according to the sub-band energy
parameters and updated noise parameters, and determine the received speech signal
to be a useful signal or non-useful signal.
[0039] The apparatus for classifying sound signals may further include a signal type judging
module. The PSC module transfers the determined signal type to the signal type judging
module. The signal type judging module determines the type of a useful signal based
on the open loop pitch parameters, ISF parameters, and sub-band energy parameters,
or based on ISF parameters and sub-band energy parameters, where the type of the useful
signal includes speech and music.
[0040] The apparatus for classifying sound signals may further include a classification
parameter extracting module. The PSC module transfers the determined signal type to
the signal type judging module through the classification parameter extracting module.
The classification parameter extracting module is further configured to: obtain ISF
parameters and sub-band energy parameters, or further obtain open loop pitch parameters,
process the obtained parameters into signal type characteristic parameters, and send
the parameters to the signal type judging module; and process the obtained parameters
into spectral distribution parameters of sound signals and background noise, and transfer
the spectral distribution parameters to the background noise parameter updating module.
Therefore, the signal type judging module determines the type of useful signals according
to the foregoing signal type characteristic parameter and the signal type determined
by the PSC module, where the type of useful signals includes speech and music.
[0041] The PSC module may be further configured to transfer the sound signal SNR calculated
in the process of determining the signal type to the signal type judging module. The
signal type judging module determines the useful signal to be a speech signal or music
signal according to the SNR.
[0042] The apparatus for classifying sound signals may further include a coder mode and
rate selecting module. The signal type judging module transfers the determined signal
type to the coder mode and rate selecting module, and the coder mode and rate selecting
module determines the coding mode and rate of sound signals according to the received
signal type.
[0043] The apparatus for classifying sound signals may further include a coder parameter
extracting module, which is configured to extract ISF parameters and sub-band energy
parameters or additionally open loop pitch parameters, transfer the extracted parameters
to the classification parameter extracting module, and transfer the extracted sub-band
energy parameters to the PSC module.
[0044] The method for classifying and detecting sound signals and the apparatus for classifying
sound signals in an embodiment of the present invention are detailed below.
[0045] Figure 4 is a schematic diagram showing a system in an embodiment of the present
invention. The system includes a Sound Activity Detector (SAD). As required by the
coder, the SAD sorts the audio digital signals into three types: non-useful signal,
speech, and music, thus forming a basis for the coder to select the coding mode and
rate.
[0046] As shown in Figure 4, the SAD module includes: a background noise estimation control
module, a PSC module, a classification parameter extracting module, and a signal type
judging module. As a signal classifier used inside the coder, the SAD makes the most
of the parameters of the coder in order to reduce resource occupation and calculation
complexity. Therefore, the coder parameter extracting module in the coder is used
to calculate the sub-band energy parameters and coder parameters, and provide the
calculated parameters for the SAD module. Moreover, the SAD module finally outputs
a determined signal type (namely, non-useful signal, speech, or music), and provides
the determined signal type for the coder mode and rate selecting module to select
the coder mode and rate.
[0047] The SAD-related modules in the coder, sub-modules in the SAD, and the interaction
processes between the sub-modules are detailed below.
[0048] The coder parameter extracting module in the coder calculates the sub-band energy
parameters and coder parameters, and provides the calculated parameters for the SAD
module. The sub-band energy parameters may be calculated through filtering of a filter
group. The specific quantity of sub-bands (for example, 12 sub-bands in this embodiment)
is determined according to the calculation complexity requirement and classification
accuracy requirement.
[0049] Figure 5 or Figure 6 shows how a coder parameter extracting module calculates various
parameters required by the SAD module in this embodiment.
[0050] The process shown in Figure 5 includes the following process:
[0051] Block 501: The coder parameter extracting module calculates the sub-band energy parameters
first.
[0052] Block 502: The coder parameter extracting module decides whether it is necessary
to perform ISF calculation according to the primary signal judgment result (Vad_flag)
received from the PSC module, and performs block 503 if necessary; or performs block
504 if not necessary.
[0053] The decision about whether to perform ISF calculation in this block includes: If
the current frame is composed of non-useful signal signals, the mechanism of the coder
applies. The mechanism of the coder is: If ISF parameters are required when the coder
encodes non-useful signals, the ISF calculation needs to be performed; otherwise,
the operation of the coder parameter extracting module is finished. If the current
frame is composed of useful signals, the ISF calculation needs to be performed. Most
coding modes require calculation of ISF parameters for useful signals. Therefore,
the calculation brings no redundant complexity to the coder. The technical solution
to calculation of ISF parameters is detailed in the instruction manuals of coders,
and is not repeated here any further.
[0054] Block 503: The coder parameter extracting module calculates the ISF parameters and
then performs block 504.
[0055] Block 504: The coder parameter extracting module calculates the open loop pitch parameters.
[0056] The sub-band energy parameters calculated through the process in Figure 5 are provided
for the PSC module and the classification parameter extracting module in the SAD,
and other parameters are provided for the classification parameter extracting module
in the SAD.
[0057] In the process shown in Figure 6, a block is added on the basis of the process in
Figure 5, where the added block is to decide whether to calculate the open-loop pitch
parameters depending on whether the primary noise converges. Blocks 601-603 are basically
identical to blocks 501-503 in Figure 5. In block 604, a judgment is made about whether
the primary noise parameter (namely, noise estimation) converges. If the primary noise
parameter converges, the open loop pitch parameters are calculated in block 60; otherwise,
no open loop pitch parameter is calculated.
[0058] The calculation of open-loop pitch parameters is redundant to some coding modes such
as TCX. In order to simplify calculation, it is basically certain that the corresponding
coding mode of the signal does not need to calculate open loop pitch parameters after
the noise estimation converges. Therefore, the open loop pitch parameters are not
calculated any more.
[0059] Before convergence of the noise estimation, the open loop pitch parameters need to
be calculated in order to ensure convergence of the noise estimation and the convergence
speed. However, such calculation occurs at the startup stage, and the complexity of
calculation is ignorable. The technical solution to calculation of open loop pitch
parameters is detailed in the instruction about ACELP-based coding, and is not repeated
here any further. The basis for judging whether the noise estimation converges may
be: The count of determining as noise frames continuously exceeds the noise convergence
threshold (THR1). In an example in this embodiment, the value of THR1 is 20.
[0060] The foregoing extracted sub-band energy parameter is: level[i], where i represents
a member index of the vector, and its value falls within 1...12 in this embodiment,
corresponding to 0-200 Hz, 200-400 Hz, 400-600 Hz, 600-800 Hz, 800-1200 Hz, 1200-1600
Hz, 1600-2000 Hz, 2000-2400 Hz, 2400-3200 Hz, 3200-40000 Hz, 4000-4800 Hz, and 4800-6400
Hz, respectively.
[0061] The foregoing extracted ISF parameter is
Isfn[
i], where n represents a frame index, and the value of i falls within 1...16, representing
a member index in the vector.
[0062] The foregoing extracted open loop pitch parameters include: open_loop pitch gain
(ol_gain), open_loop pitch lag (ol_lag), and tone_flag. If the value of ol_gain is
greater than the value of tone threshold (TONE_THR), the tone_flag is set to 1.
[0063] The PSC module may be implemented through various VAD algorithms in the related art,
and includes: background noise estimating sub-module, SNR calculating sub-module,
useful signal estimating sub-module, judgment threshold adjusting sub-module, comparing
sub-module, and hangover protective useful signal sub-module. In this embodiment,
as shown in Figure 7, the implementation of the PSC module may differ from the VAD
algorithm module in the related art in the following aspects:
[0064] I. The SNR calculating sub-module calculates the SNR according to this parameter
and the sub-band energy parameters. The calculated SNR parameter is not only applied
inside the PSC module, but also transferred to the signal type judging module so that
the signal type judging module identifies the speech and music more accurately in
the case of low SNR.
[0065] II. The VAD in the related art underperforms in identifying noise and some types
of music, and improvement is made for the VAD in this embodiment: First, the calculation
of the background noise parameter is controlled by the update rate (ACC) provided
by the background noise parameter updating module. The background noise estimating
sub-module receives the update rate from the background noise parameter updating module,
updates the noise parameter, and transfers the sub-band energy estimation parameters
of background noise calculated out according to the updated noise parameter to the
SNR calculating sub-module. The calculation of the update rate is detailed in the
instruction about the background noise parameter updating module hereinafter. In an
example of this embodiment, the update rate comes in 4 levels: acc1, acc2, acc3, and
acc4. For different update rates, different upward update parameters (update_up) and
downward update parameters (update_down) are determined, where update_up corresponds
to the upward update rate of background noise, and update_down corresponds to the
downward update rate of background noise.
[0066] Afterwards, the solution to updating the noise parameter may be the solution in the
AMR_WB+:

Therefore, the formula for updating noise estimation is:

Therefore, the formula for updating the spectral distribution parameter vector of
noise is:

where,
m: frame index
n: sub-band index
i: element index of spectral distribution parameter vector, i = 1,2,3,4
bckr_est: sub-band energy of background noise estimation
p̃: estimation of spectral distribution parameter vector of background noise
p : spectral distribution parameter vector of the current signal
[0067] III. In the VAD in the related art, hangover is used to prevent useful signals from
being mistaken for noise. The hangover length should be tradeoff between signal protection
and transmission efficiency. For traditional speech coders, the hangover length may
be a constant after learning. A multi-rate coder is oriented to audio signals such
as music. Such signals tend to have a long low-energy hangover. It is difficult for
a conventional VAD to detect such a hangover. Therefore, a relatively long hangover
is required for protection. In this embodiment, the hangover length in the hangover
protective useful signal sub-module is designed to be adaptive according to the SAD
signal judgment result. If the judgment result is a music signal (SAD_flag = MUSIC),
a long hangover (hang_len = HANG_LONG) is set; if the judgment result is a speech
signal (SAD_flag = SPEECH), a short hangover (hang_len = HANG_SHORT) is set. The detailed
setting mode is as follows:
If (SAD_flag = MUSIC)
hang_len = HANG_LONG
else if (SAD_flag = SPEECH)
hang_len = HANG_SHORT
else
hang_len = 0
where,
SAD_flag: SAD judgment flag
hang_len: protective hangover length
[0068] In an example of this embodiment, HANG_LONG = 100, and HANG_SHORT = 20, which may
be measured in frames.
[0069] The classification parameter extracting module is configured to: calculate the parameters
required by the signal type judging module and the background noise parameter updating
module according to the Vad_flag parameter determined by the PSC module and the sub-band
energy parameters, ISF parameters, and open loop pitch parameters provided by the
coder parameter extracting module; and provide the sub-band energy parameters, ISF
parameters, open loop pitch parameters, and calculated parameters for the signal type
judging module and the background noise parameter updating module. The parameters
calculated by the classification parameter extracting module include:
1. Pitch parameter
[0070] Difference of continuous open loop pitch lags is compared. If the increment of the
open loop pitch lag is less than a set threshold, the lag count accrues; if the sum
of the lag counts of two continuous frames is great enough, the pitch is set to 1;
otherwise, the pitch is set to 0. The formula for calculating the open loop pitch
lag is specified in the AMR-WB+/AMR-WB standard document.
2. Longtime signal correlation value parameter (meangain)
[0071] The meangain is a moving average of tones of three adjacent frames, where tone =
1000*tone_flg. The definition of tone_flg is the same as that in the AMR-WB+.
[0072] 3. Zero Cross Rate (zcr)
II{
A} is 1 when
A is "truth", and is 0 when
A is false.
[0073] 4. Time domain fluctuation of sub-band energy (t_flux)

where short_mean_level_energy represents short-time average energy.
[0074] 5. Ratio of high sub-band energy to low sub-band energy (ra)

Given below is an instance of the present invention:
sublevel_high_energy = level[10]+ level[11];
sublevel_low_energy = level[0]+ level[1]+ level[2]+ level[3]+ level[4]+ level[5]+
level[6]+ level[7] + level[8]+ level[9];
[0075] 6. Frequency domain fluctuation of sub-band energy (f_flux)

[0076] 7. ISF mean short-time distance (isf_meanSD): average of ISF distance (Isf_SD) of
five adjacent frames, where

[0077] 8. Sub-band energy standard deviation mean (level_meanSD) parameter: average of the
sub-band energy standard deviation (level_SD) of two adjacent frames, where the calculation
method of the level_SD parameter is similar to the calculation method of the Isf_SD
described above.
[0078] In the foregoing 8 parameters, the parameters provided for the background noise parameter
updating module include: zcr, ra, i_flux, and t_flux; the parameters provided for
the signal type judging module include: pitch, meangain, isf_meanSD, and level_meanSD.
[0079] The signal type judging module is configured to sort the signals into non-useful(such
as noise), speech, and music according to the snr and Vad_flag parameters received
from the PSC module and the sub-band energy parameter, pitch, meangain, Isf_meansD,
and level_meanSD parameters received from the classification parameter extracting
module. The signal type judging module may include:
a parameter updating sub-module, configured to: update the threshold in the signal
type judgment process according to the SNR, and provide the updated threshold for
a judging sub-module; and
a judging sub-module, configured to: receive the sound signal type from the PSC module,
determine the type of the useful signals in the sound signals based on the open loop
pitch parameter, ISF parameter, sub-band energy parameter, and updated threshold,
or based on the ISF parameter and sub-band energy parameter and the updated threshold,
and send the determined type of the useful signals to the coder mode and rate selecting
module.
[0080] The process of determining a useful signal to be a speech signal or music signal
includes:
firstly, setting both the speech flag bit and the music flag bit to 0, sorting the
signals into speech, music and uncertain signals primarily according to the pitch
parameter flag, longtime signal correlation value, isf_meansD, and level_meanSD, and
modifying the value of the speech flag bit or music flag bit according to the primarily
determined speech or music;
secondly, correcting the primarily determined speech, music, and uncertain signals
according to: sub-band energy, longtime signal correlation value, level_meanSD, speech_flag,
music_flag, whether the number of continuous frames whose pitch value is 1 exceeds
the preset hangover frame threshold, number of continuous music frames, number of
continuous speech frames, and type of the previous frame; and determining the type
of useful signals, where the type of a useful signal includes speech signal and music
signal.
[0081] The process of determining a useful signal to be a speech signal or music signal
is detailed below.
[0082] In order to ensure stability of judging signals and avoid frequent conversion of
judgment results, this embodiment provides a parameter flag hangover mechanism. The
characteristic parameter values such as pitch_flag, level_meanSD_high_flag, ISF_meanSD_high_flag,
ISF_meanSD_low_flag, level_meanSD_low_flag, and meangain_flag are determined according
to the hangover mechanism, as shown in Figure 8.
[0083] In Figure 8, the length of the hangover period is determined according to the hangover
parameter flag value. This embodiment provides two types of hangover settings (namely,
two solutions to determining the hangover parameter flag value).
[0084] In the first hangover setting solution, when the parameter value is higher or lower
than a threshold, the corresponding parameter hangover counter value increases by
one; otherwise, the corresponding parameter hangover counter value is set to 0, and
different parameter hangover flags are set according to the value of the parameter
hangover counter. If the value of the parameter hangover counter is higher, the parameter
hangover flag value is greater. The specific value is determined as required at the
time of setting the parameter hangover flag value according to the parameter counter,
and is not described here any further.
[0085] In the second hangover setting solution, the hangover length is controlled according
to the Error Rate (ER) of the internal nodes of the decision tree corresponding to
the training parameter. If the ER is lower, the hangover is shorter; if the ER is
higher, the hangover is longer.
[0086] Afterwards, if the current signal is classified as a useful signal, the signal is
primarily sorted into either speech or music:
[0087] Firstly, primary speech judgment is performed. As shown in Figure 9, in block 901,
the speech flag bit is set to 0, and then in block 902, a judgment is made about whether
the Isf_meanSD is greater than the first ISF speech threshold (such as 1500). If the
Isf_meanSD is greater than the first ISF speech threshold, the speech flag bit is
set to 1; otherwise,
in block 903, a judgment is made about whether the pitch value is 1 and the pitch
lag value (t_top_mean) obtained switching on and switching off the pitch search is
less than the pitch speech threshold (such as 40). If yes, the speech flag bit is
set to 1; otherwise,
in block 904, a judgment is made about whether the number of continuous frames whose
pitch value is 1 exceeds the preset threshold of the number of hangover frames (such
as 2 frames). If yes, the speech flag bit is set to 1; otherwise:
in block 905, a judgment is made about whether the meangain exceeds the preset threshold
of the longtime correlation speech (such as 8000). If yes, the speech flag bit is
set to 1; otherwise,
in block 906, a judgment is made about whether either or both of the level_meanSD_high_flag
value and the ISF_meanSD_high_flag value are 1. If yes, the speech flag bit is set
to 1; otherwise, the value of the speech flag bit remains unchanged.
[0088] Afterwards, primary music judgment is performed, as shown in Figure 10:
[0089] In block 1001, the music flag bit is set to 0 first, and then in block 1002, a judgment
is made about whether the signal fulfills both ISF_meanSD_low_flag = 1 and level_meanSD_low_flag
= 1. If yes, the music signal flag (music_flag) is set; otherwise, the value of the
music flag bit remains unchanged.
[0090] Afterwards, as shown in Figure 11, the primary judgment result is corrected:
[0091] In block 1101, a judgment is made about whether the instant energy of the sub-band
is less than the sub-band energy threshold (such as 5000). If yes, the process proceeds
to block 1102; otherwise, the signal is determined to be of the uncertain type.
[0092] In block 1102, a judgment is made about whether meangain_flag is 1 and the continuous
count of music is less than the speech judgment threshold (such as 3) of continuous
music count. If yes, the signal is determined to be a speech signal; otherwise,
in block 1103, a judgment is made about whether the ISF_meanSD value exceeds the preset
threshold of the second ISF speech (such as 2000). If yes, the signal is determined
to be a speech signal; otherwise,
in block 1104, a judgment is made about whether the level_energy is less than 10000
and more than five frames are previously determined to be noise. If yes, the current
signal type is set to the uncertain type, with a view to reducing the probability
of mistaking noise for music; otherwise,
in block 1105, a judgment is made about whether both the music flag bit and the speech
flag bit are 1s. If yes, the current signal type is determined to be the uncertain
type; otherwise,
in block 1106, a judgment is made about whether both the music flag bit and the speech
flag bit are 0s. If yes, the current signal type is determined to be the uncertain
type; otherwise,
in block 1107, a judgment is made about whether the music flag bit is 0 and the speech
flag bit is 1. If yes, the current signal type is determined to be the speech type;
otherwise,
in block 1108, because the music flag bit is 1 and the speech flag bit is 0, the current
signal type is determined to be the music type.
[0093] After the signal is determined to be of the uncertain type in the foregoing blocks
1104, 1105 and 1106, block 1109 is performed to judge whether pitch_flag is 1, the
ISF_meanSD is less than the ISF music threshold (such as 900), and the number of continuous
speech frames is less than 3. If yes, the signal is determined to be of the music
type; otherwise, the signal is still determined to be of the uncertain type.
[0094] After the signal is determined to be of the speech type in the foregoing blocks 1103
and 1107, block 1110 is performed to judge whether the number of continuous music
frames is greater than 3 and the ISF_meanSD is less than the ISF music threshold.
If yes, the signal is determined to be a music signal; otherwise, the signal is determined
to be a speech signal.
[0095] After the speech signals and music signals are determined through the foregoing process,
the signals of the uncertain type undergo the primary corrective classification process
shown in Figure 12, including:
[0096] In block 1201, a judgment is made about whether the level_energy is less than the
threshold (such as 5000) of the uncertain type of sub-band energy. If yes, the signal
type is still determined to be the uncertain class; otherwise,
in block 1202, a judgment is made about whether the number of continuous music frames
is greater than 1 and ISF_meanSD is less than the ISF music threshold. If yes, the
signal is determined to be of the music class; otherwise,
the speech and music hangover flags are cleared. If the signals before this frame
are continuous speech signals and the continuity is strong, the speech is judged according
to the characteristic parameters of the speech. If the speech conditions are fulfilled,
the speech_hangover_flag is set to 1, as illustrated in blocks 1203 to 1206 in Figure
12. If the signals before this frame are continuous music signals and the continuity
is strong, the music is judged according to the characteristic parameters of the music.
If the music conditions are fulfilled, the music_hangover_flag is set to 1, as illustrated
in blocks 1207 to 1210 in Figure 12.
[0097] Afterwards, as illustrated in blocks 1211 to 1216 in Figure 12, if the speech hangover
flag is 1 and the music hangover flag is 0, the current signal type is set to the
speech class. If the music hangover flag is 1 and the speech hangover flag is 0, the
current signal type is set to the music class. If both the music hangover flag and
the speech hangover flag are 1 or both are 0, the signal type is set to the uncertain
class. In this case, if more than 20 previous music frames are continuous, the signal
is determined to be of the music class; if more than 20 previous speech frames are
continuous, the signal is determined to be of the speech class.
[0098] After the foregoing primary correction is performed, the useful signal type is corrected
finally in Figure 13. The type is further corrected according to the current context.
In block 1301, if the current context is music and the continuity is longer than 3
seconds, namely, the current continuous music frames are more than 150 frames, mandatory
correction may be performed according to the ISF_meanSD value to determine the music
signal. In block 1302, if the current context is speech and the continuity is longer
than 3 seconds, namely, the current continuous speech frames are more than 150 frames,
mandatory correction may be performed according to the ISF_meanSD value to determine
the speech signal class. Afterwards, if the signal type is still uncertain, the signal
type is corrected according to the previous context in block 1303, namely, the current
uncertain signal type is sorted into the previous signal type.
[0099] After the type of useful signals is determined in the foregoing process, the three
type counters and the threshold values in the signal type judging module need to be
updated. For the three type counters, if the current type is music (signal_sort =
music), the music counter (music_continue_counter) increases by 1; otherwise, the
music counter is cleared. Other type counters are processed similarly as shown in
Figure 14, and are not detailed here any further. The threshold values are updated
according to the SNR output by the PSC module. The threshold examples given in the
embodiments herein are the values learned in the case that the SNR is 20 dB.
[0100] The background noise parameter updating module uses some spectral distribution parameters
calculated in the classification parameter extracting module in the SAD to control
the update rate of the background noise. In the actual application environment, the
energy level of the background noise may surge abruptly. In this case, it is probable
that the background noise estimation remains non-updated because the signals are continuously
determined to be useful signals. Such a problem is solved by the background noise
parameter updating module.
[0101] The background noise parameter updating module calculates the vector of relevant
spectral distribution parameters according to the parameters received from the classification
parameter extracting module. The vector includes the following elements:
zero cross rate short-time mean (zcr_mean)
short-time mean of ratio of high sub-band energy to low sub-band energy (RA)
short-time mean of frequency domain fluctuation (f_flux) of sub-band energy
short-time mean of time domain fluctuation (t_flux) of sub-band energy
where the zcr_mean is calculated in the following way, and other elements are calculated
similarly:

where ALPHA = 0.96 and m represents a frame index.
[0102] This embodiment makes use of the stable spectral features of the background noise.
The elements of the spectral distribution parameter vector are not limited to the
4 elements listed above. The update rate of the current background noise is controlled
by a difference (
dcb) between the current spectral distribution parameter and the spectral distribution
parameter estimation of the background noise. The difference may be implemented through
the algorithms such as Euclidean distance and Manhattan distance. An instance of the
present invention adopts the Manhattan distance (a distance calculation method similar
to Euclidean distance):

where
p is the spectral distribution parameter vector of the current signal, and
p̃ is the spectral distribution parameter vector estimation of background noise.
[0103] In an example of this embodiment, if
dcb <TH1, the module outputs an update rate accl, which represents the fastest update
rate; otherwise, if
dcb <TH2, the module outputs an update rate acc2; otherwise, if
dcb <TH3, the module outputs an update rate acc3; otherwise, the module outputs an update
rate acc4. TH1, TH2, TH3 and TH4 are update thresholds, and the specific threshold
values depend on the actual environment conditions.
[0104] In the embodiments of the present invention, the update rate of the background noise
is determined, the noise parameters are updated according to the update rate, the
signals are classified primarily according to the sub-band energy parameters and the
updated noise parameters, and the non-useful signals and the useful signals in the
received speech signals are determined, thus reducing the probability of mistaking
useful signals for noise signals and improving accuracy of classifying sound signals.
[0105] It is understandable to those skilled in the art that the embodiments of the present
invention may be implemented through software in addition to a universal hardware
platform or through hardware only. In most cases, however, software in addition to
a universal hardware platform is preferred. Therefore, the technical solution under
the present invention or contributions to the related art may be embodied by a software
product. The software product is stored in a storage medium and incorporates several
instructions so that a computer device (for example, PC, server, or network device)
may execute the method in each embodiment of the present invention.
[0106] Described above are preferred embodiments of the present invention. In practice,
those skilled in the art may make modifications to the method under the present invention
to meet the specific requirements. Although the invention has been described through
some exemplary embodiments, the invention is not limited to such embodiments.
1. A method for classifying sound signals, comprising:
(a) receiving the sound signals, and determining an update rate of background noise
according to spectral distribution parameters of the background noise and spectral
distribution parameters of the sou nd signals; and
(b) updating noise parameters according to the update rate, and classifying the sound
signals according to sub-band energy parameters and the updated noise parameters.
2. The method of claim 1, wherein after (b), the method further comprises:
(c) determining the type of useful signals obtained through classification based on
an open loop pitch parameter, an Immittance Spectral Frequency (ISF) parameter, and
a sub-band energy parameter, wherein the type of the useful signals comprises speech
and music.
3. The method of claim 2, wherein before (c), the method further comprises:
(c0) detecting whether noise estimation converges; if the noise estimation converges,
performing c1; otherwise, performing c; and
(c1) determining the type of the useful signals obtained through the classification
based on the ISF parameter and the sub-band energy parameter, wherein the type of
the useful signals comprises the speech and the music.
4. The method of claim 3, wherein the process of detecting whether primary noise converges
in c0 is:
judging whether the number of continuous noise frames before a received sound signal
exceeds a preset noise convergence threshold; if the number of continuous noise frames
exceeds a preset noise convergence threshold, determining that the noise estimation
converges; otherwise, determining that the noise estimation does not converge.
5. The method of claim 2, wherein (b) further comprises:
obtaining the determined type of the useful signals, determining a signal hangover
length according to the type of the useful signals, and classifying the sound signals
according to the signal hangover length.
6. The method of claim 2, wherein (c) further comprises:
initializing a speech flag bit and a music flag bit; determining the type of the useful
signals primarily according to a pitch parameter flag, a longtime signal correlation
parameter, an isf_meanSD parameter, a level_meanSD parameter, and corresponding thresholds,
wherein the type is speech, music, or uncertain; and modifying the speech flag bit
and the music flag bit according to the primarily determined speech and music;
correcting the primarily determined speech, music, and uncertain signals according
to: sub-band energy, the longtime signal correlation parameter, the level_meanSD parameter,
the speech flag bit, the music flag bit, whether a count of continuous frames whose
pitch parameter flag value is 1 exceeds a preset hangover frame threshold, a count
of continuous music frames, a count of continuous speech frames, and the type of a
previous frame and corresponding thresholds; and correcting the primarily determined
speech, music or uncertain signals; and finally determining the type of the useful
signals, where the type of the useful signals comprises speech and music.
7. The method of claim 6, wherein the threshold is adjusted according to a Signal-to-Noise
Ratio (SNR) of the sound signals.
8. The method of claim 1, wherein after (b), the method further comprises:
(d) determining a coding mode corresponding to non-useful signals obtained through
the classification, and determining whether it is necessary to calculate an Immittance
Spectral Frequency (ISF) parameter according to the determined coding mode.
9. The method of claim 1, wherein the noise parameters in (b) comprise: a noise estimation
parameter, and a noise spectral distribution parameter.
10. The method of claims 1 or 9, wherein (a) comprises:
calculating a difference between the spectral distribution parameter of the sound
signals and the spectral distribution parameter of the background noise, and determining
the update rate according to the difference.
11. The method of claim 10, wherein the spectral distribution parameters involved in calculation
of the difference comprise:
Zero Cross Rate (ZCR) short-time mean, short-time mean of ratio of high sub-band energy
to low sub-band energy, short-time mean of sub-band energy frequency domain fluctuation,
and short-time mean of sub-band energy time domain fluctuation.
12. An apparatus for classifying sound signals, comprising:
a background noise parameter updating module, configured to: determine an update rate
of background noise according to a spectral distribution parameter of the background
noise and spectral distribution parameters of current sound signals, and send the
determined update rate; and
a Primary Signal Classification (PSC) module, configured to: receive the update rate
from the background noise parameter updating module, update noise parameters, classify
the current sound signals according to a sub-band energy parameter and the updated
noise parameters, and send a sound signal type determined through classification.
13. The apparatus of claim 12, further comprising a signal type judging module, configured
to:
receive the sound signal type from the PSC module;
determine the type of useful signals in the sound signals based on an open loop pitch
parameter, an Immittance Spectral Frequency (ISF) parameter, and a sub-band energy
parameter, or based on the ISF parameter and the sub-band energy parameter, wherein
the type of the useful signals comprises speech and music; and
send the determined type of the useful signals.
14. The apparatus of claim 13, further comprising a classification parameter extracting
module, configured to:
receive the sound signal type from the PSC module, and transfer the sound signal type
to the signal type judging module; and
obtain the ISF parameter and the sub-band energy parameter, or further obtain the
open loop pitch parameter, process the obtained parameters into signal type characteristic
parameters, and send the parameters to the signal type judging module; and
process the obtained parameters into the spectral distribution parameter of the sound
signals and the spectral distribution parameter of the background noise, and transfer
the spectral distribution parameters to the background noise parameter updating module;
and
the signal type judging module determines the type of the useful signals according
to the signal type characteristic parameter and the sound signal type determined by
the PSC module, wherein the type of the useful signals comprises speech and music.
15. The apparatus of claim 13 or claim 14, wherein the PSC module comprises:
a background noise estimating sub-module, a Signal-to-Noise Ratio (SNR) calculating
sub-module, a useful signal estimating sub-module, a judgment threshold adjusting
sub-module, a comparing sub-module, and a hangover protective useful signal sub-module;
wherein
the background noise estimating sub-module is configured to: receive the update rate
from the background noise parameter updating module, updates the noise parameters,
and transfers the sub-band energy estimation parameter of the background noise calculated
out according to the updated noise parameters to the SNR calculating sub-module;
the SNR calculating sub-module is configured to: receive the sub-band energy estimation
parameter of the background noise, calculate an SNR according to this parameter and
the sub-band energy parameter, and transfer the SNR to the signal type judging module;
the signal type judging module comprises a parameter updating sub-module and a judging
sub-module, wherein the parameter updating sub-module is configured to update thresholds
in a signal type judgment process according to the SNR and provide the updated threshold
to the judging sub-module; and
the judging sub-module is configured to: receive the sound signal type from the PSC
module, determine the type of the useful signals in the sound signals based on the
open loop pitch parameter, ISF parameter, sub-band energy parameter, and updated thresholds,
or based on the ISF parameter and sub-band energy parameter and the updated threshold,
and send the determined type of the useful signals.
16. The apparatus of claim 13, further comprising:
a coder mode and rate selecting module, configured to: receive the type of the useful
signals from the signal type judging module, and determine a coding mode and rate
of the sound signals according to the received type of the useful signals.
17. The apparatus of claim 14, further comprising:
a coder parameter extracting module, configured to: extract the ISF parameter and
the sub-band energy parameter or additionally the open loop pitch parameter, transfer
the extracted parameters to the classification parameter extracting module, and transfer
the extracted sub-band energy parameter to the PSC module.