FIELD OF THE TECHNOLOGY
[0002] The present disclosure relates to computer technologies, and in particular, to audio
signal processing.
BACKGROUND OF THE DISCLOSURE
[0003] With the development of technologies related to mobile terminals, more smart devices
are used in people's daily life, and making a voice call through a smart device is
particularly common. However, during a call process, a microphone at a local terminal
may collect an audio signal originated from a speaker at an opposite terminal, and
the audio signal loops in the voice call between the local terminal and the opposite
terminal. Especially during a short-range voice call process, the audio signal becomes
a howling due to continual loop gain.
[0004] Generally, a frequency shifter or a phase shifter for processing an input audio at
the local terminal may be used to remove audio components having a phase same as the
howling, thereby suppressing the howling.
SUMMARY
[0005] In view of the previous description, the present disclosure provides an audio signal
processing method which can locate a howling point effectively and suppress howling,
thereby improving accuracy of audio signal processing.
[0006] According to one aspect, an embodiment of the present disclosure provides an audio
signal processing method, which may be applied to a system or a program including
an audio signal processing function in a terminal device, and specifically includes:
obtaining a first audio input signal;
inputting the first audio input signal into a machine learning model to obtain a first
howling point; obtaining a first gain value according to the first howling point,
the first howling point being used for indicating a howling point in a frequency band
corresponding to an effective audio signal in the first audio input signal, and the
first gain value being used for indicating a suppression parameter for the first howling
point; and processing the first audio input signal according to the first gain value
to obtain a second audio input signal;
checking the second audio input signal to obtain a second howling point, and obtaining
a second gain value according to the second howling point, the second howling point
being used for indicating a howling point in a frequency band corresponding to an
ineffective audio signal in the second audio input signal; and
processing the second audio input signal according to the second gain value to obtain
an audio output signal.
[0007] According to another aspect, an embodiment of the present disclosure provides an
audio signal processing apparatus, including:
an obtaining unit, configured to obtain a first audio input signal;
an input unit, configured to input the first audio input signal into a machine learning
model to obtain a first howling point; to obtain a first gain value according to the
first howling point, the first howling point being used for indicating a howling point
in a frequency band corresponding to an effective audio signal in the first audio
input signal, and the first gain value being used for indicating a suppression parameter
for the first howling point; and to process the first audio input signal according
to the first gain value to obtain a second audio input signal;
a detection unit, configured to check the second audio input signal to obtain a second
howling point, and to obtain a second gain value according to the second howling point,
the second howling point being used for indicating a howling point in a frequency
band corresponding to an ineffective audio signal in the second audio input signal;
and
a processing unit, configured to process the second audio input signal according to
the second gain value to obtain an audio output signal.
[0008] According to another aspect, an embodiment of the present disclosure provides a method
for training a machine learning model, including: acquiring a reference signal and
a voice sample signal, the reference signal being a howling signal determined based
on at least two variable elements, the variable elements including a program category,
a program running period, or a program running location, and the acquired signal being
used for indicating an effective voice in a call;
generating a feature training set according to the reference signal and the acquired
signal; and
inputting the feature training set into the machine learning model for performing
at least one round of training to obtain a trained machine learning model, the trained
machine learning model being used for determining a corresponding howling point and
gain value according to an audio input signal.
[0009] A fourth aspect of the present disclosure provides an apparatus for training a machine
learning model, including: an acquisition unit, configured to acquire a reference
signal and a voice sample signal, the reference signal being a howling signal determined
based on at least two variable elements, the variable elements including a program
category, a program running period, or a program running location, and the acquired
signal being used for indicating an effective voice in a call process;
a generation unit, configured to generate a feature training set according to the
reference signal and the acquired signal; and
a training unit, configured to input the feature training set into the machine learning
model for performing at least one round of training to obtain a trained machine learning
model, the trained machine learning model being used for determining a corresponding
howling point and gain value according to an audio input signal.
[0010] A fifth aspect of the present disclosure provides a computer device, including: a
memory, a processor, and a bus system, the memory being configured to store program
code; and the processor being configured to perform, according to instructions in
the program code, the audio signal processing method according to the above aspect,
or the method for training the machine learning model according to the above aspect.
[0011] A sixth aspect of the present disclosure provides a computer-readable storage medium,
the storage medium being configured to store a computer program, and the computer
program being configured to implement the audio signal processing method according
to the above aspect, or the method for training the machine learning model according
to the above aspect.
[0012] According to still another aspect, an embodiment of the present disclosure provides
a computer program product including instructions, the computer program product, when
running on a computer, causing the computer to perform the audio signal processing
method according to the aspect above, or the method for training the machine learning
model according to the above aspect.
[0013] As can be seen from the technical solutions above, the embodiments of the present
disclosure have the following advantages:
[0014] A first audio input signal is obtained; then the first audio input signal is inputted
into a machine learning model to obtain a first gain value for processing a frequency
band in which an effective audio signal is located; the first audio input signal is
processed according to the first gain value to obtain a second audio input signal;
next the second audio input signal is checked to obtain a second howling point, the
second howling point being used for indicating a howling point in a frequency band
corresponding to an ineffective audio signal in the second audio input signal; and
the second audio input signal is processed according to a second gain value to obtain
an audio output signal, the second gain value being used for indicating a suppression
parameter for the second howling point. In this way, the howling is suppressed for
the audio input signal at an initial stage and therefore will not loop with gains.
Due to the indication of howling points in the machine learning model, the convenience
of model calculation, and further processing on the unprocessed howling point by using
the second gain value, a rapid and thorough process of the howling suppression is
implemented without affecting the effective audio signal, so that the accuracy and
efficiency in audio processing are improved.
BRIEF DESCRIPTION OF THE DRAWINGS
[0015]
FIG. 1 is a diagram of network architecture for running an audio signal processing
system.
FIG. 2 is a process architecture diagram of audio signal processing according to an
embodiment of the present disclosure.
FIG. 3 is a flowchart of an audio signal processing method according to an embodiment
of the present disclosure.
FIG. 4 is a flowchart of another audio signal processing method according to an embodiment
of the present disclosure.
FIG. 5 is a schematic diagram of a scenario for audio signal processing according
to an embodiment of the present disclosure.
FIG. 6 is a schematic diagram of another scenario for audio signal processing according
to an embodiment of the present disclosure.
FIG. 7 is a flowchart of another audio signal processing method according to an embodiment
of the present disclosure.
FIG. 8 is a comparison diagram of audio signal processing according to an embodiment
of the present disclosure.
FIG. 9 is a flowchart of another audio signal processing method according to an embodiment
of the present disclosure.
FIG. 10 is a schematic diagram of an interface of an audio signal processing method
according to an embodiment of the present disclosure.
FIG. 11 is a schematic diagram of an interface of another audio signal processing
method according to an embodiment of the present disclosure.
FIG. 12 is a flowchart of a machine learning model training method according to an
embodiment of the present disclosure.
FIG. 13 is a schematic flowchart of training of a machine learning model according
to an embodiment of the present disclosure.
FIG. 14 is a schematic structural diagram of an audio signal processing apparatus
according to an embodiment of the present disclosure.
FIG. 15 is a schematic structural diagram of a machine learning model training apparatus
according to an embodiment of the present disclosure.
FIG. 16 is a schematic structural diagram of a terminal device according to an embodiment
of the present disclosure.
FIG. 17 is a schematic structural diagram of a server according to an embodiment of
the present disclosure.
DESCRIPTION OF EMBODIMENTS
[0016] The embodiments of the present disclosure provide an audio signal processing method
and a related apparatus, which may be applied to a system or a program including an
audio signal processing function in a terminal device. A first audio input signal
is obtained; then the first audio input signal is inputted into a machine learning
model to obtain a first gain value for processing a frequency band in which an effective
audio signal is located; the first audio input signal is processed according to the
first gain value to obtain a second audio input signal; next the second audio input
signal is checked to obtain a second howling point, the second howling point being
used for indicating a howling point in a frequency band corresponding to an ineffective
audio signal; and the second audio input signal is processed according to a second
gain value to obtain an audio output signal, the second gain value being used for
indicating a suppression parameter for the second howling point. In this way, the
howling is suppressed for the audio input signal at an initial stage and therefore
will not loop with gains. Due to indication of howling points in the machine learning
model, the convenience of model calculation, and further processing on the unprocessed
howling point by using the second gain value, a rapid and thorough process of the
howling suppression is implemented without affecting the effective audio signal, so
that the accuracy and efficiency in audio processing are improved.
[0017] First, some terms to be used in the embodiments of the present disclosure are described
as follows.
[0018] Howling: in a case that a sound signal acquired by a microphone is amplified by a
speaker and then picked up again by the microphone, the signal is continuously superimposed
and amplified in a feedback loop and an oscillation cycle is generated by a positive
feedback, therefore the howling occurs.
[0019] Howling point: a frequency point in an audio signal at which loop gain is greater
than or equal to 1.
[0020] Effective audio signal: a target audio in the audio signal, such as a voice signal
in a voice call.
[0021] Ineffective audio signal: an interference audio in the audio signal, such as background
noise or an echo.
[0022] Gain value: a degree of change in processing an audio signal at a specified frequency
band, used for indicating a suppression factor applied on an audio signal corresponding
to a howling point in a howling suppression scenario.
[0023] Pitch period: a time period from opening to closing of vocal cords in human vocalization,
a parameter that may be used for indicating the effective audio signal.
[0024] Machine learning model: a model used for adjusting a parameter based on a given sample,
so that an output has characteristics similar to the given sample.
[0025] Power spectrum: the change of signal power as frequency varies, namely, the distribution
of signal power in frequency domain.
[0026] Analog-to-digital converter (ADC): an electronic component configured to convert
an analog signal into a digital signal.
[0027] Recurrent neural network (RNN) model: a recursive neural network that takes data
sequence as an input and recurs in an evolution direction of the sequence, in which
all nodes (recurrent units) are connected in a chain.
[0028] Convolutional neural network (CNN) model: a convolutional neural network has a characterization
learning capability and can classify input information in a translation-invariant
manner according to a hierarchical structure thereof.
[0029] It is to be understood that the audio signal processing method provided by the present
disclosure may be applied to a system or a program including an audio signal processing
function in a terminal device, for example, as a voice plug-in for a game. Specifically,
an audio signal processing system may be run in network architecture as shown in FIG.
1. FIG. 1 is a diagram of network architecture for running an audio signal processing
system. As shown, the audio signal processing system can perform audio signal processing
on a plurality of information sources. A terminal is connected to a server through
a network to receive an audio signal sent by another terminal, and perform the audio
signal processing method provided by the present disclosure on the received signal
to suppress howling and obtain an audio output, so that an audio interaction between
multiple terminals is implemented. It can be understood that although a variety of
terminal devices are shown in FIG. 1, more or fewer types of terminal devices may
be involved in the audio signal processing in actual scenarios, and the specific number
and types depend on the actual scenarios, which is not limited herein. In addition,
one server is shown in FIG. 1, but in actual scenarios multiple servers may be involved,
especially in a scenario of multi-content application interaction, and the specific
number of servers depends on the actual scenario.
[0030] The audio signal processing method provided in this embodiment may be performed offline.
In other words, no server is involved in the audio signal processing. In this case,
the terminal conducts local audio signal interaction with another terminal, and then
performs audio signal processing between the terminals.
[0031] It can be understood that the audio signal processing system may be run on a personal
mobile terminal, for example, as an application such as a voice plug-in for a game,
or may be run on a server, or may be run as a third-party device to perform audio
signal processing, to obtain an audio signal processing result corresponding to an
information source. The audio signal processing system may be run in the foregoing
devices as a program, or be run in the foregoing devices as a system component, or
may be used as a type of cloud service program. The specific operation mode depends
on actual scenarios, which is not limited herein.
[0032] With the development of technologies related to mobile terminals, more smart devices
are used in people's daily life, and making a voice call through a smart device is
particularly common. However, during a call process, a microphone at a local terminal
may an audio signal originated from a speaker at an opposite terminal, and the audio
signal loops in the voice call between the local terminal and the opposite terminal.
Especially during a short-range voice call, the audio signal becomes a howling due
to continual loop gain.
[0033] Generally, a frequency shifter or a phase shifter for processing an input audio at
the local terminal may be used to remove audio components having a phase same as the
howling, thereby suppressing the howling.
[0034] However, the scheme of using the frequency shifter or the phase shifter requires
a long processing time and is not suitable for real-time processing of voice call.
Moreover, the phase removal fixed at the howling point causes damage to the sound
quality of the effective audio and affects the accuracy in audio processing.
[0035] To resolve the foregoing problems, the present disclosure provides an audio signal
processing method, which is applied to a process framework of audio signal processing
shown in FIG. 2. FIG. 2 is a process architecture diagram of audio signal processing
according to an embodiment of the present disclosure. First, a terminal device acquires
voice of a user and converts the voice into an audio signal. The audio signal is input
into a trained machine learning model, to determine and suppress a howling point,
perform gain control for an unprocessed howling point, so as to obtain a howling-suppressed
audio signal as an output.
[0036] It can be understood that the method provided by the present disclosure may be written
as a program, which is used as a processing logic in a hardware system, or as an audio
signal processing apparatus, to implement the processing logic in an integrated or
externally connected manner. As an implementation, the audio signal processing apparatus
obtains a first audio input signal; then inputs the first audio input signal into
a machine learning model to obtain a first gain value for processing a frequency band
in which an effective audio signal is located; processes the first audio input signal
according to the first gain value to obtain a second audio input signal; checks the
second audio input signal to obtain a second howling point, the second howling point
being used for indicating a howling point in a frequency band corresponding to an
ineffective audio signal; and processes the second audio input signal according to
a second gain value to obtain an audio output signal, the second gain value being
used for indicating a suppression parameter for the second howling point. In this
way, the howling is suppressed for the audio input signal at an initial stage and
therefore will not loop with gains. Due to indication of a howling point in the machine
learning model, the convenience of model calculation, and further processing on the
unprocessed howling point by using the second gain value, a rapid and thorough process
of the howling suppression is implemented without affecting the effective audio signal,
so that the accuracy and efficiency in audio processing are improved.
[0037] With reference to the process architecture, the audio signal processing method in
the present disclosure is described below. FIG. 3 is a flowchart of an audio signal
processing method according to an embodiment of the present disclosure. The method
includes at least the following steps.
[0038] In 301: a first audio input signal is obtained.
[0039] In this embodiment, the first audio input signal may be an audio signal initially
obtained when a voice call is started, or may be an audio signal obtained after a
call lasts for a period of time. Specifically, the howling generates when an audio
signal continuously gains in a feedback loop, i.e., due to accumulated feedback gains.
The feedback loop include a microphone at a local terminal and a speaker at an opposite
terminal. Therefore, an audio signal at a different time period may have different
accumulated feedback gain. The audio signal processing method provided by the present
disclosure may be invoked immediately, or may be invoked after the feedback gain becomes
greater than or equal to 1 since the generation of howling requires a feedback gain
of the audio signal in the loop to be greater than or equal to 1.
[0040] Optionally, the obtained first audio input signal may be an amplified signal. Specifically,
an acquired signal, which may be acquired by a microphone or another acquisition device,
is obtained; the acquired signal is converted into a digital signal, for example,
by an ADC; the digital signal is inputted to an amplifier to obtain the first audio
input signal. Because the first audio input signal is amplified, it is convenient
for a user to listen and convenient for a subsequent howling point filtering process
in the present disclosure.
[0041] In addition, the acquired signal from the acquisition device may contain obvious
noise, such as a signal with a frequency far beyond voice range. In this case, noise
filtering may be performed. Specifically, the digital signal is inputted into an amplifier
to obtain an amplified signal; then the amplified signal is processed according to
a filter parameter to obtain a filtered amplified signal; and the filtered amplified
signal is transformed into frequency domain by Fourier transformation to obtain the
first audio input signal. The filter parameter may be a fixed value, or may be set
according to a frequency band corresponding to common noise in historical records.
[0042] In 302: the first audio input signal is input into a machine learning model to obtain
a first howling point, and a first gain value is obtained according to the first howling
point.
[0043] In this embodiment, the first howling point is used for indicating a howling point
in a frequency band corresponding to an effective audio signal in the first audio
input signal; and the first gain value is used for indicating a suppression parameter
for the first howling point. The first audio input signal is processed according to
the first gain value to obtain a second audio input signal. In addition, the machine
learning model is trained with a plurality of training signals, the training signals
include a plurality of howling point samples, and the first howling point is used
for indicating the howling point in the frequency band corresponding to the effective
audio signal.
[0044] Because howling points differ from effective audio in frequency band distribution
or energy features, feature extraction may be performed by determining a plurality
of features in the audio input signal. These features may be selected from features
of the effective audio signal, for example, frequency band distribution, a position
of a pitch period, a signal fluctuation frequency, or the like. Then the extracted
features are inputted into the machine learning model to determine a corresponding
first howling point. Accordingly, a first gain value corresponding to the first howling
point is determined.
[0045] Specifically, the feature extraction may be performed based on the effective audio
signal. According to a first aspect, the feature extraction is performed based on
parameter feature of the effective audio signal, for example, Mel-scale frequency
cepstral coefficient (MFCC) of the effective audio signal, or mathematical transformation
of the coefficient. According to a second aspect, the feature extraction is performed
based on biological feature of the effective audio signal, for example, a pitch period
of the effective audio signal, because an audio signal of the human voice below 500
Hz has a pitch period while a howling signal does not. According to a third aspect,
the feature extraction is performed based on waveform feature of the effective audio
signal, for example, the fluctuation of the effective audio signal in a specific frequency
band, because the effective audio signal has a feature of short-term stability. Through
the feature extraction, the effective audio signal may be distinguished from the signal
corresponding to the howling point, so that the machine learning model can learn the
features of the effective audio signal and can have improved accuracy in extracting
the effective audio signal.
[0046] The features listed above are exemplary. The feature may be indicative of the effective
audio signal, or indicative of the howling signal, or indicative of feature distinguishing
the effective audio signal from the howling signal, which is not limited herein.
[0047] Optionally, considering different frequencies for acquiring the input signal in different
scenarios, the first audio input signal may be adjusted to a target frequency for
transformation into the frequency domain. For example, a sampling rate of 16 KHz is
commonly used for voice call on mobile phone, and therefore the target frequency is
adjusted to 16 KHz; then a plurality of sampling points in the first audio input signal
transformed into the frequency domain are determined; and a plurality of audio features
are extracted based on the sampling points. In this way, the input signal is processed
in multiple threads and the efficiency for audio processing is improved.
[0048] In addition, in the process of transforming the first audio input signal from a time
domain into the frequency domain, for better periodicity of the time domain signal
required in the Fourier transform and less signal omission, the input signal may be
divided with a window function. In other words, the first audio input signal transformed
into the frequency domain is divided based on the window function to obtain a plurality
of sub-bands; and then a plurality of sampling points in the sub-bands are determined.
The window function may be a rectangular window, a Gaussian window, a Kaiser window,
or the like, and the specific form of the function depends on an actual scenario.
[0049] In 303: a second audio input signal is checked to obtain a second howling point,
and a second gain value is obtained according to the second howling point.
[0050] In this embodiment, the first gain values correspond to a plurality of howling points
in the first audio input signal, each howling point corresponds to a plurality of
frequency bands, and a set of these frequency bands is called a sub-band. Therefore,
the first gain values may include a plurality of gain values for howling suppression,
each being a floating-point number within the range from 0 to 1. The first gain values
are inputted into the first audio input signal, and each sub-band is multiplied by
an attenuation gain for howling suppression corresponding to be sub-band, so as to
obtain a result, i.e., the second audio input signal, with the machine having learned
howling suppression.
[0051] It can be understood that the second howling point is used for indicating a howling
point in a frequency band corresponding to an ineffective audio signal in the second
audio input signal, that is, the howling point that does not belong to the frequency
band corresponding to the effective audio signal. Because there may be a howling point
corresponding to a frequency band of an unprocessed ineffective audio signal in machine
model training, a secondary gain processing is performed to detect the second howling
point.
[0052] Specifically, the second howling point may be detected by: obtaining a power spectrum
corresponding to the second audio input signal; detecting, in the power spectrum,
an extreme value such as a maximum power value in the power spectrum, or a value range
set based on the maximum power value; determining a corresponding candidate frequency
point, which may be a howling point, according to the extreme value; and determining
the second howling point according to the candidate frequency point. That is, phase
and feedback gain of the candidate frequency point are detected, and in a case that
the phase is consistent and the feedback gain is greater than or equal to 1, the candidate
frequency point is determined as the second howling point. Through the determining
of the extreme value in the power spectrum, the change of gains of frequency points
can be visually determined because the power corresponding to a howling point is usually
greater than the power of a general frequency point, so that the accuracy in howling
point recognition is improved.
[0053] Optionally, the second howling point may be determined based on a peak-to-average
ratio, that is, a plurality of frequency points adjacent to the candidate frequency
point are obtained to determine a candidate range; then an average frequency value
of the frequency points in the candidate range is determined to obtain the peak-to-average
ratio; and in a case that the peak-to-average ratio is greater than a howling threshold,
the candidate frequency point is determined as the second howling point. To avoid
the impact of extreme values caused by accidental events on the recognition, the howling
points may be determined by the peak-to-average ratio, so that a range of reference
is expanded and the accuracy in howling point recognition is further improved.
[0054] Optionally, because the howling points may appear periodically, the howling points
may also be determined by statistical analysis of a historical record. For example,
in a possible scenario, the howling points tend to be concentrated in a frequency
band above 2 KHz, and the energy of the voice signal is mainly concentrated in a frequency
band below 2 KHz. Then the howling points are determined based on the peak-to-average
ratio; or further check may be performed based on locations where the howling points
appear in the historical record. For example, in a case that the howling points are
concentrated in 2 KHz to 3 KHz in the historical record, secondary detection is performed
on this range in the next howling point recognition. For the specific manner of check,
reference may be made to the recognition based on the extreme value of the power spectrum
or the peak-to-average ratio as described above.
[0055] It can be understood that, the frequency band where the howling points are concentrated
depends on a specific scenario, that is, in different scenarios the howling points
may be concentrated in higher or lower frequency band. The description herein is only
to illustrate the method for obtaining the howling points based on analysis of the
historical record, and the disclosure is not limited herewith.
[0056] In 304, the second audio input signal is processed according to the second gain value
to obtain an audio output signal.
[0057] In this embodiment, the second gain value is used for indicating a suppression parameter
for the second howling point, that is, a suppression factor applied on a frequency
band corresponding to the second howling point. The secondary filtering of howling
point ensures the accuracy and the significant effect of howling suppression. The
second gain value may be empirically set to a floating-point value ranging from 0
to 1, or may be calculated based on the energy of upper and lower adjacent sub-bands.
[0058] In addition, after the second audio input signal is processed according to the second
gain value, a processed signal may be transformed into time domain to be subjected
to notch processing, which is a type of filter, to further eliminate the howling points.
[0059] In the foregoing embodiments, a first audio input signal is obtained; then the first
audio input signal is inputted into a machine learning model to obtain a first gain
value for processing a frequency band in which an effective audio signal is located;
the first audio input signal is processed according to the first gain value to obtain
a second audio input signal; next the second audio input signal is checked to obtain
a second howling point, the second howling point being used for indicating a howling
point in a frequency band corresponding to an ineffective audio signal in the second
audio input signal; and the second audio input signal is processed according to a
second gain value to obtain an audio output signal, the second gain value being used
for indicating a suppression parameter for the second howling point. In this way,
the howling is suppressed for the audio input signal at an initial stage and therefore
will not loop with gains. Due to the indication of a howling point in the machine
learning model, the convenience of model calculation, and further processing on the
unprocessed howling point by using the second gain value, a rapid and thorough process
of the howling suppression is implemented without affecting the effective audio signal,
so that the accuracy and efficiency in audio processing are improved.
[0060] In the foregoing embodiments, the process of audio signal processing is described.
However, the effective audio signal may be affected in the secondary howling suppression.
To avoid such a situation, reference is made to FIG. 4. FIG. 4 is a flowchart of another
audio signal processing method according to an embodiment of the present disclosure,
and the method includes at least the following steps.
[0061] In 401, a first audio input signal is obtained.
[0062] In 402, the first audio input signal is inputted into a machine learning model to
obtain a first gain value.
[0063] In 403, the first audio input signal is processed according to the first gain value
to obtain a second audio input signal.
[0064] In 404, the second audio input signal is checked to obtain a second howling point.
[0065] In this embodiment, steps 401 to 404 are similar to steps 301 to 304 in the embodiment
of FIG. 3, and for related features and description, reference may be made to the
description above, which are not repeated herein.
[0066] In 405, the second audio input signal is checked for voice protection.
[0067] In this embodiment, the voice protection is to ensure the integrity of the effective
audio signal. Specifically, feature information in the effective audio signal is first
obtained, the feature information being determined based on waveform features indicated
by the effective audio signal. For example, for the waveform feature indicated by
the effective audio signal, a voiced sound has formants and an unvoiced sound has
large high-frequency energy and a stable energy slope with respect to frequency axis.
Then the corresponding effective audio signal in the second audio input signal is
detected according to the feature information. A lock operation is further performed
on the effective audio signal, the lock operation being used for indicating an object
on which the second gain value does not act. In other words, a processing frequency
band corresponding to the second gain value may include the effective audio signal
which is locked in this step so as not to be subjected to the gain processing.
[0068] Optionally, the voice protection for the effective audio signal may also be performed
based on a historical record indicating a voice frequency band. In other words, frequency
band distribution of the effective audio signal is recorded, and frequency bands with
a large distribution weight are detected and filtered one by one.
[0069] In 406, the second audio input signal is processed according to a second gain value.
[0070] In 407, a frame processed according to the second gain value is smoothed.
[0071] In this embodiment, to prevent a harsh audio output signal caused by a large difference
of suppression gains between frames, that is, an abrupt audio change, gained frames
associated with howling points corresponding to the second gain value may be smoothed.
Specifically, a gained frame and adjacent frames may be processed with the following
formula:

where
α is a smooth factor ranging from 0 to 1;
g̃ain2(
m-1
,k) is a suppression gain of a previous frame; m is a frame index; and k is a frequency
point index. In the formula above, by adjusting a gain difference between adjacent
frames, the gains of the adjacent frames are more proximately in linear distribution,
the abrupt audio change is reduced, the audio output signal sounds smoother, and user
experience is improved.
[0072] In 408, an audio output signal is obtained.
[0073] In this embodiment, the audio output signal is obtained by multiplying the gain parameter
g̃ain2(
m,k) in step 407 by a value of a corresponding frequency point.
[0074] In the foregoing embodiments, the accuracy and resolution of the audio output signal
are improved by detecting and locking the effective audio signal. In addition, by
smoothing the gain parameter of adjacent gained frames, the gains of the adjacent
frames are more proximately in the linear distribution, the abrupt audio change is
reduced, the audio output signal sounds smoother, and the user experience is improved.
[0075] In the foregoing embodiments, the audio processing method for howling suppression
is described. With reference to specific scenarios, the audio processing method for
howling suppression is described below. FIG. 5 is a schematic diagram of a scenario
for audio signal processing according to an embodiment of the present disclosure.
A scenario, in which a microphone acquires and amplifies a voice signal for playback,
is shown. Because a sound source (microphone) is too close to an amplification device
(speaker), the sound signal acquired by the microphone is amplified by the speaker
and then picked up again by the microphone. Hence the signal is continuously superimposed
and amplified in a feedback loop and an oscillation cycle is generated by a positive
feedback, which results in howling. A function of the oscillation generated in the
positive feedback may be:

[0076] Correspondingly, the generation of howling requires that a phase of an input signal
acquired by the microphone in the feedback loop is the same as a phase of a sound
wave signal fed back to the speaker, that is:

and a feedback loop gain is greater than or equal to 1, that is:

where G(s) is the input signal acquired by the microphone; F(s) is the sound wave
signal fed back to the speaker; G(w0) is the phase of the input signal acquired by
the microphone; F(w0) is the phase of the sound wave signal fed back to the speaker;
and n is an integer parameter.
[0077] In this scenario, the audio signal processing method provided by the present disclosure
may be performed in an amplifier. In other words, after an audio signal acquired by
the microphone is transmitted to the amplifier, the audio signal processing process
in the embodiment of FIG. 3 or FIG. 4 is immediately performed, and then an output
signal is transmitted to the speaker for playback. As this cycle repeats, the howling
can be suppressed.
[0078] Another possible scenario is shown in FIG. 6. FIG. 6 is a schematic diagram of another
scenario for audio signal processing according to an embodiment of the present disclosure,
showing a loop in a scenario in which the terminal outputs sound loudly. In a case
that two terminals are close to each other, the sound from the speaker of the terminal
on the right is picked up by the microphone of the terminal on the left. After pre-processing
and signal conversion, the sound is sent to the terminal via a network and is played
via the speaker. Then the sound is picked up again by the microphone of the terminal
on the left. As this cycle repeats, when a loop gain is greater than or equal to 1
at a certain frequency point and the phase is positive, a howling point is formed
at this point.
[0079] The howling suppression is described below with reference to a specific example.
FIG. 7 is a flowchart of another audio signal processing method according to an embodiment
of the present disclosure, and the method includes at least the following steps.
[0080] In 701, an audio signal of a target frequency is inputted, and the audio signal is
divided into frames at every 20 milliseconds.
[0081] In this embodiment, considering that a mobile phone voice call is generally processed
at a sampling rate of 16 KHz, the target frequency may be set to 16 KHz.
[0082] In 702, transformation into frequency domain is performed.
[0083] In this embodiment, the audio signal is windowed and transformed into frequency domain
by Fourier transformation. The window function may be a rectangular window, a Gaussian
window, a Kaiser window, or the like, and the specific form of the function depends
on an actual scenario.
[0084] In 703, 42 feature values are extracted.
[0085] In this embodiment, the feature values may include 22 Mel-scale frequency cepstral
coefficients, for which reference may be made to a parameter in a voice recognition
process, that is, the effective audio signal; the feature values may further include
first-order or second-order derivatives of the first 6 coefficients used for indicating
voice features; the feature values may further include a pitch period, because a voiced
sound of a voice signal has a pitch period while a howling signal does not at a frequency
below 500 Hz; and the feature values may further include the detection of a non-stable
feature value because the voice is short-term stable.
[0086] In 704, a first gain value is calculated with a RNN model.
[0087] In this embodiment, the machine learning model is a RNN model for modeling time series
rather than for considering only input and output frames. The specific process of
obtaining the first gain value is similar to step 302 in the embodiment of FIG. 3,
which is not repeated herein.
[0088] In 705, howling suppression is performed with the first gain value at a first howling
point.
[0089] In 706, a second howling point is detected, and a second gain value is obtained.
[0090] In 707, howling suppression is performed with the second gain value.
[0091] In 708, transformation into time domain is performed and an audio signal is outputted.
[0092] In this embodiment, steps 705-708 are similar to steps 303-305 in the embodiment
of FIG. 3, and for related features and descriptions, reference may be made to the
above, which are not repeated herein.
[0093] With reference to the foregoing embodiments, a howling suppression result shown in
FIG. 8 can be obtained. FIG. 8 shows a comparison of audio signal processing according
to an embodiment of the present disclosure, where an upper diagram is a spectrogram
of the input signal before the howling suppression, and a lower diagram is a spectrogram
of the signal after the howling suppression. It can be seen from the comparison that
the clutter around the peak of the sample is significantly reduced, that is, howling
has been suppressed by the audio processing method provided in the present disclosure
before the howling is generated.
[0094] In the foregoing embodiments, the audio signal processing process is described. The
following description is made in a specific scenario of game application. FIG. 9 is
a flowchart of another audio signal processing method according to an embodiment of
the present disclosure, and the method includes at least the following steps.
[0095] In 901, a start instruction of a game is obtained.
[0096] In this embodiment, the start instruction of the game may be a triggering of starting
the game or starting a thread of a certain scene in the game, for example, entering
of a battle scene.
[0097] In 902, howling suppression is performed once a feature element is triggered.
[0098] In this embodiment, the feature element is a physical or virtual button for activating
a voice call function. FIG. 10 is a schematic diagram of an interface of an audio
signal processing method according to an embodiment of the present disclosure. A feature
element A1 in the game interface is shown. Once any of the buttons is triggered, the
audio processing method in the embodiment of FIG. 3 or FIG. 4 is invoked.
[0099] In addition, the audio processing method in the present disclosure is not only applicable
in a voice call between two users, but may also be applied in a voice call among a
plurality of users. FIG. 11 is a schematic diagram of an interface in another audio
signal processing method according to an embodiment of the present disclosure, in
which users are in a public voice scene B2. At this time, once a feature element B
1 is triggered, the audio processing method in the embodiment of FIG. 3 or FIG. 4
is invoked.
[0100] In 903, a processed audio signal is outputted.
[0101] In this embodiment, the audio signal subjected to the howling suppression is utilized
to realize a clear voice call between two or more users.
[0102] By virtue of the howling suppression performed on the audio signal between users
in the game, the users can make a clearer voice call without affecting the communication
due to howling, so that the user experience and the accuracy of a voice call are ensured
in a scenario such as the game that requires high-efficiency and high-quality voice.
[0103] In the foregoing embodiments, a machine learning model is applied. The machine learning
model is pre-trained. The method may be performed by an audio processing device, which
may be a terminal device or a server. The trained machine learning model may be applied
to the foregoing audio signal processing solution. The scenario is described below.
FIG. 12 is a flowchart of a machine learning model training method according to an
embodiment of the present disclosure, and the method includes at least the following
steps.
[0104] In 1201, a reference signal and a voice sample signal are acquired.
[0105] In this embodiment, the reference signal is a howling signal determined based on
at least two variable elements, the variable elements including a program category,
a program running period, or a program running location; and the voice sample signal
is used for indicating an effective voice in a voice call progress.
[0106] It can be understood that the program category in the variable elements may be training
samples in different game scenarios of different games, such as Honor of Kings or
Game for Peace. The program running period indicates a time period in which the training
samples are acquired. For example, a voice call function is usually activated in a
game during a time period from 8 pm to 9 pm, and the voice call is relatively intense,
which may be additionally marked to generate the training samples. In addition, the
program running location is geographic information of the voice acquisition. For example,
the training samples are acquired at different geographic locations such as a market,
a classroom, or a bedroom.
[0107] The training samples are acquired under various conditions, and the howling points
in the training samples are marked, so that a diversity of the training samples is
ensured. Because the acquired signal is used as a voice sample, the machine learning
model has a good performance in recognizing howling point(s) in voice frequency band.
[0108] In 1202, a feature training set is generated according to the reference signal and
the acquired signal.
[0109] In this embodiment, the signals acquired under the different factors are assigned
with labels and are classified, and corresponding howling points are marked to generate
the feature training set.
[0110] In 1203, the feature training set is inputted into a machine learning model to perform
at least one round of training, to obtain a trained machine learning model.
[0111] In this embodiment, the trained machine learning model is used for determining a
howling point and a gain value according to the audio input signal. Specifically,
FIG. 13 is a schematic flowchart of training of a machine learning model according
to an embodiment of the present disclosure. An RNN model including 3-level gated recurrence
units (GRUs) is shown. Compared with a simple recurrence unit, the GRU has two additional
gates, where a reset gate determines whether to memorize a current state for calculating
a new state, and an update gate determines how much to change the current state according
to a new input. In a case that the update gate is closed, the GRU can remember training
information for a long time. First, the GRU at the first layer has a 42-dimensional
input, and outputs a 24-dimensional feature and a voice activity detection (VAD) flag.
The GRU at the second layer inputs the initial 42-dimensional feature and the 24-dimensional
feature outputted by the first layer to output a 48-dimensional feature, for estimating
a howling signal. A third layer inputs the initial 42-dimensional feature and the
42-dimensional feature outputted by the second layer to obtain an output; and adjusts
the output according to a gain value in a training sample to update a model parameter,
thereby implementing training of the RNN model.
[0112] The training process in the present disclosure may also be applied to a deep neural
network model or a convolutional neural network model, which is not repeated herein.
[0113] Through the process of training the machine learning model, the howling point distribution
and a corresponding first gain value can be obtained after the audio signal is inputted
into the machine learning model, and the accuracy of howling suppression in the voice
frequency band is ensured.
[0114] To practice the foregoing solutions in the embodiments of the present disclosure,
related apparatuses for implementing the foregoing solutions are further provided
below. FIG. 14 is a schematic structural diagram of an audio signal processing apparatus
according to an embodiment of the present disclosure, and the audio signal processing
apparatus 1400 includes:
an obtaining unit 1401 configured to obtain a first audio input signal;
an input unit 1402 configured to input the first audio input signal into a machine
learning model to obtain a first gain value, the first gain value being used for indicating
a suppression parameter for a first howling point in the first audio input signal,
and the first howling point being used for indicating a howling point in a frequency
band corresponding to an effective audio signal;
a detection unit 1403 configured to process the first audio input signal according
to the first gain value to obtain a second audio input signal; and
a processing unit 1404 configured to process the second audio input signal according
to a second gain value to obtain an audio output signal, the second gain value being
used for indicating a suppression parameter for a second howling point, and the second
howling point being used for indicating a howling point in a frequency band corresponding
to an ineffective audio signal.
[0115] Optionally, in some possible implementations of the present disclosure, the input
unit 1402 is specifically configured to transform the audio input signal into frequency
domain, to extract a plurality of audio features, the audio features being determined
based on features of the effective audio signal or a howling sample.
[0116] The input unit 1402 is specifically configured to input the audio features into the
machine learning model to determine the first howling point.
[0117] The input unit 1402 is specifically configured to determine the first gain value
according to the first howling point.
[0118] The input unit 1402 is specifically configured to process the first audio input signal
according to the first gain value to obtain the second audio input signal.
[0119] The detection unit 1403 is configured to check the second audio input signal to obtain
the second howling point, and obtain the second gain value according to the second
howling point, the second howling point being used for indicating the howling point
in the frequency band corresponding to the ineffective audio signal in the second
audio input signal.
[0120] The processing unit 1404 is configured to process the second audio input signal according
to the second gain value to obtain the audio output signal.
[0121] Optionally, the input unit 1402 is specifically configured to transform the audio
input signal into frequency domain, to extract the plurality of audio features, the
audio features being determined based on features of the effective audio signal or
the howling sample.
[0122] The input unit 1402 is specifically configured to input the audio features into the
machine learning model to determine the first howling point.
[0123] The input unit 1402 is specifically configured to obtain the first gain value according
to the first howling point.
[0124] Optionally, in some possible implementations of the present disclosure, the input
unit 1402 is specifically configured to adjust the first audio input signal to a target
frequency for transformation into the frequency domain.
[0125] The input unit 1402 is specifically configured to determine a plurality of sampling
points in the first audio input signal transformed into the frequency domain.
[0126] The input unit 1402 is specifically configured to extract the plurality of audio
features based on the sampling points.
[0127] Optionally, in some possible implementations of the present disclosure, the input
unit 1402 is specifically configured to divide, based on a window function, the first
audio input signal transformed into the frequency domain to obtain a plurality of
sub-bands.
[0128] The input unit 1402 is specifically configured to determine the plurality of sampling
points in the sub-bands.
[0129] Optionally, in some possible implementations of the present disclosure, the detection
unit 1403 is specifically configured to obtain a power spectrum corresponding to the
second audio input signal.
[0130] The detection unit 1403 is specifically configured to detect an extreme value in
the power spectrum and determine a corresponding candidate frequency point.
[0131] The detection unit 1403 is specifically configured to determine the second howling
point according to the candidate frequency point.
[0132] The detection unit 1403 is specifically configured to process the second howling
point according to the second gain value to obtain the audio output signal.
[0133] Optionally, in some possible implementations of the present disclosure, the detection
unit 1403 is specifically configured to obtain a plurality of frequency points adjacent
to the candidate frequency point to determine a candidate range.
[0134] The detection unit 1403 is specifically configured to determine an average value
of the frequency points in the candidate range to obtain a peak-to-average ratio;
and
[0135] The detection unit 1403 is specifically configured to determine the candidate frequency
point as the second howling point in a case that the peak-to-average ratio is greater
than a howling threshold.
[0136] Optionally, in some possible implementations of the present disclosure, the detection
unit 1403 is further configured to obtain feature information in the effective audio
signal, the feature information being determined based on a waveform feature indicated
by the effective audio signal, and the effective audio signal being used for indicating
a voice sample.
[0137] The detection unit 1403 is specifically configured to detect the effective audio
signal in the second audio input signal according to the feature information.
[0138] The detection unit 1403 is specifically configured to perform a locking operation
on the effective audio signal, the locking operation being used for indicating an
object on which the second gain value does not act.
[0139] Optionally, in some possible implementations of the present disclosure, the processing
unit 1404 is specifically configured to determine a plurality of gained frames corresponding
to the second howling point.
the processing unit 1404 is specifically configured to process the gained frames according
to a smoothing formula to update the audio output signal.
[0140] Optionally, in some possible implementations of the present disclosure, the obtaining
unit 1401 is specifically configured to obtain an acquired signal;
the obtaining unit 1401 is specifically configured to convert the acquired signal
into a digital signal; and
the obtaining unit 1401 is specifically configured to input the digital signal into
an amplifier to obtain the first audio input signal.
[0141] Optionally, in some possible implementations of the present disclosure, the obtaining
unit 1401 is specifically configured to input the digital signal into the amplifier
to obtain an amplified signal;
the obtaining unit 1401 is specifically configured to process the amplified signal
according to a filter parameter to obtain a filtered amplified signal; and
the obtaining unit 1401 is specifically configured to Fourier transform the filtered
amplified signal into frequency domain to obtain the first audio input signal.
[0142] Optionally, in some possible implementations of the present disclosure, the audio
signal processing method is applied to a voice call in a game, and the obtaining unit
1401 is specifically configured to detect triggering of a feature element, the feature
element being an element in a game interface; and
the obtaining unit 1401 is specifically configured to obtain the first audio input
signal in a case that the feature element is triggered.
[0143] A first audio input signal is obtained; then the first audio input signal is inputted
into a machine learning model to obtain a first gain value for processing a frequency
band in which an effective audio signal is located; the first audio input signal is
processed according to the first gain value to obtain a second audio input signal;
next the second audio input signal is checked to obtain a second howling point, the
second howling point being used for indicating a howling point in a frequency band
corresponding to an ineffective audio signal in the second audio input signal; and
the second audio input signal is processed according to a second gain value to obtain
an audio output signal, the second gain value being used for indicating a suppression
parameter for the second howling point. In this way, the howling is suppressed for
the audio input signal at an initial stage and therefore will not loop with gains.
Due to indication of howling points in the machine learning model, the convenience
of model calculation, and further processing on the unprocessed howling point by using
the second gain value, a rapid and thorough process of the howling suppression is
implemented without affecting the effective audio signal, so that the accuracy and
efficiency in audio processing are improved.
[0144] The present disclosure further provides an apparatus 1500 for training machine learning
model. FIG. 15 is a schematic structural diagram of an apparatus for training machine
learning model according to an embodiment of the present disclosure. The apparatus
includes: an acquisition unit 1501, configured to acquire a reference signal and a
voice sample signal, the reference signal being a howling signal determined based
on at least two variable elements, the variable elements including a program category,
a program running period, or a program running location, and the acquired signal being
used for indicating an effective voice in a call process;
a generation unit 1502, configured to generate a feature training set according to
the reference signal and the acquired signal; and
a training unit 1503, configured to input the feature training set into a machine
learning model for at least one round of training, to obtain a trained machine learning
model, the trained machine learning model being used for determining a howling point
and gain value according to an audio input signal.
[0145] The embodiments of the present disclosure further provide a terminal device. FIG.
16 is a schematic structural diagram of another terminal device according to an embodiment
of the present disclosure. To facilitate the description, only parts related to the
embodiments of the present disclosure are shown. For specific technical details that
are not described, reference may be to the method in the embodiments of the present
disclosure. The terminal may be any terminal device such as a mobile phone, a tablet
computer, a personal digital assistant (PDA), a point of sales (POS), an on-board
computer or the like. The terminal being a mobile phone is used as an example.
[0146] FIG. 16 is a partial structural diagram of a mobile phone related to a terminal according
to an embodiment of the present disclosure. Referring to FIG. 16, the mobile phone
includes: a radio frequency (RF) circuit 1610, a memory 1620, an input unit 1630,
a display unit 1640, a sensor 1650, an audio circuit 1660, a wireless fidelity (WiFi)
module 1670, a processor 1680, and a power supply 1690. A person skilled in the art
can understand that the structure of the mobile phone shown in FIG. 16 does not constitute
a limitation on the mobile phone, and the mobile phone may include more or fewer components
than those shown in the figure, or some components may be combined, or a different
component deployment may be used.
[0147] The following describes the components of the mobile phone in detail with reference
to FIG. 16.
[0148] The RF circuit 1610 may be configured to receive and send signals during an information
receiving and sending process or a call process. Specifically, the RF circuit 1610
receives downlink information from a base station, then delivers the downlink information
to the processor 1680 for processing, and sends designed uplink data to the base station.
Usually, the RF circuit 1610 includes, but not limited to, an antenna, at least one
amplifier, a transceiver, a coupler, a low noise amplifier (LNA), and a duplexer.
In addition, the RF circuit 1610 may communicate with a network and another device
through wireless communication.
[0149] The memory 1620 may be configured to store software programs and modules. The processor
1680 runs the software programs and the modules stored in the memory 1620, to implement
various functional applications and data processing of the mobile phone.
[0150] The input unit 1630 may be configured to receive input digit or character information,
and generate a key signal input related to the user setting and function control of
the mobile phone. Specifically, the input unit 1630 may include a touch panel 1631
and another input device 1632.
[0151] The display unit 1640 may be configured to display information inputted by the user
or information provided for the user, and various menus of the mobile phone.
[0152] The mobile phone may further include at least one sensor 1650 such as an optical
sensor, a motion sensor, and other sensors.
[0153] The audio circuit 1660, a speaker 1661, and a microphone 1662 may provide audio interfaces
between the user and the mobile phone. The audio circuit 1660 may convert received
audio data into an electrical signal and transmit the electrical signal to the speaker
1661. The speaker 1661 converts the electrical signal into a sound signal for output.
On the other hand, the microphone 1662 converts an acquired sound signal into an electrical
signal. The audio circuit 1660 receives the electrical signal, converts the electrical
signal into audio data, and outputs the audio data to the processor 1680 for processing.
Then, the processor transmits the audio data to, for example, another mobile phone
through the RF circuit 1610, or outputs the audio data to the memory 1620 for further
processing.
[0154] WiFi is a short-range wireless transmission technology. Although FIG. 16 shows the
WiFi module 1670, it can be understood that the Wi-Fi module is not a necessary component
of the mobile phone, and the Wi-Fi module may be omitted as required without changing
the essence of the present disclosure is not changed.
[0155] The processor 1680 is a control center of the mobile phone, and is connected to various
parts of the mobile phone through various interfaces and lines. By running or executing
software programs and/or modules stored in the memory 1620, and invoking data stored
in the memory 1620, the processor performs various functions of the mobile phone and
performs data processing, so that the entire mobile phone is monitored.
[0156] In the embodiments of the present disclosure, the processor 1680 included in the
terminal further has a function of performing each step of the above audio signal
processing method or the model training method.
[0157] The embodiments of the present disclosure further provide a server. FIG. 17 is a
schematic structural diagram of a server according to an embodiment of the present
disclosure. The server 1700 may vary greatly due to different configurations or performance,
and may include one or more central processing units (CPUs) 1722 and a memory 1732,
and one or more storage media 1730 that store application programs 1742 or data 1744.
[0158] The server 1700 may further include one or more power supplies 1726, one or more
wired or wireless network interfaces 1750, one or more input/output interfaces 1758,
and/or one or more operating systems 1741.
[0159] The steps performed by the model training apparatus in the foregoing embodiments
may be based on the server structure shown in FIG. 17.
[0160] The embodiments of the present disclosure further provide a computer-readable storage
medium having a computer program stored thereon, the computer program being configured
to implement the steps performed by the audio signal processing apparatus in the foregoing
method described in the embodiments of FIG. 2 to FIG. 13.
[0161] The embodiments of the present disclosure further provide a computer program product
including audio signal processing instructions, the computer program product, when
running on a computer, causing the computer to perform the steps performed by the
audio signal processing apparatus in the foregoing method described in the embodiments
of FIG. 2 to FIG. 13.
[0162] The embodiments of the present disclosure further provide an audio signal processing
system. The audio signal processing system may include the audio signal processing
apparatus described in the embodiment of FIG. 14 or the terminal device in FIG. 16.
[0163] A person skilled in the art can clearly understand that for convenience and conciseness
of description, for specific working processes of the foregoing systems, devices and
units, reference may be made to the corresponding processes in the foregoing method
embodiments, and details are not described herein again.
[0164] It should be understood that the system, apparatus, and method disclosed in several
embodiments of the present disclosure may be implemented in other manners.
[0165] When the integrated unit is implemented in the form of a software functional unit
and sold or used as an independent product, the integrated unit may be stored in a
computer-readable storage medium. Based on such an understanding, the technical solutions
of the present disclosure or the part contributing to the related art, or all or some
of the technical solutions may be implemented in the form of a software product. The
computer software product is stored in a storage medium and includes several instructions
for instructing a computer device (which may be a personal computer, an audio signal
processing apparatus, a network device, or the like) to perform all or some of the
steps of the methods described in the embodiments of the present disclosure.
[0166] The foregoing embodiments are merely intended for describing the technical solutions
of the present disclosure, but not for limiting the present disclosure. Although the
present disclosure is described in detail with reference to the foregoing embodiments,
persons of ordinary skill in the art are to understand that they may still make modifications
to the technical solutions described in the foregoing embodiments or make equivalent
replacements to some technical features thereof, without departing from the spirit
and scope of the technical solutions of the embodiments of the present disclosure.
1. An audio signal processing method, performed by a terminal device, the method comprising:
obtaining a first audio input signal;
inputting the first audio input signal into a machine learning model to obtain a first
howling point, and obtaining a first gain value according to the first howling point,
the first howling point being used for indicating a howling point in a frequency band
corresponding to an effective audio signal in the first audio input signal, and the
first gain value being used for indicating a suppression parameter for the first howling
point;
processing the first audio input signal according to the first gain value to obtain
a second audio input signal;
checking the second audio input signal to obtain a second howling point, and obtaining
a second gain value according to the second howling point, the second howling point
being used for indicating a howling point in a frequency band corresponding to an
ineffective audio signal in the second audio input signal; and
processing the second audio input signal according to the second gain value to obtain
an audio output signal.
2. The method according to claim 1, the inputting the first audio input signal into a
machine learning model to obtain a first howling point, and obtaining a first gain
value according to the first howling point comprising:
transforming the first audio input signal into frequency domain to extract a plurality
of audio features, the audio features being determined based on features of the effective
audio signal;
inputting a first audio feature into the machine learning model to determine the first
howling point; and
obtaining the first gain value according to the first howling point.
3. The method according to claim 2, the transforming the first audio input signal into
frequency domain to extract a plurality of audio features comprising:
adjusting the first audio input signal to a target frequency for transformation into
the frequency domain;
determining a plurality of sampling points in the first audio input signal transformed
into the frequency domain; and
extracting the plurality of audio features based on the sampling points.
4. The method according to claim 3, the determining a plurality of sampling points in
the first audio input signal transformed into the frequency domain comprising:
dividing, based on a window function, the first audio input signal transformed into
the frequency domain to obtain a plurality of sub-bands; and
determining the plurality of sampling points in the sub-bands.
5. The method according to claim 1, the processing the second audio input signal according
to the second gain value to obtain an audio output signal comprising:
obtaining a power spectrum corresponding to the second audio input signal;
detecting an extreme value in the power spectrum, and determining a candidate frequency
point corresponding to the extreme value;
determining the second howling point according to the candidate frequency point; and
processing the second howling point according to the second gain value to obtain the
audio output signal.
6. The method according to claim 5, the determining the second howling point according
to the candidate frequency point comprising:
obtaining a plurality of frequency points adjacent to the candidate frequency point
to determine a candidate range;
determining an average frequency value of the frequency points in the candidate range
to obtain a peak-to-average ratio; and
determining the candidate frequency point as the second howling point in a case that
the peak-to-average ratio is greater than a howling threshold.
7. The method according to claim 5, after the determining the second howling point according
to the candidate frequency point, further comprising:
obtaining feature information in the effective audio signal, the feature information
being determined based on a waveform feature indicated by the effective audio signal;
detecting the corresponding effective audio signal in the second audio input signal
according to the feature information; and
performing a lock operation on the effective audio signal, the lock operation being
used for indicating an object on which the second gain value does not act.
8. The method according to claim 5, further comprising:
determining a plurality of gained frames corresponding to the second howling point;
and
processing the gained frames according to a smoothing formula to update the audio
output signal.
9. The method according to any one of claims 1 to 8, the audio signal processing method
being applied to a voice call in a game, and the obtaining a first audio input signal
comprising:
detecting triggering of a feature element, the feature element being an element in
an interface of the game; and
obtaining the first audio input signal in a case that the feature element is triggered.
10. The method according to any one of claims 1 to 8, the machine learning model being
a recurrent neural network model, and the first audio input signal and the audio output
signal being applied in a voice call of the terminal device.
11. A method for training a machine learning model, performed by an audio processing device,
the method comprising:
acquiring a reference signal and a voice sample signal, the reference signal being
a howling signal determined based on at least two variable elements, the variable
elements comprising a program category, a program running period, or a program running
location, and the acquired signal being used for indicating an effective voice in
a call;
generating a feature training set according to the reference signal and the acquired
signal; and
inputting the feature training set into the machine learning model for performing
at least one round of training to obtain a trained machine learning model, the trained
machine learning model being used for determining a corresponding howling point and
gain value according to an audio input signal.
12. An audio signal processing apparatus comprising:
an obtaining unit configured to obtain a first audio input signal;
an input unit configured to input the first audio input signal into a machine learning
model to obtain a first howling point; to obtain a first gain value according to the
first howling point, the first howling point being used for indicating a howling point
in a frequency band corresponding to an effective audio signal in the first audio
input signal, and the first gain value being used for indicating a suppression parameter
for the first howling point; and to process the first audio input signal according
to the first gain value to obtain a second audio input signal;
a detection unit configured to check the second audio input signal to obtain a second
howling point, and to obtain a second gain value according to the second howling point,
the second howling point being used for indicating a howling point in a frequency
band corresponding to an ineffective audio signal in the second audio input signal;
and
a processing unit, configured to process the second audio input signal according to
the second gain value to obtain an audio output signal.
13. An apparatus for training a machine learning model, the apparatus comprising:
an acquisition unit configured to acquire a reference signal and a voice sample signal,
the reference signal being a howling signal determined based on at least two variable
elements, the variable elements comprising a program category, a program running period,
or a program running location, and the acquired signal being used for indicating an
effective voice in a call;
a generation unit configured to generate a feature training set according to the reference
signal and the acquired signal; and
a training unit configured to input the feature training set into the machine learning
model for performing at least one round of training to obtain a trained machine learning
model, the trained machine learning model being used for determining a corresponding
howling point and gain value according to an audio input signal.
14. A computer device comprising a processor and a memory,
the memory being configured to store program codes; and the processor being configured
to perform, according to instructions in the program codes, the audio signal processing
method according to any one of claims 1 to 10, or the method for training the machine
learning model according to claim 11.
15. A computer-readable storage medium storing with a computer program, the computer program
being configured to implement the audio signal processing method according to any
one of claims 1 to 10, or the method for training the machine learning model according
to claim 11.
16. A computer program product comprising instructions, the computer program product,
when running on a computer, causing the computer to perform the audio signal processing
method according to any one of claims 1 to 10, or the method for training the machine
learning model according to claim 11.