TECHNICAL FIELD
[0001] The disclosure relates to a field of communication technologies, in particular to
an audio signal encoding method, an audio signal encoding apparatus, an electronic
device and a storage medium.
BACKGROUND
[0002] In the related art, after acquiring an audio signal, the audio signal is subjected
to a uniform encoding process. In the uniform encoding process, without considering
the different encoding rates, a number of bits available for each audio channel is
different, which causes the number of bits available for each audio channel to exceed
or be less than a number of bits necessary for encoding, resulting in the waste of
bits or the inability to provide an audio service that matches the encoding rate for
remote users, which is an urgent problem to be solved.
SUMMARY
[0003] Embodiments of the disclosure provide an audio signal encoding method, an audio signal
encoding apparatus, an electronic device and a storage medium, to encode the audio
signal according to a number of audio channels and an encoding rate, which may make
full use of bits available during the encoding process, avoid a waste of bits, and
provide audio services that match the encoding rate for remote users.
[0004] According to a first aspect of embodiments of the disclosure, an audio signal encoding
method is provided. The method includes: obtaining a scene-based audio signal; determining
a number of audio channels of the audio signal and an encoding rate; and generating
an encoded codestream by encoding the audio signal according to the number of audio
channels and the encoding rate.
[0005] In this technical solution, a scene-based audio signal is obtained, and a number
of audio channels of the audio signal and an encoding rate are determined, and then
an encoded codestream is generated by encoding the audio signal according to the number
of audio channels and the encoding rate. The audio signal is thus encoded according
to the number of audio channels and the encoding rate, and during the encoding process,
the bits available may be fully utilized, avoiding a waste of bits, and providing
audio services that match the encoding rate for remote users.
[0006] In some embodiments, generating the encoded codestream by encoding the audio signal
according to the number of audio channels and the encoding rate, includes: performing
a down-mixed processing on the audio signal according to the number of audio channels
and the encoding rate to generate a down-mixed parameter and a down-mixed audio channel
signal; encoding the down-mixed audio channel signal to generate an encoding parameter;
and generating the encoded codestream by performing codestream multiplexing on the
down-mixed parameter and the encoding parameter.
[0007] In some embodiments, performing the down-mixed processing on the audio signal according
to the number of audio channels and the encoding rate to generate the down-mixed parameter
and the down-mixed audio channel signal, includes: determining a target control parameter
for the audio signal according to the number of audio channels and the encoding rate;
determining a down-mixed processing algorithm according to the target control parameter;
and performing the down-mixed processing on the audio signal according to the down-mixed
processing algorithm to generate the down-mixed parameter and the down-mixed audio
channel signal.
[0008] In some embodiments, determining the target control parameter for the audio signal
according to the number of audio channels and the encoding rate, includes: calculating
an initial average rate of each channel according to the number of audio channels
and the encoding rate; determining a target average rate according to the initial
average rate and a preset average rate threshold; and determining the target control
parameter for the audio signal according to the initial average rate and the target
average rate.
[0009] In some embodiments, before encoding the audio signal, the method further includes:
performing a pre-emphasis preprocessing and/or a high-pass filtering preprocessing
on the audio signal.
[0010] According to a second aspect of embodiments of the disclosure, an audio signal encoding
apparatus is provided. The apparatus includes: a signal obtaining unit, configured
to obtain a scene-based audio signal; an information determining unit, configured
to determine a number of audio channels of the audio signal and an encoding rate;
and an encoding processing unit, configured to generate an encoded codestream by encoding
the audio signal according to the number of audio channels and the encoding rate.
[0011] In some embodiments, the encoding processing unit includes: a down-mixed processing
module, configured to perform a down-mixed processing on the audio signal according
to the number of audio channels and the encoding rate to generate a down-mixed parameter
and a down-mixed audio channel signal; a parameter generating module, configured to
encode the down-mixed audio channel signal to generate an encoding parameter; and
a codestream generating module, configured to generate the encoded codestream by performing
codestream multiplexing on the down-mixed parameter and the encoding parameter.
[0012] In some embodiments, the down-mixed processing module includes: a parameter determining
sub-module, configured to determine a target control parameter for the audio signal
according to the number of audio channels and the encoding rate; an algorithm determining
sub-module, configured to determine a down-mixed processing algorithm according to
the target control parameter; and a down-mixed processing sub-module, configured to
perform the down-mixed processing on the audio signal according to the down-mixed
processing algorithm to generate the down-mixed parameter and the down-mixed audio
channel signal.
[0013] In some embodiments, the parameter determining sub-module is further configured to:
calculate an initial average rate of each channel according to the number of audio
channels and the encoding rate; determine a target average rate according to the initial
average rate and a preset average rate threshold; and determine the target control
parameter for the audio signal according to the initial average rate and the target
average rate.
[0014] In some embodiments, the apparatus further includes: a preprocessing unit, configured
to perform a pre-emphasis preprocessing and/or a high-pass filtering preprocessing
on the audio signal.
[0015] According to a third aspect of embodiments of the disclosure, an electronic device
is provided. The electronic device includes at least one processor, and a memory communicatively
connected to the at least one processor. The memory stores instructions executable
by the at least one processor, and the instructions are executed by the at least one
processor to cause the at least one processor to execute the method described in the
first aspect.
[0016] According to a fourth aspect of embodiments of the disclosure, a non-transitory computer-readable
storage medium having computer instructions stored thereon is provided. The computer
instructions are configured to cause a computer to execute the method described in
the first aspect.
[0017] According to a fifth aspect of embodiments of the disclosure, a computer program
product including computer instructions is provided. When the computer instructions
are executed by a processor, the method described in the first aspect is implemented.
[0018] It is understood that both the foregoing general description and following detailed
description are exemplary and explanatory only and are not for limiting the disclosure.
BRIEF DESCRIPTION OF THE DRAWINGS
[0019] In order to clearly illustrate technical solutions of embodiments of the disclosure
or background technologies, a description of drawings used in the embodiments or the
background technologies is given below.
FIG. 1 is a flowchart of an audio signal encoding method according to an embodiment
of the disclosure.
FIG. 2 is a schematic diagram of a coordinate of an audio signal in a Firs-Order Ambisonics
(FOA) format according to an embodiment of the disclosure.
FIG. 3 is a flowchart of another audio signal encoding method according to an embodiment
of the disclosure.
FIG. 4 is a flowchart of an audio signal encoding method in the related art according
to an embodiment of the disclosure.
FIG. 5 is a flowchart of yet another audio signal encoding method according to an
embodiment of the disclosure.
FIG. 6 is a flowchart of substeps of step S30 in the audio signal encoding method
according to an embodiment of the disclosure.
FIG. 7 is a flowchart of substeps of step S301 in the audio signal encoding method
according to an embodiment of the disclosure.
FIG. 8 is a structural diagram of an audio signal encoding apparatus according to
an embodiment of the disclosure.
FIG. 9 is a structural diagram of an encoding processing unit in an audio signal encoding
apparatus according to an embodiment of the disclosure.
FIG. 10 is a structural diagram of a down-mixed processing module in an audio signal
encoding apparatus according to an embodiment of the disclosure.
FIG. 11 is a structural diagram of another audio signal encoding apparatus according
to an embodiment of the disclosure.
FIG. 12 is a structural diagram of an electronic device according to an embodiment
of the disclosure.
DETAILED DESCRIPTION
[0020] In order to make those skilled in the field better understand the technical solutions
of this disclosure, the technical solutions in the embodiment of this disclosure will
be described clearly and completely in combination with the attached drawings.
[0021] Unless the context indicates otherwise, throughout the specification and the claims,
the term "comprise" is interpreted as open, inclusive, i.e., "includes, but is not
limited to". In the description of the specification, "some embodiments" is intended
to indicate that certain features, structures, materials or characteristics related
to the embodiments or examples are included in at least one embodiment or example
of the disclosure. The schematic representation of the above term does not necessarily
refer to the same embodiment or example. Furthermore, the features, structures, materials,
or characteristics described above may be included in any one or more embodiments
or examples in any suitable manner.
[0022] It should be noted that the terms "first" and "second" in the specification and the
claims of this disclosure and the drawings are used to distinguish similar objects,
and are not necessarily used to describe a specific order or sequence. The terms "first"
and "second" are only used for descriptive purposes, and cannot be understood as indicating
or implying relative importance or implicitly indicating the number of indicated technical
features. Therefore, the features defined with the terms "first" and "second" may
explicitly or implicitly include one or more of these features. It should be understood
that data so used may be interchanged under appropriate circumstances, so that the
embodiments of the disclosure described herein may be implemented in other orders
than those illustrated or described herein. The implementations described in the following
exemplary embodiments do not represent all implementations consistent with the disclosure.
Rather, they are merely examples of devices and methods consistent with some aspects
of the disclosure as detailed in the appended claims.
[0023] The term "at least one" in the disclosure may also be described as one or more, and
the term "multiple" may be two, three, four, or more, which is not limited in the
disclosure. In the embodiments of the disclosure, for a type of technical features,
"first", "second", and "third", and "A", "B", "C" and "D" are used to distinguish
different technical features of the type, the technical features described using the
"first", "second", and "third", and "A", "B", "C" and "D" do not indicate any order
of precedence or magnitude.
[0024] The correspondences shown in the tables in this disclosure may be configured or may
be predefined. The values of information in the tables are merely examples and may
be configured to other values, which are not limited by the disclosure. In configuring
the correspondence between the information and the parameter, it is not necessarily
required that all the correspondences illustrated in the tables must be configured.
For example, the correspondences illustrated in certain rows in the tables in this
disclosure may not be configured. For another example, the above tables may be adjusted
appropriately, such as splitting, combining, and the like. The names of the parameters
shown in the titles of the above tables may be other names that may be understood
by the communication device, and the values or representations of the parameters may
be other values or representations that may be understood by the communication device.
Each of the above tables may also be implemented with other data structures, such
as, arrays, queues, containers, stacks, linear tables, pointers, chained lists, trees,
graphs, structures, classes, heaps, and Hash tables.
[0025] Those skilled in the art may realize that the units and algorithmic steps of the
various examples described in combination with the embodiments disclosed herein are
capable of being implemented in the form of electronic hardware, or a combination
of computer software and electronic hardware. Whether these functions are performed
in the form of hardware or software depends on the specific application and design
constraints of the technical solution. Those skilled in the art may use different
methods to implement the described functions for each particular application, but
such implementations should not be considered as beyond the scope of the disclosure.
[0026] The first generation (1G) mobile communication technology is the first generation
wireless cellular technology, which belongs to an analog mobile communication network.
When 1G is upgraded to 2G, a mobile phone switches from an analog communication to
a digital communication, and a global system for mobile communication (GSM) network
standard is adopted. A voice encoder adopts an adaptive multi rate (AMR) narrow band
speech codec, an enhanced full rate (EFR), a full rate (FR), and a half rate (HR),
and a communication provides a single-channel narrowband voice service. A 3G mobile
communication system was proposed by the International Telecommunication Union (ITU)
for international mobile communications in 2000, which can adopt Time Division-Synchronous
Code Division Multiple Access (TD-SCDMA), Code Division Multiple Access 2000 (CDMA2000),
or Wideband Code Division Multiple Access (WCDMA), and the voice encoder of which
adopts an adaptive multi-rate wideband (AMR-WB) to provide a single-channel broadband
voice service. A 4G is improved base on the 3G technology. Both data and voice are
transmitted in an all-IP manner, for providing a real-time high definition (HD)+Voice
service of voice audio. An enhanced voice service (EVS) codec adopted by the 4G can
balance a high-quality compression of voice and audio.
[0027] The voice and audio communication service provided above have expanded from a narrowband
signal to an ultra-wideband service or even a full-band service, but they are all
signal-audio channel services. With the increasing demand for high-quality audio,
compared with the signal-audio channel audio, stereo audio has a sense of orientation
and distribution for each sound source, and can improve a clarity.
[0028] With an increase of transmission bandwidth, an upgrade of a terminal device signal
collection device, an improvement of performance of a signal processor, and an upgrade
of a terminal playback device, three signal formats, namely audio channel-based multi-channel
audio signals, object-based audio signals, and scene-based audio signals, can provide
three-dimensional audio services. An immersive voice and audio service (IVAS) codec
that is being standardized by the 3rd Generation Partnership Project (3GPP) SA4 can
support encoding and decoding requirements of the above three signal formats. Terminal
devices that can support 3D audio services include a mobile phone, a computer, a Pad,
a conference system device, an augmented reality/virtual reality (AR/VR) device, a
vehicle, etc.
[0029] A Firs-Order Ambisonics/High-Order Ambisonics (FOA/HOA) signal is a main scene-based
audio signal. The FOA/HOA signal represents audio information collected at a certain
position in an audio scene and is an immersive audio format whose audio quality gradually
gets better with the increase of order. Different Ambisonics orders represent different
numbers of audio signal components. That is, for an N-order Ambisonics signal, the
number of Ambisonics coefficients is (N+1)*(N+1).
Table 1: the relationship between the Ambionics signal order and the Ambionics coefficient
Ambisonics order |
Ambisonics coefficient/number of audio channels |
0 |
1 |
1 |
4 |
2 |
9 |
3 |
16 |
4 |
25 |
5 |
36 |
6 |
49 |
[0030] As shown in Table 1, the number of audio channels of Ambisonics increases rapidly
with the increase of order. Correspondingly, an amount of encoded data also increases
rapidly, as well as an encoding complexity. Meanwhile, due to the limitation of encoding
rate, an encoding performance is greatly reduced. In order to reduce the encoding
complexity, it is necessary to perform a down-mixed processing on an input initial
audio channel. After the down-mixed process, the number of audio channels decreases,
and the encoding complexity is reduced, so as to achieve a balance between the encoding
complexity and the encoding performance.
[0031] In response to the problem of the waste of bits or the inability to provide the audio
services that match the encoding rate for a remote user in the related art, the embodiment
of the disclosure provides an audio signal encoding method and an audio signal encoding
apparatus to solve the problems existing in the related art at least to some extent,
so as to make full use of the available bits, provide the audio services that match
the encoding rate for the remote user, and improve the user experience.
[0032] As illustrated in FIG. 1, FIG. 1 is a flowchart of an audio signal encoding method
according to an embodiment of the disclosure.
[0033] As illustrated in FIG. 1, the method includes but is not limited to the following
steps.
[0034] At step S1, a scene-based audio signal is obtained.
[0035] It is understood that when a local user establishes a voice communication with any
remote user, the local user can establish a voice communication with a terminal device
of the any remote user through a terminal device of the local user. The terminal device
of the local user may obtain sound information of an environment where the local user
is located in real time and obtain the scene-based audio signal.
[0036] The sound information of the environment where the local user is located includes
sound information made by the local user and sound information of surrounding things.
The sound information of surrounding things may be, for example, sound information
of vehicle driving, sound information of birds, sound information of wind, and sound
information of other users around the local user, and so on.
[0037] It should be noted that the terminal device is an entity on a user side for receiving
or transmitting signals. For example, the terminal device may be a mobile phone, a
computer, a Pad, a watch, an interphone, a conference system device, an augmented
reality/virtual reality (AR/VR) device, a vehicle, etc. The terminal device may also
be referred to as a user equipment (UE), a mobile station (MS), a mobile terminal
(MT), and the like. The terminal device may be a vehicle with communication functions,
a smart vehicle, a mobile phone, a wearable device, a Pad, a computer with wireless
transceiver functions, a VR terminal device, an AR terminal device, a wireless terminal
device in industrial control, a wireless terminal device in self-driving, a wireless
terminal device in remote medical surgery, a wireless terminal device in smart grid,
a wireless terminal device in transportation safety, a wireless terminal device in
smart city, a wireless terminal device in smart home, etc. The specific technology
and specific device form adopted by the terminal device are not limited in embodiments
of the disclosure.
[0038] In the embodiment of the disclosure, when acquiring the scene-based audio signal,
the terminal device of the local user can acquire the sound information of the environment
where the local user is located via a recording apparatus, such as a microphone, arranged
in the terminal device or cooperating with the terminal device, and then generate
the scene-based audio signal to obtain the scene-based audio signal.
[0039] In the embodiment of the disclosure, the scene-based audio signal may be an audio
signal in a FOA format or an audio signal in a HOA format.
[0040] At step S2, a number of audio channels of the audio signal and an encoding rate are
determined.
[0041] In the embodiment of the disclosure, after obtaining the scene-based audio signal,
the number of audio channels of the audio signal and the encoding rate are determined.
[0042] For example, as illustrated in FIG. 2, in a case where the scene-based audio signal
is an audio signal in the FOA format, it is determined that the number of audio channels
of the audio signal is 4, which may be represented by W, X, Y and Z, in which W represents
a component containing all sounds in all directions in a sound field superimposed
with the same gain and phase, X represents a component in a front-back direction in
the sound field, Y represents a component in a left-right direction in the sound field,
and Z represents a component in an up-down direction in the sound field. It is further
determined that the selected encoding rate is 96kbps.
[0043] At step S3, an encoded codestream is generated by encoding the audio signal according
to the number of audio channels and the encoding rate.
[0044] In the embodiment of the disclosure, the scene-based audio signal is obtained, the
number of audio channels of the audio signal and the encoding rate are determined,
and the encoded codestream is generated by encoding the audio signal according to
the number of audio channels and the encoding rate.
[0045] When encoding the audio signal according to the number of audio channels and the
encoding rate, the encoding rate of each audio channel may be determined according
to the number of audio channels and the encoding rate. For example, an average encoding
rate of each audio channel, the maximum encoding rate of each audio channel, or the
encoding rate of each audio channel may be determined. The average encoding rate of
each audio channel may be determined by dividing the encoding rate by the number of
audio channels, the maximum encoding rate of each audio channel is equal to the encoding
rate, and the encoding rate of each audio channel is the encoding rate.
[0046] In a base of determining the encoding rate of each audio channel, the number of bits
available for each audio channel may be considered at the different encoding rates
according to the encoding rate of each audio channel, so that the bits available is
able to be fully utilized during the encoding process, to avoid the waste of bits
and provide the audio services matching the encoding rate for the remote user. The
generated encoded codestream is able to provide clear, stable and understandable audio
services when the encoding rate is low, and is able to provide high-definition, stable
and immersive audio services when the encoding rate is high. In this way, it can provide
the remote user with the audio services matching the encoding rate, thus improving
the user experience.
[0047] In some embodiments, before encoding the audio signal, the method further includes:
performing a pre-emphasis preprocessing and/or a high-pass filtering preprocessing
on the audio signal.
[0048] In the embodiment of the disclosure, in a case where the scene-based audio signal
is obtained and the number of audio channels of the audio signal and the encoding
rate are determined, the pre-emphasis preprocessing may be performed on the audio
signa, which may enhance a high-frequency portion of the audio information and increase
a high-frequency resolution of the audio information.
[0049] In the embodiment of the disclosure, in a case where the scene-based audio signal
is obtained and the number of audio channels of the audio signal and the encoding
rate are determined, the high-pass filtering preprocessing may be performed on the
audio signal, to filter signal components in the audio signal lower than a certain
frequency threshold. A starting frequency in the high-pass filtering processing may
be set as required, for example, the starting frequency may be set as 20Hz.
[0050] After performing the high-pass filtering preprocessing on the audio signal, an audio
signal component of the required encoding frequency band may be obtained. When the
audio signal is encoded, an influence of an ultra-low frequency signal on encoding
processing effects may be avoided.
[0051] By implementing the embodiments of the disclosure, the scene-based audio signal is
obtained, the number of audio channels of the audio signal and the encoding rate are
determined, and the encoded codestream is generated by encoding the audio signal according
to the number of audio channels and the encoding rate. In this way, the audio signal
is encoded according to the number of audio channels and the encoding rate, and the
bits available are able to be fully utilized during the encoding process, so that
the waste of bits may be avoided, and the audio services that match the encoding rate
may be provided for the remote user.
[0052] FIG. 3 is a flowchart of an audio signal encoding method according to an embodiment
of the disclosure.
[0053] As illustrated in FIG. 3, the method includes but is not limited to the following
steps.
[0054] At step S10, a scene-based audio signal is obtained.
[0055] At step S20, a number of audio channels of the audio signal and an encoding rate
are determined.
[0056] In the embodiment of the disclosure, the related descriptions of steps S10 and S20
can be referred to the related descriptions in the above embodiments, and the same
contents will not be repeated here.
[0057] At step S30, a down-mixed processing is performed on the audio signal according to
the number of audio channels and the encoding rate, to generate a down-mixed parameter
and a down-mixed audio channel signal.
[0058] At step S40, the down-mixed audio channel signal is encoded to generate an encoding
parameter.
[0059] At step S50, the encoded codestream is generated by performing codestream multiplexing
on the down-mixed parameter and the encoding parameter.
[0060] In the embodiment of the disclosure, the scene-based audio signal is obtained, the
number of audio channels of the audio signal and the encoding rate are determined,
and the encoded codestream is generated by encoding the audio signal according to
the number of audio channels and the encoding rate. Encoding the audio signal according
to the number of audio channels and the encoding rate may include performing the down-mixed
processing on the audio signal according to the number of audio channels and the encoding
rate to generate the down-mixed parameter and the down-mixed audio channel signal.
The down-mixed audio channel signal is then encoded to generate the encoding parameter.
The encoded codestream is generated by codestream multiplexing according to the down-mixed
parameter and the encoding parameter.
[0061] As illustrated in FIG. 4, in the related art, after acquiring an audio signal (an
audio signal in the FOA format or an audio signal in the HOA format), the audio signal
is subjected to a uniform down-mixed process, and the number of audio channels after
down-mixed is less than the initial number of audio channels. All the remaining channels
are encoded by a core encoder, and down-mixed parameters generated by the down-mixed
processing and output parameters of the core encoder are performed by codestream multiplexing
to output the encoded codestream.
[0062] The uniform down-mixed processing of the audio signal does not consider that the
number of bits available for each audio channel is different under different encoding
rates, resulting in the number of audio channels after the down-mixed processing does
not match the number of audio channels that the core encoder is able to encode. Therefore,
when the number of audio channels after the down-mixed processing is much less than
the number of input audio channels, better audio services cannot be provided to remote
users at a high encoding rate (because the number of bits available for each audio
channel exceeds the number of bits necessary for encoding, which may lead to the waste
of bits). When the number of audio channels after the down-mixed processing is slightly
different from the number of input audio channels, the remote users cannot be provided
with audio services that match the encoding rate at a low encoding rate (because the
number of bits available for each audio channel is much less than the number of bits
necessary for encoding, which may lead to a poor encoding quality of each audio channel).
[0063] However, as illustrated in FIG. 5, in the embodiment of the disclosure, a scene-based
audio signal (an audio signal in the FOA format or an audio signal in the HOA format)
is input to an encoder end, the encoder end may determine the number of audio channels
of the audio signal and the encoding rate and input the encoding rate, the number
of audio channels and the audio signal to a pattern analysis module, or the encoder
end may perform a high-pass filtering preprocessing on the audio signal and then input
the preprocessed audio signal into the pattern analysis module.
[0064] The pattern analysis module may output a control parameter according to the selected
encoding rate and the number of audio channels, and use the control parameter to guide
a down-mixed processing module to select a corresponding down-mixed processing algorithm.
The down-mixed processing module outputs a down-mixed parameter and a down-mixed audio
channel signal after processing the audio signal. An encoding parameter is output
after encoding the down-mixed audio channel signal by the core encoder. The encoding
parameter and the down-mixed parameter are input to a codestream multiplexer to output
an encoded codestream.
[0065] In the embodiment of the disclosure, when the input scene-based audio signal is the
audio signal in the FOA format/the audio signal in the HOA format, a matching down-mixed
processing algorithm is adaptively selected according to the number of audio channels
of the input audio signal and the number of bits available, so that the number of
audio channels after the down-mixed processing matches the number of audio channels
that may be encoded by the core encoder at this encoding rate, and a full (optimal)
utilization of bits available may be achieved. That is, at a low rate, it may ensure
the provision of clear, stable and understandable audio services, and at a high rate,
it may ensure the provision of high-definition, stable immersive audio services, which
may improve the user experience.
[0066] In the embodiment of the disclosure, after the encoder outputs the encoded codestream,
the encoded codestream may be sent to a decoder end for decoding, so that the remote
terminals may obtain sound information transmitted by the local terminal.
[0067] As illustrated in FIG. 6, in some embodiments, step S30 of performing the down-mixed
processing on the audio signal according to the number of audio channels and the encoding
rate to generate the down-mixed parameter and the down-mixed audio channel signal,
includes the following steps.
[0068] At step S301, a target control parameter for the audio signal is determined according
to the number of audio channels and the encoding rate.
[0069] In the embodiment of the disclosure, when performing the down-mixed processing on
the audio signal according to the number of audio channels and the encoding rate,
the target control parameter for the audio signal may be determined according to the
number of audio channels and the encoding rate.
[0070] When determining the target control parameter for the audio signal according to the
number of audio channels and the encoding rate, the encoding rate of each audio channel
may be determined according to the number of audio channels and the encoding rate.
For example, an average encoding rate of each audio channel, the maximum encoding
rate of each audio channel, or the encoding rate of each audio channel may be determined.
The average encoding rate of each audio channel is determined by dividing the encoding
rate by the number of audio channels, the maximum encoding rate of each audio channel
is equal to the encoding rate, and the encoding rate of each channel is the encoding
rate.
[0071] In the embodiment of the disclosure, on the basis of determining the encoding rate
of each audio channel according to the number of audio channels and the encoding rate,
the target control parameter for the audio signal is determined according to the encoding
rate of each audio channel.
[0072] Certainly, when determining the target control parameter for the audio signal according
to the number of audio channels and the encoding rate, by pre-setting corresponding
relationships between the number of audio channels and the encoding rate and the control
parameter, in a case of determining the number of audio channels of the audio signal
and the encoding rate, the target control parameter for the audio signal may be determined.
[0073] Alternatively, a target number of audio channels may be determined according to the
number of audio channels and the encoding rate, and then the target control parameter
for the audio signal may be determined according to the target number of audio channels.
[0074] The target number of audio channels is determined according to the number of audio
channels and the encoding rate. For example, N thresholds of the average encoding
rate are preset, where N is a positive integer, and N+1 threshold ranges are determined
by the N thresholds. Different threshold ranges are set to correspond to different
numbers of audio channels after the down-mixed process. On the basis, an initial average
encoding rate is calculated according to the number of audio channels and the encoding
rate, and the target number of audio channels may be determined according to the threshold
range to which the initial average rate belongs, and then the target control parameter
for the audio signal is determined according to the target number of audio channels.
[0075] It is understood that in a case where the encoding rate and the number of audio channels
after the down-mixed processing are known, an average rate that is able to be allocated
to each audio channel after the down-mixed processing may be obtained, and the target
control parameter for the audio signal may be determined according to the target number
of audio channels and/or the average rate that is able to be allocated to each audio
channel after the down-mixed processing.
[0076] When determining the target control parameter of the audio signal according to the
target number of audio channels and/or the average rate that is able to be allocated
to each audio channel after the down-mixed processing, corresponding relationships
between the target number of audio channels and/or the average rate that is able to
be allocated to each audio channel after the down-mixed processing and the control
parameter may be preset, and the target control parameter for the audio signal may
be determined according to the target number of audio channels and/or the average
rate that is able to be allocated to each audio channel after the down-mixed processing.
[0077] At step S302, a down-mixed processing algorithm is determined according to the target
control parameter.
[0078] In the embodiment of the disclosure, in a case where the target control parameter
for the audio signal is determined according to the number of audio channels and the
encoding rate, the down-mixed processing algorithm may be determined according to
the target control parameter. Determining the down-mixed processing algorithm may
be determining the down-mixed processing algorithm corresponding to each audio channel,
and the determined down-mixed processing algorithms for different channels may be
the same or different.
[0079] At step S303, the down-mixed processing is performed on the audio signal according
to the down-mixed processing algorithm, to generate the down-mixed parameter and the
down-mixed audio channel signal.
[0080] In the embodiment of the disclosure, in a case where the down-mixed processing algorithm
corresponding to each audio channel is determined, the audio signal may be performed
by the down-mixed processing according to the down-mixed processing algorithm to generate
the down-mixed parameter and the down-mixed audio channel signal.
[0081] As illustrated in FIG. 7, in some embodiments, step S301 of determining the target
control parameter for the audio signal according to the number of audio channels and
the encoding rate, includes the following steps.
[0082] At step S3011, an initial average rate of each audio channel is calculated according
to the number of audio channels and the encoding rate.
[0083] At step S3012, a target average rate is determined according to the initial average
rate and a preset average rate threshold.
[0084] At step S3013, the target control parameter for the audio signal is determined according
to the initial average rate and the target average rate.
[0085] According to the number of audio channels and the encoding rate, the initial average
rate of each audio channel may be calculated by dividing the encoding rate by the
number of audio channels. For example, if the number of audio channels is 4 and the
encoding rate is 96kbps, the initial average rate of each audio channel is calculated
to be 24kbps according to the number of audio channels and the encoding rate.
[0086] In the embodiment of the disclosure, in a case where the initial average rate of
each audio channel is calculated, the target average rate may be determined according
to the initial average rate and the preset average rate threshold.
[0087] The preset average rate threshold may be set according to the scene-based audio signal.
For example, a first average rate threshold Thres1 is set to 13.2kbps, and a second
average rate threshold Thres2 is set to 32kbps. According to the above two average
rate thresholds, ranges corresponding to the average rate is divided into three average
rate ranges, as follows,
an average rate range 1: less than or equal to 13.2kbps;
an average rate range 2: greater than 13.2 kbps and less than 32 kbps; and
an average rate range 3: greater than or equal to 32 kbps.
[0088] In the embodiment of the disclosure, the target average rate is determined according
to the initial average rate and the preset average rate threshold. If the average
rate threshold range is determined according to the average rate threshold, the corresponding
number of output audio channels is set for each average rate threshold range, so that
the corresponding target number of output audio channels may be determined according
to the average rate threshold range to which the initial average rate belongs.
[0089] On the basis, in a case where the target number of output audio channels is determined,
the target average rate may be calculated according to the target number of output
audio channels and the encoding rate.
[0090] For example, the number of output audio channels corresponding to the average rate
range 1 is 2, the number of output audio channels corresponding to the average rate
range 2 is 3, and the number of output audio channels corresponding to the average
rate range 3 is 4. If the initial average rate is 24kbps and belongs to the average
rate range 2, it is determined that the target number of output audio channels is
3, and the target average rate may be calculated to be 96kbps/3=32kbps. It may be
seen that the target average rate in the average rate range 2 is increasing compared
to the initial average rate, so that the appropriate target control parameter may
be determined when determining the target control parameter for the audio signal in
subsequent processes, and the down-mixed processing algorithm may be determined according
to the target control parameter. Therefore, the number of output audio channels after
the down-mixed processing matches the number of audio channels that may be encoded
by the core encoder at this encoding rate, and the optimal use of available bits may
be achieved. That is, at a low rate, it may ensure the provision of clear, stable
and understandable audio services, and at a high rate, it may ensure the provision
of high-definition, stable immersive audio services, which may improve the user experience.
[0091] In the embodiment of the disclosure, for three average rate ranges, three different
types of down-mixed processing algorithms may be selected for scene-based audio signals.
After the selected down-mixed processing, an average rate available for each audio
channel in the average rate range 1 and the average rate range 2 are increasing after
the down-mixed processing. The average rate range 3 chooses not to perform the down-mixed
processing because the encoding rate is rich enough, that is, an input signal is directly
used as an output signal of the down-mixed processing, which means that the average
rate available for each audio channel after the down-mixed processing remains unchanged.
[0092] For example, Table 2 shows some kinds of scene-based audio signals, initial average
rates (average rates that may be allocated to each audio channel initially), preset
average rate thresholds, as well as corresponding numbers of output audio channels
(numbers of audio channels after the down-mixed processing) and determined target
average rates (average rates that may be allocated to each audio channel after the
down-mixed processing).
[0093] As can be seen from Table 2 below, the average rate that may be allocated to each
audio channel after the down-mixed processing is greater than or equal to an average
number of bits available for each audio channel, which may make full use of the available
bits, avoid the waste of bits, and provide audio services that match the encoding
rate for the remote users.
Table 2
Scene-based audio signal |
Number of audio channels |
Coding rate (kbps) |
Initial average rate (kbps) that may be allocated to each audio channel |
Number of audio channels after the down-mixed processing |
Average rate (kbps) that may be allocated to each audio channel after down-mixed processing |
FOA |
4 |
less than or equal to 52.8 |
less than or equal to 13.2 |
2 |
less than or equal to 26.4 |
greater than 52.8 and less than 128 |
greater than 13.2 and less than 32 |
3 |
greater than 52.8/3 and less than 128/3 |
greater than or equal to 128 |
greater than or equal to 32 |
4 |
greater than or equal to 32 |
HOA2 |
9 |
less than or equal to 118.8 |
less than or equal to 13.2 |
5 |
less than or equal to 118.8/5 |
greater than 118.8 and less than 288 |
greater than 13.2 and less than 32 |
7 |
greater than 118.8/7 and less than 288/7 |
greater than or equal to 288 |
greater than or equal to 32 |
9 |
greater than or equal to 32 |
HOA3 |
16 |
less than or equal to 211.2 |
less than or equal to 13.2 |
8 |
less than or equal to 26.4 |
greater than 211.2 and less than 512 |
greater than 13.2 and less than 32 |
12 |
greater than 211.2/12 and less than 512/12 |
greater than or equal to 512 |
greater than or equal to 32 |
16 |
greater than or equal to 32 |
HOA4 |
25 |
less than or equal to 330 |
less than or equal to 13.2 |
14 |
less than or equal to 330/14 |
greater than 330 and less than 800 |
greater than 13.2 and less than 32 |
20 |
greater than 330/16 and less than 800/20 |
greater than or equal to 800 |
greater than or equal to 32 |
25 |
greater than or equal to 32 |
[0094] It is understood that each element in Table 2 exists independently, and these elements
are listed in the same table by way of example, but it does not mean that all the
elements in the table must exist simultaneously as shown in the table. The value of
each element is independent of any other element value in Table 2. Therefore, it is
understood by those skilled in the art that the value of each element in Table 2 is
an independent embodiment.
[0095] In the embodiment of the disclosure, a target average rate is determined according
to an initial average rate and a preset average rate threshold. In addition to the
above-mentioned exemplary method, an average rate threshold closest to the initial
average rate may be determined as the target average rate, or the initial average
rate may be directly determined as the target average rate, or an average rate threshold,
among average rate thresholds greater than the initial average rate, closest to the
initial average rate may be determined as the target average rate, which is not specifically
limited in the embodiment of the disclosure.
[0096] In the embodiment of the disclosure, after the target average rate is determined,
when determining a target control parameter for an audio signal according to the initial
average rate and the target average rate, corresponding relationships between the
initial average rate and the target average rate and the control parameter may be
preset. For example, corresponding relationships between the initial average rate
and the target average rate and the control parameter are set, or corresponding relationships
between the control parameter and a difference between the initial average rate and
the target average rate are set, or corresponding relationships between the control
parameter and an absolute value of a difference between the initial average rate and
the target average rate are set, or corresponding relationships between the control
parameter and a sum of the initial average rate and the target average rate, etc.,
which is not specifically limited in the embodiment of the disclosure.
[0097] A down-mixed processing algorithm is to design a down-mixed conversion matrix according
to a target number of output audio channels and a number of audio channels for acquiring
scene-based audio signals. For example, if the number of audio channels is N and the
target number of output audio channels is M, the conversion matrix is M*N, and both
N and M are positive integers, and M is less than or equal to N.
[0098] The conversion matrix M*N satisfies the following equation:

where [M*1] represents a matrix of M times 1, [M*N] represents a matrix of M times
N, and [N* 1] represents a matrix of N times 1.
[0099] For convenience of understanding, the embodiment of the disclosure provides an exemplary
embodiment.
[0100] In an exemplary embodiment, the scene-based audio signal obtained is an audio signal
in a FOA format, the number of audio channels is 4, namely, W, X, Y, Z, and the selected
encoding rate is 96kbps. After the down-mixed processing, the target number of output
audio channels is 3, where W represents a component containing all sounds in all directions
in a sound field superimposed with the same gain and phase, X represents a component
in a front-back direction in the sound field, Y represents a component in a left-right
direction in the sound field, and Z represents a component in an up-down direction
in the sound field. The schematic diagram of the coordinate is shown in FIG. 2.
[0101] If the target number of audio channels is 3 after the down-mixed processing, the
component Z in the up-down direction is omitted, and only three channel components,
W, X and Y are reserved. This strategy is made in consideration of two aspects. Firstly,
when reconstructing the sound field, the listener at a playback end is sensitive to
the component in the front-back direction and the component in the left-right direction,
but less sensitive to the component in the up-down direction. Secondly, there are
fewer sound sources for the component in the up-down direction in the sound field
of a general audio scene. After the down-mixed processing, the number of audio channels
is 3, and the average encoding rate that may be allocated to each audio channel is
96kbps/3=32kbps. An encoding core may encode and reconstruct high-quality audio signals
at this average encoding rate, and provides high-definition, stable and immersive
audio services to remote users.
[0102] FIG. 8 is a structural diagram of an audio signal encoding apparatus provided by
an embodiment of the disclosure.
[0103] As illustrated in FIG. 8, the audio signal encoding apparatus 1 includes: a signal
obtaining unit 11, an information determining unit 12, and an encoding processing
unit 13.
[0104] The signal obtaining unit 11 is configured to obtain a scene-based audio signal.
[0105] The information determining unit 12 is configured to determine a number of audio
channels of the audio signal and an encoding rate.
[0106] The encoding processing unit 13 is configured to generate an encoded codestream by
encoding the audio signal according to the number of audio channels and the encoding
rate.
[0107] By implementing the embodiment of the disclosure, the signal obtaining unit 11 obtains
a scene-based audio signal, the information determining unit 12 determines a number
of audio channels of the audio signal and an encoding rate, and the encoding processing
unit 13 generates a coded codestream by encoding the audio signal according to the
number of audio channels and the encoding rate. In this way, the audio signal is encoded
according to the number of audio channels and the encoding rate, and the bits available
may be fully utilized during the encoding process, so that the waste of bits may be
avoided, and audio services that match the encoding rate may be provided for remote
users.
[0108] As illustrated in FIG. 9, in some embodiments, the encoding processing unit 13 includes:
a down-mixed processing module 131, a parameter generating module 132, and a codestream
generating module 133.
[0109] The down-mixed processing module 131 is configured to perform a down-mixed processing
on the audio signal according to the number of audio channels and the encoding rate
to generate a down-mixed parameter and a down-mixed audio channel signal.
[0110] The parameter generating module 132 is configured to encode the down-mixed audio
channel signal to generate an encoding parameter.
[0111] The codestream generating module 133 is configured to generate the encoded codestream
by performing codestream multiplexing on the down-mixed parameter and the encoding
parameter.
[0112] As illustrated in FIG. 10, in some embodiments, the down-mixed processing module
131 includes: a parameter determining sub-module 1311, an algorithm determining sub-module
1312, and a down-mixed processing sub-module 1313.
[0113] The parameter determining sub-module 1311 is configured to determine a target control
parameter for the audio signal according to the number of audio channels and the encoding
rate.
[0114] The algorithm determining sub-module 1312 is configured to determine a down-mixed
processing algorithm according to the target control parameter.
[0115] The down-mixed processing sub-module 1313 is configured to perform the down-mixed
processing on the audio signal according to the down-mixed processing algorithm to
generate the down-mixed parameter and the down-mixed audio channel signal.
[0116] In some embodiments, the parameter determining sub-module 1311 is further configured
to:
calculate an initial average rate of each channel according to the number of audio
channels and the encoding rate;
determine a target average rate according to the initial average rate and a preset
average rate threshold; and
determine the target control parameter for the audio signal according to the initial
average rate and the target average rate.
[0117] As illustrated in FIG. 11, in some embodiments, the audio signal encoding apparatus
1 further includes: a preprocessing unit 14.
[0118] The preprocessing unit 14 is configured to perform a pre-emphasis preprocessing and/or
a high-pass filtering preprocessing on the audio signal.
[0119] With regard to the apparatus in the above embodiments, the specific way in which
each module performs operations has been described in detail in the embodiments of
the method, and will not be described in detail here.
[0120] The audio signal encoding apparatus according to the embodiment of the disclosure
may execute the audio signal encoding methods as described in some of the above embodiments,
and its beneficial effects are the same as those of the above audio signal encoding
methods, which are not repeated here.
[0121] FIG. 12 is a structural diagram of an electronic device 100 for performing an audio
signal encoding method illustrated by an exemplary embodiment.
[0122] For example, the electronic device 100 may be a mobile phone, a computer, a digital
broadcasting terminal, a message transceiver device, a game console, a tablet device,
a medical device, a fitness device or a personal digital assistant.
[0123] As illustrated in FIG. 12, the electronic device 100 may include one or more of the
following components: a processing component 101, a memory 102, a power component
103, a multimedia component 104, an audio component 105, an input/output (I/O) interface
106, a sensor component 107, and a communication component 108.
[0124] The processing component 101 typically controls overall operations of the electronic
device 100, such as the operations associated with display, telephone calls, data
communications, camera operations, and recording operations. The processing component
101 may include one or more processors 1011 to perform all or part of the steps in
the above described methods. Moreover, the processing component 101 may include one
or more modules which facilitate the interaction between the processing component
101 and other components. For example, the processing component 101 may include a
multimedia module to facilitate the interaction between the multimedia component 104
and the processing component 101.
[0125] The memory 102 is configured to store various types of data to support the operation
of the electronic device 100. Examples of such data include instructions for any applications
or methods operated on the electronic device 100, contact data, phonebook data, messages,
pictures, video, etc. The memory 102 may be implemented using any type of volatile
or non-volatile memory devices, or a combination thereof, such as a Static Random-Access
Memory (SRAM), an Electrically-Erasable Programmable Read Only Memory (EEPROM), an
Erasable Programmable Read Only Memory (EPROM), a Programmable Read Only Memory (PROM),
a Read Only Memory (ROM), a magnetic memory, a flash memory, a magnetic or optical
disk.
[0126] The power component 103 provides power to various components of the electronic device
100. The power component 103 may include a power management system, one or more power
sources, and any other components associated with the generation, management, and
distribution of power in the electronic device 100.
[0127] The multimedia component 104 includes a touch screen providing an output interface
between the electronic device 100 and the user. In some embodiments, the screen may
include a Liquid Crystal Display (LCD) and a Touch Panel (TP). The touch panel includes
one or more touch sensors to sense touches, swipes, and gestures on the touch panel.
The touch sensor may not only sense a boundary of a touch or swipe action, but also
sense a period of wakeup time and a pressure associated with the touch or swipe action.
In some embodiments, the multimedia component 104 includes a front-facing camera and/or
a rear-facing camera. When the electronic device 100 is in an operating mode, such
as a shooting mode or a video mode, the front-facing camera and/or the rear-facing
camera can receive external multimedia data. Each front-facing camera and rear-facing
camera may be a fixed optical lens system or has focal length and optical zoom capability.
[0128] The audio component 105 is configured to output and/or input audio signals. For example,
the audio component 105 includes a microphone (MIC) configured to receive an external
audio signal when the electronic device 100 is in an operation mode, such as a call
mode, a recording mode, and a voice recognition mode. The received audio signal may
be further stored in the memory 102 or transmitted via the communication component
108. In some embodiments, the audio component 105 further includes a speaker to output
audio signals.
[0129] The I/O interface 106 provides an interface between the processing component 101
and peripheral interface modules, such as a keyboard, a click wheel, buttons, and
the like. The buttons may include, but are not limited to, a home button, a volume
button, a starting button, and a locking button.
[0130] The sensor component 107 includes one or more sensors to provide status assessments
of various aspects of the electronic device 100. For instance, the sensor component
107 may detect an open/closed status of the electronic device 100, relative positioning
of components, e.g., the display and the keypad, of the electronic device 100, a change
in position of the electronic device 100 or a component of the electronic device 100,
a presence or absence of user contact with the electronic device 100, an orientation
or an acceleration/deceleration of the electronic device 100, and a change in temperature
of the electronic device 100. The sensor component 107 may include a proximity sensor
configured to detect the presence of nearby objects without any physical contact.
The sensor component 107 may also include a light sensor, such as a Complementary
Metal Oxide Semiconductor (CMOS) or Charge-Coupled Device (CCD) image sensor, for
use in imaging applications. In some embodiments, the sensor component 107 may also
include an accelerometer sensor, a gyroscope sensor, a magnetic sensor, a pressure
sensor, or a temperature sensor.
[0131] The communication component 108 is configured to facilitate communication, wired
or wirelessly, between the electronic device 100 and other devices. The electronic
device 100 can access a wireless network based on a communication standard, such as
Wi-Fi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication
component 108 receives a broadcast signal or broadcast associated information from
an external broadcast management system via a broadcast channel. In an exemplary embodiment,
the communication component 108 further includes a Near Field Communication (NFC)
module to facilitate short-range communication. For example, the NFC module may be
implemented based on a Radio Frequency Identification (RFID) technology, an Infrared
Data Association (IrDA) technology, an Ultra-Wide Band (UWB) technology, a Blue Tooth
(BT) technology, and other technologies.
[0132] In some exemplary embodiments, the electronic device 100 may be implemented with
one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors
(DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs),
Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors
or other electronic components, for performing the above described methods. It should
be noted that the implementation process and technical principle of the electronic
device in this embodiment can be referred to the above explanation of the audio signal
encoding method in the embodiment of the disclosure, and will not be repeated here.
[0133] The electronic device 100 provided by the embodiment of the disclosure may execute
the audio signal encoding method as described in some of the above embodiments, and
its beneficial effects are the same as those of the above audio signal encoding method,
which will not be repeated here.
[0134] In order to realize the above embodiments, the disclosure also provides a storage
medium.
[0135] When the instructions stored in the storage medium are executed by the processor
of the electronic device, the electronic device is caused to perform the audio signal
encoding method described above. For example, the storage medium may be a ROM, a Random
Access Memory (RAM), Compact Disc-ROM (CD-ROM), a magnetic tape, a floppy disk, an
optical data storage device, etc.
[0136] In order to realize the above embodiments, the disclosure also provides a computer
program product. When the computer program is executed by the processor of the electronic
device, the electronic device is caused to perform the audio signal encoding method
as described above.
[0137] Other embodiments of the disclosure will be apparent to those skilled in the art
from consideration of the specification and practice of the disclosure disclosed here.
This application is intended to cover any variations, uses, or adaptations of the
disclosure following the general principles thereof and including such departures
from the disclosure as come within known or customary practice in the art. It is intended
that the specification and examples are considered as illustrative only, with a true
scope and spirit of the disclosure being indicated by the following claims.
[0138] It is clearly understood by those skilled in the art that for the convenience and
conciseness of description, the specific working processes of the systems, devices
and units described above can be referred to the corresponding processes in the aforementioned
method embodiments, and will not be repeated here.
[0139] The above descriptions are specific implementations of the disclosure, but the protection
scope of the disclosure is not limited thereto. Any technician familiar with the technical
field can easily think of changes or substitutions within the technical scope disclosed
in the disclosure, which should be included in the protection scope of the disclosure.
Therefore, the protection scope of the disclosure should be based on the protection
scope of the attached claims.