Field
[0001] The present application relates to apparatus and methods for encoding for discontinuous
transmission operation multi-source inputs, but not exclusively for encoding for discontinuous
transmission operation multi-source inputs within an immersive or spatial audio codec.
Background
[0002] Immersive audio codecs are being implemented supporting a multitude of operating
points ranging from a low bit rate operation to transparency. An example of such a
codec is the Immersive Voice and Audio Services (IVAS) codec which is being designed
to be suitable for use over a communications network such as a 3GPP 4G/5G network
including use in such immersive services as for example immersive voice and audio
for virtual reality (VR). This audio codec is expected to handle the encoding, decoding
and rendering of speech, music and generic audio. It is furthermore expected to support
channel-based audio and scene-based audio inputs including spatial information about
the sound field and sound sources. The codec is also expected to operate with low
latency to enable conversational services as well as support high error robustness
under various transmission conditions.
[0003] Voice Activity Detection (VAD), also known as speech activity detection or more generally
as signal activity detection is a technique used in various speech processing algorithms,
most notably speech codecs, for detecting the presence or absence of human speech.
It can be generalized to detection of active signal, i.e., a sound source other than
background noise. Based on a VAD decision, it is possible to utilize, e.g., a certain
encoding mode in a speech encoder.
[0004] Discontinuous Transmission (DTX) is a technique utilizing VAD intended to temporarily
shut off parts of active signal processing (such as speech coding according to certain
modes) and the frame-by-frame transmission of encoded audio. For example rather than
transmitting normal encoded frames infrequent simplified update frames are sent to
drive a comfort noise generator (CNG) at the decoder. The use of DTX can help with
reducing interference and/or preserving/reallocating capacity in a practical mobile
network. Furthermore the use of DTX can also help with battery life of the device,
e.g., by turning off radio when not transmitting.
[0005] Comfort Noise Generation (CNG) is a technique for creating a synthetic background
noise at the decoder to fill silence periods that would otherwise be observed. For
example comfort noise generation can be implemented under a DTX operation.
[0006] Silence Descriptor (SID) frames can be sent during speech inactivity to keep the
receiver CNG decently well aligned with the background noise level at the sender side.
This can be of particular importance at the onset of each new talk spurt. Thus, SID
frames should not be too old, when speech starts again. Commonly SID frames are sent
regularly e.g. every 8
th frame, but some codecs allow also variable rate SID updates. SID frames are typically
quite small e.g. 2.4kbit/s SID bitrate equals 48 bits per frame.
Summary
[0007] There is provided according to a first aspect an apparatus comprising means configured
to: obtain two or more separate audio signal inputs; determine signal activity within
the two or more separate audio signal inputs; determine a coding mode from three or
more coding modes, wherein at least one of the three or more coding modes comprises
at least one adaptive discontinuous transmission coding mode and the coding mode is
determined based on the signal activity within the two or more separate audio signal
inputs; and encode the two or more audio signal inputs based on the determined coding
mode.
[0008] The at least one adaptive discontinuous transmission coding mode may comprise at
least one externally visible adaptive discontinuous transmission coding mode, and
the means configured to encode the two or more audio signal inputs based on the determined
coding mode when the coding mode is an externally visible adaptive discontinuous transmission
coding mode may be configured to adaptively encode the two or more audio signal inputs
with discontinuous transmission encoding assistance, the discontinuous transmission
encoding assistance configured to encode inactive signal activity within the two or
more separate audio signal inputs as silence descriptor elements.
[0009] The means configured to encode the two or more audio signal inputs based on the determined
coding mode when the coding mode is an externally visible adaptive discontinuous transmission
coding mode may be configured to adaptively encode one of the two or more audio signal
inputs based on signal activity within the one of the two or more separate audio signal
inputs and at least one of: signal activity within another one of the two or more
separate audio signal inputs; a determined output bit rate and an encoding rate of
the others of the two or more separate audio signal inputs, such that a combined bit
rate for encoding the two or more audio signal inputs is kept constant; a determined
output bit rate and an encoding rate of the others of the two or more separate audio
signal inputs, such that a combined bit rate for encoding the two or more audio signal
inputs is variable; and a number of encoded channels to be output from the encoded
others of the two or more audio signal inputs.
[0010] The means configured to encode the two or more audio signal inputs based on the determined
coding mode when the coding mode is an externally visible adaptive discontinuous transmission
coding mode may be configured to control at least one of: a number of encoded channels
to be output from the encoded one of the two or more audio signal inputs; and a bit
rate of the encoded one of the two or more audio signal inputs.
[0011] The means configured to control the number of encoded channels to be output from
the encoded one of the two or more audio signal inputs may be configured to control
the number of encoded channels to be output from the encoded one of the two or more
audio signal inputs such that a total number of channels output from encoding all
of the two or more audio signal inputs is constant.
[0012] The at least one adaptive discontinuous transmission coding mode may comprise at
least one externally invisible adaptive discontinuous transmission coding mode, and
the means configured to encode the two or more audio signal inputs based on the determined
coding mode when the coding mode is an externally invisible adaptive discontinuous
transmission coding mode may be configured to adaptively encode the two or more audio
signal inputs with discontinuous transmission encoding assistance, the discontinuous
transmission encoding assistance may be configured to encode inactive signal activity
within the two or more separate audio signal inputs as silence descriptor elements,
but maintain a constant number of output channels and/or a constant output bitrate.
[0013] The means configured to encode the two or more audio signal inputs based on the determined
coding mode when the coding mode is an externally invisible adaptive discontinuous
transmission coding mode may be configured to adaptively encode one of the two or
more audio signal inputs based on signal activity within the one of the two or more
separate audio signal inputs and at least one of: signal activity within another one
of the two or more separate audio signal inputs; a determined output bit rate and
an encoding rate of the others of the two or more separate audio signal inputs, such
that a combined bit rate for encoding the two or more audio signal inputs is kept
constant; a determined output bit rate and an encoding rate of the others of the two
or more separate audio signal inputs, such that a combined bit rate for encoding the
two or more audio signal inputs is variable; and a number of encoded channels to be
output from the encoded others of the two or more audio signal inputs.
[0014] The means configured to encode the two or more audio signal inputs based on the determined
coding mode when the discontinuous transmission coding mode is an externally invisible
adaptive discontinuous transmission coding mode may be configured to control at least
one of: a number of encoded channels to be output from the encoded one of the two
or more audio signal inputs to maintain a constant number of output channels; and
a bit rate of the encoded one of the two or more audio signal inputs to maintain a
constant output bitrate.
[0015] The means configured to control the bit rate of the encoded one of the two or more
audio signal inputs to maintain a constant output bitrate may be configured to apply
zero padding to the encoded audio signal inputs.
[0016] The means configured to control the bit rate of the encoded one of the two or more
audio signal inputs to maintain a constant output bitrate may be configured to apply
an adaptive discontinuous transmission coding mode for the one of the two or more
audio signal inputs single source resulting in a maximum of a single encoded audio
signal inputs implementing discontinuous transmission encoding assistance.
[0017] The means configured to control the bit rate of the encoded one of the two or more
audio signal inputs to maintain a constant output bitrate may be configured to apply
an adaptive discontinuous transmission coding mode for bit rate allocation and transport
signal selection for the encoded one of the two or more audio signal inputs.
[0018] The three or more coding modes may further comprise an off mode wherein the means
configured to encode the two or more audio signal inputs based on the determined coding
mode may be configured to encode the two or more audio signal inputs without any discontinuous
transmission encoding assistance.
[0019] The three or more coding modes may further comprise an on mode wherein the means
configured to encode the two or more audio signal inputs based on the determined coding
mode may be configured to encode the two or more audio signal inputs with discontinuous
transmission encoding assistance, the discontinuous transmission encoding assistance
configured to encode inactive signal activity within the two or more separate audio
signal inputs as silence descriptor elements.
[0020] The three or more coding modes may further comprise an on mode wherein the means
configured to encode the two or more audio signal inputs based on the determined coding
mode may be configured to encode the two or more audio signal inputs with discontinuous
transmission encoding assistance, the discontinuous transmission encoding assistance
configured to adaptively individually encode one or more of the two or more separate
audio signal inputs with discontinuous transmission encoding assistance, the discontinuous
transmission encoding assistance configured to encode inactive signal activity within
the one or more of the two or more separate audio signal inputs as silence descriptor
elements and others of the two or more audio signal inputs without any discontinuous
transmission encoding assistance.
[0021] According to second aspect there is provided a method comprising: obtaining two or
more separate audio signal inputs; determining signal activity within the two or more
separate audio signal inputs; determining a coding mode from three or more coding
modes, wherein at least one of the three or more coding modes comprises at least one
adaptive discontinuous transmission coding mode and the coding mode is determined
based on the signal activity within the two or more separate audio signal inputs;
and encoding the two or more audio signal inputs based on the determined coding mode.
[0022] The at least one adaptive discontinuous transmission coding mode may comprise at
least one externally visible adaptive discontinuous transmission coding mode, and
encoding the two or more audio signal inputs based on the determined coding mode when
the coding mode is an externally visible adaptive discontinuous transmission coding
mode may comprise adaptively encoding the two or more audio signal inputs with discontinuous
transmission encoding assistance, the discontinuous transmission encoding assistance
configured to encode inactive signal activity within the two or more separate audio
signal inputs as silence descriptor elements.
[0023] Encoding the two or more audio signal inputs based on the determined coding mode
when the coding mode is an externally visible adaptive discontinuous transmission
coding mode may comprise adaptively encoding one of the two or more audio signal inputs
based on signal activity within the one of the two or more separate audio signal inputs
and at least one of: signal activity within another one of the two or more separate
audio signal inputs; a determined output bit rate and an encoding rate of the others
of the two or more separate audio signal inputs, such that a combined bit rate for
encoding the two or more audio signal inputs is kept constant; a determined output
bit rate and an encoding rate of the others of the two or more separate audio signal
inputs, such that a combined bit rate for encoding the two or more audio signal inputs
is variable; and a number of encoded channels to be output from the encoded others
of the two or more audio signal inputs.
[0024] Encoding the two or more audio signal inputs based on the determined coding mode
when the coding mode is an externally visible adaptive discontinuous transmission
coding mode may comprise controlling at least one of: a number of encoded channels
to be output from the encoded one of the two or more audio signal inputs; and a bit
rate of the encoded one of the two or more audio signal inputs.
[0025] Controlling the number of encoded channels to be output from the encoded one of the
two or more audio signal inputs may comprise controlling the number of encoded channels
to be output from the encoded one of the two or more audio signal inputs such that
a total number of channels output from encoding all of the two or more audio signal
inputs is constant.
[0026] The at least one adaptive discontinuous transmission coding mode may comprise at
least one externally invisible adaptive discontinuous transmission coding mode, and
encoding the two or more audio signal inputs based on the determined coding mode when
the coding mode is an externally invisible adaptive discontinuous transmission coding
mode may comprise adaptively encoding the two or more audio signal inputs with discontinuous
transmission encoding assistance, the discontinuous transmission encoding assistance
may be configured to encode inactive signal activity within the two or more separate
audio signal inputs as silence descriptor elements, but maintain a constant number
of output channels and/or a constant output bitrate.
[0027] Encoding the two or more audio signal inputs based on the determined coding mode
when the coding mode is an externally invisible adaptive discontinuous transmission
coding mode may comprise adaptively encoding one of the two or more audio signal inputs
based on signal activity within the one of the two or more separate audio signal inputs
and at least one of: signal activity within another one of the two or more separate
audio signal inputs; a determined output bit rate and an encoding rate of the others
of the two or more separate audio signal inputs, such that a combined bit rate for
encoding the two or more audio signal inputs is kept constant; a determined output
bit rate and an encoding rate of the others of the two or more separate audio signal
inputs, such that a combined bit rate for encoding the two or more audio signal inputs
is variable; and a number of encoded channels to be output from the encoded others
of the two or more audio signal inputs.
[0028] Encoding the two or more audio signal inputs based on the determined coding mode
when the discontinuous transmission coding mode is an externally invisible adaptive
discontinuous transmission coding mode may comprise controlling at least one of: a
number of encoded channels to be output from the encoded one of the two or more audio
signal inputs to maintain a constant number of output channels; and a bit rate of
the encoded one of the two or more audio signal inputs to maintain a constant output
bitrate.
[0029] Controlling the bit rate of the encoded one of the two or more audio signal inputs
to maintain a constant output bitrate may comprise applying zero padding to the encoded
audio signal inputs.
[0030] Controlling the bit rate of the encoded one of the two or more audio signal inputs
to maintain a constant output bitrate may comprise applying an adaptive discontinuous
transmission coding mode for the one of the two or more audio signal inputs single
source resulting in a maximum of a single encoded audio signal inputs implementing
discontinuous transmission encoding assistance.
[0031] Controlling the bit rate of the encoded one of the two or more audio signal inputs
to maintain a constant output bitrate may comprise applying an adaptive discontinuous
transmission coding mode for bit rate allocation and transport signal selection for
the encoded one of the two or more audio signal inputs.
[0032] The three or more coding modes may further comprise an off mode wherein encoding
the two or more audio signal inputs based on the determined coding mode may comprise
encoding the two or more audio signal inputs without any discontinuous transmission
encoding assistance.
[0033] The three or more coding modes may further comprise an on mode wherein the encoding
the two or more audio signal inputs based on the determined coding mode may comprise
encoding the two or more audio signal inputs with discontinuous transmission encoding
assistance, the discontinuous transmission encoding assistance configured to encode
inactive signal activity within the two or more separate audio signal inputs as silence
descriptor elements.
[0034] The three or more coding modes may further comprise an on mode wherein encoding the
two or more audio signal inputs based on the determined coding mode may comprise encoding
the two or more audio signal inputs with discontinuous transmission encoding assistance,
the discontinuous transmission encoding assistance configured to adaptively individually
encode one or more of the two or more separate audio signal inputs with discontinuous
transmission encoding assistance, the discontinuous transmission encoding assistance
configured to encode inactive signal activity within the one or more of the two or
more separate audio signal inputs as silence descriptor elements and others of the
two or more audio signal inputs without any discontinuous transmission encoding assistance.
[0035] According to a third aspect there is provided an apparatus comprising at least one
processor and at least one memory including a computer program code, the at least
one memory and the computer program code configured to, with the at least one processor,
cause the apparatus at least to: obtain two or more separate audio signal inputs;
determine signal activity within the two or more separate audio signal inputs; determine
a coding mode from three or more coding modes, wherein at least one of the three or
more coding modes comprises at least one adaptive discontinuous transmission coding
mode and the coding mode is determined based on the signal activity within the two
or more separate audio signal inputs; and encode the two or more audio signal inputs
based on the determined coding mode.
[0036] The at least one adaptive discontinuous transmission coding mode may comprise at
least one externally visible adaptive discontinuous transmission coding mode, and
the apparatus caused to encode the two or more audio signal inputs based on the determined
coding mode when the coding mode is an externally visible adaptive discontinuous transmission
coding mode may be caused to adaptively encode the two or more audio signal inputs
with discontinuous transmission encoding assistance, the discontinuous transmission
encoding assistance configured to encode inactive signal activity within the two or
more separate audio signal inputs as silence descriptor elements.
[0037] The apparatus caused to encode the two or more audio signal inputs based on the determined
coding mode when the coding mode is an externally visible adaptive discontinuous transmission
coding mode may be caused to adaptively encode one of the two or more audio signal
inputs based on signal activity within the one of the two or more separate audio signal
inputs and at least one of: signal activity within another one of the two or more
separate audio signal inputs; a determined output bit rate and an encoding rate of
the others of the two or more separate audio signal inputs, such that a combined bit
rate for encoding the two or more audio signal inputs is kept constant; a determined
output bit rate and an encoding rate of the others of the two or more separate audio
signal inputs, such that a combined bit rate for encoding the two or more audio signal
inputs is variable; and a number of encoded channels to be output from the encoded
others of the two or more audio signal inputs.
[0038] The apparatus caused to encode the two or more audio signal inputs based on the determined
coding mode when the coding mode is an externally visible adaptive discontinuous transmission
coding mode may be caused to control at least one of: a number of encoded channels
to be output from the encoded one of the two or more audio signal inputs; and a bit
rate of the encoded one of the two or more audio signal inputs.
[0039] The apparatus caused to control the number of encoded channels to be output from
the encoded one of the two or more audio signal inputs may be caused to control the
number of encoded channels to be output from the encoded one of the two or more audio
signal inputs such that a total number of channels output from encoding all of the
two or more audio signal inputs is constant.
[0040] The at least one adaptive discontinuous transmission coding mode may comprise at
least one externally invisible adaptive discontinuous transmission coding mode, and
the apparatus caused to encode the two or more audio signal inputs based on the determined
coding mode when the coding mode is an externally invisible adaptive discontinuous
transmission coding mode may be caused to adaptively encode the two or more audio
signal inputs with discontinuous transmission encoding assistance, the discontinuous
transmission encoding assistance may be caused to encode inactive signal activity
within the two or more separate audio signal inputs as silence descriptor elements,
but maintain a constant number of output channels and/or a constant output bitrate.
[0041] The apparatus caused to encode the two or more audio signal inputs based on the determined
coding mode when the coding mode is an externally invisible adaptive discontinuous
transmission coding mode may be caused to adaptively encode one of the two or more
audio signal inputs based on signal activity within the one of the two or more separate
audio signal inputs and at least one of: signal activity within another one of the
two or more separate audio signal inputs; a determined output bit rate and an encoding
rate of the others of the two or more separate audio signal inputs, such that a combined
bit rate for encoding the two or more audio signal inputs is kept constant; a determined
output bit rate and an encoding rate of the others of the two or more separate audio
signal inputs, such that a combined bit rate for encoding the two or more audio signal
inputs is variable; and a number of encoded channels to be output from the encoded
others of the two or more audio signal inputs.
[0042] The apparatus caused to encode the two or more audio signal inputs based on the determined
coding mode when the discontinuous transmission coding mode is an externally invisible
adaptive discontinuous transmission coding mode may be caused to control at least
one of: a number of encoded channels to be output from the encoded one of the two
or more audio signal inputs to maintain a constant number of output channels; and
a bit rate of the encoded one of the two or more audio signal inputs to maintain a
constant output bitrate.
[0043] The apparatus caused to control the bit rate of the encoded one of the two or more
audio signal inputs to maintain a constant output bitrate may be caused to apply zero
padding to the encoded audio signal inputs.
[0044] The apparatus caused to control the bit rate of the encoded one of the two or more
audio signal inputs to maintain a constant output bitrate may be caused to apply an
adaptive discontinuous transmission coding mode for the one of the two or more audio
signal inputs single source resulting in a maximum of a single encoded audio signal
inputs implementing discontinuous transmission encoding assistance.
[0045] The apparatus caused to control the bit rate of the encoded one of the two or more
audio signal inputs to maintain a constant output bitrate may be caused to apply an
adaptive discontinuous transmission coding mode for bit rate allocation and transport
signal selection for the encoded one of the two or more audio signal inputs.
[0046] The three or more coding modes may further comprise an off mode wherein the apparatus
caused to encode the two or more audio signal inputs based on the determined coding
mode may be caused to encode the two or more audio signal inputs without any discontinuous
transmission encoding assistance.
[0047] The three or more coding modes may further comprise an on mode wherein the apparatus
caused to encode the two or more audio signal inputs based on the determined coding
mode may be caused to encode the two or more audio signal inputs with discontinuous
transmission encoding assistance, the discontinuous transmission encoding assistance
configured to encode inactive signal activity within the two or more separate audio
signal inputs as silence descriptor elements.
[0048] The three or more coding modes may further comprise an on mode wherein the apparatus
caused to encode the two or more audio signal inputs based on the determined coding
mode may be caused to encode the two or more audio signal inputs with discontinuous
transmission encoding assistance, the discontinuous transmission encoding assistance
configured to adaptively individually encode one or more of the two or more separate
audio signal inputs with discontinuous transmission encoding assistance, the discontinuous
transmission encoding assistance configured to encode inactive signal activity within
the one or more of the two or more separate audio signal inputs as silence descriptor
elements and others of the two or more audio signal inputs without any discontinuous
transmission encoding assistance.
[0049] According to a fourth aspect there is provided an apparatus comprising: obtaining
circuitry configured to obtain two or more separate audio signal inputs; determining
circuitry configured to determine signal activity within the two or more separate
audio signal inputs; determining circuitry configured to determine a coding mode from
three or more coding modes, wherein at least one of the three or more coding modes
comprises at least one adaptive discontinuous transmission coding mode and the coding
mode is determined based on the signal activity within the two or more separate audio
signal inputs; and encoding circuitry configured to encode the two or more audio signal
inputs based on the determined coding mode.
[0050] According to a fifth aspect there is provided a computer program comprising instructions
[or a computer readable medium comprising program instructions] for causing an apparatus
to perform at least the following: obtain two or more separate audio signal inputs;
determine signal activity within the two or more separate audio signal inputs; determine
a coding mode from three or more coding modes, wherein at least one of the three or
more coding modes comprises at least one adaptive discontinuous transmission coding
mode and the coding mode is determined based on the signal activity within the two
or more separate audio signal inputs; and encode the two or more audio signal inputs
based on the determined coding mode.
[0051] According to a sixth aspect there is provided a non-transitory computer readable
medium comprising program instructions for causing an apparatus to perform at least
the following: obtain two or more separate audio signal inputs; determine signal activity
within the two or more separate audio signal inputs; determine a coding mode from
three or more coding modes, wherein at least one of the three or more coding modes
comprises at least one adaptive discontinuous transmission coding mode and the coding
mode is determined based on the signal activity within the two or more separate audio
signal inputs; and encode the two or more audio signal inputs based on the determined
coding mode.
[0052] According to a seventh aspect there is provided an apparatus comprising: means for
obtaining two or more separate audio signal inputs; means for determining signal activity
within the two or more separate audio signal inputs; means for determining a coding
mode from three or more coding modes, wherein at least one of the three or more coding
modes comprises at least one adaptive discontinuous transmission coding mode and the
coding mode is determined based on the signal activity within the two or more separate
audio signal inputs; and means for encoding the two or more audio signal inputs based
on the determined coding mode
[0053] According to an eighth aspect there is provided a computer readable medium comprising
program instructions for causing an apparatus to perform at least the following: obtain
two or more separate audio signal inputs; determine signal activity within the two
or more separate audio signal inputs; determine a coding mode from three or more coding
modes, wherein at least one of the three or more coding modes comprises at least one
adaptive discontinuous transmission coding mode and the coding mode is determined
based on the signal activity within the two or more separate audio signal inputs;
and encode the two or more audio signal inputs based on the determined coding mode.
[0054] An apparatus comprising means for performing the actions of the method as described
above.
[0055] An apparatus configured to perform the actions of the method as described above.
[0056] A computer program comprising program instructions for causing a computer to perform
the method as described above.
[0057] A computer program product stored on a medium may cause an apparatus to perform the
method as described herein.
[0058] An electronic device may comprise apparatus as described herein.
[0059] A chipset may comprise apparatus as described herein.
[0060] Embodiments of the present application aim to address problems associated with the
state of the art.
Summary of the Figures
[0061] For a better understanding of the present application, reference will now be made
by way of example to the accompanying drawings in which:
Figure 1 shows schematically an example spatial audio communications system suitable
for implementing some embodiments;
Figures 2, 3 and 4 show schematically differences in audio activity in spatial communications
based on multi-source encoder input;
Figure 5 shows schematically an encoder system of apparatus suitable for implementing
some embodiments;
Figure 6 shows schematically an example encoder configured with mode selection for
multi-source inputs according to some embodiments;
Figure 7 shows schematically an example flow diagram implementing an adaptive DTX
operation;
Figure 8 shows schematically an example flow diagram implementing an internal adaptive
DTX operation; and
Figure 9 shows schematically an example device suitable for implementing the apparatus
shown.
Embodiments of the Application
[0062] The concept as discussed in the embodiments of the invention relates to speech and
audio codecs and in particular immersive audio codecs supporting a multitude of operating
points ranging from a low bit rate operation to transparency as well as a range of
service capabilities, e.g., from mono to stereo to fully immersive audio encoding/decoding/rendering.
An example of such a codec is the 3GPP IVAS codec discussed above.
[0063] The input signals are presented to the IVAS encoder in one of the supported formats
(and in some allowed combinations of the formats). Similarly, it is expected that
the decoder can output the audio in a number of supported formats. A pass-through
mode has been proposed, where the audio could be provided in its original format after
transmission (encoding/decoding), which can, e.g., allow for external rendering of
the transmitted audio.
[0064] For example a mono audio signal (without metadata) may be encoded using an Enhanced
Voice Service (EVS) encoder. Other input formats may utilize new IVAS encoding tools.
One input format proposed for IVAS is the Metadata-assisted spatial audio (MASA) format,
where the encoder may utilize, e.g., a combination of mono and stereo encoding tools
and metadata encoding tools for efficient transmission of the format. MASA is a parametric
spatial audio format suitable for spatial audio processing. Parametric spatial audio
processing is a field of audio signal processing where the spatial aspect of the sound
(or sound scene) is described using a set of parameters. For example, in parametric
spatial audio capture from microphone arrays, it is a typical and an effective choice
to estimate from the microphone array signals a set of parameters such as directions
of the sound in frequency bands, and the relative energies of the directional and
non-directional parts of the captured sound in frequency bands, expressed for example
as a direct-to-total ratio or an ambient-to-total energy ratio in frequency bands.
These parameters are known to well describe the perceptual spatial properties of the
captured sound at the position of the microphone array. These parameters can be utilized
in synthesis of the spatial sound accordingly, for headphones binaurally, for loudspeakers,
or to other formats, such as Ambisonics.
[0065] For example, there can be two channels (stereo) of audio signals and spatial metadata.
The spatial metadata may furthermore define parameters such as: Direction index, describing
a direction of arrival of the sound at a time-frequency parameter interval; level/phase
differences; Direct-to-total energy ratio, describing an energy ratio for the direction
index; Diffuseness; Coherences such as Spread coherence describing a spread of energy
for the direction index; Diffuse-to-total energy ratio, describing an energy ratio
of non-directional sound over surrounding directions; Surround coherence describing
a coherence of the non-directional sound over the surrounding directions; Remainder-to-total
energy ratio, describing an energy ratio of the remainder (such as microphone noise)
sound energy to fulfil requirement that sum of energy ratios is 1; Distance, describing
a distance of the sound originating from the direction index in meters on a logarithmic
scale; covariance matrices related to a multi-channel loudspeaker signal, or any data
related to these covariance matrices; other parameters for guiding or controlling
a specific decoder, e.g., VAD/DTX/CNG/SID parameters. Any of these parameters can
be determined in frequency bands.
[0066] As discussed above Voice Activity Detection (VAD) may be employed in such a codec
to control Discontinuous Transmission (DTX), Comfort Noise Generation (CNG) and Silence
Descriptor (SID) frames.
[0067] Furthermore as discussed above CNG is a technique for creating a synthetic background
noise to fill silence periods that would otherwise be observed, e.g., under the DTX
operation. However a complete silence can be confusing or annoying to a receiving
user. For example, the listener could judge that the transmission may have been lost
and then unnecessarily say "hello, are you still there?" to confirm or simply hang
up. On the other hand, sudden changes in sound level (from total silence to active
background and speech or vice versa) could also be very annoying. Thus, CNG is applied
to prevent a sudden silence or sudden change. Typically, the CNG audio signal output
is based on a highly simplified transmission of noise parameters.
[0068] There is currently no proposed spatial audio DTX, CNG and SID implementations. In
particular, the implementation of DTX operation for spatial audio is likely to change
the "feel" and level of the background noise that will be observed by the user. These
changes in the spatialization of the background noise may be perceived by the listener
as being annoying or confusing. For example in some embodiments as discussed herein
the background noise is provided such that it is experienced as coming from the same
direction(s) both during active and inactive speech periods.
[0069] For example, in one scenario, a user is talking in front of a spatial capture device
with a busy road behind the device. The spatial audio capture has then constant traffic
hum (that is somewhat diffuse), and specific traffic noises (e.g., car horns) coming
mainly from behind, and of course the talker's voice coming from the front. When DTX
is active, and the user is not talking both N channels + spatial metadata transmission
can be shut off to save transmission bandwidth (DTX active). In the absence of regular
spatial audio coding CNG provides the static background hum that is not too different
to the original captured background noise. The embodiments as discussed herein attempt
to generate spatial metadata during inactive periods. This avoids using the most recent
received values results repeating and annoying "stuck" spatial image.
[0070] Crude background noise description (SID) updates may be transmitted during inactive
period (with EVS this is mono and with IVAS, e.g., mono or stereo SID/CNG) to keep
the signal properties (spectrum and energy) aligned between encoder and decoder. The
embodiments as discussed herein attempt to define how to transmit spatial image SID
updates.
[0071] Furthermore in the example above upon local VAD indicating a speech onset, the user
voice returns, and so do the traffic hum and other traffic noises, and they get regular
updates as well as spatial metadata will be sent at normal bitrate. The listener thus
hears a significant change in the spatial reproduction. The embodiments as described
herein are configured to consider the spatial dimension of background noise during
the CNG periods and SID updates. Thus in such embodiments the DTX operation is made
as transparent and pleasant to the user as possible.
[0072] As such the embodiments as described herein attempt to provide an optimal DTX / CNG
/ SID system for parametric spatial audio such as MASA. Additionally the embodiments
as described herein are configured to provide an optimal CNG system for parametric
spatial audio such as MASA based on a mono or stereo DTX system.
[0073] Thus, some embodiments comprise an IVAS apparatus or system configured to implement
a DTX / CNG system where the parametric spatial audio CNG is based either on a spatial
audio DTX or a mono/stereo DTX. This means that spatial audio parameters are updated
substantially synchronously with the core audio DTX.
[0074] For example as shown in Figure 1 there is shown an example spatial audio communications
system. The system comprises apparatus or device 1 121, which may be any suitable
spatial audio capture apparatus being configured to capture audio within the environment
E1 100 including the user U1 101. The system further comprises apparatus or device
2 141, which may be a further suitable spatial audio capture apparatus being configured
to capture audio within the environment E2 110 including the user U2 111. The spatial
audio communications are shown as link 123 from apparatus or device 1 121 to apparatus
or device 2 141 and link 143 from apparatus or device 2 141 to apparatus or device
1 121 both of which are via a suitable network 131. In this example there are shown
two devices but it would be understood that the communication may be between more
than two parties in some embodiments.
[0075] By spatial audio communications it is meant that at least one upstream is a spatial
audio, i.e., more than mono audio. In some embodiments the (two) upstreams (link 123
and link 143) may be different configurations.
[0076] For example Figure 2 shows differing audio signal content types being obtained, encoded
and then transmitted from the apparatus 1 121 to apparatus 2 141 and obtained, encoded
and then transmitted from apparatus 2 141 to apparatus 1 121. In this example, the
apparatus 1 121 audio input is a mono voice audio signal 201 and spatial ambience
audio signal 203 (illustrated as a stereo audio signal), while the apparatus 2 141
audio input is shown as a stereo voice audio signal.
[0077] Additionally Figure 2 shows an important aspect relating to DTX operation. There
are significant silent periods in both upstreams when considering the user voice signals.
[0078] This for example is shown in Figure 3. Figure 3 shows the apparatus 1 mono voice
audio signal 201 and a voice activity detection (VAD) indication or signal 301. Here
it can be seen that within the VAD indication/signal there are two periods of activity,
a first early period 311 and a second late period 312. Additionally as shown in Figure
3 there is the apparatus 2 stereo voice audio signal 205 and a voice activity detection
(VAD) indication or signal 303. The VAD indication/signal associated with the apparatus
2 stereo voice audio signal 205 shows two periods of activity, a first early-mid period
321 and a second late-mid period 322.
[0079] In the example apparatus input audio signals shown in Figure 2 there are significant
periods of inactivity which provide the justification for implementing DTX in communications
systems. However, the spatial ambience can be very active during the silent periods.
This for example is shown in Figure 4. Figure 4 shows the apparatus 1 mono voice audio
signal 201 and voice activity detection (VAD) indication or signal 301, with the two
periods of activity, a first early period 311 and a second late period 312. Figure
4 also shows the apparatus 1 spatial ambience audio signal 203 (illustrated as a stereo
audio signal) and associated voice activity detection (VAD) indication or signal 401.
The VAD indication/signal 401 associated with the apparatus 1 spatial ambience audio
signal 203 also shows two long periods of activity. The first long period is a first
early-to-mid period 411 which significantly overlaps with the first early period 311
and the second long period is a second mid-to-late period 412 which overlaps the late-mid
period 312. It would be appreciated that in some examples and embodiments there need
not be such an overlap.
[0080] The ambience/background signals can in many use cases be almost as important as the
voice signal or, indeed, include voice signals themselves. For example, in a multi-user
audio capture scenario, one of the audio signals could be the primary user voice (for
example the user 1 voice as shown in Figure 2 and Figure 3), while the spatial background
signal can carry the voice of one or more other local participants (not shown in Figure
2/Figure 3).
[0081] As shown by the above examples the multi-source audio input presents a problem for
the discontinuous transmission (DTX) operation. The more than one audio inputs can
thus show very different activity characteristics and can be independent.
[0082] The concept as discussed in the embodiments herein is enabling discontinuous transmission
in multi-source input encoding. In other words improving the efficiency of audio encoding
for multi-source inputs based on the session-level VAD/DTX properties.
[0083] The embodiments therefore describe a method for efficient multi-source input encoding
utilizing an (internal) adaptive DTX operation. The adaptive DTX operation is introduced
as an extension of the well-known DTX operation.
[0084] In some embodiments, instead of the binary DTX selection, there can be utilized a
multi-step model. The model can in some embodiments be visible externally or be provided
for operation as a codec-internal operation, and which is based on multi-source input
handling and encoding.
[0085] In some embodiments the method is configured to maintain the number of transport
signals in constant bit rate operation, where inactive sources are internally handled
using DTX.
[0086] The following examples are shown with respect to a 3GPP IVAS encoder. Specifically,
IVAS is foreseen to require support for multi-source input encoding. For example,
combination of one audio object input and one MASA audio input can be input to the
encoder.
[0087] The embodiments as discussing in some embodiments provides apparatus configured to
provide DTX-capability derived adaptation of the encoding for efficient conversational
multi-input voice and audio.
[0088] Some embodiments may be implemented employing also in other multi-input audio codecs.
Some embodiments may be useful in low to mid bit-rate operations such as conversational
spatial codecs. Particularly, in some embodiments it may be possible to optimize constant
bit rate codec operation.
[0089] The embodiments therefore in summary may as shown in the examples herein relate to
immersive voice and audio codecs and specifically to efficient encoding of multi-source
immersive voice and audio inputs under a) DTX conditions and/or b) constant bit rate
transmission conditions. However they may be implemented without need of further inventive
effort to other codecs with similar functionalities.
[0090] Furthermore while the EVS codec provides extensive bit rate switching capabilities,
all current commercial EVS service launches utilize only a single bit rate, typically
13.2 kbps or 24.4 kbps. Similar approach is fairly likely for at least the initial
launches of IVAS service. Therefore as shown herein in the embodiments optimizing
constant bit rate (CBR) operation is a desired goal.
[0091] Figure 5 presents a high-level overview of a suitable system or apparatus for IVAS
coding and decoding which is suitable for implementing embodiments as described herein.
The system 500 is shown comprising an (IVAS) input 511. The IVAS input 511 can comprise
one or more of any suitable input format. For example as shown in Figure 5 there is
shown a mono audio signal input 512. The mono audio signal input 512 may in some embodiments
be passed to the encoder 521 and specifically to an Enhanced Voice Services (EVS)
encoder 523. Furthermore is shown a stereo and binaural audio signal input 513. The
stereo and binaural audio signal input 513 in some embodiments is passed to the encoder
521 and specifically to the (IVAS) spatial audio encoder 525. Figure 5 also shows
a Metadata-Assisted Spatial Audio (MASA) signal input 514. The Metadata-Assisted Spatial
Audio (MASA) signal input in some embodiments is passed to the encoder 521. Specifically
the audio component of the MASA input is passed to the (IVAS) spatial audio encoder
525 and the metadata component passed to a metadata quantizer/encoder 527. Another
input format shown in Figure 5 is an ambisonic audio signal, which may comprise first
order Ambisonics (FOA) and/or higher order ambisonics (HOA) audio signal 515. The
first order Ambisonics (FOA) and/or higher order ambisonics (HOA) audio signal 515
in some embodiments is passed to the encoder 521 and specifically to the (IVAS) spatial
audio encoder 525. Furthermore as shown in Figure 5 is a channel based audio signal
input 516. This may be any suitable input audio channel format, for example 5.1 channel
format, 7.1 channel format etc. The channel based audio signal input 516 in some embodiments
is passed to the encoder 521 and specifically to the (IVAS) spatial audio encoder
525. The final example input shown in Figure 5 is an object (or audio object) signal
input 517. The object signal input in some embodiments is passed to the encoder 521.
Specifically the audio component of the object signal input is passed to the (IVAS)
spatial audio encoder 525 and the metadata component passed to a metadata quantizer/encoder
527. In some embodiments, the mono audio signal corresponding to the audio object
signal input 517 can be passed to the EVS encoder 523, while the corresponding metadata
component is passed to a metadata quantizer/encoder 527.
[0092] Figure 5 furthermore shows an (IVAS) encoder 521. The (IVAS) encoder 521 is configured
to receive the audio signal from the input and encode it to produce a suitable format
encoded bitstream 531. The (IVAS) encoder 521 in some embodiments as shown in Figure
5 comprises an EVS encoder 523 configured to receive any mono audio signals 512 and
encode them according to an EVS codec definition.
[0093] Furthermore the (IVAS) encoder 521 is shown comprising an (IVAS) spatial audio encoder
525. The (IVAS) spatial audio encoder 525 is configured to receive the audio signals
or audio signal components and encode the audio signals based on a suitable definition
or coding mechanism. In some embodiments the spatial audio encoder 525 is configured
to reduce the number of audio signals being encoded before the signals are encoded.
For example in some embodiments the spatial audio encoder is configured to combine
or otherwise downmix the input audio signals. Such audio signal reduction can in some
embodiments include, e.g., a spatial audio analysis, which can result in metadata
in addition to the reduced number of audio signals. For example, such spatial audio
analysis can transform, e.g., a channel based audio input 516 into an internal MASA
representation, which can be encoded similarly as a MASA input 514.
[0094] In some embodiments, for example when the input type is MASA signals, the spatial
audio encoder is configured to encode the audio signals as a mono or stereo signal.
[0095] The spatial audio encoder 525 may comprise an audio encoder core which is configured
to receive the downmix or the audio signals directly and generate a suitable encoding
of these audio signals. The encoder 525 can in some embodiments be a computer (running
suitable software stored on memory and on at least one processor), or alternatively
a specific device utilizing, for example, FPGAs or ASICs.
[0096] In some embodiments the encoder 521 comprises a metadata quantizer/encoder 527. The
metadata quantizer/encoder 527 is configured to receive the metadata, for example
from the MASA input or the objects and generate a suitable quantized and/or encoded
metadata bitstream suitable for (being combined with or associated with the encoded
audio signal bitstream) and being output as part of the (IVAS) bitsteam 531.
[0097] Furthermore as shown in Figure 5 there is shown a (IVAS) decoder 541. The decoder
541 in some embodiments comprises a metadata dequantizer/decoder 547. The metadata
dequantizer/decoder 547 is configured to receive the encoded metadata, for example
from the IVAS bitstream 531 and generate a metadata bitstream suitable for rendering
the audio signals within the stereo and spatial audio decoder 545.
[0098] Figure 5 furthermore shows the (IVAS) decoder 541 comprising an EVS decoder 543.
The EVS decoder 543 is configured to receive the EVS encoded mono audio signals as
part of the IVAS bitstream 531 and decode them to generate a suitable mono audio signal
which can be passed to an internal renderer (for example the stereo and spatial decoder)
or suitable external renderer.
[0099] Additionally in some embodiments the (IVAS) decoder 541 comprises a stereo and spatial
audio signal decoder 545. The stereo and spatial audio signal decoder 545 in some
embodiments is configured to receive the encoded audio signals and generate a suitable
decoded spatial audio signal which can be rendered internally (for example by the
stereo and spatial audio signal decoder) or suitable external renderer.
[0100] Therefore in summary first the system is configured to receive a suitable audio signal
format or any combination of suitable audio signal formats. In some embodiments the
system is configured to generate (a downmix or more generally known as transport audio
signals) audio signals. The system is then configured to encode for storage/transmission
the audio signals. After this the system may store/transmit the encoded audio signals
and metadata. The system may retrieve/receive the encoded audio signals and metadata.
Then the system is configured to extract the audio signals and metadata from encoded
audio signals and metadata parameters, for example demultiplex and decode the encoded
audio signals and metadata parameters.
[0101] The system may furthermore be configured to synthesize an output multi-channel audio
signal based on the extracted audio signals and metadata.
[0102] With respect to Figure 6 is shown the encoder 521 with respect to some embodiments
in further detail.
[0103] In Figure 6, the upper part shows that the inputs are for example a MASA input format
514 and an object input format 517. These can be passed to the (IVAS) encoder 521
for encoding. It would be understood that any suitable input format may be used in
some embodiments and these are examples of input formats used to demonstrate the embodiments
herein only.
[0104] The lower part of Figure 6 shows the encoder shown in the upper part in further detail.
Thus the MASA input format 514 is shown comprising a stereo audio signal 601 and associated
MASA metadata 603. The stereo audio signal 601 is an example of a suitable audio signal
as part of a MASA input format only and other number of channels or time-frequency
domain input audio signals may be employed in some embodiments.
[0105] The stereo audio signal 601 is passed to the encoder 521 and in some embodiments
is passed a first signal activity detector 621 within the encoder 521. The associated
MASA metadata 603 is passed to the encoder 521 and in some embodiments to a metadata
quantizer/encoder 653 within the encoder 521.
[0106] The object input format 517 is shown comprising a mono audio signal 611 and associated
metadata 613. The mono audio signal 611 is passed to the encoder 521 and in some embodiments
is passed a second signal activity detector 623 within the encoder 521. The associated
metadata 613 is passed to the encoder 521 and in some embodiments to the metadata
quantizer/encoder 653 within the encoder 521.
[0107] The encoder 521 can in some embodiments comprise a first signal activity detector
(SAD) 621 configured to determine whether the stereo audio signals 601 comprise activity.
This information (with the stereo audio signals 601) can be passed to a coding mode
selector 631.
[0108] Furthermore the encoder 521 in some embodiments comprises a second signal activity
detector (SAD) 623 configured to determine whether the mono audio signal 611 comprises
activity. This information (with the mono audio signal 611) can also be passed to
a coding mode selector 631.
[0109] The encoder 521 in some embodiments comprises a coding mode selector 631 configured
to receive the SAD information associated with the MASA or stereo audio signal input
(and may also receive the stereo audio signals 601) the SAD information associated
with the object or mono audio signal input (and may also receive the mono audio signal
611) and be configured to determine an coding mode for coding the stereo audio signals,
the mono audio signal and the metadata associated with each.
[0110] The mode selection or determination can be passed to a stereo encoder 641, a mono
audio encoder 643 and the metadata quantizer/encoder 653.
[0111] In some embodiments the encoder 521 comprises a stereo encoder 641. The stereo encoder
641 is configured to receive the stereo audio signals 601 and the determined or selected
coding mode and encode the stereo audio signals 601 based on the coding mode. The
encoded stereo audio signals can then be output.
[0112] Furthermore in some embodiments the encoder 521 comprises a mono encoder 643. The
mono encoder 643 is configured to receive the mono audio signal 611 and the determined
or selected coding mode and encode the mono audio signals 611 based on the coding
mode. The encoded mono audio signal can then be output.
[0113] Furthermore in some embodiments the encoder 521 comprises a metadata quantizer/encoder
653. The metadata quantizer/encoder 653 is configured to receive the metadata (the
metadata 603 associated with the MASA input and the metadata 613 associated with the
object input) and the determined or selected coding mode and quantize and/or encode
the metadata based on the coding mode. The quantized and/or encoded metadata can then
be output.
[0114] As shown herein in some embodiments multi-input encoding can be implemented in such
a manner to encode the inputs separately. This for example may be implemented for
bit rates at least outside the very lowest bit rates. In some embodiments where very
lowest bit rates are required the input formats may be transformed or suitable downmixing
implemented to simplify the audio signals (e.g., by reducing the number of audio signals)
to be encoded.
[0115] The implementation of separate input format encoding for the input formats may be
because it enables the implementation of different core encoding algorithms to be
used for different input types. The more than one input types can be from multiple
sources and may be uncorrelated. As such the use of separate encoders prevents encoding
gains from correlated audio signals being generated.
[0116] In some embodiments the implementation of a joint encoder may require format conversion
and rely on correlated signals in order to produce significant improvements (the downmixed
scene for joint encoding is generally not fully separable). A joint encoder may also
have to determine whether the joint encoding has added correlation artefacts to the
component signals.
[0117] In some embodiments the encoders do not operate entirely separately. For example,
in order to maintain constant bit rate (CBR) operation, it is beneficial to employ
common bit rate allocation, in other words, vary the bit rate for encoding of each
component signal. In some embodiments this may employ a Discontinuous transmission
(DTX) operation based on the output of the detector (VAD/SAD) analysis of the individual
inputs. The encoders such as shown in Figure 6 by the stereo encoder 641 and mono
encoder 643 may implement a Discontinuous transmission (DTX) operation.
[0118] Conventionally a DTX operation is characterized as being either on or off. This means
that DTX operation is either used to limit the upstream transmission activity and
bandwidth (bit rate) or the codec is operated without DTX operation such that it provides
a constant or variable bit rate. Thus, in DTX operation, the bit rate varies, and
the transmission frequency varies. In non-DTX operation, the bit rate may vary (if
VBR), but the transmission frequency is constant (i.e., all frames are transmitted).
[0119] In the embodiments discussed herein the encoders, such as the stereo encoder 641
and the mono encoder 643 are configured to implement a multi-step DTX operation in
multi-input encoding. The additional capability can be explicit to the outside of
the codec or the DTX property can, e.g., for sake of simplicity remain on/off for
codec negotiation and interfacing purposes.
[0120] For example in some embodiments the coding mode selector 631 is configured to determine
one of the follow modes or steps and control the separate encoders in the following
manner:
- 1. Off - No DTX operation, characterized in minimum by all frames being transmitted
- 2. Adaptive - At least "internal DTX operation" is activated, all frames may be transmitted
- 3. On - DTX operation is active, inactive frames are being transmitted as SID updates
[0121] In some embodiments the coding mode selector 631 is configured to determine one of
the following alternative modes :
- 1. Off - No DTX operation
- 2. Adaptive internal - DTX operation not visible externally, all frames transmitted
- 3. Adaptive external - DTX operation is visible externally, limited use of SID updates,
activity decision of one stream influences encoding of another stream(s), constant
or variable rate encoding for transmitted frames
- 4. On - DTX operation used
- a. Single stream, limited use of SID updates, variable rate encoding for transmitted
frames
- b. Multi-stream, each multi-input source transmitted in their own stream, varying
SID characteristics per stream
[0122] The adaptive DTX is employed for multi-input encoding as follows.
[0123] In some embodiments the MASA audio input is spatial ambience audio signals and the
object audio input a voice audio signal. The coding mode selector can in some embodiments
be configured to determine a coding mode selection based (at least) on individual
audio input VAD/SAD decisions.
[0124] For example as shown in Figure 6 a first SAD decision is derived at the first signal
activity detector (SAD) 621 from the stereo audio signal input (part of the MASA input)
and a second SAD decision is derived at the second signal activity detector (SAD)
623 from the mono audio input (part of the object input). The coding mode selector
631 then may be configured to allocate audio signals and bit rates for stereo encoding,
mono encoding, as well as bit rate for quantization of the associated input metadata
(e.g., MASA metadata and object metadata).
[0125] Thus the encoder can, when the coding mode selector 631 determines that the mode
is DTX off, be configured to control the encoder to implement a suitable (regular)
encoding operation that may utilize a fixed bit rate allocation for each component
or some form of variable-rate coding over the components.
[0126] In addition, when the coding mode selector 631 determines that the mode is DTX on,
be configured to control the encoder to implement separate DTX decisions or a combined
DTX decision for the individual components.
[0127] In some embodiments the coding mode selector 631 can be configured to determine a
coding mode on a frame-by-frame basis such that the encoders (an internal encoding)
are configured to maintain a constant bit rate as viewed externally or limit certain
parameters (for example a number of transport signals) in a way that reduces the computational
complexity in the encoder and/or decoder. This may for example in some embodiments
be achieved by a sequential combination of the individual decisions.
[0128] An example is now considered for the object + MASA combination input as shown in
Figure 6. In this example, we specifically consider MASA that is stereo-based or includes
both mono and stereo in the input (in other words a 3-channel MASA configuration with
mono and stereo and metadata). Stereo-based MASA can be seen as the most common MASA
configuration due to its excellent properties of having two naturally incoherent signals
in addition to the spatial metadata providing the "description of the 3D audio image".
A mono-based MASA can be trivially upmixed to stereo-based MASA by duplicating the
mono channel. Thus, mono-based MASA can utilize stereo-based MASA encoding if needed.
[0129] In some embodiments the encoding is performed in an adaptive DTX mode. Separate SAD
decisions are derived for the inputs. In some embodiments, the SAD decision for the
object is utilized not only to drive, in part, the bit rate allocation between the
inputs and also the transport signal transmission and the corresponding encoding modes.
This is achieved, in some embodiments based on the following logic:

[0130] The above mode selection results in following transmission configuration for a mono
object + stereo-based MASA input combination:
|
MASA active |
MASA inactive |
Object active |
1 object + mono MASA |
1 object (+ MASA DTX) |
Object inactive |
Stereo MASA (+ object DTX) |
(Object DTX + MASA DTX) |
[0131] Thus in this example at most two active channels are encoded. This behaviour may
be preferable for efficient spatial audio encoding and transmission due to the following
properties:
a constant external bit rate can be maintained, and quality optimized for it (when
no DTX is used);
quality can be optimized for constant maximum bit rate (when DTX is used);
decoding load is maintained at maximum of two active channels reducing peak and average
complexity and thus saving battery life;
object is prioritized (highest importance as generally carrying voice in this combination);
MASA spatial quality is enhanced when object is inactive (stereo-based spatialization
is better than mono-based spatialization due to availability of two naturally incoherent
prototypes).
[0132] Figure 7 shows an example flow diagram of the operation of the encoder (as shown
in Figure 6) according to some embodiments.
[0133] The audio object input is received as shown in Figure 7 by step 701.
[0134] A SAD determination on the audio object input is shown in Figure 7 by step 703.
[0135] The check of whether the SAD determination is active is shown in Figure 7 by step
705.
[0136] Where the audio object SAD determination indicates that it is not active then the
encoder can be configured to implement a DTX encoding for the audio object audio signals
(and the associated metadata) as shown in Figure 7 by step 707.
[0137] Where the audio object SAD determination indicates that it is active then the encoder
can be configured to implement an audio object encoding for the audio object audio
signals (and the associated metadata) as shown in Figure 7 by step 715.
[0138] The MASA input is received as shown in Figure 7 by step 702.
[0139] A SAD determination on the MASA input is shown in Figure 7 by step 704.
[0140] The check of whether the SAD determination is active is shown in Figure 7 by step
706.
[0141] Where the MASA SAD determination indicates that it is not active then the encoder
can be configured to implement a DTX encoding for the MASA audio signals (and the
associated metadata) as shown in Figure 7 by step 713.
[0142] Where the MASA SAD determination indicates that it is active and furthermore when
the decision is made from the audio object SAD determination (that there is activity
with respect to the audio object) then the encoder can be configured to implement
a mono MASA encoding for the MASA audio signals (and the associated metadata) as shown
in Figure 7 by step 717.
[0143] Where the MASA SAD determination indicates that it is active and furthermore when
the decision is made from the audio object SAD determination (that there is no activity
with respect to the audio object) then the encoder can be configured to implement
a stereo MASA encoding for the MASA audio signals (and the associated metadata) as
shown in Figure 7 by step 719.
[0144] Then the encoder may be configured to determine a bitstream output based on at most
two active channels and corresponding metadata as shown in Figure 7 by step 721.
[0145] In such embodiments the above processing results in an externally visible DTX operation.
This is acceptable when DTX operation activation is desirable. It may however be beneficial
to implement a DTX operation which is not externally visible and specifically a constant
bit rate operation. This can be achieved according to the adaptive DTX operation scheme
in some embodiments by implementing Zero padding to maintain a desired total bit rate.
In these embodiments internal adaptive DTX operation is carried out according to above
examples, yet all frames are then zero padded such that the frame transmitted is a
fixed size. (This may be implemented without significant processing and results in
same audio quality characteristics as regular DTX operation with bandwidth advantage).
[0146] In some embodiments internal adaptive DTX operations may implement adaptive DTX operation
only for a single source. In some embodiments an internal adaptive DTX operation is
carried out according to a process described below, resulting in maximum of one source
under DTX.
[0147] In some embodiments SAD is used for bit rate allocation and transport signal selection
only. In these embodiments a process is implemented in a manner similar to those as
described above, however no internal DTX operation as such is used, active noise encoding
is used in place of any frame-internal DTX.
[0148] Figure 8 shows an example flow diagram of an encoder such as shown in Figure 6 where
no externally visible DTX happens. In other words the encoder is configured to transmit
every frame. In this example there is constant bit rate encoding. In this example
a DTX operation is included (as shown by the "object DTX encoding and MASA DTX encoding"
step). The example operates on the combined audio object + MASA input, however, in
various examples a similar approach can be utilized for other input format combinations.
[0149] The audio inputs (the audio object and MASA input) are received or obtained as shown
in Figure 8 by step 801.
[0150] Signal activity detection is then performed for both of the inputs as shown in Figure
8 by step 803.
[0151] The encoder is then configured to update encoding allocation states based on the
current and previous signal activities (and thus selected states) as shown in Figure
8 by step 805.
[0152] Based on the state update an encoding mode is then selected. The mode selection step
is shown in Figure 8 by step 807.
[0153] In some embodiments the states and mode selection may be the following Internal adaptive
DTX with constant number of transport channels. In such embodiments there may be 2
transport channels are always sent for a combination of audio object input and MASA
input.
- In this state the possible modes are
- Object encoding and Mono MASA encoding as shown in Figure 8 by step 809
- Object encoding and Mono MASA active noise encoding as shown in Figure 8 by step 811
- Object DTX encoding and Stereo MASA encoding as shown in Figure 8 by step 815
- Object DTX encoding and MASA active noise encoding as shown in Figure 8 by step 817
- Object active noise encoding and MASA active noise encoding as shown in Figure 8 by
step 819
- In some embodiments the coding mode selection may be determined based on the following
pseudo code:


This may be summarized as:
If object input is active, it is actively encoded, and MASA encoding enters mono MASA
encoding for either active signal or active noise signal encoding depending on MASA
input signal activity. In other words implement encoding according to either steps
809 or 811.
[0154] If object input is inactive, it can enter DTX as long as MASA is active and stereo-based.
In this case MASA is encoded according to stereo MASA encoding. In other words implement
encoding according to steps 815.
[0155] If object input is inactive and MASA is active and mono-based, active noise signal
encoding is used for the object input. MASA is encoded according to mono MASA encoding.
In other words implement encoding according to step 819.
[0156] If MASA however is also inactive, the algorithm checks previous states and if object
input was previously inactive and encoded using DTX, the MASA channel configuration
may be used to determine need for updating the object coding mode. It will thus either
remain DTX or switch to active noise encoding. MASA noise encoding will be stereo-based
or mono-based, accordingly. In other words implement encoding according to either
steps 817 or 819. Otherwise, active noise encoding is used for both object input and
MASA input. In other words implement encoding according to step 819.
[0157] In some embodiments the states and mode selection may be an internal adaptive DTX
with no constraints on number of transport channels.
[0158] In these embodiments all of the encoding modes shown in Figure 8 may be considered.
- Object encoding and Mono MASA encoding as shown in Figure 8 by step 809
- Object encoding and Mono MASA active noise encoding as shown in Figure 8 by step 811
- Object encoding and MASA DTX encoding as shown in Figure 8 by step 813
- Object DTX encoding and Stereo MASA encoding as shown in Figure 8 by step 815
- Object DTX encoding and MASA active noise encoding as shown in Figure 8 by step 817
- Object active noise encoding and MASA active noise encoding as shown in Figure 8 by
step 819
- Object DTX encoding and MASA DTX encoding as shown in Figure 8 by step 821.
[0159] In some embodiments the coding mode selection operation is similar as for the above
example. However in these embodiments the transmission can now be limited to one active
transport channel. This choice can be implementation specific and depend, for example,
on the bit rate or signal type.
[0160] In some embodiments where both and/or all audio inputs are inactive the apparatus
can be configured to select elements or data for transmission. The selection can in
some embodiments be implementation specific. In some embodiments, as it can generally
be beneficial to send active noise modeling for the background signal rather than
the voice object (which is silent), then the background signal modelling information
is transmitted.
[0161] In some embodiments when the second / last active input becomes inactive while the
first / other inputs stay in inactive state the DTX operation is maintained for any
source that was already in DTX, and active noise encoding is selected for the second
/ last input.
[0162] In some embodiments there may be implemented DTX using an adaptive DTX operation.
This in such embodiments is shown in Figure 8 by the Object DTX encoding and MASA
DTX encoding as shown in Figure 8 by step 821 and may be similar to that shown in
Figure 7 where SID only transmission is allowed.
[0163] Thus the previously presented table may be modified as follows, when no external
DTX is performed and a constraint for transport channel constraint is in place:
|
MASA active |
MASA inactive |
Object active |
1 object + mono MASA |
1 object + mono MASA (noise) |
Object inactive |
Stereo MASA (+ object DTX) |
Stereo MASA (noise) (+ object DTX) |
OR |
1 object (noise) + mono MASA (noise) |
[0164] With respect to Figure 9 an example electronic device which may be used as the analysis
or synthesis device is shown. The device may be any suitable electronics device or
apparatus. For example in some embodiments the device 1400 is a mobile device, user
equipment, tablet computer, computer, audio playback apparatus, etc.
[0165] In some embodiments the device 1400 comprises at least one processor or central processing
unit 1407. The processor 1407 can be configured to execute various program codes such
as the methods such as described herein.
[0166] In some embodiments the device 1400 comprises a memory 1411. In some embodiments
the at least one processor 1407 is coupled to the memory 1411. The memory 1411 can
be any suitable storage means. In some embodiments the memory 1411 comprises a program
code section for storing program codes implementable upon the processor 1407. Furthermore
in some embodiments the memory 1411 can further comprise a stored data section for
storing data, for example data that has been processed or to be processed in accordance
with the embodiments as described herein. The implemented program code stored within
the program code section and the data stored within the stored data section can be
retrieved by the processor 1407 whenever needed via the memory-processor coupling.
[0167] In some embodiments the device 1400 comprises a user interface 1405. The user interface
1405 can be coupled in some embodiments to the processor 1407. In some embodiments
the processor 1407 can control the operation of the user interface 1405 and receive
inputs from the user interface 1405. In some embodiments the user interface 1405 can
enable a user to input commands to the device 1400, for example via a keypad. In some
embodiments the user interface 1405 can enable the user to obtain information from
the device 1400. For example the user interface 1405 may comprise a display configured
to display information from the device 1400 to the user. The user interface 1405 can
in some embodiments comprise a touch screen or touch interface capable of both enabling
information to be entered to the device 1400 and further displaying information to
the user of the device 1400. In some embodiments the user interface 1405 may be the
user interface for communicating with the position determiner as described herein.
[0168] In some embodiments the device 1400 comprises an input/output port 1409. The input/output
port 1409 in some embodiments comprises a transceiver. The transceiver in such embodiments
can be coupled to the processor 1407 and configured to enable a communication with
other apparatus or electronic devices, for example via a wireless communications network.
The transceiver or any suitable transceiver or transmitter and/or receiver means can
in some embodiments be configured to communicate with other electronic devices or
apparatus via a wire or wired coupling.
[0169] The transceiver can communicate with further apparatus by any suitable known communications
protocol. For example in some embodiments the transceiver can use a suitable universal
mobile telecommunications system (UMTS) protocol, a wireless local area network (WLAN)
protocol such as for example IEEE 802.X, a suitable short-range radio frequency communication
protocol such as Bluetooth, or infrared data communication pathway (IRDA).
[0170] The transceiver input/output port 1409 may be configured to receive the signals and
in some embodiments determine the parameters as described herein by using the processor
1407 executing suitable code.
[0171] In general, the various embodiments of the invention may be implemented in hardware
or special purpose circuits, software, logic or any combination thereof. For example,
some aspects may be implemented in hardware, while other aspects may be implemented
in firmware or software which may be executed by a controller, microprocessor or other
computing device, although the invention is not limited thereto. While various aspects
of the invention may be illustrated and described as block diagrams, flow charts,
or using some other pictorial representation, it is well understood that these blocks,
apparatus, systems, techniques or methods described herein may be implemented in,
as non-limiting examples, hardware, software, firmware, special purpose circuits or
logic, general purpose hardware or controller or other computing devices, or some
combination thereof.
[0172] The embodiments of this invention may be implemented by computer software executable
by a data processor of the mobile device, such as in the processor entity, or by hardware,
or by a combination of software and hardware. Further in this regard it should be
noted that any blocks of the logic flow as in the Figures may represent program steps,
or interconnected logic circuits, blocks and functions, or a combination of program
steps and logic circuits, blocks and functions. The software may be stored on such
physical media as memory chips, or memory blocks implemented within the processor,
magnetic media such as hard disk or floppy disks, and optical media such as for example
DVD and the data variants thereof, CD.
[0173] The memory may be of any type suitable to the local technical environment and may
be implemented using any suitable data storage technology, such as semiconductor-based
memory devices, magnetic memory devices and systems, optical memory devices and systems,
fixed memory and removable memory. The data processors may be of any type suitable
to the local technical environment, and may include one or more of general purpose
computers, special purpose computers, microprocessors, digital signal processors (DSPs),
application specific integrated circuits (ASIC), gate level circuits and processors
based on multi-core processor architecture, as non-limiting examples.
[0174] Embodiments of the inventions may be practiced in various components such as integrated
circuit modules. The design of integrated circuits is by and large a highly automated
process. Complex and powerful software tools are available for converting a logic
level design into a semiconductor circuit design ready to be etched and formed on
a semiconductor substrate.
[0175] Programs, such as those provided by Synopsys, Inc. of Mountain View, California and
Cadence Design, of San Jose, California automatically route conductors and locate
components on a semiconductor chip using well established rules of design as well
as libraries of pre-stored design modules. Once the design for a semiconductor circuit
has been completed, the resultant design, in a standardized electronic format (e.g.,
Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility
or "fab" for fabrication.
[0176] The foregoing description has provided by way of exemplary and non-limiting examples
a full and informative description of the exemplary embodiment of this invention.
However, various modifications and adaptations may become apparent to those skilled
in the relevant arts in view of the foregoing description, when read in conjunction
with the accompanying drawings and the appended claims. However, all such and similar
modifications of the teachings of this invention will still fall within the scope
of this invention as defined in the appended claims.
1. An apparatus comprising means configured to:
obtain two or more separate audio signal inputs;
determine signal activity within the two or more separate audio signal inputs;
determine a coding mode from three or more coding modes, wherein at least one of the
three or more coding modes comprises at least one adaptive discontinuous transmission
coding mode and the coding mode is determined based on the signal activity within
the two or more separate audio signal inputs; and
encode the two or more audio signal inputs based on the determined coding mode.
2. The apparatus as claimed in claim 1, wherein the at least one adaptive discontinuous
transmission coding mode comprises at least one externally visible adaptive discontinuous
transmission coding mode, and the means configured to encode the two or more audio
signal inputs based on the determined coding mode when the coding mode is an externally
visible adaptive discontinuous transmission coding mode is configured to adaptively
encode the two or more audio signal inputs with discontinuous transmission encoding
assistance, the discontinuous transmission encoding assistance configured to encode
inactive signal activity within the two or more separate audio signal inputs as silence
descriptor elements.
3. The apparatus as claimed in claim 2, wherein the means configured to encode the two
or more audio signal inputs based on the determined coding mode when the coding mode
is an externally visible adaptive discontinuous transmission coding mode is configured
to adaptively encode one of the two or more audio signal inputs based on signal activity
within the one of the two or more separate audio signal inputs and at least one of:
signal activity within another one of the two or more separate audio signal inputs;
a determined output bit rate and an encoding rate of the others of the two or more
separate audio signal inputs, such that a combined bit rate for encoding the two or
more audio signal inputs is kept constant;
a determined output bit rate and an encoding rate of the others of the two or more
separate audio signal inputs, such that a combined bit rate for encoding the two or
more audio signal inputs is variable; and
a number of encoded channels to be output from the encoded others of the two or more
audio signal inputs.
4. The apparatus as claimed in any of claims 2 or 3, wherein the means configured to
encode the two or more audio signal inputs based on the determined coding mode when
the coding mode is an externally visible adaptive discontinuous transmission coding
mode is configured to control at least one of:
a number of encoded channels to be output from the encoded one of the two or more
audio signal inputs;
a bit rate of the encoded one of the two or more audio signal inputs.
5. The apparatus as claimed in claim 4, wherein the means configured to control the number
of encoded channels to be output from the encoded one of the two or more audio signal
inputs is configured to control the number of encoded channels to be output from the
encoded one of the two or more audio signal inputs such that a total number of channels
output from encoding all of the two or more audio signal inputs is constant.
6. The apparatus as claimed in claim 1, wherein the at least one adaptive discontinuous
transmission coding mode comprises at least one externally invisible adaptive discontinuous
transmission coding mode, and the means configured to encode the two or more audio
signal inputs based on the determined coding mode when the coding mode is an externally
invisible adaptive discontinuous transmission coding mode is configured to adaptively
encode the two or more audio signal inputs with discontinuous transmission encoding
assistance, the discontinuous transmission encoding assistance configured to encode
inactive signal activity within the two or more separate audio signal inputs as silence
descriptor elements, but maintain a constant number of output channels and/or a constant
output bitrate.
7. The apparatus as claimed in claim 6, wherein the means configured to encode the two
or more audio signal inputs based on the determined coding mode when the coding mode
is an externally invisible adaptive discontinuous transmission coding mode is configured
to adaptively encode one of the two or more audio signal inputs based on signal activity
within the one of the two or more separate audio signal inputs and at least one of:
signal activity within another one of the two or more separate audio signal inputs;
a determined output bit rate and an encoding rate of the others of the two or more
separate audio signal inputs, such that a combined bit rate for encoding the two or
more audio signal inputs is kept constant;
a determined output bit rate and an encoding rate of the others of the two or more
separate audio signal inputs, such that a combined bit rate for encoding the two or
more audio signal inputs is variable; and
a number of encoded channels to be output from the encoded others of the two or more
audio signal inputs.
8. The apparatus as claimed in any of claims 6 or 7, wherein the means configured to
encode the two or more audio signal inputs based on the determined coding mode when
the discontinuous transmission coding mode is an externally invisible adaptive discontinuous
transmission coding mode is configured to control at least one of:
a number of encoded channels to be output from the encoded one of the two or more
audio signal inputs to maintain a constant number of output channels; and
a bit rate of the encoded one of the two or more audio signal inputs to maintain a
constant output bitrate.
9. The apparatus as claimed in claim 8, wherein the means configured to control the bit
rate of the encoded one of the two or more audio signal inputs to maintain a constant
output bitrate is configured to apply zero padding to the encoded audio signal inputs.
10. The apparatus as claimed in claim 8, wherein the means configured to control the bit
rate of the encoded one of the two or more audio signal inputs to maintain a constant
output bitrate is configured to apply an adaptive discontinuous transmission coding
mode for the one of the two or more audio signal inputs single source resulting in
a maximum of a single encoded audio signal inputs implementing discontinuous transmission
encoding assistance.
11. The apparatus as claimed in claim 8, wherein the means configured to control the bit
rate of the encoded one of the two or more audio signal inputs to maintain a constant
output bitrate is configured to apply an adaptive discontinuous transmission coding
mode for bit rate allocation and transport signal selection for the encoded one of
the two or more audio signal inputs.
12. The apparatus as claimed in any of claims 1 to 11, wherein the three or more coding
modes further comprises an off mode wherein the means configured to encode the two
or more audio signal inputs based on the determined coding mode is configured to encode
the two or more audio signal inputs without any discontinuous transmission encoding
assistance.
13. The apparatus as claimed in any of claims 1 to 12, wherein the three or more coding
modes further comprise an on mode wherein the means configured to encode the two or
more audio signal inputs based on the determined coding mode is configured to encode
the two or more audio signal inputs with discontinuous transmission encoding assistance,
the discontinuous transmission encoding assistance configured to encode inactive signal
activity within the two or more separate audio signal inputs as silence descriptor
elements.
14. The apparatus as claimed in any of claims 1 to 12, wherein the three or more coding
modes further comprise an on mode wherein the means configured to encode the two or
more audio signal inputs based on the determined coding mode is configured to encode
the two or more audio signal inputs with discontinuous transmission encoding assistance,
the discontinuous transmission encoding assistance configured to adaptively individually
encode one or more of the two or more separate audio signal inputs with discontinuous
transmission encoding assistance, the discontinuous transmission encoding assistance
configured to encode inactive signal activity within the one or more of the two or
more separate audio signal inputs as silence descriptor elements and others of the
two or more audio signal inputs without any discontinuous transmission encoding assistance.
15. A method for an apparatus, the method comprising:
obtaining two or more separate audio signal inputs;
determining signal activity within the two or more separate audio signal inputs;
determining a coding mode from three or more coding modes, wherein at least one of
the three or more coding modes comprises at least one adaptive discontinuous transmission
coding mode and the coding mode is determined based on the signal activity within
the two or more separate audio signal inputs; and
encoding the two or more audio signal inputs based on the determined coding mode.