[0001] The invention relates to a method and to an apparatus for generating and for decoding
sound field data including Ambisonics sound field data of an order higher than three,
wherein for encoding and for decoding different processing paths can be used.
Background
[0002] Traditional audio data signal transport streams for 2D presentation are channel oriented.
2D presentations include formats like stereo or surround sound, and are based on audio
container formats like WAV and BWF (Broadcast Wave Format). The wave format WAV is
described in
Microsoft, "Multiple Channel Audio Data and WAVE Files", updated March 7,2007, http://www.microsoft.com/whdc/device/audio/multichaud.mspx , and in http://www-mmsp.ece.mcgill.ca/Documents/AudioFormats
/WAVE/WAVE.html, last update 19 June 2006.
[0003] Improved surround systems require an increasing number of loudspeakers or audio channels,
which leads to an extension of these audio container formats.
[0004] Due to the upcoming 3D video activities in cinema and broadcasting, spatial or 3D
audio becomes more and more attractive. Nevertheless, descriptions of spatial audio
scenes are significantly more complex than in existing 2D surround sound systems.
Well-known descriptions are based on Wave Field Synthesis (WFS, cf.
WO2004/047485 A1) as well as on Ambisonics, which was already developed in the early 1970s: http://en.wikipedia.org/wiki/Ambisonics
.
[0005] WFS combines a high number of spherical sound sources for emulating plane waves from
different directions. Therefore, a lot of loudspeakers or audio channels are required.
A description contains a number of source signals as well as their specific positions.
[0006] Ambisonics, however, uses specific coefficients based on spherical harmonics for
providing a sound field description that is independent from any specific loudspeaker
set-up. This leads to a description which does not require information about loudspeaker
positions during sound field recording or generation of synthetic scenes. The reproduction
accuracy in an Ambisonics system can be modified by its order
N. The 'higher-order Ambisonics' (HOA) description considers an order of more than
one, and the focus in this application is on HOA.
[0007] By that order the number of required audio information channels can be determined
for a 2D or a 3D system, because this depends on the number of spherical harmonic
bases. The number
O of channels is for 2D:
O=2
·N+1, for 3D:
O=(
N+1)
2. Besides true 2D or 3D cases, 'mixed orders' have different orders in 2D (x-y plane
only) and 3D (additionally z axis).
[0008] The first-order B-Format uses three channels for 2D and four channels for 3D. The
first-order B-Format is extended to the higher-order B-format. Depending on
O a horizontal (2D), a full-sphere (3D), or a mixture sound field type description
can be generated. By ignoring appropriate channels, this B-format is backward compatible,
i.e. a 2D Ambisonics receiver is able to decode the 2D components from a 3D Ambisonics
sound field. The extended B-format for HOA considers only orders up to three, which
corresponds to 16 channels maximum.
[0009] The older UHJ-format was introduced to enable mono and stereo compatibility. The
G-format was introduced to reproduce sound scenarios in 5.1 environments.
[0010] However, all these existing formats do not consider orders of more than three.
[0011] The Wave FORMAT_EXTENSIBLE format is an extension of the above-mentioned WAV format.
One application is the use of Ambisonics B-format in the WAVEX description: "Wave
Format Extensible and the .amb suffix or WAVEX and Ambisonics", http://mchapman.com/amb/wavex
.
Invention
[0012] As mentioned above, known Ambisonics formats do not consider orders of more than
three.
[0013] Wave-based audio format descriptions are used in different applications. An environment
which is very important today and will become even more important in the future are
internet applications based on Ethernet transmission protocols. However, a data structure
for Ambisonics transmission that is able to use the above-mentioned B-format as well
as additional features like the Ambisonics order and their coefficient's bit lengths
in an efficient manner is not yet known to the applicant.
[0014] Another aspect is that in case of B-format always plane waves are assumed for the
sound sources. Even for a higher quality of the acoustic wave field reproduction,
a more realistic view should emulate the sound sources as spherical waves. But spherical
waves will introduce more complex frequency dependencies than plane waves.
[0015] Furthermore, a transmission of video content is in many cases combined with audio
content transmission. Existing streaming data structures, e.g. for cinema applications,
consider 2D surround sound only, for example WAV or AIFF (Audio Interchange File Format).
[0017] This payload header extends the RTP header of Fig. 1 by a 2-octet extended sequence
number and a 2-octet extended time stamp. Furthermore, one octet for flags and a reserved
field, followed by a 3-octet SMPTE time stamp and a 4-octet offset value is proposed
therein. The 32-bit aligned payload data is following the header data.
[0018] A problem to be solved by the invention is to provide a data structure (i.e. a protocol
layer) for 3D higher-order Ambisonics sound field description formats, which can be
used for real-time transmission over Ethernet. This problem is solved by the encoding
method disclosed in claim 1 and the decoding method disclosed in claim 3. Apparatuses
which utilise these methods are disclosed in claims 2 and 4, respectively.
[0019] The data structures described below facilitate real-time transmission of 3D sound
field descriptions over Ethernet. From the content of additional metadata the transmitted
3D sound field can be adapted at receiver side to the available headphones or the
number and positions of loudspeakers, for regular as well as for irregular set-ups.
No regular loudspeaker set-ups including a large number of loudspeakers are required
like in WFS.
[0020] Advantageously, in the inventive transmission data structure the sound quality level
can be adapted to the available sound reproduction system, e.g. by mapping a 3D Ambisonics
sound field description onto a 2D loudspeaker set-up. Advantageously, the inventive
format enables Ambisonics orders up to
N =255, whereas known Ambisonics formats allow orders up to
N =3 only.
[0021] Further, the inventive data structure considers single microphones or microphone
arrays as well as virtual acoustical sources with different accuracies and sample
rates. Advantageously, moving sources (i.e. sources with time-dependent spatial positions)
are considered in the Ambisonics descriptions inherently.
[0022] The Ambisonics header information level is adaptable between a simple and an encoder
related mode. The latter one enables fast decoder modifications. This is useful especially
for real-time applications.
[0023] The proposed data structure is extendable for classical audio scene descriptions,
i.e. sound sources and their positions.
[0024] Generally, the inventive Ambisonics processing is based on linear operators, i.e.
the Ambisonics channels data can be packed and transmitted singly or in an assembled
manner as a matrix.
[0025] In principle, the inventive encoding method is suited for generating sound field
data including Ambisonics sound field data of an order higher than three, said method
including the steps:
- receiving S input signals x(k) from a microphone array including M microphones, and/or from one or more virtual sound sources;
- multiplying said input signals x(k) with a matrix Ψ,

wherein the matrix elements

represent the spherical harmonics of all currently used directions Ω0,..,ΩS-1, index m denotes the order, index n denotes the degree of a spherical harmonics, N represents the Ambisonics order, n = 0,...,N, and m = -n,..., +n, so as to get coefficients vector data d(k) representing coded directional information of N Ambisonics signals for every sample time instant k;
- processing said coefficients vector data d(k), value N and parameter Norm in one or two or more of the following four paths:
- a) combining said coefficients vector data d(k), said value N and said parameter Norm with radii data RS representing the distances of the sources of said S input signals x(k);
- b) based on spherical waves, array response filtering said coefficients vector data
d(k) in dependency from said Ambisonics order N and radii Rm values, said radii Rm values representing individual microphone radii in a microphone array, so as to compensate
for non-linear frequency dependency, followed by normalising for spherical waves data,
so as to provide filtered coefficients A(k), said parameter Norm and said order N value;
- c) based on spherical waves, array response filtering said coefficients vector data
d(k) in dependency from said Ambisonics order N, said radii Rm values and a radius Rref value, said radius Rref value representing a mean radius of loudspeakers arranged at decoder side, so as
to compensate for non-linear frequency dependency, followed by normalising for spherical
waves data, so as to provide filtered coefficients A(k), said parameter Norm, said order N value, and said radius Rref value;
- d) based on plane waves, array response filtering said coefficients vector data d(k) in dependency from said Ambisonics order N, said radii RM values and a Plane Wave parameter, so as to compensate for non-linear frequency dependency,
followed by normalising for plane waves data, so as to provide filtered coefficients
A(k), said parameter Norm, said order N value, and said Plane Wave parameter;
- in case a processing took place in two or more of said paths, multiplexing the corresponding
data;
- output of data frames including said provided data and values.
[0026] In principle the inventive encoder apparatus is suited for generating sound field
data including Ambisonics sound field data of an order higher than three, said apparatus
including:
- means being adapted for multiplying S input signals x(k), which are received from a microphone array including M microphones and/or from one or more virtual sound sources, with a matrix Ψ,

wherein the matrix elements

represent the spherical harmonics of all currently used directions Ω0,...,ΩS-1, index m denotes the order, index n denotes the degree of a spherical harmonics, N represents the Ambisonics order, n = 0,...,N, and m = -n,..., +n, so as to get coefficients vector data d(k) representing coded directional information of N Ambisonics signals for every sample time instant k;
- means being adapted for processing said coefficients vector data d(k), value N and parameter Norm in one or two or more of the following four paths:
- a) combining said coefficients vector data d(k), said value N and said parameter Norm with radii data RS representing the distances of the sources of said S input signals x(k);
- b) based on spherical waves, array response filtering said coefficients vector data
d(k) in dependency from said Ambisonics order N and radii Rm values, said radii Rm values representing individual microphone radii in a microphone array, so as to compensate
for non-linear frequency dependency, followed by normalising for spherical waves data,
so as to provide filtered coefficients A(k), said parameter Norm and said order N value;
- c) based on spherical waves, array response filtering said coefficients vector data
d(k) in dependency from said Ambisonics order N, said radii RM values and a radius Rref value, said radius Rref value representing a mean radius of loudspeakers arranged at decoder side, so as
to compensate for non-linear frequency dependency, followed by normalising for spherical
waves data, so as to provide filtered coefficients A(k), said parameter Norm, said order N value, and said radius Rref value;
- d) based on plane waves, array response filtering said coefficients vector data d(k) in dependency from said Ambisonics order N, said radii RM values and a Plane Wave parameter, so as to compensate for non-linear frequency dependency,
followed by normalising for plane waves data, so as to provide filtered coefficients
A(k), said parameter Norm, said order N value, and said Plane Wave parameter;
- a multiplexer means for multiplexing the corresponding data in case a processing took
place in two or more of said paths, which multiplexer means provide data frames including
said provided data and values.
[0027] In principle, the inventive decoding method is suited for decoding sound field data
that were encoded according to the above encoding method using one or two or more
of said paths, said method including the steps:
- parsing the incoming encoded data, determining the type or types a) to d) of said
paths used for said encoding and providing the further data required for a decoding
according to the encoding path type or types;
- performing a corresponding decoding processing for one or two or more of the paths
a) to d):
- a) based on spherical waves, filtering the received coefficients vector data d(k) in dependency from said radii data RS so as to provide filtered coefficients A(k),
and distance coding said filtered coefficients A(k) in dependency from said order value N and for all radii Rl of loudspeakers to be used for a presentation of the decoded signals,
and providing the distance encoded filtered coefficients A'(k) together with loudspeaker direction values Ωl, value N and parameter Norm;
- b) based on spherical waves, distance coding said filtered coefficients A(k) in dependency from said order value N and for all radii Rl of loudspeakers to be used for a presentation of the decoded signals,
and providing the distance encoded filtered coefficients A'(k) together with loudspeaker direction values Ωl, value N and parameter Norm;
- c) based on spherical waves, distance coding said filtered coefficients A(k) in dependency from said order value N and said radius value Rref for all radii Rl of loudspeakers to be used for a presentation of the decoded signals,
and providing the distance encoded filtered coefficients A'(k) together with loudspeaker direction values Ωl, value N and parameter Norm;
- d) based on plane waves, providing said filtered coefficients A(k), order value N, parameter Norm and a flag for Plane Waves;
- in case a processing took place in two or more of said paths, multiplexing the corresponding
data, wherein the selected path or paths are determined based on parameter Norm, order value N and said Plane Waves flag;
- decoding said distance encoded filtered coefficients A'(k) or said filtered coefficients A(k), respectively, in dependency from said parameter Norm, said order value N and said loudspeaker direction values Ωl, so as to provide loudspeaker signals for a loudspeaker array.
[0028] In principle the inventive decoder apparatus is suited for decoding sound field data
that were encoded according to the above encoding method using one or two or more
of said paths, said apparatus including:
- means being adapted for parsing the incoming encoded data, and for determining the
type or types a) to d) of said paths used for said encoding and for providing the
further data required for a decoding according to the encoding path type or types;
- means being adapted for performing a corresponding decoding processing for one or
two or more of the paths a) to d):
- a) based on spherical waves, filtering the received coefficients vector data d(k) in dependency from said radii data RS so as to provide filtered coefficients A(k),
and distance coding said filtered coefficients A(k) in dependency from said order value N and for all radii Rl of loudspeakers to be used for a presentation of the decoded signals,
and providing the distance encoded filtered coefficients A'(k) together with loudspeaker direction values Ωl, value N and parameter Norm;
- b) based on spherical waves, distance coding said filtered coefficients A(k) in dependency from said order value N and for all radii Rl of loudspeakers to be used for a presentation of the decoded signals,
and providing the distance encoded filtered coefficients A'(k) together with loudspeaker direction values Ωl, value N and parameter Norm;
- c) based on spherical waves, distance coding said filtered coefficients A(k) in dependency from said order value N and said radius value Rref for all radii Rl of loudspeakers to be used for a presentation of the decoded signals,
and providing the distance encoded filtered coefficients A'(k) together with loudspeaker direction values Ωl, value N and parameter Norm;
- d) based on plane waves, providing said filtered coefficients A(k), order value N, parameter Norm and a flag for Plane Waves;
- multiplexing means which, in case a processing took place in two or more of said paths,
select the corresponding data to be combined, based on parameter Norm, order value N and said Plane Waves flag;
- decoding means which decode said distance encoded filtered coefficients A'(k) or said filtered coefficients A(k), respectively, in dependency from said parameter Norm, said order value N and said loudspeaker direction values Ωl, so as to provide loudspeaker signals for a loudspeaker array.
[0029] Advantageous additional embodiments of the invention are disclosed in the respective
dependent claims.
Drawings
[0030] Exemplary embodiments of the invention are described with reference to the accompanying
drawings, which show in:
- Fig. 1
- Known RTP header format;
- Fig. 2
- Known extended RTP header format encapsulating DPX data, audio data or metadata;
- Fig. 3
- Ambisonics encoder facilitating different applications at production side before Ambisonics
coefficients and metadata are transmitted;
- Fig. 4
- Ambisonics decoder facilitating different applications at reproduction side following
reception of Ambisonics coefficients and metadata;
- Fig. 5
- RTP payload header extension for Ambisonics data according to the invention;
- Fig. 6
- General Ambisonics data header;
- Fig. 7
- Individual Ambisonics data header;
- Fig. 8
- Ambisonics metadata;
- Fig. 9
- Ambisonics receiver parser.
Exemplary embodiments
[0031] At first, different scenarios for sound recording or production as well as for reproduction
are considered in order to derive the inventive Ethernet/IP based streaming data format.
The description of these scenarios is based at production side on an Ambisonics encoder
(AE) and at reproduction side on an Ambisonics decoder (AD).
[0032] In an Ambisonics encoder as shown in Fig. 3 there are two different kinds of possible
input signals:
- a microphone array 31 including m microphones, i.e. real sound sources;
- v virtual sources 32, i.e. synthetic sounds.
[0033] For an HOA description of a source not only the time dependent source signal
s(
t) is required but also its position, which may move around and is time-dependent,
too. The source position can be described by its spherical coordinates, i.e. the radius
rS from the origin to the source and the angles (Θ
S, Φ
S) = Ω
S, where Θ
S denotes the inclination and Φ
S denotes the azimuth angle in the x,y plane.
[0034] In a first step or multiplier 33, all s source signals
x(
k) at each sample time
kT, i.e. virtual single sources as well as microphone array sources, are multiplied
with a matrix Ψ defined in Eq.(1).
[0035] Matrix Ψ with O rows and
S columns performs a direction coding because Ψ contains the spherical harmonics

of all currently used directions Ω
S, wherein the superscript index m denotes the order and the subscript index
n denotes the degree of a spherical harmonics (note: in connection with microphones
the index
m refers to the running number of a microphone). If
N represents the Ambisonics order, the index n has values in the range 0,...,
N, and the values of
m are running from -
N to +
N. 
More details regarding indices
n and
m (
m for order) are explained below in connection with Table 1. Instead of this specific
format of matrix Ψ, any other equivalent representation for that matrix can be used
instead.
[0036] Matrix Ψ is used to output a vector
d(
k) of
N Ambisonics signals for every sample time instant
k, as defined in Eq. (2) and Eq. (3) :

These signals are representing the complete sound field description that has to be
transmitted to the reproduction side. Vector
d(
k) contains the directional information only. However, the distances of all sources
over a specific frequency range are to be considered, too, and the frequency behaviour
or dependency is non-linear. Therefore additional filters 341, 342 and 343 are required,
which can be implemented at encoder or at decoder side. Especially for HOA, the plane
wave processing is sometimes not sufficient because it does not consider frequency
dependencies. Therefore, a more general processing will consider sources and sinks
not only with plane waves but also with spherical waves. Both wave forms require additional
steps or stages that use different factors depending on the radius
r and the wave number
kω, where

[0037] The pressure of a sound field
p(
r,Θ,Φ,
kω) can be calculated as follows:

where
jn(
kωr) describes the spherical Bessel function of the first type, which is depending on
the product of wave number
kω and radius
r. The coefficient

or Ambisonics signal can be calculated in case of plane waves from every direction
Ω
S, independent from the frequency:

This is not the case for spherical waves. Here, the coefficients

are depending on the frequency:

where
hn(
kωr) describes the spherical Hankel function of the first type.
[0038] All these dependencies lead to the following four cases that are to be considered
for an extended transmission of Ambisonics coefficients based on RTP. Fig. 3 shows
a block diagram of an Ambisonics encoder for these four cases at production side.
The required functions are represented by corresponding steps or stages in front of
the transmission. All processing steps are clocked by a frequency that is made in
stage 38 synchronous with the sample frequency 1/T. A controller 37 receives a mode
selection signal and the value of order
N, and controls an optional multiplexer 36 that receives the filter responses and the
output signal of multiplier 33, and outputs the inventive data structure frames 39.
Multiplier 33 represents a directional encoder providing corresponding coefficients
and outputs the unfiltered vector data
d(
k), the order
N value, and parameter
Norm.
Case 1:
[0039] An array response filter 42 ('Filter 1' in Fig. 4) only for the microphone sources
data can be arranged at decoder side. The unfiltered vector data
d(
k), the order
N value, and parameter
Norm are assembled in a combiner 340 with radii data
RS(
t), and are fed to an optional multiplexer 36. Radii data
RS(
t) represent the distances of the audio sources of the
S input signals
x(
k), and refer to microphones as well as to artificially generated virtual sound sources.
Case 2:
[0040] The coefficients vector data
d(
k) pass through an array response filter 341 for the microphone sources (filter 2).
The filtering compensates the microphone-array response and is based on Bessel or
Hankel functions. Basically, the signals from the output vectors
d(
k) are filtered. The other inputs serve as parameters for the filter, e.g. parameter
R is used for the term
k*
r. The filtering is relevant only for microphones that have the individual radius
Rm. Such radii are taken into consideration in the term
k*
r of the Bessel or Hankel functions. Normally, the amplitude response of the filter
starts with a lowpass characteristic but increases for higher frequencies. The filtering
is performed in dependency from the Ambisonics order
N, the order
n and the radii
Rm values, so as to compensate for non-linear frequency dependency. A subsequent normalisation
step or stage 351 for spherical waves data provides filtered coefficients
A(
k). It is assumed that there is also a corresponding filter at reproduction side (filter
431 in Fig. 4). The filtered and normalised coefficients
A(
k), parameter
Norm and the order
N value are fed to multiplexer 36.
Case 3:
[0041] The coefficients vector data
d(
k) pass through an array response filter 342 for the microphone sources (filter 3).
The filtering is performed in dependency from said Ambisonics order
N, said order
n, the radii
Rm values and a radius
Rref value representing the average radius
Rref of the loudspeakers at decoder side as described in the below section "Radius
Rref (RREF)", so as to compensate for non-linear frequency dependency. In case microphone
signals are used, a filter for spherical waves data is also arranged at reproduction
side. Then the average radius
Rref of the loudspeakers has to be considered already in filter 342. A subsequent normalisation
step or stage 352 for spherical waves data provides filtered coefficients
A(
k). Step/stage 352 can include a distance coding like that described in connection
with Fig. 4. The filtered coefficients
A(
k) from step/stage 352, parameter
Norm, the order
N value and radius value
Rref are fed to multiplexer 36.
Case 4:
[0042] The coefficients vector data
d(
k) pass through an array response filter 343 for the microphone sources (filter 4).
The filtering is performed in dependency from the Ambisonics order
N, the radii
Rm values and a Plane Wave parameter. A subsequent normalisation step or stage 353 for
plane waves data provides parameter
Norm, the order
N value and a flag for Plane Wave to multiplexer 36.
[0043] The Ambisonics encoder can code the output signals 361 in any one of these paths,
in any two of these paths, or in more than two of these paths.
The normalisation steps or stages 351 to 353 can use a normalisation or scaling as
described below in section "Ambisonics Normalisation/Scaling Format (ANSF)".
[0044] Following transmission of the values mentioned above, e.g. via an Ethernet connection,
at reproduction side the Ambisonics decoder depicted in Fig. 4 parses the incoming
data data structures in a parser 41 in order to detect the case type and to provide
the data for performing the appropriate functions. An example for such parser is disclosed
in
WO 2009/106637 A1.
Case 1:
[0045] Unfiltered vector data
d(
k), order value
N, parameter Norm and each radii data
RS(
t) are parsed. These values pass through an array response filter 42 (Filter 1) for
filtering (a filtering as described in Fig. 3) the received
d(
k) data under consideration of all radii
RS(
t). The resulting filtered coefficients
A(
k) are distance coded (DC) in a distance coding step or stage 431 for all loudspeaker
radii
RLS and order
N, and pass thereafter together with loudspeaker direction values
Ωl (
representing the directions of the LS loudspeakers 46), value
N and parameter
Norm through an optional multiplexer 44 to a panning or pseudo inverse step or stage 45.
Distance coding means taking into account Bessel or Hankel functions with radii parameter
in term
k*
r for plane or spherical waves. Examples of distance coding are published in
M.A.Poletti, "Three-Dimensional Surround Sound Systems Based on Sperical Harmonics",
J.Audio Eng.Soc., vol.53, no.11, November 2005, e.g. in equations (31) and (32), and in
J.Daniel, "Spatial Sound Encoding Including Near Field Effect: Introducing Distance
Coding Filters and a Viable, New Ambisonic Format", AES 23th Intl.Conf., Copenhagen,
Denmark, 23-25 May 2003.
Case 2:
[0046] Filtered coefficients
A(
k), parameter Norm and order value
N are parsed. The filtered coefficients
A(
k) are distance coded (DC) in a distance coding step or stage 432 for all loudspeaker
radii
RLS and order
N, and pass thereafter together with loudspeaker direction values Ω
l, value
N and parameter
Norm through multiplexer 44 to the panning or pseudo inverse step or stage 45. Spherical
waves on AE and AD sides are assumed.
Case 3:
[0047] Filtered coefficients
A(
k), order value
N, parameter
Norm and radius value
Rref are parsed. The filtered coefficients
A(
k) are distance coded (DC) in a distance coding step or stage 432 for all loudspeaker
radii
RLS and order
N under consideration of radius
Rref, and pass thereafter together with loudspeaker direction values Ω
l, value
N and parameter
Norm through multiplexer 44 to the panning or pseudo inverse step or stage 45. Spherical
waves on AE and AD sides are assumed.
Case 4:
[0048] Filtered coefficients
A(
k), order value
N, parameter
Norm and a flag for Plane Waves are parsed. The filtered coefficients
A(
k) together with loudspeaker direction values Ω
l, value
N and parameter
Norm pass through multiplexer 44 to the panning or pseudo inverse step or stage 45. Plane
waves on AE and AD sides are assumed.
[0049] Based on parameter
Norm, order value
N and the Plane Waves flag, a mode selector 47 selects in multiplexer 44 the corresponding
path or paths a) to d) which was or were used at encoder side. Decoder 45, which represents
a panning or a mode matching operation including pseudo inverse, inverts the matrix
Ψ operation in the Ambisonics encoder in Fig. 3, and applies this operation to the
filtered coefficients
A(
k) or the filtered and distance coded coefficients
A'(
k), respectively, in dependency from the parameter
Norm, order value
N and the loudspeaker direction values Ω
l, and provides the
l loudspeaker signals for a loudspeaker array 46. The matrix Ψ operation is inverted
for
cases 1-3 by
wl(k)=D·A'(k), and for
case 4 by
wl(k)=D·A(k). Parser 41 also provides synchronisation information that is used for re-synchronisation
of a clock 48.
[0050] The invention specifies a packet-based streaming format for encapsulating spatial
sound field descriptions based on Ambisonics into an extended real-time transport
protocol, in particular RTP, for real-time streaming of spatial audio scenes. The
focus is on a standalone spatial (2D/3D) audio real-time application, e.g. a transmission
of a live concert or a live sport event via IP. This requires a specific spatial audio
layer including time stamps and possibly synchronisation information. The Ambisonics
real-time stream can be used together with an RTP layer. In addition, alternative
RTP layers with or without extended headers are described below.
[0051] In general, for a spatial audio transmission a sound field description in Ambisonics
can be used in which possible sound source positions are inherently encoded. An alternative
is the transmission of the source signals together with their time-dependent or time-independent
positions. A switching possibility between these two alternatives is provided, too,
but the directly following section will focus on Ambisonics.
Extended Ambisonics streaming format (EASF)
[0052] Ethernet transmissions (e.g. via internet) are performed in data packets with a typical
packet length called 'path MTU' with up to 1500 or 9000 bytes. In case Ambisonics
sound fields are to be transmitted via Ethernet, such relatively small data packets
are not large enough. Therefore, several packets can be combined in larger containers
named 'frames'. Such frame represents a dedicated time interval within which a typical
number of packets is transmitted. For example in video applications, in 1080p video
mode a frame contains 1080 data packets of which each one describes one line of a
complete video frame. Especially for real-time applications, even for audio (where
low latency and low packet loss is important), a transmission should be frame based.
[0053] Because Ambisonics supports a sound field description independent of positions but
with an adaptable quality, different amounts of data per packet or frame are possible.
However, the number of octets in a data packet shall always be the same within a frame,
except the last packet. In principle, the RTP sequence number is to be incremented
with each packet.
[0054] With regard to Fig. 3 and Fig. 4, Case 1 requires a transmission of each time-dependent
radii
RS(
t). This is an option if filter processing is to be performed in the decoder. However,
in the following section the focus is on Cases 2-4 in which the filtered coefficients
A(
k) are transmitted. This allows a higher bandwidth because the transmission remains
independent from all source positions, i.e. this is suited more for Ambisonics.
[0055] For standalone audio transmission, the protocol contains the following header data
structure.
A standard RTP header (cf. Fig. 1) containing the following bit fields:
Version (V) - 2 bit
RTP Version (default is V=2)
Padding (P) - 1 bit
If set, a data packet will contain several additional padding bytes. These are always
located at the end following the payload. The last padding byte contains a count of
how many padding bytes are to be ignored.
Extension (X) - 1 bit
If set, the fixed header is followed by exactly one header extension.
CSRC count (CC) - 4 bit
The number of contributing source identifiers, following the fixed header.
Marker (M) - 1 bit
In general, the marker bit can be defined by a profile. Here, it signalises the end
of a frame, i.e. it is set for the last data packet. For other packets it must be
cleared.
Payload Type (PT) - 7 bits
The payload type is defined for an Audio standalone transmission as EASF. For a combined
transmission with uncompressed video the film format is chosen, e.g. DPX.
Sequence Number - 16 bits
The LSB bits for the sequence number. It increments by one for each RTP data packet
sent, and may be used by the receiver for detecting packet loss and for restoring
the packet sequence. The initial value of the sequence number is random (i.e. unpredictable)
in order to make known-plaintext attacks on encryption more difficult. The standard
16-bit sequence number is augmented with another 16 bits in the payload header in
order to avoid problems due to wrap-around when operating at high data rates.
Timestamp - 32 bits
The timestamp denotes the sampling instant of the frame to which the RTP packet belongs.
Packets belonging to the same frame must have the same timestamp.
RTP payload header extension
According to the invention, the fields of the known RTP header keep their usual meaning,
but that header is amended as follows:
RTP Payload Frame Status (PLFS) - 2 bit
The frame status describes which type of data will follow the extended RTP header
in the payload block:
PLFS code |
Payload type |
00 |
Ambisonics coefficients |
01 |
Frame end (+ Ambisonics coefficients) |
10 |
Frame begin (+ Metadata) |
11 |
Metadata |
I.e., in the first packet of a frame, instead of audio data, additional metadata can
be transmitted. In case of Ambisonics transmission, the metadata contains source and
Ambisonics encoder related information (production side information) required for
the decoding process.
[0056] Time Code/Sync Frequency (TCSF) - 30 bit unsigned integer The following SMPTE time
code or the synchronisation is based on a specific clock frequency, the Time Code/Sync
Frequency TSCF. In order to support a large range of frequencies, the TCSF is defined
as a 30 bit integer field. The value is represented in Hz and leads to a frequency
range from 0 to 1073.741824 MHz, wherein a value of 0 Hz is signalling that no time
code is available.
Audio Source Type (AST) - 2 bit
[0057] The transmission of audio content is possible in different modes. In form of Ambisonics
sound field descriptions or sampled audio sources including their positions. The following
table shows AST values and their meaning.
AST code |
Possible sources |
00 |
Sound field |
01 |
Sound sources + fixed positions |
10 |
Sound sources + time dependent positions |
11 |
Reserved |
[0058] The selection in data field AST facilitates not only a separation within Ambisonics
(cf. the example provided below in connection with Fig. 9) but also the parallel transmission
of differently encoded audio source signals (Ambisonocs and/or PCM data + position
data), i.e. the inventive protocol can be complemented e.g. for PCM data. The below-described
SMPTE Time Code/Clock Sync Info (STCSI) facilitates the temporally correct assignment
of the audio signal sources.
Audio Dimension (ADIM) - 1 bit
[0059] The dimension in case of existing and extendable formats is described as follows:
ADIM code |
Dimension |
0 |
2D |
1 |
3D |
Extended Ambisonics Header (XAH) - 1 bit
[0060] If XAH is cleared, the general Ambisonics header is transmitted only in the first
data packet of a frame and the individual Ambisonics header is transmitted in all
other data packets.
[0061] If XAH is set, the general Ambisonics header shall also be available in every data
packet in front of the individual Ambisonics header. This mode enables a modification
of the parameters in each data packet, i.e. in real-time. It can be useful for real-time
applications where no or only small buffers are available. However, this mode decreases
the available bandwidth.
[0062] Different sources can generate audio signals at the same time. Known protocols are
based on a separate transmission of the sound sources, i.e. every data frame refers
to a single temporal section in which, depending on the sampling frequency, several
samples can be contained. Therefore, in known protocols, different source signal occurring
at the same time instant will use the same time stamp and the same frame number. This
poses no problem for an offline processing, i.e. no real-time processing. The transmitted
data are buffered and assembled later on. However, this does not work for real-time
processing in which a small latency is demanded. In the inventive protocol, the data
field XAH facilitates a continued entrainment of the header, and the parser 41 in
Fig. 4 can switch back and forth block-by-block (or Ethernet packet-by-packet or frame-by-frame)
between different audio sources types.
[0063] Distinguishing between general header and individual header facilitates a real-time
adaptation.
Selector Time Code or Sync (STS) - 1 bit
[0064] If STS is cleared, the value in the 24 bit field STCSI (see below) represents the
SMPTE time code. If STS is set, field STCSI contains user-specific synchronisation
information.
Rsvrd - 3 bit
[0065] Reserved bits for future applications concerning the SMPTE time code or clock synchronisation.
SMPTE Time Code/Clock Sync Info (STCSI) - 24 bit
[0066] Identifies the SMPTE time code (hh:mm:ss:frfr = 6:6:6:6 bit), or synchronisation
information for the local clocks of each source and sink. That synchronisation information
format is user-dependent. It appears that this kind of synchronisation has not been
used before for Ambisonics- and video synchronisation.
Packet Offset (PAO) - 64 bit
[0067] In a current frame the packet offset describes the distance in bytes between the
first payload octet of the first data packet in a frame relative to the first payload
octet in the current data packet. PAO(HIGH) represents the 32 MSBs and PAO(LOW) represents
the 32 LSBs.
[0068] The above known and extended RTP header data are depicted in Fig. 5. PAO(LOW) is
followed by the Ambisonics payload data.
Ambisonics payload layer
[0069] Ambisonics payload data and Ambisonics header data shall be fragmented such that
the resulting RTP data packet is smaller than the 'path MTU' mentioned above. In case
of 10GE transmission the path MTU is a 'jumbo frame' of e.g. 9000 bytes. There are
two types of Ambisonics headers. A small individual Ambisonics header is sent in front
of each data packet. A general header contains source and encoder related information
that can be useful for the Ambisonics decoder. It contains information that is valid
for the all data packets within a frame, and for small frames and/or data packets
it can be sent once at the beginning of a frame. Especially for real-time applications
where the packet information is changing frequently, it can be advantageous to send
the general header with each data packet.
General Ambisonics header (only in the first data packet if
XAH= 0)
Ambisonics Endianness (AEN): 1 bit
[0070] The endianness used for the transmitted Ambisonics data.
AE code |
Dimension |
0 |
Big Endian |
1 |
Little Endian |
Ambisonics Header Length (AHL) - 8 bit
[0071] Identifies the length of the complete header in byte.
Ambisonics Wave Type (AWT) - 1 bit
[0072] Traditionally, Ambisonics assumes that all audio sources and loudspeakers provide
plane waves for modelling the sound field. A typical example is the B-format. However,
an extended Ambisonics sound field description with higher quality requires also a
modelling with spherical waves. Therefore, the AWT field considers both possibilities.
AWT code |
Dimension |
0 |
Plane wave |
1 |
Spherical wave |
Ambisonics Order Type (AOT) - 2 bit
[0073] Identifies the sequence of how the Ambisonics coefficients are transmitted. Up to
4 order types can be addressed. The different formats depend on the order and indexing
in Eq. (1), i.e. how the spherical harmonics are ordered in a column of W. The existing
Ambisonics B-format uses a specific sequence of Ambisonics coefficients according
to Table 1, wherein K to Z denotes known B-Format channels. In case of 3D the coefficients
are transmitted from top to bottom in Table 1.
E.g. for degree n=2, the sequence will be WXYZRSTUV.
Table 1
AFT code |
Format |
00 |
B-Format order |
01 |
numerical upward |
10 |
numerical downward |
11 |
Reserved |
Degree n |
Order m |
Channel |
0 |
0 |
W |
1 |
1 |
X |
1 |
-1 |
Y |
1 |
0 |
Z |
2 |
0 |
R |
2 |
1 |
S |
2 |
-1 |
T |
2 |
2 |
U |
2 |
-2 |
V |
3 |
0 |
K |
3 |
1 |
L |
3 |
-1 |
M |
3 |
2 |
N |
3 |
-2 |
O |
3 |
3 |
P |
3 |
-3 |
Q |
[0074] As an alternative, the sequence of each matrix column in Eq.(1) from top to bottom
represents a numerical upward order type. A degree value always starts with 0 and
runs up to Ambisonics Order
N. For each degree, the sequence starts with lowest order -
N and runs up to order +
N. The downward type uses for each degree the reversed order.
Ambisonics Horizontal Order (AHO) - 8 bit
Ambisonics Vertical Order (AVO) - 8 bit
[0075] The Ambisonics order describes the quality of the Ambisonics en- and decoding via
Ψ. An order up to 255 should be sufficient. According to the audio dimension the order
is distinguished in horizontal and vertical direction.
In case of 2D, only AHO has a value greater than '0'. A mixed order can have different
AHO and AVO values.
RSVRD (Order) - 2x2 bit
[0076] For possible extension of order related issues, these reserved bits are considered
in front of AHO and AVO.
[0077] Ambisonics Normalisation/Scaling Format (ANSF) - 3 bit Identifies different normalisation
formats, typically used for Ambisonics. The normalisation corresponds to the orthogonality
relationship between

and

. Furthermore there are additional normalisation principles, e.g. Furse-Malham. The
Furse-Malham formulation facilitates a normalisation of the coefficients to get maximum
values of ±1, which yields an optimal dynamic range.
In case of dedicated scaling the scaling factors are fixed over one frame. The scaling
factors will be transmitted only once in front of the Ambisonics coefficients.
ANF code |
Format |
000 |
Orthonormal |
001 |
Schmidt semi-normalised |
010 |
4n normalised |
011 |
Unnormalised |
100 |
Furse-Malham |
101 |
Dedicated scaling |
11x |
Reserved |
Radius Rref (RREF) - 16 bit
[0078] The reference radius
Rref value of the loudspeakers in mm is required in case of spherical waves. The maximal
radius depends on the acoustic wave length λ which can be calculated from audible
frequencies
f (
FLOW =20 Hz -
fHI =20 kHz) and the speed of sound
c =340 m/s. Thus for the radius
Rref, values from 17.000 mm to 17 mm are required and a word length of 16 bit is sufficient
for that.
Ambisonics Sample Format (ASF) - 4 bit
[0079] This code defines the word length as well as the format (integer/floating point)
of the transmitted Ambisonics coefficients
A(
k). The sample format enables an adaptation to different value ranges. In the following
table nine sample formats are predefined:
ASF code |
Format |
0000 |
Unsigned integer 8 bit |
0001 |
Signed integer 8 bit |
0010 |
Signed integer 16 bit |
0011 |
Signed integer 24 bit |
0100 |
Signed integer 32 bit |
0101 |
Signed integer 64 bit |
0110 |
Float 32 bit (binary single prec.) |
0111 |
Float 64 bit (binary double prec.) |
1000 |
Float 128 bit (binary quad prec.) |
1001-1111 |
Reserved |
Ambisonics Invalid Bits (AIB) - 5 bit
[0080] If ASF is specified as an integer format, the number AIB of invalid bits can mask
the lowest bits within the ASF integer. AIB is coded as 5 bit unsigned integer value,
so that up to 31 bits can be marked as invalid. Valid bits start at MSB. Note that
the word length of AIB is less than the ASF integer word length.
Sample Rate (SR) - 32 bit
[0081] The rate at which the input data
xi(
k) are sampled. The value in Hz is coded as an unsigned integer.
Frame Size Mode (FSM) - 1 bit
[0082] If FSM is cleared, the following 31 bits for FS represent the file size in bytes.
If FSM is set, FS represents the total number of data packets in the actual frame.
Frame Size (FS) - 31 bit
[0083] The frame size number FS is to be interpreted in view of the FSM flag's value. Depending
on the application, the frame size can vary from frame to frame.
[0084] As mentioned above, a frame represents a unit of several data packets. It is assumed
that for uncompressed data all packets expect the last one will have the same length.
Then the frame size in bytes can be calculated to #bytes per frame = (FS-1)*packet
size + last packet size.
[0085] Basic Ethernet applications do normally use MTU sizes of 1500 bytes. Modern 10 Gigabit
Ethernet applications consider larger MTUs (e.g. 'jumbo frames' with 9000 to 16000
bytes). To enable data sets larger than 232 bytes (4GB), the frame size should be
specified as a number of data packets. I.e., if a data packet contains 9000 bytes
the maximum frame size would be greater than 35 Tbyte.
[0086] The general Ambisonics data header in the Ambisonics payload data is depicted in
Fig. 6. A 'frame' can contain several equal-length packets, wherein the last packet
can have a different length that is described in the individual Ambisonics header.
Every packet may use such a header for describing at the end lengths values that differ
from prior packet lengths.
Individual Ambisonics header
Reserved (RSRVD) - 16 bit
[0087] The bits in front of APL are reserved. This enables an extension of the individual
header, e.g. by packet related flags, and a 32 bit alignment for the following Ambisonics
coefficients.
Ambisonics Packet Length (APL) - 16 bit
[0088] Defines the MTU length for each individual data packet in bytes. The maximum length
is 65535.
[0089] This individual Ambisonics header is depicted in Fig. 7. If applied, the two data
fields RSRVD and APL will follow data field FS in Fig. 6. APL contains the length
of the following Ethernet packet which contains payload data (Ambisonics components).
Ambisonics payload data
[0090] As mentioned above, the payload data type is defined in the data field PLFS (RTP
Payload Frame Status), cf. Fig. 5. Following the individual Ambisonics header, and
possibly the individual Ambisonics header, 'pure' Ambisonics data or 'pure' metadata
can be arranged.
Ambisonics coefficients
[0091] Due to the time dependency of the input samples
x(
kT)=
x(
k) and of the direction and radii
RS(t), it is important to perform the Ambisonics encoding and decoding with regard to
the specific sample time
kT or even simpler at
k.
[0092] However, when considering a protocol based transmission, the transmission processing
operates in a sequential manner, i.e. at each transmission clock step (which is totally
different from the sampling rate) only 32 or 64 bits of a data packet can be dealt
with. The number of considered Ambisonics samples in one data packet is related to
one concatenated sample time or to a group of concatenated sample times.
[0093] Normally, all Ambisonics coefficients have the same length across all data packets
in a frame. However, if the general Ambisonics header is inserted in a normal data
packet, the data parameters can be modified within a frame.
[0094] The following examples of payload data show different dimensions, orders, and Ambisonics
coefficients based on the encoder/decoder cases 2 to 4 of Fig. 3. The first index
x of A(
x,
y) describes the sequence number for a specific order, whereas the second index
y stands for the sample time
k in a data packet.
Example 1: ADIM=1, AHO=AVO=3, ASF=2

Example 2: ADIM=1, AHO=AVO=2, ASF=3, AIB=2

Example 3: ADIM=1, AHO=AVO=2, ASF=4, AIB=7

Example 4: ADIM=1, AHO=AVO=1, ASF=4

Example 5: ADIM=1, AHO=AVO=1, ASF=7

Ambisonics metadata
[0095] If PLFS is set to 10
2 = 2
10, metadata are transmitted instead of Ambisonics coefficients. For metadata different
formats are existing, of which some are considered below. Thus in front of the concrete
metadata content a metadata type defines the specific formats as depicted in Fig.
8. The first two data fields RSRVD and APL are like in Fig. 7.
Ambisonics Metadata Type (AMT) - 16 bit
[0096] The types SMPTE MXF and XML are pre-defined.
AMT code |
Format |
0x00 |
SMPTE MXF |
0x80 |
XML |
0x01-7F |
Rsrvd |
0x81-0xFF |
Rsrvd |
Rsrvd - 16 bit
[0097] Reserved bits for future applications concerning metadata.
[0098] This data field is followed by specific metadata. If possible the metadata descriptions
should be kept simple in order to get only one metadata packet in the 'begin packet'
of a frame. However, the packet length in bytes is the same as for Ambisonics coefficients.
If the amount of metadata will exceed this packet length, the metadata has to be fragmented
into several packets which shall be inserted between packets with Ambisonics coefficients.
If the metadata amount in bytes in one packet is less than the regular packet length,
the remaining packet bytes are to be padded with '0' or stuffing bits.
[0099] For channel coding purposes the encapsulated CRC word at the end of each Ethernet
packet should be used.
[0100] At the production side as shown in Fig. 3, three different Cases are considered by
the above-mentioned data structure (i.e. three cases where
A(
k) data are transmitted and one case where
d(
k) data are transmitted). The question is how to detect the Ambisonics Encoding/Decoding
mode at reproduction or receiver side. The Case chosen at production side can be derived
in parser 41 in Fig. 4 from the bit fields RREF and AFT. The following table shows
the values for RREF and AFT and their meaning:
Mode |
Payload data |
2 |
filtered A(k), RREF=0, AFT= Spherical Wave |
3 |
filtered A(k), RREF≠0, AFT= Spherical Wave |
4 |
filtered A(k), AFT= B-format/Plane Wave |
RREF is obsolete |
Regarding the specific structure in figures 3 and 4, in Fig. 9 the parser 41 of the
Ambisonics decoder in Fig. 4 is shown in more detail. For collecting corresponding
data items from an Ambisonics data stream ADSTR, the parser can use registers REG
and content addressable memories CAM. The content addressable memories CAM detect
all protocol data which will lead to a decision about how the received data are to
be processed in the following steps or stages, and the registers REG store information
about the length or the payload data. The parser evaluates the header data in a hierarchical
manner and can be implemented in hardware or software, according to any real-time
requirements.
Example:
[0101] Several audio signals are generated and transmitted as spherical waves SPW or plane
waves PW, e.g. the worldwide live broadcast of a concert in 3D format, wherein all
receiving units are arranged in cinemas. In such case the individual signals are to
be transmitted separately so that a correct presentation can be facilitated. By a
corresponding arrangement of the protocol (Ambisonics Wave Type AWT described above)
the parser can distinguish this and supply two separate 'distance coding' units with
the corresponding data items. The inventive Ambisonics decoder depicted in Fig. 4
can process all these signals, whereas in the prior art several decoders would be
required. I.e., the considering the Ambisonics wave type facilitates the advantages
described above.
1. Method for generating sound field data including Ambisonics sound field data of an
order higher than three, said method including the steps:
- receiving s input signals x(k) from a microphone array (31) including m microphones, and/or from one or more virtual sound sources (32);
- multiplying (33) said input signals x(k) with a matrix Ψ,

wherein the matrix elements

represent the spherical harmonics of all currently used directions Ω0,...,ΩS-1, index m denotes the order, index n denotes the degree of a spherical harmonics, N represents the Ambisonics order, n = 0,...,N, and m = -n, ...,+n,
so as to get coefficients vector data d(k) representing coded directional information of N Ambisonics signals for every sample time instant k;
- processing said coefficients vector data d(k), value N and parameter Norm in one or two or more of the following four paths:
a) combining (340) said coefficients vector data d(k), said value N and said parameter Norm with radii data RS representing the distances of the sources of said input signals x(k);
b) based on spherical waves, array response filtering (341) said coefficients vector
data d(k) in dependency from said Ambisonics order N and radii Rm values, said radii Rm values representing individual microphone radii in a microphone array, so as to compensate
for non-linear frequency dependency, followed by normalising (351) for spherical waves
data, so as to provide filtered coefficients A(k), said parameter Norm and said order N value;
c) based on spherical waves, array response filtering (342) said coefficients vector
data d(k) in dependency from said Ambisonics order N, said radii Rm values and a radius Rref value, said radius Rref value representing a mean radius of loudspeakers arranged at decoder side, so as
to compensate for non-linear frequency dependency, followed by normalising (352) for
spherical waves data, so as to provide filtered coefficients A(k), said parameter Norm, said order N value, and said radius Rref value;
d) based on plane waves, array response filtering (343) said coefficients vector data
d(k) in dependency from said Ambisonics order N, said radii Rm values and a Plane Wave parameter, so as to compensate for non-linear frequency dependency,
followed by normalising (353) for plane waves data, so as to provide filtered coefficients
A(k), said parameter Norm, said order N value, and said Plane Wave parameter;
- in case a processing took place in two or more of said paths, multiplexing (36)
the corresponding data;
- output (361) of data frames (39) including said provided data and values.
2. Apparatus for generating sound field data including Ambisonics sound field data of
an order higher than three, said apparatus including:
- means (33) being adapted for multiplying S input signals x(k), which are received from a microphone array (31) including m microphones and/or from one or more virtual sound sources (32), with a matrix Ψ,

wherein the matrix elements

represent the spherical harmonics of all currently used directions Ω0,...,ΩS-1, index m denotes the order, index n denotes the degree of a spherical harmonics, N represents the Ambisonics order, n = 0,...,N, and m = -n,...,+n, so as to get coefficients vector data d(k) representing coded directional information of N Ambisonics signals for every sample time instant k;
- means (340,341,351,342,352,343,353) being adapted for processing said coefficients
vector data d(k), value N and parameter Norm in one or two or more of the following four paths:
a) combining said coefficients vector data d(k), said value N and said parameter Norm with radii data RS representing the distances of the sources of said S input signals x(k);
b) based on spherical waves, array response filtering said coefficients vector data
d(k) in dependency from said Ambisonics order N and radii Rm values, said radii Rm values representing individual microphone radii in a microphone array, so as to compensate
for non-linear frequency dependency, followed by normalising for spherical waves data,
so as to provide filtered coefficients A(k), said parameter Norm and said order N value;
c) based on spherical waves, array response filtering said coefficients vector data
d(k) in dependency from said Ambisonics order N, said radii RM values and a radius Rref value, said radius Rref value representing a mean radius of loudspeakers arranged at decoder side, so as
to compensate for non-linear frequency dependency, followed by normalising for spherical
waves data, so as to provide filtered coefficients A(k), said parameter Norm, said order N value, and said radius Rref value;
d) based on plane waves, array response filtering said coefficients vector data d(k) in dependency from said Ambisonics order N, said radii RM values and a Plane Wave parameter, so as to compensate for non-linear frequency dependency,
followed by normalising for plane waves data, so as to provide filtered coefficients
A(k), said parameter Norm, said order N value, and said Plane Wave parameter;
- a multiplexer means (36) for multiplexing the corresponding data in case a processing
took place in two or more of said paths, which multiplexer means provide data frames
(39) including said provided data and values.
3. Method for decoding sound field data that were encoded according to claim 1 using
one or two or more of said paths, said method including the steps:
- parsing (41) the incoming encoded data, determining the type or types a) to d) of
said paths used for said encoding and providing the further data required for a decoding
according to the encoding path type or types;
- performing a corresponding decoding processing for one or two or more of the paths
a) to d):
a) based on spherical waves, filtering (42) the received coefficients vector data
d(k) in dependency from said radii data RS so as to provide filtered coefficients A(k),
and distance coding (431) said filtered coefficients A(k) in dependency from said order value N and for all radii Rl of loudspeakers to be used for a presentation of the decoded signals,
and providing the distance encoded filtered coefficients A'(k) together with loudspeaker direction values Ωl, value N and parameter Norm;
b) based on spherical waves, distance coding (432) said filtered coefficients A(k) in dependency from said order value N and for all radii Rl of loudspeakers to be used for a presentation of the decoded signals,
and providing the distance encoded filtered coefficients A'(k) together with loudspeaker direction values Ωl, value N and parameter Norm;
c) based on spherical waves, distance coding (433) said filtered coefficients A(k) in dependency from said order value N and said radius value Rref for all radii Rl of loudspeakers to be used for a presentation of the decoded signals,
and providing the distance encoded filtered coefficients A'(k) together with loudspeaker direction values Ωl, value N and parameter Norm;
d) based on plane waves, providing said filtered coefficients A(k), order value N, parameter Norm and a flag for Plane Waves;
- in case a processing took place in two or more of said paths, multiplexing (44)
the corresponding data, wherein the selected (47) path or paths are determined based
on parameter Norm, order value N and said Plane Waves flag;
- decoding (45) said distance encoded filtered coefficients A'(k) or said filtered coefficients A(k), respectively, in dependency from said parameter Norm, said order value N and said loudspeaker direction values Ωl, so as to provide loudspeaker signals for a loudspeaker array (46).
4. Apparatus for decoding sound field data that were encoded according to claim 1 using
one or two or more of said paths, said apparatus including:
- means (41) being adapted for parsing the incoming encoded data, and for determining
the type or types a) to d) of said paths used for said encoding and for providing
the further data required for a decoding according to the encoding path type or types;
- means (42,431,432,433) being adapted for performing a corresponding decoding processing
for one or two or more of the paths a) to d) :
a) based on spherical waves, filtering the received coefficients vector data d(k) in dependency from said radii data RS so as to provide filtered coefficients A(k), and distance coding said filtered coefficients A(k) in dependency from said order value N and for all radii Rl of loudspeakers to be used for a presentation of the decoded signals,
and providing the distance encoded filtered coefficients A'(k) together with loudspeaker direction values Ωl, value N and parameter Norm;
b) based on spherical waves, distance coding said filtered coefficients A(k) in dependency from said order value N and for all radii Rl of loudspeakers to be used for a presentation of the decoded signals,
and providing the distance encoded filtered coefficients A'(k) together with loudspeaker direction values Ωl, value N and parameter Norm;
c) based on spherical waves, distance coding said filtered coefficients A(k) in dependency from said order value N and said radius value Rref for all radii Rl of loudspeakers to be used for a presentation of the decoded signals, and providing
the distance encoded filtered coefficients A'(k) together with loudspeaker direction values Ωl, value N and parameter Norm;
d) based on plane waves, providing said filtered coefficients A(k), order value N, parameter Norm and a flag for Plane Waves;
- multiplexing means (44) which, in case a processing took place in two or more of
said paths, select the corresponding data to be combined, based on parameter Norm, order value N and said Plane Waves flag;
- decoding means (45) which decode said distance encoded filtered coefficients A'(k) or said filtered coefficients A(k), respectively, in dependency from said parameter
Norm, said order value N and said loudspeaker direction values Ωl, so as to provide loudspeaker signals for a loudspeaker array (46).
5. Method according to claim 3, or apparatus according to claim 4, wherein said parser
(41) includes registers (REG) and content addressable memories (CAM) for collecting
data items from the decoder input data by evaluating header data in a hierarchical
manner, and wherein said content addressable memories (CAM) detect all protocol data
which will lead to a decision about how the received data are to be processed in the
decoding, and said registers (REG) store data item length information and/or information
about payload data.
6. Method according to claim 5, or apparatus according to claim 5, wherein said parser
(41) provides data for two or more individual audio signals by distinguishing Ambisonics
plane wave and spherical wave types (AWT).
7. Method according to claim one of claims 1, 3 and 5, or apparatus according to one
of claims 2, 4 and 5, wherein said Ambisonics sound field data are transferred using
Ethernet or internet or a protocol network.
8. Data structure for Ambisonics audio signal data which can be encoded according to
claim 1, said data structure including:
- a data field determining plane wave and spherical wave Ambisonics;
- a data field determining the Ambisonics order types B-Format order, numerical upward
order, numerical downward order;
- a data field determining the channel in dependency from the degree n and the order m;
- a data field determining horizontal or vertical order of the coefficients in the
Ambisonics matrix.