Technical Field
[0001] Example embodiments relate to audio processing, for example audio processing which
modifies respective perceived spatial positions of audio sources in a spatial audio
scene to counter movement of a user position.
Background
[0002] Spatial audio refers to audio which, when output to a user device such as a pair
of earphones, enables a user to perceive one or more audio sources as coming from
respective directions with respect to the user's position. For example, one audio
source may be perceived as coming from a position in front of the user whereas other
audio sources may be perceived as coming from positions to the left and right -hand
sides of the user. In spatial audio, the user may perceive such audio sources as coming
from positions external to the user's position, in contrast to, for example, stereoscopic
audio in which audio is effectively perceived within the user's head and where an
audio source may be panned between the ears or, in some cases, played back to one
ear only. Spatial audio may therefore provide a more life-like and immersive user
experience.
Summary
[0003] The scope of protection sought for various embodiments of the invention is set out
by the independent claims. The embodiments and features, if any, described in this
specification that do not fall under the scope of the independent claims are to be
interpreted as examples useful for understanding various embodiments of the invention.
[0004] According to a first aspect, there is described an apparatus comprising means for:
tracking a user movement in a real-world space, wherein the user consumes, via an
audio output device, data representing a spatial audio scene comprised of a plurality
of audio sources perceived from respective spatial positions with respect to a first
orientation of the user in the real-world space; identifying a first subset of the
audio sources having respective perceived spatial positions within a first region
of the real-world space defined with respect to the first orientation and a second
subset of the audio sources having respective perceived spatial positions outside
of the first region; and responsive to a tracked movement of the user position from
a first position to a second position within the real-world space, the movement being
within a predefined range of movement, modifying the respective perceived spatial
positions of the first subset of the audio sources in the spatial audio scene so as
to counter the movement of the user position without modifying the respective perceived
spatial positions of the second subset of the audio sources in the spatial audio scene.
[0005] It is to be noted that, in the context of the first subset and the second subset
of the audio sources, a subset may comprise one audio source.
[0006] The first orientation may correspond to the orientation of the user's head and the
first region may correspond to a region to the front of the user's head.
[0007] The first region may comprise a sector to the front of the user's head having a central
angle of less than 180 degrees. The central angle may be between 30 and 60 degrees.
[0008] The first position may correspond to the first orientation and the second position
may corresponds to a second orientation, wherein the tracking means may be configured
to track angular movement between the first and second orientations and the modifying
means may be configured to modify the respective perceived spatial positions of the
first subset of the audio sources if the tracked angular movement is within a predefined
angular range of movement.
[0009] The predefined angular range may corresponds with the central angle of the first
region sector.
[0010] The first and second positions may correspond to respective first and second spatial
positions of the real-world space, wherein the tracking means may be configured to
track translational movement between the first and second spatial positions and the
modifying means may be configured to modify the respective perceived spatial positions
of the first subset of the audio sources if the tracked translational movement is
within a predefined translational range of movement.
[0011] The modifying means may be further configured, responsive to the tracked movement
going beyond the predefined range of movement, to disable further modification of
the respective perceived spatial positions of the first subset of the audio sources.
[0012] The apparatus may further comprise means for: determining that, subsequent to movement
of the user position from the first position to the second position within the real-world
space, the tracked user movement over a predetermined time period is below a predetermined
movement amount; and updating, in response to said determination, the first region
of the real-world space such that it is defined with respect to the orientation of
the user at the second position.
[0013] The apparatus may further comprise means for updating the respective perceived spatial
positions of the first subset of the audio sources such that they are returned to
their previous respective spatial positions in the spatial audio scene with respect
to the second subset of the audio sources.
[0014] The apparatus may further comprise means for: identifying that the first subset of
audio sources, comprising a plurality of audio sources, are of interest to the user
based on one or more tracked movement characteristics; and responsive to said identification,
modifying the respective perceived spatial positions of the first subset of audio
sources such that they are perceived as more spatially spread apart, at least temporarily.
[0015] The identifying means may be configured to identify the one or more tracked movement
characteristics as movements that cycle between limits of the predefined range of
movements.
[0016] The amount of modification to spatially spread apart the respective perceived spatial
positions may be based on the distance of at least one of the first subset of the
audio sources from the position of the user.
[0017] Limits of the first region and/or the permissible range of movement may be dynamically
changeable based on the distance of at least one of the first subset of the audio
sources from the position of the user.
[0018] The data representing the spatial audio scene may be representative of a musical
performance.
[0019] The audio output device may comprise a set of earphones.
[0020] The modifying means may be configured to perform said modification in response to
detecting that the data representing the spatial audio data scene is representative
of a particular type of spatial audio scene and/or has associated metadata indicative
that said modification is to be performed for the spatial audio scene.
[0021] According to a second aspect, there is described a method comprising: tracking a
user movement in a real-world space, wherein the user consumes, via an audio output
device, data representing a spatial audio scene comprised of a plurality of audio
sources perceived from respective spatial positions with respect to a first orientation
of the user in the real-world space; identifying a first subset of the audio sources
having respective perceived spatial positions within a first region of the real-world
space defined with respect to the first orientation and a second subset of the audio
sources having respective perceived spatial positions outside of the first region;
and responsive to a tracked movement of the user position from a first position to
a second position within the real-world space, the movement being within a predefined
range of movement, modifying the respective perceived spatial positions of the first
subset of the audio sources in the spatial audio scene so as to counter the movement
of the user position without modifying the respective perceived spatial positions
of the second subset of the audio sources in the spatial audio scene.
[0022] The first orientation may correspond to the orientation of the user's head and the
first region may correspond to a region to the front of the user's head.
[0023] The first region may comprise a sector to the front of the user's head having a central
angle of less than 180 degrees. The central angle may be between 30 and 60 degrees.
[0024] The first position may correspond to the first orientation and the second position
may corresponds to a second orientation, wherein the tracking may comprise tracking
angular movement between the first and second orientations and the modifying may comprise
modifying the respective perceived spatial positions of the first subset of the audio
sources if the tracked angular movement is within a predefined angular range of movement.
[0025] The predefined angular range may corresponds with the central angle of the first
region sector.
[0026] The first and second positions may correspond to respective first and second spatial
positions of the real-world space, wherein the tracking may comprise tracking translational
movement between the first and second spatial positions and the modifying may comprise
modifying the respective perceived spatial positions of the first subset of the audio
sources if the tracked translational movement is within a predefined translational
range of movement. The modifying may further comprise, responsive to the tracked movement
going beyond the predefined range of movement, disabling further modification of the
respective perceived spatial positions of the first subset of the audio sources.
[0027] The method may further comprise: determining that, subsequent to movement of the
user position from the first position to the second position within the real-world
space, the tracked user movement over a predetermined time period is below a predetermined
movement amount; and updating, in response to said determination, the first region
of the real-world space such that it is defined with respect to the orientation of
the user at the second position.
[0028] The method may further comprise updating the respective perceived spatial positions
of the first subset of the audio sources such that they are returned to their previous
respective spatial positions in the spatial audio scene with respect to the second
subset of the audio sources.
[0029] The method may further comprise: identifying that the first subset of audio sources,
comprising a plurality of audio sources, are of interest to the user based on one
or more tracked movement characteristics; and responsive to said identification, modifying
the respective perceived spatial positions of the first subset of audio sources such
that they are perceived as more spatially spread apart, at least temporarily.
[0030] The identified one or more tracked movement characteristics may be movements that
cycle between limits of the predefined range of movements.
[0031] The amount of modification to spatially spread apart the respective perceived spatial
positions may be based on the distance of at least one of the first subset of the
audio sources from the position of the user.
[0032] Limits of the first region and/or the permissible range of movement may be dynamically
changeable based on the distance of at least one of the first subset of the audio
sources from the position of the user.
[0033] The data representing the spatial audio scene may be representative of a musical
performance.
[0034] The audio output device may comprise a set of earphones.
[0035] Modifying may comprise performing said modification in response to detecting that
the data representing the spatial audio data scene is representative of a particular
type of spatial audio scene and/or has associated metadata indicative that said modification
is to be performed for the spatial audio scene.
[0036] According to a third aspect, there is provided a computer program product comprising
a set of instructions which, when executed on an apparatus, is configured to cause
the apparatus to carry out the method of any preceding method definition.
[0037] According to a fourth aspect, there is provided a non-transitory computer readable
medium comprising program instructions stored thereon for performing a method, comprising:
tracking a user movement in a real-world space, wherein the user consumes, via an
audio output device, data representing a spatial audio scene comprised of a plurality
of audio sources perceived from respective spatial positions with respect to a first
orientation of the user in the real-world space; identifying a first subset of the
audio sources having respective perceived spatial positions within a first region
of the real-world space defined with respect to the first orientation and a second
subset of the audio sources having respective perceived spatial positions outside
of the first region; and responsive to a tracked movement of the user position from
a first position to a second position within the real-world space, the movement being
within a predefined range of movement, modifying the respective perceived spatial
positions of the first subset of the audio sources in the spatial audio scene so as
to counter the movement of the user position without modifying the respective perceived
spatial positions of the second subset of the audio sources in the spatial audio scene..
[0038] The program instructions of the fourth aspect may also perform operations according
to any preceding method definition of the second aspect.
[0039] According to a fifth aspect, there is provided an apparatus comprising: at least
one processor; and at least one memory including computer program code which, when
executed by the at least one processor, causes the apparatus to: track a user movement
in a real-world space, wherein the user consumes, via an audio output device, data
representing a spatial audio scene comprised of a plurality of audio sources perceived
from respective spatial positions with respect to a first orientation of the user
in the real-world space; identify a first subset of the audio sources having respective
perceived spatial positions within a first region of the real-world space defined
with respect to the first orientation and a second subset of the audio sources having
respective perceived spatial positions outside of the first region; and responsive
to a tracked movement of the user position from a first position to a second position
within the real-world space, the movement being within a predefined range of movement,
modify the respective perceived spatial positions of the first subset of the audio
sources in the spatial audio scene so as to counter the movement of the user position
without modifying the respective perceived spatial positions of the second subset
of the audio sources in the spatial audio scene..
[0040] The computer program code of the fifth aspect may also perform operations according
to any preceding method definition of the second aspect.
Brief Description of the Drawings
[0041] Example embodiments will now be described with reference to the accompanying drawings,
in which:
FIG. 1 is a block diagram of a system 100 which may be useful for understanding example
embodiments.
FIGs. 2A - 2C are front views of a user wearing the set of earphones which may be
useful for understanding example embodiments;
FIG. 3 is a flow diagram showing processing operations according to example embodiments;
FIGs. 4A - 4C are top-plan views of a user including an indication of a spatial audio
scene comprising a plurality of audio sources and how respective spatial positions
of the audio sources may be modified according to some example embodiments;
FIGs. 5A- 5B are top-plan views of the user in the FIGs. 4A- 4C spatial audio scene
for indicating a reset operation according to some example embodiments;
FIGs. 6A - 6C are top-plan views of the user in the FIGs. 4A - 4C spatial audio scene
for indicating an audio source spreading effect according to some example embodiments;
FIGs. 7A and 7B are top-plan views of the user in FIGs. 4A - 4C spatial audio scene
for indicating another type of audio source spreading effect according to some example
embodiments;
FIGs. 8A - 8C are front views of a user for indicating how respective spatial positions
of the audio sources may be modified due to rotational movement according to some
example embodiments;
FIGs. 9A - 9C are front views of a user for indicating how respective spatial positions
of the audio sources may be modified due to translational movement according to some
example embodiments;
FIG.10 is a schematic view of an apparatus in which example embodiments may be embodied;
and
FIG. 11 is a plan view of a non-transitory medium which may store computer-readable
code for causing an apparatus, such as the FIG. 10 apparatus, to perform operations
according to example embodiments.
Detailed Description
[0042] In the description and drawings, like reference numerals refer to like elements throughout.
[0043] Example embodiments relate to an apparatus, method and computer program for audio
processing, for example audio processing which modifies respective perceived spatial
positions of one or more audio sources in a spatial audio scene to counter movement
of a user position. More generally, example embodiments relate to audio processing
in the field of spatial audio in which data, which may be referred to as spatial audio
data, encodes a so-called spatial audio scene. When the spatial audio data is rendered
and output as audio to a user device such as a pair of earphones or similar, it enables
a user to perceive one or more audio sources as coming from respective positions,
e.g. directions, with respect to the user's position. In spatial audio, the user may
perceive such audio sources as coming from positions external to the user's position,
in contrast to stereoscopic audio in which audio is effectively perceived within the
user's head. Spatial audio may therefore provide a more life-like and immersive user
experience.
[0044] Example formats for spatial audio data may include, but are not limited to, multi-channel
mixes such as 5.1 or 7.1+4, Ambisonics, parametric spatial audio (e.g., metadata-assisted
spatial audio (MASA)), object-based audio, or any combination thereof.
[0045] Spatial audio rendering may be characterized also in terms of so-called degrees-of-freedom
(DoF.) For example, if a user's head rotation affects rendering and therefore output
of the spatial audio, this may be referred to as 3DoF audio rendering. If a change
in the user's spatial position also affects rendering, this may be referred to as
6DoF audio rendering. Sometimes, the term 3DoF+ rendering may be used to indicate
a limited effect of a user's change in spatial position, for example to account for
a limited amount of translational movement of the user's head when the user is otherwise
stationary. Taking into account this type of movement is known to improve, e.g., externalization
of spatial audio sources.
[0046] A user's position maybe determined in real-time or near real-time, i.e. tracked,
using one or more known methods. For example, a user's position may correspond to
the position of the user's head. In this sense, the term "position" may refer to orientation,
i.e. a first orientation of the user's head is a first position and a second, different
orientation of the user's head is a second position. The term may also refer to spatial
position within a real-world space to account for translational movements of the user's
head which may or may not accompany translational movements of the user's body.
[0047] The position of the user's head may be determined using one or more known head-tracking
methods, such as by use of one or more cameras which identify facial features in real-time,
by use of inertial sensors (gyroscopes/ accelerometers) within a head-worn device,
such as a set of earphones used to output the spatial audio data, satellite position
systems such as by use of the Global Navigation Satellite System (GNSS) or other forms
of position determination means, to give but some examples.
[0048] The spatial audio data may be output to a head-worn device such as a pair of earphones,
which term is intended to cover devices such as a pair of earbuds, on or over -ear
headphones, and also speakers within a worn headset such as an extended reality (XR)
headset. In this context, an XR headset may incorporate one or more display screens
for presenting video data which may represent part of a virtual video scene. For example,
the video data when rendered may present one or more visual objects corresponding
to one or more audio sources in the audio scene.
[0049] In use, a user will be located in a real-world space which may be an indoor space,
for example a room or hall, or possibly an outdoor space. The real-world space is
to be distinguished over a virtual space that is output to the user device (such as
the above-mentioned head-worn device) and therefore perceived by the user.
[0050] In some example embodiments, a spatial audio scene is a form of virtual space comprising
one or more audio sources coming from respective positions with respect to a user's
current position in the real-world space. The spatial audio scene may, for example,
represent a musical performance and the one or more audio sources may represent different
performers and/or instruments that can be perceived from respective spatial positions
with respect to the user's current spatial position and orientation in the real-world
space. The spatial audio scene is not, however, limited to musical performances.
[0051] As a user moves within the real-world space, a so-called head-tracking effect may
be applied by processing of spatial audio data that represents the spatial audio scene.
Head tracking may involve processing and rendering spatial audio data such that respective
perceived spatial positions of audio sources are modified to counter user movements.
In other words, rather than the respective perceived positions of audio sources remaining
fixed in relation to the position of the user's head as it moves (as in the case of
stereoscopic audio) the positions are modified in a counter, or opposite, manner;
this gives the perception that the audio sources remain in their original position
even though the user has changed position, whether via rotation or translation.
[0052] For example, if the user rotates their head clockwise by
a degrees, the spatial audio data may be modified so that respective perceived spatial
positions of audio sources are moved counter-clockwise, for example also by
a degrees. Alternatively, or additionally, if the user moves their head in translation,
e.g. five centimetres to the left, the spatial audio data may be modified so that
the respective perceived spatial positions of audio sources are moved to the right,
for example also by five centimetres. The amount of counter-clockwise movement is
not necessarily the same as the translational movement.
[0053] This may provide several benefits for the user experience. For example, the user
may be able to determine more precisely the direction of a given audio source in the
spatial audio scene. The user may better experience externalization and immersion.
[0054] However, in certain scenarios such as where the artistic or creative intent for the
spatial audio scene is for significant audio sources to be heard from a particular
position with respect to the user, e.g. a music performance where significant vocals
and/or instruments are to be heard generally to the front of the user, head tracking
effects may confuse the experience.
[0055] Example embodiments are directed at mitigating or avoiding such issues to provide
a more optimal user experience when consuming spatial audio.
[0056] FIG. 1 is a block diagram of a system 100 which may be useful for understanding example
embodiments.
[0057] The system 100 may comprise a server 110, a media player 120, a network 130 and a
set of earphones 140.
[0058] The server 110 may be connected to the media player 120 by means of the network 130
for sending data, e.g., spatial audio data, to the media player 120. The server 110
may for example send the data to the media player 120 responsive to one or more data
requests sent by the media player 120. For example, the media player 120 may transmit
to the server 110 an indication of a position associated with a user of the media
player 120, and the server may process and transmit back to the media player 120 spatial
audio data responsive to the received position, which may be in real-time or near
real-time. This may be by means of any suitable streaming data protocol. Alternatively,
or additionally, the server 110 may provide one or more files representing spatial
audio data to the media player 120 for storage and processing thereat. At the media
player 120, the spatial audio data may be processed, rendered and output to the set
of earphones 140. In example embodiments, the set of earphones 140 may comprise head
tracking sensors for indicating to the media player 120, using any suitable method,
a current position of the user, e.g., one or both of the orientation and spatial position
of the user's head, in order to determine how the spatial audio data is to be rendered
and output.
[0059] In some embodiments, the media player 120 may comprise part of the set of earphones
140.
[0060] The network may be any suitable data communications network including, for example,
one or more of a radio access network (RAN) whereby communication is via one or more
base stations, a WiFi network whereby communications is via one or more access points,
or a short-range network such as one using the Bluetooth or Zigbee protocol.
[0061] FIGs. 2A - 2C are representational drawings of a user 210 wearing the set of earphones
140 which may also be useful for understanding example embodiments. Referring to FIG.
2A, the user 210 is shown listening to a rendered audio field comprised of first to
fourth audio sources (collectively indicated by reference numeral 220), e.g., corresponding
to distinct respective sounds labelled "1", "2", "3" and "4."
[0062] Referring to FIG. 2B, the respective perceived spatial positions of the first to
fourth audio sources 220 are indicated with respect to the user's head for the case
that the audio data is spatial audio data but with no head tracking effect implemented.
It will be seen, for example, that counter-clockwise rotation of the user's head results
in no modification of the spatial audio data because the respective perceived spatial
positions of the first to fourth audio sources 220 follow the user's movement.
[0063] Referring to FIG. 2C, the respective perceived spatial positions of the first to
fourth audio sources 220 are indicated with respect to the user's head for the case
that the audio data is spatial audio data and the above-mentioned head tracking effect
is implemented. It will be seen, in this case, that counter-clockwise rotation results
of the user's head results in modification of the spatial audio data because the respective
perceived spatial positions of the first to fourth audio sources 220 are relatively
static in the audio field even though the user is moving.
[0064] FIG. 3 is a flow diagram showing processing operations, indicated generally by reference
numeral 300, according to example embodiments. The processing operations 300 may be
performed in hardware, software, firmware, or a combination thereof. For example,
the processing operations may be performed by the media player 120 shown in FIG. 1.
[0065] A first operation 302 may comprise tracking a user movement in a real-world space,
wherein the user consumes, via an audio output device, data representing a spatial
audio scene comprised of a plurality of audio sources perceived from respective spatial
positions with respect to a first orientation of the user in the real-world space.
[0066] A second operation 304 may comprise identifying a first subset of the audio sources
having respective perceived spatial positions within a first region of the real-world
space defined with respect to the first orientation and a second subset of the audio
sources having respective perceived spatial positions outside of the first region.
[0067] It is to be noted that, in the context of the first subset and the second subset
of the audio sources, a subset may comprise one audio source.
[0068] A third operation 306 may comprise, responsive to a tracked movement of the user
position from a first position to a second position within the real-world space, the
movement being within a predefined range of movement, modifying the respective perceived
spatial positions of the first subset of the audio sources in the spatial audio scene
so as to counter the movement of the user position without modifying the respective
perceived spatial positions of the second subset of the audio sources in the spatial
audio scene.
[0069] Example embodiments may therefore provide a form of partial head tracking in the
sense that, when a user position moves, e.g. from the first position to the second
position within the real-world space, the head tracking effect is applied to the first
subset of audio sources and not the second subset of audio sources, as will be explained
below with the help of visual examples.
[0070] For example, spatial audio data representing a musical performance with clear left
- right balance and separation, with important content in the centre or front, may
have this partial head tracking applied to audio sources to the front of the user
to allow a degree of immersion and localisation whilst maintaining artistic intent
and avoiding confusing effects.
[0071] In some example embodiments, the first orientation may correspond to the orientation
of the user's head and the first region may correspond to a region to the front of
the user's head.
[0072] For example, the first region may comprise a sector to the front of the user's head
having a central angle of less than 180 degrees. For example, the central angle may
be between 30 and 60 degrees but may vary depending on application.
[0073] In some example embodiments, the movement of the user position from the first position
to the second position may be an angular movement.
[0074] For example, the first position may correspond to the first orientation and the second
position may correspond to a second orientation. Angular movement may be tracked between
the first and second orientations and the respective perceived spatial positions of
the first subset of the audio sources may be modified if the tracked angular movement
is within a predefined angular range of movement.
[0075] The predefined angular range may correspond with the central angle of the first region
sector, e.g. both may be 30 degrees, but they need not be the same.
[0076] To illustrate, FIGs. 4A- 4C are plan views of a user 402 in real-world space together
with an indication of a spatial audio scene that the user perceives through a pair
of earphones 404. Referring to FIG. 4A, the spatial audio scene may comprise first
to fourth audio sources 410, 412, 414, 416 perceived from respective spatial positions
with respect to the user's shown orientation 406.
[0077] A first region is defined in the form of a sector 408 having in this example a central
angle of 30 degrees and this angle may also correspond with a predefined angular range
of movement through which partial head tracking may be performed. This equates to
15 degrees angular movement either side of the user's current orientation 406.
[0078] It will be seen that the first audio source 410 corresponds to the sector 408 and
hence head tracking may be performed in respect of this audio source and not performed
in respect of the second to fourth audio sources 412, 414, 416.
[0079] Referring to FIG. 4B, as the tracking operation tracks the user's orientation as
their head moves clockwise by 15 degrees, the perceived spatial position of the first
audio source 410 is modified so that it is static in the spatial audio scene whereas
the respective perceived spatial positions of the second to fourth audio sources 412,
414, 416, outside of the sector 408, are unmodified (in this sense) and move with
the user's change in orientation. The user is thus able to better localise the position
of the first audio source 410.
[0080] In the modifying operation, responsive to the tracked movement going beyond the predefined
range of movement, which in this example is 15 degrees to one side of the FIG. 4A
orientation 406, further modification of the first subset of the audio sources may
be disabled. To illustrate, and with reference to FIG. 4C, as the tracking operation
tracks a change in the user's orientation by a further 15 degrees, the perceived spatial
position of the first audio source 410 moves 15 degrees with the user's change in
orientation, as do those of the second to fourth audio sources 412, 414, 416 and hence
the spatial relationship between the first to fourth audio sources remains fixed.
The permitted 15-degree range of movement is indicated by region 420.
[0081] In some example embodiments, after tracked movement of the user position from the
first position to the second position, the tracked user movement may be below a predetermined
movement amount for a predetermined time period. For example, the user may remain
relatively still in the second position, or move by an amount within a predetermined
movement threshold, and this may be the case for greater than, say, 10 seconds.
[0082] In response to identifying such relative stability, the first region of the real-world
space may be updated such that it is defined with respect to the orientation of the
user at the second position. This may be termed a first region reset.
[0083] To illustrate, FIG. 5A corresponds with FIG. 4C described above. If the user 402
remains at the shown orientation 502 for a predetermined time period, or moves very
little from the shown orientation such that movement is within a predetermined movement
amount (e.g. less than 2 degrees) for said predetermined time period (e.g. 10 seconds),
a first region reset may be initiated. Referring to FIG. 5B, this may involve moving
the perceived spatial position of the first audio source 410 based on the second position,
in this case to become aligned with the shown orientation 502. The first region is
then updated such that, in this case, it becomes the new front sector 508 having a
central angle of 30 degrees. In general, the respective perceived spatial positions
of the first subset of the audio sources 410 may be returned to their previous respective
spatial positions in the spatial audio scene with respect to the second subset of
the audio sources, comprising the second to fourth audio sources 412, 414, 416 such
that the scene orientation corresponds to that shown in FIG. 4A, assuming the audio
source orientations themselves have not dynamically changed. The new front sector
508 then defines the first subset of the audio sources and the second subset of the
audio sources for future partial head tracking operations. The dashed circle 510 indicates
the previous perceived spatial position of the first audio source 410.
[0084] In some example embodiments, further operations may comprise identifying that the
first subset of audio sources, when comprising a plurality of audio sources, are of
interest to the user based on one or more tracked movement characteristics, and responsive
to said identification, modifying the respective perceived spatial positions of the
first subset of audio sources such that they are perceived as more spatially spread
apart. In this way, the user may perceive the audio sources better because they are
spatially spread apart.
[0085] In some example embodiments, identifying the first subset of the audio sources as
being of interest to the user may comprise identifying one or more tracked movement
characteristics as movements that cycle between limits of the predefined range of
movement.
[0086] To illustrate, FIGs. 6A- 6C are plan views of a user 602 in a real-world space together
with an indication of a spatial audio scene that the user perceives through a pair
of earphones 604. Referring to FIG. 6A, the spatial audio scene may comprise first
to sixth audio sources 610, 612, 614, 616, 618, 620 perceived from respective spatial
positions with respect to the user's shown orientation 606.
[0087] A first region is defined in the form of a sector 608 having a central angle of 30
degrees, as before, and this may also correspond with a predefined angular range of
movement through which partial head tracking may be performed. This may equate to
15 degrees angular movement either side of the user's current orientation.
[0088] It will be seen that the first audio source 610 and the second audio source 612 correspond
to the sector 608 and hence head tracking may be performed in respect of these audio
sources and not performed in respect of the third to sixth audio sources 614, 616,
618, 620.
[0089] Referring to FIG. 6B, it will be seen that the user 602 may change orientation in
a cyclical manner, i.e. first counter-clockwise and then clockwise, within and possibly
around (e.g. possibly larger) the predefined range of movements.
[0090] Referring then to FIG. 6C, this may trigger the first audio source 610 and the second
audio source 612 to be spatially separated so that they may be perceived better by
the user 602. This separation may occur temporarily, e.g. for a predetermined time
period. This separation may also allow the predetermined range of movements over which
head tracking may be performed to be relaxed, at least temporarily, and in respect
of at least one of the first audio source 610 and the second audio source 612.
[0091] In some embodiments, the amount of modification applied in order to spatially spread
apart the respective perceived spatial positions of the first sector audio sources
may be based on the distance of at least one of the audio sources 610, 612 from the
position of the user 602.
[0092] For example, FIG. 7A shows a partial plan view of a user 702 in real-world space
together with an indication of part of a spatial audio scene, perceived through a
pair of earphones 704. The spatial audio scene may include first, second and third
audio sources 710, 712, 714 within the first sector 708. Referring to FIG. 7B, responsive
to identifying that the first to third audio sources 710, 712, 714 are of interest,
it will be seen that the first audio source 710, which is the closest to the user
702, remains relatively static, whereas the second and third audio sources 712, 714
are spread apart from the first audio source by different amounts based on their respective
distance from the user 702.
[0093] In some embodiments, identification that the first subset of audio sources are of
interest to the user may be an alternative or additional trigger to perform a first
region reset as described previously with reference to FIGs. 5A- 5C.
[0094] In some embodiments, the limits of the first region and the predefined range of movement,
may be fixed. In some embodiments, the limits of one or both may change, by user input
and/or adaptively.
[0095] For example, a device implementing example embodiments may analyse a complexity and/or
spatial closeness of the first subset of audio sources and the second set of audio
sources and then update the limits (and therefore the size) of the first region and/or
the predefined range of movements based on such analysis.
[0096] Additionally, or alternatively, such limits may be based on values that are part
of a description associated with the spatial audio data, e.g., in metadata. This may
be based, in the case of a musical performance, on an artist recommendation or service
preference settings. The metadata can be time-varying metadata, wherein the limits
may vary during the course of the music performance's runtime. The metadata may comprise
contextdependent modifications. For example, a detected user environment or activity
can be taken into account when determining the limits to be used. For example, if
a user is considered to be focusing on certain audio sources, more freedom for head-tracking
may be allowed by widening the first region and/or the predefined range of movements.
[0097] User preferences may also be taken into account. For example, a user may select whether
to use, at a given time, no head tracking, full head tracking, or partial head tracking
as described herein.
[0098] Example embodiments have so far focussed on 3DoF head tracking. Example embodiments
may be extended to account for 3DoF+ in which some amount of translational movement
may be additionally or alternatively tracked.
[0099] To generalise, the first and second positions referred to with regard to the operations
in FIG. 3 may correspond to respective first and second spatial positions of the real-world
space. Translational movement may be tracked between the first and second spatial
positions and the respective perceived spatial positions of the first subset of the
audio sources may be modified if the tracked translational movement is within a predefined
translational range of movement. For example, the predefined translational range of
movement may be up to 10 centimetres but it may be greater.
[0100] In such a case, it will also be appreciated that adaptation of the limits of the
first region and/or the predefined range of movement may be based on the distance
of at least one of the first subset of the audio sources from the position of the
user. Distance information may form part of the metadata within spatial audio data,
i.e. indicating respective distances or depths associated with audio sources in the
audio scene.
[0101] FIGs. 8A - 8C show front views of a user 802 when consuming spatial audio using a
pair of earphones 804, in which the represented audio scene comprises first to fourth
audio sources 810, 812, 814, 816.
[0102] More specifically, FIG. 8A shows the user 802 in a first orientation and it may be
assumed that, in accordance with example embodiments, the first and fourth audio sources
8 10, 8 16 correspond to a first sector region (within the meaning of the FIG. 3 operations)
and therefore comprise the first subset of the audio sources. The second and third
audio sources 812, 814 comprise the second subset of the audio sources. As shown in
FIG. 8B, with clockwise movement of the user 802 within a predetermined range of movement,
e.g. up to 15 degrees to one side, the first and fourth audio sources 810, 816 are
perceived as static, because head tracking is applied, whereas the second and third
audio sources 812, 814 move with the user. Referring to FIG. 8C, the same happens
for counter-clockwise movement of the user 802. The user 802 is better able to localise
the first and second audio sources 8 10, 816.
[0103] For completeness, a translational movement case will now be shown with reference
to FIGs. 9A- 9C. FIG. 9Ais the same as FIG. 8A and it maybe assumed that, in accordance
with example embodiments, the first and fourth audio sources 810, 816 correspond to
a first sector region (within the meaning of the FIG. 3 operations) and therefore
comprise the first subset of the audio sources. The second and third audio sources
812, 814 comprise the second subset of the audio sources. As shown in FIG. 9B, with
rightwards translational movement of the user 802 within a predetermined range of
movement, e.g. up to 10 centimetres degrees to one side, the first and fourth audio
sources 8 10, 816 are perceived as static, whereas the second and third audio sources
8 12, 8 14 move rightwards with the user. Referring to FIG. 9C, the same happens for
leftwards translational movement of the user 802. The user 802 is again better able
to localise the first and second audio sources 810, 816. In FIGs. 8A- 8C and 9A- 9C,
the dashed lines indicate the respective spatial positions of audio sources that would
result without head tracking.
[0104] In some example embodiments, such partial head tracking may be performed in response
to detecting one or more predetermined conditions associated with spatial audio data.
For example, spatial audio data may be processed and rendered by default with no head
tracking effect or a full head tracking effect unless the one or more predetermined
conditions are met, in which case a partial head tracking effect is performed. Examples
of the one or more predetermined conditions include, but are not limited to, detecting
that the spatial audio scene (a) is representative of a particular type of spatial
audio scene or content, e.g. a musical performance, (b) is received from a particular
source of data, e.g. a music streaming service, and/or (c) has associated metadata
indicative that partial head tracking is to be performed.
Example Apparatus
[0105] FIG. 10 shows an apparatus according to some example embodiments. The apparatus may
be configured to perform the operations described herein, for example operations described
with reference to any disclosed process. The apparatus comprises at least one processor
1000 and at least one memory 1001 directly or closely connected to the processor.
The memory 1001 includes at least one random access memory (RAM) 1001a and at least
one read-only memory (ROM) 1001b. Computer program code (software) 1005 is stored
in the ROM 1001b. The apparatus may be connected to a transmitter (TX) and a receiver
(RX). The apparatus may, optionally, be connected with a user interface (UI) for instructing
the apparatus and/or for outputting data. The at least one processor 1000, with the
at least one memory 1001 and the computer program code 1005 are arranged to cause
the apparatus to at least perform at least the method according to any preceding process,
for example as disclosed in relation to the flow diagrams herein.
[0106] FIG. 11 shows a non-transitory media 1100 according to some embodiments. The non-transitory
media 1100 is a computer readable storage medium. It may be e.g. a CD, a DVD, a USB
stick, a blu-ray disk, etc. The non-transitory media 1100 stores computer program
code, causing an apparatus to perform the method of any preceding process for example
as disclosed in relation to the flow diagrams of any one of FIGs. 4, 6 and 8, and
related features thereof.
[0107] Names of network elements, protocols, and methods are based on current standards.
In other versions or other technologies, the names of these network elements and/or
protocols and/ or methods may be different, as long as they provide a corresponding
functionality. For example, embodiments may be deployed in 2G/3G/4G/5G networks and
further generations of 3GPP but also in non-3GPP radio networks such as WiFi.
[0108] A memory may be volatile or non-volatile. It may be e.g. a RAM, a SRAM, a flash memory,
a FPGA block ram, a DCD, a CD, a USB stick, and a blue ray disk.
[0109] If not otherwise stated or otherwise made clear from the context, the statement that
two entities are different means that they perform different functions. It does not
necessarily mean that they are based on different hardware. That is, each of the entities
described in the present description may be based on a different hardware, or some
or all of the entities may be based on the same hardware. It does not necessarily
mean that they are based on different software. That is, each of the entities described
in the present description may be based on different software, or some or all of the
entities may be based on the same software. Each of the entities described in the
present description may be embodied in the cloud. Implementations of any of the above
described blocks, apparatuses, systems, techniques or methods include, as non-limiting
examples, implementations as hardware, software, firmware, special purpose circuits
or logic, general purpose hardware or controller or other computing devices, or some
combination thereof. Some embodiments may be implemented in the cloud.
[0110] It is to be understood that what is described above is what is presently considered
the preferred embodiments. However, it should be noted that the description of the
preferred embodiments is given by way of example only and that various modifications
may be made without departing from the scope as defined by the appended claims.
1. An apparatus comprising means for:
tracking a user movement in a real-world space, wherein the user consumes, via an
audio output device, data representing a spatial audio scene comprised of a plurality
of audio sources perceived from respective spatial positions with respect to a first
orientation of the user in the real-world space;
identifying a first subset of the audio sources having respective perceived spatial
positions within a first region of the real-world space defined with respect to the
first orientation and a second subset of the audio sources having respective perceived
spatial positions outside of the first region; and
responsive to a tracked movement of the user position from a first position to a second
position within the real-world space, the movement being within a predefined range
of movement, modifying the respective perceived spatial positions of the first subset
of the audio sources in the spatial audio scene so as to counter the movement of the
user position without modifying the respective perceived spatial positions of the
second subset of the audio sources in the spatial audio scene.
2. The apparatus of claim 1, wherein the first orientation corresponds to the orientation
of the user's head and the first region corresponds to a region to the front of the
user's head.
3. The apparatus of claim 2, wherein the first region comprises a sector to the front
of the user's head having a central angle of less than 180 degrees.
4. The apparatus of claim 3, wherein the central angle is between 30 and 60 degrees.
5. The apparatus of any preceding claim, wherein the first position corresponds to the
first orientation and the second position corresponds to a second orientation, wherein
the tracking means is configured to track angular movement between the first and second
orientations and the modifying means is configured to modify the respective perceived
spatial positions of the first subset of the audio sources if the tracked angular
movement is within a predefined angular range of movement.
6. The apparatus of claim 5 when dependent on claim 3 or claim 4, wherein the predefined
angular range corresponds with the central angle of the first region sector.
7. The apparatus of any of claims 1 to 4, wherein the first and second positions correspond
to respective first and second spatial positions of the real-world space, wherein
the tracking means is configured to track translational movement between the first
and second spatial positions and the modifying means is configured to modify the respective
perceived spatial positions of the first subset of the audio sources if the tracked
translational movement is within a predefined translational range of movement.
8. The apparatus of any preceding claim, wherein the modifying means is further configured,
responsive to the tracked movement going beyond the predefined range of movement,
to disable further modification of the respective perceived spatial positions of the
first subset of the audio sources.
9. The apparatus of any preceding claim, further comprising means for:
determining that, subsequent to movement of the user position from the first position
to the second position within the real-world space, the tracked user movement over
a predetermined time period is below a predetermined movement amount; and
updating, in response to said determination, the first region of the real-world space
such that it is defined with respect to the orientation of the user at the second
position.
10. The apparatus of claim 9, further comprising means for updating the respective perceived
spatial positions of the first subset of the audio sources such that they are returned
to their previous respective spatial positions in the spatial audio scene with respect
to the second subset of the audio sources.
11. The apparatus of any preceding claim, further comprising means for:
identifying that the first subset of audio sources, comprising a plurality of audio
sources, are of interest to the user based on one or more tracked movement characteristics;
and
responsive to said identification, modifying the respective perceived spatial positions
of the first subset of audio sources such that they are perceived as more spatially
spread apart, at least temporarily.
12. The apparatus of any preceding claim, wherein limits of the first region and/or the
permissible range of movement are dynamically changeable based on the distance of
at least one of the first subset of the audio sources from the position of the user.
13. The apparatus of any preceding claim, wherein the audio output device comprises a
set of earphones.
14. The apparatus of any preceding claim, wherein the modifying means is configured to
perform said modification in response to detecting that the data representing the
spatial audio data scene is representative of a particular type of spatial audio scene
and/or has associated metadata indicative that said modification is to be performed
for the spatial audio scene.
15. A method, the method comprising:
tracking a user movement in a real-world space, wherein the user consumes, via an
audio output device, data representing a spatial audio scene comprised of a plurality
of audio sources perceived from respective spatial positions with respect to a first
orientation of the user in the real-world space;
identifying a first subset of the audio sources having respective perceived spatial
positions within a first region of the real-world space defined with respect to the
first orientation and a second subset of the audio sources having respective perceived
spatial positions outside of the first region; and
responsive to a tracked movement of the user position from a first position to a second
position within the real-world space, the movement being within a predefined range
of movement, modifying the respective perceived spatial positions of the first subset
of the audio sources in the spatial audio scene so as to counter the movement of the
user position without modifying the respective perceived spatial positions of the
second subset of the audio sources in the spatial audio scene.