AUDIO APPARATUS AND METHOD OF AUDIO PROCESSING

(19)

(11)

EP 3 595 337 A1

(12)	EUROPEAN PATENT APPLICATION

(43)	Date of publication:
	15.01.2020 Bulletin 2020/03

(21)	Application number: 18182376.6

(22)	Date of filing: 09.07.2018

(51)

International Patent Classification (IPC):

H04S 7/00^(2006.01)
H04R 27/00^(2006.01)

H04S 1/00^(2006.01)

(84)	Designated Contracting States:
	AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR
	Designated Extension States:
	BA ME
	Designated Validation States:
	KH MA MD TN

(71)	Applicant: Koninklijke Philips N.V.
	5656 AE Eindhoven (NL)

(72)	Inventor:
	KOPPENS, Jeroen Gerardus Henricus 5656 AE Eindhoven (NL)

(74)	Representative: de Haan, Poul Erik et al
	Philips International B.V. Philips Intellectual Property & Standards High Tech Campus 5 5656 AE Eindhoven 5656 AE Eindhoven (NL)

(54)	AUDIO APPARATUS AND METHOD OF AUDIO PROCESSING

(57) An audio apparatus, e.g. for a virtual reality client, comprises a receiver (201) for receiving a set of input audio signals from a first source; a receiver (203) for receiving acoustic environment data from a second source; and a receiver (205) for receiving binaural transfer function data from a third source where the binaural transfer function data is indicative of a set of binaural transfer functions. A renderer (207) generates output audio signals from the set of input audio signals. A reverberator generates a reverberation component of the output audio signals by applying reverberation processing to the set of input audio signals in response to the acoustic environment data, and an adapter (211) adapts a first property of the reverberation processing in response to a second property of the set of binaural transfer functions. The acoustic environment data and binaural transfer function data may be received from different sources.

Description

FIELD OF THE INVENTION

[0001] The invention relates to an audio apparatus and method of audio processing, and in particular, but not exclusively, to using such to support an Augmented/ Virtual Reality conference application.

BACKGROUND OF THE INVENTION

[0002] The variety and range of experiences based on audiovisual content have increased substantially in recent years with new services and ways of utilizing and consuming such content continuously being developed and introduced. In particular, many spatial and interactive services, applications and experiences are being developed to give users a more involved and immersive experience.

[0003] Examples of such applications are Virtual Reality (VR) and Augmented Reality (AR) applications which are rapidly becoming mainstream, with a number of solutions being aimed at the consumer market. A number of standards are also under development by a number of standardization bodies. Such standardization activities are actively developing standards for the various aspects of VR/AR systems including e.g. streaming, broadcasting, rendering, etc.

[0004] VR applications tend to provide user experiences corresponding to the user being in a different world/ environment/ scene whereas AR (including Mixed Reality MR) applications tend to provide user experiences corresponding to the user being in the current environment but with additional information or virtual objects or information being added. Thus, VR applications tend to provide a fully immersive synthetically generated world/ scene whereas AR applications tend to provide a partially synthetic world/ scene which is overlaid the real scene in which the user is physically present. However, the terms are often used interchangeably and have a high degree of overlap. In the following, the term Virtual Reality/ VR will be used to denote both Virtual Reality and Augmented Reality.

[0005] As an example, a service being increasingly popular is the provision of images and audio in such a way that a user is able to actively and dynamically interact with the system to change parameters of the rendering such that this will adapt to movement and changes in the user's position and orientation. A very appealing feature in many applications is the ability to change the effective viewing position and viewing direction of the viewer, such as for example allowing the viewer to move and "look around" in the scene being presented.

[0006] Such a feature can specifically allow a virtual reality experience to be provided to a user. This may allow the user to (relatively) freely move about in a virtual environment and dynamically change his position and where he is looking. Typically, such virtual reality applications are based on a three-dimensional model of the scene with the model being dynamically evaluated to provide the specific requested view. This approach is well known from e.g. game applications, such as in the category of first person shooters, for computers and consoles.

[0007] It is also desirable, in particular for virtual reality applications, that the image being presented is a three-dimensional image. Indeed, in order to optimize immersion of the viewer, it is typically preferred for the user to experience the presented scene as a three-dimensional scene. Indeed, a virtual reality experience should preferably allow a user to select his/her own position, camera viewpoint, and moment in time relative to a virtual world.

[0008] Typically, virtual reality applications are inherently limited in being based on a predetermined model of the scene, and typically on an artificial model of a virtual world. In some applications, a virtual reality experience may be provided based on real-world capture. In many cases such an approach tends to be based on a virtual model of the real-world being built from the real-world captures. The virtual reality experience is then generated by evaluating this model.

[0009] Many current approaches tend to be suboptimal and tend to often have a high computational or communication resource requirement and/or provide a suboptimal user experience with e.g. reduced quality or restricted freedom.

[0010] As an example of an application, virtual reality glasses have entered the market which allow viewers to experience captured 360° (panoramic) or 180° video. These 360° videos are often pre-captured using camera rigs where individual images are stitched together into a single spherical mapping. Common stereo formats for 180° or 360° video are top/bottom and left/right. Similar to non-panoramic stereo video, the left-eye and right-eye pictures are compressed, e.g. as part of a single H.264 video stream.

[0011] In addition to the visual rendering, most VR/AR applications further provide a corresponding audio experience. In many applications, the audio preferably provides a spatial audio experience where audio sources are perceived to arrive from positions that correspond to the positions of the corresponding objects in the visual scene. Thus, the audio and video scenes are preferably perceived to be consistent and with both providing a full spatial experience.

[0012] For audio, the focus has until now mostly been on headphone reproduction using binaural audio rendering technology. In many scenarios, headphone reproduction enables a highly immersive, personalized experience to the user. Using headtracking, the rendering can be made responsive to the user's head movements, which highly increases the sense of immersion.

[0013] Recently, both in the market and in standards discussions, use cases are starting to be proposed that involve a "social" or "shared" aspect of VR (and AR), i.e. the possibility to share an experience together with other people. These can be people at different locations, but also people in the same location (or a combination of both). For example, several people in the same room may share the same VR experience with a projection (audio and video) of each participant being present in the VR content/ scene. For example, in a game where multiple people participate, each player may have a different location in the game-scene and consequently a different projection of the audio and video scene.

[0014] As a specific example, MPEG attempts to standardize a bit stream and decoder for realistic, immersive AR/VR experiences with six degrees of freedom. Social VR is an important feature and allows users to interact in a shared environment (gaming, conference calls, online shopping, etc.). The concept of social VR also facilitates making a VR experience a more social activity for users physically in the same location but where e.g. a head mounted display or other VR headset provides a perceptional isolation from the physical surroundings.

[0015] Audio rendering in VR applications is a complex problem. It is typically desired to provide an audio experience which is as natural as possible but this is particularly difficult in a dynamic VR application with a high degree of freedom. The desired perceived audio is not just merely depending on the virtual sound sources or their position but in order to get a realistic experience it is also desired that the perceived audio reflects the audio characteristics of the virtual environment. For example, the audio should sound different when the audio source and virtual user is in a tiled bathroom than e.g. when in a sitting room with furniture, carpet, curtains etc attenuating reflections. When the simulation of an acoustic environment does not match the user's (subconscious) expectations from the visual information, the realism of the acoustic rendering will quickly degrade and may lead to a perception of increased reverberation and/or a lack of externalization.

[0016] In addition, it is desired for the audio experience to be adapted to the individual user. Indeed, as different people have different physiognomic characteristics, the same audio source will tend to be perceived differently. For example, the pinnae are unique and the resulting impact on incoming sound will vary for different people. Further, the effect depends on the directional incidence of the incoming soundwave, and accordingly localization of sources is subject dependent and the specific features to localize sources are learned by each person from early childhood. Therefore, any mismatch between a person's actual pinnae and that of a reference used to generate virtual audio will result in potential degraded audio perception, and specifically spatial perception. Providing optimized audio perception is accordingly a challenging problem.

[0017] Further, the current trend is towards providing a high degree of flexibility and choice to the end user and VR client. This includes VR standards being developed that allows the VR client to select the rendering algorithm and approach. As a result, the audio description data provided by a VR server must be sufficiently generic and rendering agnostic to allow for different rendering algorithms at the VR server.

[0018] Hence, an improved approach for generating audio processing, in particular for a virtual/ augmented/ mixed reality experience/ application, application, would be advantageous. In particular, an approach that allows improved operation, increased flexibility, reduced complexity, facilitated implementation, an improved audio experience, a more consistent perception of an audio and visual scene, improved customization, improved personalization; an improved virtual reality experience, and/or improved performance and/or operation would be advantageous.

SUMMARY OF THE INVENTION

[0019] Accordingly, the Invention seeks to preferably mitigate, alleviate or eliminate one or more of the above mentioned disadvantages singly or in any combination.

[0020] According to an aspect of the invention there is provided an audio apparatus comprising: a first receiver for receiving a set of input audio signals from a first source; a second receiver for receiving acoustic environment data from a second source; a third receiver for receiving binaural transfer function data from a third source, the binaural transfer function data being indicative of a set of binaural transfer functions; a renderer for generating output audio signals from the set of input audio signals; the renderer comprising a reverberator arranged to generate a reverberation component of the output audio signals by applying reverberation processing to the set of input audio signals in response to the acoustic environment data; and an adapter for adapting a first property of the reverberation processing in response to a second property of the set of binaural transfer functions.

[0021] The invention may provide an improved user experience in many embodiments and may specifically provide improved audio rendering in many applications, such as specifically virtual/ augmented/ mixed reality applications. The approach may provide improved audio perception and may in many embodiments provide an improved and more natural perception of an audio scene. The approach may in many embodiments provide improved performance while maintaining low complexity and resource usage.

[0022] The approach may provide improved flexibility and may for example allow, facilitate, or improve audio processing where the acoustic environment data and the binaural transfer function data are received from different sources and e.g. generated by different and independent sources. For example, audio data describing the set of input audio signals and the acoustic environment data may be provided by one source, such as specifically a remote server, whereas the binaural transfer function data may be provided by a different source, such as a local source. The approach may for example allow a central server to provide data representing a virtual experience without having to consider specific aspects of the individual user while allowing a local client comprising the audio apparatus to efficiently adapt and customize the rendered reverberation to the individual user, e.g. such that the reverberation reflects anthropometric properties of the individual user or at least anthopometric properties corresponding to the anthropometric properties reflected by the binaural transfer function data.

[0023] The approach may for example provide efficient support for systems in which a central server does not specifically know the audio processing that will be performed at the individual client. It may support a flexible choice of rendering algorithm at the client.

[0024] The input audio signals may be encoded audio data and may correspond to e.g. audio channel signals, audio components, audio objects etc. The first receiver may receive position data for the input audio signals and the renderer may generate the output audio signals in response to the position data. Specifically, the output audio signals may comprise audio components corresponding to the input audio signals with positions determined in response to the position data. An audio component of the output audio signals corresponding to given input audio signal may be rendered with positional cues corresponding to a position indicated for the input audio signal by the position data.

[0025] The output audio signals may specifically be a binaural stereo signal, or may e.g. be a set of signals derived from a binaural stereo signal.

[0026] The acoustic environment data may comprise room acoustic characteristics. The acoustic environment data may for example describe acoustic or physical properties of the environment (e.g. reverberation time or dimensions) or may e.g. directly describe rendering parameters indicative of how the reverberation processing should be performed (e.g. filter coefficients).

[0027] In accordance with an optional feature of the invention, the adapter is arranged to adapt the reverberation processing such that a characteristic of the reverberation component match a corresponding characteristic of the set of binaural transfer functions.

[0028] This may provide a rendering of the input audio signals which is particularly advantageous in many embodiments, and which may provide an improved perceived audio quality. Often an improved spatial perception and/or a more naturally sounding sound stage/ audio scene can be achieved.

[0029] In some embodiments, the adapter may be arranged to adapt the reverberation processing such that characteristics of the reverberation component match an anthropometrically dependent characteristic of the set of binaural transfer functions.

[0030] In accordance with an optional feature of the invention, the second property is a frequency response characteristic for the set of binaural transfer functions, and the adapter is arranged to adapt a frequency response of the reverberation processing in response to the frequency response characteristic.

[0031] This may provide particularly advantageous operation in many embodiments, and may in many scenarios provide a more naturally perceived audio scene. The approach may adapt the coloration of the output audio signals to match the coloration provided by the binaural transfer functions.

[0032] In accordance with an optional feature of the invention, the reverberator comprises an adaptable reverberator adaptive to acoustic environment data and a filter having a frequency response dependent on the frequency response characteristic.

[0033] This may provide a particularly efficient and high-performance implementation in many embodiments. The adaptive reverberator and the filter may be cascade coupled. The order of the adaptive reverberator and the filter may be different in different embodiments and other functional blocks may be coupled in-between.

[0034] In accordance with an optional feature of the invention, the reverberator comprises a synthetic reverberator and the adapter is arranged to adapt a processing parameter of the synthetic reverberator in response to the frequency response characteristic.

[0035] This may provide a particularly efficient and high-performance implementation in many embodiments.

[0036] In accordance with an optional feature of the invention, the second property comprises an inter-ear correlation property for the set of binaural transfer functions and the first property comprises inter-ear correlation property for the reverberation processing.

[0037] This may provide particularly advantageous operation in many embodiments, and may in many scenarios provide a more naturally perceived audio scene. An inter-ear correlation property may be indicative of a correlation between a left ear signal and a right ear signal. The inter-ear correlation property may for example include a coherence property or correlation coefficient for two signals corresponding to the two ears of the user. For the reverberation processing, the inter-ear correlation property may be a correlation measure between the two signals of a binaural output audio signal generated by the reverberation processing. For the set of binaural transfer functions, the inter-ear correlation property may be a correlation measure for right ear and left ear transfer functions of the binaural transfer functions.

[0038] In accordance with an optional feature of the invention, the inter-ear correlation property for the set of binaural transfer functions and the first inter-ear correlation property for the reverberation processing are frequency dependent.

[0039] This may provide an improved audio output in many scenarios and embodiments.

[0040] In accordance with an optional feature of the invention, the reverberator is arranged to generate a pair of partially correlated signals from a pair of substantially uncorrelated signals generated from the set of input audio signals, and to generate the output audio signals from the partially correlated signals, and the adapter is arranged to adapt a correlation between the output audio signals in response to the inter-ear correlation property for the set of binaural transfer functions.

[0041] This may provide a particularly efficient and high-performance implementation in many embodiments. The correlation between the substantially uncorrelated signals is very low. In many embodiments, the correlation coefficient may be no more than e.g. 0.05 or 0.10.

[0042] In accordance with an optional feature of the invention, the reverberator comprises a decorrelator for generating a substantially decorrelated signal from a first signal derived from the set of input audio signals; and the reverberator is arranged to generate a pair of partially correlated signals from the decorrelated signal and the first signal and to generate the output audio signals from the partially correlated signals, the adapter being arranged to adapt a correlation between the output audio signals in response to the inter-ear correlation property for the set of binaural transfer functions.

[0043] This may provide a particularly efficient and high-performance implementation in many embodiments. The substantially decorrelated signal may have a very low correlation relative to the first signal. In many embodiments, the correlation coefficient may be no more than e.g. 0.05, 0.08, or 0.10.

[0044] In accordance with an optional feature of the invention, the adapter is arranged to determine the second property of the set of binaural transfer functions in response to a combination of properties for a plurality of binaural transfer functions of the set of binaural transfer functions for different positions.

[0045] This may provide an improved performance in many scenarios and embodiments.

[0046] In accordance with an optional feature of the invention, the second receiver is arranged to receive dynamically changing acoustic environment data and the third receiver is arranged to receive static binaural transfer function data.

[0047] In accordance with an optional feature of the invention, the second source is different from the third source.

[0048] In accordance with an optional feature of the invention, the renderer further comprises a binaural processor for generating an early reflection component of the output audio signals in response to at least one of the set of binaural transfer functions.

[0049] According to an aspect of the invention there is provided a method of audio processing comprising: receiving a set of input audio signals from a first source; receiving acoustic environment data from a second source; receiving binaural transfer function data from a third source, the binaural transfer function data being indicative of a set of binaural transfer functions; generating output audio signals from the set of input audio signals, the generating comprising generating a reverberation component of the output audio signals by applying reverberation processing to the set of input audio signals in response to the acoustic environment data; and adapting a first property of the reverberation processing in response to a second property of the set of binaural transfer functions.

[0050] These and other aspects, features and advantages of the invention will be apparent from and elucidated with reference to the embodiment(s) described hereinafter.

BRIEF DESCRIPTION OF THE DRAWINGS

[0051] Embodiments of the invention will be described, by way of example only, with reference to the drawings, in which

FIG. 1 illustrates an example of an audio distribution system;

FIG. 2 illustrates an example of elements of an audio apparatus in accordance with some embodiments of the invention;

FIG. 3 illustrates an example of elements of an audio apparatus in accordance with some embodiments of the invention;

FIG. 4 illustrates an example of a room impulse response;

FIG. 5 illustrates an example of elements of a reverberator for an audio apparatus in accordance with some embodiments of the invention;

FIG. 6 illustrates an example of elements of a reverberator for an audio apparatus in accordance with some embodiments of the invention;

FIG. 7 illustrates an example of elements of a reverberator for an audio apparatus in accordance with some embodiments of the invention;

FIG. 8 illustrates an example of elements of a reverberator for an audio apparatus in accordance with some embodiments of the invention;

FIG. 9 illustrates an example of elements of a reverberator for an audio apparatus in accordance with some embodiments of the invention;

FIG. 10 illustrates an example of elements of a reverberator for an audio apparatus in accordance with some embodiments of the invention; and

FIG. 11 illustrates an example of elements of a reverberator for an audio apparatus in accordance with some embodiments of the invention.

DETAILED DESCRIPTION OF SOME EMBODIMENTS OF THE INVENTION

[0052] Virtual (including augmented) experiences allowing a user to move around in a virtual or augmented world are becoming increasingly popular and services are being developed to satisfy such demands. In many such approaches, visual and audio data may dynamically be generated to reflect a user's (or viewer's) current pose.

[0053] In the field, the terms placement and pose are used as a common term for position and/or direction/ orientation. The combination of the position and direction/ orientation of e.g. an object, a camera, a head, or a view may be referred to as a pose or placement. Thus, a placement or pose indication may comprise up to six values/ components/ degrees of freedom with each value/ component typically describing an individual property of the position/ location or the orientation/ direction of the corresponding object. Of course, in many situations, a placement or pose may be represented by fewer components, for example if one or more components is considered fixed or irrelevant (e.g. if all objects are considered to be at the same height and have a horizontal orientation, four components may provide a full representation of the pose of an object). In the following, the term pose is used to refer to a position and/or orientation which may be represented by one to six values (corresponding to the maximum possible degrees of freedom).

[0054] Many VR applications are based on a pose having the maximum degrees of freedom, i.e. three degrees of freedom of each of the position and the orientation resulting in a total of six degrees of freedom. A pose may thus be represented by a set or vector of six values representing the six degrees of freedom and thus a pose vector may provide a three-dimensional position and/or a three-dimensional direction indication. However, it will be appreciated that in other embodiments, the pose may be represented by fewer values.

[0055] A system or entity based on providing the maximum degree of freedom for the viewer is typically referred to as having 6 Degrees of Freedom (6DoF). Many systems and entities provide only an orientation or position and these are typically known as having 3 Degrees of Freedom (3DoF).

[0056] Typically, the virtual reality application generates a three-dimensional output in the form of separate view images for the left and the right eyes. These may then be presented to the user by suitable means, such as typically individual left and right eye displays of a VR headset. In other embodiments, one or more view images may e.g. be presented on an autostereoscopic display, or indeed in some embodiments only a single two-dimensional image may be generated (e.g. using a conventional two-dimensional display).

[0057] Similarly, for a given viewer/ user/ listener pose, an audio representation of the scene may be provided. The audio scene is typically rendered to provide a spatial experience where audio sources are perceived to originate from desired positions. As audio sources may be static in the scene, changes in the user pose will result in a change in the relative position of the audio source with respect to the user's pose. Accordingly, the spatial perception of the audio source should change to reflect the new position relative to the user. The audio rendering may accordingly be adapted depending on the user pose.

[0058] In many embodiments, the audio rendering is a binaural rendering using Head Related Transfer Functions (HRTFs) or Binaural Room Impulse Responses (BRIRs) (or similar) to provide the desired spatial effect for a user wearing a headphone. However, it will be appreciated that in some systems, the audio may instead be rendered using a loudspeaker system and the signals for each loudspeaker may be rendered such that the overall effect at the user corresponds to the desired spatial experience.

[0059] The viewer or user pose input may be determined in different ways in different applications. In many embodiments, the physical movement of a user may be tracked directly. For example, a camera surveying a user area may detect and track the user's head (or even eyes (eye-tracking)). In many embodiments, the user may wear a VR headset which can be tracked by external and/or internal means. For example, the headset may comprise accelerometers and gyroscopes providing information on the movement and rotation of the headset and thus the head. In some examples, the VR headset may transmit signals or comprise (e.g. visual) identifiers that enable an external sensor to determine the position of the VR headset.

[0060] In some systems, the viewer pose may be provided by manual means, e.g. by the user manually controlling a joystick or similar manual input. For example, the user may manually move the virtual viewer around in the virtual scene by controlling a first analog joystick with one hand and manually controlling the direction in which the virtual viewer is looking by manually moving a second analog joystick with the other hand.

[0061] In some applications a combination of manual and automated approaches may be used to generate the input viewer pose. For example, a headset may track the orientation of the head and the movement/ position of the viewer in the scene may be controlled by the user using a joystick.

[0062] In some systems, the VR application may be provided locally to a viewer by e.g. a standalone device that does not use, or even have any access to, any remote VR data or processing. For example, a device such as a games console may comprise a store for storing the scene data, input for receiving/ generating the viewer pose, and a processor for generating the corresponding images from the scene data.

[0063] In other systems, the VR application may be implemented and performed remote from the viewer. For example, a device local to the user may detect/ receive movement/ pose data which is transmitted to a remote device that processes the data to generate the viewer pose. The remote device may then generate suitable view images for the viewer pose based on scene data describing the scene. The view images are then transmitted to the device local to the viewer where they are presented. For example, the remote device may directly generate a video stream (typically a stereo/ 3D video stream) which is directly presented by the local device. Similarly, the remote device may generate an audio scene reflecting the virtual audio environment. This may in many embodiments be done by generating audio signals that correspond to the relative position of different audio sources in the virtual audio environment, e.g. by applying binaural processing to the individual audio components corresponding to the current position of these relative to the head pose. Thus, in such an example, the local device may not perform any VR processing except for transmitting movement data and presenting received video and audio data.

[0064] Similarly, the remote VR device may generate audio data representing an audio scene and may transmit audio components/ objects corresponding to different audio sources in the audio scene together with position information indicative of the position of these (which may e.g. dynamically change for moving objects). The local VR device may then render such signals appropriately, e.g. by applying appropriate binaural processing reflecting the relative position of the audio sources for the audio components.

[0065] For the audio side, a central server may accordingly in some embodiments generate a spatial audio mix that can be rendered directly by the remote client device. For example, the central server may generate spatial audio as a number of audio channels for direct rendering by a surround sound loudspeaker setup. However, more commonly, the central server may generate a mix by binaurally processing all audio signals in the scene to be rendered and then combining these into a binaural stereo signal which can be rendered directly at the client side using a set of headphones.

[0066] In many applications, the central server may instead provide a number of audio objects or components with each of these corresponding typically to a single audio source. The client can then process such objects/ components to generate the desired audio scene. Specifically, it may binaurally process each audio object based on the desired position and combine the results.

[0067] In such systems, audio data transmitted to a remote client may include data for a plurality of audio components or objects. The audio may for example be represented as encoded audio for a given audio component which is to be rendered. The audio data may further comprise position data which indicates a position of the source of the audio component. The positional data may for example include absolute position data defining a position of the audio source in the scene. The local apparatus may in such an embodiment determine a relative position of the audio source relative to the current user pose. Thus, the received position data may be independent of the user's movements and a relative position for audio sources may be determined locally to reflect the position of the audio source with respect to the user. Such a relative position may indicate the relative position of where the user should perceive the audio source to originate from, and it will accordingly vary depending on the user's head movements. In other embodiments, the audio data may comprise position data which directly describes the relative position.

[0068] In addition to audio data describing the audio and position of the different audio sources, the VR server may further provide data describing the acoustic environment of the user and/or the audio sources. For example, the VR server may provide data to reflect whether the virtual user is e.g. in a small room, a large concert hall, outside etc. Additionally, the acoustic environment data may include information about the reflectiveness of the boundaries (walls, ceiling, floor) and/or objects in the environment. The perceived audio in such environments vary substantially and in order to perceive more naturally sounding audio it is therefore highly desirable to adapt the rendering audio to reflect such characteristics.

[0069] FIG. 1 illustrates an example of a VR system in which a central server 101 liaises with a number of remote clients 103 e.g. via a network 105, such as the Internet. The central server 101 may be arranged to simultaneously support a potentially large number of remote clients 103.

[0070] Such an approach may in many scenarios provide an improved trade-off e.g. between complexity and resource demands for different devices, communication requirements etc. For example, the viewer pose and corresponding scene data may be transmitted with larger intervals with the local device processing the viewer pose and received scene data locally to provide a real time low lag experience. This may for example substantially reduce the required communication bandwidth while providing a low latency experience and while allowing the scene data to be centrally stored, generated, and maintained. It may for example be suitable for applications where a VR experience is provided to a plurality of remote devices.

[0071] FIG. 2 illustrates elements of an audio apparatus which may provide an improved audio rendering in many applications and scenarios. In particular, the audio apparatus may provide improved rendering for many VR applications, and the audio apparatus may specifically be arranged to perform the audio processing and rendering for a VR client of FIG. 1.

[0072] The audio apparatus of FIG. 2 generates output audio signals corresponding to an audio scene described by audio data received from the central server 101. Accordingly, the audio apparatus comprises a first receiver 201 which is arranged to receive a set of input audio signals from a first source which in the specific example is the central server 101.

[0073] The audio signals may be encoded audio signals and thus be represented by encoded audio data. Further, the audio input signals may be different types of audio signals and components and indeed in many embodiments the first receiver 201 may receive audio data which defines a combination of different types of audio signals. For example, the audio data may include audio represented by audio channel signals, individual audio objects, higher order ambisonics etc.

[0074] The audio apparatus further comprises a second receiver 203 which is arranged to receive acoustic environment data from a second source. The acoustic environment data may describe one or more acoustic properties or characteristics of an acoustic environment that is desired to be reproduced by the audio presented to the listener. For example, the acoustic environment data may describe an acoustic environment for the listener and/or the audio sources/ signals/ components. Typically, the acoustic environment data will be a description of an intended acoustic environment in which both the audio sources and the listener is present. For a VR application the acoustic environment data may describe the virtual acoustic environment of the listener/ user and/or one ore more objects in the virtual scene and/or one or more of the audio sources in the virtual scene.

[0075] Typically, the acoustic environment data will reflect acoustic characteristics of a virtual room in which the virtual user is present. It is desired that the audio apparatus renders the input audio signals with acoustic characteristics that match those of the virtual room and the acoustic environment data may correspond to room acoustics data. The following description will focus on this description and use terms such as room data or room acoustics data, but it will be appreciated that this is not intended as a limitation and that the acoustic environment data may equally apply to e.g. virtual outdoor environments.

[0076] The second source may typically be the same source as the first source but could be a different source in some embodiments. In the following, the second source will be considered to be the same as the first source, namely specifically the central server 101. The audio apparatus may specifically be arranged to receive a single data stream comprising visual data, audio data, and acoustic environment data from the central server 101. The first receiver 201 and the second receiver 203 may be considered to correspond to elements of a common receiver or network interface extracting the different parts of data.

[0077] The audio apparatus further comprises a third receiver 205 which is arranged to receive binaural transfer function data from a third source. The binaural transfer function data specifically comprises data describing a set of binaural transfer functions.

[0078] The third source may in some embodiments correspond to the first and/or second source but will in many embodiments be a different source. Specifically, in many embodiments, the third source may be a local source. For example, the audio apparatus may comprise a local store which stores the set of binaural transfer functions.

[0079] Binaural rendering is a technology, typically (but not exclusively) aimed at consumption over headphones. Binaural processing seeks to create the perception that there are sound sources surrounding the listener that are not physically present. As a result, the sound will not only be heard 'inside' one's head, as is the case with listening over headphones without binaural rendering, but can be brought outside one's head, as is the case for natural listening. Apart from a more realistic experience another upside is that virtual surround has a positive effect on listener fatigue.

[0080] Binaural processing is known to be used to provide a spatial experience by virtual positioning of sound sources using individual signals for the listener's ears. Virtual surround is a method of rendering the sound such that audio sources are perceived as originating from a specific direction, thereby creating the illusion of listening to a physical surround sound setup (e.g. 5.1 speakers) or immersive environment (concert) or e.g. by directly positioning audio sources at their appropriate position in the sound stage. With an appropriate binaural rendering processing, the signals required at the eardrums in order for the listener to perceive sound from any desired direction can be calculated, and the signals can be rendered such that they provide the desired effect. These signals are then recreated at the eardrum using either headphones or a crosstalk cancelation method (suitable for rendering over closely spaced speakers). Binaural rendering can be considered to be an approach for generating signals for the ears of a listener resulting in tricking the human auditory system into thinking that a sound is coming from the desired positions.

[0081] It should be appreciated that in many embodiments the binaural rendering includes a compensation for such headphone or speaker playback. For example, a compensation for the frequency response of the speakers or the transfer function from the speaker feed signals to the ears or ear-drums. Such compensation filtering may be included in the rendering, or applied as a pre- of post-processing step. It may also, or alternatively, be combined with any filtering related to the invention.

[0082] The binaural rendering is based on binaural transfer functions such as head related binaural transfer functions which vary from person to person due to the acoustic properties of the head, ears and reflective surfaces, such as the shoulders. For example, binaural filters can be used to create a binaural recording simulating multiple sources at various locations. This can be realized by convolving each sound source with the pair of Head Related Impulse Responses (HRIRs) that correspond to the position of the sound source.

[0083] By measuring e.g. the responses from a sound source at a specific location in 2D or 3D space to microphones placed in or near the human ears, the appropriate binaural filters can be determined. Typically, such measurements are made e.g. using models of human heads, MRI scans or indeed in some cases the measurements may be made by attaching microphones close to the eardrums of a person. The binaural filters can be used to create a binaural recording simulating multiple sources at various locations. This can be realized e.g. by convolving each sound source with the pair of measured impulse responses for a desired position of the sound source. In order to create the illusion that a sound source is moved around the listener, a large number of binaural filters is typically required with adequate spatial resolution, e.g. 10 degrees.

[0084] The head related binaural transfer functions may be represented e.g. as Head Related Impulse Responses (HRIR), or equivalently as Head Related Transfer Functions (HRTFs) or, Binaural Room Impulse Responses (BRIRs), or Binaural Room Transfer Functions (BRTFs). The (e.g. estimated or assumed) transfer function from a given position to the listener's ears (or eardrums) may for example be given in the frequency domain in which case it is typically referred to as an HRTF or BRTF, or in the time domain in which case it is typically referred to as a HRIR or BRIR. In some scenarios, the head related binaural transfer functions are determined to include aspects or properties of the acoustic environment and specifically of the room in which the measurements are made, whereas in other examples only the user characteristics are considered. Examples of the first type of functions are the BRIRs and BRTFs.

[0085] A well-known method to determine binaural transfer functions is binaural recording. It is a method of recording sound that uses a dedicated microphone arrangement and is intended for replay using headphones. The recording is made by either placing microphones in the ear canal of a subject or using a dummy head with built-in microphones, a bust that includes pinnae (outer ears). The use of such dummy head including pinnae provides a very similar spatial impression as if the person listening to the recordings was present during the recording.

[0086] The set of binaural transfer functions may accordingly comprise binaural transfer functions for a, typically high, number of different positions with each binaural transfer function providing information of how an audio signal should be processed/ filtered in order to be perceived to originate from that position. Individually applying binaural processing to a plurality of audio signals/ sources and combining the result may be used to generate an audio scene with a number of audio sources positioned at appropriate positions in the sound stage.

[0087] The audio apparatus further comprises a renderer 207 which is coupled to the first receiver 201, second receiver 203, the third receiver 205 and which receives the input audio signals/ data, the acoustic environment data (room data), and the set of binaural transfer functions. The renderer 207 is arranged to generate output audio signals from the set of input audio signals in response to the acoustic environment data and the set of binaural transfer functions.

[0088] Thus, the renderer 207 generates output audio signals that represent the desired audio scene when presented to a user. The output audio signals are specifically a binaural stereo output signal which can be provided to a user via headphones resulting in the perception of the audio stage.

[0089] The renderer 207 of FIG. 2 includes a reverberator 209 which generates a reverberation component for the output audio signals by applying reverberation processing to the set of input audio signals. Thus, the renderer 207 has a specific processing which specifically generates a reverberation component for the audio being presented to the user. The reverberation component may be combined with other audio components, such as typically audio components that do not reflect reverberation (although possibly these audio components may also include some reverberation characteristics).

[0090] Specifically, a second audio component may be generated to represent the direct audio component as well as possibly early reflections of the audio. An example of such a renderer 207 is illustrated in FIG. 3.

[0091] In the example of FIG. 3, the renderer 207 comprises a binaural processor 301 which is arranged to apply binaural processing to the received input audio signals. The binaural processor 301 processes the input audio signals based on the set of binaural transfer functions to generate binaural signals that correspond to the desired position of the audio sources of the input audio signals. The appropriate binaural transfer function to use is dependent on the desired position for the input audio signal being processed and may accordingly be selected based on position data received with the input audio signals.

[0092] For example, for a given input audio signal being e.g. an audio object for an audio source at a specific position (indicated by position data), the binaural processor 301 may retrieve the binaural transfer function closest to the source position (or e.g. interpolate between the closest binaural transfer functions). The input audio signal may then be convoluted with the retrieved binaural transfer function to provide a binaural audio component (it will be appreciated that a binaural transfer function for a given position will typically include a transfer function for the right ear and a transfer function for the left ear).

[0093] The convolution may also be a frequency dependent filtering with a gain and phase offset to generate a desired signal level (relative to other frequencies), a level difference and a phase difference between the two signals for each frequency. At higher frequencies, e.g. above 2.5 kHz, the phase difference may be omitted to reduce computational complexity.

[0094] The binaural processor 301 may proceed to perform binaural processing for all input audio signals and add these together to generate a binaural output signal. The binaural processor 301 thus provides a binaural audio signal corresponding to the sound stage with the individual audio sources (perceived to be) positioned at the desired positions.

[0095] Typically, the binaural transfer functions will reflect the direct sound and possibly some early reflections. However, in most embodiments, the binaural transfer functions will not include a reverberation component or room specific information. In particular, the binaural transfer functions may be anechoic binaural transfer functions. The binaural transfer functions may specifically be HRIRs or HRTFs which reflect user anthropometric properties but not e.g. reverberation or room/ acoustics environment dependent properties. The following description will accordingly often specifically refer to HRTFs but it will be appreciated that this does not imply that the invention, or indeed even the embodiments, are limited to HRTFs.

[0096] In parallel to the processing by the binaural processor 301, the reverberator 209 may process the input audio signals to generate the reverberation component. This processing is dependent on the acoustic environment data and the input audio signals will be processed to generate an output stereo signal that corresponds to reverberant audio for an acoustic environment described by the acoustic environment data.

[0097] The reverberator 209 may for example include reverberation filters or synthetic reverberators, such as a Jot reverberator.

[0098] The binaural processor 301 may accordingly generate audio components corresponding to the direct sound propagation such that this is perceived to arrive from the audio source position. In many embodiments, the binaural processor 301 may also be arranged to generate audio components corresponding to paths from the source to the user that includes one or a few reflections. Such audio will be perceived to arrive from a position that does not correspond to the position of the audio source but rather to the position of the last reflection before reaching the user. In some embodiments, such a position of the last reflection may be determined, and an audio component corresponding to this early reflection will be generated using e.g. binaural transfer functions/ HRTFs corresponding to this position.

[0099] For example, the audio apparatus may use the positions of the audio sources and user/listener in the room/environment and determine reflections on surfaces such as walls, ceiling, floor to the user. Typically, this may be done by mirroring audio source positions with respect to the reflecting surfaces to easily find the distance and direction of incidence of the reflection of that source on the corresponding surface. Second order reflections can be determined by also mirroring a second surface with respect to the first surface and mirroring the mirrored audio source with respect to the mirrored second surface. These reflections can then be rendered as additional audio sources with the HRTFs corresponding to their direction of incidence.

[0100] Additionally, a (frequency dependent) attenuation may be used to model the reflectivity of the one or more surfaces related to the reflection and the distance from the source via the reflections to the user/listener.

[0101] It will be appreciated that the generation of audio components corresponding to the early reflections could be performed in the binaural processor 301, the reverberator 209, or indeed in a separate functional block or blocks).

[0102] The output signals of the binaural processor 301 and the reverberator 209 (including the early reflection components) are fed to a combiner 303 which combines the stereo output signals to generate an output binaural audio signal. The combiner 303 may for example perform a weighted summation with predetermined weights.

[0103] Thus, the renderer 207 of FIG. 3 generates an output stereo signal which comprises a binaural component generated by binaural processing of the input audio signals based on the binaural transfer functions and a reverberation component generated by processing the input audio signals based on the room characteristics. Further, in the example, the positional information is provided by the binaural processor 301 and the binaural component whereas the reverberation properties of the acoustic environment data are represented by the reverberation component. In the example, the binaural processor 301 is using the binaural transfer functions and the reverberator 209 is using the acoustic environment data.

[0104] However, in the audio apparatus of FIG. 2 and 3, the reverberation processing is further adapted based on a property (or properties) of the set of binaural transfer functions. Thus, the set of binaural transfer functions are not merely used for the binaural processing and the positioning of audio sources but is also used to adapt the reverberation processing despite the binaural transfer functions typically not comprising any reverberation information or specific acoustic environment data information, and indeed even if anechoic binaural transfer functions/ HRTFs are used.

[0105] Specifically, in many embodiments, the set of binaural transfer functions are permanent or semi-permanent and may specifically be static for the entire VR session. Indeed, in many embodiments, the binaural transfer functions may be fixed binaural transfer functions stored in local memory and not being dependent on the specific application. In contrast, the acoustic environment data may be dynamically varying and indeed the central server 101 may continuously transmit acoustic environment data indicative of the current acoustic properties. For example, as a user moves, the central server 101 may continuously transmit acoustic environment data that characterizes the (virtual) acoustic environment, e.g. reflecting when the user moves from a small room to a large room etc.

[0106] Thus, in many embodiments, the binaural transfer function may not be dependent on the current acoustic environment. The binaural transfer functions may specifically not reflect any reverberation properties or acoustic environment specific information. The binaural transfer functions may for example be HRTFs/HRIRs rather than BRTF/BRIR.

[0107] The acoustic environment data may be independent of positions of audio sources of the input audio signals. The acoustic environment data may comprise no data which is used to position individual audio sources. The acoustic environment data may describe the common environment for the user and the audio sources of the input audio signals.

[0108] Further, the set of binaural transfer functions may be static for a given session/ user experience. Any update rate for changes in the set of binaural transfer functions may be substantially lower than an update rate for changes in the acoustic environment data. E.g. it may typically be at least 10 times or 100 times lower. In many embodiments, the set of binaural transfer functions may be static whereas the acoustic environment data is dynamically varying for a user session.

[0109] However, despite the binaural transfer functions not comprising any specific or dynamic information relating to the current acoustic environment or any reverberation information, the audio apparatus of FIGs. 2 and 3 further comprises an adapter 211 that is arranged to adapt the reverberation processing based on the set of binaural transfer functions. Specifically, the adapter 211 may adapt a property of the reverberation processing in response to a property of the set of binaural transfer functions.

[0110] The Inventor has realized that such an adaptation may be advantageous in many scenarios. Specifically, the Inventor has realized that the approach may provide a personalization or adaptation of the reverberation to the individual user or adaptation to the binaural transfer functions used by the user and that this improves audio perception despite the acoustic environment data referring to the acoustic environment and not individual properties of the user. The Inventor has further realized that the personalization can be performed efficiently and with high quality by considering properties of the set of binaural transfer functions and matching the reverberation performance accordingly. The interworking between the seemingly separate data sets and processing has been found to provide substantially improved audio perception with a perceived improved audio quality and feeling of naturalness.

[0111] Further, the approach allows for an efficient system and distribution of functionality. It provides a high degree of flexibility and interworking. For example, it allows a central server to simply provide audio data and acoustic environment data without having to consider how this is processed at the client. Specifically, it can distribute this information without considering or having any knowledge of the binaural transfer functions used at the individual client and without considering or having any knowledge of the audio processing and algorithms used. It thus supports the acoustic environment data, audio data and set of binaural transfer function data to be received from completely different and independent sources.

[0112] As previously mentioned, by measuring e.g. the impulse responses from a sound source at a specific location in a three dimensional space to the microphones in the human ears for each individual, binaural transfer functions, such as HRIRs and HRTFs, can be determined.

[0113] However, because each person's pinnae are unique and the filtering they impose on sound depends on the directional incidence of the incoming soundwave, localization of sources is subject dependent and the specific features to localize sources are learned by each person from early childhood. Therefore, any mismatch between pinnae used during recording and those of the listener may lead to a degraded perception, and erroneous spatial impressions.

[0114] In the case that the HRIR also includes a room effect these are referred to as BRIRs. As illustrated in FIG. 4, room impulse responses (and also BRIRs) consist of an anechoic portion (or direct sound) that only depends on the subject's anthropometric attributes (such as head size, ear shape, etc), followed by a reverberant portion that characterizes the combination of the room and the anthropometric properties.

[0115] The reverberant portion contains two temporal regions, usually overlapping. The first region contains so-called early reflections, which are isolated reflections of the sound source on walls or obstacles inside the room before reaching the ear-drum (or measurement microphone). As the time lag increases, the number of reflections present in a fixed time interval increases, now also containing higher-order reflections. The second region in the reverberant portion is the part where these reflections are not isolated anymore. This region is called the diffuse- or late reverberation tail.

[0116] The reverberant portion contains cues that give the auditory system information about distance of the source and size and acoustical properties of the room. The energy of the reverberant portion in relation to that of the anechoic portion largely determines the perceived distance of the sound source. The density of the (early-) reflections contributes to the perceived size of the room. Typically indicated by T₆₀, reverberation time is the time that it takes for reflections to drop 60 dB in energy level. The reverberation time gives information on the acoustical properties of the room; whether its walls are very reflective (e.g. bathroom) or there is much absorption of sound (e.g. bed-room with furniture, carpet and curtains).

[0117] Where the T₆₀ assumes a linear decay of the reverberation (in dB domain), the Energy Decay Curve (EDC) and Energy Decay Relief (EDR) describe the profile of the reverberation's decay in more detail, where EDR is a frequency dependent variant of the EDC. This information can be extracted from a BRIR, h(t), as follows.

where H(t,ω) is a time-frequency representation of the BRIR.

[0118] From the EDC or EDR, the T₆₀ time can be derived. But since the decay may not be linear, Early Decay Time (EDT) is also used in direct comparison with T₆₀. EDT corresponds to the drop from 0 to -10 dB times 6 (to compare with T60). Early decay development is a cue to distinguishing spaces, and is highly correlated to dissimilarity between outside and inside reverberation [5].

[0119] Clarity index (expressed in dB) is a measure for early- to late reverb ratio (in dB):

where h(t) is the impulse response of the room (i.e. t=0 is right after the direct response) and t_c is the chosen boundary between early reflections and reverb. Typical values for t_c are between 50 and 100 ms.

[0120] Further properties of the reverberation may include:

Room modes - Resonance frequencies of a room.
Reflection density - Number of reflections per second.
Modal density - Average number of resonances / modes per Hz.

[0121] In the current approach, acoustic environment data may be provided which is indicative of these reverberation characteristics for the room/ environment. For example, acoustic environment data comprising values for the T₆₀, reflection density and/or modal density may be provided. These properties may as previously mentioned be described as independent of the individual user, e.g. they may be based on a nominal or reference measurement.

[0122] The acoustic environment data may be provided from the central server 101 together with the audio data describing the audio signals. However, the personalization and customization, including the rendering algorithm selection and the retrieval of personalized binaural transfer functions, is performed at the individual client.

[0123] An example of such an approach is the MPEG-I (as in: MPEG Immersive) standard for AR/VR (augmented reality, virtual reality, also considered to include 'mixed reality') currently under development. In MPEG-I there is a high-level architecture with an audio renderer that takes care of rendering an immersive audio scene corresponding to the virtual environment. For AR, typically, virtual sound elements are rendered as if they were present in the physical environment of the user.

[0124] Though MPEG will standardize a default renderer that can model an acoustic environment based on metadata (either directly from the bitstream or analyzed from indirect data and metadata or from analyzing the physical environment of the user), there will be an API to replace the default renderer. HRTFs are highly personal and are therefore also not fixed and may be personalized or optimized by means of selecting from a large default set or by providing a personal set.

[0125] In the context of MPEG-I audio, current attempts seek to standardize a bitstream and decoder for AR/VR experiences with six degrees of freedom. However, it is hugely impractical to record or generate, manage and transmit sound scenes including acoustic effects at all positions for all combinations of alternative positions of the sound sources. Therefore, it will provide a renderer that takes audio objects, higher order ambisonics and/or audio channels and scene description metadata and generate the acoustic effects at the decoder side.

[0126] The renderer can consist of several components, two of which will typically be a binaural renderer and reverberation renderer module. The binaural renderer takes in audio objects, channels and/or HOA signal sets and a set of HRIRs / HRTFs and render these. Thus, this may correspond to the binaural processor 301 of FIG. 3. Typically, these HRTFs are anechoic or at least non-reverberant (i.e. do not include late reverberation, or even early reflections). Especially for realistic rendering audio sources in a 6DoF VR or AR context, the HRTFs must preferably be anechoic, and the early reflections modelled according to room properties and possibly other reflecting surfaces. The reverberation characteristics can then be modelled separately based on acoustic environment data, such as by the reverberator 209 of FIG. 3.

[0127] This setup may use a generic late reverberation simulation/rendering that allows manipulation of acoustic properties. Since the HRTF set and reverb rendering may each be replaced by external versions, the acoustic environment data can only rely on generic representations.

[0128] Therefore, the virtual room's acoustical properties will typically be described by means of room properties (e.g. dimensions and/or reverberation properties like T₆₀) or a generic RIR rather than configuration parameters for a specific synthetic reverberator. Mainly because the properties of the user and the client based processing are not known when the scene is designed.

[0129] Currently, for binaural rendering applications that include late reverb rendering, the entire rendering is under control of the application and the reverb is provided together with the HRTFs, representing specific acoustic properties. In these cases, the reverberation processing tool generally determines in what representation the late reverb metadata must be provided (e.g. FIR coefficients, reverb model parameters) and the reverberation and binaural processing is provided as an integrated and consistent audio representation.

[0130] However, this is not feasible in the use cases described above, e.g. with reference to MPEG-I, where the data provided by the central server 101 must be independent of the user and audio processing (including the specific binaural transfer functions used).

[0131] Indeed, in the specific context:

AR/VR scene description metadata from the central server 101 provides guidance for room acoustic rendering (the acoustic environment data).
The client/ user chooses a set of (anechoic) HRTFs (the binaural transfer functions).
The client application uses a late reverberation renderer to model a late reverb.

[0132] Although such an approach of locally selecting/ providing/ retrieving anechoic binaural transfer functions and supplementing this with reverberation generated in response to acoustic environment data provided from a central server may provide suitable audio rendering in many scenarios, the Inventor has realized that improved performance can be achieved by further personalizing/ adapting the reverberation component. He has further realized that this can advantageously and efficiently be done by adapting the reverberation processing based on the anechoic binaural transfer functions.

[0133] The approach of FIG. 2 and 3 allows for a combination of the binaural transfer functions/ HRTFs with the reverb model to allow matching of the reverb to the specific set of HRTFs. This may provide for a personalization of the reverberation effect to match characteristics of the user as reflected by the HRTFs.

[0134] Thus, even if the HRTFs and reverberation are not provided together (or indeed have any relationship), the reverb characteristics are personalized to fit the HRTF set. It has been found that this provides a substantially improved perceived audio quality. The approach may be highly advantageous for systems such as those of MPEG-I by providing high performance while allowing acoustic environment data and binaural transfer function data to be retrieved from and generated by separate and independent sources.

[0135] It is particularly advantageous for VR applications where the acoustic environment may change frequently and dynamically while the set of binaural transfer functions are static (or semi-static).

[0136] Indeed, in a VR audio rendering context, such as that of an MPEG-I system, the reverberation may in this way be made to depend on both scene-dependent acoustic properties and on personal, anthropometric-acoustic properties, where the scene-dependent acoustic properties are provided from a different source than the HRTFs. It supports a scenario where the scene-dependent acoustic properties are time-varying, as the user moves through a scene or when a scene change occurs.

[0137] The acoustic environment data may in different embodiments comprise different parameters and values, and it will be appreciated that the specific choice of acoustic environment data will depend on the preferences and requirements of the individual embodiment. The acoustic environment data may comprise metadata that can directly or indirectly be used to configure the reverberator 209. For example, T₆₀ time, frequency dependent T₆₀ time, room modes, modal density, Energy Decay Curve (EDC), Energy Decay Relief (EDR), Early Decay Time, frequency dependent inter-aural coherence/ correlation, room dimensions, room shape descriptions, frequency-dependent reflective properties of walls, Room Impulse Response (RIR), etc. may be provided.

[0138] In some cases, the scene description may not provide room acoustic parameters, but a RIR or BRIRs. For example, as FIR or IIR filter descriptions. The reverberator 209 may then simply apply the filters. However, even in these cases, the filters or the processing is customized to the set of binaural transfer functions that are used for the direct part and early reflections.

[0139] The reverberation processing can be done in multiple ways, and this to some extend depends on which data is provided. The specific reverberation processing will depend on the preferences and requirements of the individual embodiment.

[0140] If, for example, a room impulse response (RIR) is provided as a description of the room acoustics. It may be analyzed to derive T₆₀, EDR, modal density, etc. to (indirectly) configure a synthetic reverberator, or used as a basis for two or more uncorrelated filters.

[0141] Similarly, providing de-personalized BRIRs may be used directly or after personalization processing as filters generating late reverberation signals.

[0142] Alternatively, the metadata may provide parameters descriptive of acoustic properties that can be used to indirectly configure a synthetic reverberator.

[0143] Early reflections until a certain order (i.e. number of reflections until reaching the ear) are typically modelled using techniques like ray tracing or image source metric (ISM) to get attenuated versions of the sound sources reaching the ear from the direction of the last reflection. Therefore, using such approaches, the HRTFs can be used directly to render the early reflections accurately, by seeing them as an additional source at the location of the last reflection before reaching the ear.

[0144] With late reverberation the reflections are not isolated anymore, and the impulse response becomes diffuse, with many reflections reaching the user from all directions. Therefore, this part of the reverberation can be modelled statistically, instead of using measured impulse responses incorporating a certain acoustic environment.

[0145] Synthetic reverberation algorithms are often employed, because of the ability to modify certain properties of the acoustic simulation, and because of their relatively low computational complexity in generating a binaural rendering.

[0146] It has been shown that reverberation can be simulated by models incorporating a number of aspects that appear important for realistic reverberation. By changing parameters of these models, it is relatively easy to represent a wide range of reverberation effects. Providing means for manipulation of reflection density, reverberation time and overall energy, it is possible to model different rooms. Since the models can be used to reproduce the perception of a measured BRIR, it includes sufficient configuration possibilities to include personal aspects of the late reverb.

[0147] Examples of suitable synthetic reverberators may e.g. be found in:

[1] Jot, J. et al. (1991). Digital delay networks for designing artificial reverberations. 90th AES Convention, paper number 3030.
[2] Jot, J. (1992). An analysis/synthesis approach to real-time artificial reverberation. IEEE International conference on Acoustics, Speech, and Signal Processing (ICASSP). Vol. 2. p. 221-224
[3] Menzer, F. et al. (2009). Binaural reverberation using a modified Jot reverberator with frequency dependent interaural coherence matching. 126th AES Convention, paper number 7765.

[0148] The adaptation and personalization of the reverberation processing may be performed in many different ways. The approach may for example depend on the specific reverberation processing that has been implemented.

[0149] In many embodiments, the adapter 211 may be arranged to adapt the reverberation processing such that characteristics of the reverberation component match a corresponding characteristic of the set of binaural transfer functions.

[0150] The adaptation may be such that the reverberation processing will have the same characteristic as the binaural transfer function processing. The adapter 211 may adapt the reverberation processing such that a property matches the corresponding property of the binaural transfer functions, and specifically an average or combined property of a plurality of the set of binaural transfer functions.

[0151] The adapter 211 may specifically be arranged to adapt a property of the reverberation processing such that an identifying characteristic of the reverberation component corresponds to (or match) an identifying characteristic of the set of binaural transfer functions where the identifying characteristic may be a property of the binaural transfer function that depends on a given anthropometric property or a combination of anthropometric properties. The identifying characteristic may e.g. be a property of the binaural transfer function having a dependency on the widths, lengths, protrusions, etc. of the ears or head.

[0152] In some embodiments, the adapter 211 may be arranged to adapt the reverberation processing such that characteristics of the reverberation component match an anthropometrically dependent characteristic of the set of binaural transfer functions.

[0153] In many embodiments, a particularly advantageous parameter to adapt is a frequency response of the reverberation processing. Thus, in many embodiments, the property of the set of binaural transfer functions being considered is a frequency response characteristic and the adapter 211 is arranged to adapt a frequency response of the reverberation processing in response to the frequency response characteristic of the set of binaural transfer functions. The frequency response of a processing is often also referred to as a coloration of the signal being processed.

[0154] Thus, in many embodiments the ear-dependent coloration may be a main contributor to a personalized late reverberation. This may be considered a simplification that can be made because the human auditory system cannot distinguish between the many individual reflections that occur in very short time intervals in the case of late reverberation. Therefore, inter-aural phase- or level differences are not traceable to specific sound sources. The late reverberation causes a diffuse sound-field.

[0155] However, the coloration of each incoming reflection due to the filtering by the ears, head and torso do have an effect on the overall perceived coloration of the late reverberation.

[0156] Therefore, coloration may be a major component to personalize a late reverberation signal. Indeed, acoustics also affect the coloration of the reverb, but this is completely independent of the coloration by the human anthropometric properties. Any coloration by the acoustics can be represented in the acoustics metadata as a result from frequency dependent T₆₀ or EDR, and reflectivity properties of walls and other object surfaces. The personal part of the coloration can be applied in addition to this acoustics-imposed coloration.

[0157] In order to achieve this, there are various possibilities for methods that match the reverberation's coloration to the HRTF set.

[0158] In many embodiments, the reverberator 209 may comprise an adaptable reverberator which is adaptive to the acoustic environment data but not to the binaural transfer functions. Thus, the adaptable reverberator may only consider the acoustic environment data but not the personalization data derived from the binaural transfer functions. The adaptable reverberator may not include any personalization but simply generate a reverberation based on the acoustic environment data. This may e.g. allow a low complexity implementation of the adaptable reverberator with potential for reuse of existing reverberator circuitry and algorithms.

[0159] The adaptable reverberator may be supplemented by a filter which has a frequency response that is dependent on the set of binaural transfer functions and specifically on the frequency response characteristic determined from these (e.g. the average frequency response for a plurality of binaural transfer functions). The filter may be coupled in series with the adaptable reverberator and may adapt the frequency response of the reverberation component.

[0160] The HRTF-based coloration could be performed after the reverb processing by a filtering of the generated reverberation signals as illustrated in FIG. 5.

[0161] In some embodiments it may be beneficial to perform the coloration before the late reverberation generation by the adaptable reverberator. This approach is typically beneficial in terms of computational complexity when the number of audio signals going into the adaptable reverberator is lower than the number of audio signals coming out. This approach is shown in FIG 6 and can typically be applied when the late reverberation generation is a predominantly linear process.

[0162] Similarly, the coloration processing can be chosen to be moved to be the first processing of the reverberator 209 as shown in FIG. 7.

[0163] In some different embodiments, the reverberator 209 may apply a filter internally and this may allow the reverberation generation to be combined with the HRTF-based coloration. An example is shown in FIG. 8 where the late reverberation generation comprises filtering with FIR filters. For example, when the acoustics metadata provides (de-personalized) BRIRs.

[0164] The FIR modification may be a modification of the magnitude spectrum of the filter. Let A_L and A_R be the sets of two FIR filter coefficients corresponding to a left ear BRIR filter and a right ear BRIR filter respectively. Further, let C_p,L and C_p,R be coloration FIR filters (for the left- resp. right ear) representing the HRTF metadata.

[0165] The FIR modification may be expressed by the following two equations:

where

[0166] Note that in some embodiments, the HRTF metadata may provide one coloration filter for coloration of both ears' BRIR filters. This may be beneficial for symmetric HRTFs.

[0167] In some embodiments, the FIR modification excludes the IFFT operation and provides the FFT coefficients to FIR filtering block for filtering in the FFT domain.

[0168] In some embodiments, the reverberator 209 may comprise a synthetic reverberator and the adapter 211 may be arranged to adapt a processing parameter of the synthetic reverberator in response to the frequency response characteristic. A synthetic reverberator is specifically a reverberator that models the many reflections of an audio source in a room by a combination of simple processing blocks, such as gains, low order filters, delays and feedback loops. The feedback loops typically include delays that cause repetitions of an input signals with those delays. Gains (0 < g < 1) or filters in the feedback loops cause the repetitions to dissipate as the signal component spends more time in the loop, similar to the decaying energy of an audio wave component as it is progressing through a room and reflecting off walls. These filters control simulation of a (frequency dependent) T60. By having multiple feedback loops with varying delays and distributing signals from each loop to the other loops, periodicity due to the fixed delays is distorted and creates a chaotic diffuse reverb tail as would happen in a real environment. Further filtering steps, typically after or before the feedback loops, may impose coloration due to a combination of reflectivity properties.

[0169] In embodiments where a synthetic reverberator is used for the late reverberation generation, the coloration may be integrated into an operation of the synthetic reverberator. For example, the Jot reverberator as shown in FIG. 9 already has filtering to control the reverberation signals' coloration per ear (filters t_L and t_R in FIG. 9). A detailed discussion on the Jot reverberator and the synthesis of its filters can be found in the references previously given.

[0170] Let C_p,L(z) and C_p,L(z) be coloration filters representing the HRTF metadata, then the synthetic reverberation's coloration filters can be updated to impose the personalized coloration.

[0171] Also, here, a single C_p(z) may be provided for coloration of both ears to represent symmetry in an HRTF set.

[0172] In many embodiments, a particularly advantageous parameter to adapt is an inter-ear correlation property of the reverberation processing. Thus, in many embodiments, the property of the set of binaural transfer functions being considered is an inter-ear correlation property and the adapter 211 is arranged to adapt an inter-ear correlation property of the reverberation processing in response to the inter-ear correlation property of the set of binaural transfer functions.

[0173] The inter-ear correlation property may reflect a coherency or cross-correlation between signals/ processing corresponding to the right and left ears of the listener. The inter-ear correlation property for a plurality of binaural transfer functions may be determined by comparing the binaural transfer functions for the left and right ears for a plurality of positions, and the reverberation processing may then be set to provide the same inter-ear correlation property. The determination and setting of the inter-ear coherency may typically be frequency dependent and the inter-ear coherency may typically be independently determined and set in different frequency bands/ bins.

[0174] It should be noted that the term correlation property comprises all measures indicative of correlation between signals including correlation coefficients and coherence and specifically all properties that can be derived from cross-correlation measures.

[0175] The inter-ear correlation property can be derived from the complex cross-correlation:

where H_L and H_R may represent a left- and right ear spectrum, corresponding to an HRTF or BRIR.

[0176] The correlation coefficient (usually denoted as the correlation) is the real value of the complex cross-correlation (for no lag), and the coherence is the magnitude of the complex cross-correlation. In this text, the term correlation property includes both the correlation coefficient and coherence. Typical embodiments may use one or both of these and the following references to coherence may be considered to mutatis mutandis apply to correlation coefficients or indeed any other correlation property.

[0177] Inter-aural coherence is an important contributor to reverb personalization. Further, it is advantageous to control inter-aural coherence in a frequency dependent manner.

[0178] Synthetic reverberators like the Jot reverberator shown in FIG. 9 have means to control the Frequency Dependent Inter-aural Coherence (FDIC). More details on how to configure the filters c₁(z) and c₂(z) can be found in reference [3] indicated above.

[0179] Assuming reverberation causes a fully diffuse sound field, any correlation between the left and right ear is introduced by the HRTFs. Therefore, in many embodiments c₁(z) and c₂(z) would be fully defined by the HRTF metadata and not by the acoustic environment data.

[0180] Alternatively, the approach used to control the FDIC by the Jot reverberator can be applied to any source of two (at least partially) decorrelated reverb signals, where decorrelated means the signals have a very low cross-correlation yet having highly similar temporal and spectral profiles. For example, late reverberation processing blocks may have a one or more pairs of decorrelated late reverberation impulse responses, or the acoustics metadata may provide such decorrelated late reverberation impulse responses.

[0181] By applying filters c₁ and c₂ on a pair of decorrelated signals and combining the resulting output signals as shown in FIG. 10, the FDIC is controlled by the filters.

where Φ(ω) is the desired frequency dependent coherence between the output channels.

[0182] FIG. 10 illustrates an example of how two decorrelated signals can be combined to control the coherence of the output signals. In some embodiments, the reverberator 209 may comprise a combiner which, e.g. in this way, generates a pair of partially correlated signals from a pair of uncorrelated signals generated from the set of input audio signals. The resulting partially correlated signals may then be used to generate the output audio signals, or indeed may directly be used as the resonating components (e.g. if the combiner is the last processing block of the reverberator 209).

[0183] The combiner may accordingly adapt the correlation as shown above to provide the desired inter-ear correlation property.

[0184] In other embodiments, a single reverb signal may be produced by e.g. a synthetic reverberator only considering the acoustic environment data. A decorrelated signal may then be generated from this reverberation signal resulting in the generation of a stereo reverberation signal with the two signals being decorrelated. A pair of partially correlated signals may then be generated from these two signals, e.g. by a mixer performing a weighted summation for each of the partially correlated signals. The weights may be determined to provide the desired correlation as determined from the binaural transfer functions, i.e. such that the correlation of the resulting reverberation component matches that of the set of binaural transfer functions.

[0185] The binaural transfer functions may be analyzed directly in order to determine a suitable property (or properties) for adapting and personalizing the reverberator 209. In other embodiments, the property may be determined in response to e.g. metadata describing parameters of the binaural transfer functions.

[0186] Specifically, HRTF metadata input may comprise metadata that can directly or indirectly be used for matching the late reverberation to the HRTFs used for the rendering of the direct path and/or early reflections. For example, any subset of the HRTF set, parameters or metadata provided along with the HRTFs, parameters or metadata extracted from analyzing the HRTFs may be used.

[0187] In some embodiments, the HRTFs may be directly analyzed to derive personalization information, which could be provided as the HRTF metadata. Alternatively, the HRTF set (or a subset) is provided as HRTF metadata and the analysis is performed inside the late reverberation processing block.

[0188] Since late reverberation sound arrives from all directions, the coloration and coherence will typically be derived by combining information from individual HRTF pairs representing all directions, for example by averaging. In a fully diffuse sound field (like the late reverberation) the power arriving from each direction is equal. Therefore, no direction-dependent weighing needs to be applied.

[0189] Thus, in many embodiments, the audio apparatus may be arranged to determine a property for adapting the reverberation processing in response to a combination of properties for a plurality of binaural transfer functions of the set of binaural transfer functions for different positions.

[0190] As a specific example, for coloration the following analysis may be used in many embodiments:

with C_p,L and C_p,R coloration spectra for the left- and right ear respectively, and where H_i,L and H_i,R may represent a left- and right ear spectrum respectively corresponding to HRTF with index i and H* represents the complex conjugate of H.

[0191] M represents a set of HRTFs involved in the analysis. This may be a subset of a larger set of HRTFs. For example, to reduce computational complexity. However, it may also be a subset that is equally distributed over a sphere's surface. In some embodiments the subset M may be different for the left and right ear. For example, to include only the ipsilateral responses. In some embodiments the subset may be influenced by the location in a room, or room properties. For example, when the user is very close to a wall, or when a wall is missing, HRTFs corresponding to that direction may be excluded.

[0192] If HRTFs are symmetric (generally the case when a person's head and ears are more or less symmetric), the values for C_p,L and C_p,R will not differ much, in such cases it may be advantageous to calculate a single coloration. For example, by choosing one of the above, or according to:

[0193] It is advantageous to have each direction equally represented. Therefore, the above equations work best if the subset of HRTFs, M, contains HRTFs more or less equally distributed on a sphere. If this is not the case, this can be compensated for with a weighing factor. For example:

[0194] In some embodiments, coloration may be derived from a single HRTF pair. This will typically be suboptimal, but may be beneficial in scenarios where the analysis has to be very low computational complexity. In such cases it is typically best to choose the HRTF from the median plane.

[0195] Assuming reverberation causes a fully diffuse sound field, any correlation between the left and right ear is introduced by the HRTFs. Therefore the personalized frequency-dependent inter-aural coherence for late reverberation may be derived from a set of M HRTFs equally spaced on a sphere according to:

where H_i,L and H_i,R may represent a left- and right ear spectrum respectively corresponding to HRTF with index i. Similar to the coloration the M HRTFs may be a subset of a larger set of HRTFs.

[0196] For HRTF sets not equally distributed over a sphere the calculation could, for example, be done according to:

with w_i a weighing factor, to compensate for the non-equal spacing and/or other aspects, for example when certain HRTF pairs were measured at different distances than others.

[0197] For both coloration and coherence further processing may be applied such as smoothing, dynamic range modification on the (coloration) spectrum, biasing of the inter-aural coherence to lower or higher values, etc.

[0198] In case the HRTF metadata for late reverberation personalization is sent as part of the HRTF representation, or when BRIRs or late reverberation impulse responses are available, such embodiments may analyze BRIR or late reverberation directly, rather than through the analysis of a set of HRTFs. In such cases it may be necessary to eliminate, or compensate for, any influence of room acoustics. Especially for coloration analysis this may be necessary.

[0199] An alternative approach is to average all HRTF impulse responses and derive the coloration spectrum and coherence from the mix.

where, b is a frequency band index (for example following ERB scale) of which its start-bin is given by the function (or vector) s(b).

[0200] It will be appreciated that the above description for clarity has described embodiments of the invention with reference to different functional circuits, units and processors. However, it will be apparent that any suitable distribution of functionality between different functional circuits, units or processors may be used without detracting from the invention. For example, functionality illustrated to be performed by separate processors or controllers may be performed by the same processor or controllers. Hence, references to specific functional units or circuits are only to be seen as references to suitable means for providing the described functionality rather than indicative of a strict logical or physical structure or organization.

[0201] The invention can be implemented in any suitable form including hardware, software, firmware or any combination of these. The invention may optionally be implemented at least partly as computer software running on one or more data processors and/or digital signal processors. The elements and components of an embodiment of the invention may be physically, functionally and logically implemented in any suitable way. Indeed the functionality may be implemented in a single unit, in a plurality of units or as part of other functional units. As such, the invention may be implemented in a single unit or may be physically and functionally distributed between different units, circuits and processors.

[0202] Although the present invention has been described in connection with some embodiments, it is not intended to be limited to the specific form set forth herein. Rather, the scope of the present invention is limited only by the accompanying claims. Additionally, although a feature may appear to be described in connection with particular embodiments, one skilled in the art would recognize that various features of the described embodiments may be combined in accordance with the invention. In the claims, the term comprising does not exclude the presence of other elements or steps.

[0203] Furthermore, although individually listed, a plurality of means, elements, circuits or method steps may be implemented by e.g. a single circuit, unit or processor. Additionally, although individual features may be included in different claims, these may possibly be advantageously combined, and the inclusion in different claims does not imply that a combination of features is not feasible and/or advantageous. Also the inclusion of a feature in one category of claims does not imply a limitation to this category but rather indicates that the feature is equally applicable to other claim categories as appropriate. Furthermore, the order of features in the claims do not imply any specific order in which the features must be worked and in particular the order of individual steps in a method claim does not imply that the steps must be performed in this order. Rather, the steps may be performed in any suitable order. In addition, singular references do not exclude a plurality. Thus references to "a", "an", "first", "second" etc. do not preclude a plurality. Reference signs in the claims are provided merely as a clarifying example shall not be construed as limiting the scope of the claims in any way.

Claims

1. An audio apparatus comprising:

a first receiver (201) for receiving a set of input audio signals from a first source;

a second receiver (203) for receiving acoustic environment data from a second source;

a third receiver (205) for receiving binaural transfer function data from a third source, the binaural transfer function data being indicative of a set of binaural transfer functions;

a renderer (207) for generating output audio signals from the set of input audio signals;
the renderer comprising a reverberator arranged to generate a reverberation component of the output audio signals by applying reverberation processing to the set of input audio signals in response to the acoustic environment data; and

an adapter (211) for adapting a first property of the reverberation processing in response to a second property of the set of binaural transfer functions.

2. The audio apparatus of claim 1 wherein the adapter (211) is arranged to adapt the reverberation processing such that a characteristic of the reverberation component match a corresponding characteristic of the set of binaural transfer functions.

3. The audio apparatus of any previous claim where the second property is a frequency response characteristic for the set of binaural transfer functions, and the adapter (211) is arranged to adapt a frequency response of the reverberation processing in response to the frequency response characteristic.

4. The audio apparatus of claim 3 wherein the reverberator (209) comprises an adaptable reverberator adaptive to acoustic environment data and a filter having a frequency response dependent on the frequency response characteristic.

5. The audio apparatus of claim 3 wherein the reverberator (209) comprises a synthetic reverberator and the adapter (211) is arranged to adapt a processing parameter of the synthetic reverberator in response to the frequency response characteristic.

6. The audio apparatus of any previous claim wherein the second property comprises an inter-ear correlation property for the set of binaural transfer functions and the first property comprises inter-ear correlation property for the reverberation processing.

7. The audio apparatus of claim 6 wherein the inter-ear correlation property for the set of binaural transfer functions and the first inter-ear correlation property for the reverberation processing are frequency dependent.

8. The audio apparatus of claim 6 or 7 wherein the reverberator (209) is arranged to generate a pair of partially correlated signals from a pair of substantially uncorrelated signals generated from the set of input audio signals, and to generate the output audio signals from the partially correlated signals, and the adapter (211) is arranged to adapt a correlation between the output audio signals in response to the inter-ear correlation property for the set of binaural transfer functions.

9. The audio apparatus of claim 6 or 7 wherein the reverberator (209) comprises a decorrelator for generating a substantially decorrelated signal from a first signal derived from the set of input audio signals; and the reverberator (209) is arranged to generate a pair of partially correlated signals from the decorrelated signal and the first signal and to generate the output audio signals from the partially correlated signals, the adapter (211) being arranged to adapt a correlation between the output audio signals in response to the inter-ear correlation property for the set of binaural transfer functions.

10. The audio apparatus of any previous claim wherein the adapter (211) is arranged to determine the second property of the set of binaural transfer functions in response to a combination of properties for a plurality of binaural transfer functions of the set of binaural transfer functions for different positions.

11. The audio apparatus of any previous claim wherein the second receiver (203) is arranged to receive dynamically changing acoustic environment data and the third receiver (205) is arranged to receive static binaural transfer function data.

12. The audio apparatus of any previous claim wherein the second source is different from the third source.

13. The audio apparatus of any previous claim wherein the renderer (207) further comprises a binaural processor (301) for generating an early reflection component of the output audio signals in response to at least one of the set of binaural transfer functions.

14. A method of audio processing comprising:

receiving a set of input audio signals from a first source;

receiving acoustic environment data from a second source;

receiving binaural transfer function data from a third source, the binaural transfer function data being indicative of a set of binaural transfer functions;

generating output audio signals from the set of input audio signals, the generating comprising generating a reverberation component of the output audio signals by applying reverberation processing to the set of input audio signals in response to the acoustic environment data; and

adapting a first property of the reverberation processing in response to a second property of the set of binaural transfer functions.

15. A computer program product comprising computer program code means adapted to perform all the steps of claim 14 when said program is run on a computer.

Drawing

Search report

Search report

Cited references

REFERENCES CITED IN THE DESCRIPTION

This list of references cited by the applicant is for the reader's convenience only. It does not form part of the European patent document. Even though great care has been taken in compiling the references, errors or omissions cannot be excluded and the EPO disclaims all liability in this regard.

Non-patent literature cited in the description

JOT, J. et al.Digital delay networks for designing artificial reverberations90th AES Convention, 1991, [0147]
JOT, J.An analysis/synthesis approach to real-time artificial reverberationIEEE International conference on Acoustics, Speech, and Signal Processing (ICASSP), 1992, vol. 2, 221-224 [0147]
MENZER, F. et al.Binaural reverberation using a modified Jot reverberator with frequency dependent interaural coherence matching126th AES Convention, 2009, [0147]