Device and method for converting spatial audio signal

(19)

(11)

EP 2 268 064 A1

(12)	EUROPEAN PATENT APPLICATION

(43)	Date of publication:
	29.12.2010 Bulletin 2010/52

(21)	Application number: 09163760.3

(22)	Date of filing: 25.06.2009

(51)

International Patent Classification (IPC):

H04S 3/00^(2006.01)

(84)	Designated Contracting States:
	AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO SE SI SK TR
	Designated Extension States:
	AL BA RS

(71)	Applicant: Berges Allmenndigitale Rädgivningstjeneste
	0175 Oslo (NO)

(72)	Inventor:
	Berge, Svein N-0175 Oslo (NO)

(74)	Representative: Plougmann & Vingtoft A/S
	Sundkrogsgade 9 P.O. Box 831 2100 Copenhagen Ø 2100 Copenhagen Ø (DK)

(54)	Device and method for converting spatial audio signal

(57) An audio processor for converting a multi-channel audio input signal (X, Y, Z, W), such as a B-format Sound Field signal, into a set of audio output signals (L, R), such as a set of two audio output signals (L, R) arranged for headphone reproduction. A filter bank splits the input signal (X, Y, Z, W) into frequency bands. A sound source separation unit uses wave expansion on the input signal (X, Y, Z, W) to determine one or two dominant sound source directions. The(se) are used to determine virtual loudspeaker positions selected such that one or both of the virtual loudspeaker positions coincide(s) with one or both of the dominant directions. The input signal (X, Y, Z, W) is then decoded into virtual loudspeaker signals corresponding to each of the virtual loudspeaker positions, and finally the frequency components are combined in a summation unit to arrive at the set of audio output signals (L, R). E.g. Head-Related Transfer Functions (HRTFs) are applied to arrive at a binaural signal suited for headphone reproduction. A high spatial fidelity is obtained due to the coincidence of virtual loudspeaker positions and the determined dominant sound source direction(s). Improved performance can be obtained by differentiating the phase of a high frequency part of the HRTFs before with respect to frequency, followed by a corresponding integration of this part with respect to frequency after combining the components of HRTFs.

Description

FIELD OF THE INVENTION

[0001] The invention relates to the field of audio signal processing. More specifically, the invention provides a processor and a method for converting a multi-channel audio signal, such as a sound field signal, into another type of multi-channel audio signal suited for playback via headphones or loudspeakers, while preserving spatial information in the original signal.

BACKGROUND OF THE INVENTION

[0002] The use of B-format measurements, recordings and playback in the provision of more ideal acoustic reproductions which capture part of the spatial characteristics of an audio reproduction are well known.

[0003] In the case of conversion of B-format signals to multiple loudspeakers in a loudspeaker array, there is a well recognized problem due to the spreading of individual virtual sound sources over a large number of playback speaker elements. In the case of binaural playback of B-format signals, the approximations inherent in the B-format sound field can lead to less precise localization of sound sources, and a loss of the out-of-head sensation that is an important part of the binaural playback experience.

[0004] In prior art, filter banks are used to split each component of the spatial sound field set into a set of frequency bands. The short-term correlation between the W (omni) channel and each of the three other bands is then used to estimate the direction of arrival of sound within each frequency band. The input signal is split in two parts: One consisting of the frequency bands where a clear direction of arrival was detected, and one consisting of the remainder of the frequency bands. The first part of the signal is processed through head-related transfer functions corresponding to the estimated direction of arrival in each frequency band. The second part is processed through a linear decoding matrix and a fixed set of head-related transfer functions corresponding to a virtual loudspeaker array.

[0005] US 6,628,787 by Lake Technology Ltd. describes a specific method for creating a multi-channel of binaural signal from a B-format sound-field signal. The sound-field signal is split into frequency bands, and in each band a direction factor is determined. Based on the direction factor, speaker drive signals are computed foe each band by panning the signals to drive the nearest speaker. In addition, residual signal components are apportioned to the speaker signals by means of known decoding techniques.

[0006] There are different problems with the known methods. Firstly, the direction estimate is generally missing or incorrect in the case where more than a single sound source emits sound at the same time and within the same frequency band. This leads to imprecise or incorrect localization when there is more than one sound source present and when echoes interfere with the direct sound from a single source. Secondly, the use of head-related transfer functions from different directions in different frequency bands leads to a phase mismatch which increases proportionally with frequency. This in turn leads to two problems: Firstly, the group delay of high-frequency input signals does not correspond to the group delay encoded in the head-related transfer functions. This gives the wrong inter-aural time difference and therefore inaccurate localization. Secondly, the temporal evolution of the input signal is distorted, leading to poor reproduction of such transient sounds as applause and percussion instruments.

SUMMARY OF THE INVENTION

[0007] In view of the above, it may be seen as an object of the present invention to provide a processor and a method for converting a multi-channel audio input, such as a B-format sound field input into an audio output suited for playback over headphones or via loudspeakers, while still preserving the substantial spatial information contained in the original multi-channel input.

[0008] In a first aspect, the invention provides an audio processor arranged to convert a multi-channel audio input signal comprising at least two channels, such as a B-format Sound Field signal, into a set of audio output signals, such as a set of two audio output signals arranged for headphone reproduction, the audio processor comprising

a filter bank arranged to separate the input signal into a plurality of frequency bands, such as partially overlapping frequency bands,
a sound source separation unit arranged, for at least a part of the plurality of frequency bands, to
- perform a plane wave expansion computation on the multi-channel audio input signal so as to determine at least one dominant direction corresponding to a direction of a dominant sound source in the audio input signal,
- determine an array of at least two, such as four, virtual loudspeaker positions selected such that one or more of the virtual loudspeaker positions at least substantially coincides, such as precisely coincides, with the at least one dominant direction, and
- decode the audio input signal into virtual loudspeaker signals corresponding to each of the virtual loudspeaker positions, and
a summation unit arranged to sum the virtual loudspeaker signals for the at least part of the plurality of frequency bands to arrive at the set of audio output signals.

[0009] Such audio processor provides an advantageous conversion of the multi-channel input signal due to the combination of plane wave expansion extraction of directions for dominant sound sources for each frequency band and the selection of at least one virtual loudspeaker position coinciding with a direction for at least one dominant sound source. For example, this provides a virtual loudspeaker signal highly suited for generation of a binaural output signal by applying Head-Related Transfer Functions to the virtual loudspeaker signals. The reason is that it is secured that a dominant sound source is represented in the virtual loudspeaker signal by its direction, whereas prior art systems with a fixed set of virtual loudspeaker positions will in general split such dominant sound source between the nearest fixed virtual loudspeaker positions. When applying Head-Related Transfer Functions, this means that the dominant sound source will be reproduced through two sets of Head-Related Transfer Functions corresponding to the two fixed virtual loudspeaker positions which results in a rather blurred spatial image of the dominant sound source. According to the invention, the dominant sound source will be reproduced through one set of Head-Related Transfer Functions corresponding to its actual direction, thereby resulting in an optimal reproduction of the 3D spatial information contained in the original input signal.

[0010] Thus, in a preferred embodiment, the audio processor is arranged to generate the set of audio output signals such that it is arranged for playback over headphones, e.g. by applying Head-Related Transfer Functions, or other known ways of creating a spatial effects based on a single input signal and its direction.

[0011] The filter bank may comprise at least 500, such as 1000 to 5000, preferably partially overlapping filters covering the frequency range of 0 Hz to 22 kHz. E.g. specifically, an FFT analysis with a window length of 2048 to 8192 samples, i.e. 1024-4096 bands covering 0-22050 Hz may be used. However, it is appreciated that the invention may be performed also with fewer filters, in case a reduced performance is accepted.

[0012] The sound source separation unit preferably determines the at least one dominant direction in each frequency band for each time frame, such as a time frame having a size of 2,000 to 10,000 samples, e.g. 2048-8192, as mentioned. However, it is to be understood that a lower update of the dominant direction may be used, in case a reduced performance is accepted.

[0013] For audio input signals where a panning technique is used to position sound sources, such as stereo recordings or surround sound recordings, less spatial information is present, in comparison with a B-format input. To compensate, the filter outputs from two consecutive time frames should be sent to the sound source separation unit. This is preferably achieved with a plurality of delay elements.

[0014] The virtual loudspeaker positions may be selected by a rotation of a set of at least two positions in a fixed spatial interrelation. Especially, the set of positions in a fixed spatial interrelation comprises four positions, such as four positions arranged in a tetrahedron, which has been found to provide an accurate localization of the sound sources without increasing the noise level of the recorded signal.

[0015] The plane wave expansion may determine two dominant directions, and wherein the array of at least two virtual loudspeaker positions is selected such that two of the virtual loudspeaker positions at least substantially coincides, such as precisely coincides, with the two dominant directions. Hereby it is ensured that the two most dominating sound sources, in a frequency band, are as precisely spatially represented as possible, thus leading to the best possible spatial reproduction of audio material with two dominant sound sources spatially distributed, e.g. two singers or two musical instruments playing at the same time.

[0016] In order to generate a binaural two-channel output signal, the audio processor may comprise a binaural synthesizer unit arranged to generate first and second audio output signals by applying Head-Related Transfer Functions to each of the virtual loudspeaker signals. Especially, such audio processor may be implemented by a decoding matrix corresponding to the determined virtual loudspeaker positions and a transfer function matrix corresponding to the Head-Related Transfer Functions being combined into an output transfer matrix prior to being applied to the audio input signals. Hereby a smoothing may be performed on transfer functions of such output transfer matrix prior to being applied to the input signals, which will serve to improve reproduction of transient sounds.

[0017] In a preferred embodiment, the phase of the Head-Related Transfer Functions is differentiated with respect to frequency, and after combining components of Head-Related Transfer Functions corresponding to different directions, the phase of the combined transfer functions is integrated with respect to frequency. This serves to preserve the group delay of the Head-Related Transfer Functions. Even more specifically, the phase of the Head-Related Transfer Functions may be differentiated with respect to frequency at frequencies above a frequency limit only, such as above 1.8 kHz, and after combining components of Head-Related Transfer Functions corresponding to different directions, the phase of the combined transfer functions is integrated with respect to frequency at frequencies above the frequency limit. Hereby, only the phase at higher frequencies is manipulated, and thus at lower frequencies where the interaural phase difference is significant, the phase is left unchanged. In a more specific embodiment, the phase of the Head-Related Transfer Functions may be left unaltered below a first frequency limit, such as below 1.6 kHz, and differentiated with respect to frequency at frequencies above a second frequency limit with a higher frequency than the first frequency limit, such as 2.0 kHz, and with a gradual transition in between, and after combining components of HRTFs corresponding to different directions, the inverse operation is applied to the combined function.

[0018] The audio input signal is preferably a multi-channel audio signal arranged for decomposition into plane wave components. Especially, the input signal may be one of: a B-format sound field signal, a higher-order ambisonics recording, a stereo recording, and a surround sound recording.

[0019] In a second aspect, the invention provides a device comprising an audio processor according to the first aspect. Especially, the device may be one of: a device for recording sound or video signals, a device for playback of sound or video signals, a portable device, a computer device, a video game device, a hi-fi device, an audio converter device, and a headphone unit.

[0020] In a third aspect, the invention provides a method for converting a multi-channel audio input signal comprising at least two channels, such as a B-format Sound Field signal, into a set of audio output signals, such as a set of two audio output signals arranged for headphone reproduction, the method comprising

separating the input signal into a plurality of frequency bands, such as partially overlapping frequency bands,
performing a sound source separation for at least a part of the plurality of frequency bands, comprising
performing a plane wave expansion computation on the multi-channel audio input signal so as to determine at least one dominant direction corresponding to a direction of a dominant sound source in the audio input signal,
determining an array of at least two, such as four, virtual loudspeaker positions selected such that one or more of the virtual loudspeaker positions at least substantially coincides, such as precisely coincides, with the at least one dominant direction, and
decoding the audio input signal into virtual loudspeaker signals corresponding to each of the virtual loudspeaker positions, and
summing the virtual loudspeaker signals for the at least part of the plurality of frequency bands to arrive at the set of audio output signals.

[0021] The method may be implemented in pure software, e.g. in the form of a generic code or in the form of a processor specific executable code. Alternatively, the method may be implemented partly in specific analog and/or digital electronic components and partly in software. Still alternatively, the method may be implemented in a single dedicated chip.

[0022] It is appreciated that two or more of the mentioned embodiments can advantageously be combined. It is also appreciated that embodiments and advantages mentioned for the first aspect, applies as well for the second and third aspects.

BRIEF DESCRIPTION OF THE DRAWING

[0023] Embodiments of the invention will be described, by way of example only, with reference to the drawings.

Fig. 1 illustrates basic components of one embodiment of the audio processor,

Fig. 2 illustrates details of an embodiment for converting a B-format sound field signal into a binaural signal,

Fig. 3 illustrates a possible implementation of the transfer matrix generator referred to in Fig. 2, and

Fig. 4 illustrates an audio device with an audio processor according to the invention.

DESCRIPTION OF EMBODIMENTS

[0024] Fig. 1 shows an audio processor component with basic components according to the invention. Input to the audio processor is a multi-channel audio signal. This signal is split into a plurality of frequency bands in a filter bank, e.g. in the form of an FFT analysis performed on each of the plurality of channels. A sound source separation unit SSS is then performed on the frequency separated signal. First, a plane wave expansion calculation PWE is performed on each frequency band in order to determine one or two dominant sound source directions. The one or two dominant sound source directions are then applied to a virtual loudspeaker position calculation algorithm VLP serving to select a set of virtual sound source or virtual loudspeaker directions, e.g. by rotation of a fixed set of virtual loudspeaker directions, such that the one or both, in case of two, dominant sound source directions coincide with respective virtual loudspeaker directions. Then, the input signal is transferred or decoded DEC according to a decoding matrix corresponding to the selected virtual loudspeaker directions, and optionally Head-Related Transfer Functions corresponding to the virtual loudspeaker directions are applied before the frequency components are finally combined in a summation unit SU to form a set of output signals, e.g. two output signals in case of a binaural implementation, or such as four, five, six, seven or even more output signals in case of conversion to a format suitable for reproduction through a surround sound set-up of loudspeakers.

[0025] The audio processor can be implemented in various ways, e.g. in the form of a processor forming part of a device, wherein the processor is provided with executable code to perform the invention.

[0026] Figs. 2 and 3 illustrate components of a preferred embodiment suited to convert an input signal having a three dimensional characteristics and is in an "ambisonic B-format". The ambisonic B-format system is a very high quality sound positioning system which operates by breaking down the directionality of the sound into spherical harmonic components termed W, X, Y and Z. The ambisonic system is then designed to utilize a plurality of output speakers to cooperatively recreate the original directional components. For a description of the B-format system, reference is made to: http://en.wikipedia.org/wiki/Ambisonics.

[0027] Referring to Fig. 2, the preferred embodiment is directed at providing an improved spatialization of input audio signals. A B-format signal is input having X, Y, Z and W components. Each component of the B-format input set is processed through a corresponding filter bank 1-4 each of which divides the input into a number of output frequency bands (The number of bands being implementation dependent, typically in the range of 1024 to 4096).

[0028] Elements 5, 6, 7, 8 and 10 are replicated once for each frequency band, although only one of each is shown in Fig. 2. For each frequency band, the four signals (one from each filter bank 1-4) are processed by a plane wave expansion element 5, which determines the smallest number of plane waves necessary to recreate the local sound field encoded in the four signals. The plane wave expansion element also calculates the direction, phase and amplitude of these waves. The input signal is denoted w, x, y, z, with subscripts r and i. The local sound field can in most cases be recreated by two plane waves, as expressed in the following equations:

[0029] The solution to these equations is

where

[0030] Equation 5 gives zero, one or two real values for cos²ϕ, corresponding to zero, one or two solutions to the equations. Each value for cos²ϕ corresponds to several possible values of ϕ, one in each quadrant, or the values 0 and n. Only one of these is correct. The correct quadrant can be determined from equation 9 and the requirement that w₁ and w₂ should be positive.

[0031] When equation 5 gives no real solutions, more than two plane waves are necessary to reconstruct the local sound field. It may also be advantageous to use an alternative method when the matrix to invert in equation 4 is singular or nearly singular. When allowing for more than two plane waves, an infinite number of possible solutions exist. Since this alternative method is necessary only for a small part of most signals, the choice of solution is not critical. One possible choice that of two plane waves travelling in the directions of the principal axes of the ellipse which is described by the time-dependent velocity vector associated with each frequency band. In addition to these two plane waves, a spherical wave is necessary to reconstruct the W component of the incoming signal:

[0032] The chosen solution is

where

[0033] As before, the quadrant of ϕ can be determined based on another equation (18) and the requirement that w'₁ and w'₂ should be positive.

[0034] The values of w_o and ϕ₀ are not used in subsequent steps.

[0035] The output of 5 consists of the two vectors <x₁, y₁, z₁> and <x₂, y₂, z₂>. This output is connected to an element 6 which sorts these two vectors in accordance to the value of their y element. In an alternative embodiment of the invention, only one of the two vectors is passed on from element 6. The choice can be that of the longest vector or the one with the highest degree of similarity with neighbouring vectors. The output of 6 is connected to a smoothing element 7 which suppresses rapid changes in the direction estimates. The output of 7 is connected to an element 8 which generates suitable transfer functions from each of the input signals to each of the output signals, a total of eight transfer functions. Each of these transfer functions are passed through a smoothing element 9. This element suppresses large differences in phase and amplitude between neighbouring frequency bands and also suppresses rapid changes in phase and amplitude. The output of 9 is passed to a matrix multiplier 10 which applies the transfer functions to the input signals and creates two output signals. Elements 11 and 12 sum each of the output signals from 10 across all filter bands to produce a binaural signal.

[0036] Referring to Fig. 3, there is illustrated schematically the preferred embodiment of the transfer matrix generator referenced in Fig. 2. While one transfer matrix generator is provided for each frequency band, elements 3, 4 and 5 are shared across all frequency bands. An element 1 generates two new vectors whose directions are chosen so as to maximize the angles between the four resulting vectors. In an alternative embodiment of the invention, only one vector is passed into the transfer matrix generator. In this case, element 1 must generate three new vectors, preferably such that the resulting four vectors point towards the vertices of a regular tetrahedron. This alternative approach is also beneficial in cases where the two input vectors are collinear or nearly collinear.

[0037] The four vectors are used to represent the directions to four virtual loudspeakers which will be used to play back the input signals. An element 6 calculates a decoding matrix by inverting the following matrix:

where

[0038] An element 5 stores a set of head-related transfer functions. An element 3 alters the phase of these transfer functions, leaving the amplitude unchanged. The reason for this transformation is that any uncertainty in the direction estimates translates to an uncertainty in the absolute phase of the head-related transfer functions which increases proportionally with frequency. Without this alteration to the transfer functions, too much noise would be added to the phase of high-frequency components in the signal, resulting in poor reproduction of transient sounds. The transformation performed by element 3 is:

[0039] The effect of this transformation is none at low frequencies. At high frequencies, the phase of the transfer function is differentiated with respect to frequency. The transition happens around f=f_c, in a transition region of approximate width f_o. In this equation, Δf is the band-to-band frequency difference. Since the human ability to perceive inter-aural phase difference is limited to frequencies below approx. 1200-1600 Hz, reasonable values for f_c and f₀ are 1800 Hz and 200 Hz, respectively. Above this transition frequency, humans are still sensitive to inter-aural group delay, which will be restored after performing the inverse transformation, as is done in element 4:

[0040] Element 2 uses the virtual loudspeaker directions to select and interpolate between the modified head-related transfer functions closest to the direction of each virtual loudspeaker. For each virtual loudspeaker, there are two head-related transfer functions; one for each ear, providing a total of eight transfer functions which are passed to element 4.

[0041] The outputs of elements 4 and 6 are multiplied in a matrix multiplication 7 to produce the suitable transfer matrix.

[0042] In the arrangement shown in Fig. 3, the decoding matrix is multiplied with the transfer function matrix before their product is multiplied with the input signals. In an alternative embodiment of the invention, the input signals are first multiplied with the decoding matrix and their product subsequently multiplied with the transfer function matrix. However, this would preclude the possibility of smoothing of the overall transfer functions. Such smoothing is advantageous for the reproduction of transient sounds.

[0043] The overall effect of the arrangement shown in Figs. 2 and 3 is to decompose the local sound field into a large number of plane waves and to pass these plane waves through corresponding head-related transfer functions in order to produce a binaural signal suited for headphone reproduction.

[0044] Fig. 4 illustrates a block diagram of an audio device with an audio processor according to the invention, e.g. the one illustrated in Figs. 2 and 3. The device may be a dedicated headphone unit, a general audio device offering the conversion of a multi-channel input signal to another output format as an option, or the device may be a general computer with a sound card provided with software suited to perform the conversion method according to the invention.

[0045] The device may be able to perform on-line conversion of the input signal, e.g. by receiving the multi-channel input audio signal in the form of a digital bit stream. Alternatively, e.g. if the device is a computer, the device may generate the output signal in the form of an audio output file based on an audio file as input.

[0046] To sum up, the invention provides an audio processor for converting a multi-channel audio input signal X, Y, Z, W, such as a B-format Sound Field signal, into a set of audio output signals L, R, such as a set of two audio output signals L, R arranged for headphone reproduction. A filter bank splits the input signal X, Y, Z, W into frequency bands. A sound source separation unit uses wave expansion on the input signal X, Y, Z, W to determine one or two dominant sound source directions. These are used to determine virtual loudspeaker positions selected such that one or both of the virtual loudspeaker positions coincide with one or both of the dominant directions. The input signal X, Y, Z, W is then decoded into virtual loudspeaker signals corresponding to each of the virtual loudspeaker positions, and finally the frequency components are combined in a summation unit to arrive at the set of audio output signals L, R. E.g. Head-Related Transfer Functions (HRTFs) are applied to arrive at a binaural signal suited for headphone reproduction. A high spatial fidelity is obtained due to the coincidence of virtual loudspeaker positions and the determined dominant sound source direction(s).

[0047] Improved performance can be obtained by differentiating the phase of a high frequency part of the HRTFs before with respect to frequency, followed by a corresponding integration of this part with respect to frequency after combining the components of HRTFs.

[0048] In the claims, the term "comprising" does not exclude the presence of other elements or steps. Additionally, although individual features may be included in different claims, these may possibly be advantageously combined, and the inclusion in different claims does not imply that a combination of features is not feasible and/or advantageous. In addition, singular references do not exclude a plurality. Thus, references to "a", "an", "first", "second" etc. do not preclude a plurality. Reference signs are included in the claims however the inclusion of the reference signs is only for clarity reasons and should not be construed as limiting the scope of the claims.

Claims

1. An audio processor arranged to convert a multi-channel audio input signal (X, Y, Z, W) comprising at least two channels, such as a B-format Sound Field signal, into a set of audio output signals (L, R), such as a set of two audio output signals (L, R) arranged for headphone reproduction, the audio processor comprising

- a filter bank arranged to separate the input signal (X, Y, Z, W) into a plurality of frequency bands, such as partially overlapping frequency bands,

- a sound source separation unit arranged, for at least a part of the plurality of frequency bands, to

- perform a plane wave expansion computation on the multi-channel audio input signal (X, Y, Z, W) so as to determine at least one dominant direction corresponding to a direction of a dominant sound source in the audio input signal (X, Y, Z, W),

- determine an array of at least two, such as four, virtual loudspeaker positions selected such that one or more of the virtual loudspeaker positions at least substantially coincides, such as precisely coincides, with the at least one dominant direction, and

- decode the audio input signal (X, Y, Z, W) into virtual loudspeaker signals corresponding to each of the virtual loudspeaker positions, and

- a summation unit arranged to sum the virtual loudspeaker signals for the at least part of the plurality of frequency bands to arrive at the set of audio output signals (L, R).

2. Audio processor according to claim 1, wherein the filter bank comprises at least 500, such as 1000 to 5000, partially overlapping filters covering a frequency range of 0 Hz to 22 kHz.

3. Audio processor according to claim 1 or 2, wherein the virtual loudspeaker positions are selected by a rotation of a set of at least three positions in a fixed spatial interrelation.

4. Audio processor according to claim 3, wherein the set of positions in a fixed spatial interrelation comprises four positions, such as four positions arranged in a tetrahedron.

5. Audio processor according to any of the preceding claims, wherein the wave expansion determines two dominant directions, and wherein the array of at least two virtual loudspeaker positions is selected such that two of the virtual loudspeaker positions at least substantially coincides, such as precisely coincides, with the two dominant directions.

6. Audio processor according to any of the preceding claims, comprising a binaural synthesizer unit arranged to generate first and second audio output signals (L, R) by applying Head-Related Transfer Functions (HRTF) to each of the virtual loudspeaker signals.

7. Audio processor according to claim 6, wherein a decoding matrix corresponding to the determined virtual loudspeaker positions and a transfer function matrix corresponding to the Head-Related Transfer Functions (HRTF) are being combined into an output transfer matrix prior to being applied to the audio input signals (X, Y, Z, W).

8. Audio processor according to claim 7, wherein a smoothing is performed on transfer functions of the output transfer matrix prior to being applied to the input signals (X, Y, Z, W).

9. Audio processor according to any of claims 6-8, wherein the phase of the Head-Related Transfer Functions (HRTF) is differentiated with respect to frequency, and after combining components of Head-Related Transfer Functions (HRTF) corresponding to different directions, the phase of the combined transfer functions is integrated with respect to frequency.

10. Audio processor according to any of claim 9, wherein the phase of the Head-Related Transfer Functions (HRTF) is left unaltered below a first frequency limit, such as below 1.6 kHz, and differentiated with respect to frequency at frequencies above a second frequency limit with a higher frequency than the first frequency limit, such as 2.0 kHz, and with a gradual transition in between, and after combining components of Head-Related Transfer Functions (HRTF) corresponding to different directions, the inverse operation is applied to the combined function.

11. Audio processor according to any of the preceding claims, wherein the audio input signal is a multi-channel audio signal arranged for decomposition into plane wave components, such as one of: a B-format sound field signal, a higher-order ambisonics recording, a stereo recording, and a surround sound recording.

12. Audio processor according to any of the preceding claims, wherein the sound source separation unit determines the at least one dominant direction in each frequency band for each time frame, wherein a time frame has a size of 2,000 to 10,000 samples.

13. Audio processor according to any of the preceding claims, wherein the set of audio output signals (L, R) is arranged for playback over headphones.

14. Device comprising an audio processor according to any of claims 1-13, such as the device being one of: a device for recording sound or video signals, a device for playback of sound or video signals, a portable device, a computer device, a video game device, a hi-fi device, an audio converter device, and a headphone unit.

15. Method for converting a multi-channel audio input signal (X, Y, Z, W) comprising at least two channels, such as a B-format Sound Field signal, into a set of audio output signals (L, R), such as a set of two audio output signals (L, R) arranged for headphone reproduction, the method comprising

- separating the input signal (X, Y, Z, W) into a plurality of frequency bands, such as partially overlapping frequency bands,

- performing a sound source separation for at least a part of the plurality of frequency bands, comprising

- performing a plane wave expansion computation on the multi-channel audio input signal (X, Y, Z, W) so as to determine at least one dominant direction corresponding to a direction of a dominant sound source in the audio input signal (X, Y, Z, W),

- determining an array of at least two, such as four, virtual loudspeaker positions selected such that one or more of the virtual loudspeaker positions at least substantially coincides, such as precisely coincides, with the at least one dominant direction, and

- decoding the audio input signal (X, Y, Z, W) into virtual loudspeaker signals corresponding to each of the virtual loudspeaker positions, and

- summing the virtual loudspeaker signals for the at least part of the plurality of frequency bands to arrive at the set of audio output signals (L, R).

Drawing

Search report

Cited references

REFERENCES CITED IN THE DESCRIPTION

This list of references cited by the applicant is for the reader's convenience only. It does not form part of the European patent document. Even though great care has been taken in compiling the references, errors or omissions cannot be excluded and the EPO disclaims all liability in this regard.

Patent documents cited in the description

US6628787B [0005]