IRCAM - Centre PompidouServeur © IRCAM - CENTRE POMPIDOU 1996-2005.
Tous droits réservés pour tous pays. All rights reserved.

Synthesizing Three-Dimensional Sound Scenes in Audio or Multimedia Production and Interactive Human-Computer Interfaces

Jean-Marc Jot

5th International Conference: Interface to Real & Virtual Worlds, Montpellier, France, Mai 1996
Copyright © Ircam - Centre Georges-Pompidou 1996


This paper overviews the principles of the synthesis of virtual sound environments for application to the fields of music and multimedia production and to the design of advanced multimodal human-computer interfaces (e.g. virtual reality). A real-time spatial processor developed by Espaces Nouveaux and Ircam, the Spatialisateur, is presented. It allows to reproduce and control the localization of sound sources in three dimensions as well as the projection (reverberation) of sounds in an existing or virtual space. A specificity of the Spatialisateur project is the concern to provide direct control over the perceptual spatial attributes associated to each sound source and to the virtual space, while allowing reproduction via various systems or formats over loudspeakers or headphones. The operational advantages of this approach are illustrated in practical contexts.
Cet article décrit les principes de la synthèse d'environnements sonores virtuels s'appliquant au domaine de la production musicale ou multimédia et aux interfaces homme-machine avancées de type réalité virtuelle. Il présente un processeur spatial temps-réel, le Spatialisateur, developpé par Espaces Nouveaux et l'Ircam, qui permet de reproduire et de contrôler la localisation des sources sonores en trois dimensions et la projection (réverbération) des sons dans un lieu existant ou virtuel. Le Spatialisateur se caractérise notamment par le souci d'offrir un contrôle direct des attributs perceptifs spatiaux associés aux différentes sources sonores et à la salle virtuelle, en permettant une reproduction via différents formats ou dispositifs sur haut-parleurs ou casque d'écoute. Les avantages opérationnels de cette approche sont illustrés dans des contextes d'application pratiques.

1. Introduction

The reproduction of complex sound scenes including multiple sources in an existing or imaginary space has been a major concern in professional recording and production of music and soundtracks. More recently, the evolution of technologies and computers has led to the development of systems aiming at immersing an individual in an artificial scene through the reconstruction of multisensorial cues (particularly auditory, visual and haptic cues).

From an auditory point of view, the spatial cues to be reproduced can be divided into two categories: the auditory localization of sound sources (desirably in three dimensions), and the room effect resulting from indirect sound paths (reflections and reverberation from walls and obstacles). The benefits of spatial sound reproduction over mono reproduction are significant in a wide range of industrial and research or artistic and entertainment applications, including teleconferencing, simulation and virtual reality or telerobotics, professional audio and computer music, advanced machine interfaces for data representation or visually disabled users... Spatial sound processing is a key factor for improving the legibility and naturalness of a virtual scene, restoring the capacity of our perception to exploit spatial auditory cues in order to segregate sounds emanating from different directions [1-3]. It further allows manipulation of the spatial attributes of sound events for creative purposes or 'augmented reality' [3]. In all applications, the coherence of auditory cues with those conveyed by cognition and other modes of perception (visual, haptic...), or the absence of these cues, must be considered or taken into account.

Spatial sound reproduction requires the use of an electro-acoustic apparatus (headphones or loudspeaker system), along with a technique or format for encoding directional localization cues on several audio channels for transmission or storage. This encoding can be achieved in several ways:
a) Recording of an existing sound scene with coincident or closely-spaced microphones (stereo microphone pair, dummy head, Ambisonic microphone [4]) allowing simultaneous reproduction of several sound sources and their relative positions in space. This approach of course considerably limits the possibilities of future spatial manipulations and adaptation to various reproduction contexts or interactive applications.
b) Synthesis of a virtual sound scene: a signal processing system reconstructs the localization of each sound source and the room effect, starting from individual sound signals and parameters describing the sound scene (position, orientation, directivity of each source and acoustic characterization of the room or space). An example of this approach, in the field of professional audio, is multitrack recording and post-processing using a stereo mixing console and artificial reverberators.
c) A combination of approaches (a) and (b), such as a live multitrack recording in which a stereo signal, captured by a main microphone pair, is mixed with signals captured by several spot microphones located near the individual instruments.

In the context of interactive applications, where elements of the sound scene can be dynamicly modified by the user's or performer's actions (e. g. movements of sound sources), a real-time spatial synthesis technique is necessary. This requires local signal processing hardware in each display interface (videogame console, teleconference interface, concert sound system...) and involves a processing and transmission cost which increases linearly with the number of sound events to be synthesized simultaneously.

From a general point of view, the spatial synthesis parameters can be provided either by the analysis of an existing scene (through position trackers, cameras, adaptative acoustic arrays...), or by the user's actions (man-to-machine interface: mixing desk, graphic or gestual interfaces...), or even by a stand-alone process (videogames, simulators). For instance, whenever a head-tracking system is used for updating a synthetic image on a head-mounted display according to the movements of the spectator, the position coordinates it provides can be exploited simultaneously for updating the synthetic sound scene reproduced over headphones (this is necessary, in headphone reprduction, for ensuring that the perceived positions or movements of sound sources in the virtual space are independent from the movements of the listener).

In the next section of this paper, the general principles and limitations of current spatial sound processing and room simulation technologies will be briefly reviewed. In the third section, a perceptually-based processor (the Spatialisateur) will be introduced. In conclusion, the advantages of this approach will be illustrated in practical applications.

2. Overview of current spatial sound processing techniques

Multi-channel panpots and artificial reverberators

In a natural situation, directional localization cues (azimuth and elevation of the sound source) are typically conveyed by the direct sound path from the source to the listener. However, the intensity of this direct sound is not a reliable distance cue in the absence of a room effect, especially in an electro-acoustically reproduced sound scene [1, 2]. Thus the typical mixing structure shown on Figure 1 defines the minimum signal processing system for conveying three-dimensional localization cues simultaneously for N sound sources over P loudspeakers.

Figure 1: Typical mixing architecture (shown here assuming 4-channel loudspeaker reproduction) combining a mixing console for synthesizing directional effects and an external reverberation unit for synthesizing temporal effects (recent digital mixing consoles include a tunable delay line in each channel). With some spatial encoding methods, an additional decoding stage is necessary for delivering the processed signal to the loudspeakers.
Each channel of the mixing console receives a mono recorded signal (devoid of room effect) from an individual sound source, and contains a panning module which synthesizes directional localization over the P loudspeakers (this module is usually called a panoramic potentiometer, or 'panpot', in stereo mixing consoles). All source signals feed an artificial reverberator which delivers different output signals to the loudspeakers, creating a diffuse (non-localized) room effect to which each sound source can contribute a different intensity. The relative values of the gains d and r can be used in each channel for controlling the perceived distance of the corresponding sound source. This basic principle of current mixing architectures, generally designed for two output channels, can be extended to any number of loudspeakers in two or three dimensions, by designing an appropriate 'panpot'. This extension was proposed by Chowning [5], who designed a spatial processing system for computer music, allowing bidimensional control of the localization and movements of virtual sound sources over four loudspeakers (including the simulation of the Doppler effect). The localization was controlled in polar coordinates (azimuth and distance) referenced to a listener placed at the center of the loudspeaker system, using a gestual control interface.

Many systems used today in computer music are based on Chowning's process. For the directional encoding of the direct sound over P loudspeakers, Chowning used a pairwise intensity panning principle (sometimes called 'discrete surround'), derived from the conventional stereo panpot. Alternatively, it is always possible to design a panning module which emulates the encoding achieved in free field by any spatial sound pickup technique using several microphones. An attractive approach is the three-dimensional 4-channel Ambisonic 'B-format', for which decoders have been designed to accomodate various multi-channel loudspeaker setups [4]. However, there is currently no multi-channel reproduction technique allowing accurate 3D reproduction over a wide listening area, while using a reasonable number of loudspeakers. Current techniques either assume a centrally located listener (such as in Ambisonics) or assume a frontal bias for primary sound sources in the reproduced scene. The latter approach is found in formats initially developed for movie theaters, which have recently evolved into standards for HDTV, multimedia and domestic entertainment [6]. In these formats, the side or rear loudspeakers are not intended for reproducing lateral or rear sound sources, but only for diffuse ambiance and reverberation.

Binaural and transaural processors

Another attractive approach for designing a three-dimensional panning module is to emulate a dummy-head microphone recording. This method could be expected to provide exact reproduction over headphones, since it directly synthesizes the pressure signals at the two ears [1]. The panning module can be implemented in the digital domain as shown in Figure 2. Using a dummy-head and a louspeaker in an anechoic room, a set of 'head-related transfer functions' (HRTFs) can be measured in order to simulate any particular direction of incidence of a sound wave in free field (i. e. without reflections) [1, 2].

Figure 2: Principle of binaural synthesis, allowing to simulate a free-field listening situation over headphones. The azimuth and elevation of the virtual sound source are controlled by loading two sets of filter coefficients from a database of 'Head Related Tranfer Functions', according to a specified direction.
Early binaural processors and binaural mixing consoles, developed in the late 80s [7], used powerful signal processors in order to accurately implement the HRTF filters in real time. Further research on the modeling of HRTFs has led to efficient implementation of time-varying HRTF filters allowing to simulate dynamic movements of sound sources [2, 8,9]. A dynamic implementation requires about twice the computational power necessary for a static implementation [9]. Thus, with off-the-shelf programmable digital signal processors such as the Motorola DSP56002, a straightforward implementation of a variable binaural panpot using 200-tap convolution filters would require the signal processing capacity of two DSPs at a sample rate of 48 kHz. This cost can be reduced to less than 150 multiply-accumulates per sample period (30% of the capacity of one DSP) using an implementation based on minimum-phase pole-zero filters and variable delay lines [9].

Because the HRTFs actually encode diffraction effects which depend on the shape of the head and pinnae, the applications of binaural technology in broadcasting and recording are limited by the individual nature of the HRTFs. To achieve perfect reproduction over headphones, it would be necessary, instead of using a HRTF database measured on a dummy head, to use measurements made specificly for each listener (with microphones inserted in his/her ears). A typical consequence of using non-individual HRTFs is the difficulty of reproducing virtual sound sources localized in the frontal sector over heaphones (these will often be heard above or behind, near or even inside the head) [2]. An additional constraint of headphone reproduction is that it calls for the use of a head-tracking system allowing real-time compensation of the listener's movements in the binaural synthesis process. Under this condition, in real-time interactive applications such as virtual reality, binaural synthesis offers an attractive solution because the dynamic localization cues conveyed by head-tracking largely compensate the ambiguities resulting from the use of non-individual HRTFs.

In order to preserve the three dimensional localization cues in reproduction over loudspeakers, a binaural signal must be decoded by a 2 x 2 inverse transfer-function matrix which cancels the cross-talk from each loudspeaker to the opposite ear [10]. This technique assumes a strong constraint on the position and orientation of the listener and loudspeakers during playback, which must be more strictly inforced than in conventional two-channel stereophony if a convincing reproduction of lateral or rear sound sources is desired. Nevertheless, in broadcasting and recording applications over two channels, transaural stereophony offers a viable solution for transgressing the limits of conventional stereophony. Our implementation generally produces reliable localization cues on a carefully installed loudspeaker pair, except in the rear sector. Current research toward improved transaural reproduction includes the implementation of head-tracking [11] or multichannel extensions involving least-squares inversion over multiple listening positions [12].

Real-time room simulation

The system proposed by Chowning, corresponding to the processing architecture of Figure 1, entailed some limitations. Although it was appropriate for conveying the impression that all virtual sources were situated in the same room, it did not allow to faithfully reproduce the perception of distance or direction of a sound source as experienced in a natural situation, because the temporal and directional distribution of early reflections could not be controlled specifically for each virtual sound source.

Early digital reverberation algorithms based on digital delay lines with feedback, following Schroeder's pioneering studies [13], evolved into more sophisticated designs during the 80's, allowing to shape the early reflection pattern and simulate the later diffuse reverberation more naturally [14, 18]. An artificial reverberation algorithm based on feedback delay networks, such as shown on Figure 3, can mimic the reverberation caracteristics of an existing room and deliver several channels of artificial reverberation, while using only a fraction of the processing capacity of a programmable digital signal processor such as the Motorola DSP56002 [19, 9].

When combining commercial reverberation units with a conventional mixing console, however, the musician or sound engineer still faces a non-ideal user interface: the perceived distance of sound sources cannot be controlled effectively using only the gain controls d and r in each channel of the mixing console, because it also depends on the settings of the reverberator's controls. This heterogeneity of the user interface limits the possibilities of continuous variation of the perceived distance of sound sources. Furthermore, in most current reverberation units, intuitive adjustements of the room effect are typically limited to the modification of the decay time or the size of a factory-preset virtual room, and the signal processing structure is usually not designed for reproduction formats other than conventional stereophony. These limitations make traditional mixing architectures inadequate for interactive and immersive applications, as well as broadcasting and production of recordings in new formats such as 3/2-stereo.

Figure 3: Typical echogram for a source and a receiver in a room and cost-efficient real-time binaural room simulation algorithm using feedback delay networks. The delay lengths ti and gains bi allow controlling the time, amplitude and lateralization of each early reflection. Each delay line [tau]i includes an attenuation filter allowing accurate control of the reverberation decay time vs frequency. The feedback matrix A is a unitary (energy preserving) matrix. A typical implementation of this algorithm involves 8 feedback channels.

The minimum requirement for overcoming these limitations appears to be providing not only a panpot in each channel of the mixing architecture, but also a module allowing to control the distribution of the early reflections specifically for each sound source, as shown on Figure 4 [19]. Moore [20] described a signal processing structure of this type, allowing to control the times and amplitudes of the first reflections, for each source signal and each output channel, according to: The general processing model proposed by Moore for concert performances is that of a polygonal 'listening' room delimited by the loudspeaker positions and containing the audience, placed in a larger room (the 'virtual' room, containing the virtual sound sources). The signals delivered to the loudspeakers reconstruct an approximation of the signals that would be captured by P virtual microphones placed along the exterior perimeter of the 'listening' room, at the positions of the P loudspeakers. This directional encoding emulates a multi-microphone recording using non-coincident microphones (with the microphones being much more distant than in conventional recording techniques).

Figure 4: Improved mixing architecture allowing to reproduce several virtual sound sources located in the same virtual room while controlling the early reflection pattern associated to each individual source.
The identification of indirect sound paths from each source to each virtual microphone is based on a geometrical model of sound propagation, assuming specular reflections of sound waves on the walls of the virtual room (image source model). The arrival time and frequency-dependent attenuation of each early reflection can be computed by simulating all physical phenomena along the corresponding path as a cascade of elementary filters: radiation by the source, propagation in the air, absorption by walls and capture by the microphone.

For application to headphone reproduction, Moore proposes reducing the size of the 'listening room' to the size of a head, and placing the two microphones on the sides. His directional encoding model then becomes equivalent to a rough implementation of HRTF filtering. From a signal processing point of view, this approach of room simulation is equivalent to using a digital mixing console as in Figure 1, where the same source signal is used in several channels of the console and each additional channel reproduces an early reflection. The delay and gain can be set in each channel to control the arrival time and amplitude of the reflection (as captured by an omnidirectional microphone placed at the center of the 'head' or 'listening room'). From this signal, the panning module then derives the P microphone signals to be delivered to the loudspeaker or headphones, according to the direction of incidence of the reflection, according to the principle of Figure 2 (which can be extended to more than 2 channels).

Very recently, systems performing binaural processing of both the direct sound and the early reflections in real time have been proposed, in which room reflections are computed according to the same physical and geometrical model as above [7, 8]. These systems involve a heavy real-time signal processing effort: a binaural panning module must be implemented for each early reflection and for each virtual sound source, which is impractical for most real-world applications. Fortunately, this signal processing cost can be further reduced by introducing perceptually relevant simplifications in the spectral and binaural processing of early reflections [9]. In addition to this signal processing task, a considerable processing effort is necessary for updating all parameters whenever a sound source or the listener moves [20, 8]. As Moore noted, the dynamic variation of the delay times of the direct sound and reflections will produce the expected Doppler effects naturally.

Another approach to real-time artificial reverberation was proposed recently, based on hybrid convolution in the time and frequency domain [21, 22]. Unlike earlier convolution algorithms, these hybrid algorithms allow to implement a very long convolution filter with no input-output delay for an affordable computational cost. Unlike reverberation algorithms based on feedback delay networks, convolution methods allow exact reproduction of reverberation derived from an impulse response measured in an existing room or derived from a computer model. However, it is impractical to dynamicly update the lengthy impulse response in a convolution processor in order to tune the artificial reverberation effect or simulate moving sound sources and Doppler effects. In most interactive applications, this global convolution approach must be restricted to the rendering of the late reverberation, which can be synthesized more efficiently by a feedback delay network such as shown on Figure 3.

3. Perceptually-based spatial sound processing

In many applications requiring real-time spatial sound processing, the algorithms used for synthesizing the room effect need not imitate the exact response of a given room in a given situation, or the physical propagation of sound in rooms. Reference to the physics of sound propagation and reverberation in rooms need not be a consequence of the control strategy, but may be necessary only for ensuring the naturalness of the simulated reverberation [19]. Rather, it is often desirable to have a spatial reverberation processor with the following features:

Tunability, in real time, through perceptually relevant control parameters.
These control parameters should include the azimuth and elevation of each virtual sound source, as well as descriptors of the room effect, separately for each source. The perceptual effect of each control parameter should be predictable and independant of the setting of other parameters. A measurement and analysis procedure should allow to automatically derive the settings of all control parameters according to a existing situation.

Configurability according to the reproduction setup and context.
Since there is no single reproduction format that can satisfy all applications, it should be possible, having specified the desired localization and reverberation effects, to configure the signal processor in order to allow reproduction in various different formats over headphones or loudspeakers. This should include corrections (such as spectral equalization) in order to preserve the perceived effect, as much as possible, between different setups and different listening rooms.

Computational efficiency.
The processor should make optimal use of the available computational resources. It should be possible, considering a particular application where the user or the designer can accept a loss of flexibility or independance between some control parameters, to further reduce the overall complexity and cost of the system, by introducing relevant simplifications in the signal processing and control architecture. One illustration is the system of Figure 4, where the late reverberation algorithm is shared between several sources, assuming that these are located in the same virtual room, while an early reflection module is associated to each individual sound source.

The Spatialisateur

Espaces Nouveaux and Ircam have developed a virtual acoustics processor, the Spatialisateur, which is built by association of software modules for real-time signal processing and control. The synthesis of localization and room effect cues can be integrated in a single compact processing module, for each source signal. This processor can be configured for various electroacoustic reproduction systems: three-dimensional encoding on two channels for individual listening over headphones or loudspeakers, as well as various multichannel formats intended for small or medium-sized listening rooms and auditoria. Several processors can be associated in parallel in order to process several source signals simultaneously.

The design approach adopted in the Spatialisateur project focuses on giving the user the possibility of specifying the desired effect from the point of view of the listener, rather than from the point of view of the technological apparatus or physical process which generates that effect. A higher-level user-interface controls the different signal processing sub-modules simultaneously, and allows to specify the reproduced effect, for one source signal, through a set of control parameters whose definitions do not depend on the chosen reproduction system or setup (Figure 5). These parameters include the azimuth and elevation of the virtual sound source, as well as descriptors of the room acoustical quality (or room effect) associated to this sound source.

The room acoustical quality is not controlled through a model of the virtual room's geometry and wall materials, but through a formalism directly related to the perception of the virtual sound source by the listener, described by a small number of mutually independent 'perceptual factors':

The definition of these perceptual factors is derived from psycho-acoustical research carried out at Ircam on the perceptual characterization of room acoustical quality in concert halls, opera houses and auditoria [23, 24]. In the graphic interface shown on Figure 5, each slider is scaled according to the average sensitivity of listeners with respect to the perceptual factor it controls. As apparent from Figure 5 and reported above, each perceptual factor is related to a measurable acoustical index characterizing the sound transformation. These relations are implemented in the Spatialisateur's perceptual control module in order to map the perceptual representation of room acoustical quality into low-level signal processing parameters. Some of these acoustical indexes are classical criteria known in the field of architectural acoustics for characterizing concert hall acoustics (but not implemented in current artificial reverberators), such as the envelopment or the early decay time. These indexes can be derived from an impulse response measured in an existing room, allowing to set the Spatialisateur's controls in order to mimic a real situation. Consequently, virtual and measured acoustical qualities can be manipulated within a unified framework.

User interface: physical vs perceptual approach

The synthesis of a virtual sound scene relies on a description of the positions and orientations of the sound sources and the acoustical characteristics of the space, which is then translated into parameters of a signal processing algorithm such as shown on Figure 3. From a general point of view, the space can be described by a physical and geometrical description, or by attributes describing the perceived acoustical quality associated to each sound source [25]. The first approach typically suggests a three-dimensional graphic user interface representing the geometry of the room and the positions of the sources and listener, and relies on a computer algorithm simulating the propagation of sound in rooms (such as the image source model mentionned earlier). The second approach relies on knowledge of the perception of room acoustical quality, and suggests a graphic interface such as shown on Figure 5, or various types of multidimensional interfaces (as described further in section 4).

Figure 5: Higher-level user interface and general structure of the Spatialisateur (shown for one source signal). The user interface contains perceptual attributes for tuning the desired effect, as well as configuration parameters which are set at the beginning of a performance or work session, according to the reproduction format and setup.

A physically-based user interface according to the first approach will not allow to control directly and effectively the sensation perceived by the listener [25]. Although localization can be directly controled by a geometric interface, many aspects of room acoustical quality (such as envelopment or reverberance) will be affected by a change in the position of the source or the listener, in ways that are not easily predictable and depend on room geometry and wall absorption characteristics. On the other hand, adjustments of the room acoustical quality can only be achieved by modifying these geometry and absorption parameters, and the effects of such adjustments are often unpredictable or inexistant. Additionally, a physically-based user interface will only allow the reproduction of physically realizable situations: source positions will be constrained by the geometry of the space and, even if the modelled room is imaginary, the laws of physics will limit the range of realizable acoustical qualities. For instance, in a room of a given shape, modifying wall absorption coefficients in order to obtain a longer reverberation decay will cause an increase in the reverberation level at the same time.

In contrast to a physical approach, a perceptual approach leads to a more intuitive and effective user interface because the control parameters are directly related to audible sensations. This also yields to a more efficient implementation of the control process which dynamicly updates low-level signal processing parameters according to higher-level user interface settings. In a physical approach, the room simulation process which is necessary for updating the arrival time, spectrum and direction of each early reflection whenever the source or the listener move is computationally heavy, unless the room model is restricted to very simple geometries such as rectangular rooms [20, 8].

Finally, a perceptual approach will allow a more efficient implementation of the digital signal processing algorithm itself, i. e. the 'number-crunching' which must be clyclicly performed for each input sound sample in order to merely produce the output sound signals. This 'number-crunching' must be performed whether or not sources move or room parameters are modified, which differs from image synthesis, where computations are necessary only if light sources or reflective objects change in position, shape or color. For spatial sound processing, avoiding to mimic physical phenomena while focusing on the control of perceptual attributes will allow drastic improvements in the computational efficiency of the signal processing algorithms.

Signal processing modules

The Spatialisateur was developed in the Max/FTS object-oriented signal processing software environment [ 26], and is implemented as a Max signal-processing object (named Spat) running in real-time on the Ircam Music Workstation. Spat can also be considered as a library of elementary modules for real-time spatial processing of sounds (panpots, artificial reverberators, parametric equalizers...). This modularity allows one to configure a spatial processor for different applications or with different computational costs, depending on the reproduction format or setup, the desired flexibility in controlling the room effect, and the available digital signal processing resources. As Shown in Figure 5, the Spat processor is formed by cascade association of four configurable sub-modules, namely: Source, Room, Pan, Out. Configuring a Spat module is achieved in a straightforward way by subtituting one version of a sub-module for another.

The Room module is a computationally efficient multi-channel reverberator based on multi-channel feedback delay networks and designed to ensure the necessary naturalness and accuracy for professional audio or virtual reality applications [ 19, 9]. The input signal (assumed devoid of reverberation) can be pre-processed by the Source module, which can include a low-pass filter and a variable delay line to reproduce the air absorption and the Doppler effect, as well as spectral equalizers allowing additional corrections according to the nature of the input signal. The Room module can be broken down to elementary reverberation modules (e. g. an early reflection module or a late reverberation module) which allows building various processing structures, such as those of Figure 1 or Figure 4. The reverberation modules exist in several versions with different numbers of feedback channels, so that computational efficiency can be traded off for time or frequency density of the artificial reverberation [ 18, 19].

The multichannel output of the Room module is directly compatible with reproduction of frontal sounds in the 3/2-stereo format. The directional distribution module Pan can then convert this multi-channel output to various reproduction formats or setups, while allowing to control the perceived direction of the sound event (currently limited to the horizontal plane and the upper hemisphere). It can be configured for two-channel formats, including three-dimensional stereophony (binaural or transaural) over headphones or over a pair of loudspeakers, and the simulation of coincident or non-coincident stereo microphone recordings [9]. Multi-channel formats, appropriate for studios or concert auditoria, allow reproduction over 4 to 8 loudspeakers in various 2-D or 3-D arrangements (the structure of the Pan module can be easily extended to a higher number of channels if necessary).

The reproduced effect is specified perceptually in the higher-level control interface irrespective of the reproduction context, and is, as much as possible, preserved from one reproduction mode or listening room to another. Generally speaking, the Out module can be used as a 'decoder' for adapting the multi-channel output of the Pan module to the geometry or acoustical response of the loudspeaker system or headphones, including spectral and time delay correction of each output channel. In a typical multichannel reproduction setup, this can be used to equalize the direct paths from all loudspeakers to a reference listening poiition. However, in addition to these direct paths, the listening room will provide its own reflections and reverberation, which will also affect the perception by a listener of the sounds delivered to the loudspeakers. To correct the temporal effects of the listening room reverberation, the high-level control processing module includes a new algorithm to perform corrections in the reverberation synthesized by the Room module, so that the perceived effect at a reference listening position be as close as possible to the specification defined by the higher-level user interface. This compensation process allows for instance, under certain restrictive conditions, to simulate the acoustics of a given room in another one, with recorded or live sound sources.

4. Applications and future work

Signal processing architectures for professional audio or interactive multimodal interfaces

Even in the most computationally demanding reproduction modes, such as binaural and transaural stereophony, a 'high-fidelity' implementation of Spat requires less than 400 multiply-accumulates per sample period at a rate of 48 kHz. This corresponds to less than 20 million operations per second, which can be handled by a single programmable digital signal processor (such as the Motorola DSP56002 or Texas Instruments TMS320C40). It is thus economically feasible to insert a full spatial processor (including both directional panning and reverberation) in each channel of a digital mixing console, by devoting one DSP to each source channel. The mix can be produced in traditional as well as currently developing formats, including conventional stereo, 3/2-stereo, or three-dimensional two-channel stereo (transaural stereo). This increased processing capacity might call for a new generation of user interfaces for studio recording and computer music applications: providing a reduced set of independant perceptual attributes for each virtual source, as discussed in this paper, is a promising approach from the point of view of ergonomy.

Spatial sound processors for virtual reality and multimedia applications (video games, simulation, teleconference, etc...) also rely on a real-time mixing architecture and can benefit substantially from the reproduction of a natural sounding room effect allowing effective control of the subjective distance of sound events. Many of these applications involve the simulation of several sources located in the same virtual room, which allows to reduce the overall signal and control processing cost by using a single late reverberation module (Figure 4). It is possible to further reduce this cost in applications which can accomodate a less refined reproduction and control of the room effect (e. g. in video games or 'augmented reality' applications where an artificial sensation of distance must be controlled, but controlling the room signature is of lesser importance). Binaural reproduction over headphones is particularly well suited to such applications, and can be combined with image synthesis in order to immerse a spectator in a virtual environment. The Spatialisateur is designed to allow remote control through pointing or tracking devices and ensure a high degree of interactivity, with up to 33 Hz localization update rates (fast enough for video synchronization or operation with a head-tracker). An alternative reproduction environment for simulators is a booth equipped with a multichannel loudspeaker system (such as the 'Audiosphere' designed by Espaces Nouveaux). Future directions for research include modeling of individual differences in HRTFs and individual equalization of binaural recordings, as well as improved methods for multichannel reproduction over a wide listening area.

Live computer music performances and architectural acoustics

The perceptual approach adopted in the Spatialisateur project allows the composer to immediately take spatial effects into account at the early stages of the composition process, without referring to a particular electro-acoustical apparatus or performing space. Executing the spatial processing in real time during the concert performance allows specific corrections according to a particular reproduction setup and context. Localization effects, now traditionally manipulated in electro-acoustic music, can thus be more reliably preserved from one performance to another. Spatial reverberation processing allows more convincing illusions of remotely located virtual sound sources and helps concealing the acoustic signature of the loudspeakers, for a wider listening area. It thus allows to achieve a better continuity between live sources and synthetic or pre-recorded signals, which is a significant issue e. g. in the field of computer music [27, 25].

Consequently, a computer music work need not be written for a specific number of loudspeakers in a specific geometrical arrangement. As an illustration, consider an electroacoustic music piece composed in a personal studio equipped with four loudspeakers. Rather than producing a four-channel mix to be used in all subsequent concert performances, a score describing all spatial effects applied to each sound source can be stored using a musical sequencer software. By reconfiguring the signal processing structure (i. e. substituting adequate versions of the Pan and Out modules), an adequate mix can then be produced for a concert performance over 8 channels, or a transaural stereo recording preserving three-dimensional effects in domestic playback over two loudspeakers.

The Spatialisateur can be used for designing an electro-acoustic system allowing to modify the acoustical quality of an existing room, for sound reinforcement or reverberation enhancement, with live sources or pre-recorded signals. For relatively large audience areas (e. g. large concert halls or multipurpose halls), the signal processing structure can be re-configured specifically for a particular situation (by connecting sub-modules of Spat), according to a division of the audience and/or the stage area into adjacent zones, in order to ensure effective control of the perceptual attributes related to the direct sound and the early reflections, for all listeners.

Musical and multidimensional interfaces

Spatial attributes of sounds can thus be manipulated as natural extensions of the musical language. The availability of perceptually relevant attributes for describing the room effect can encourage the composer to manipulate room acoustical quality, in addition to the localization of sound events, as a musical parameter [28, 25]. In one approach (initiated by Georges Bloch in 1993 using an early Spatialisateur prototype), the spatial processor's score can be recorded in successive "passes". During each pass, manipulations of spatial attributes of one or several sound sources can be added to the score and monitored in real-time simultaneously with previously stored effects.

This is similar to operating an automation system in a mixing console, although allowing to manipulate room acoustical parameters which are not available in traditional mixing console automation systems. In this approach, it is important that the control parameters be mutually independent, i. e. that the manipulation of a spatial attribute may not destroy or modify the perceived effect of previously stored manipulations of other spatial attributes (except possibly in extreme or obviously predictible cases: for instance, suppressing the room presence will make adjustments of the late reverberance imperceptible). For operational efficiency, it is also important that the perceived effect of each parameter be predictible, particularly when it is desired to edit the score or write it directly without real-time monitoring. As discussed in this paper, such modes of operation would be much more difficult in a physically-based approach, or with current mixing architectures and reverberation units.

Beside a sequencing or automation process, another approach for creating simultaneous variations of several spatial attributes for one or several virtual sound sources consists of defining a mapping from a graphic or gestual interface to the multidimensional representation defined by a set of perceptually-relevant control parameters: azimuth and/or elevation, together with a set of perceptual factors of the room acoustical quality. A simple example was included in the higher-level interface of the Spatialisateur in order to allow straightforward connection to a bidimensional localization control interface delivering polar coordinates (azimuth an distance): the 'distance' control is mapped logarithmicly to the 'source presence' perceptual factor (the 'drop' parameter defines the drop of presence in dB for a doubling of the distance, i. e. setting 'drop' to 6 dB emulates the natural attenuation of a sound with distance). This provides a simple and effective way of connecting the Spatialisateur to graphic or gestual interfaces, in order to create three-dimensional sound trajectories on various reproduction formats or setups, or draw a map of a virtual sound scene with several sources at different positions in the horizontal plane around the listener.

This mapping principle can of course be implemented in many ways, with various types of multidimensional interfaces, allowing a wide variety of creative effects. Because of the nature of the multidimensional scaling analysis process which led to the definition of the perceptual factors [23, 24], these can be considered as coordinates in an orthonormal basis, allowing to define a norm (in a mathematical sense) for measuring the perceptual dissimilarity between several acoustical qualities. It follows that linear weighting along one perceptual factor or a set of perceptual factors provides a general and perceptually relevant method for interpolating among different acoustical qualities [25]. For instance, it allows implementing a gradual and natural-sounding transition from the sensation of listening to a singer from 20 meters away on the balcony of an opera house to the sensation of being 3 meters behind the singer in a cathedral (based possibly on acoustical impulse response measurements made in two existing spaces), without having to implement an arguable geometrical and physical 'morphing' process between the two situations.

This opens onto research on new multidimensional interfaces for music and audio components of virtual reality in various fields. An additional direction of research is the extension of the perceptual control formalism to spaces such as small rooms, chambers, corridors or outer spaces. In the current implementation of the Spatialisateur, such spaces can be dealt with by manipulating, in addition to the higher-level perceptual factors, some lower-level processing parameters available in the user interface of the Room module.


Spatialisateur technology is covered by issued and pending patents. The perceptual approach adopted in this project results from research on the characterization of room acoustical quality directed by Jean-Pascal Jullien and Olivier Warusfel. Research on binaural synthesis was carried out in collaboration with France Télécom (Centre National d'Etudes des Télécommunications), and includes contributions by Martine Marin and Véronique Larcher. Musical / graphical user interfaces were developed in collaboration with Georges Bloch, Tom Mays and Gerhard Eckel.


[1] J. Blauert: Spatial Hearing: the Psychophysics of Human Sound Localization; Cambridge MIT Press, 1983.

[2] D. Begault: 3-D Sound for virtual reality and multimedia; Academic Press, 1994.

[3] M. Cohen, E. Wenzel: The design of multidimensional sound interfaces; Technical Report 95-1-004, Human Interface Laboratory, Univ. of Aizu, 1995.

[4] M. Gerzon: Ambisonics in multichannel broadcasting and video; J. Audio Engineering Society, vol. 33, no. 11, 1985.

[5] J. Chowning: The simulation of moving sound sources; J. Audio Engineering Society, vol. 19, no. 1, 1971.

[6] G. Thiele: The new sound format '3/2-stereo'; Proc. 94th Conv. Audio Engineering Society, preprint 3550a, 1993.

[7] A. Persterer: A very high performance digital audio processing system; Proc. 13th International Conf. on Acoustics (Belgrade), 1989.

[8] S. Foster, E. M. Wenzel, R. M. Taylor: Real-time synthesis of complex acoustic environments; Proc. IEEE Workshop on Applications of Digital Signal Processing to Audio and Acoustics, 1991.

[9] J.-M. Jot, V. Larcher, O. Warusfel: Digital signal processing issues in the context of binaural and transaural stereophony; Proc. 98th Conv. Audio Engineering Society, preprint 3980, 1995.

[10] D. H. Cooper, J. L. Bauck: Prospects for transaural recording; J. Audio Engineering Society, Vol. 37, no. 1/2, 1989.

[11] M. A. Casey., W. G. Gardner, S. Basu: Vision steered beam-forming and transaural rendering for the artificial life interactive video environment (ALIVE); Proc. 99th Conv. Audio Engineering Society, preprint 4052, 1995.

[12] J. L. Bauck & D. H. Cooper: Generalized transaural stereo; Proc. 93rd Conv. Audio Engineering Society, preprint 3401, 1992.

[13] M. R. Schroeder: Natural-sounding artificial reverberation; J. Audio Engineering Society, vol. 10, no. 3, 1962.

[14] J. A. Moorer: About this reverberation business; Computer Music Journal, vol. 3, no. 2, 1979.

[15] J. Stautner, M. Puckette: Designing multi-channel reverberators; Computer Music Journal, vol. 6, no. 1, 1982.

[16] G. Kendall, W. Martens, D. Freed, D. Ludwig, R. Karstens: Image-model reverberation from recirculating delays; Proc. 81st Conv. Audio Engineering Society, preprint 2408, 1986.

[17] D. Griesinger: Practical processors and programs for digital reverberation; Proc. 7th Audio Engineering Society International Conf.,1989.

[18] J.-M. Jot, A. Chaigne: Digital delay networks for designing artificial reverberators; Proc. 90th Conv. Audio Engineering Society, preprint 3030, 1991.

[19] J.-M. Jot, Etude et réalisation d'un spatialisateur de sons par modèles physiques et perceptifs; Doctoral dissertation, Télécom Paris, 1992.

[20] F. R. Moore: A general model for spatial processing of sounds; Computer Music Journal, vol. 7, no. 6, 1983.

[21] W. G. Gardner: Efficient convolution without input-output delay; J. Audio Engineering Society, vol. 43, no. 3, 1995.

[22] A. Reilly, D. McGrath: Convolution processing for realistic reverberation; Proc. 98th Conv. Audio Engineering Society, preprint 3977, 1995.

[23] J.-P. Jullien, E. Kahle, S. Winsberg, O. Warusfel: Some results on the objective and perceptual characterization of room acoustical quality in both laboratory and real environments; Proc. Institute of Acoustics, Vol. XIV, no. 2, 1992.

[24] J.-P. Jullien: Structured model for the representation and the control of room acoustical quality; Proc. 15th International Conf. on Acoustics, 1995.

[25] J.-P. Jullien, O. Warusfel: Technologies et perception auditive de l'espace; Les Cahiers de l'Ircam, vol. 5 "L'Espace", 1994.

[26] M. Puckette: Combining event and signal processing in the Max graphical programming environment; Computer Music Journal, vol. 15, no. 3, 1991.

[27] O. Warusfel: Etude des paramètres liés à la prise de son pour les applications d'acoustique virtuelle; Proc. 1rst French Congress on Acoustics, 1990.

[28] G. Bloch, G. Assayag, O. Warusfel, J.-P. Jullien: Spatializer: from room acoustics to virtual acoustics; Proc. International Computer Music Conf., 1992.

Server © IRCAM-CGP, 1996-2008 - file updated on .

Serveur © IRCAM-CGP, 1996-2008 - document mis à jour le .