PNAS2014 Parise
PNAS2014 Parise
net/publication/261445260
CITATIONS READS
229 588
3 authors, including:
All content following this page was uploaded by Cesare V Parise on 27 June 2014.
Edited by Dale Purves, Duke University, Durham, NC, and approved March 10, 2014 (received for review December 5, 2013)
Human perception, cognition, and action are laced with seemingly of the peaks and notches produced by the HRTF on the spectra of
arbitrary mappings. In particular, sound has a strong spatial con- the incoming signals is known to provide reliable cues for auditory
notation: Sounds are high and low, melodies rise and fall, and pitch localization in the medial plane (16). We therefore looked for the
systematically biases perceived sound elevation. The origins of such existence of a frequency–elevation mapping (FEM) in the statistics
mappings are unknown. Are they the result of physiological con- of natural auditory scenes and in the filtering properties of the
straints, do they reflect natural environmental statistics, or are they outer ear. Hence, we effectively measured the mapping between
truly arbitrary? We recorded natural sounds from the environment, frequency and elevation in both the distal and the proximal stimuli.
analyzed the elevation-dependent filtering of the outer ear, and To look for the existence of an FEM in the natural acoustic
measured frequency-dependent biases in human sound localization. environment, we recorded a large sample of environmental sounds
We find that auditory scene statistics reveals a clear mapping be- (∼50,000 recordings, 1 s each) by means of two directional micro-
tween frequency and elevation. Perhaps more interestingly, this phones mounted on the head of a human freely moving indoors and
PSYCHOLOGICAL AND
natural statistical mapping is tightly mirrored in both ear-filtering
COGNITIVE SCIENCES
outdoors in urban and rural areas (around Bielefeld, Germany).
properties and in perceived sound location. This suggests that both Overall, the recordings revealed a consistent mapping between the
sound localization behavior and ear anatomy are fine-tuned to the
frequency of sounds and the average elevation of their sources in
statistics of natural auditory scenes, likely providing the basis for
the external space [F(5, 57,859) = 35.8, P < 0.0001; Methods], which
the spatial connotation of human hearing.
was particularly evident in the middle range of the spectrum, be-
tween 1 and 6 kHz (Fig. 1C, Upper). That is, high-frequency sounds
|
frequency–elevation mapping head-related transfer function | have a tendency to originate from elevated sources in natural au-
|
Bayesian modeling cross-modal correspondence
ditory scenes. We can only speculate about the origins of this
mapping: it could either be that at higher elevations, more energy is
T he spatial connotation of auditory pitch is a universal hall-
mark of human cognition. High pitch is consistently mapped
to high positions in space in a wide range of cognitive (1–3),
generated in high frequencies (e.g., leaves on the trees rustle in
a higher frequency range than the footsteps on the floor), or it could
also be that the absorption of the ground is frequency dependent in
perceptual (4–6), attentional (7–12), and linguistic functions a way that it filters out more of the high-frequency spectrum.
(13), and the same mapping has been consistently found in To look for the existence of an FEM in the filtering properties
infants as young as 4 mo of age (14). In spatial hearing, the of the ear, we analyzed a set of 45 HRTFs [the CIPIC database
perceived spatial elevation of pure tones is almost fully de-
(17); Methods and Fig. S1], and found again a clear mapping
termined by frequency––rather than physical location––in a very
between frequency and elevation [F(5, 264) = 216.6, P < 0.0001;
systematic fashion [i.e., the Pratt effect (4, 5)]. Likewise, most
Fig. 1C, Lower]. That is, due to the filtering properties of the
natural languages use the same spatial attributes, high and low,
outer ear, sounds coming from high (head-centered) elevations
to describe pitch (13), and throughout the history of musical
notation high notes have been represented high on the staff.
However, a comprehensive account for the origins of the spatial Significance
connotation of auditory pitch to date is still missing. More than
a century ago, Stumpf (13) suggested that it might stem from the Auditory pitch has an intrinsic spatial connotation: Sounds are
statistics of natural auditory scenes, but this hypothesis has never high or low, melodies rise and fall, and pitch can ascend and
been tested. This is a major omission, as the frequency–elevation descend. In a wide range of cognitive, perceptual, attentional,
mapping often leads to remarkable inaccuracies in sound local- and linguistic functions, humans consistently display a positive,
ization (4, 5) and can even trigger visual illusions (6), but it can sometimes absolute, correspondence between sound frequency
also lead to benefits such as reduced reaction times or improved and perceived spatial elevation, whereby high frequency is map-
detection performance (7–12). ped to high elevation. In this paper we show that pitch borrows
its spatial connotation from the statistics of natural auditory
Results scenes. This suggests that all such diverse phenomena, such as the
To trace the origins of the mapping between auditory frequency convoluted shape of the outer ear, the universal use of spatial
and perceived vertical elevation, we first measured whether this terms for describing pitch, or the reason why high notes are
mapping is already present in the statistics of natural auditory represented higher in musical notation, ultimately reflect adap-
signals. When trying to characterize the statistical properties of tation to the statistics of natural auditory scenes.
incoming signals, it is critical to distinguish between distal stim-
Author contributions: C.V.P. and M.O.E. designed research; C.V.P. and K.K. performed
uli, the signals as they are generated in the environment, and research; C.V.P. analyzed data; C.V.P. and M.O.E. performed statistical modeling; and
proximal stimuli, the signals that reach the transducers (i.e., the C.V.P. and M.O.E. wrote the paper.
middle and inner ear). In the case of auditory stimuli this is es- The authors declare no conflict of interest.
pecially important, because the head and the outer ear operate This article is a PNAS Direct Submission.
as frequency- and elevation-dependent filters (15), which modu- 1
To whom correspondence may be addressed. E-mail: [email protected] or
lates the spectra of the sounds reaching the middle ear as a func- [email protected].
tion of the elevation of the sound source relative to the observer This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.
(the head-related transfer function, HRTF). Notably, the structure 1073/pnas.1322705111/-/DCSupplemental.
10
−10
1 10
10
< 0.8 0.8-1.4 1.4 -2.5 2.5 - 4.5 4.5 - 8 >8 left
White
noise −10
1 10
Band-pass frequency (kHz) Band-pass frequency (kHz)
C D E
Statistical frequency-elevation mapping Estimated frequency-elevation priors Priors for elevation
8
4 20 Reference frames:
World-centred
Statistical mapping
Environ.
20 10
.84 .68
Correlation (ρ)
(.73-.93) (.59-.77)
0 0
50
-10
HRTF
−20 .76 .90
Ear filtering (proximal) -20 Head-centred (.61-.84) (.84-.94)
−40 0
World Head
1 10 1 10 centred centred
Band-pass frequency (kHz) Estimated priors
Fig. 1. (A) Average endpoint of pointing responses for the various frequency bands (column) and body tilts (row). The filled points correspond to the average
responses; the thin gray grid represents the actual position of the stimuli. Colors represent tilt (green = 0°, brown = 45°, red = 90°). (B) Frequency-dependent
bias (±SEM) in sound localization in head-center elevation (Upper) and azimuth (Lower). The magnitude of the frequency-dependent elevation biases was
only mildly affected by body tilt, reflecting the contribution of a frequency–elevation mapping encoded in head-centered coordinates. The frequency-
dependent azimuth biases increase in magnitude with increasing body-tilt angle reflecting the contribution of a frequency–elevation mapping encoded in
world-centered coordinates. (C) Statistical mapping (±SEM) between frequency and elevation recorded in the environment (Upper) and measured from the
HRTFs (Lower). The dashed lines represent the frequency–elevation mapping using nonbinned data. (D) Shapes of the estimated priors coding for the fre-
quency–elevation mapping in world-centered (Upper) and head-centered coordinates (Lower). Lightness within the panels represents the equal loudness
contour (International Organization for Standardization 226:2003): lighter gray represents higher sensitivity. (E) Schematic 1D representation of the model
illustrating the head- (magenta) and world-centered priors (cyan). (F) Correlation (and 95% confidence intervals) between the estimated priors and the
frequency–elevation mapping measured from the environment and the HRTFs. (C–F) Colors indicate the reference frame (magenta = head-centered; cyan =
world-centered).
have more energy at high frequencies. These results demonstrate showing a high degree of similarity between the spectra of natural
that an FEM is consistently present in the statistics of both images and the optical transfer function of the eye (18). This might
proximal and distal stimuli. This suggests that the perceptual suggest that human spatial hearing is so finely tuned to the envi-
FEM might ultimately reflect a tuning of the human auditory ronment that even the filtering properties of the outer ear, and
system to the statistics of natural sounds. hence its convoluted anatomy, evolved to mirror the statistics of
Finally, we determined the correlation between the FEM mea- natural auditory scenes.
sured in proximal and distal stimuli, and found a strong similarity To investigate the relation between human performance and
between the two mappings (ρ = 0.79, interquartile range = 0.72– the FEM in proximal and distal stimuli, we asked participants to
0.84). That is, the filtering properties of the external ear accentuate localize on a 2D plane (19) a set of narrowband (∼1.8-octave)
the FEM that is present in natural auditory scenes. One possible auditory noises with different central frequencies (Movie S1).
reason for this similarity is that the elevation-dependent filtering of Sounds were played from a set of 16 speakers hidden behind
the outer ear is set to maximize the transfer of naturally available a sound-transparent projection screen, arranged on a 4 × 4 grid
information. This result parallels previous findings in human vision subtending an angle of ∼30 × 30°. Participants were asked to
PSYCHOLOGICAL AND
COGNITIVE SCIENCES
responses were virtually independent from the actual sound coordinates), we can look for similarities (i.e., correlation) be-
source location and the reported elevation was almost entirely tween the shapes of such perceptual mappings, and the ones that
determined in a very consistent way by the frequency of the we measured from both the statistics of the acoustic environment
signals (Fig. 1A, Center). Notably, such biases showed a clear and from the HRTFs. Notably, both estimated priors signifi-
mapping between frequency and elevation (Fig. 1B), which was cantly correlated with the statistical mappings present in proxi-
evident in both head- and world-centered coordinates (see also mal and distal stimuli (i.e., the maximum of the frequency
refs. 5, 11). Importantly, such localization biases were signifi- spectra against spatial elevation) (Fig. 1F). However, the head-
cantly correlated with the FEM present in proximal and distal centered prior was more correlated to the FEM measured from
stimuli (ρ = 0.76 for world-centered biases with distal stimulus the filtering properties of the outer ear, whereas the world-
and ρ = 0.78 for head-centered biases with the proximal stimulus; centered prior was more correlated to the FEM present in en-
see SI Text). Consistent with previous studies (21, 22), we also vironmental sounds. These results demonstrate that the per-
found moderate but consistent frequency-dependent biases in ceptual FEM in humans jointly depends on the statistics of both
horizontal sound localization. These results demonstrate the natural auditory scenes and the filtering properties of the
existence of striking frequency- and body-orientation-dependent outer ear.
perceptual biases in sound localization. The results also dem-
onstrate the dependence of such biases on the statistics of nat- Discussion
ural auditory scenes, and on the filtering properties of the Previous studies have already hypothesized the grounding of
outer ear. cross-dimensional sensory correspondences in the statistics of
However, it is not immediately obvious why there is such incoming stimuli (13, 28). None of them, however, directly mea-
a high degree of correspondence between the behavioral biases sured how such mappings relate to the statistical properties of the
found in sound localization and the statistical mappings found in stimuli. Our results demonstrate that an FEM is already present in
both the environment, and in the filtering properties of the ear. the statistics of both the proximal and the distal stimuli. Moreover,
To better understand this close correspondence we would need we demonstrate that the perceptual FEM is in fact a twofold
a generative model. Recently, the Bayesian approach has been mapping, which separately encodes the statistics of natural audi-
successfully used for developing such generative models and in tory scenes and the filtering properties of the outer ear in different
particular for describing the effects of stimulus statistics on frames of reference. Interestingly, this finding provides further
perceptual judgments (23–27). In Bayesian terms, the frequency support for the role of vestibular and proprioceptive information
dependency of sound source location can be modeled as a prior in sound localization (29). These results highlight the possibility of
distribution pf(s) representing the probability of a sound source s using sound spectral frequency to simulate the vertical elevation of
of a given frequency f occurring at some given 2D spatial location sound sources.
s = ðsx ; sy Þ. Based on the measured statistics of natural auditory The pervasiveness of the FEM in the statistics of the stimuli
scenes, the filtering properties of the ear, and the biases in sound readily explains why previous research found this mapping to be
localization, we postulated the existence of two distinct mappings absolute (5, 30) (i.e., each frequency is related to exactly one
between frequency and elevation, respectively coding the ex- elevation), universal (3, 13) (cross-cultural and language in-
pected elevation of sounds as a function of the frequency spec- dependent), and already present in early infancy (14); and it
trum in either head- or world-centered coordinates. Therefore, argues against interpretations of cross-dimensional sensory cor-
we modeled two frequency-dependent priors for elevation, one respondences in terms of “weak synesthesia” (9). The mapping
being head-centered and the other world-centered. This model between pitch and elevation, also reflected in musical notation
would involve a mechanism dedicated to the extraction and and in the lexicon of most natural languages (13), has often been
combination of relevant spectral cues from the proximal stimulus considered a metaphorical mapping (6, 31), and cross-sensory
(such as the frequencies with more energy), and mapping the correspondences have been theorized to be the basis for lan-
result to certain head- and world-centered elevations. For sim- guage development (32). The present findings demonstrate that,
plicity, such priors were modeled as Gaussian distributions, whose at least in the case of the FEM, such a metaphorical mapping is
means represent the expected elevation given the spectrum of the indeed embodied and based on the statistics of the environment,
incoming signal (Fig. 1E). Given that participants had to localize hence raising the intriguing hypothesis that language itself might
Analysis of the HRTF. The CIPIC HRTF (17) database includes the transfer
pf ^sjs = N sθ , Σf ,θ ,
function produced by the outer ear of 45 humans for 71 different frequency
channels (linearly spaced between 0.66 and 16.1 kHz), and recorded from 50 σ 2f ,x 0
with mean sθ = ðsx ,sy Þ · Rθ and covariance matrix Σf ,θ = · Rθ (Fig. S3,
elevations (range −45° to 230°). The elevation mapped to each frequency 0 σ 2f ,y
channel was calculated from each individual HRTF as the elevation with the Left). Assuming the likelihood to be encoded in head-centered coordinates, Rθ
highest transfer value (dB) for that particular frequency channel (28) for is a rotation matrix that rotates the axes according to the orientation of the
sounds coming from the midsagittal plane (Fig. S1). body with respect to gravitational vertical (θ).
The expected elevation of a sound source of a given frequency spectrum
Psychophysical Task. Ten healthy observers with normal audition and normal can be modeled as a Gaussian a priori probability distribution, whose mean
or corrected-to-normal vision took part in the experiment (six females, mean represents the expected location given the maximum of the frequency
age 25 y, range 21–33). All of them were students or employees at the spectrum, and the variance the uncertainty of the mapping. Given that we
University of Bielefeld and provided written informed consent before par- empirically measured an FEM in the filtering properties of the outer ear and
ticipating. The study was conducted in accordance with the Declaration of the statistics of the natural auditory scenes, we assumed the existence of two
Helsinki and had ethical approval from the ethics committee of the Uni- independent priors encoding, respectively, the FEM in head- and world-
versity of Tübingen. centered coordinates.
Observer’s head was fixed 130 cm away from a sound-transparent pro- In head-centered coordinates the prior distribution phc,f ðsÞ for the location
jection screen (220 × 164 cm) mounted in front of a set of 16 speakers (Fig. shc,f of a sound with frequency f is defined as a 2D Gaussian:
S2). On each trial, one of the speakers played a 300-ms band-pass noise
(band-pass kHz: <0.8; 0.8–1.4; 1.4–2.5; 2.5–4.5; 4.5–8; >8; or white noise;
phc,f ðsÞ = N shc,f ,θ , Σhc,f ,θ ,
Movie S1). Participants were instructed to indicate where they have heard
the stimulus come from, using a cursor projected on the screen in front of ∞ 0
with mean shc,f ,θ = ð0,shc,f ,y Þ · Rθ and covariance matrix Σhc,f ,θ = 0 σ 2hc,f ,y · Rθ
the speakers. Localization was visually guided (closed loop) and temporally
unconstrained. When participants were happy with the position of the (Fig. S3, second column). The mean shc,f,y represents the expected spatial eleva-
cursor, they had to press on the touchpad to submit their response. In dif- tion and the variance σ 2hc,f ,y the mapping uncertainty. For simplicity, we assumed
ferent blocks (in a counterbalanced order), we tilted participants’ bodies (0°, no mapping between frequency and the head-centered left–right location of
45°, or 90° counterclockwise; Fig. S2) with respect to the gravitational ver- a sound source; therefore, the prior had a mean azimuth of zero and ∞ variance
tical using custom-built chairs that maintained the line of sight aligned with (i.e., the prior is uninformative with respect to the head-centered azimuth).
the center of the screen, without covering the ears. Each combination of In a similar fashion, the world-centered prior distribution pwc,f ðsÞ for the
stimulus frequency, tilt, and spatial location was repeated 4 times (1,344 location swc,f of a sound with frequency f is defined as a 2D Gaussian:
trials per participants).
Before running the localization task, the auditory stimuli were perceptually pwc,f ðsÞ = N swc,f , Σwc,f ,
equalized in loudness using the method of adjustment to prevent any perceived
loudness differences to affect our results. That is, we used the white noise ∞ 0
with mean swc,f = ð0,swc,f ,y Þ and covariance matrix Σwc,f = 0 σ 2wc,f ,y (Fig. S3,
stimulus as the standard, and participants adjusted the intensity of each band-
pass stimulus until the loudness of the band-pass stimuli perceptually matched third column). The mean swc,f ,y represents the expected spatial elevation and the
the standard. Each band-pass stimulus was adjusted six times, and we repeated variance σ 2wc,f ,y the mapping uncertainty. Again, the prior was made un-
the procedure with four participants. The gain factor used to equalize each informative as to the world-centered azimuth location of the sound source.
stimulus was determined as the median value of all adjustments. The statistically optimal way to combine noisy sensory information with
The experiment was conducted in a dark anechoic chamber, and controlled prior knowledge is described by the Bayes theorem, according to which the
by a custom-built software based on the Psychtoolbox (34). Participants were posterior pf ðsj^sÞ (Fig. S3, Right), on which the percept is based, is pro-
tested in three sessions taking place on three consecutive days. Different portional to the product of the likelihood (i.e., the sensory information) and
body-tilt conditions were tested in separate blocks (4 blocks/d), with the the prior (here, the FEM):
order of the blocks counterbalanced within and across participants. Within
each block, sounds with different frequencies and positions were presented pf sj^s ∝ phc,f ðsÞ · pwc,f ðsÞ · pf ^sjs :
in a pseudorandom fashion.
For each orientation and frequency, the localization bias was calculated Assuming all of the noise in the data to be due to sensory (as opposed to
separately for head-centered elevation and azimuth as the grand mean of the response-motor) noise (19), participants’ responses would represent random
responses for each participant. The elevation bias (Fig. 1B, Upper) showed a main samples of the posterior distribution pf ðsj^sÞ. Therefore, given the psychophys-
effect of frequency [F(5,45) = 11.564; P < 0.001], without significant effects of ical data it is possible to estimate the parameters of the model and eventually
tilt [F(2,18) = 1.157, P = 0.337] or interactions [F(10,90) = 1.313; P = 0.235]. The estimate the shape of the internal FEMs. Using a maximum-likelihood
PSYCHOLOGICAL AND
tistics of the distal stimulus is much smaller than all of the other mappings
COGNITIVE SCIENCES
priors for elevation in our model. However, previous studies also demonstrated
the existence of frequency-dependent biases for azimuth (22), and such biases (Fig. 1 C and D). Something similar has been found in human vision, where
have also been related to the filtering properties of the outer ear (21). That the filtering properties of the eye seem to exaggerate the statistics of nat-
said, biases on azimuth had a much smaller magnitude in the present study ural visual scenes (18). It would be a matter of future research to understand
(∼2°; Fig. 1B Lower, green line) compared with elevation biases (∼15°, Fig. 1B why the brain and the filtering of the outer ear encode the same FEM
Upper, green line) and they were almost frequency independent. The reason present in the environment on a different scale.
why these azimuth biases here were so small compared with Butler (22)—and
thus could be safely neglected in the modeling—might be because our task ACKNOWLEDGMENTS. The authors would like to thank the Cognitive
involved binaural hearing, thus having time difference and loudness differ- Neuroscience research team in Bielefeld for precious support throughout
ence between the ears as a main cue to azimuth, whereas Butler (22) de- this study, and J. Burge and J.M. Ache for insightful comments on a previous
termined azimuth biases for monaural hearing only. version of the manuscript. C.V.P. and M.O.E. were supported by the 7th
Framework Programme European Projects “The Hand Embodied” (248587)
and “Wearhap” (601165). This study is part of the research program of the
Comparison Between the Estimated Priors and the Statistics of the Natural Bernstein Center for Computational Neuroscience, Tübingen, funded by the
Sounds and Filtering Properties of the Outer Ear. To calculate the relation German Federal Ministry of Education and Research (German Federal Min-
between the priors and the statistics of the proximal and distal stimuli, we istry of Education and Research; Förderkennzeichen: 01GQ1002).
1. Douglas KM, Bilkey DK (2007) Amusia is associated with deficits in spatial processing. 19. Parise CV, Spence C, Ernst MO (2012) When correlation implies causation in multi-
Nat Neurosci 10(7):915–921. sensory integration. Curr Biol 22(1):46–49.
2. Rusconi E, Kwan B, Giordano BL, Umiltà C, Butterworth B (2006) Spatial representa- 20. Suzuki Y, Takeshima H (2004) Equal-loudness-level contours for pure tones. J Acoust
tion of pitch height: The SMARC effect. Cognition 99(2):113–129. Soc Am 116(2):918–933.
3. Dolscheid S, Shayan S, Majid A, Casasanto D (2013) The thickness of musical pitch: 21. Carlille S, Pralong D (1994) The location-dependent nature of perceptually salient
Psychophysical evidence for linguistic relativity. Psychol Sci 24(5):613–621. features of the human head-related transfer functions. J Acoust Soc Am 95(6):
4. Pratt CC (1930) The spatial character of high and low tones. J Exp Psychol 13(3): 3445–3459.
278–285. 22. Butler RA (1987) An analysis of the monaural displacement of sound in space. Percept
5. Roffler SK, Butler RA (1968) Factors that influence the localization of sound in the Psychophys 41(1):1–7.
vertical plane. J Acoust Soc Am 43(6):1255–1259. 23. Adams WJ, Graf EW, Ernst MO (2004) Experience can change the ‘light-from-above’
6. Maeda F, Kanai R, Shimojo S (2004) Changing pitch induced visual motion illusion. prior. Nat Neurosci 7(10):1057–1058.
Curr Biol 14(23):R990–R991. 24. Weiss Y, Simoncelli EP, Adelson EH (2002) Motion illusions as optimal percepts. Nat
7. Chiou R, Rich AN (2012) Cross-modality correspondence between pitch and spatial Neurosci 5(6):598–604.
location modulates attentional orienting. Perception 41(3):339–353. 25. Tassinari H, Hudson TE, Landy MS (2006) Combining priors and noisy visual cues in
8. Melara RD, O’Brien TP (1990) Effects of cuing on cross-modal congruity. J Mem Lang a rapid pointing task. J Neurosci 26(40):10154–10163.
29(6):655–686. 26. Zhang R, Kwon O-S, Tadin D (2013) Illusory movement of stationary stimuli in the
9. Melara RD, O’Brien TP (1987) Interaction between synesthetically corresponding di- visual periphery: Evidence for a strong centrifugal prior in motion processing. J Neu-
mensions. J Exp Psychol Gen 116(4):323–336. rosci 33(10):4415–4423.
10. Bernstein IH, Edelstein BA (1971) Effects of some variations in auditory input upon 27. Girshick AR, Landy MS, Simoncelli EP (2011) Cardinal rules: Visual orientation per-
visual choice reaction time. J Exp Psychol 87(2):241–247. ception reflects knowledge of environmental statistics. Nat Neurosci 14(7):926–932.
11. Mossbridge JA, Grabowecky M, Suzuki S (2011) Changes in auditory frequency guide 28. Rogers ME, Butler RA (1992) The linkage between stimulus frequency and covert peak
visual-spatial attention. Cognition 121(1):133–139. areas as it relates to monaural localization. Percept Psychophys 52(5):536–546.
12. Evans KK, Treisman A (2010) Natural cross-modal mappings between visual and au- 29. Goossens HH, van Opstal AJ (1999) Influence of head position on the spatial repre-
ditory features. J Vis 10(1):1–12. sentation of acoustic targets. J Neurophysiol 81(6):2720–2736.
13. Stumpf K (1883) Tonpsychologie (Hirzel, Leipzig, Germany). 30. Cabrera D, Ferguson S, Tilley S, Morimoto M (2005) Recent studies on the effect of
14. Walker P, et al. (2010) Preverbal infants’ sensitivity to synaesthetic cross-modality signal frequency on auditory vertical localization. Proceedings of International Con-
correspondences. Psychol Sci 21(1):21–25. ference on Auditory Display (ICAD, Limerick, Ireland).
15. Batteau DW (1967) The role of the pinna in human localization. Proc R Soc Lond B Biol 31. Sadaghiani S, Maier JX, Noppeney U (2009) Natural, metaphoric, and linguistic auditory
Sci 168(11):158–180. direction signals have distinct influences on visual motion processing. J Neurosci 29(20):
16. Iida K, Itoh M, Itagaki A, Morimoto M (2007) Median plane localization using a 6490–6499.
parametric model of the head-related transfer function based on spectral cues. Appl 32. Ramachandran V, Hubbard E (2001) Synaesthesia: A window into perception, thought
Acoust 68(8):835–850. and language. J Conscious Stud 8(12):3–34.
17. Algazi VR, Duda RO, Thompson DM, Avendano C (2001) The CIPIC HRTF database. 33. Parise C, Spence C Audiovisual cross-modal correspondences in the general pop-
2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics ulation. Oxford Handbook of Synaesthesia, eds Simner J, Hubbard EM (Oxford Univ
(IEEE, New Paltz, NY), pp 99–102. Press, Oxford, UK).
18. Burge J, Geisler WS (2011) Optimal defocus estimation in individual natural images. 34. Kleiner M, Brainard D, Pelli D (2007) What’s new in Psychtoolbox-3. Perception 36(14):
Proc Natl Acad Sci USA 108(40):16849–16854. 1–16.
225 20
180
10
135 Transfer (dB)
Elevation (°)
90 0
45
-10
0
-45 -20
1 10
Frequency (kHz)
Fig. S1. Average HRTF, obtained by averaging all 45 HRTFs of the CIPIC database. The black dots, representing the elevation with maximum transfer for each
frequency, show a clear mapping between frequency and elevation. The FEM reported in Fig. 1C (Lower) was obtained by calculating for each individual HRTF
the elevation with maximum transfer for each frequency (i.e., the black dots here), and then averaging the results across the 45 HRTFs.
0° 45° 90°
164 cm
130 cm
Fig. S2. (A) Schematic representation of the experimental setup. (B) Representation of the three different body orientations. The vertical arrows represent the
world-centered elevation; the tilted arrows represent the head-centered elevation. When the body of the participant is not tilted (Left), head- and world-
centered elevation overlap, whereas when the body is tilted by 90° (Right), the head- and world-centered elevations are orthogonal. The gray grids represent
the physical position of the speakers.
0°
Body orientation
45°
90°
y
x
Fig. S3. Schematic illustration of the Bayesian model. The icons on the left represent the different orientations of the observers (in rows). The left column
represents the likelihood function (the sensory information). The red dots represent the physical position of the stimulus s = ðsx ,sy Þ. The second and the third
columns represent the frequency-dependent priors on elevation in head- and world-centered coordinates, respectively. The last column on the right represents
the posterior distribution; the red dot represents the physical position of the stimuli, whereas the green dot represents the maximum a posteriori, that is, the
perceived position of the stimuli. Note how the perceived position is shifted away from the actual position as a function of both the frequency-dependent
priors and body orientation. Colors indicate the reference frame of the priors (magenta = head-centered; cyan = world-centered). This figure represents the
case of the localization of a 4.5–8-kHz band-pass auditory stimulus coming from the bottom-right speaker (swc = [15, −15]°). Frequency-independent distortions
of perceived auditory space (Modeling) are not represented.
Movie S1. Auditory stimuli used in the sound localization experiment. To better appreciate how perceived spatial elevation changes as a function of the
spectra of the stimuli, we recommend playing the sounds using loudspeakers (not headphones), and listening with the eyes closed.
Movie S1