0% found this document useful (0 votes)
17 views9 pages

PNAS2014 Parise

The document analyzes how the statistics of natural auditory scenes shape human spatial hearing. It finds that there is a mapping between higher sound frequencies and higher sound elevations in natural environments. This mapping is also reflected in the filtering properties of the outer ear and in human sound localization biases. This suggests that both hearing anatomy and behavior are adapted to the statistics of natural sounds.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views9 pages

PNAS2014 Parise

The document analyzes how the statistics of natural auditory scenes shape human spatial hearing. It finds that there is a mapping between higher sound frequencies and higher sound elevations in natural environments. This mapping is also reflected in the filtering properties of the outer ear and in human sound localization biases. This suggests that both hearing anatomy and behavior are adapted to the statistics of natural sounds.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/261445260

Natural auditory scene statistics shapes human spatial hearing

Article in Proceedings of the National Academy of Sciences · April 2014


DOI: 10.1073/pnas.1322705111 · Source: PubMed

CITATIONS READS

229 588

3 authors, including:

Cesare V Parise Marc O Ernst


University of Liverpool Ulm University
61 PUBLICATIONS 2,376 CITATIONS 278 PUBLICATIONS 14,115 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Cesare V Parise on 27 June 2014.

The user has requested enhancement of the downloaded file.


Natural auditory scene statistics shapes human
spatial hearing
Cesare V. Parisea,b,1, Katharina Knorreb, and Marc O. Ernsta,b,1
a
Max Planck Institute for Biological Cybernetics and Bernstein Center for Computational Neuroscience, 72076 Tübingen, Germany; and bCognitive
Neuroscience Department and Cognitive Interaction Technology-Center of Excellence, Bielefeld University, 33615 Bielefeld, Germany

Edited by Dale Purves, Duke University, Durham, NC, and approved March 10, 2014 (received for review December 5, 2013)

Human perception, cognition, and action are laced with seemingly of the peaks and notches produced by the HRTF on the spectra of
arbitrary mappings. In particular, sound has a strong spatial con- the incoming signals is known to provide reliable cues for auditory
notation: Sounds are high and low, melodies rise and fall, and pitch localization in the medial plane (16). We therefore looked for the
systematically biases perceived sound elevation. The origins of such existence of a frequency–elevation mapping (FEM) in the statistics
mappings are unknown. Are they the result of physiological con- of natural auditory scenes and in the filtering properties of the
straints, do they reflect natural environmental statistics, or are they outer ear. Hence, we effectively measured the mapping between
truly arbitrary? We recorded natural sounds from the environment, frequency and elevation in both the distal and the proximal stimuli.
analyzed the elevation-dependent filtering of the outer ear, and To look for the existence of an FEM in the natural acoustic
measured frequency-dependent biases in human sound localization. environment, we recorded a large sample of environmental sounds
We find that auditory scene statistics reveals a clear mapping be- (∼50,000 recordings, 1 s each) by means of two directional micro-
tween frequency and elevation. Perhaps more interestingly, this phones mounted on the head of a human freely moving indoors and

PSYCHOLOGICAL AND
natural statistical mapping is tightly mirrored in both ear-filtering

COGNITIVE SCIENCES
outdoors in urban and rural areas (around Bielefeld, Germany).
properties and in perceived sound location. This suggests that both Overall, the recordings revealed a consistent mapping between the
sound localization behavior and ear anatomy are fine-tuned to the
frequency of sounds and the average elevation of their sources in
statistics of natural auditory scenes, likely providing the basis for
the external space [F(5, 57,859) = 35.8, P < 0.0001; Methods], which
the spatial connotation of human hearing.
was particularly evident in the middle range of the spectrum, be-
tween 1 and 6 kHz (Fig. 1C, Upper). That is, high-frequency sounds
|
frequency–elevation mapping head-related transfer function | have a tendency to originate from elevated sources in natural au-
|
Bayesian modeling cross-modal correspondence
ditory scenes. We can only speculate about the origins of this
mapping: it could either be that at higher elevations, more energy is
T he spatial connotation of auditory pitch is a universal hall-
mark of human cognition. High pitch is consistently mapped
to high positions in space in a wide range of cognitive (1–3),
generated in high frequencies (e.g., leaves on the trees rustle in
a higher frequency range than the footsteps on the floor), or it could
also be that the absorption of the ground is frequency dependent in
perceptual (4–6), attentional (7–12), and linguistic functions a way that it filters out more of the high-frequency spectrum.
(13), and the same mapping has been consistently found in To look for the existence of an FEM in the filtering properties
infants as young as 4 mo of age (14). In spatial hearing, the of the ear, we analyzed a set of 45 HRTFs [the CIPIC database
perceived spatial elevation of pure tones is almost fully de-
(17); Methods and Fig. S1], and found again a clear mapping
termined by frequency––rather than physical location––in a very
between frequency and elevation [F(5, 264) = 216.6, P < 0.0001;
systematic fashion [i.e., the Pratt effect (4, 5)]. Likewise, most
Fig. 1C, Lower]. That is, due to the filtering properties of the
natural languages use the same spatial attributes, high and low,
outer ear, sounds coming from high (head-centered) elevations
to describe pitch (13), and throughout the history of musical
notation high notes have been represented high on the staff.
However, a comprehensive account for the origins of the spatial Significance
connotation of auditory pitch to date is still missing. More than
a century ago, Stumpf (13) suggested that it might stem from the Auditory pitch has an intrinsic spatial connotation: Sounds are
statistics of natural auditory scenes, but this hypothesis has never high or low, melodies rise and fall, and pitch can ascend and
been tested. This is a major omission, as the frequency–elevation descend. In a wide range of cognitive, perceptual, attentional,
mapping often leads to remarkable inaccuracies in sound local- and linguistic functions, humans consistently display a positive,
ization (4, 5) and can even trigger visual illusions (6), but it can sometimes absolute, correspondence between sound frequency
also lead to benefits such as reduced reaction times or improved and perceived spatial elevation, whereby high frequency is map-
detection performance (7–12). ped to high elevation. In this paper we show that pitch borrows
its spatial connotation from the statistics of natural auditory
Results scenes. This suggests that all such diverse phenomena, such as the
To trace the origins of the mapping between auditory frequency convoluted shape of the outer ear, the universal use of spatial
and perceived vertical elevation, we first measured whether this terms for describing pitch, or the reason why high notes are
mapping is already present in the statistics of natural auditory represented higher in musical notation, ultimately reflect adap-
signals. When trying to characterize the statistical properties of tation to the statistics of natural auditory scenes.
incoming signals, it is critical to distinguish between distal stim-
Author contributions: C.V.P. and M.O.E. designed research; C.V.P. and K.K. performed
uli, the signals as they are generated in the environment, and research; C.V.P. analyzed data; C.V.P. and M.O.E. performed statistical modeling; and
proximal stimuli, the signals that reach the transducers (i.e., the C.V.P. and M.O.E. wrote the paper.
middle and inner ear). In the case of auditory stimuli this is es- The authors declare no conflict of interest.
pecially important, because the head and the outer ear operate This article is a PNAS Direct Submission.
as frequency- and elevation-dependent filters (15), which modu- 1
To whom correspondence may be addressed. E-mail: [email protected] or
lates the spectra of the sounds reaching the middle ear as a func- [email protected].
tion of the elevation of the sound source relative to the observer This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.
(the head-related transfer function, HRTF). Notably, the structure 1073/pnas.1322705111/-/DCSupplemental.

www.pnas.org/cgi/doi/10.1073/pnas.1322705111 PNAS Early Edition | 1 of 5


Sound source localization Localization bias Body tilt
A (world-centred coordinates) 10°
B (head-centred coordinates)

10

Elevation bias (°)


up

−10
1 10
10

Azimuth bias (°)


right

< 0.8 0.8-1.4 1.4 -2.5 2.5 - 4.5 4.5 - 8 >8 left
White
noise −10
1 10
Band-pass frequency (kHz) Band-pass frequency (kHz)

C D E
Statistical frequency-elevation mapping Estimated frequency-elevation priors Priors for elevation
8

4 20 Reference frames:

World-centred

Equal loudness contour (dB SPL)


0 0 40
Head-centred
-4 -20
Elevation (°)

Environment (distal) World-centred


-8
1 10 1 10 45 F Correlation
1

Statistical mapping
Environ.
20 10
.84 .68

Correlation (ρ)
(.73-.93) (.59-.77)
0 0
50
-10

HRTF
−20 .76 .90
Ear filtering (proximal) -20 Head-centred (.61-.84) (.84-.94)

−40 0
World Head
1 10 1 10 centred centred
Band-pass frequency (kHz) Estimated priors

Fig. 1. (A) Average endpoint of pointing responses for the various frequency bands (column) and body tilts (row). The filled points correspond to the average
responses; the thin gray grid represents the actual position of the stimuli. Colors represent tilt (green = 0°, brown = 45°, red = 90°). (B) Frequency-dependent
bias (±SEM) in sound localization in head-center elevation (Upper) and azimuth (Lower). The magnitude of the frequency-dependent elevation biases was
only mildly affected by body tilt, reflecting the contribution of a frequency–elevation mapping encoded in head-centered coordinates. The frequency-
dependent azimuth biases increase in magnitude with increasing body-tilt angle reflecting the contribution of a frequency–elevation mapping encoded in
world-centered coordinates. (C) Statistical mapping (±SEM) between frequency and elevation recorded in the environment (Upper) and measured from the
HRTFs (Lower). The dashed lines represent the frequency–elevation mapping using nonbinned data. (D) Shapes of the estimated priors coding for the fre-
quency–elevation mapping in world-centered (Upper) and head-centered coordinates (Lower). Lightness within the panels represents the equal loudness
contour (International Organization for Standardization 226:2003): lighter gray represents higher sensitivity. (E) Schematic 1D representation of the model
illustrating the head- (magenta) and world-centered priors (cyan). (F) Correlation (and 95% confidence intervals) between the estimated priors and the
frequency–elevation mapping measured from the environment and the HRTFs. (C–F) Colors indicate the reference frame (magenta = head-centered; cyan =
world-centered).

have more energy at high frequencies. These results demonstrate showing a high degree of similarity between the spectra of natural
that an FEM is consistently present in the statistics of both images and the optical transfer function of the eye (18). This might
proximal and distal stimuli. This suggests that the perceptual suggest that human spatial hearing is so finely tuned to the envi-
FEM might ultimately reflect a tuning of the human auditory ronment that even the filtering properties of the outer ear, and
system to the statistics of natural sounds. hence its convoluted anatomy, evolved to mirror the statistics of
Finally, we determined the correlation between the FEM mea- natural auditory scenes.
sured in proximal and distal stimuli, and found a strong similarity To investigate the relation between human performance and
between the two mappings (ρ = 0.79, interquartile range = 0.72– the FEM in proximal and distal stimuli, we asked participants to
0.84). That is, the filtering properties of the external ear accentuate localize on a 2D plane (19) a set of narrowband (∼1.8-octave)
the FEM that is present in natural auditory scenes. One possible auditory noises with different central frequencies (Movie S1).
reason for this similarity is that the elevation-dependent filtering of Sounds were played from a set of 16 speakers hidden behind
the outer ear is set to maximize the transfer of naturally available a sound-transparent projection screen, arranged on a 4 × 4 grid
information. This result parallels previous findings in human vision subtending an angle of ∼30 × 30°. Participants were asked to

2 of 5 | www.pnas.org/cgi/doi/10.1073/pnas.1322705111 Parise et al.


point toward the sound source, while pointing direction was the auditory stimuli on a 2D plane, the Bayesian ideal observer
measured (Methods). Participants performed the sound locali- model was also framed in 2D space (Methods and Fig. S3). In
zation experiment in three conditions in which we tilted their a similar fashion, we also modeled incoming sensory information
whole body [0°, 45°, and 90°] to dissociate head- from world- in terms of Gaussian probability distributions over spatial loca-
centered elevation (Fig. S2). Given that the FEM in the proximal tions: the likelihood function. According to Bayesian decision
and distal stimuli come in different reference frames (the first theory, prior expectations and incoming sensory information are
being head-centered, the second world-centered), tilting partic- combined to determine the final percepts. This model predicts
ipants allows one to separately estimate the relationship between that as soon as the sensory information from the peaks and
sound localization biases and the FEM measured in the proximal notches of the HRTF (16) becomes unreliable, such as when
and distal stimuli. In the extreme case, when the participant lay sounds have a narrow spectrum as in the present experiment, the
horizontally on the side (tilt = 90°), head- and world-centered perceived elevation would be mainly determined by the prior.
elevations were orthogonal, and as a result vertical sound lo- Given this generative model, we can use the responses from the
calization biases on each reference frame were independent. sound localization task to estimate the expected head- and world-
When participants had to localize white noise (which includes centered elevation of a sound given its frequency, that is, the
all spectral frequencies), performance was quite accurate and the shape of the internal FEMs.
orientation-dependent spatial distortions were minor (Fig. 1A, The shapes of the estimated frequency-dependent priors on
Right). Conversely, sound source localization was strongly biased vertical sound location (Fig. 1D) reveal a strong similarity with
when the stimulus consisted of narrowband noise (Fig. 1A). Such the frequency-dependent biases measured from the responses of
biases depended both on the spectra of the stimuli and the ori- the participants (Fig. 1B, red lines). Given that such biases are
entation of the observers. This bias was especially strong for supposedly the outcome of the estimated frequency-dependent
those frequencies in which hearing sensitivity, as measured by priors, this is an expected finding that further validates the cur-
equal loudness contours, was at its maximum (20): In the three rent modeling approach. Having empirically determined the
frequency bands between 1.4 and 8 kHz the localization shapes of the internal FEM (in both head- and world-centered

PSYCHOLOGICAL AND
COGNITIVE SCIENCES
responses were virtually independent from the actual sound coordinates), we can look for similarities (i.e., correlation) be-
source location and the reported elevation was almost entirely tween the shapes of such perceptual mappings, and the ones that
determined in a very consistent way by the frequency of the we measured from both the statistics of the acoustic environment
signals (Fig. 1A, Center). Notably, such biases showed a clear and from the HRTFs. Notably, both estimated priors signifi-
mapping between frequency and elevation (Fig. 1B), which was cantly correlated with the statistical mappings present in proxi-
evident in both head- and world-centered coordinates (see also mal and distal stimuli (i.e., the maximum of the frequency
refs. 5, 11). Importantly, such localization biases were signifi- spectra against spatial elevation) (Fig. 1F). However, the head-
cantly correlated with the FEM present in proximal and distal centered prior was more correlated to the FEM measured from
stimuli (ρ = 0.76 for world-centered biases with distal stimulus the filtering properties of the outer ear, whereas the world-
and ρ = 0.78 for head-centered biases with the proximal stimulus; centered prior was more correlated to the FEM present in en-
see SI Text). Consistent with previous studies (21, 22), we also vironmental sounds. These results demonstrate that the per-
found moderate but consistent frequency-dependent biases in ceptual FEM in humans jointly depends on the statistics of both
horizontal sound localization. These results demonstrate the natural auditory scenes and the filtering properties of the
existence of striking frequency- and body-orientation-dependent outer ear.
perceptual biases in sound localization. The results also dem-
onstrate the dependence of such biases on the statistics of nat- Discussion
ural auditory scenes, and on the filtering properties of the Previous studies have already hypothesized the grounding of
outer ear. cross-dimensional sensory correspondences in the statistics of
However, it is not immediately obvious why there is such incoming stimuli (13, 28). None of them, however, directly mea-
a high degree of correspondence between the behavioral biases sured how such mappings relate to the statistical properties of the
found in sound localization and the statistical mappings found in stimuli. Our results demonstrate that an FEM is already present in
both the environment, and in the filtering properties of the ear. the statistics of both the proximal and the distal stimuli. Moreover,
To better understand this close correspondence we would need we demonstrate that the perceptual FEM is in fact a twofold
a generative model. Recently, the Bayesian approach has been mapping, which separately encodes the statistics of natural audi-
successfully used for developing such generative models and in tory scenes and the filtering properties of the outer ear in different
particular for describing the effects of stimulus statistics on frames of reference. Interestingly, this finding provides further
perceptual judgments (23–27). In Bayesian terms, the frequency support for the role of vestibular and proprioceptive information
dependency of sound source location can be modeled as a prior in sound localization (29). These results highlight the possibility of
distribution pf(s) representing the probability of a sound source s using sound spectral frequency to simulate the vertical elevation of
of a given frequency f occurring at some given 2D spatial location sound sources.
s = ðsx ; sy Þ. Based on the measured statistics of natural auditory The pervasiveness of the FEM in the statistics of the stimuli
scenes, the filtering properties of the ear, and the biases in sound readily explains why previous research found this mapping to be
localization, we postulated the existence of two distinct mappings absolute (5, 30) (i.e., each frequency is related to exactly one
between frequency and elevation, respectively coding the ex- elevation), universal (3, 13) (cross-cultural and language in-
pected elevation of sounds as a function of the frequency spec- dependent), and already present in early infancy (14); and it
trum in either head- or world-centered coordinates. Therefore, argues against interpretations of cross-dimensional sensory cor-
we modeled two frequency-dependent priors for elevation, one respondences in terms of “weak synesthesia” (9). The mapping
being head-centered and the other world-centered. This model between pitch and elevation, also reflected in musical notation
would involve a mechanism dedicated to the extraction and and in the lexicon of most natural languages (13), has often been
combination of relevant spectral cues from the proximal stimulus considered a metaphorical mapping (6, 31), and cross-sensory
(such as the frequencies with more energy), and mapping the correspondences have been theorized to be the basis for lan-
result to certain head- and world-centered elevations. For sim- guage development (32). The present findings demonstrate that,
plicity, such priors were modeled as Gaussian distributions, whose at least in the case of the FEM, such a metaphorical mapping is
means represent the expected elevation given the spectrum of the indeed embodied and based on the statistics of the environment,
incoming signal (Fig. 1E). Given that participants had to localize hence raising the intriguing hypothesis that language itself might

Parise et al. PNAS Early Edition | 3 of 5


have been influenced by a set of statistical mappings between the azimuth bias (Fig. 1B, Lower) showed a main effect of frequency [F(5,45) =
sensory signals. Even more, besides the FEM, human perception, 4.074; P = 0.004], tilt [F(2,18) = 43.474, P < 0.001], and a significant in-
cognition, and action are laced with seemingly arbitrary corre- teraction [F(10,90) = 8.11; P < 0.001].
spondences (33), such as for example that yellow-reddish colors To engage participants with the experiment, the whole task was presented
as a shooting video game (19): A bullet-hole graphic effect (spatially aligned
are associated with a warm temperature, or that sour foods taste
with the pointing response) and the sound of a gunshot accompanied each
sharp. We may speculate here that many of these mappings are response, closely followed by the sound of a loading gun. The sound effects
in fact the reflection of natural scene statistics. came from an additional speaker placed in the proximity of participants’
heads. To avoid those effects interfering with the experimental stimuli,
Methods a temporal interval randomized between 2 and 3 s separated two consec-
Recordings from the Environment. The recordings were taken by two micro- utive trials. To further motivate the participants, they were told that they
phones (Sennheiser ME105) mounted one above the other on the side of could get points as a function of their performance. Every 16 trials, a fake
a baseball cap, and pointing ±25° from the horizontal midline. The distance high score list was presented, in which participants on average ranked third
between the microphones was 4 cm, and the experimenter kept the head in out of 10.
a natural upright position throughout the recording session. We did not
constrain naturally occurring head movements while recording the sounds, Modeling. In the present experiment, participants were presented with
because it was our goal to measure the natural soundscape of a listener with physical stimuli coming from a source s = ðsx ,sy Þ. Using both binaural cues and
ordinary postures. The recordings had a sampling frequency of 44,100 Hz the structure of the peak and notches in the frequency spectrum, the au-
and a depth of 16 bits. Each recording was filtered with a pool of 71 band- ditory system can estimate, respectively, the azimuth and the elevation of
pass filters (constant log-frequency width, overall range = 0.5–16 kHz), and the sound source. Assuming that the sensory estimate ^s = ð^sx ,^sy Þ derived from
the elevations of the resulting signals were measured from the lag that the physical source of a sound with frequency f is unbiased but noisy, with
maximized the cross-correlation between the two microphones (if the cross- some Gaussian noise σ = ðσ f ,x ,σ f ,y Þ added independently to each spatial di-
correlation was <0.5, elevation was not calculated). The elevation mapped mension i ð^si = si + σ f ,i Þ, the likelihood distribution pf ð^sjsÞ for the spatial lo-
to each frequency was calculated as the average elevation across recordings. cation of the sound source is a 2D Gaussian:

Analysis of the HRTF. The CIPIC HRTF (17) database includes the transfer   
pf ^sjs = N sθ , Σf ,θ ,
function produced by the outer ear of 45 humans for 71 different frequency
 
channels (linearly spaced between 0.66 and 16.1 kHz), and recorded from 50 σ 2f ,x 0
with mean sθ = ðsx ,sy Þ · Rθ and covariance matrix Σf ,θ = · Rθ (Fig. S3,
elevations (range −45° to 230°). The elevation mapped to each frequency 0 σ 2f ,y
channel was calculated from each individual HRTF as the elevation with the Left). Assuming the likelihood to be encoded in head-centered coordinates, Rθ
highest transfer value (dB) for that particular frequency channel (28) for is a rotation matrix that rotates the axes according to the orientation of the
sounds coming from the midsagittal plane (Fig. S1). body with respect to gravitational vertical (θ).
The expected elevation of a sound source of a given frequency spectrum
Psychophysical Task. Ten healthy observers with normal audition and normal can be modeled as a Gaussian a priori probability distribution, whose mean
or corrected-to-normal vision took part in the experiment (six females, mean represents the expected location given the maximum of the frequency
age 25 y, range 21–33). All of them were students or employees at the spectrum, and the variance the uncertainty of the mapping. Given that we
University of Bielefeld and provided written informed consent before par- empirically measured an FEM in the filtering properties of the outer ear and
ticipating. The study was conducted in accordance with the Declaration of the statistics of the natural auditory scenes, we assumed the existence of two
Helsinki and had ethical approval from the ethics committee of the Uni- independent priors encoding, respectively, the FEM in head- and world-
versity of Tübingen. centered coordinates.
Observer’s head was fixed 130 cm away from a sound-transparent pro- In head-centered coordinates the prior distribution phc,f ðsÞ for the location
jection screen (220 × 164 cm) mounted in front of a set of 16 speakers (Fig. shc,f of a sound with frequency f is defined as a 2D Gaussian:
S2). On each trial, one of the speakers played a 300-ms band-pass noise
(band-pass kHz: <0.8; 0.8–1.4; 1.4–2.5; 2.5–4.5; 4.5–8; >8; or white noise;  
phc,f ðsÞ = N shc,f ,θ , Σhc,f ,θ ,
Movie S1). Participants were instructed to indicate where they have heard
 
the stimulus come from, using a cursor projected on the screen in front of ∞ 0
with mean shc,f ,θ = ð0,shc,f ,y Þ · Rθ and covariance matrix Σhc,f ,θ = 0 σ 2hc,f ,y · Rθ
the speakers. Localization was visually guided (closed loop) and temporally
unconstrained. When participants were happy with the position of the (Fig. S3, second column). The mean shc,f,y represents the expected spatial eleva-
cursor, they had to press on the touchpad to submit their response. In dif- tion and the variance σ 2hc,f ,y the mapping uncertainty. For simplicity, we assumed
ferent blocks (in a counterbalanced order), we tilted participants’ bodies (0°, no mapping between frequency and the head-centered left–right location of
45°, or 90° counterclockwise; Fig. S2) with respect to the gravitational ver- a sound source; therefore, the prior had a mean azimuth of zero and ∞ variance
tical using custom-built chairs that maintained the line of sight aligned with (i.e., the prior is uninformative with respect to the head-centered azimuth).
the center of the screen, without covering the ears. Each combination of In a similar fashion, the world-centered prior distribution pwc,f ðsÞ for the
stimulus frequency, tilt, and spatial location was repeated 4 times (1,344 location swc,f of a sound with frequency f is defined as a 2D Gaussian:
trials per participants).
 
Before running the localization task, the auditory stimuli were perceptually pwc,f ðsÞ = N swc,f , Σwc,f ,
equalized in loudness using the method of adjustment to prevent any perceived  
loudness differences to affect our results. That is, we used the white noise ∞ 0
with mean swc,f = ð0,swc,f ,y Þ and covariance matrix Σwc,f = 0 σ 2wc,f ,y (Fig. S3,
stimulus as the standard, and participants adjusted the intensity of each band-
pass stimulus until the loudness of the band-pass stimuli perceptually matched third column). The mean swc,f ,y represents the expected spatial elevation and the
the standard. Each band-pass stimulus was adjusted six times, and we repeated variance σ 2wc,f ,y the mapping uncertainty. Again, the prior was made un-
the procedure with four participants. The gain factor used to equalize each informative as to the world-centered azimuth location of the sound source.
stimulus was determined as the median value of all adjustments. The statistically optimal way to combine noisy sensory information with
The experiment was conducted in a dark anechoic chamber, and controlled prior knowledge is described by the Bayes theorem, according to which the
by a custom-built software based on the Psychtoolbox (34). Participants were posterior pf ðsj^sÞ (Fig. S3, Right), on which the percept is based, is pro-
tested in three sessions taking place on three consecutive days. Different portional to the product of the likelihood (i.e., the sensory information) and
body-tilt conditions were tested in separate blocks (4 blocks/d), with the the prior (here, the FEM):
order of the blocks counterbalanced within and across participants. Within
 
each block, sounds with different frequencies and positions were presented pf sj^s ∝ phc,f ðsÞ · pwc,f ðsÞ · pf ^sjs :
in a pseudorandom fashion.
For each orientation and frequency, the localization bias was calculated Assuming all of the noise in the data to be due to sensory (as opposed to
separately for head-centered elevation and azimuth as the grand mean of the response-motor) noise (19), participants’ responses would represent random
responses for each participant. The elevation bias (Fig. 1B, Upper) showed a main samples of the posterior distribution pf ðsj^sÞ. Therefore, given the psychophys-
effect of frequency [F(5,45) = 11.564; P < 0.001], without significant effects of ical data it is possible to estimate the parameters of the model and eventually
tilt [F(2,18) = 1.157, P = 0.337] or interactions [F(10,90) = 1.313; P = 0.235]. The estimate the shape of the internal FEMs. Using a maximum-likelihood

4 of 5 | www.pnas.org/cgi/doi/10.1073/pnas.1322705111 Parise et al.


procedure, we fitted the mean of the priors shc,f ,y and swc,f ,y for each fre- first divided the spectra of the HRTF and the recordings into the same six
quency band that we tested and, assuming for simplicity that the strength of frequency bands that we used for the experiment. The elevation mapped to
the FEM is independent of frequency, we fitted the two mapping uncer- each frequency band corresponded to the mean of the elevations within the
tainties σ 2hc,f ,y , and σ 2wc,f ,y . We also fitted the covariance matrix Σf ;θ of the frequency range. This procedure was carried out individually for each re-
likelihood function (given that sound frequency is known to impact the cording and HRTF, and the results were used for statistical inference on the
sensitivity to the elevation of a sound source, we fitted a different variance existence of a FEM in the proximal and distal stimuli (see Results) and for the
σ 2f ,y for each frequency band tested). Overall, the model had 21 free correlation between the statistics of the stimulus and human performance
parameters fitted over 13,440 trials, that is, 640 trials per parameter. (estimated priors and biases). The similarity between the shapes of the FEM
Additionally, we used the responses in the white noise condition to es- measured from the psychophysical task and from the statistics of the stim-
timate further frequency-independent distortions of perceived space. This ulus was measured in terms of Pearson correlation (Fig. 1F). A correlation of
was modeled by shifting the mean of the posterior, for each position and 1 means that the mappings are identical in shape, irrespective of potential
orientation, by the bias calculated from the white noise (i.e., the discrepancy shifts and scaling factors, whereas a correlation of 0 means that the two
between physical and perceived position in the white noise condition). mappings are statistically independent. The correlation was only calculated
The parameters were fitted over the mean pointing response for each for the frequency bands between 0.8 and 8 kHz, as above and below such
condition (i.e., frequency, tilt, and spatial location) across participants (i.e.,
frequencies the estimated priors and the measurements from statistics of the
the dots in Fig. 1A). The fitting was based on an unconstrained nonlinear
signals were estimated over different ranges of frequencies. To estimate the
optimization procedure (fminsearch, Matlab). Parameters were fitted using
mean and the confidence interval of the correlation, we used a resampling
a leave-one-out Jackknife procedure, consisting of iteratively estimating the
procedure, whereby the correlation was iteratively calculated from the
parameters of the pointing responses excluding one participant at a time.
mean of a subset of one-fifth of the whole recordings (n = 9,962), one-fifth
The results in Fig. 1D represent the mean of the 10 iterations. To minimize
of the HRTFs (n = 9), and one-fifth of the 10 estimated parameter sets (n =
the effect of the starting parameter values we iteratively repeated each
2). This procedure was repeated 1,000 times.
fitting procedure 30 times using random starting values, and selected the set
of parameters that provided the best fit. The results of these analyses are reported in Fig. 1F. Note that despite the
Given that in this study we were especially interested in the effects of strong similarities between the shapes of the FEM in the statistics of the
frequency on perceived elevation, we only included frequency-dependent natural stimuli and in the estimated priors, the scale of the FEM in the sta-

PSYCHOLOGICAL AND
tistics of the distal stimulus is much smaller than all of the other mappings

COGNITIVE SCIENCES
priors for elevation in our model. However, previous studies also demonstrated
the existence of frequency-dependent biases for azimuth (22), and such biases (Fig. 1 C and D). Something similar has been found in human vision, where
have also been related to the filtering properties of the outer ear (21). That the filtering properties of the eye seem to exaggerate the statistics of nat-
said, biases on azimuth had a much smaller magnitude in the present study ural visual scenes (18). It would be a matter of future research to understand
(∼2°; Fig. 1B Lower, green line) compared with elevation biases (∼15°, Fig. 1B why the brain and the filtering of the outer ear encode the same FEM
Upper, green line) and they were almost frequency independent. The reason present in the environment on a different scale.
why these azimuth biases here were so small compared with Butler (22)—and
thus could be safely neglected in the modeling—might be because our task ACKNOWLEDGMENTS. The authors would like to thank the Cognitive
involved binaural hearing, thus having time difference and loudness differ- Neuroscience research team in Bielefeld for precious support throughout
ence between the ears as a main cue to azimuth, whereas Butler (22) de- this study, and J. Burge and J.M. Ache for insightful comments on a previous
termined azimuth biases for monaural hearing only. version of the manuscript. C.V.P. and M.O.E. were supported by the 7th
Framework Programme European Projects “The Hand Embodied” (248587)
and “Wearhap” (601165). This study is part of the research program of the
Comparison Between the Estimated Priors and the Statistics of the Natural Bernstein Center for Computational Neuroscience, Tübingen, funded by the
Sounds and Filtering Properties of the Outer Ear. To calculate the relation German Federal Ministry of Education and Research (German Federal Min-
between the priors and the statistics of the proximal and distal stimuli, we istry of Education and Research; Förderkennzeichen: 01GQ1002).

1. Douglas KM, Bilkey DK (2007) Amusia is associated with deficits in spatial processing. 19. Parise CV, Spence C, Ernst MO (2012) When correlation implies causation in multi-
Nat Neurosci 10(7):915–921. sensory integration. Curr Biol 22(1):46–49.
2. Rusconi E, Kwan B, Giordano BL, Umiltà C, Butterworth B (2006) Spatial representa- 20. Suzuki Y, Takeshima H (2004) Equal-loudness-level contours for pure tones. J Acoust
tion of pitch height: The SMARC effect. Cognition 99(2):113–129. Soc Am 116(2):918–933.
3. Dolscheid S, Shayan S, Majid A, Casasanto D (2013) The thickness of musical pitch: 21. Carlille S, Pralong D (1994) The location-dependent nature of perceptually salient
Psychophysical evidence for linguistic relativity. Psychol Sci 24(5):613–621. features of the human head-related transfer functions. J Acoust Soc Am 95(6):
4. Pratt CC (1930) The spatial character of high and low tones. J Exp Psychol 13(3): 3445–3459.
278–285. 22. Butler RA (1987) An analysis of the monaural displacement of sound in space. Percept
5. Roffler SK, Butler RA (1968) Factors that influence the localization of sound in the Psychophys 41(1):1–7.
vertical plane. J Acoust Soc Am 43(6):1255–1259. 23. Adams WJ, Graf EW, Ernst MO (2004) Experience can change the ‘light-from-above’
6. Maeda F, Kanai R, Shimojo S (2004) Changing pitch induced visual motion illusion. prior. Nat Neurosci 7(10):1057–1058.
Curr Biol 14(23):R990–R991. 24. Weiss Y, Simoncelli EP, Adelson EH (2002) Motion illusions as optimal percepts. Nat
7. Chiou R, Rich AN (2012) Cross-modality correspondence between pitch and spatial Neurosci 5(6):598–604.
location modulates attentional orienting. Perception 41(3):339–353. 25. Tassinari H, Hudson TE, Landy MS (2006) Combining priors and noisy visual cues in
8. Melara RD, O’Brien TP (1990) Effects of cuing on cross-modal congruity. J Mem Lang a rapid pointing task. J Neurosci 26(40):10154–10163.
29(6):655–686. 26. Zhang R, Kwon O-S, Tadin D (2013) Illusory movement of stationary stimuli in the
9. Melara RD, O’Brien TP (1987) Interaction between synesthetically corresponding di- visual periphery: Evidence for a strong centrifugal prior in motion processing. J Neu-
mensions. J Exp Psychol Gen 116(4):323–336. rosci 33(10):4415–4423.
10. Bernstein IH, Edelstein BA (1971) Effects of some variations in auditory input upon 27. Girshick AR, Landy MS, Simoncelli EP (2011) Cardinal rules: Visual orientation per-
visual choice reaction time. J Exp Psychol 87(2):241–247. ception reflects knowledge of environmental statistics. Nat Neurosci 14(7):926–932.
11. Mossbridge JA, Grabowecky M, Suzuki S (2011) Changes in auditory frequency guide 28. Rogers ME, Butler RA (1992) The linkage between stimulus frequency and covert peak
visual-spatial attention. Cognition 121(1):133–139. areas as it relates to monaural localization. Percept Psychophys 52(5):536–546.
12. Evans KK, Treisman A (2010) Natural cross-modal mappings between visual and au- 29. Goossens HH, van Opstal AJ (1999) Influence of head position on the spatial repre-
ditory features. J Vis 10(1):1–12. sentation of acoustic targets. J Neurophysiol 81(6):2720–2736.
13. Stumpf K (1883) Tonpsychologie (Hirzel, Leipzig, Germany). 30. Cabrera D, Ferguson S, Tilley S, Morimoto M (2005) Recent studies on the effect of
14. Walker P, et al. (2010) Preverbal infants’ sensitivity to synaesthetic cross-modality signal frequency on auditory vertical localization. Proceedings of International Con-
correspondences. Psychol Sci 21(1):21–25. ference on Auditory Display (ICAD, Limerick, Ireland).
15. Batteau DW (1967) The role of the pinna in human localization. Proc R Soc Lond B Biol 31. Sadaghiani S, Maier JX, Noppeney U (2009) Natural, metaphoric, and linguistic auditory
Sci 168(11):158–180. direction signals have distinct influences on visual motion processing. J Neurosci 29(20):
16. Iida K, Itoh M, Itagaki A, Morimoto M (2007) Median plane localization using a 6490–6499.
parametric model of the head-related transfer function based on spectral cues. Appl 32. Ramachandran V, Hubbard E (2001) Synaesthesia: A window into perception, thought
Acoust 68(8):835–850. and language. J Conscious Stud 8(12):3–34.
17. Algazi VR, Duda RO, Thompson DM, Avendano C (2001) The CIPIC HRTF database. 33. Parise C, Spence C Audiovisual cross-modal correspondences in the general pop-
2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics ulation. Oxford Handbook of Synaesthesia, eds Simner J, Hubbard EM (Oxford Univ
(IEEE, New Paltz, NY), pp 99–102. Press, Oxford, UK).
18. Burge J, Geisler WS (2011) Optimal defocus estimation in individual natural images. 34. Kleiner M, Brainard D, Pelli D (2007) What’s new in Psychtoolbox-3. Perception 36(14):
Proc Natl Acad Sci USA 108(40):16849–16854. 1–16.

Parise et al. PNAS Early Edition | 5 of 5


Supporting Information
Parise et al. 10.1073/pnas.1322705111
SI Text red lines) and from the statistics of the stimulus (Fig. 1C) was
measured in term of Pearson correlation. The mean and
Comparison Between the Localization Biases and the Statistics confidence interval of the correlation were calculated by an
of Natural Sounds and Filtering Properties of the Outer Ear iterative resampling procedure whereby the estimated priors
Strictly speaking, the perceptual biases measured in the psy- were correlated to the FEM measured from the mean of a
chophysical task do not represent the internal mappings between subset of one-fifth of the all recordings (n = 9,962), one-fifth
frequency and elevation but the outcome of such mappings which of the head-related transfer functions (HRTFs) (n = 9), and
we estimated with a Bayesian model. Nevertheless, to further one-fifth of the observers (n = 2). The procedure was repeated
prove the link between the measured statistics and observed 1,000 times.
behavior without relying on a model (which is necessarily based The correlation between the azimuth bias and the statistics of
on a set of assumptions, which might as such be wrong), we also the environment was 0.76 [95% confidence interval (c.i.) =
directly measured the correlation between the perceptual biases 0.56–0.87], the correlation between the azimuth bias and the
and the frequency–elevation mapping (FEM) in the environment
FEM in the proximal stimulus was 0.89 (95% c.i. = 0.72–0.95).
and in the filtering properties of the outer ear. To do so, we used
The correlation between the elevation bias and the statistics of
the frequency-dependent bias observed when participants were
tilted by 90° (Fig. 1B, red lines), that is, when the head- and the environment was 0.90 (95% c.i. = 0.83–0.96), and the
world-centered FEMs were made orthogonal, so that the ele- correlation between the elevation bias and the FEM in the
vation bias in head-centered coordinates should reflect the proximal stimulus was 0.78 (95% c.i. = 0.65–0.86). Notably, this
contribution of the head-centered FEM, whereas the azimuth pattern of correlation by and large confirms the findings based
bias (again in head-centered coordinates) should reflect the on the priors estimated using the Bayesian model and provides
contribution of the world-centered FEM. further converging evidence supporting the conclusion that the
As in the previous section, the similarity between the shapes perceptual FEMs reflect the statistics of the proximal and distal
of the FEM measured from the psychophysical task (Fig. 1B, stimuli.

225 20

180
10
135 Transfer (dB)
Elevation (°)

90 0

45
-10
0

-45 -20
1 10
Frequency (kHz)
Fig. S1. Average HRTF, obtained by averaging all 45 HRTFs of the CIPIC database. The black dots, representing the elevation with maximum transfer for each
frequency, show a clear mapping between frequency and elevation. The FEM reported in Fig. 1C (Lower) was obtained by calculating for each individual HRTF
the elevation with maximum transfer for each frequency (i.e., the black dots here), and then averaging the results across the 45 HRTFs.

Parise et al. www.pnas.org/cgi/content/short/1322705111 1 of 3


a 220 cm b Body orientation

0° 45° 90°

164 cm
130 cm

Fig. S2. (A) Schematic representation of the experimental setup. (B) Representation of the three different body orientations. The vertical arrows represent the
world-centered elevation; the tilted arrows represent the head-centered elevation. When the body of the participant is not tilted (Left), head- and world-
centered elevation overlap, whereas when the body is tilted by 90° (Right), the head- and world-centered elevations are orthogonal. The gray grids represent
the physical position of the speakers.

Likelihood Prior Prior


head-cent. world-cent.


Body orientation

45°

90°

y
x

Fig. S3. Schematic illustration of the Bayesian model. The icons on the left represent the different orientations of the observers (in rows). The left column
represents the likelihood function (the sensory information). The red dots represent the physical position of the stimulus s = ðsx ,sy Þ. The second and the third
columns represent the frequency-dependent priors on elevation in head- and world-centered coordinates, respectively. The last column on the right represents
the posterior distribution; the red dot represents the physical position of the stimuli, whereas the green dot represents the maximum a posteriori, that is, the
perceived position of the stimuli. Note how the perceived position is shifted away from the actual position as a function of both the frequency-dependent
priors and body orientation. Colors indicate the reference frame of the priors (magenta = head-centered; cyan = world-centered). This figure represents the
case of the localization of a 4.5–8-kHz band-pass auditory stimulus coming from the bottom-right speaker (swc = [15, −15]°). Frequency-independent distortions
of perceived auditory space (Modeling) are not represented.

Parise et al. www.pnas.org/cgi/content/short/1322705111 2 of 3


Auditory stimuli
Parise Knorre & Ernst

Movie S1. Auditory stimuli used in the sound localization experiment. To better appreciate how perceived spatial elevation changes as a function of the
spectra of the stimuli, we recommend playing the sounds using loudspeakers (not headphones), and listening with the eyes closed.

Movie S1

Parise et al. www.pnas.org/cgi/content/short/1322705111 3 of 3

View publication stats

You might also like