Award Abstract # 1248047
EAGER: Collaborative Research: Towards Modeling Human Speech Confusions in Noise

NSF Org: IIS
Division of Information & Intelligent Systems
Recipient: INTERNATIONAL COMPUTER SCIENCE INSTITUTE
Initial Amendment Date: August 8, 2012
Latest Amendment Date: August 8, 2012
Award Number: 1248047
Award Instrument: Standard Grant
Program Manager: Tatiana Korelsky
IIS
 Division of Information & Intelligent Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: August 1, 2012
End Date: July 31, 2015 (Estimated)
Total Intended Award Amount: $100,000.00
Total Awarded Amount to Date: $100,000.00
Funds Obligated to Date: FY 2012 = $100,000.00
History of Investigator:
  • Nelson Morgan (Principal Investigator)
    morgan@icsi.berkeley.edu
Recipient Sponsored Research Office: International Computer Science Institute
2150 SHATTUCK AVE
BERKELEY
CA  US  94704-1345
(510)666-2900
Sponsor Congressional District: 12
Primary Place of Performance: International Computer Science Institute
CA  US  94704-1198
Primary Place of Performance
Congressional District:
12
Unique Entity Identifier (UEI): GSRMP1QCXU74
Parent UEI:
NSF Program(s): Robust Intelligence
Primary Program Source: 01001213DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7495, 7916
Program Element Code(s): 749500
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

This EArly-concept Grant for Exploratory Research (EAGER) supports an exploratory study to evaluate model components for prediction of human speech recognition in the presence of noise. Such a model has the potential to predict confusions between fine phonetic distinctions in different levels of background noise and at different speaking rates. The study takes advantage of modern physiological results that indicate that the primary auditory cortex performs spectro-temporal filtering; that is, that there are cells that are sensitive to particular spectro-temporal modulations at each auditory frequency. In this project, perceptual experiments in the presence of both stationary and non-stationary additive noise and at different signal-to-noise ratios for a database of CVC syllables recorded at 2 different speaking rates yield confusion statistics. These statistics are then compared to those resulting from an auditory model enhanced by elements incorporating these spectro-temporal filters.

Successful results from this study will suggest enhancements to current hearing models and ultimately, after a broader study for which this EAGER is a pilot, advance the understanding of human speech perception. Background noise presents a challenging problem for a variety of speech and hearing devices including hearing aids and automatic speech recognition (ASR) systems. Since normal-hearing human listeners are extremely adept at perceiving speech in noise, this improved understanding of human models could lead to better artificial systems for speech processing. The databases and tools developed for this study will be disseminated to the research community.

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

A collaboration between the Speech Processing and Auditory Perception Laboratory at UCLA and the Speech Group at ICSI focused on the refinement of the simple models used in Automatic Speech Recognition (ASR) with representations that have been developed from observations of mammalian auditory physiology. While no animal experiments were performed in this study, earlier work reported by researchers at the University of Maryland suggested that auditory systems were particularly sensitive to particular ranges of modulations over time and frequency. Processing techniques based on these observations have proved useful in many previous ASR experiments. In this study, though, the goal was to see if incorporating these insights into ASR approaches would yield a pattern of errors (for consonants in Consonant-Vowel-Consonant  (CVC) syllables) that was more similar to what would be observed for human perception.

More specifically, we wanted to see if correlations with the pattern of human perceptual errors for CVC syllables for noisy and rapid speech were improved by using modulation-based features for a modern ASR system. To explore this, the UCLA team recorded a corpus of spoken CVC syllables, and then conducted listening tests and analyzed results to determine the pattern of errors for the listeners. These results were then passed to the ICSI team, who developed the ASR systems and compared results with the perceptual ones.

One result in particular was quite notable, mainly, that higher temporalmodulations (i.e., components of the speech where the spectral content was varying quickly) were sometimes more helpful than including information from the entire range of temporal modulations. The most prominent of these results occurred for rapid speech, which could be expected since the spectral content varies more quickly in this case. But significant improvement in correlation with human consonant perception was really only observed for noisy speech. But for speech recognition researchers, in an era where the efficacy of machine learning approaches is the focus of much attention, it should be interesting that restricting the observed features (to higher modulations) gives improved performance for noisy and rapid speech.

The table below summarizes these results.

 

Testing condition

Machine correlation with perception, baseline model

Machine correlation with perception,

all modulation features

Machine correlation with perception,
high modulations only

Clean, all speech

0.96

0.93

0.95

Clean, rapid speech

0.94

0.92

0.94

Clean, slow speech

0.94

0.92

0.95

Noisy, all speech

0.24

0.22

0.28

Noisy, rapid speech

0.20

0.26

0.31

Noisy, slow speech

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page