AVICAR: Audiovisual Speech Recognition in a Car

AVICAR: Audiovisual Speech Recognition in a Car

Audio-Visual Speech Recognition: Audio Noise, Video Noise, and Pronunciation Variability Mark Hasegawa-Johnson Electrical and Computer Engineering Audio-Visual Speech Recognition 1) Video Noise 1) Graphical Methods: Manifold Estimation 2) Local Graph Discriminant Features 2) Audio Noise 1) Beam-Form, Post-Filter, and Low-SNR VAD 3) Pronunciation Variability 1) Graphical Methods: Dynamic Bayesian Network 2) An Articulatory-Feature Model for AudioVisual Speech Recognition I. Video Noise 1) Video Noise 1) Graphical Methods: Manifold Estimation

2) Local Graph Discriminant Features 2) Audio Noise 1) Beam-Form, Post-Filter, and Low-SNR VAD 3) Pronunciation Variability 1) Graphical Methods: Dynamic Bayesian Network 2) An Articulatory-Feature Model for AudioVisual Speech Recognition AVICAR Database AVICAR = Audio-Visual In a CAR 100 Talkers 4 Cameras, 7 Microphones 5 noise conditions: Engine idling, 35mph, 35mph with windows open, 55mph, 55mph with windows open Three types of utterances:

Digits & Phone numbers, for training and testing phonenumber recognizers TIMIT sentences, for training and testing large vocabulary speech recognition Isolated Letters, to test the use of video for an acoustically hard recognition problem AVICAR Recording Hardware (Lee, Hasegawa-Johnson et al., ICSLP 2004) 8 Mics, Pre-amps, Wooden Baffle. 4 Cameras, Glare Shields, Adjustable Mounting Best Place= Dashboard Best Place= Sunvisor. System is not permanently installed; mounting

requires 10 minutes. AVICAR Video Noise Lighting: Many different angles, many types of weather Interlace: 30fps NTSC encoding used to transmit data from camera to digital video tape Facial Features: Hair Skin Clothing Obstructions AVICAR Noisy Image Examples Related Problem: Dimensionality Dimension of the raw grayscale lip rectangle: 30x200=6000 pixels Dimension of the DCT of the lip rectangle: 30x200=6000 dimensions Smallest truncated DCT that allows a human viewer to recognize lip shapes (Hasegawa-Johnson, informal experiments):

25x25=625 dimensions Truncated DCT typically used in AVSR: 4x4=16 dimensions Dimension of geometric lip features that allow high-accuracy AVSR (e.g., Chu and Huang, 2000): 3 dimensions (lip height, lip width, vertical assymmetry) Dimensionality Reduction: The Classics 3.5 3.5 3 3 2.5 2.5 2

2 1.5 1.5 1 1 0.5 0.5 0 0 0 0.5 1 1.5

2 2.5 Principal Components Analysis (PCA): Project onto eigenvectors of the total covariance matrix Projection includes noise 3 0 0.5 1 1.5 2 2.5

3 Linear Discriminant Analysis (LDA): Project onto v=W-1(d), W=withinclass covariance Projection reduces noise Manifold Estimation (e.g., Roweis and Saul, Science 2000) Neighborhood Graph Node = data point Edge = connect each data point to its K nearest neighbors 3.5 3 2.5 2 1.5 1 0.5 0 0

0.5 1 1.5 2 2.5 3 Manifold Estimation The K nearest neighbors of each data point define the local (K-1)dimensional tangent space of a manifold Local Discriminant Graph (Fu, Zhou, Liu, Hasegawa-Johnson and Huang, ICIP 2007) Maximize Local InterManifold Interpolation Errors,

subject to a constant SameClass Interpolation Error: 3.5 3 2.5 2 Find P to maximize Di||PT(xi-kckyk)||2, yk KNN(xi), other classes 1.5 1 0.5 0 0 0.5 1 1.5 2

2.5 3 Subject to S = constant, S = i||PT(xi-jcjxj)||2, xj KNN(xi), same class PCA, LDA, LDG: Experimental Test (Fu, Zhou, Liu, Hasegawa-Johnson and Huang, ICIP 2007) Lip Feature Extraction: DCT=discrete cosine transform; PCA=principal components analysis; LDA=linear discriminant analysis; LEA=local eigenvector analysis; LDG=local discriminant graph Lip Reading Results (Digits) (Fu, Zhou, Liu, Hasegawa-Johnson and Huang, ICIP 2007) 64.5 64 DCT

PCA LDA LEA LDG 63.5 Word Error 63 Rate (%) 62.5 62 61.5 61 2 4 8 32 Gaussians per State DCT=discrete cosine transform; PCA=principal components analysis; LDA=linear discriminant analysis; LEA=local eigenvector analysis; LDG=local discriminant graph

II. Audio Noise 1) Video Noise 1) Graphical Methods: Manifold Estimation 2) Local Graph Discriminant Features 2) Audio Noise 1) Beam-Form, Post-Filter, and Low-SNR VAD 3) Pronunciation Variability 1) Graphical Methods: Dynamic Bayesian Network 2) An Articulatory-Feature Model for AudioVisual Speech Recognition Audio Noise Audio Noise Compensation Beamforming Filter-and-sum (MVDR) vs. Delay-and-sum

Post-Filter MMSE log spectral amplitude estimator (Ephraim and Malah, 1984) vs. Spectral Subtraction Voice Activity Detection Likelihood ratio method (Sohn and Sung, ICASSP 1998) Noise estimates: Fixed noise Time-varying noise (autoregressive estimator)

High-variance noise (backoff estimator) MVDR Beamformer + MMSElogSA Postfilter (MVDR = Minimum variance distortionless response) (MMSElogSA = MMSE log spectral amplitude estimator) (Proof of optimality: Balan and Rosca, ICASSP 2002) Word Error Rate: Beamformers Ten-digit phone numbers; trained and tested with 50/50 mix of quiet (idle) and noisy (55mph open) DS=Delay-and-sum; MVDR=Minimum variance distortionless response Word Error Rate: Postfilters Voice Activity Detection Most errors at low SNR are because noise gets misrecognized as speech Effective solution: voice activity detection (VAD)

Likelihood ratio VAD (Sohn and Sung, ICASSP 1998): t = log { p(Xt=St+Nt) / p(Xt=Nt) } Xt = Measured Power Spectrum St, Nt = Exponentially Distributed Speech, Noise t > threshold Speech Present t < threshold Speech Absent VAD: Noise Estimators Fixed estimate: N0=average of first 10 frames Autoregressive estimator (Sohn and Sung): Nt = t Xt + (1-t) Nt-1 t = function of Xt, N0 Backoff estimator (Lee and HasegawaJohnson, DSP for In-Vehicle and Mobile Systems, 2007): Nt = t Xt + (1-t) N0 Word Error Rate: Digits

10 9 8 7 Word 6 Error Rate 5 4 (%) 3 2 1 0 Backoff Estimation Autoregressive Estimation Fixed Noise Idle 35U35D55U55D Noise Condition III. Pronunciation Variability 1) Video Noise

1) Graphical Methods: Manifold Estimation 2) Local Graph Discriminant Features 2) Audio Noise 1) Beam-Form, Post-Filter, and Low-SNR VAD 3) Pronunciation Variability 1) Graphical Methods: Dynamic Bayesian Network 2) An Articulatory-Feature Model for AudioVisual Speech Recognition Graphical Methods: Dynamic Bayesian Network Bayesian Network = A Graph in which Nodes are Random Variables (RVs) Edges Represent Dependence Dynamic Bayesian Network = A BN in which RVs are repeated once per time step Example: an HMM is a DBN Most important RV: the phonestate variable qt Typically qt {Phones} x {1,2,3} Acoustic features xt and video features yt depend on qt Example: HMM is a DBN

wt-1 wt winct-1 winct t-1 t qt-1 xt-1 qt qinct-1 yt-1 xt qinct yt

Frame t-1 Frame t qt is the phonestate, e.g., qt { /w/1, /w/2, /w/3, /n/1, /n/2, }w/w/1, /w/2, /w/3, /n/1, /n/2, }1, /w/1, /w/2, /w/3, /n/1, /n/2, }w/w/1, /w/2, /w/3, /n/1, /n/2, }2, /w/1, /w/2, /w/3, /n/1, /n/2, }w/w/1, /w/2, /w/3, /n/1, /n/2, }3, /w/1, /w/2, /w/3, /n/1, /n/2, }n/w/1, /w/2, /w/3, /n/1, /n/2, }1, /w/1, /w/2, /w/3, /n/1, /n/2, }n/w/1, /w/2, /w/3, /n/1, /n/2, }2, } wt is the word label at time t, for example, wt {one, two, }one, one, two, }two, } t is the position of phone qt within word wt: t {1st, 2nd, 3rd, } qinct {0,1} specifies whether t+1=t or t+1=t+1 Pronunciation Variability Even when reading phone numbers, talkers blend articulations. For example: seven eight: /svnet/ /svne?/ As speech gets less formal, pronunciation variability gets worse, e.g., worse in a car than in the lab; worse in conversation than in read speech A Related Problem: Asynchrony Audio and Video information are not synchronous For example: th (//) in three is visible, but not yet audible, because the audio is still silent Should HMM be in

qt=silence, or qt=//? A Solution: Two State Variables (Chu and Huang, ICASSP 2000) wt-1 wt winct-1 t-1 winct t qinct-1 qt-1 xt-1 t-1 t-1 qt

xt qinct t t vinct-1 vt-1 yt-1 Frame t-1 vt vinct yt Frame t Coupled HMM (CHMM): Two parallel HMMs

qt: Audio state (xt: audio observation) vt: Video state (yt: video observation) t=t-t: Asynchrony, capped at |t|<3 Asynchrony in Articulatory Phonology (Livescu and Glass, 2004) Its not really the AUDIO and VIDEO that are ssynchronous It is the LIPS, TONGUE, and GLOTTIS that are asynchronous word word ind1 ind1 U1

U1 sync1,2 sync1, 2 S1 S1 ind2 ind2 U2 U2 sync2,3 sync2, 3 S2

S2 ind3 ind3 U3 U3 S3 S3 Asynchrony in Articulatory Phonology Its not really the AUDIO and VIDEO that are ssynchronous It is the LIPS, TONGUE, and GLOTTIS that are asynchronous one, two, }three, three, dictionary form Tongue Glottis

Dental /w/1, /w/2, /w/3, /n/1, /n/2, }/w/1, /w/2, /w/3, /n/1, /n/2, } Silent Retroflex /w/1, /w/2, /w/3, /n/1, /n/2, }r/w/1, /w/2, /w/3, /n/1, /n/2, } Unvoiced Palatal /w/1, /w/2, /w/3, /n/1, /n/2, }i/w/1, /w/2, /w/3, /n/1, /n/2, } Voiced time one, two, }three, three, casual speech Tongue Glottis Dental /w/1, /w/2, /w/3, /n/1, /n/2, }/w/1, /w/2, /w/3, /n/1, /n/2, } Silent Unvoiced Retroflex /w/1, /w/2, /w/3, /n/1, /n/2, }r/w/1, /w/2, /w/3, /n/1, /n/2, }

Palatal /w/1, /w/2, /w/3, /n/1, /n/2, }i/w/1, /w/2, /w/3, /n/1, /n/2, } Voiced Asynchrony in Articulatory Phonology Same mechanism represents pronunciation variability: Seven: /vn/ /vn/ if tongue closes before lips open Eight: /et/ /e?/ if glottis closes before tongue tip closes one, two, }seven, seven, dictionary form: /w/1, /w/2, /w/3, /n/1, /n/2, }svn/w/1, /w/2, /w/3, /n/1, /n/2, } Lips Fricative /w/1, /w/2, /w/3, /n/1, /n/2, }v/w/1, /w/2, /w/3, /n/1, /n/2, } Tongue Fricative /w/1, /w/2, /w/3, /n/1, /n/2, }s/w/1, /w/2, /w/3, /n/1, /n/2, } Wide /w/1, /w/2, /w/3, /n/1, /n/2, }/w/1, /w/2, /w/3, /n/1, /n/2, } Neutral /w/1, /w/2, /w/3, /n/1, /n/2, }/w/1, /w/2, /w/3, /n/1, /n/2, } Closed /w/1, /w/2, /w/3, /n/1, /n/2, }n/w/1, /w/2, /w/3, /n/1, /n/2, } one, two, }seven, seven, casual speech: /w/1, /w/2, /w/3, /n/1, /n/2, }svn/w/1, /w/2, /w/3, /n/1, /n/2, } Lips Tongue

time Fricative /w/1, /w/2, /w/3, /n/1, /n/2, }v/w/1, /w/2, /w/3, /n/1, /n/2, } Fricative /w/1, /w/2, /w/3, /n/1, /n/2, }s/w/1, /w/2, /w/3, /n/1, /n/2, } Wide /w/1, /w/2, /w/3, /n/1, /n/2, }/w/1, /w/2, /w/3, /n/1, /n/2, } Neutral /w/1, /w/2, /w/3, /n/1, /n/2, }/w/1, /w/2, /w/3, /n/1, /n/2, } Closed /w/1, /w/2, /w/3, /n/1, /n/2, }n/w/1, /w/2, /w/3, /n/1, /n/2, } time An Articulatory Feature Model (Hasegawa-Johnson, Livescu, Lal and Saenko, ICPhS 2007) wt-1 wt winct-1 t-1 lt-1 winct

t linct-1 lt t-1 t-1 tt-1 t t tinct-1 tt t-1 t-1 gt-1 linct tinct

t t ginct-1 gt ginct There is no phonestate variable. Instead, we use a vector qt[lt,tt,gt] Lipstate variable lt Tonguestate variable tt Glotstate variable gt Experimental Test (Hasegawa-Johnson, Livescu, Lal and Saenko, ICPhS 2007)

Training and test data: CUAVE corpus Patterson, Gurbuz, Turfecki and Gowdy, ICASSP 2002 169 utterances used, 10 digits each, silence between words Recorded without Audio or Video noise (studio lighting; silent bkgd) Audio prepared by Kate Saenko at MIT NOISEX speech babble added at various SNRs MFCC+d+dd feature vectors, 10ms frames Video prepared by Amar Subramanya at UW Feature vector = DCT of lip rectangle Upsampled from 33ms frames to 10ms frames Experimental Condition: Train-Test Mismatch Training on clean data

Audio/video weights tuned on noise-specific dev sets Language model: uniform (all words equal probability), constrained to have the right number of words per utterance Experimental Questions (Hasegawa-Johnson, Livescu, Lal and Saenko, ICPhS 2007) 1) Does Video reduce word error rate? 2) Does Audio-Video Asynchrony reduce word error rate? 3) Should asynchrony be represented as 1) Audio-Video Asynchrony (CHMM), or 2) Lips-Tongue-Glottis Asynchrony (AFM) 4) Is it better to use only CHMM, only AFM, or a combination of both methods? Results, part 1: Should we use video? Answer: YES. Audio-Visual WER < Single-stream WER 90 80 70 60

50 Audio Video 40 Audiovisual 30 20 10 0 CLEAN SNR 12dB SNR 10dB SNR 6dB SNR 4dB SNR -4dB

Results, part 2: Are Audio and Video be asynchronous? Answer: YES. Async WER < Sync WER. 70 60 50 No Asynchrony 1 State Async 2 States Async Unlimited Asyn 40 30 20 10 0 CLEAN

SNR 12dB SNR 10dB SNR 6dB SNR 4dB SNR -4 Results, part 3: Should we use CHMM or AFM? Answer: DOESNT MATTER! WERs are equal. 80 70 60 50 Phone-viseme 40 Articulatory features 30 20 10

0 Clean SNR 12dB SNR 10dB SNR 6dB SNR 4dB SNR -4dB Results, part 4: Should we combine systems? Answer: YES. Best is AFM+CH1+CH2 ROVER 23 22 21 20 19 18

17 A+C1+C2 ROVER CU+C1+C2 ROVER C2: CHMM A: AFM C1: CHMM Conclusions Video Feature Extraction: Manifold discriminant is better than a global discriminant Audio Feature Extraction:

Beamformer: Delay-and-sum beats Filter-and-sum Postfilter: Spectral subtraction gives best WER (though MMSE-logSA sounds best) VAD: Backoff noise estimation works best in this corpus Audio-Video Fusion: Video reduces WER in train-test mismatch conditions Audio and video are asynchronous (CHMM) Lips, tongue and glottis are asynchronous (AFM) It doesnt matter whether you use CHMM or AFM, but... Best result: combine both representations

Recently Viewed Presentations

  • 2. You have collected and genotyped 200 individuals

    2. You have collected and genotyped 200 individuals

    Species are maintained through pre-zygotic and post-zygotic isolating mechanism. Speciation can occur in sympatry, parapatry, or allopatry. Studying the process of speciation is relatively easy, and we have good evidence for . each process. In contrast, inferring how existing species...
  • 5th Edition - Chi-Hung Liao (Charles)

    5th Edition - Chi-Hung Liao (Charles)

    Demand Schedules and Quantity Demanded. Demand schedule: A table that shows the relationship between the price of a product and the quantity of the product demanded.. Quantity demanded: The amount of a good or service that a consumer is willing...
  • Chapter 13 Leadership Essentials

    Chapter 13 Leadership Essentials

    What is situational contingency leadership? What is situational contingency leadership? Figure 13.2. Fiedler's situation contingency model. What is situational contingency leadership? What is situational contingency leadership? What is situational contingency leadership? Figure 13.3 House's theory of path-goal relationships.
  • Player Sponsorship Presentation

    Player Sponsorship Presentation

    Events include: World Snooker Championship, the Masters, UK Championship, International Championship, China Open, Shanghai Masters, Wuxi Classic, Hainan World Open, Australian Goldfields Open, Welsh Open, German Masters, Premier League Snooker, Championship League Snooker, Six Reds World Championship and the Players...
  • SITUATIONAL AWARENESS Dr Paula Foran Copyright Dr Paula

    SITUATIONAL AWARENESS Dr Paula Foran Copyright Dr Paula

    Situational Awareness . Perianaesthesia and perioperative nursing are team sports. To maximize survival rates and improve prognosis, advanced non-technical skill competency of the entire perioperative team is required, including both cognitive (decision making and situation awareness) and interpersonal (communication and...
  • The New Deal

    The New Deal

    Tennessee Valley Authority - The TVA was established to build dams to provide hydroelectric power and prevent floods. It did do this, but a private developer was ready to implement this plan as part of his business expansion.
  • Prezentacja programu PowerPoint - Ekonometria

    Prezentacja programu PowerPoint - Ekonometria

    A joke every 15 minutes = entertainment every 5, 6 slides Shorthand calculations: For example 15 minutes = max 4 slides plus obligatory 1 + 1 + 1 = 3 prepare maximmum 3 content slides prepare at least one „joke"...
  • Saxagliptin Assessment of Vascular Outcomes Recorded in ...

    Saxagliptin Assessment of Vascular Outcomes Recorded in ...

    These results challenge many practice dogmas. Role of glucose control in CVD risk mitigation remains uncertain. How should these agents be integrated into care. Despite unclear MOA, role of SGLT2i and GLP1 analogues in T2DM treatment algorithms likely to be...