Pattern Theory: the Mathematics of Perception Prof. David Mumford Division of Applied Mathematics Brown University International Congress of Mathematics Beijing, 2002 Outline of talk I. Background: history, motivation, basic definitions II. A basic example Hidden Markov Models and speech; and extensions

III. The natural degree of generality Markov Random Fields; and vision applications IV. Continuous models: image processing via PDEs, self-similarity of images and random diffeomorphisms URL: www.dam.brown.edu/people/mumford/Papers /ICM02powerpoint.pdf or /ICM02proceedings.pdf Some History Is there a mathematical theory underlying intelligence? 40s Control theory (Wiener-Pontrjagin), the output side: driving a motor with noisy feedback in a noisy world to achieve a given state 70s ARPA speech recognition program 60s-80s AI, esp. medical expert systems, modal,

temporal, default and fuzzy logics and finally statistics 80s-90s Computer vision, autonomous land vehicle Statistics vs. Logic Plato: If Theodorus, or any other geometer, were prepared to rely on plausibility when he was doing geometry, he'd be worth absolutely nothing. Graunt counting corpses in medieval London Gauss Gaussian distributions, least squares relocating lost Ceres from noisy incomplete data Control theory the Kalman-Wiener-Bucy filter AI Enhanced logics < Bayesian belief networks Vision Boolean combinations of features < Markov random fields

What you perceive is not what you hear: 1. 2. 3. 4. ACTUAL SOUND The ?eel is on the shoe The ?eel is on the car The ?eel is on the table The ?eel is on the orange 1.

2. 3. 4. PERCEIVED WORDS The heel is on the shoe The wheel is on the car The meal is on the table The peel is on the orange (Warren & Warren, 1970) Statistical inference is being used! Why is this old man recognizable from a cursory glance?

His outline is lost in clutter, shadows and wrinkles; except for one ear, his face is invisible. No known algorithm will find him. The Bayesian Setup, I VARIABLES: xo observable variables, xh hidden (not directly observable) variables, parameters in model MODEL: Pr( xo | xh , ).Pr( xh | ) or Pr( xh , ) ML PARAMETER ESTIMATION:

arg max Pr( x ( ) | ) or arg max Pr( xo( ) , xh | ) xh The Bayesian Setup, II BAYESS RULE: Pr( xo | xh , ).Pr( xh | ) Pr( xh | xo , ) Pr( xo | )

Pr( xo | xh , ).Pr( xh | ) This is called the posterior distribution on xh Sampling Pr(x ,x |), synthesis is the acid test o h of the model The central problem of Statistical learning theory: complexity of the model and the Bias-Variance dilemma The * Minimum Description LengthMDL,

* Vapniks VC dimension A basic example: HMMs and speech recognition I. Setup sk = sound in window around time kt are the observables xk = part of phoneme being spoken at time kt are the hidden vars. Pr( x , s ) p1 ( xk | xk 1 ). p2 ( sk | xk ) k Sk-1 sk Sk+1

xk+1 xk-1 xk s = logs of table of values of p1, p2, so 1 k k . Ek ( x , s ) Pr( x , s ) e Z A basic example: HMMs and speech recognition II. Inference by dynamic programming: (a) p (x

1 Pr( xk | sk ) k | xk 1 ). p2 ( sk | xk ).Pr( xk 1 | sk 1 ) xk 1 numerator xk (b) Call maxp( xk ) max xk 1 Pr( xk , xk 1 , sk ), then maxp( xk ) max xk 1 p1 ( xk | xk 1 ). p2 ( sk | xk ).maxp( xk 1 )

(c) Optimizing the s done by EM algorithm, valid for any exponential model Continuous and discrete variables in perception Perception locks on to discrete labels, and the world is made up of discrete objects/events Make an empirical histogram of changes x and compute the kurtosis: =Exp(( x x ) 4 ) / 4 High kurtosis is natures universal signal of discrete events/objects in space-time. Stochastic process with i.i.d. increments has jumps iff the kurtosis of its increments is > 3.

A typical stochastic process with jumps Xt stochastic process with independent increments, then z X t 1 X t e z2 / 2 2 X t Brownian, hence continuous z X t 1 X t e z X t has jumps: Ex.: daily log-price changes in a sample of stocks

Note fat power law tails N.B. vertical axis is log of probability Particle filtering Compiling full conditional probability tables is usually impractical. Use a weighted sample: Pr( xk | sk ) weak w i,k

i xi ,k ( xk ) Bootstrap particle filtering: (a) Sample with replacement (b) Diffuse via p1 (c) Reweight via p2 Tracking application (from A.Blake, M.Isard): Estimating the posterior distribution on

optical flow in a movie (from M.Black) Horizontal flow (follow window in red) Horizontal flow Horizontal flow Horizontal flow Horizontal flow Horizontal flow

No process is truly Markov Speech has longer range patterns than phonemes: triphones, words, sentences, speech acts, PCFGs = probabilistic context free grammars = almost surely finite, labeled, random branching processes: Forest of random trees Tn, labels xv on vertices, leaves in 1:1 corresp with observations sm, prob. p1(xvk|xv) on children, p2(sm|xm) on observations.). Unfortunate fact: nature is not so obliging, longer range constraints force context-sensitive grammars. But how to make these stochastic?? Grammar in the parsed speech of Helen, a 2 year old

Grammar in images (G. Kanisza): contour completion Markov Random Fields: the natural degree of generality Time linear structure of dependencies space/space-time/abstract situations general graphical structure of dependencies G (V , E ), a graph,

xv vV , the variables, 1 C EC ( xC ) Pr( xV ) e , Z C the cliques of G The Markov property: xv, xw are conditionally independent, given xS , if S separates v,w in G. Hammersley-Clifford: the converse. A simple MRF: the Ising model 1 Pr( x , s ) e ZT

c. ( x s ) 2 sk+1,l-1 sk,l-1 sk-1,l-1

, sk+1,l sk,l sk-1,l xk,l-1 sk+1,l+1 sk,l+1

xk+1,l xk,l xk-1,l , x 1, s sk-1,l+1 xk+1,l-1 xk-1,l-1

x x T adj. xk+1,l+1 xk,l+1 xk-1,l+1 The Ising model and image segmentation A state-of-the-art image segmentation algorithm (S.-C. Zhu)

Input Segmentation Synthesis from model I ~ p( I | W*) Hidden variables describe segments and their texture, allowing both slow and abrupt intensity and texture changes (See also Shi-Malik) Texture synthesis via MRFs On left: a cheetah hide; In middle, a sample from the Gaussian model with identical second order statistics;

On right, a sample from exponential model reproducing 7 filter marginals using: k (( Fk I )( x )) 1 Pr( I ) e k Z Monte Carlo Markov Chains Basic idea: use artificial thermal dynamics to find minimum energy (=maximum probability) states Bayesian belief propagation and the Bethe approximation Can find modes of MRFs on trees using dynamic programming

When the graph G is not a tree, use its universal covering graph G The Bethe approximation = the 1(G)-G)-invariant MRF on G closest to the given MRF, closest = minimizing Kullback-Liebler information distance Bayesian belief propagation = finding the modes of the Bethe approximation with dynamic programming Continuous models I: deblurring and denoising Observe noisy, blurred image I, seek to remove noise, enhance edges simultaneously! min c1 ( I J ) 2 dxdy c2 J D , D

D p dxdy c3 p=2 is Mumford-Shah model; p=1, c3=0 is Osher-Rudin TV model J J div c J

t I J p p=2 is Perona-Malik equation; c=0, p=1 is TV-gradient descent An example: Bela Bartok enhanced via the Nitzberg-Shiota filter Continuous models II: images and scaling

The statistics of images of natural scenes appear to be a fixed point under blockaveraging renormalization, i.e. Assume NN images of natural scenes have a certain probability distribution; form N/2N/2 images by a window or by 22 averages get the same marginal distribution! Scale invariance has many implications: Power law for the spectrum: 2 Exp I( , ) c 2 2

In the continuous limit, images are not locally integrable functions but generalized functions in: H Intuitively, this is what we call clutter the mathematical explanation of why vision is hard Three axioms for natural images 1. Scale invariance 2. For all zero-mean filters F, the scalar random

variables F I (i, j ) have kurtosis > 3 (D.Field, J.Huang). 3. Local image patches are dominated by preferred geometries: edges, bars, blobs as well as blue sky blank patches (D.Marr, B.Julesz, A.Lee). It is not known if these axioms can be exactly satisfied! Empirical data on image filter responses Probability distributions of 1 and 2 filters, estimated from natural image data. a) Top plot is for values of horizontal first difference of pixel values; middle plot is for random 0-mean 8x8 filters.

Vertical axis in top 2 plots is log(G)-prob.density). b) Bottom plot shows level curves of Joint prob.density of vert.differences at two horizontally adjacent pixels. All are highly non-Gaussian! Mathematical models for random images I ( x, y ) k (e rk x xk , e rk y yk ), k ( xk , yk , rk ) a Poisson process in 3 , k independent samples from prob.space of normalized wavelets.

(Random wavelet model) I ( x, y ) k ( x , y ) (e rk ( x , y ) x xk ( x , y ) , e rk ( x , y ) y yk ( x , y ) ),

k ( x, y ) min k (e rk x xk , e rk y yk ) supp( k ) (Dead leaves model) Continuous models III: random diffeomorphisms The patterns of the world include shapes, structures which recur with distortions: e.g. alphanumeric characters, faces, anatomy Thus the hidden variables must include (i) clusters of similar shapes, (ii) warpings between shapes in a cluster Mathematically: need a metric on (i) the space of diffeomorphisms Gk of k, or (ii) the space of shapes Sk in k (open subsets with smooth bdry) Can use diffusion to define a probability measure on Gk .

Metrics on Gk, I V.Arnold: Introduce a Riemannian metric in SGk, the group of volume preserving diffeos. For any path {t}, let length of path k t 1

2 vt ( x ) dx dt , vt ( x) t ( x) t Then he proved geodesics are solutions of Eulers equation: vt (vt .)vt p t Metrics on Gk, II

Christensen, Rabbitt, Miller on Gk, use stronger metric: m Lv , v dx , L ( I )

k t t vt = velocity, ut = Lvt = momentum in this metric Geodesics now are solutions to a regularized compressible form of Eulers equation: ut (vt .)ut div(vt )ut (ut )i ((vt )i ) t i Note: linear in u, so u can be a generalized function! Geodesics in the quotient space S2

Get geodesics {Ht} on Sk by taking singular momentum: ut ( x ) a (t , x )nx , Ht Ht ( x ) S2 has remarkable structure: (A geodesic, F.Beg) 1. Weak Hilbert manifold 2. Medial axis gives it a cell decomposition 3. Geometric heat equation defines a deformation retraction

4. Diffusion defines probability measure (Dupuis-Grenander-Miller, Yip) Geodesics in the quotient space of landmark points gives a classical mechanical system (Younes) Consider geodesics whose momentum has finite support: N ut ( x ) ui (t ) Pi (t ) ( x ) i 1 This gives the ODE in which particles traveling in

the same (resp. opposite) direction attract (resp. repel): dPi 2 K Pi Pj u j dt j dui Pi K Pi Pj .(ui .u j ) dt j

K the Greens function of L, a Bessel function. Outlook for Pattern Theory Finding a rich class of stochastic models adequate for duplicating human perception yet tractable (vision remains a major challenge)

Finding algorithms fast enough to make inferences with these models (Monte Carlo? BBP ? competing hypothesis particles?) Underpinnings for a better biological theory of neural functioning e.g. incorporating particle filtering? grammar? warping? feedback? URL: www.dam.brown.edu/people/mumford/Papers /ICM02powerpoint.pdf or /ICM02proceedings.pdf A sample of Graunts data