A relative-error CUR Decomposition for Matrices and its Data Applications Michael W. Mahoney Yahoo! Research http://www.cs.yale.edu/homes/mmahoney (Joint work with P. Drineas, S. Muthukrishnan, and others.) Data sets can be modeled as matrices Matrices arise, e.g., since n objects (documents, genomes, images, web pages), each with m features, may be represented by an m x n matrix A.

DNA genomic microarray and SNP data Image analysis and recognition Time/frequency-resolved image analysis Term-document analysis Recommendation system analysis Many, many other applications! In many cases: analysis methods identify and extract low-rank or linear structure. we know what the rows/columns mean from the application area. rows/columns are very sparse. SVD of a matrix Any m x n matrix A can be decomposed as: U (V): orthogonal matrix containing the left (right) singular vectors of A. : diagonal matrix containing the singular values of A, ordered nonincreasingly. rank of A, the number of non-zero singular values. Exact computation of the SVD takes O(min{mn2 , m2n}) time.

SVD and low-rank approximations Truncate the SVD by keeping k terms: Uk (Vk): orthogonal matrix containing the top k left (right) singular vectors of A. k: diagonal matrix containing the top k singular values of A. Ak is the best matrix among all rank-k matrices. This optimality property of very useful in, e.g., Principal Component Analysis (PCA). The rows of Uk (= UA,k) are NOT orthogonal and are NOT unit length. The lengths/norms of the rows of Uk capture a notion of information dispersal. Problems with SVD/Eigen-Analysis Problems arise since structure in the data is not respected by mathematical operations on the data:

Reification - maximum variance directions are just that. Interpretability - what does a linear combination of 6000 genes mean. Sparsity - is destroyed by orthogonalization. Non-negativity - is a convex and not linear algebraic notion. The SVD gives two bases to diagonalize the matrix. Truncating gives a low-rank matrix approximation with a very particular structure. Think: rotation-with-truncation; rescaling; rotation-back-up. Question: Do there exist better low-rank matrix approximations. better structural properties for certain applications. better at respecting relevant structure. better for interpretability and informing intuition. CUR and CX matrix decompositions Recall: Matrices are about their rows and columns. Def: A CX matrix decomposition is a low-rank approximation explicitly expressed in terms of a small number of columns of the original matrix A. Def: A CUR matrix decomposition is a low-rank approximation explicitly expressed in terms of a small number of rows and columns of the original

matrix A. Carefully chosen U O(1) rows O(1) columns Q1: Do low-rank CUR decompositions even exist? Q2: If so, can we compute a CUR decomposition efficiently? Q3: Even if we can do both of these, who cares? CUR to extract structure from data Exploit structural properties of CUR to analyze, e.g., genomic data: m genes n arrays (The data sets are not very large: poly(m,n) time algorithms are

acceptable.) In Biological/Chemical/Medical/Etc applications of PCA and SVD: Explain the singular vectors by mapping them to meaningful biological processes. This is a challenging task - think reification in statistics! What CUR does: Expresses the data in terms of actual genes and actual arrays, not eigen-genes and eigenarrays. CUR is a low-rank decomposition in terms of the data, and thus it is much more understandable. Problem formulation (1 of 3) Never mind columns and rows - just deal with columns (for now) of the matrix A. Could ask to find the best k of n columns of A. Combinatorial problem - trivial algorithm takes nk time. Probably NP-hard if k is not fixed. Lets ask a different question. Fix a rank parameter k. Lets over-sample columns by a little (e.g., k+3, 10k, k2, etc.). Try to get close (additive error or relative error) to the best rank-k approximation..

Problem formulation (2 of 3) Ques: Do there exist O(k), or O(k2), or , columns s.t.: ||A-CC+A||2,F < ||A-Ak||2,F + ||A||F Ans: Yes - and can find them in O(m+n) space and time after two passes over the data! (DFKVV99,DKM04) Ques: Do there exist O(k), or O(k2), or , columns such that ||A-CC+A||2,F < (1+)-1||A-Ak||2,F + t||A||F Ans: Yes - and can find them in O(m+n) space and time after t passes over the data! (RVW05,DM05) Ques: Do there exist O(k), or O(k2), or , columns such that ||A-CC+A||F < (1+)||A-Ak||F Ans: Yes - existential proof - no non-exhaustive algorithm given! (RVW05,DRVW06) Ans: Yes - lots of them, and can find them in O(SVD) space and time! (DMM06) Problem formulation (3 of 3) Ques: Do there exist O(k), or O(k2), or , columns and rows such that ||A-CUR||2,F < ||A-Ak||2,F + ||A||F Ans: Yes - lots of them, and can find them in O(m+n) space and time after two passes over the data! (DK03,DKM04)

Note: lots of them since these are randomized Monte Carlo algorithms! Ques: Do there exist O(k), or O(k2), or , columns and rows such that ||A-CUR||F < (1+)||A-Ak||F Ans: Regression problems We seek sampling-based algorithms for solving l2 regression. We are interested in overconstrained problems, n >> d. Typically, there is no x such that Ax = b. There is work by K. Clarkson in SODA 2005 on sampling-based algorithms for l1 regression ( = 1 ) for overconstrained problems.

Induced regression problems sampled rows of A sampled rows of b scaling to account for undersamplin g Exact solution Projection of b on the subspace spanned by the columns of A Pseudoinverse of A

Computing the SVD takes O(d2n) time. The pseudoinverse of A is Questions Can sampling methods provide accurate estimates for l2 regression? Is it possible to approximate the optimal vector and the optimal value value Z2 by only looking at a small sample from the input? (Even if it takes some sophisticated oracles to actually perform the sampling ) Equivalently, is there an induced subproblem of the full regression problem, whose optimal solution and its value Z2,s approximates the optimal solution and its value Z2? Creating an induced subproblem Algorithm Fix a set of probabilities pi, i=1

n, summing up to 1. Pick r indices from {1n} in r i.i.d. trials, with respect to the pis. For each sampled index j, keep the j-th row of A and the j-th element of b; rescale both by (1/ rpj)1/2. Our results If the pi satisfy certain conditions, then with probability at least 1-, A): condition number of A

The sampling complexity is Back to induced subproblems sampled rows of A, rescaled sampled elements of b, rescaled The relevant information for l2 regression if n >> d is contained in an induced subproblem of size O(d2)-by-d. (New improvement: we can reduce the sampling complexity to r = O(d).) Conditions on the probabilities, SVD U (V): orthogonal matrix containing the left (right) singular vectors of A.

: diagonal matrix containing the singular values of A. rank of A. Let U(i) denote the i-th row of U. Let U? 2 Rn (n -) denote the orthogonal complement of U. Conditions on the probabilities, interpretation What do the lengths of the rows of the n x d matrix U = U A mean? Consider possible n x d matrices U of d left singular vectors: In|k = k columns from the identity row lengths = 0 or 1 In|k x -> x Hn|k = k columns from the n x n Hadamard (real Fourier) matrix row lengths all equal Hn|k x -> maximally dispersed Uk = k columns from any orthogonal matrix row lengths between 0 and 1 Lengths of the rows of U = UA correspond to a notion of information

dispersal. Conditions for the probabilities The conditions that the pi must satisfy, for some 1, 2, 3 2 (0,1]: lengths of rows of matrix of left singular vectors of A Component of b not in the span of the columns of A The sampling complexity is: Small i ) more sampling

Computing good probabilities In O(nd2) = O(SVD(A)) = O(SVD(Ad)) time we can easily compute pis that satisfy all three conditions, with 1 = 2 = 3 = 1/3. (Too expensive in practice for this problem!) (But, NOT too expensive when applied to CX and CUR matrix problems!!) Open question: can we compute good probabilities faster, in a pass efficient manner? Some assumptions might be acceptable (e.g., bounded condition number of A, etc.) Critical observation sample & rescale sample & rescale Critical observation, contd

sample & rescale only U sample & rescale Critical observation, contd Important observation: Us is almost orthogonal, and we can bound the spectral and the Frobenius norm of UsT Us I. (FKV98, DK01, DKM04, RV04) Application: CUR-type decompositions Carefully chosen U Create an

approximation to A, using rows and columns of A O(1) columns O(1) rows Goal: provide (good) bounds for some norm of the error matrix A CUR 1. How do we draw the rows and columns of A to include in C and R? 2. How do we construct U? L2 Regression and CUR Approximation Extended L2 Regression Algorithm:

Input: m x n matrix A, m x p matrix B, and a rank parameter k. Output: n x p matrix X approximately solving min X ||AkX - B ||F. Algrthm: Randomly sample r=O(d2) or r=O(d) rows from Ak and B Solve the induced sub-problem. Xopt = Ak+B (SAk)+SB Cmplxty: O(SVD(Ak)) time and space Corollary 1: Approximately solve: minX ||ATkX - AT||F to get columns C such that: ||A-CC+A||F (1+)||A-Ak||F. Corollary 2: Approximately solve: minX ||CX - A||F to get rows R such that: ||A-CUR||F (1+) ||A-CC+A||F.

Theorem: (relative error) CUR Fix any k, , . Then, there exists a Monte Carlo algorithm that uses O(SVD(Ak)) time to find C and R and construct U s.t.: holds with probability at least 1-, by picking c = O( k2 log(1/) / 2 ) columns, and r = O( k4 log2(1/) / 6 ) rows. (Current theory work: we can improve the sampling complexity to c,r=O(k poly(1/, 1/)).) (Current empirical work: we can usually choose c,r k+4.) (Dont worry about : choose =1 if you want!) Subsequent relative-error algorithms November 2005: Drineas, Mahoney, and Muthukrishnan The first relative-error low-rank matrix approximation algorithm. O(SVD(Ak)) time and O(k2) columns for both CX and CUR decompositions.

January 2006: Har-Peled Used -nets and VC-dimension arguments on optimal k-flats. O(mn k2 log(k)) - linear in mn time to get 1+ approximation. March 2006: Despande and Vempala Used a volume sampling - adaptive sampling procedure of RVW05, DRVW06. O(Mk/) O(SVD(Ak)) time and O(k log(k)) columns for CX-like decomposition. April 2006: Drineas, Mahoney, and Muthukrishnan

Improved the DMM November 2005 result to O(k) columns. Previous CUR-type decompositions Goreinov, Tyrtyshnikov, & Zamarashkin (LAA 97, ) C: columns that span max volume U: W+ R: rows that span max volume Existential result Error bounds depend on ||W+||2 M. Berry, G.W. Stewart, & S. Pulatova (Num. Math. 99, TR 04, ) C: variant of the QR algorithm R: variant of the QR algorithm

U: minimizes ||A-CUR||F No a priori bounds A must be known to construct U. Solid experimental performance C: uniformly at random U: W+ R: uniformly at random Experimental evaluation A is assumed PSD Connections to Nystrom method C: w.r.t. column lengths U: in linear/constant time R: w.r.t. row lengths Sketching massive matrices Provable, a priori, bounds Explicit dependency on A Ak

C: depends on singular vectors of A. U: (almost) W+ R: depends on singular vectors of C (1+) approximation to A Ak Williams & Seeger (NIPS 01, ) D., M., & Kannan (SODA 03, TR 04, ) D., M., & Muthukrishnan (TR 05) Spectral norm bounds! Computable in low polynomial

time (Suffices to compute SVD(Ak)) (For details see Drineas & Mahoney, A Randomized Algorithm for a Tensor-Based Generalization of the SVD, CUR data application: DNA tagging-SNPs (data from K. Kidds lab at Yale University, joint work with Dr. Paschou at Yale University) Single Nucleotide Polymorphisms: the most common type of genetic variation in the genome across different individuals. They are known locations at the human genome where two alternate nucleotide bases (alleles) are observed (out of A, C, G, T). individual s SNPs AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG TT TT CC GG TT GG GG TT GG AA

GG TT TT GG TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG GT GT GA AG GG TT TT GG TT CC CC CC CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG AA GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG TT GG AA There are 10 million SNPs in the human genome, so this table could have ~10 million GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT columns. GG AA Recall the human genome Human genome 3 billion base pairs

30,000 40,000 genes The functionality of 97% of the genome is unknown. BUT: individual differences (polymorphic variation) at 1 b.p. per thousand. individual s SNPs AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG TT TT CC GG TT GG GG TT GG AA GG TT TT GG TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG

GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG GT GT GA AG GG TT TT GG TT CC CC CC CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG AA TT TT GGquite TT CC CC CG CC AG AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA CC AA GT TTthe AG CT tracking CG CG CG AT CT CT CT AG GG TT SNPs GG occur frequently within the genome allowing

ofAGdisease GG AA genes and population histories. GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG AA Thus, SNPs are effective markers for genomic research. Focus at a specific locus and assay the observed alleles. C T

SNP: exactly two alternate alleles appear. Two copies of a chromosome (father, mother) An individual could be: - Heterozygotic (in our study, CT = TC) individual s SNPs AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG TT TT CC GG TT GG GG TT GG AA GG TT TT GG TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG GT GT GA AG

GG TT TT GG TT CC CC CC CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG AA GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG TT GG AA Focus at a specific locus and assay the observed alleles. C C SNP: exactly two alternate alleles appear. Two copies of a chromosome (father, mother) An individual could be: - Heterozygotic (in our study, CT = TC) - Homozygotic at the first allele, e.g., C

individual s SNPs AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG TT TT CC GG TT GG GG TT GG AA GG TT TT GG TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG GT GT GA AG GG TT TT GG TT CC CC CC CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG AA GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG TT GG AA Focus at a specific locus and assay the observed alleles.

T T SNP: exactly two alternate alleles appear. Two copies of a chromosome (father, mother) An individual could be: - Heterozygotic (in our study, CT = TC) - Homozygotic at the first allele, e.g., C - Homozygotic at the second allele, e.g., T individual s SNPs AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG

GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG TT TT CC GG TT GG GG TT GG AA GG TT TT GG TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG GT GT GA AG GG TT TT GG TT CC CC CC CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG AA GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG TT GG AA Why are SNPs important? Genetic Association Studies: Locate causative genes for common complex disorders (e.g., diabetes, heart disease, etc.) by identifying association between affection status and known SNPs. No prior knowledge about the function of the gene(s) or the etiology of the disorder is necessary.

Biology and Association Studies: The subsequent investigation of candidate genes that are in physical proximity with the associated SNPs is the first step towards understanding the etiological pathway of a disorder and designing a drug. Data Analysis and Association Studies: Susceptibility alleles (and genotypes carrying them) should be more common in the patient population. SNPs carry redundant information Key observation: non-random relationship between SNPs. Human genome is organized into block-like structure. Strong intra-block correlations.

We can focus only on tagSNPs. Among different populations (eg., European, Asian, African, etc.), different patterns of SNP allele frequencies or SNP correlations are often observed. Understanding such differences is crucial in order to develop the next generation of drugs that will be population specific (eventually genome specific) and not just disease specific. Funding Mapping the whole genome sequence of a single individual is very expensive. Mapping all the SNPs is also quite expensive, but the costs are dropping fast. HapMap project (~$100,000,000 funding from NIH and other sources): Map all 10,000,000 SNPs for 270 individuals from 4 different populations (YRI, CEU, CHB, JPT), in order to create a genetic map to be used by researchers. Also, funding from pharmaceutical companies, NSF, the Department of Justice*, etc.

* Is it possible to identify the ethnicity of a suspect from his DNA? Research directions Research questions (working within a population): (i) Are different SNPs correlated, within or across populations? (ii) Find a good set of tagging-SNPs capturing the diversity of a chromosomal region of the human genome. (iii) Find a set of individuals that capture the diversity of a chromosomal region. Why? - Understand structural properties of the human genome. - Save time/money by assaying only the tSNPs and predicting the rest.

- Save time/money by running (drug) tests only on the cell lines of the selected individuals. (iii) Is extrapolation feasible? Existing literature Pairwise metrics of SNP correlation, called LD (linkage disequilibrium) distance, based on nucleotide frequencies and co-occurrences. Almost no metrics exist for measuring correlation between more than 2 SNPs and LD is very difficult to generalize. Exhaustive and semi-exhaustive algorithms in order to pick good ht-SNPs that have small LD distance with all other SNPs. Using Linear Algebra: an SVD based algorithm was proposed by Lin & Altman, Am. J. Hum. Gen. 2004. The DNA SNP data Samples from 38 different populations. Average size 50 subjects/population. For each subject 63 SNPs were assayed, from a region in chromosome 17 called SORCS3, 900,000 bases long. We are in the process of analyzing HapMap data as well as more 3 regions assayed by Kidds lab (with Asif Javed).

N > 50 N: 25 ~ 50 Finns Kom Zyrian Yakut Khanty Irish European, Mixed Danes Chuvash Russians

African Americans Jews, Ashkenazi Adygei Druze Samaritans Pima, Arizona Cheyenne Chinese, Hakka Japanese Han Cambodians Jews, Yemenite Ibo

Pima, Mexico Maya Atayal Hausa Yoruba Chinese, Taiwan Ami Biaka Jews, Ethiopian Mbuti Ticuna

Micronesians Chagga Nasioi Surui Karitiana Africa Europe NW Siberia NE Siberia SW Asia E Asia

N America S America Oceania Encoding the data SNPs individual s 0 0 0 1 0 -1 1 1 1 0 0 0 0 0 1 0 1 -1 -1 1 -1 0 0 0 1 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 How? -1 -1 -1 1 -1 -1 1 1 1 -1 1 0 0 0 1 0 1 -1 -1 1 -1 1 -1 1 1 1 1 1 -1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 1 -1 -1 -1 1 -1 -1 1 1 1 -1 1 0 0 1 0 0 1 -1 -1 1 0 0 0 0 1 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 -1 -1 -1 1 -1 -1 1 1 1 -1 1 0 0 0 1 1 -1 1 1 1 0 -1 1 0 1 1 0 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

-1 -1 -1 1 -1 -1 1 1 1 -1 1 -1 -1 -1 1 0 1 -1 -1 0 -1 1 1 0 0 1 1 1 -1 -1 -1 1 0 0 0 0 0 0 0 0 0 1 -1 -1 1 -1 -1 -1 1 -1 -1 1 0 1 0 0 0 0 0 1 0 1 -1 -1 0 -1 0 1 -1 0 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 1 -1 -1 1 -1 -1 -1 1two -1 -1 1 nucleotides 1 1 -1 1 0 0 0 1 -1 (out 1 -1 -1 1 of 0 0 A,G,C,T) 0 1 1 1 0 1 -1appear -1 1 -1 -1 -1 -1in -1 -1each 1 -1 -1 -1column. 0 -1 -1 Exactly 1

Thus, the two alleles might be both equal to the first one (encode by +1), both equal to the second one (encode by -1), or different (encode by 0). Notes Order of the alleles is irrelevant, so TG is the same as GT. Encoding, e.g., GG to +1 and TT to -1 is not any different (for our purposes) from encoding GG to -1 and TT to +1. (Flipping the signs of the columns of a matrix does not affect our techniques.) Evaluating (linear) structure For each population We ran SVD to determine the optimal number k of eigenSNPs covering 90% of the variance. If we pick the top k left singular vectors we can express every column (i.e, SNP) of A as a linear combination of the left singular vectors loosing 10% of the data.

We ran CUR to pick a small number (e.g., k+2) of columns of A and express every column (i.e., SNP) of A as a linear combination of the picked columns, loosing 10% of the data. SNPs individual s 0 0 0 1 0 -1 1 1 1 0 0 0 0 0 1 0 1 -1 -1 1 -1 0 0 0 1 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 -1 -1 -1 1 -1 -1 1 1 1 -1 1 0 0 0 1 0 1 -1 -1 1 -1 1 -1 1 1 1 1 1 -1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 1 -1 -1 -1 1 -1 -1 1 1 1 -1 1 0 0 1 0 0 1 -1 -1 1 0 0 0 0 1 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 -1 -1 -1 1 -1 -1 1 1 1 -1 1 0 0 0 1 1 -1 1 1 1 0 -1 1 0 1 1 0 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 -1 -1 1 -1 -1 1 1 1 -1 1 -1 -1 -1 1 0 1 -1 -1 0 -1 1 1 0 0 1 1 1 -1 -1 -1 1 0 0 0 0 0 0 0 0 0 1 -1 -1 1 -1 -1 -1 1 -1 -1 1 0 1 0 0 0 0 0 1 0 1 -1 -1 0 -1 0 1 -1 0 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 1 -1 -1 1 -1 -1 -1 1 -1 -1 1 1 1 -1 1 0 0 0 1 -1 1 -1 -1 1 0 0 0 1 1 1 0 1 -1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 0 -1 -1 1 Predicting SNPs within a population Split the individuals in two sets: training and test. Given a small number of SNPs for all individuals (tagging-SNPs), and all

SNPs for individuals in the training set, predict the unassayed SNPs. Tagging-SNPs are selected using only the training set. SNPs Training individuals, chosen uniformly at random individuals (for a few subjects, we are given all SNPs) SNP sample (for all subjects, we are given a small number of SNPs) Predicting SNPs across populations Given all SNPs for all individuals in population X, and a small number of tagging-SNPs for population Y, predict all unassayed SNPs for all individuals of Y. Tagging-SNPs are selected using only the training set. Training set: individuals in X.

Test set: individuals in Y. A: contains all individuals in both X and Y. SNPs All individuals in population X. individuals SNP sample (for all subjects, we are given a small number of SNPs) Select tSNPs OUT: set of tSNPs Assay tSNPs in populatio nB

SNPs individuals IN: population A individuals AG ?? GG ?? ?? ?? CC ?? ?? ?? AG ?? ?? ?? AA ?? GG ?? ?? ?? CG ?? ?? ?? GG ?? ?? ?? Reconstruct SNPs AG CT GT GG CT CC CG AG AG AC AG CT AG CT IN: population A & assayed tSNPs in B GG TT TT GG TT CC GG AG AA AC AG CT GG CT

AG CC GG GT CT CT CC GG AG CC GG CC AG CT Population A AG ?? GG ?? ?? ?? CC ?? ?? ?? AA ?? ?? ?? Population B SNPs AA CT GT GG TT TT CC GG GG AA GG CT AG CC AA ?? GT ?? ?? ?? CG ?? ?? ?? AA ?? ?? ?? OUT: unassayed SNPs in B : tSNP

Transferability of tagging SNPs individuals SNPs AA TT GT TT CC CT CG AG GG CC AA CC AA TT AG CT GG TT TT CT CC GG AA AA AA CC AA TT AG CC GG GT CT CC CC AG AA AC AG CT AA CT Population B AA CC GG GT CT TT CG AA AG CC GG CT AG CC FIG. 6 Keeping both SNPs and individuals

Given a small number of SNPs for all individuals, and all SNPs for some judiciously chosen individuals, predict the values of the remaining SNPs. SNPs Basis individuals JUDICIOUCLY CHOSEN individuals (for a few subjects, we are given all SNPs) SNP sample (for all subjects, we are given a small number of SNPs) Conclusions CUR matrix decompositions. Provides a low-rank approximation in terms of the actual columns and rows of the matrix.

Relative error CUR and CX uses information from SVD of A in an essential way. Approximate least squares fitting to chosen columns/rows. (Less expensive) additive error CUR and CX uses information only from A. Data applications motivating CUR matrix decomposition theory. DNA genomic microarray and SNP data Image analysis and recognition Time/frequency-resolved image analysis Term-document analysis Recommendation system analysis Many, many other applications! Workshop on Algorithms for Modern Massive Data Sets

(http://www.stanford.edu/group/mmds/) @ Stanford University and Yahoo! Research, June 21-24, 2006 Objective: - Explore novel techniques for modeling and analyzing massive, high-dimensional, and nonlinear-structured data. - Bring together computer scientists, computational and applied mathematicians, statisticians, and practitioners to promote cross-fertilization of ideas. Organizers: G. Golub, M. W. Mahoney, L-H. Lim, and P. Drineas. Sponsors: NSF, Yahoo! Research, Ask!.