Sequence evolution and homology identification Xuhua Xia [email protected] http://dambe.bio.uottawa.ca Nucleotide substitution - usually too slow to monitor directly spontaneous mutation rates? p. 35-37 for mammalian nuclear DNA (regions not under functional constraint) ~ 4 x 10 -9 nt sub per site per year ... much higher for viruses eg. 10 -6 to 10 -3 nt sub per site per generation so use comparative analysis of 2 sequences which share a common ancestor - determine number and nature of nt substitutions that have occurred (ie measure degree of divergence) Xuhua Xia Slide 2 Potential pitfalls 1. Are all evolutionary changes being monitored? - if closely-related, high probability only one change at any given site but if distant, may have been multiple substitutions (hits) at a site - can use algorithms to correct for this 2. If indels between two sequences, can they be aligned

with confidence? - algorithms with gap penalties Xuhua Xia Slide 3 Ancestral sequence 13 substitutions occurred but only 3 discovered by sequence comparison Present day sequences Xuhua Xia Fig. 3.6 Slide 4 Multiple hits Homoplasy: same nt, but not directly inherited from ancestral sequence (If comparing long stretches, highly unlikely they would have converged to the same sequence) Xuhua Xia Page & Holmes Fig. 5.9 Slide 5

Uncertainty in codon substitution Pathway I: CCC(Pro)CCA(Pro) CAA(Gln) Pathway II: CCC(Pro)CAC(His) CAA(Gln) Is one pathway more likely than another? p.82 Xuhua Xia Slide 6 The purpose of sequence alignment Identification of sequence homology and homologous sites Homology: similarity that is the result of inheritance from a common ancestor (identification and analysis of homologies is central to phylogenetics). An Alignment is an hypothesis of positional homology between bases/Amino Acids. Xuhua Xia Slide 7 Normal and Thalassemia HBb

10 20 30 40 50 60 ----|----|----|----|----|----|----|----|----|----|----|----|-Normal AUGGUGCACCUGACUCCUGAGGAGAAGUCUGCCGUUACUGCCCUGUGGGGCAAGGUGAACGU Thalass. AUGGUGCACCUGACUCCUGAGGAGAAGUCUGCCGUUACUGCCCUGUGGGGCAAGGUGAACGU ************************************************************** 70 80 90 100 110 120 --|----|----|----|----|----|----|----|----|----|----|----|---Normal

GGAUGAAGUUGGUGGU-GAGGCCCUGGGCAGGUUGGUAUCAAGGUUACAAGACAGG...... Thalass. GGAUGAAGUUGGUGGUUGAGGCCCUGGGCAGGUUGGUAUCAAGGUUACAAGACAGG...... **************** *************************************** Are the two genes homologous? What evolutionary change can you infer from the alignment? What is the consequence of the evolutionary change? Xuhua Xia Slide 8 Janeka, JE et al. 2007 Science 318:792 Xuhua Xia Aligned Sequences 10 20 30 40 50 60 ----|----|----|----|----|----|----|----|----|----|----|----|-mouse MMASYPEPEDTAGTLLAPESGRAVKEAEA-SPPSPG------KGGGTTPEKPDPAQKPPYSY human MMASYPEPEDAAGALLAPETGRTVKEPEG-PPPSPGK--GGGGGGGTAPEKPDPAQKPPYSY rabbit MMASYPEPEEAAGALLAPESGRAAKEPEA-PPPSPGKGGGGGGGGGSAAEKPDPAQKPPYSY

Fugu MMATYQNPEDDAMALMVHDTNTTKEKERPKEEPVQ----------DKVEEKPDPSQKPPYSY Tetraodon MMATYQNPEDDAMALMIHDTNTTKEKERPKEEPVQ----------DKVEEKPDPSQKPPYSY zebrafish MMATYPGHEDNGMILMD-TTSSSAEKDRTKDEAPP----------EKGPDKSDPTQKPPYSY * ** ******* 70 80 90 100 110 120 --|----|----|----|----|----|----|----|----|----|----|----|---mouse VALIAMAIRESAEKRLTLSGIYQYIIAKFPFYEKNKKGWQNSIRHNLSLNECFIKVPREGGG human VALIAMAIRESAEKRLTLSGIYQYIIAKFPFYEKNKKGWQNSIRHNLSLNECFIKVPREGGG rabbit VALIAMAIRESAEKRLTLSGIYQYIIAKFPFYEKNKKGWQNSIRHNLSLNECFIKVPREGGG Fugu VALIAMAIRESSEKRLTLSGIYQYIISKFPFYEKNKKGWQNSIRHNLSLNECFIKVPREGGG Tetraodon VALIAMAIRESSEKRLTLSGIYQYIISKFPFYEKNKKGWQNSIRHNLSLNECFIKVPREGGG zebrafish VALIAMAIRESSEKRLTLSGIYQYIISKFPFYEKNKKGWQNSIRHNLSLNECFIKVPREGGG *********** ************** *********************************** 130 140 150 160 170 180 |----|----|----|----|----|----|----|----|----|----|----|----|mouse ERKGNYWTLDPACEDMFEKGNYRRRRRMKRPFRPPPAHFQPGKGLFGSGGAAGGCGVPGAGA human

ERKGNYWTLDPACEDMFEKGNYRRRRRMKRPFRPPPAHFQPGKGLFGAGGAAGGCGVAGAGA rabbit ERKGNYWTLDPACEDMFEKGNYRRRRRMKRPFRPPPAHFQPGKGLFGAAGAAGACGVAGAGA Fugu ERKGNYWTLDPACEDMFEKGNYRRRRRMKRPFRPPPTHFQPGKSLFG--G-----------Tetraodon ERKGNYWTLDPACEDMFEKGNYRRRRRMKRPFRPPPTHFQPGKSLFG--G-----------zebrafish ERKGNYWTLDPACEDMFEKGNYRRRRRMKRPFRPPPTHFQPGKSLFG--G-----------************************************ ****** *** * Xuhua Xia Slide 10 What is an optimal alignment? Alignment1: FavoriteFavourite Alignment2: ---Favorite Favourite- Alignment3: Favorite----------Favourite Alignment4: Favo-rite Favourite An optimal alignment One with maximum number of matches and minimum number of mismatches and gaps Operational definition: one with highest alignment score given a particular scoring scheme (e.g., match: 2, mismatch: -1, gap: -2) Which of the 4 alignments above is the optimal alignment? Changing the scoring scheme may change the optimal alignment Xuhua Xia Slide 11 Importance of scoring schemes Two alternative alignments:

Alignment 1: ACCCAGGGCTTA ACCCGGGCTTAG Alignment 2: ACCCAGGGCTTAACCC-GGGCTTAG Scoring scheme 1 Scoring scheme 2 Alignment 1 2 Match 1 1 Mismatch -1 -2 Gap -5 -3 NMatch 7 11 NMismatch 5 0 NGap 0

2 Score 1 2 1 Score 2 -3 5 Alignment 1 better than Alignment 2 with scoring scheme 1 Alignment 2 better than Alignment 1 with scoring scheme 2 Importance of biological input Xuhua Xia Slide 12 Table 1-2. PAM30 matrix (extrapolated from PAM1) of which a detailed derivation is presented in the chapter on sequence alignment. A R N D C Q E G H I L K

M F P S T W Y V 6 -7 8 10 20 30 40 50 60 -4 -6 8 ----|----|----|----|----|----|----|----|----|----|----|----|--3 -10 2 8 S1 RWFFSTNHKDIGTLYLVFGAWAGMVGTALSLLIRAELSQPGALLGDDQIYNVIVTAHAFVMI -6 -8 -11 -14 10 S2 RWLFSTNHKDIGTLYLLFGAWAGVLGTALSLLIRAELGQPGNLLGNDHIYNVIVTAHAFVMI -4 -2 -3 -2 -14 8 -2 -9 -2 2 -14 1 8 -2 -9 -3 -3 -9 -7 -4 6 -7 -2 0 -4 -7

1 -5 -9 9 -5 -5 -5 -7 -6 -8 -5 -11 -9 8 -6 -8 -7 -12 -15 -5 -9 -10 -6 -1 7 -7 0 -1 -4 -14 -3 -4 -7 -6 -6 -8 7 -5 -4 -9 -11 -13 -4 -7 -8 -10 -1 1 -2 11 -8 -9 -9 -15 -13 -13 -14 -9 -6 -2 -3 -14 -4 9 -2 -4 -6 -8 -8 -3 -5 -6 -4 -8 -7 -6 -8 -10 8 0 -3 0 -4 -3 -5 -4 -2 -6 -7 -8 -4 -5 -6 -2 6 -1 -6 -2 -5 -8 -5 -6 -6 -7 -2 -7 -3 -4 -9 -4 0 7 -13 -2 -8 -15 -15 -13 -17 -15 -7 -14 -6 -12 -13 -4 -14 -5 -13 13 -8 -10 -4 -11 -4 -12 -8 -14 -3 -6 -7 -9 -11 2 -13 -7 -6 -5 10 -2 -8 -8 -8 -6 -7 -6 -5 -6 2 -2 -9 -1 -8 -6 -6 -3 -15 -7 7 A R N D C Q E G H I L K M F P S T W Y V From Xia, X. 2018 Bioinformatics and the cell. Springer Amino acid substitution matrices 10

20 30 40 50 60 ----|----|----|----|----|----|----|----|----|----|----|----|-S1 RWFFSTNHKDIGTLYLVFGAWAGMVGTALSLLIRAELSQPGALLGDDQIYNVIVTAHAFVMI S2 RWLFSTNHKDIGTLYLLFGAWAGVLGTALSLLIRAELGQPGNLLGNDHIYNVIVTAHAFVMI BLOSUM = BLOcks Substitution Matrix a substitution matrix used for sequence alignment of proteins (to score alignments between evolutionarily divergent protein sequences). Xuhua Xia Slide 14 Dynamic Programming A C G T

0 -2 -4 -6 -8 A -2 2 0 -2 -4 C -4 0 4 2 0 G -6

-2 2 6 4 G -8 -4 0 4 5 C -10 -6 -2 2 3 T -12

-8 -4 0 4 Constant gap penalty: Scoring scheme: Match (M): 2 Mismatch (MM): -1 Gap (G): -2 For each cell, compute three values: Upleft value + IF(Match, M, MM) Left value + G Up value + G (a) (b) 654321 ACG--T ACGGCT 654321 AC-G-T ACGGCT Xuhua Xia

Slide 15 Alignment with secondary structure Sequences: Seq1: CACGACCAATCTCGTG Seq2: CACGGCCAATCCGTG Seq2: CACGA ||||| GUGCU Seq2: CACGG ||||| GUGCC CCAAUCCCAAU CCAAU Seq1: CACGA ||||| GUGCU Deletion Missing link Correlated substitution Conventional alignment: Seq1: CACGACCAATCTCGTG Seq2: CACGGCCAATC-CGTG Correct alignment: Seq1: CACGACCAATCTCGTG Seq2: CACGGCCAAT-CCGTG

Xuhua Xia Hickson et al., 2000; Kjer, 1995; Notredame et al., 1997 Slide 16 Multiple Alignment: Guide Tree S1 S2 S3 ... S2 6 S3 3 2 S4 2 3 7 ... Matrix of pairwise alignment scores Seq16 Seq15 Seq14 Seq13 Seq12 Seq11 Seq10 Seq9 Seq8 Seq7 Seq6 Seq5 Seq4 Seq3 Seq2 Seq1

ATTCCAAG... ATTCCAAG... ATTTCCAAG... ATTCCCAAG... ATCGGAAG... ATCCGAAG... ATCCAAAG... ATCCAAAG... AATTCCAAG... AATTCCAAG... AATTTCCAAG... AATTCCCAAG... AAGTCCAAG... AAGTCCAAG... AAGTCCAAG... AAGTCAAG... ATT-CC-AAG... ATT-CC-AAG... ATTTCC-AAG... ATTCCC-AAG... AT--CGGAAG... AT-CCG-AAG... AT-CCA-AAG... AT-CCA-AAG... AATTCC-AAG... AATTCC-AAG... AATTTCCAAG... AATTCCCAAG... AAGTCC-AAG...

AAGTCC-AAG... AAGTCC-AAG... AAGTC--AAG... CLUSTAL CLUSTAL W (1.81) Multiple Sequence Alignments Sequence 1: ArabidopsisAAG52143 798 aa Sequence 2: ArabidopsisAAC26676 845 aa Sequence 3: yeast 664 aa ArabAAG52143 ArabAAC26676 yeast FIVDEADLLLDLGFRRDVEKIIDCLPRQR-------QSLLFSATIPKEVRRVS-QLVLKR 539 FIVDEADLLLDLGFKRDVEKIIDCLPRQR-------QSLLFSATIPKEVRRVS-QLVLKR 586 -VLDEADRLLEIGFRDDLETISGILNEKNSKSADNIKTLLFSATLDDKVQKLANNIMNKK 323 ::**** **::**: *:*.* . * .:. ::******: .:*:::: ::: *: Symbols: '*' identical, ':' not identical but similar, '.' less similar : (if Gonnet PAM 250) with score > 0.5; (If BLOSUM) 0 . (if Gonnet PAM 250) with score > 0 but 0.5; (If BLOSUM) -1 No strict rules. Xuhua Xia Slide 18

BLAST math: Distributions Toss a fair coin once. What is the probability of having a head (H)? Toss a fair coin twice. What is the probability of having HH, HT or TT (T for tail)? (0.5+0.5)2 Toss a fair coin 10 times and record the outcome, e.g., HTHHTHHTTT. What is the expected number of HH occurring in the string? E = n p = (10 2 +1)0.52 What is the probability of 0, 1, 2, ..., 9 matches? n! n! n 1 ( p q) p p q ... p n x q x ... q n ( n 1)!1! ( n x)! x ! n n Biased coin: p = 0.25, q = 0.75 Xuhua Xia Slide 19 Basic stats in string matching Given PA, PC, PG, PT in a target (database) sequence, the probability of a query sequence, say, ATTGCC, having a

perfect match of the target sequence is: prob = PA (PC)2 PG (PT)2 Let M be the target sequence length and N be the query sequence length, the matching operation can be performed (M N +1) times, e.g., Query: ATG Target CGATTGCCCG The probability distribution of the number of matches follows a binomial distribution with p = prob and n = (M N +1) Xuhua Xia Slide 20 Basic stats in string matching Computation involving a binomial distribution requires taking the factorial of n. When n is large (e.g., > 1000), this becomes impractical or impossible for todays computers. Approximation is therefore necessary. When np > 50, the binomial distribution can be approximated by the normal distribution with the mean = np and variance = npq 1 P ( x) e 2 ( x )2 2 2 When np < 1 and n is very large (which implies that p is very

small given np < 1), binomial distribution can be approximated by the Poisson distribution with mean and variance equal to np (i.e., = 2 = np). e x P( x) x! Xuhua Xia Slide 21 From Binomial to Poisson ( p q) n p n n! n! n! p n 1q ... p n x q x ... p x q n x ... q n (n 1)!1! (n x)! x ! (n x)! x ! P (n x) n! p x q n x (n x)! x ! n!

p xq xqn ( n x)! x ! P (0) q n n(n 1)(n 2)...(n x 1) p n (1 p ) x! q x n x x np n x p x np (np ) x np pe e e e x! x! x! x! P (n) p n P (n 1) np n 1q n! p n x q x (n x)! x !

n! P( x) p x q n x (n x)! x ! Xuhua Xia P( x) x Slide 22 Matching two sequences without gap Assuming equal nucleotide frequencies, the probability of a nucleotide site in the query sequence matching a site in the target sequence is p = 0.25. The probability of finding an exact match of L letters is a = p L = 0.25L = 2-2L = 2-S, where S is called the bit score in BLAST. M: query length; N: target length The query can start at (M L +1) positions and the target can start at (N L +1) positions. These two values are called effective lengths of the two sequences and designated m and n, respectively. They are shorter than M and N, respectively. The expected number of matches with length L is mn2-S, which is called e-value in ungapped BLAST. S is calculated differently in the gapped BLAST Xuhua Xia Slide 23

Expected number of matches: No gap 123456789012345678901234 T: GGTTACACGAGTGCTG |||||||||||||||| Q:CTGAGGTTACACGAGTGCTGCTGA M = 24, L = 16 m = M L + 1 = 24-16+1 = 9 E = m2-S = 92-216 =2.09548E-09 1234567890123456789012345 T:AACCGGTTACACGAGTGCTGAATGC |||||||||||||||| Q:CTGAGGTTACACGAGTGCTGCTGA M = 25, N = 24, L = 16 m = M L + 1 = 25-16+1 = 10 n = N L + 1 = 24-16+1 = 9 E = mn2-S = 1092-216 =2.09548E-08 Xuhua Xia Slide 24 Gapped BLAST Adapted from Crane & Raymer 2003 Input sequence: AILVPTVIGCTVPT Algorithm: Break the query sequence into words: AILV, ILVP, LVPT, VPTV, PTVI, TVIG, VIGC, IGCT, GCTV, CTVP, TVPT Discard common words (i.e., words made entirely of common amino acids) or sequence of low complexity

Search for matches against database sequences, assess significance and decide whether to discard to continue with extension using dynamic programming: AILVPTVIGCTVPT MVQGWALYDFLKCRAILVPTVIACTCVAMLALYDFLKC Xuhua Xia Slide 25 Blast Output (Nuc. Seq.) BLASTN 2.2.4 [Aug-26-2002] ... Query= Seq1 38 Database: MgCDS 480 sequences; 526,317 total letters Sequences producing significant alignments: MG001 1095 bases Score = 34.2 bits (17), Expect = 7e-004 Identities = 35/40 (87%), Gaps = 2/40 (5%) Query: 1 Sbjct: 1 Equivalent Eqs. Score E (bits) Value 34 7e-004

atgaataacg--attatttccaacgacaaaacaaaaccac 38 |||||||||| ||||||||||| |||||| |||||||| atgaataacgttattatttccaataacaaaataaaaccac 40 Lambda K H 1.37 0.711 1.31 Matrix: blastn matrix:1 -3 Gap Penalties: Existence: 5, Extension: 2 effective length of query: 26 effective length of database: 520,557 Xuhua Xia e E ( E ) x p ( x) x! Matches: 35*1 = 35 Mismatches: 3*(-3) = -9 Gap Open: 1*5 = 5 Gap extension: 2*2 =4 R = 35 - 9 - 5 - 4 = 17 S = [R ln(K)]/ln(2) =[1.37*17-ln(0.711)]/ln(2) = 34R ln(K)]/ln(2) =[1.37*17-ln(0.711)]/ln(2) = 34 E = mn2-S = 26 * 520557 * 2-34 = 7.878E-04 x p(x)

0 0.999265217 1 0.000734513 2 0.000000270 3 0.000000000 Slide 26 Effective length in gapped BLAST We know that the effective length without gap is n=(N-L+1) and m=(M-L+1), where N and M are the query and target sequence lengths, respectively. Because N and M are typically quite large, we may have n N-L and m M-L. If we have 10 target sequences of lengths M1, M2, ..., M10, then . Alternatively, . If we know L, then n and m are easy to compute. With gapped BLAST, we do not know L. There are 1) a quick and dirty way, and 2) a BLAST (and empirically better) way to compute L. The simple and dirty way is to note that both the ungapped and gapped BLAST uses . In the ungapped BLAST, S = 2L, so we may just take L S/2. In the BLAST example, S = 34, so L 17. However, this makes m and n too small, leading to an underestimate of E based on empirical simulation. The BLAST way: Next slide. Xuhua Xia Slide 27 BLAST way of computing n,m In gapped BLAST with C target sequences, L is defined as, or more explicitly,

( ) ( ) 2 2 =1 This L is known as effective HSP (High-scoring Segment Pair). In the BLAST output, we get N = 38, C = 480, Mi = 526,317, so ( 526,317 480 ) ( 38 ) 2 2 =1 You may try different values for L, e.g., 11, 12, 13, etc., and check which value will make the left-hand-side closest to 1. You will find 12 to be the best. So n = 38 12 = 26 m = 526317 480*12 = 520557 Xuhua Xia Slide 28 Why is sequence complexity important when judging whether two sequences are homologous? or whether their similarity is by chance AAGAGGAG Pu-rich region #2 (which is not homologous to #1) Human DNA Chimp DNA Pu-rich region #1

Region of unbiased base composition where G=C=A=T AAGAGGAG How frequently is such an 8-nt seq (AAGAGGAG) expected to occur by chance in a DNA sequence? If within unbiased region? 1 / 48 ... once every 65.5 kb on average If within Pu-rich region? 1 / 28 ... once every 256 bp on average If both sequences are purine rich, then high % identity (or a small evalue) may not reflect shared evolutionary origin E-Value in BLAST The e-value is the expected number of random matches that is equally good or better than the reported match. It can be a number near zero or much larger than 1. It is NOT the probability of finding the reported match. Only when the e-value is extremely small can it be interpreted as the probability of finding 1 match that is as good as the reported one (see next slide).

Xuhua Xia Slide 30 E-value and P(1) e E ( E ) x p ( x) x! p (1) e E E E ( when E 0) 1 0.9 0.8 P(1) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.00 0.20 0.40 0.60

0.80 1.00 E-value Xuhua Xia Slide 31 Problem with BLAST Q: GGC GCG CCC AAG CUG UGC ...... T: GGT GCA CCT AAA CUA UGT ...... Alternatives: FASTA AA sequence Xuhua Xia Slide 32