I - Summary Introduction to protein domains Domain

I - Summary  Introduction to protein domains  Domain

I - Summary Introduction to protein domains Domain databases Domain hunting http://www.sanger.ac.uk/Software/Pfam/ Protein Domains From a structural perspective protein domains are discrete units. Little interaction between them http://www.sanger.ac.uk/Software/Pfam/

What is a Domain? Domains are discrete structural units Defined by structure Domain boundaries can be inferred from careful sequence analysis Domains are the currency of protein function http://www.sanger.ac.uk/Software/Pfam/ Domains - size Domains can be 25 to 500 residues long

Most are less than 200 residues. Domains can be smaller that 50 residues but these are stabilized by disulphide bonds or chelated metals. 37 residues 31 residues C N Zn 2+

http://www.sanger.ac.uk/Software/Pfam/ S-S bond Example domains The lipoxygenase domain is a giant at 500 residues long. http://www.sanger.ac.uk/Software/Pfam/ Leucine Rich repeats A single repeat is not stable

Multiple repeats are stable Each repeat is represented separately Unlimited number http://www.sanger.ac.uk/Software/Pfam/ WD40 repeats 7 repeats beta sheet per repeat Limited number (6-8)

N http://www.sanger.ac.uk/Software/Pfam/ Structural domains Domains are most easily defined in known structures Several automatic programs available They dont always/often agree! http://www.sanger.ac.uk/Software/Pfam/ Defining domains from sequence Has been done successfully hundreds of

times Cannot always be done Usually requires the domain to be mobile http://www.sanger.ac.uk/Software/Pfam/ Domains and structure determination Hard to get structure of complete protein Expressing smaller segments is easier http://www.sanger.ac.uk/Software/Pfam/

LysM domain http://www.sanger.ac.uk/Software/Pfam/ The domain mapping problem N C ? N C

Why important: - Functional insights - Improved database searching - Fold recognition - Localisation - Guide experiments http://www.sanger.ac.uk/Software/Pfam/ Parsing a protein sequence into domains Look for internal duplications. Look for transmembrane segments. Look for low complexity segments.

http://www.sanger.ac.uk/Software/Pfam/ Internal Duplications Internal duplications with sequence similarity can be detected with a dot plot. Use e.g. Dotter or dotlet http://www.sanger.ac.uk/Software/Pfam/

Transmembrane segments Hydrophobic residues, typically 15-35 long Most segments easy to predict, topology and all segments in multi-spanning much harder. Some programs: PHD TMHMM TMpred http://www.sanger.ac.uk/Software/Pfam/ Low complexity The amino acid composition is nonrandom. Typically occurs in non-compact folds, e.g. coiled-coils, rods, and flexible

domain linkers. Detect by: complexity function, (SEG program) small-pitch overlapping repeats (XNU) http://www.sanger.ac.uk/Software/Pfam/ >104K_THEPA P15711 104 KD MICRONEME-RHOPTRY ANTIGEN. flillfnilclf kkskkk ghkgpskgsdsskegkkpgsgkkpgp sksprtasptrrpspklpqlsklpkstspr spppptrpssperpe

tkiiktskppspkppfdpsfkekf etlpetpgtpfttprpvppkrprtpesp ppkdpdspstspsefftppesk rlerlrltttemet ddegteaddeet rrrrppkkpsksprpskpkkpkkp 1-2 3-14 15-487 488-493 494-529



default option in blast 589-633 634-634 635-658 659-687 688-715 716-717 718-739 740-801 802-815 816-856 857-868


http://www.sanger.ac.uk/Software/Pfam/ I - Summary Introduction to protein domains Domain databases Domain hunting http://www.sanger.ac.uk/Software/Pfam/ Domain databases Many of the common domains have already been defined in domain databases. Advantages:

Pre-annotated domains Easy interpretation of domain structure Sensitivity can be higher The most used databases are: Pfam - Prints Prosite Profiles - Blocks SMART - ProDom http://www.sanger.ac.uk/Software/Pfam/

Good coverage No specific bias Good graphical views Structural data in alignments No heirarchy http://www.sanger.ac.uk/Software/Pfam/ Domain collection by Ponting and Bork.

Specialises in Signaling domains Extracellular domains Nuclear domains Excellent quality families. Really nice graphics Coiled-coil, TM, low-complexity http://www.sanger.ac.uk/Software/Pfam/ Prints does not specialise in protein domains. Has heirarchical classification for important families such as GPCRs

http://www.sanger.ac.uk/Software/Pfam/ Profiles Sensitive Low coverage (Good for signalling) Patterns e.g. N-{P}-[ST]-{P} less sensitive many false positives http://www.sanger.ac.uk/Software/Pfam/ Automatic domain DBs

e.g. Prodom, DOMO and Pfam-B. Surprisingly good quality No annotation (some links) Good way to find if region is found in different contexts http://www.sanger.ac.uk/Software/Pfam/ Comparison of protein family

databases: an example Pfam Prosite Prints Blocks Smart (ProDom, PIRaln, ProClass, Systers, Picasso etc. not shown) Example: ENTK_HUMAN (Enteropeptidase precursor) http://www.sanger.ac.uk/Software/Pfam/ Interpro Interpro is a database that presents Prosite, Prints, Prodom and Pfam domain.

Annotation is a strong point http://www.sanger.ac.uk/Software/Pfam/ I - Summary Introduction to protein domains Domain databases Domain hunting http://www.sanger.ac.uk/Software/Pfam/ Domain Hunting: RNAi Piwi


Archaea & Bacteria RNAse 3 RNAse 3 Cerrutti, Mian & Bateman. Trends Biochem Sci. 25:481-482 (2000) http://www.sanger.ac.uk/Software/Pfam/ Domain Hunting: CBS domains Discovering new domains can reveal new biology

J. Clin. Inv. 113:274-284. http://www.sanger.ac.uk/Software/Pfam/ Domain Hunting in Genomes A scan for repeats in the Strep. Coelicolor genome identified a repeat in PknB-like kinases PknB Kinase Repeat

Repeat Repeat http://www.sanger.ac.uk/Software/Pfam/ Repeat The PASTA domain PknB Kinase

PASTA PASTA PASTA PBP type I Transglycosylase Transpeptidase PASTA

PBP type II Dimerisation Transpeptidase PASTA Uncharacterised PASTA MK0796

PASTA PASTA PASTA PASTA Pro-isomerase http://www.sanger.ac.uk/Software/Pfam/ PASTA

A function for PASTA domains PBP type II Dimerisation Transpeptidase PASTA http://www.sanger.ac.uk/Software/Pfam/ PASTA PASTA domain

Discovery of PASTA: lactams may act on two sites in PBPs PknB may be a novel site of lactam action PknB kinases may transduce cross-linking status of peptidoglycan http://www.sanger.ac.uk/Software/Pfam/ Conclusions Domains are the common currency of protein function Understanding the domain structure helps to understand the biology Domain databases are key labour saving

tools http://www.sanger.ac.uk/Software/Pfam/ Useful URLS

Pfam: SMART: Prints: Prosite: Interpro: Prodom: Dotlet: Other: www.sanger.ac.uk/Software/Pfam/ smart.embl-heidelberg.de/ www.bioinf.man.ac.uk/dbbrowser/PRINTS/ www.expasy.ch/prosite/ www.ebi.ac.uk/interpro/

www.toulouse.inra.fr/prodom.html www.isrec.isb-sib.ch/java/dotlet/Dotlet.html www.google.com/ http://www.sanger.ac.uk/Software/Pfam/ II - Summary Introduction to Pfam Protein Interactions Language Modelling for Domain Discovery http://www.sanger.ac.uk/Software/Pfam/ Pfam: 7,000 families for the

molecular biologist Alex Bateman, Ewan Birney, Richard Durbin, Sean Eddy, Ajay Khanna, Rob Finn, Sam Griffiths-Jones, William Mifsud, Mhairi Marshall, Matthew Bashton, Michael Asman, Kevin Howe, David Studholme and Erik Sonnhammer. http://www.sanger.ac.uk/Software/Pfam/ Annotating genomes Bacteria 5MB

Worm 100MB Human 3000MB http://www.sanger.ac.uk/Software/Pfam/ Family Pages http://www.sanger.ac.uk/Software/Pfam/ Family Pages

http://www.sanger.ac.uk/Software/Pfam/ Pfam contains Alignments http://www.sanger.ac.uk/Software/Pfam/ Pfam contains: SEED alignment representative members Profile-HMM HMMer-2.0 Search database

FULL alignment Manually curated Automatically made http://www.sanger.ac.uk/Software/Pfam/ The data deluge http://www.sanger.ac.uk/Software/Pfam/ Profiles, HMMs and PSSMs Complicated names - Simple idea RU1A_HUMAN


MDSQRAI ||||||| EAAEAAV http://www.sanger.ac.uk/Software/Pfam/ Pfam 12.0 Pfam-A 7,316 Curated families with annotation. A

Pfam-B 100,000 families derived from Prodom. http://www.sanger.ac.uk/Software/Pfam/ Pfam-A Pfam-B Other Coverage Coverage (%)

Pfam Sequence Coverage 100 90 80 70 60 50 40 30 20 10 0 -3000

2000 7000 12000 17000 22000 27000 No Families

Retire sometime between Sept 2012 and May 2033! http://www.sanger.ac.uk/Software/Pfam/ Family Pages http://www.sanger.ac.uk/Software/Pfam/ http://www.sanger.ac.uk/Software/Pfam/ Taxonomy information Does your favourite thermophile have a member? http://www.sanger.ac.uk/Software/Pfam/

http://www.sanger.ac.uk/Software/Pfam/ II - Summary Introduction to Pfam Protein Interactions Language Modelling for Domain Discovery http://www.sanger.ac.uk/Software/Pfam/ Protein Interactions http://www.sanger.ac.uk/Software/Pfam/

Protein Interactions http://www.sanger.ac.uk/Software/Pfam/ Protein Interactions http://www.sanger.ac.uk/Software/Pfam/ The Pfam Webserver http://www.sanger.ac.uk/Software/Pfam/ Complex Complexes

ATP synthase Cytochrome bc1 http://www.sanger.ac.uk/Software/Pfam/ II - Summary Introduction to Pfam Protein Interactions Language Modelling for Domain Discovery http://www.sanger.ac.uk/Software/Pfam/ CBS domains often come in pairs Is there a second CBS domain in

cystathionine-beta synthase? Can we include this extra knowledge in our searches? http://www.sanger.ac.uk/Software/Pfam/ How much information in context? http://www.sanger.ac.uk/Software/Pfam/ Overview Of A Statistical Speech Recognizer phonemes Analogue

to digital converter A = a1a2a3, Language model D = D1D2D3, Phonetic models http://www.sanger.ac.uk/Software/Pfam/

Language HMM Captures Domain Associations Language Model HMM of D1 B(D1) B E(D1) E

B(D2) HMM of D3 B(D3) D= D1 D1 D3 E(D3)

http://www.sanger.ac.uk/Software/Pfam/ Mathematical Formulation We seek to find the sequence of words which maximise the sentence odds score: P ( D | A, M D ) ~ P ( A | D) P( D | M D ) P( A | R) P( Di | Di 1, Di 2, ..., M D )

P( Ai | Di ) ~ P ( Di ) P ( A | R ) P ( D )

i i i i P( Di | Di 1, Di 2, ..., M D ) P ( Ai | Di ) 1 log ~ log log P

( A | R ) P ( D ) P ( D ) i

i i i i HMMer score - thresh Domain transition/context score where D =D1D2D3 is the sequence of domains Ai is the protein sequence corresponding to domain D i

R is a null model of independent point wise emission of amino acids MD is the language model http://www.sanger.ac.uk/Software/Pfam/ Searching The Model Prune search space, consider only domains with domain score above a threshold This gives domains D1D2D3 ordered by endpoint Use a dynamic programming algorithm to search the combined language and domain model over all domain combinations

Define Fi to be the score of the best sentence ending in domain Di Use theSfollowing recursion relation to calculate Fi i Fi S i max j i ( F j T j i ) Where Si Is the domain score

Is the domain transition score T j http://www.sanger.ac.uk/Software/Pfam/ i Model Trained From Pfam Database Both digram and trigram models were built and tested Trigram performed worse than digram! Finally implemented a variable order model Probabilities in language model assigned on the basis of observed transitions in Pfam families with pseudocounts Sentence start and end states included http://www.sanger.ac.uk/Software/Pfam/

Examples http://www.sanger.ac.uk/Software/Pfam/ Examples Of Context Domains: WD40 http://www.sanger.ac.uk/Software/Pfam/ Method Finds Short Domains http://www.sanger.ac.uk/Software/Pfam/

Why is there a context signal? Domains in specific subcellular location Evolutionary history Functional biases (domain fusion) Structural necessity (repeats/partial domains) Interacting domains (intrachain) http://www.sanger.ac.uk/Software/Pfam/

Conclusions Majority of proteins have Pfam domains Pfam helps to understand protein interactions Domains have an underlysing grammar http://www.sanger.ac.uk/Software/Pfam/

Recently Viewed Presentations

  • Unit 3 Bonding & Chemical Rxns

    Unit 3 Bonding & Chemical Rxns

    Ionic substances form giant ionic lattices containing positive and negative charged ions. Ionic Bonding. They have high melting and boiling points because the intermolecular forces are very strong. Ionic Bonding. They conduct electricity when melted or dissolved in water.
  • Compounding swap tutorial | FinPricing

    Compounding swap tutorial | FinPricing

    Compounding Swap. Practical Notes. First of all, you need to generate accurate cash flows for each leg. The cash flow generation is based on the start time, end time and payment frequency of the leg, plus calendar (holidays), business convention...
  • The Gendered experience of Young People - PBworks

    The Gendered experience of Young People - PBworks

    Robert Agnew & General Strain Theory Limitations of classic Strain Theory Fail to explain gender differences in offending An assumption that women feel less economic strain Agnew's Gendered differences in sources of strain and adaptations Women: relationships and purpose of...
  • E906/SeaQuest Drell-Yan Experiment

    E906/SeaQuest Drell-Yan Experiment

    Quest for the Anti-Quark Sea: E906/SeaQuest Kazutaka Nakahara. University of Maryland College Park. for the . E906 Collaboration. ECT* Conference, Drell-Yan Workshop, Trento, Italy May 2012 Abilene Christian University:
  • 5 - Office 365 Service Communications

    5 - Office 365 Service Communications

    Waves to Ripples. Continuous innovation delivered on monthly basis vs. 18-24 month upgrades. 6/3/2014. ... View their customers' Office 365 service health status and details . Create, edit and view service requests on behalf of their customers.
  • Social Studies Administrator Update

    Social Studies Administrator Update

    Workshop Model in AP SS. AP classes may be differentiated away from traditional workshop model even more. Work Time might be a socratic seminar. Openings may be shorter . practicing a skill already introduced . taking a quiz that mimics...
  • Chapter 3 The Basic Structure of a Cell

    Chapter 3 The Basic Structure of a Cell

    Basic Structure of a Cell * ... Contain ribosomes (no membrane) in their cytoplasm to make proteins * Eukaryotes Cells that HAVE a nucleus and membrane-bound organelles Includes protists, fungi, plants, and animals More complex type of cells * Eukaryotic...
  • SPEDAS Developers Workshop GEM  San Francisco, CA December

    SPEDAS Developers Workshop GEM San Francisco, CA December

    What is the current status of SPEDAS development? SPEDAS 1.00 was released in August 2014, including plugins supporting THEMIS, GOES, and IUGONET, as well as a general purpose interface for loading many other missions via CDAWeb.