Bio-REGNET Retrieval of Patent Documents from Heterogeneous Sources using Ontologies and Similarity Analysis Siddharth Taduri, Gloria T. Lau, Kincho H. Law Engineering Informatics Lab, Stanford University Jay P. Kesan, School of Law, University of Illinois Urbana-Champaign 09/21/2011 International Conference Semantic Computing on roblem Statement Issued Patents and Applicatio ns

Court Cases Regulation s and Laws File Wrappers Technical Publicatio ns Patent Validity and Enforcement Questions involves analysis of documents in various domains World-wide Patents, PTO File Wrappers, Scientific Publications and Court documents The information is siloed into several diverse information sources 09/21/2011 2 Problem Statement Issued

Patents and Applicatio ns File Wrappers Knowledge Source 1: Patent System Ontology Knowledge Source 2:Bio Ontology Court Specific Technical Cases Domain Regulation s and Laws Technical Publicatio ns

Integration The sources are diverse in structure, formats, semantics and syntax How to develop and retrieve comprehensive knowledge of patents in a particular technological space? 09/21/2011 3 atent System Ontology Established semantics allow us to reason over the classes, properties and instances to infer new facts Documents can be connected to form a network similar to citation networks. Only now we have not just citations, but other metadata such as co-inventorships, technological classification and other cross-domain relevancy metrics between documents (ex: patents occurring in court cases etc.) Can develop rules to perform additional inferences over the knowledge

09/21/2011 4 Example Query Return all the patent documents which contain the keyword erythropoietin in the Claims and Assigned to Amgen_Inc. What technology classes do these patent documents belong to? SPARQL Query: Patent SELECT DISTINCT ?patent ?inventor FROM 5856298 WHERE{ 5885574 ?patent a ont:Patent . 7304150 ?patent ont:hasAbstract ?abs . 7304150 ?abs ont:resourceVal ?val . 7304150 ?val bif:contains "erythropoietin" .

?patent ont:hasAssignee ont:Amgen_Inc . ?patent ont:hasInventor ?inventor } Limit 10 06/13/2011 7304150 7217689 7217689 6319499 5756349 Inventor Strickland_Thomas_ W Elliott_Steven_G Egrie_Joan_C Elliott_Steven_G Browne_Jeffrey_K Sitney_Karen_C Elliott_Steven_G Byrne_Thomas_E Elliott_Steven_G Lin_Fu-Kuen

5 Domain (Bio) Ontologies Bio Ontologies serve as terminological standards in the domain 09/21/2011 6 Expanded Query Original Term: Erythropoietin Synonyms: Erythropoietin, Recombinant Erythropoietin, erythropoietin receptor binding, Hematopoietin, Recombinant EPO, Erythrocyte Colony Stimulating Factor, Epoetin, EPO Children: Darbopoietin Alfa, Epoetin Alfa, Epoetin Beta Parents: Colony Stimulating Factors, cytokine receptor binding, recombinant hematopoietic growth factors Grand-Parents: hematopoietic growth factor, receptor binding, recombinant growth factor An appropriate ranking function is to be applied to balance the more general terms. Heuristically, we assign a higher weight to synonyms, and a lower weight as we traverse away from the concept node Resulting Query: original term OR [synonyms]^weight OR [children]^weight OR .

06/13/2011 7 se-Case: Erythropoietin Current Corpus: experimental platform to test the overall effectiveness of the framework 5 Core patents U.S. Patents 5,621,080, 5,756,349, 5,955,422, 5,547,933, 5,618,698 135 directly related patents (through citations) form our gold standard for computing formal measures such as Precision and Recall Total patent corpus of 1150 patents Identified over related 3000 publications through citations. These are available on PubMed and can be accessed through Entrez A tool that provides a search interface to PubMed database Around 30 court cases, patent litigation involving major companies including Amgen, Hoechst Marion Roussel, Inc., Transkaryotic

Therapies, Inc. BioPortal ( source of domain knowledge 09/21/2011 is a comprehensive 8 atent Ontology Stats 54 Classes, 40 Properties and over 15,000 individuals from 1150 patents, 30 court cases and one partially instantiated file wrapper Used Protg-OWL to edit the ontology and Protg-OWL/ Jena API to programmatically instantiate physical documents Can query using any SPARQL endpoint such as Protg or Virtuosos Triple Store SWRL is used to declare rules. We use the Jess rule execution engine

06/13/2011 9 Methodology The cross-references between document types and metadata of documents in the patent system are utilized through a rule-based system Structural dependencies between types of documents must be considered The application of bio-ontologies to each type of document is different due to the depth of technical terminology. This is controlled through the weighting vector Based upon an initial selection of documents by the user, we perform a similarity analysis between documents [User Relevancy Feedback] 09/21/2011 10 Rules

The declarative representation of the patent system ontology can facilitate reasoning through rules Different users may be interested in different aspects of the document (Users can use their own heuristics) The methodology allows users to select which rules apply during search 09/21/2011 11 Rules Two patents share the same inventor: IF hasInventor (?pat1, ?inv1) ^ hasInventor (?pat2, ?inv1) ^ owlDifferentFrom (?pat1, ?pat2) hasSimilarDocument(? pat1, ?pat2) Same court case cites two different patents: IF patentsInvolved(?case, ?pat1) ^ patentsInvolved (?case, ? pat2) ^ owlDifferentFrom (?pat1, pat2) hasSimilarDocument(?pat1, ?pat2)

Rules are combined by using: 09/21/2011 12 Text 1. Patent 5,547,933 Claims A non-naturally occurring erythropoietin glycoprotein product having the in vivo biological to increase production of reticulocytes and red blood cells and having glycosylation which differs

from that of human urinary erythropoietin. NCI Thesaurus ID Recombinant Erythropoietin Recombinant Growth Factor Rdfs:label Recombinant Erythropoietin Synonym Erythrocyte Colony Stimulating Factor Recombinant

Hematopoietic Growth Factor Synonym Erythropoietin Synonym Hematopoietin Synonym Recombinant EPO Synonym EPO Recombinant Erythropoietin Pegzerepoietin Alfa

Darbepoitin Alfa M= {recombinant erythropoietin, epo, recombinant epo, hematopoietin, erythrocyte colony stimulating factor} {recombinant hematopoietic growth factor} {recombinant growth factor} subClassOf property Epoetin Beta {non-naturally, erythropoietin, glycoprotein, biological, , reticulocytes, glycosylation } WPat = Properties for Recombinant Erythropoietin

Epoetin Alfa 1 1 0.5 0.5 0.2 WCase = 0 0.2 0 0.1 0

Weight Vector for Patents WCase Weight Vector for Cases M Expanded Terms Generated Query: QPatent = WPatT * M QCase = WCaseT * M {darbapoietin alfa, epoetin beta, epoetin alfa} 09/21/2011 WPat 13 Implementation

09/21/2011 14 esult Structural Dependency 09/21/2011 15 esult Combining Rules and Bio-Ontology 09/21/2011 16 Future Work Formal evaluation is hard due to the unavailability of well defined ground truths, but necessary Include other regulations, laws information

sources publications, Experiment with more use cases outside of the biomedical domain 09/21/2011 17 Tool Snapshot 06/13/2011 18 Acknowledgement This research is partially supported by NSF Grant Number IIS-0811975 awarded to the University of Illinois at Urbana-Champaign

and NSF Grant Number IIS-0811460 to Stanford University. Any opinions and findings are those of the authors, and do not necessarily reflect the views of the National Science Foundation. 09/21/2011 19 Thank You: Questions? Engineering Informatics Lab: Contact Siddharth Taduri: [email protected] Gloria T. Lau: [email protected] Kincho H. Law: [email protected] Jay P. Kesan: [email protected] 09/21/2011 20 BACK UP SLIDES

09/21/2011 21 Patent Ontology, Document Diversity etc. 09/21/2011 22 Patents Documents Over 7 million U.S. patents In 2009, 485,312 applications were filed patent Information is contained in various sections of the documents; a full-text

search alone is not sufficient other metrics such as classification, citations etc. need to be considered Documents are available in HTML Format and can be easily parsed 09/21/2011 23 927 F.2d 1200 (1991) AMGEN, INC., Plaintiff/Cross-Appellant, v. CHUGAI PHARMACEUTICAL CO., LTD., and Genetics Institute, Inc., DefendantsAppellants. Court Cases Nos. 90-1273, 90-1275. United States Court of Appeals, Federal Circuit. March 5, 1991. Suggestion for Rehearing Declined May 20, 1991. Before MARKEY, LOURIE and CLEVENGER, Circuit Judges.

THE PATENTS On June 30, 1987, the United States Patent and Trademark Office (PTO) issued to Dr. Rodney Hewick U.S. Patent 4,677,195, entitled "Method for the Purification of Erythropoietin and Erythropoietin Compositions" (the '195 patent). The patent claims both homogeneous EPO and compositions thereof and a method for purifying human EPO using reverse phase high performance liquid chromatography. The method claims are not before us. The relevant claims of the '195 patent are: 1. Homogeneous erythropoietin characterized by a molecular weight of about 34,000 daltons on SDS PAGE, movement as a single peak on reverse phase high performance liquid chromatography and a specific activity of at least 160,000 IU per absorbance unit at 280 nanometers. ****** 3. A pharmaceutical composition for the treatment of anemia comprising a therapeutically effective amount of the homogeneous erythropoietin of claim 1 in a pharmaceutically acceptable vehicle. 4. Homogeneous erythropoietin characterized by a molecular weight of about 34,000 daltons on SDS PAGE, movement as a single peak on reverse phase high performance liquid chromatography and a specific activity of at least about 160,000 IU per absorbance unit at 280 nanometers.

09/21/2011 Court Cases are not very well structured! Comparatively more difficult to parse information PACER an electronic system to access databases for U.S. Courts - requires one to know party/assignee name, case number/type, etc. which may not be known 24 Events Text Patent File Wrapper File Wrappers are folders which contain all documents

exchanged between a patent applicant and the patent office Every File Wrapper is different! No standardized ordering of events The relevant information is embed within lots of irrelevant text File Wrappers are available as images requiring additional processing in order to extract text 09/21/2011 25 Cross-Referencing There are many aspects of these documents which can be utilized; especially the cross-referencing between the documents COURT CASE 314 F.3d 1313 (2003) AMGEN INC., Plaintiff-Cross Appellant v. HOECHST MARION ROUSSEL, INC. (now

known as Aventis Pharmaceuticals, Inc.) and Transkaryotic Therapies, Inc., Defendants-Appellants. Plaintiff-Cross Appellant Amgen Inc. is the owner of numerous patents directed to the production of erythropoietin ("EPO"), alleging that TKT's Investigational New Drug Application ("INDA") infringed United States Patent Nos. 5,547,933; 5,618,698; and 5,621,080. The complaint was amended in October 1999 to include United States Patent Nos. 5,756,349 and 5,955,422, which issued after suit was filed. BIOPORTAL: DOMAIN KNOWLEDGE

09/21/2011 REGULATIONS: U.S. Code Title 35, C. F. R Title 37, M. P. E. P. Publication Database PATENT United States Patent, 5,955,422 September 21, 1999 Production of erthropoietin Abstract: Disclosed are novel polypeptides possessing part or all of the primary structural conformation and one or more of the biological properties of mammalian erythropoietin ("EPO") FILE WRAPPER U.S. Patent 5,955,422

Claims 61-63 are rejected under 35 U.S.C. 103 as being unpatentable over any one of Miyake et al., 1977 (R) In accordance with the provisions of 37 C.F.R. 1.607, the present continuation is being filed for the purpose of Inventors: Lin; Fu-Kuen (Thousand Oaks, CA) Assignee: Kirin-Amgen, Inc. (Thousand Oaks, CA) Appl. No.: 08/100,197 Filed: August 2, 1993. 26 urrent prototype framework

1. Use bio-ontologies to expand users query, covering broader terms and concepts 2. Search document domain using expanded query 3. Use patent system ontologys properties to relate documents (from all document domains) 4. Support user feedback to ensure search progresses in right directions Patent System Ontology 09/21/2011 27 Class Hierarchy - I 06/13/2011 28 Class Hierarchy - II 06/13/2011 29

Class Hierarchy - III 06/13/2011 30 ing the document to instantiate the Ontology Documents are automatically parsed using a regular expression based script Separate scripts needed for each document domain Ontology is automatically instantiated using the Protg-OWL API Chugai .. hasDefendant Amgen .. Case

1 06/13/2011 hasPlaintiff 31

