Transcription

Proceedings of the 13th Conference on Natural Language Processing (KONVENS 2016)Verb lemmatization and semantic verb classes in a Middle English corpusMichael PercillierUniversität MannheimAnglistische Linguistik/DiachronieL13, 9, 68131 Mannheim, [email protected] syntactic structures need to be complementedby searches for specific verbs and semantic verbclasses. The focus on specific verbs allows for adetailed comparison of native and borrowed verbs,while the focus on semantic verb classes will makeit possible to follow the spread of syntactic structures from borrowed verbs to other verbs sharingsimilar meanings.The currently available resources, described inSection 2, are geared towards queries of syntacticstructures, but not specific verbs, let alone semantic verb classes. In order to fulfill the needs ofthe project, the existing resources have to be enhanced in two ways: (1) the extension of existingannotation with lemma information for verbs, and(2) a method for determining semantic classes ofME verbs. The implementation of both enhancements is described in this paper, as well as theirpossible application on a recent study of the Frenchborrowing please (Trips and Stein, accepted).The paper describes the creation of newresources and associated tools in the framework of the research project Borrowing ofArgument Structure in Contact Situations(BASICS), which investigates the borrowing of argument structures of verbs fromOld French (OF) to Middle English (ME).The first resource is a database of MEform-lemma correspondences, on which alemmatization process is based. This process also identifies French-based verbs andthus enables a first diachronic analysis oftheir prevalence in ME. The second itemdiscussed is a newly developed method forquerying ME verbs according to their semantic classes. The created resources andmethods are crucial in the continuation ofthe research project, and can be applied toannotate further ME corpora and train othertools for the treatment of ME data.12IntroductionThis paper is part of a research project1 that investigates grammatical change in the language contactsituation between Middle English (ME) and OldFrench (OF) that set in after the Norman Conquest(1066) and lasted until ca 1500. More specifically,the project focuses on the connection between thelexical borrowing of verbs and the transfer of theirargument structures (AS) from the source languageOF to the recipient language ME.One of the objectives of the project is to tracethe spread of AS from borrowed verbs, i.e. verbsoriginally from OF, to native verbs, i.e. verbs already part of the English lexicon prior to the language contact situation. To this end, corpus queries1 Borrowingof Argument Structure in Contact Situations:The Case of Medieval English under French influence (BASICS).209Currently available resourcesThe Oxford English Dictionary (Proffitt, editor,2015), abbreviated as OED, serves as a point ofreference for the project, not only because it isan authoritative resource on the English lexicon,but also because it contains a wealth of etymological information. Owing to a cooperation in theproject with the OED’s principal etymologist PhilipDurkin, we were able to obtain a list of 2,026 English verbs borrowed from French between 1066and 1500 based on an explicit query. The verbs insaid list constitute the starting point of the project,as they are the loan words whose AS is thus introduced to English and can thereafter extend to otherverbs.The ways in which these loan verbs were usedshould be verified empirically in a corpus. ForME, the Penn-Helsinki-Parsed-Corpus of MiddleEnglish (Kroch and Taylor, 2000), henceforthPPCME2, presents the advantage of being syntac-

Proceedings of the 13th Conference on Natural Language Processing (KONVENS 2016)tically annotated. The corpus consists of 55 texts,totaling ca 1.2 million words, and is divided intofour periods: M1 (1150–1250), M2 (1250–1350),M3 (1350–1420), and M4 (1420–1500).2 The annotation format used is Penn-Treebank, which canbe queried using the specialized software tool CorpusSearch (Randall, 2010). The format uses setsof parentheses to represent the clause hierarchy,as illustrated for Modern English in the examplebelow.3( (IP-MAT (ADVP-TMP (ADV Then))(NP-SBJ (D the)(N child))(VBD became)(ADJP (ADJR happier)(CONJ and)(ADJR happier))(E S .)) )At the lowest level of the tree hierarchy, eachform is assigned a part-of-speech (POS) tag. Consequently, the annotation format, in combinationwith CorpusSearch, makes it possible to search forspecific grammatical properties, such as past tenseverbs using the VBD tag, or specific forms such asbecame. However, due to frequent spelling variation in ME data and the existence of irregular verbparadigms, queries for all forms of a verb, suchas become, are not readily available by searchingfor verb stems in ME corpora. To remedy this,all lexical verb forms in the PPCME2 are to belemmatized, a process described in Section 3.For the definition of semantic verb classes, themodel proposed by Levin (1993), which groupslexical verbs on a semantic basis, can be used asa point of reference. The advantage over othersemantic resources such as WordNet (PrincetonUniversity, 2010) lies in the listing of possible syntactic alternations for each verb class. However,the model applies to Present Day English (PDE)and cannot be directly applied to ME for a numberof reasons: (1) semantic changes occurred fromME to PDE, so that the classification proposed byLevin (1993) may be inaccurate for certain MEverbs, (2) ME verbs that no longer exist in PDEare not included in Levin’s classification, so thata direct application of the model to ME would re2 Informationfrom ASE-4/description.html.3 Example adapted from 0sult in only partial coverage, (3) a number of PDEverbs did not yet exist in ME and are therefore irrelevant in the definition of ME verb classes, and(4) the potential of syntactic alternations cannotbe postulated on the basis of intuition for earlierperiods.In addition to the OED, the Middle English Dictionary (McSparran et al., 2001), henceforth MED,constitutes a further dictionary resource that is relevant for the lemmatization of a ME corpus andthe definition of ME semantic verb classes. TheMED uses unique numerical identifiers (henceforthMED-IDs) for each entry that can serve to disambiguate homonyms. Furthermore, entries in theMED and the OED are linked, so that using bothresources in tandem makes it possible to distinguish between native and borrowed ME verbs bychecking them against the list of verbs borrowedfrom French provided by the OED.3Lemmatization of a ME corpusAs previously stated, the lemmatization of a MEcorpus, in particular of its verbs, is a crucial stepfor any study in which queries of specific verbs orsemantic verb classes are to be undertaken. Giventhe absence of lemmatized ME corpora or any goldstandard for the lemmatization of ME data, thelemmatization process relies on the semi-manualassignment of graphemic verb forms to their respective lemmas. The process is divided into twomajor steps: (1) the creation of an inventory ofform-lemma correspondences linking forms in thePPCME2 to lemmas in the MED, and (2) the insertion of this lemma information into the corpus.3.1Assignment of form-lemmacorrespondencesVerb forms were extracted from the PPCME2, andeach verb form was paired with a lemma and thecorresponding ID extracted from the MED. This assignment of verb forms to lemmas was undertakenmanually by four trained research assistants andthe author using a spreadsheet application. Theyalso had the option of specifying multiple lemmas or marking their choices as doubtful. In total, 19,320 graphemic verb forms were assigned to2,979 lemmas as primary matches, alongside 4,973lemmas specified as additional possible matches.The resulting form-lemma links were exported tothe YAML (Evans, 2009) format, which was chosen so as to allow the data to be easily imported as

Proceedings of the 13th Conference on Natural Language Processing (KONVENS 2016)a hash/dictionary in any programming language.4START3.2 Insertion of lemma information into thecorpusRead verb formUsing the inventory of form-lemma correspondences just mentioned, the insertion of lemma information is performed. For every verb markedwith a POS tag beginning with V in the corpus,5the following instructions are carried out:The main approach is a lexical lookup in the inventory of form-lemma correspondences. Shouldthis not return any results, two fallbacks are used:(1) Spelling variants are generated and queried forcorresponding lemmas. The following graphemesubstitution rules are used: i e/y, e i, y i/g/ g, u v/ou, v u, th t/ d, t th, d th, g g/y, g g/y, ou u, ll l,nn n, and pp p.6 Further, forms containing hyphens or tildes are assigned spelling variants without these characters. (2) The form isstemmed and checked against all stemmed forms inthe form-lemma inventory. Stemming is achievedby removing the following ME inflectional suffixes: d, d d, t, t t, an, ande?, dd?, den?, e,e d, e t, ede?, enn?, e?st, et, in?d?e?, ingg?e?,ode, odest, oden, ten?, th, tt?, yde?, ynde?, ynn?,yngg?e?, and yst.7The lemma information is appended directly tothe form in the corpus, so as to still comply with thePenn-Treebank format and related software suchas CorpusSearch. Each piece of inserted information is demarcated by @ characters and specifiedby an attribute. Verb lemmas are specified by theattribute l, and MED-IDs by the attribute m (seeExample (1)). For verbs occurring in the list ofFrench-based verbs, an additional attribute e (foretymology) is defined as french (see Example (2)).The attribute w (for warning) indicates that thelemma was matched using either the spelling substitution or the stemming method (see Examples(2)/(5) and (3) respectively), or that the manualform-lemma match was deemed doubtful (see Example (4)). For verbs spelt as multiple words, theinformation is appended to the final element (seeExample (5)). Should no form-lemma correspon4 Forexample with the PyYAML (Simonov, 2014) modulein Python (Python Software Foundation, 2015).5 Lexical be, do, and have need not be lemmatized as theirtags (B*, D*, and H* respectively) already reveal their lemma.6 In the PCCME2, the character sequences d, g, and trepresent the graphemes D, Z, þ respectively.7 Question marks refer to the regular expression quantifierspecifying that the preceding character may or may not occur.211TRUEMatch withlemma?FALSEGeneratespellingvariantsTRUEMatch withlemma?FALSECompareverbstemsMatch withlemma?FALSEAssign NAvaluesAdd attributee frenchTRUEENDFrenchborrowing?TRUEAdd relevantw attributeAssign lemmaand MED-IDFALSEFigure 1: Lemma insertion process.dence have been found even after the stemmingmethod, the lemma and MED-ID are marked as NA(see Example (6)). The lemma insertion process issummarized in Figure 1.(1) (VAG [email protected] [email protected] [email protected])(2) (VAG [email protected] [email protected] [email protected] [email protected] [email protected])(3) (VB [email protected] [email protected] [email protected] [email protected])(4) (VBI [email protected] [email protected] [email protected] [email protected])(5) (VBP21 vnder)(VBP22 [email protected] [email protected] [email protected] [email protected])(6) (VAN [email protected] [email protected] [email protected])With this additional annotation, the PPCME2can be queried for syntactic structures as before,but also for specific verbs. Using CorpusSearch,this is achieved by specifying the lemma with theexists function, e.g. (*l [email protected]* exists).To distinguish between homonyms, the MED-IDcan also be used for unambiguous queries, e.g. (*m [email protected]* exists).3.3EvaluationThe lemmatization of verbs in the PPCME2 treated130,282 verbs in total. 110,116 verbs (84.52%)were directly assigned matching lemmas. Additionally, 5,868 verbs (4.5%) were assigned a lemma

Proceedings of the 13th Conference on Natural Language Processing (KONVENS 2016)in the previous step, and (3) a method for queryingthe corpus for multiple verbs simultaneously.Proportion of French verbs per text 30% 20% 4.1 10%0% m1m2m3m4PeriodFigure 2: Proportion of French-based verbs in subperiods of ME.M2M3M4M11.0e-078.2e-152.6e-13M210.97M31Table 1: Pairwise t-tests of French-based verbs perME sub-period, using Bonferroni correction.using spelling substitution, and 10,421 verbs (8%)using stem comparison. The total of lemmatizedverbs is thus 126,405 (97.02%), whereas 3,877verbs (2.98%) could not be assigned any lemma.Based on controls of random samples of 100 tokens,the spelling substitution and stem comparison fallbacks were estimated to be accurate to 86% and90% respectively.The estimation of French-based verbs and thedivision of ME into the sub-periods M1–M4 makeit possible to investigate the diachronic spread ofFrench-based verbs in ME (see Figure 2).8 Theanalysis suggests a strong increase in the usageof French-based verbs between M1 and M2, withonly little fluctuation thereafter. This is confirmedthrough pairwise t-tests (see Table 1) with Bonferroni correction (Baayen, 2008, 105–106).4Determining ME semantic verb classesIn order to identify ME semantic verb classes, theclassification proposed by Levin (1993) can serveas a point of reference, but cannot be applied directly to ME, as already mentioned in Section 2.The estimation of ME equivalents to the semanticverb classes proposed by Levin (1993) is undertaken in three steps: (1) the creation of a database ofsemantic classes and the verbs therein from whichverb lists can be extracted, (2) a method for findingME verbs synonymous to the PDE verbs extracted8 Figure generated in R (R Core Team, 2016) with ggplot2(Wickham, 2009) and scales (Wickham, 2016).212Creating an inventory of semantic verbclassesAn electronic index of Levin (1993) exists as aHTML file,9 but it only lists which verbs occur inwhich numbered section of the monograph, ther