Tesi: "Ukrainien (langue) – Analyse automatique (linguistique)"

1

Saint-Joanis, Olena. "Formalisation de la langue ukrainienne avec NooJ : préparation du module ukrainien". Electronic Thesis or Diss., Bourgogne Franche-Comté, 2024. http://www.theses.fr/2024UBFCC005.

Testo completo

Abstract (sommario):

L'intérêt de ce travail est porté sur la formalisation de la langue ukrainienne à travers la plateforme NooJ.La langue ukrainienne est très peu décrite dans le monde occidental, alors que c'est une langue officielle d'un pays européen qui compte plus de 45 millions d'habitants et qui est représentée dans plusieurs institutions mondiales. L'ukrainien est également étudié dans plusieurs universités d'Europe.De ce fait, la formalisation de l'ukrainien à travers un outil informatique pourra trouver plusieurs applications pratiques et notamment : cela permettra de faire l'analyse morphosyntaxique et sémantique approfondie des corpus, jouer un rôle dans le développement des applications TAL (par exemple, extracteurs d'entités nommées, terminologie, traduction automatique, correcteur d'orthographe, etc.), mais aussi dans le domaine de l'enseignement assisté par ordinateur (EAO). Nous avons construit un module ukrainien pour NooJ qui est composé d'un dictionnaire principal « Ukr_dictionary_V.1.3 » et de deux dictionnaires secondaires « Ukr_dictionary_Participle_V.1.3 » et « Ukr_dictionary_Proper_lowercase_V.1.3 ». Le dictionnaire principal contient 157 534 entrées et reconnaît 3 184 522 formes fléchies. Il décrit des ALU simples, composées d'une seule forme graphique, mais aussi des locutions composées de deux formes ou plus ; il reconnait et analyse les ALU avec orthographes alternatives, et explicite les abréviations.Les formes fléchies des entrées variables sont formalisées grâce à 303 paradigmes flexionnels. Nous avons formalisé également 114 paradigmes dérivationnels qui permettent de lier les verbes perfectifs aux verbes imperfectifs.Nous avons décrit de nombreuses formes dérivées ou les variantes orthographiques absentes du dictionnaire grâce aux 19 grammaires morphologiques.Enfin, nous avons recensé certaines formes dans les dictionnaires secondaires, notamment les participes et les noms propres en minuscule. Le dictionnaire « Ukr_dictionary_Participle_V.1.3 » contient 13 070 entrées et complète le dictionnaire principal, quand la grammaire morphologique qui décrit des participes ne permet pas de reconnaitre le participe dans le texte. Le dictionnaire « Ukr_dictionary_Proper_lowercase_V.1.3 » contient des noms propres écrits en minuscule, en combinaison avec la grammaire «Adjectives_Relatives_V.1.3.nom», il permet de reconnaitre les adjectifs relatifs créés à partir des noms propres.Grâce à ces ressources, 98,3% d'occurrences dans le corpus de tests ont été reconnues et annotées avec leurs informations morphologiques.Nous avons également construit dix grammaires syntaxiques qui permettent de lever un grand nombre d'ambiguïtés, puisque nous passons de 206 445 annotations à 131 415 pour un corpus de 108 137 occurrences
Lthough interest in the Ukrainian language has increased greatly in recent years, it remains poorly described and schematized. The few Natural Language Processing (NLP) software applications available do not necessarily meet the needs of students or researchers. These tools have been developed using stochastic approaches and, therefore, do not have a solid linguistic basis. Consequently, their usefulness is questionable, as they produce too many errors. After studying these available NLP applications, we chose to use the NooJ linguistic platform to process Ukrainian because it provides us with the tools we need to develop linguistic resources in the form of dictionaries and orthographic, morphological, syntactic, and semantic grammars. Note that NooJ also provides users with tools to manage corpora, perform various statistical analyses, and is well adapted to construct pedagogical applications. We have built a Ukrainian module for NooJ that consists of a main dictionary, "Ukr_dictionary_V.1.3," and two secondary dictionaries, "Ukr_dictionary_Participle_V.1.3" and "Ukr_dictionary_Proper_lowercase_V.1.3". The main dictionary contains 157,534 entries and recognizes 3,184,522 inflected forms. It describes simple ALUs made up of a single graphic form, but also locutions made up of two or more forms; it recognizes and analyzes ALUs with alternative spellings, and makes abbreviations explicit. The inflected forms of variable entries are formalized through 303 inflectional paradigms. We have also formalized 114 derivational paradigms that link perfective verbs to imperfective verbs. The 19 morphological grammars describe numerous derived forms and spelling variants not found in the dictionary. Finally, we have listed certain forms in secondary dictionaries, notably lower-case participles, and proper nouns. The "Ukr_dictionary_Participle_V.1.3" dictionary contains 13,070 entries and complements the main dictionary when the morphological grammar describing participles does not allow the participle to be recognized in the text. Thanks to these resources, 98.3% of occurrences in the test corpus were recognized and annotated with their morphological information. We also built ten syntactic grammars, which removed many ambiguities, as we went from 206,445 annotations to 131,415 for a corpus of 108,137 occurrences. We have also outlined several avenues for future work to improve our module, namely: the development of new additional morphological grammars and syntactic grammars that will remove the remaining ambiguities