Dissertationen zum Thema „Documentation automatique des langues“
Geben Sie eine Quelle nach APA, MLA, Chicago, Harvard und anderen Zitierweisen an
Machen Sie sich mit Top-50 Dissertationen für die Forschung zum Thema "Documentation automatique des langues" bekannt.
Neben jedem Werk im Literaturverzeichnis ist die Option "Zur Bibliographie hinzufügen" verfügbar. Nutzen Sie sie, wird Ihre bibliographische Angabe des gewählten Werkes nach der nötigen Zitierweise (APA, MLA, Harvard, Chicago, Vancouver usw.) automatisch gestaltet.
Sie können auch den vollen Text der wissenschaftlichen Publikation im PDF-Format herunterladen und eine Online-Annotation der Arbeit lesen, wenn die relevanten Parameter in den Metadaten verfügbar sind.
Sehen Sie die Dissertationen für verschiedene Spezialgebieten durch und erstellen Sie Ihre Bibliographie auf korrekte Weise.
Okabe, Shu. „Modèles faiblement supervisés pour la documentation automatique des langues“. Electronic Thesis or Diss., université Paris-Saclay, 2023. http://www.theses.fr/2023UPASG091.
Der volle Inhalt der QuelleIn the wake of the threat of extinction of half of the languages spoken today by the end of the century, language documentation is a field of linguistics notably dedicated to the recording, annotation, and archiving of data. In this context, computational language documentation aims to devise tools for linguists to ease several documentation steps through natural language processing approaches.As part of the CLD2025 computational language documentation project, this thesis focuses mainly on two tasks: word segmentation to identify word boundaries in an unsegmented transcription of a recorded sentence and automatic interlinear glossing to predict linguistic annotations for each sentence unit.For the first task, we improve the performance of the Bayesian non-parametric models used until now through weak supervision. For this purpose, we leverage realistically available resources during documentation, such as already-segmented sentences or dictionaries. Since we still observe an over-segmenting tendency in our models, we introduce a second segmentation level: the morphemes. Our experiments with various types of two-level segmentation models indicate a slight improvement in the segmentation quality. However, we also face limitations in differentiating words from morphemes, using statistical cues only. The second task concerns the generation of either grammatical or lexical glosses. As the latter cannot be predicted using training data solely, our statistical sequence-labelling model adapts the set of possible labels for each sentence and provides a competitive alternative to the most recent neural models
Godard, Pierre. „Unsupervised word discovery for computational language documentation“. Thesis, Université Paris-Saclay (ComUE), 2019. http://www.theses.fr/2019SACLS062/document.
Der volle Inhalt der QuelleLanguage diversity is under considerable pressure: half of the world’s languages could disappear by the end of this century. This realization has sparked many initiatives in documentary linguistics in the past two decades, and 2019 has been proclaimed the International Year of Indigenous Languages by the United Nations, to raise public awareness of the issue and foster initiatives for language documentation and preservation. Yet documentation and preservation are time-consuming processes, and the supply of field linguists is limited. Consequently, the emerging field of computational language documentation (CLD) seeks to assist linguists in providing them with automatic processing tools. The Breaking the Unwritten Language Barrier (BULB) project, for instance, constitutes one of the efforts defining this new field, bringing together linguists and computer scientists. This thesis examines the particular problem of discovering words in an unsegmented stream of characters, or phonemes, transcribed from speech in a very-low-resource setting. This primarily involves a segmentation procedure, which can also be paired with an alignment procedure when a translation is available. Using two realistic Bantu corpora for language documentation, one in Mboshi (Republic of the Congo) and the other in Myene (Gabon), we benchmark various monolingual and bilingual unsupervised word discovery methods. We then show that using expert knowledge in the Adaptor Grammar framework can vastly improve segmentation results, and we indicate ways to use this framework as a decision tool for the linguist. We also propose a tonal variant for a strong nonparametric Bayesian segmentation algorithm, making use of a modified backoff scheme designed to capture tonal structure. To leverage the weak supervision given by a translation, we finally propose and extend an attention-based neural segmentation method, improving significantly the segmentation performance of an existing bilingual method
Guinaudeau, Camille. „Structuration automatique de flux télévisuels“. Phd thesis, INSA de Rennes, 2011. http://tel.archives-ouvertes.fr/tel-00646522.
Der volle Inhalt der QuellePitzalis, Denis. „3D et sémantique : nouveaux outils pour la documentation et l'exploration du patrimoine culturel“. Paris 6, 2013. http://www.theses.fr/2013PA066642.
Der volle Inhalt der QuelleThe role of museums and libraries is shifting from that of an institution which mainly collects and stores artefacts and works of art towards a more accessible place where visitors can experience heritage and find cultural knowledge in more engaging and interactive ways. Due to this shift, ICT have an important role to play both in assisting in the documentation and preservation of information, by providing images and 3D models about historical artefacts and works of art, and in creating interactive ways to inform the general public of the significance that these objects have for humanity. The process of building a 3D collection draws on many different technologies and digital sources. From the perspective of the ICT professional, technologies such as photogrammetry, scanning, modelling, visualisation, and interaction techniques must be used jointly. Furthermore, data exchange formats become essential to ensure that the digital sources are seamlessly integrated. This PhD thesis aims to address the documentation of works of art by proposing a methodology for the acquisition, processing, and documentation of heritage objects and archaeological sites using 3D information. The main challenge is to convey the importance of 3D model that is "fit for purpose" and that is created with a specific function in mind (i. E. Very high definition and accurate models for : academic studies, monitoring conservation conditions over time and preliminary studies for restoration; medium resolution for on-line web catalogues). Hence, this PhD thesis investigate the integration of technologies for 3D capture, processing, integration between different sources, semantic organization of meta-data, and preservation of data provenance
Lima, Ronaldo. „Contribution au traitement automatique de textes médicaux en portugais : étude du syntagme verbal“. Nice, 1995. http://www.theses.fr/1995NICE2012.
Der volle Inhalt der QuelleThe research work in question undertakes to analyse the verb phrase in portuguese and thus contributes to the automatic processing of medical texts. It is a linguistic study with firstly a morpho-syntactic perspective, followed by a syntactico-semantic a then a conceptual one. The later study culminates in the representation of key concepts or themes caracterizing the disciplines of allergology and pneumology
Borges, de Faveri Claudia. „Contribution au traitement automatique de textes médicaux en portugais : étude du syntagme nominal“. Nice, 1995. http://www.theses.fr/1995NICE2013.
Der volle Inhalt der QuelleThis study is mainly concerned with the description of the portuguese noun phrase, in a scientific and technical context provided by medical texts. Whilst being traditionally linguistic in nature, the analysis also means to bring to light a certain number of linguistic resources which may ultimately serve in langage processing activities, and in particular those of document processing a machine-aided translation
Gauthier, Elodie. „Collecter, Transcrire, Analyser : quand la machine assiste le linguiste dans son travail de terrain“. Thesis, Université Grenoble Alpes (ComUE), 2018. http://www.theses.fr/2018GREAM011/document.
Der volle Inhalt der QuelleIn the last few decades, many scientists were concerned with the fast extinction of languages. Faced with this alarming decline of the world's linguistic heritage, action is urgently needed to enable fieldwork linguists, at least, to document languages by providing them innovative collection tools and to enable them to describe these languages. Machine assistance might be interesting to help them in such a task.This is what we propose in this work, focusing on three pillars of the linguistic fieldwork: collection, transcription and analysis.Recordings are essential, since they are the source material, the starting point of the descriptive work. Speech recording is also a valuable object for the documentation of the language.The growing proliferation of smartphones and other interactive voice mobile devices offer new opportunities for fieldwork linguists and researchers in language documentation. Field recordings should also include ethnolinguistic material which is particularly valuable to document traditions and way of living. However, large data collections require well organized repositories to access the content, with efficient file naming and metadata conventions.Thus, we have developed LIG-AIKUMA, a free Android app running on various mobile phones and tablets. The app aims to record speech for language documentation, over an innovative way.It includes a smart generation and handling of speaker metadata as well as respeaking and parallel audio data mapping.LIG-AIKUMA proposes a range of different speech collection modes (recording, respeaking, translation and elicitation) and offers the possibility to share recordings between users. Through these modes, parallel corpora are built such as "under-resourced speech - well-resourced speech", "speech - image", "speech - video", which are also of a great interest for speech technologies, especially for unsupervised learning.After the data collection step, the fieldwork linguist transcribes these data. Nonetheless, it can not be done -currently- on the whole collection, since the task is tedious and time-consuming.We propose to use automatic techniques to help the fieldwork linguist to take advantage of all his speech collection. Along these lines, automatic speech recognition (ASR) is a way to produce transcripts of the recordings, with a decent quality.Once the transcripts are obtained (and corrected), the linguist can analyze his data. In order to analyze the whole collection collected, we consider the use of forced alignment methods. We demonstrate that such techniques can lead to fine evaluation of linguistic features. In return, we show that modeling specific features may lead to improvements of the ASR systems
Francony, Jean Marc. „Modélisation du dialogue et représentation du contexte d'interaction dans une interface de dialogue multi-modes dont l'un des modes est dédié à la langue naturelle écrite“. Grenoble 2, 1993. http://www.theses.fr/1993GRE21038.
Der volle Inhalt der QuelleThe problems posed by the representation of the interaction context in the dialogue systeme of a multi-modal man-machine interface are art the origin of the aim of this thesis which is a study of a focusing mechanism which. The emphasis is on the anchoring of the focusing mechanism in the intervention surface. In the model we propose, anchorage is expressed at each mode level in terms of a thematic model similar to the one we proposed for natural language in this thesis. This thematic model is based on work by the prague school of formal linguistics whose hypotheses concerning the communicative function have been adopted. The thematic model allows for an utterance to translate its enunciated dynamism into a degree of activation on its knowledge representation. This model has been extended to discourse representation on the basis of a hypothesis concerning textual cohesion (which can be found for instance in anaphorical or elliptical inter-utterance relation). From this point of view, synergy of modes can be expressed as the fusion of representations of modal segments. In the focusing model, cohesion relations are considered as pipes propagating activation. This work is at the origin of the context management system implemented in the project mmi2 (esprit project 2474)
Moneimne, Walid. „TAO vers l'arabe : spécification d'une génération standard de l'arabe ; réalisation d'un prototype anglais-arabe à partir d'un analyseur existant“. Grenoble 1, 1989. http://www.theses.fr/1989GRE10061.
Der volle Inhalt der QuellePellegrini, Thomas. „Transcription automatique de langues peu dotées“. Phd thesis, Université Paris Sud - Paris XI, 2008. http://tel.archives-ouvertes.fr/tel-00619657.
Der volle Inhalt der QuelleColin, Émilie. „Traitement automatique des langues et génération automatique d'exercices de grammaire“. Electronic Thesis or Diss., Université de Lorraine, 2020. http://www.theses.fr/2020LORR0059.
Der volle Inhalt der QuelleOur perspectives are educational, to create grammar exercises for French. Paraphrasing is an operation of reformulation. Our work tends to attest that sequence-to-sequence models are not simple repeaters but can learn syntax. First, by combining various models, we have shown that the representation of information in multiple forms (using formal data (RDF), coupled with text to extend or reduce it, or only text) allows us to exploit a corpus from different angles, increasing the diversity of outputs, exploiting the syntactic levers put in place. We also addressed a recurrent problem, that of data quality, and obtained paraphrases with a high syntactic adequacy (up to 98% coverage of the demand) and a very good linguistic level. We obtain up to 83.97 points of BLEU-4*, 78.41 more than our baseline average, without syntax leverage. This rate indicates a better control of the outputs, which are varied and of good quality in the absence of syntax leverage. Our idea was to be able to work from raw text : to produce a representation of its meaning. The transition to French text was also an imperative for us. Working from plain text, by automating the procedures, allowed us to create a corpus of more than 450,000 sentence/representation pairs, thanks to which we learned to generate massively correct texts (92% on qualitative validation). Anonymizing everything that is not functional contributed significantly to the quality of the results (68.31 of BLEU, i.e. +3.96 compared to the baseline, which was the generation of text from non-anonymized data). This second work can be applied the integration of a syntax lever guiding the outputs. What was our baseline at time 1 (generate without constraint) would then be combined with a constrained model. By applying an error search, this would allow the constitution of a silver base associating representations to texts. This base could then be multiplied by a reapplication of a generation under constraint, and thus achieve the applied objective of the thesis. The formal representation of information in a language-specific framework is a challenging task. This thesis offers some ideas on how to automate this operation. Moreover, we were only able to process relatively short sentences. The use of more recent neural modelswould likely improve the results. The use of appropriate output strokes would allow for extensive checks. *BLEU : quality of a text (scale from 0 (worst) to 100 (best), Papineni et al. (2002))
Vasilescu, Ioana Gabriela. „Contribution à l'identification automatique des langues romanes“. Lyon 2, 2001. http://theses.univ-lyon2.fr/documents/lyon2/2001/vasilescu_ig.
Der volle Inhalt der QuelleThis work deals with the automatic identification of Romance Languages. The aim of our study is to provide linguistic patterns potentially robust for the discrimination of 5 languages from the latin family (i. E. , Spanish, French, Italian, Portuguese and Romanian). The Romance Languages have the advantage of a secular linguistic tradition and represents official languages in several countries of the world, the study of the taxonomist approaches devoted to this linguistic family shows a spécial relevance of the typological classification. More precisely, the vocalic patterns provide relevant criteria for a division of the five idioms in two groups, according to the complexity of each Romance vocalic system : italian, Spanish vs. Romanian, French, Portuguese. The first group includes languages with prototypical vocalic systems, whereas the second group, languages with complex vocalic systems in terms of number of oppositions. In addition to the vocalic criteria, these hierarchy is supported by consonantal and prosodic particularities. We conducted two experimental paradigms to test the correspondence between the perceptual patterns used by nai͏̈f listeners to differentiate the Romance languages and the linguistic patterns employed by the typological classification. A first series of discrimination experiments on four groups of subjects, selected according to the criterion [+/- Romance native language] (i. E. , French, Romanian vs. Japanese, Americans), showed different perceptual strategies related both to the native language and to the familiarity with the Romance languages. The linguistic strategies lead to a macro-discrimination of the languages in two groups similar to those obtained via the typological taxonomy based on vocalic particularities (i. E. , Spanish, Italian vs. Romanian, French, Portuguese). The second series of perceptual experiments on two groups of subjects (French and American) consisted in the evaluation of the acoustic similarity of the have languages. The results confirmed the division of Romance Languages in the same two groups as those via the discrimination experiments. We concluded that the vocalic patterns may be a robust clue for the discrimination of the Latin idioms into two major linguistic groups : Italian, Spanish vs. Romanian, French, Portuguese
Gutiérrez, Celaya Jorge Arturo. „Fusion d'informations en identification automatique des langues“. Toulouse 3, 2005. http://www.theses.fr/2005TOU30098.
Der volle Inhalt der QuelleFusing decision information coming out of different experts is an important issue in Automatic Language Identification. In order to explore and compare different fusion strategies, the information behaviour is modelled by means of formal classification methods provided either by the Statistics Theory, such as the Gaussian Mixture Model, the Neural Networks and the Discriminant Classifier, or by recent research advances in Possibility and Evidential Theories. As an alternative to empirical procedures, a formal fusion methodology within the Bayesian paradigm is proposed: evaluating expert performance by means of the Discriminant Factor Analysis provides us with confidence indices, aggregating expert decisions takes us to choose those fusion methods that provide us directly, or after transformation, with probability or likelihood values of languages, and building and weighting new loss functions with confidence indices lead us to make unique decisions by minimum risk
Vasilescu, Ioana Gabriela Hombert Jean-Marie. „Contribution à l'identification automatique des langues romanes“. [S.l.] : [s.n.], 2001. http://demeter.univ-lyon2.fr:8080/sdx/theses/lyon2/2001/vasilescu_ig.
Der volle Inhalt der QuelleTirilly, Pierre. „Traitement automatique des langues pour l'indexation d'images“. Phd thesis, Université Rennes 1, 2010. http://tel.archives-ouvertes.fr/tel-00516422.
Der volle Inhalt der QuelleTirilly, Pierre. „Traitement automatique des langues pour l'indexation d'images“. Phd thesis, Rennes 1, 2010. http://www.theses.fr/2010REN1S045.
Der volle Inhalt der QuelleIn this thesis, we propose to integrate natural language processing (NLP) techniques in image indexing systems. We first address the issue of describing the visual content of images. We rely on the visual word-based image description, which raises problems that are well known in the text indexing field. First, we study various NLP methods (weighting schemes and stop-lists) to automatically determine which visual words are relevant to describe the images. Then we use language models to take account of some geometrical relations between the visual words. We also address the issue of describing the semantic content of images: we propose an image annotation scheme that relies on extracting relevant named entities from texts coming with the images to annotate
Pellegrino, François. „Une approche phonétique en identification automatique des langues“. Toulouse 3, 1998. http://www.theses.fr/1998TOU30294.
Der volle Inhalt der QuellePerez, Laura Haide. „Génération automatique de phrases pour l'apprentissage des langues“. Thesis, Université de Lorraine, 2013. http://www.theses.fr/2013LORR0062/document.
Der volle Inhalt der QuelleIn this work, we explore how Natural Language Generation (NLG) techniques can be used to address the task of (semi-)automatically generating language learning material and activities in Camputer-Assisted Language Learning (CALL). In particular, we show how a grammar-based Surface Realiser (SR) can be usefully exploited for the automatic creation of grammar exercises. Our surface realiser uses a wide-coverage reversible grammar namely SemTAG, which is a Feature-Based Tree Adjoining Grammar (FB-TAG) equipped with a unification-based compositional semantics. More precisely, the FB-TAG grammar integrates a flat and underspecified representation of First Order Logic (FOL) formulae. In the first part of the thesis, we study the task of surface realisation from flat semantic formulae and we propose an optimised FB-TAG-based realisation algorithm that supports the generation of longer sentences given a large scale grammar and lexicon. The approach followed to optimise TAG-based surface realisation from flat semantics draws on the fact that an FB-TAG can be translated into a Feature-Based Regular Tree Grammar (FB-RTG) describing its derivation trees. The derivation tree language of TAG constitutes a simpler language than the derived tree language, and thus, generation approaches based on derivation trees have been already proposed. Our approach departs from previous ones in that our FB-RTG encoding accounts for feature structures present in the original FB-TAG having thus important consequences regarding over-generation and preservation of the syntax-semantics interface. The concrete derivation tree generation algorithm that we propose is an Earley-style algorithm integrating a set of well-known optimisation techniques: tabulation, sharing-packing, and semantic-based indexing. In the second part of the thesis, we explore how our SemTAG-based surface realiser can be put to work for the (semi-)automatic generation of grammar exercises. Usually, teachers manually edit exercises and their solutions, and classify them according to the degree of dificulty or expected learner level. A strand of research in (Natural Language Processing (NLP) for CALL addresses the (semi-)automatic generation of exercises. Mostly, this work draws on texts extracted from the Web, use machine learning and text analysis techniques (e.g. parsing, POS tagging, etc.). These approaches expose the learner to sentences that have a potentially complex syntax and diverse vocabulary. In contrast, the approach we propose in this thesis addresses the (semi-)automatic generation of grammar exercises of the type found in grammar textbooks. In other words, it deals with the generation of exercises whose syntax and vocabulary are tailored to specific pedagogical goals and topics. Because the grammar-based generation approach associates natural language sentences with a rich linguistic description, it permits defining a syntactic and morpho-syntactic constraints specification language for the selection of stem sentences in compliance with a given pedagogical goal. Further, it allows for the post processing of the generated stem sentences to build grammar exercise items. We show how Fill-in-the-blank, Shuffle and Reformulation grammar exercises can be automatically produced. The approach has been integrated in the Interactive French Learning Game (I-FLEG) serious game for learning French and has been evaluated both based in the interactions with online players and in collaboration with a language teacher
Dary, Franck. „Modèles incrémentaux pour le traitement automatique des langues“. Electronic Thesis or Diss., Aix-Marseille, 2022. http://www.theses.fr/2022AIXM0248.
Der volle Inhalt der QuelleThis thesis is about natural language processing, and more specifically concerns the prediction of the syntactic-morphological structure of sentences.This is the matter of segmenting a text into sentences and then into words and associating to each word a part of speech and morphological features and then linking the words to make the syntactic structure explicit.The thesis proposes a predictive model that performs these tasks simultaneously and in an incremental fashion: the text is read character by character and the entire linguistic predictions are updated by the information brought by each new character.The reason why we have explored this architecture is the will to be inspired by human reading which imposes these two constraints.From an experimental point of view, we compute the correlation between eye-tracking variables measured on human subjects and complexity metrics specific to our model.Moreover, we propose a backtracking mechanism, inspired by the regressive saccades observed in humans. To this end, we use reinforcement learning, which allows the model to perform backtracking when it reaches a dead end
Donzo, Bunza Yugia Jean-Pierre. „Langues bantoues de l'entre Congo-Ubangi, RD Congo: documentation, reconstruction, classification et contacts avec les langues oubanguiennes“. Doctoral thesis, Universite Libre de Bruxelles, 2015. http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/209145.
Der volle Inhalt der QuelleUne étude lexicostatistique quantitative détermine le degré de similarité entre les langues bantoues à l’étude avant d’établir classification phylogénétique intégrant ces langues dans un groupe plus large totalisant 401 langues bantoues illustrée par des arbres Neighbor-Net et des Neighbor-Joining.
La description phonologique signale la présence de certains phonèmes étrangers au système proto-bantou (implosives et labiovélaires) fonctionnant non comme des allophones mais des phonèmes distincts de leurs correspondants explosifs et vélaires dans plusieurs langues. Ainsi l’examen de ces éléments ou des traits linguistiques particuliers indique qu’ils seraient des emprunts aux langues oubanguiennes voisines.
Somme toute, Il apparaît que les particularités linguistiques actuelles au niveau segmental, suprasegmental (que nous n’avons pas abordé) et structural des langues bantoues de l’entre Congo-Ubangi seraient liées, en partie, au contact autant dans le passé qu’au présent avec les locuteurs des langues non bantoues, notamment oubanguiennes.
Les emprunts lexicaux, par exemple, révèlent à la fois des emprunts de bantu vers les langues oubanguiennes et des langues oubanguiennes vers le bantu.
Néanmoins, les preuves historiques et archéologiques sur la date et la nature de ces relations de contact est assez faible et nécessite des études interdisciplinaires dans le futur.
Doctorat en Langues et lettres
info:eu-repo/semantics/nonPublished
McCabe, Gragnic Julie. „Documentation et description du maya tenek“. Thesis, Paris 3, 2014. http://www.theses.fr/2014PA030166.
Der volle Inhalt der QuelleThe principal objective of this thesis is to document and describe an endangered indigenous language of Mexico and, in parallel, to provide tools to its speakers for the teaching and transmission of said language, thereby contributing to efforts for its revitalisation.As documented within the thesis, Tének (sometimes written Teenek; also known by thename Huastec/Wastek) is a Mayan language spoken in the state of San Luis Potosí, Mexico, and although it is not officially recognised as being in any particular danger of extinction, its destiny is quite uncertain in the mid-term. This is duly demonstrated within the first part of the thesis, thereby questioning the classification of endangered languages, and revealing the extent to which manymore languages are at risk than apparent.The Maya Tének are separated from the other Mayan language speakers by more than 700km, but are in close contact with indigenous language speakers of other origins (namely Uto-Aztec and Otomanguean). This configuration of isolation/contact creates, typologically speaking, aparticularly interesting object of study. Its isolation from the other Mayan languages means thatTének is and has remained a conservative language displaying close links with the proto-language,yet this same situation of isolation, coupled with its contact with languages of other origins, hasforced Tének to innovate and to evolve in other ways. One such example is the classification of nouns which differs from other Mayan languages. Another Tének development is its morphological inverse system based on a hierarchy of person markers which is unique within the Mayan family.The complex verb structure of Tének also presents some interesting features : it has both primary aspect markers (completive, incompletive, etc.) and secondary aspect markers (exhaustive,intensive, résultative, etc.), several antipassive markers (one of which is used to express reciprocity,which is in itself unusual for a Mayan language), more than one way to express the passive as well as the middle voice. All of these features are examined in detail within the second part of this thesis based on original materials collected in the field within the framework of this project both via elicitation and the collection and transcription of stories.The third and final part of the thesis is dedicated to the presentation of some of the original and creative documentation methods and tools used both for fieldwork and in organised workshop sessions in order to collect data for this project as well as to provide means by which the speakersand/or teachers of Tének can fight against the loss of the language. Some of the results of the work accomplished via these methods are presented here too. This part of the thesis also takes a look at how bilingual and intercultural education in Mexico is shaped and the actions taken toward protecting Mexican native languages.This thesis was developed as an experimental project in documentary linguistics; this particular paradigm of linguistics is revealing itself to be more and more important as languages continually disappear but remains as yet a little explored domain within the field of linguistics inFrance
Kuramoto, Hélio. „Proposition d'un système de recherche d'information assistée par ordinateur : avec application à la langue portugaise“. Lyon 2, 1999. http://theses.univ-lyon2.fr/documents/lyon2/1999/hkuramoto.
Der volle Inhalt der QuelleIn this research paper, we propose a model to address problems typically faced by users in information indexing and retrieval systems (IRS) applied to full text databases. Through discussion of these problems we arrive at a solution that had been formerly proposed by the SYDO group : the use of nominal phrases (or Nominal Group) as descriptors instead of words which are generally used by the traditional IRS. In order to verify the feasibility of this proposition, we have developed a prototype of a n IRS with a full text database. .
Lavecchia, Caroline. „Les Triggers Inter-langues pour la Traduction Automatique Statistique“. Phd thesis, Université Nancy II, 2010. http://tel.archives-ouvertes.fr/tel-00545463.
Der volle Inhalt der QuelleDenoual, Etienne. „Méthodes en caractères pour le traitement automatique des langues“. Phd thesis, Université Joseph Fourier (Grenoble), 2006. http://tel.archives-ouvertes.fr/tel-00107056.
Der volle Inhalt der QuelleLe présent travail promeut l'utilisation de méthodes travaillant au niveau du signal de l'écrit: le caractère, unité immédiatement accessible dans toute langue informatisée, permet de se passer de segmentation en mots, étape actuellement incontournable pour des langues comme le chinois ou le japonais.
Dans un premier temps, nous transposons et appliquons en caractères une méthode bien établie d'évaluation objective de la traduction automatique, BLEU.
Les résultats encourageants nous permettent dans un deuxième temps d'aborder d'autres tâches de traitement des données linguistiques. Tout d'abord, le filtrage de la grammaticalité; ensuite, la caractérisation de la similarité et de l'homogénéité des ressources linguistiques. Dans toutes ces tâches, le traitement en caractères obtient des résultats acceptables, et comparables à ceux obtenus en mots.
Dans un troisième temps, nous abordons des tâches de production de données linguistiques: le calcul analogique sur les chaines de caractères permet la production de paraphrases aussi bien que la traduction automatique.
Ce travail montre qu'on peut construire un système complet de traduction automatique ne nécessitant pas de segmentation, a fortiori pour traiter des langues sans séparateur orthographique.
Moreau, Fabienne. „Revisiter le couplage traitement automatique des langues et recherche d'information“. Phd thesis, Université Rennes 1, 2006. http://tel.archives-ouvertes.fr/tel-00524514.
Der volle Inhalt der QuelleBardet, Adrien. „Architectures neuronales multilingues pour le traitement automatique des langues naturelles“. Thesis, Le Mans, 2021. http://www.theses.fr/2021LEMA1002.
Der volle Inhalt der QuelleThe translation of languages has become an essential need for communication between humans in a world where the possibilities of communication are expanding. Machine translation is a response to this evolving need. More recently, neural machine translation has come to the fore with the great performance of neural systems, opening up a new area of machine learning. Neural systems use large amounts of data to learn how to perform a task automatically. In the context of machine translation, the sometimes large amounts of data needed to learn efficient systems are not always available for all languages.The use of multilingual systems is one solution to this problem. Multilingual machine translation systems make it possible to translate several languages within the same system. They allow languages with little data to be learned alongside languages with more data, thus improving the performance of the translation system. This thesis focuses on multilingual machine translation approaches to improve performance for languages with limited data. I have worked on several multilingual translation approaches based on different transfer techniques between languages. The different approaches proposed, as well as additional analyses, have revealed the impact of the relevant criteria for transfer. They also show the importance, sometimes neglected, of the balance of languages within multilingual approaches
Lê, Viêt Bac. „Reconnaissance automatique de la parole pour des langues peu dotées“. Université Joseph Fourier (Grenoble), 2006. http://www.theses.fr/2006GRE10061.
Der volle Inhalt der QuelleNowadays, computers are heavily used to communicate via text and speech. Text processing tools, electronic dictionaries, and even more advanced systems like text-to-speech or dictation are readily available for several languages. There are however more than 6900 languages in the world and only a small number possess the resources required for implementation of Human Language Technologies (HLT). Thus, HLT are mostly concerned by languages for which large resources are available or which have suddenly become of interest because of the economic or political scene. On the contrary, languages from developing countries or minorities have been less worked on in the past years. One way of improving this "language divide" is do more research on portability of HLT for multilingual applications. Among HLT, we are particularly interested in Automatic Speech Recognition (ASR). Therefore, we are interested in new techniques and tools for rapid development of ASR systems for under-resourced languages or π-languages when only limited resources are available. These languages are typically spoken in developing countries, but can nevertheless have many speakers. In this work, we investigate Vietnamese and Khmer, which are respectively spoken by 67 million and 13 million people, but for which speech processing services do not exist at all. Firstly, given the statistical nature of the methods used in ASR, a large amount of resources (vocabularies, text corpora, transcribed speech corpora, phonetic dictionaries) is crucial for building an ASR system for a new language. Concerning text resources, a new methodology for fast text corpora acquisition for π-languages is proposed and applied to Vietnamese and Khmer. Some specific problems in text acquisition and text processing for π-languages such as text normalization, text segmentation, text filtering are resolved. For fast developing of text processing tools for a new π-language, an open source generic toolkit named CLIPS-Text-Tk was developed during this thesis. Secondly, for acoustic modeling, we address particularly the use of acoustic-phonetic unit similarities for multilingual acoustic models portability to new languages. Notably, an estimation method of the similarity between two phonemes is first proposed. Based on these phoneme similarities, some estimation methods for polyphone similarity and clustered polyphonic model similarity are investigated. For a new language, a source/target acoustic-phonetic unit mapping table can be constructed with these similarity measures. Then, clustered models in the target language are duplicated from the nearest clustered models in the source language and adapted with limited data to the target language. Results obtained for Vietnamese demonstrate the feasibility and efficiency of these methods. The proposal of grapheme-based acoustic modeling, which avoids building a pronunciation dictionary, is also investigated in our work. Finally, our whole methodology is applied to design a Khmer ASR system which leads to 70% word accuracy and which was developed in only five months
Moreau, Fabienne Sébillot Pascale. „Revisiter le couplage traitement automatique des langues et recherche d'information“. [S.l.] : [s.n.], 2006. ftp://ftp.irisa.fr/techreports/theses/2006/moreau.pdf.
Der volle Inhalt der QuelleManad, Otman. „Nettoyage de corpus web pour le traitement automatique des langues“. Thesis, Paris 8, 2018. http://www.theses.fr/2018PA080011.
Der volle Inhalt der QuelleCorpora are the main material of computer linguistics and natural language processing. Not many languages have corpora made from web resources (forums, blogs, etc.), even those that do not have other resources. Web resources contain lots of noise (menus, ads, etc.). Filtering boilerplate and repetitive data requires a large-scale manual cleaning by the researcher.This thesis presents an automatic system that construct web corpus with a low level of noise.It consists of three modules : (a) one for building corpora in any language and any type of data, intended to be collaborative and preserving corpus history; (b) one for crawling web forums and blogs; (c) one for extracting relevant data using clustering techniques with different distances, from the structure of web page.The system is evaluated in terms of the efficacy of noise filtering and of computing time. Our experiments, made on four languages, are evaluated using our own gold standard corpus. To measure quality, we use recall, precision and F-measure. Feature-distance and Jaro distance give the best results, but not in the same contexts, feature-distance having the best average quality.We compare our method with three methods dealing with the same problem, Nutch, BootCat and JusText. The performance of our system is better as regards the extraction quality, even if for computing time, Nutch and BootCat dominate
Mammadova, Nayiba. „Eléments de description et documentation du tat de l'Apshéron, langue iranienne d'Azerbaïdjan“. Thesis, Sorbonne Paris Cité, 2017. http://www.theses.fr/2017USPCF016/document.
Der volle Inhalt der QuelleThis thesis is a descriptive grammar of Tat (an Iranian language of the South-Western branch) as spoken on the Absheron Peninsula, east of Baku in the Republic of Azerbaijan. It is the first description of a Muslim variety of Tat in a Western European language.After a detailed introduction outlining the sociolinguistic context and the phonology, the present study discusses the parts of speech, the marking of grammatical relations and verbal morphology of Absheron Tat (verbal derivation, verb classes, complex predicates, formation and use of inflected verb forms). This is followed by a survey of complex sentences, viz. relative clauses, complement clauses, adverbial subordinates as well as coordination.The present work adopts a typological point of view and is based on the analysis of texts originating from the author’s fieldwork and tales translated from Azeri into Tat, in addition to the author’s competence as a native speaker. The appendix presents samples of the text corpus (some of them also translated) and a glossary listing items that feature in the grammatical description and the texts
Bouamor, Houda. „Etude de la paraphrase sous-phrastique en traitement automatique des langues“. Phd thesis, Université Paris Sud - Paris XI, 2012. http://tel.archives-ouvertes.fr/tel-00717702.
Der volle Inhalt der QuelleBentes, Pinto Virginia. „La représentation des connaissances dans le contexte de la documentation technique : proposition d'un modèle d'indexation“. Grenoble 3, 1999. http://www.theses.fr/1999GRE39018.
Der volle Inhalt der QuelleDimon, Pierre. „Un système multilingual d'interprétation automatique : étape du sous-logiciel "analyse" pour les langues germaniques“. Metz, 1994. http://docnum.univ-lorraine.fr/public/UPV-M/Theses/1994/Dimon.Pierre.LMZ945_1.pdf.
Der volle Inhalt der QuelleIn part one of the thesis, the reader is reminded first of all the language models underlying grammars from which the systems of automatic processing of languages borrow, and second of the computing aids that make applications possible. A vast survey of the machine translation and computer-assisted translation systems incepted since the early beginnings up to 1991 illustrates the developments in connection with translating. In counterpart to the limits offered by the present systems, in part 2 of this thesis, another path is laid down, whose basis is the following hypothesis : is it possible to a minimum the quality of the target-text for a reader - a specialist of the area who, however, is not familiar with the language of the source-text-, to recreate its meaning through implicit comprehension? Hyperanalysis applies to the whole of the text. The local hypersyntactic module explores everything that introduces an object, defines it, names it (
Filhol, Michael. „Modèle descriptif des signes pour un traitement automatique des langues des signes“. Phd thesis, Université Paris Sud - Paris XI, 2008. http://tel.archives-ouvertes.fr/tel-00300591.
Der volle Inhalt der QuelleDégremont, Jean-François. „Ethnométhodologie et innovation technologique : le cas du traitement automatique des langues naturelles“. Paris 7, 1989. http://www.theses.fr/1989PA070043.
Der volle Inhalt der QuelleThe thesis begins with a short historical reminder of ethnomethodology, considered as a scientific field, since the whole beginners during the 30's until the 1967 explosion in US and Europe. The first part is an explication of the main concepts of ethnomethodology. They are developped from the pariseptist school theoretical point of view, which tries to associate the strongest refuse of inductions and the indifference principle, mainly when natural languages, considered as well as studies objects and communication tools, are used. The second part of the thesis is devoted to the concrete application of these theoretical concepts in the field of technological strategies which have been elaborated in France in the area of natural language processing. Three studies successively describe the ethnomethods and rational properties of practical activities which are used in an administrative team, the elaboration of a technology policy and indexical descriptions of the language industry field. The conclusion tries to show how the concepts and methods developped by ethnomethodology can increase, in this field, the efficacy of strategical analysis and the quality of research and development programs
Kim, Haksoo. „Structure syntaxique et structure informative (pour une analyse automatique des langues naturelles)“. Aix-Marseille 1, 1995. http://www.theses.fr/1996AIX10070.
Der volle Inhalt der QuelleThis thesis aims to be able to do automatically by computer a syntactic and informative analysis of french phrases. This analysis must identify the syntactic functions and the informative structure of the message. To do this one must specify the following elements of linguistic theories ; the immediate constituent analysis, the two types of structures (exocentric and endocentric), the rules for rewriting system, the pertinent outlines as well as the formalisation or the informative and syntactic functions. These theory elements are going to permit us to evolve a system in "turbo prolog", called "asia" which could serve as a base in view of a global automatic treatment of the natural language
Millour, Alice. „Myriadisation de ressources linguistiques pour le traitement automatique de langues non standardisées“. Thesis, Sorbonne université, 2020. http://www.theses.fr/2020SORUL126.
Der volle Inhalt der QuelleCitizen science, in particular voluntary crowdsourcing, represents a little experimented solution to produce language resources for some languages which are still little resourced despite the presence of sufficient speakers online. We present in this work the experiments we have led to enable the crowdsourcing of linguistic resources for the development of automatic part-of-speech annotation tools. We have applied the methodology to three non-standardised languages, namely Alsatian, Guadeloupean Creole and Mauritian Creole. For different historical reasons, multiple (ortho)-graphic practices coexist for these three languages. The difficulties encountered by the presence of this variation phenomenon led us to propose various crowdsourcing tasks that allow the collection of raw corpora, part-of-speech annotations, and graphic variants. The intrinsic and extrinsic analysis of these resources, used for the development of automatic annotation tools, show the interest of using crowdsourcing in a non-standardized linguistic framework: the participants are not seen in this context a uniform set of contributors whose cumulative efforts allow the completion of a particular task, but rather as a set of holders of complementary knowledge. The resources they collectively produce make possible the development of tools that embrace the variation.The platforms developed, the language resources, as well as the models of trained taggers are freely available
Hamon, Olivier. „Vers une architecture générique et pérenne pour l'évaluation en traitement automatique des langues : spécifications, méthodologies et mesures“. Paris 13, 2010. http://www.theses.fr/2010PA132022.
Der volle Inhalt der QuelleThe development of Natural Language Processing (NLP) systems needs to determine the quality of their results. Whether aiming to compare several systems to each other or to identify both the strong and weak points of an isolated system, evaluation implies defining precisely and for each particular context a methodology, a protocol, language ressources (data needed for both system training and testing) and even evaluation measures and metrics. It is following these conditions that system improvement is possible so as to obtain more reliable and easy-to-exploit results. The contribution of evaluation to NLP is important due to the creation of new language resources, the homogenisation of formats for those data used or the promotion of a technology or a system. However, evaluation requires considerable manual work, whether to formulate human judgments or to manage the evaluation procedure. This compromises the evaluation’s reliability, increases costs and makes it harder to reproduce. We have tried to reduce and delimit those manual interventions. To do so, we have supported our work by either conducting or participating in evaluation campaigns where systems are compared to each other or where isolated systems are evaluated. The management of the evaluation procedure has been formalised in this work and its different phases have been listed so as to define a common evaluation framework, understandable by all. The main point of those evaluation phases regards quality measurement through the usage of metrics. Three consecutive studies have been carried out on human measures, automatic measures and the automation of quality computation, and the meta-evaluation of the mesures so as to evaluate their reliability. Moreover, evaluation measures use language resources whose practical and administrative aspects must be taken into account. Among these, we have their creation, standarisation, validation, impact on the results, costs of production and usage, identification and legal issues. In that context, the study of the similarities between the technologies and between their evaluations has allowed us to highlight their common features and class them. This has helped us to show that a small set of measures allows to cover a wide range of applications for different technologies. Our final goal has been to define a generic evaluation architecture, which is adaptable to different NLP technologies, and sustainable, namely allowing to reuse language resources, measures or methods over time. Our proposal has been built on the conclusions drawn fromprevious steps, with the objective of integrating the evaluation phases to our architecture and incorporating the evaluation measures, all of which bearing in mind the place of language resource usage. The definition of this architecture has been done with the aim of fully automating the evaluation management work, regardless of whether this concerns an evaluation campaign or the evaluation of an isolated system. Following initial experiments, we have designed an evaluation architecture taking into account all the constraints found as well as using Web services. These latter provide the means to interconnect architecture components and grant them accessible through the Internet
Moreau, Erwan. „Acquisition de grammaires lexicalisées pour les langues naturelles“. Phd thesis, Université de Nantes, 2006. http://tel.archives-ouvertes.fr/tel-00487042.
Der volle Inhalt der QuelleNamer, Fiammetta. „Pronominalisation et effacement du sujet en génération automatique de textes en langues romanes“. Paris 7, 1990. http://www.theses.fr/1990PA077249.
Der volle Inhalt der QuelleBourgeade, Tom. „Interprétabilité a priori et explicabilité a posteriori dans le traitement automatique des langues“. Thesis, Toulouse 3, 2022. http://www.theses.fr/2022TOU30063.
Der volle Inhalt der QuelleWith the advent of Transformer architectures in Natural Language Processing a few years ago, we have observed unprecedented progress in various text classification or generation tasks. However, the explosion in the number of parameters, and the complexity of these state-of-the-art blackbox models, is making ever more apparent the now urgent need for transparency in machine learning approaches. The ability to explain, interpret, and understand algorithmic decisions will become paramount as computer models start becoming more and more present in our everyday lives. Using eXplainable AI (XAI) methods, we can for example diagnose dataset biases, spurious correlations which can ultimately taint the training process of models, leading them to learn undesirable shortcuts, which could lead to unfair, incomprehensible, or even risky algorithmic decisions. These failure modes of AI, may ultimately erode the trust humans may have otherwise placed in beneficial applications. In this work, we more specifically explore two major aspects of XAI, in the context of Natural Language Processing tasks and models: in the first part, we approach the subject of intrinsic interpretability, which encompasses all methods which are inherently easy to produce explanations for. In particular, we focus on word embedding representations, which are an essential component of practically all NLP architectures, allowing these mathematical models to process human language in a more semantically-rich way. Unfortunately, many of the models which generate these representations, produce them in a way which is not interpretable by humans. To address this problem, we experiment with the construction and usage of Interpretable Word Embedding models, which attempt to correct this issue, by using constraints which enforce interpretability on these representations. We then make use of these, in a simple but effective novel setup, to attempt to detect lexical correlations, spurious or otherwise, in some popular NLP datasets. In the second part, we explore post-hoc explainability methods, which can target already trained models, and attempt to extract various forms of explanations of their decisions. These can range from diagnosing which parts of an input were the most relevant to a particular decision, to generating adversarial examples, which are carefully crafted to help reveal weaknesses in a model. We explore a novel type of approach, in parts allowed by the highly-performant but opaque recent Transformer architectures: instead of using a separate method to produce explanations of a model's decisions, we design and fine-tune an architecture which jointly learns to both perform its task, while also producing free-form Natural Language Explanations of its own outputs. We evaluate our approach on a large-scale dataset annotated with human explanations, and qualitatively judge some of our approach's machine-generated explanations
Mauger, Serge. „L'interpretation des messages enigmatiques. Essai de semantique et de traitement automatique des langues“. Caen, 1999. http://www.theses.fr/1999CAEN1255.
Der volle Inhalt der QuelleOedipus, the character in sophocle's tragedy, solves the sphinx's enigma by + his own intelligence ;. This is the starting point of a general reflection on the linguistic status of language games, the practice of which could be seen throughout all periods and in all cultures. Oedipus's intelligence is based on a capacity for + calculating ; the interpretation of the enigma by giving up inductive reasoning (by recurrence) so as to adopt analogical reasoning instead. In the second part, it is shown that the calculation of the meaning of the polysemous messages enables us to propose a pattern of a combinatory analysis which is a tool for the automatic treatment of language (atl), able to help calculate riddles and to interpret coded definitions of crosswords. This pattern is used as a touchstone for an analysis of the semantic structures underlying interpretations and shows which lexical items are concerned by isotopy. Isotopy is not in that case considered to be an element of the message but a process of the interpretation. The whole approach is then based on interpretative semantics. The third part is the developement of the reflection including the treatment of enigmatic messages in the issues of the man-machine dialogue (mmd) which enables us to deal with the ambiguities of some utterances and is able to understand + strange messages ; on the basis of propositions of interpretation extrapolated from the pattern. Then little by little we analyse the calculation of the one who gets messages like an activity which consists in analysing graphematic and acoustic signs. Taking the signs into account is a confrontation with what is expected in the linguistic system and it enables us to carry out a series of decisions leading to the identification of a coherent analysis. This coherence and the analysis are compared to the approach adopted when + reading ; an anamorphosis (in art painting) or when decoding the organisation rules in suites of cards in eleusis' game. We find a similar approach when we have to interpret the + scriptio continua ; on paleographic inscriptions, the technique of which serves as a basis for some literary experiences under duress and for hidden puns
Dubé, Martine. „Étude terminologique et analyse des modes de formation de 50 notions sur le traitement automatique des langues naturelles /“. Thèse, Québec : Université Laval, École des gradués, 1990. http://theses.uqac.ca.
Der volle Inhalt der Quelle"Mémoire présenté pour l'obtention du grade maître es arts (M.A.) dans le cadre d'une entente entre l'Université Laval et l'Université du Québec à Chicoutimi" CaQCU Bibliogr.: f. 137-141. Document électronique également accessible en format PDF. CaQCU
Knyazeva, Elena. „Apprendre par imitation : applications à quelques problèmes d'apprentissage structuré en traitement des langues“. Thesis, Université Paris-Saclay (ComUE), 2018. http://www.theses.fr/2018SACLS134/document.
Der volle Inhalt der QuelleStructured learning has become ubiquitousin Natural Language Processing; a multitude ofapplications, such as personal assistants, machinetranslation and speech recognition, to name just afew, rely on such techniques. The structured learningproblems that must now be solved are becomingincreasingly more complex and require an increasingamount of information at different linguisticlevels (morphological, syntactic, etc.). It is thereforecrucial to find the best trade-off between the degreeof modelling detail and the exactitude of the inferencealgorithm. Imitation learning aims to perform approximatelearning and inference in order to better exploitricher dependency structures. In this thesis, we explorethe use of this specific learning setting, in particularusing the SEARN algorithm, both from a theoreticalperspective and in terms of the practical applicationsto Natural Language Processing tasks, especiallyto complex tasks such as machine translation.Concerning the theoretical aspects, we introduce aunified framework for different imitation learning algorithmfamilies, allowing us to review and simplifythe convergence properties of the algorithms. With regardsto the more practical application of our work, weuse imitation learning first to experiment with free ordersequence labelling and secondly to explore twostepdecoding strategies for machine translation
Papy, Fabrice. „Hypertextualisation automatique de documents techniques“. Paris 8, 1995. http://www.theses.fr/1995PA081014.
Der volle Inhalt der QuelleAutomatic hypertextualization, an empirical process leading to hypertext, uses sequential technical documents typed from word processing software, to create dynamically the nodes and links of hypertext networks. The phase of nodes extraction uses the physical structure to delect the logical entities within documents. Referential links (especially cross-references), whose the syntax is defined by author, are extracted by means of a parser which uses a generic definition of cross-references grammar. Automatic hypertextualization produces a hypertext meta-network, where documents updating may corrupt nodes and links coherence. As relational database management systems have proved their efficiency to preserve data integrity, we propose a relational normalization of hypertextualized documents in order to manage referential links updating. Increasing of the mass of information is another outcome of the automatic creation of hypertext networks because it accentuates more disorientation problems and cognitive overhead. A solution consists of joining the hypertextualization process with an automatic indexing system, which would allow to associate each node with a set of relevant terms representing node content. So, readers will have not only structural navigation mecanisms but semantic browsing capabilities
Djematene, Ahmed. „Un système de lecture automatique de l'écriture Berbère“. Le Havre, 1998. http://www.theses.fr/1998LEHA0008.
Der volle Inhalt der QuelleStroppa, Nicolas. „Définitions et caractérisations de modèles à base d'analogies pour l'apprentissage automatique des langues naturelles“. Phd thesis, Télécom ParisTech, 2005. http://tel.archives-ouvertes.fr/tel-00145147.
Der volle Inhalt der QuelleDans le cadre d'un apprentissage automatique de données linguistiques, des modèles inférentiels alternatifs ont alors été proposés qui remettent en cause le principe d'abstraction opéré par les règles ou les modèles probabilistes. Selon cette conception, la connaissance linguistique reste implicitement représentée dans le corpus accumulé. Dans le domaine de l'Apprentissage Automatique, les méthodes suivant les même principes sont regroupées sous l'appellation d'apprentissage \og{}paresseux\fg{}. Ces méthodes reposent généralement sur le biais d'apprentissage suivant~: si un objet $Y$ est \og{}proche\fg{} d'un objet $X$, alors son analyse $f(Y)$ est un bon candidat pour $f(X)$. Alors que l'hypothèse invoquée se justifie pour les applications usuellement traitées en Apprentissage Automatique, la nature structurée et l'organisation paradigmatique des données linguistiques suggèrent une approche légèrement différente. Pour rendre compte de cette particularité, nous étudions un modèle reposant sur la notion de \og{}proportion analogique\fg{}. Dans ce modèle, l'analyse $f(T)$ d'un nouvel objet $T$ s'opère par identification d'une proportion analogique avec des objets $X$, $Y$ et $Z$ déjà connus. L'hypothèse analogique postule ainsi que si \lana{X}{Y}{Z}{T}, alors \lana{$f(X)$}{$f(Y)$}{$f(Z)$}{$f(T)$}. Pour inférer $f(T)$ à partir des $f(X)$, $f(Y)$, $f(Z)$ déjà connus, on résout l'\og{}équation analogique\fg{} d'inconnue $I$~: \lana{$f(X)$}{$f(Y)$}{$f(Z)$}{$I$}.
Nous présentons, dans la première partie de ce travail, une étude de ce modèle de proportion analogique au regard d'un cadre plus général que nous qualifierons d'\og{}apprentissage par analogie\fg{}. Ce cadre s'instancie dans un certain nombre de contextes~: dans le domaine des sciences cognitives, il s'agit de raisonnement par analogie, faculté essentielle au c\oe{}ur de nombreux processus cognitifs~; dans le cadre de la linguistique traditionnelle, il fournit un support à un certain nombre de mécanismes tels que la création analogique, l'opposition ou la commutation~; dans le contexte de l'apprentissage automatique, il correspond à l'ensemble des méthodes d'apprentissage paresseux. Cette mise en perspective offre un éclairage sur la nature du modèle et les mécanismes sous-jacents.
La deuxième partie de notre travail propose un cadre algébrique unifié, définissant la notion de proportion analogique. Partant d'un modèle de proportion analogique entre chaînes de symboles, éléments d'un monoïde libre, nous présentons une extension au cas plus général des semigroupes. Cette généralisation conduit directement à une définition valide pour tous les ensembles dérivant de la structure de semigroupe, permettant ainsi la modélisation des proportions analogiques entre représentations courantes d'entités linguistiques telles que chaînes de symboles, arbres, structures de traits et langages finis. Des algorithmes adaptés au traitement des proportions analogiques entre de tels objets structurés sont présentés. Nous proposons également quelques directions pour enrichir le modèle, et permettre ainsi son utilisation dans des cas plus complexes.
Le modèle inférentiel étudié, motivé par des besoins en Traitement Automatique des Langues, est ensuite explicitement interprété comme une méthode d'Apprentissage Automatique. Cette formalisation a permis de mettre en évidence plusieurs de ses éléments caractéristiques. Une particularité notable du modèle réside dans sa capacité à traiter des objets structurés, aussi bien en entrée qu'en sortie, alors que la tâche classique de classification suppose en général un espace de sortie constitué d'un ensemble fini de classes. Nous montrons ensuite comment exprimer le biais d'apprentissage de la méthode à l'aide de l'introduction de la notion d'extension analogique. Enfin, nous concluons par la présentation de résultats expérimentaux issus de l'application de notre modèle à plusieurs tâches de Traitement Automatique des Langues~: transcription orthographique/phonétique, analyse flexionnelle et analyse dérivationnelle.
Charnois, Thierry. „Accès à l'information : vers une hybridation fouille de données et traitement automatique des langues“. Habilitation à diriger des recherches, Université de Caen, 2011. http://tel.archives-ouvertes.fr/tel-00657919.
Der volle Inhalt der QuelleStroppa, Nicolas. „Définitions et caractérisations de modèles à base d'analogies pour l'apprentissage automatique des langues naturelles /“. Paris : École nationale supérieure des télécommunications, 2006. http://catalogue.bnf.fr/ark:/12148/cb40129220d.
Der volle Inhalt der QuelleDIMON, PIERRE David Jean. „UN SYSTEME MULTILINGUAL D'INTERPRETATION AUTOMATIQUE. ETAPE DU SOUS-LOGICIEL "ANALYSE" POUR LES LANGUES GERMANIQUES /“. [S.l.] : [s.n.], 1994. ftp://ftp.scd.univ-metz.fr/pub/Theses/1994/Dimon.Pierre.LMZ945_1.pdf.
Der volle Inhalt der Quelle