Dissertations / Theses on the topic 'Traitement du Langage Naturel (NLP)'
Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles
Consult the top 50 dissertations / theses for your research on the topic 'Traitement du Langage Naturel (NLP).'
Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.
You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.
Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.
Moncla, Ludovic. "Automatic Reconstruction of Itineraries from Descriptive Texts." Thesis, Pau, 2015. http://www.theses.fr/2015PAUU3029/document.
Full textThis PhD thesis is part of the research project PERDIDO, which aims at extracting and retrieving displacements from textual documents. This work was conducted in collaboration with the LIUPPA laboratory of the university of Pau (France), the IAAA team of the university of Zaragoza (Spain) and the COGIT laboratory of IGN (France). The objective of this PhD is to propose a method for establishing a processing chain to support the geoparsing and geocoding of text documents describing events strongly linked with space. We propose an approach for the automatic geocoding of itineraries described in natural language. Our proposal is divided into two main tasks. The first task aims at identifying and extracting information describing the itinerary in texts such as spatial named entities and expressions of displacement or perception. The second task deal with the reconstruction of the itinerary. Our proposal combines local information extracted using natural language processing and physical features extracted from external geographical sources such as gazetteers or datasets providing digital elevation models. The geoparsing part is a Natural Language Processing approach which combines the use of part of speech and syntactico-semantic combined patterns (cascade of transducers) for the annotation of spatial named entities and expressions of displacement or perception. The main contribution in the first task of our approach is the toponym disambiguation which represents an important issue in Geographical Information Retrieval (GIR). We propose an unsupervised geocoding algorithm that takes profit of clustering techniques to provide a solution for disambiguating the toponyms found in gazetteers, and at the same time estimating the spatial footprint of those other fine-grain toponyms not found in gazetteers. We propose a generic graph-based model for the automatic reconstruction of itineraries from texts, where each vertex represents a location and each edge represents a path between locations. %, combining information extracted from texts and information extracted from geographical databases. Our model is original in that in addition to taking into account the classic elements (paths and waypoints), it allows to represent the other elements describing an itinerary, such as features seen or mentioned as landmarks. To build automatically this graph-based representation of the itinerary, our approach computes an informed spanning tree on a weighted graph. Each edge of the initial graph is weighted using a multi-criteria analysis approach combining qualitative and quantitative criteria. Criteria are based on information extracted from the text and information extracted from geographical sources. For instance, we compare information given in the text such as spatial relations describing orientation (e.g., going south) with the geographical coordinates of locations found in gazetteers. Finally, according to the definition of an itinerary and the information used in natural language to describe itineraries, we propose a markup langugage for encoding spatial and motion information based on the Text Encoding and Interchange guidelines (TEI) which defines a standard for the representation of texts in digital form. Additionally, the rationale of the proposed approach has been verified with a set of experiments on a corpus of multilingual hiking descriptions (French, Spanish and Italian)
Lauly, Stanislas. "Exploration des réseaux de neurones à base d'autoencodeur dans le cadre de la modélisation des données textuelles." Thèse, Université de Sherbrooke, 2016. http://hdl.handle.net/11143/9461.
Full textBourgeade, Tom. "Interprétabilité a priori et explicabilité a posteriori dans le traitement automatique des langues." Thesis, Toulouse 3, 2022. http://www.theses.fr/2022TOU30063.
Full textWith the advent of Transformer architectures in Natural Language Processing a few years ago, we have observed unprecedented progress in various text classification or generation tasks. However, the explosion in the number of parameters, and the complexity of these state-of-the-art blackbox models, is making ever more apparent the now urgent need for transparency in machine learning approaches. The ability to explain, interpret, and understand algorithmic decisions will become paramount as computer models start becoming more and more present in our everyday lives. Using eXplainable AI (XAI) methods, we can for example diagnose dataset biases, spurious correlations which can ultimately taint the training process of models, leading them to learn undesirable shortcuts, which could lead to unfair, incomprehensible, or even risky algorithmic decisions. These failure modes of AI, may ultimately erode the trust humans may have otherwise placed in beneficial applications. In this work, we more specifically explore two major aspects of XAI, in the context of Natural Language Processing tasks and models: in the first part, we approach the subject of intrinsic interpretability, which encompasses all methods which are inherently easy to produce explanations for. In particular, we focus on word embedding representations, which are an essential component of practically all NLP architectures, allowing these mathematical models to process human language in a more semantically-rich way. Unfortunately, many of the models which generate these representations, produce them in a way which is not interpretable by humans. To address this problem, we experiment with the construction and usage of Interpretable Word Embedding models, which attempt to correct this issue, by using constraints which enforce interpretability on these representations. We then make use of these, in a simple but effective novel setup, to attempt to detect lexical correlations, spurious or otherwise, in some popular NLP datasets. In the second part, we explore post-hoc explainability methods, which can target already trained models, and attempt to extract various forms of explanations of their decisions. These can range from diagnosing which parts of an input were the most relevant to a particular decision, to generating adversarial examples, which are carefully crafted to help reveal weaknesses in a model. We explore a novel type of approach, in parts allowed by the highly-performant but opaque recent Transformer architectures: instead of using a separate method to produce explanations of a model's decisions, we design and fine-tune an architecture which jointly learns to both perform its task, while also producing free-form Natural Language Explanations of its own outputs. We evaluate our approach on a large-scale dataset annotated with human explanations, and qualitatively judge some of our approach's machine-generated explanations
Michalon, Olivier. "Modèles statistiques pour la prédiction de cadres sémantiques." Thesis, Aix-Marseille, 2017. http://www.theses.fr/2017AIXM0221/document.
Full textIn natural language processing, each analysis step has improved the way in which language can be modeled by machines. Another step of analysis still poorly mastered resides in semantic parsing. This type of analysis can provide information which would allow for many advances, such as better human-machine interactions or more reliable translations. There exist several types of meaning representation structures, such as PropBank, AMR and FrameNet. FrameNet corresponds to the frame semantic framework whose theory has been described by Charles Fillmore (1971). In this theory, each prototypical situation and each different elements involved are represented in such a way that two similar situations are represented by the same object, called a semantic frame. The work that we will describe here follows the work already developed for machine prediction of frame semantic representations. We will present four prediction systems, and each one of them allowed to validate another hypothesis on the necessary properties for effective prediction. We will show that semantic parsing can also be improved by providing prediction models with refined information as input of the system, with firstly a syntactic analysis where deep links are made explicit and secondly vectorial representations of the vocabulary learned beforehand
Cousot, Kévin. "Inférences et explications dans les réseaux lexico-sémantiques." Thesis, Montpellier, 2019. http://www.theses.fr/2019MONTS108.
Full textThanks to the democratization of new communication technologies, there is a growing quantity of textual resources, making Automatic Natural Language Processing (NLP) a discipline of crucial importance both scientifically and industrially. Easily available, these data offer unprecedented opportunities and, from opinion analysis to information research and semantic text analysis, there are many applications.However, this textual data cannot be easily exploited in its raw state and, in order to carry out such tasks, it seems essential to have resources describing semantic knowledge, particularly in the form of lexico-semantic networks such as that of the JeuxDeMots project. However, the constitution and maintenance of such resources remain difficult operations, due to their large size but also because of problems of polysemy and semantic identification. Moreover, their use can be tricky because a significant part of the necessary information is not directly accessible in the resource but must be inferred from the data of the lexico-semantic network.Our work seeks to demonstrate that lexico-semantic networks are, by their connexionic nature, much more than a collection of raw facts and that more complex structures such as interpretation paths contain more information and allow multiple inference operations to be performed. In particular, we will show how to use a knowledge base to provide explanations to high-level facts. These explanations allow at least to validate and memorize new information.In doing so, we can assess the coverage and relevance of the database data and consolidate it. Similarly, the search for paths is useful for classification and disambiguation problems, as they are justifications for the calculated results.In the context of the recognition of named entities, they also make it possible to type entities and disambiguate them (is the occurrence of the term Paris a reference to the city, and which one, or to a starlet?) by highlighting the density of connections between ambiguous entities, their context and their possible type.Finally, we propose to turn the large size of the JeuxDeMots network to our advantage to enrich the database with new facts from a large number of comparable examples and by an abduction process on the types of semantic relationships that can connect two given terms. Each inference is accompanied by explanations that can be validated or invalidated, thus providing a learning process
Manishina, Elena. "Data-driven natural language generation using statistical machine translation and discriminative learning." Thesis, Avignon, 2016. http://www.theses.fr/2016AVIG0209/document.
Full textThe humanity has long been passionate about creating intellectual machines that can freely communicate with us in our language. Most modern systems communicating directly with the user share one common feature: they have a dialog system (DS) at their base. As of today almost all DS components embraced statistical methods and widely use them as their core models. Until recently Natural Language Generation (NLG) component of a dialog system used primarily hand-coded generation templates, which represented model phrases in a natural language mapped to a particular semantic content. Today data-driven models are making their way into the NLG domain. In this thesis, we follow along this new line of research and present several novel data-driven approaches to natural language generation. In our work we focus on two important aspects of NLG systems development: building an efficient generator and diversifying its output. Two key ideas that we defend here are the following: first, the task of NLG can be regarded as the translation between a natural language and a formal meaning representation, and therefore, can be performed using statistical machine translation techniques, and second, corpus extension and diversification which traditionally involved manual paraphrasing and rule crafting can be performed automatically using well-known and widely used synonym and paraphrase extraction methods. Concerning our first idea, we investigate the possibility of using NGRAM translation framework and explore the potential of discriminative learning, notably Conditional Random Fields (CRF) models, as applied to NLG; we build a generation pipeline which allows for inclusion and combination of different generation models (NGRAM and CRF) and which uses an efficient decoding framework (finite-state transducers' best path search). Regarding the second objective, namely corpus extension, we propose to enlarge the system's vocabulary and the set of available syntactic structures via integrating automatically obtained synonyms and paraphrases into the training corpus. To our knowledge, there have been no attempts to increase the size of the system vocabulary by incorporating synonyms. To date most studies on corpus extension focused on paraphrasing and resorted to crowd-sourcing in order to obtain paraphrases, which then required additional manual validation often performed by system developers. We prove that automatic corpus extension by means of paraphrase extraction and validation is just as effective as crowd-sourcing, being at the same time less costly in terms of development time and resources. During intermediate experiments our generation models showed a significantly better performance than the phrase-based baseline model and appeared to be more robust in handling unknown combinations of concepts than the current in-house rule-based generator. The final human evaluation confirmed that our data-driven NLG models is a viable alternative to rule-based generators
Annouz, Hamid. "Traitement morphologique des unités linguistiques du kabyle à l’aide de logiciel NooJ : Construction d’une base de données." Thesis, Paris, INALCO, 2019. http://www.theses.fr/2019INAL0022.
Full textThis work introduces the Kabyle language to the field of Natural Language Processing by giving it a database for the NooJ software that allows the automatic recognition of linguistic units in a written corpus.We have divided the work in four parts. The first part is the place to give a snapshot on the history of formal linguistics, to present the field of NLP and the NooJ software and the linguistic units that have been treated. The second part is devoted to the description of the process that has been followed for the treatment and the integration of Kabyle verbs in NooJ. We have built a dictionary that contains 4508 entries and 8762 derived components and some models of flexion for each type which have been linked with each entry. In the third part, we have explained the processing of nouns and other units. We have built, for the nouns, a dictionary (3508 entries, 501 derived components) that have been linked to the models of flexion and for the other units (870 entries including adverbs, prepositions, conjunctions, interrogatives, personal pronouns, etc.). The second and third part are completed by examples of applications on a text, this procedure has allowed us to show with various sort of annotations the ambiguities.Regarding the last part we have devoted it to ambiguities, after having identified a list of various types of amalgams, we have tried to show, with the help of some examples of syntactic grammars, some of the tools used by NooJ for disambiguation
Neme, Alexis. "An arabic language resource for computational morphology based on the semitic model." Thesis, Paris Est, 2020. http://www.theses.fr/2020PESC2013.
Full textWe developed an original approach to Arabic traditional morphology, involving new concepts in Semitic lexicology, morphology, and grammar for standard written Arabic. This new methodology for handling the rich and complex Semitic languages is based on good practices in Finite-State technologies (FSA/FST) by using Unitex, a lexicon-based corpus processing suite. For verbs (Neme, 2011), I proposed an inflectional taxonomy that increases the lexicon readability and makes it easier for Arabic speakers and linguists to encode, correct, and update it. Traditional grammar defines inflectional verbal classes by using verbal pattern-classes and root-classes. In our taxonomy, traditional pattern-classes are reused, and root-classes are redefined into a simpler system. The lexicon of verbs covered more than 99% of an evaluation corpus. For nouns and adjectives (Neme, 2013), we went one step further in the adaptation of traditional morphology. First, while this tradition is based on derivational rules, we found our description on inflectional ones. Next, we keep the concepts of root and pattern, which is the backbone of the traditional Semitic model. Still, our breakthrough lies in the reversal of the traditional root-and-pattern Semitic model into a pattern-and-root model, which keeps small and orderly the set of pattern classes and root sub-classes. I elaborated a taxonomy for broken plural containing 160 inflectional classes, which simplifies ten times the encoding of broken plural. Since then, I elaborated comprehensive resources for Arabic. These resources are described in Neme and Paumier (2019). To take into account all aspects of the rich morphology of Arabic, I have completed our taxonomy with suffixal inflexional classes for regular plurals, adverbs, and other parts of speech (POS) to cover all the lexicon. In all, I identified around 1000 Semitic and suffixal inflectional classes implemented with concatenative and non-concatenative FST devices.From scratch, I created 76000 fully vowelized lemmas, and each one is associated with an inflectional class. These lemmas are inflected by using these 1000 FSTs, producing a fully inflected lexicon with more than 6 million forms. I extended this fully inflected resource using agglutination grammars to identify words composed of up to 5 segments, agglutinated around a core inflected verb, noun, adjective, or particle. The agglutination grammars extend the recognition to more than 500 million valid delimited word forms, partially or fully vowelized. The flat file size of 6 million forms is 340 megabytes (UTF-16). It is compressed then into 11 Mbytes before loading to memory for fast retrieval. The generation, compression, and minimization of the full-form lexicon take less than one minute on a common Unix laptop. The lexical coverage rate is more than 99%. The tagger speed is 5000 words/second, and more than 200 000 words/s, if the resources are preloaded/resident in the RAM. The accuracy and speed of our tools result from our systematic linguistic approach and from our choice to embrace the best practices in mathematical and computational methods. The lookup procedure is fast because we use Minimal Acyclic Deterministic Finite Automaton (Revuz, 1992) to compress the full-form dictionary, and because it has only constant strings and no embedded rules. The breakthrough of our linguistic approach remains principally on the reversal of the traditional root-and-pattern Semitic model into a pattern-and-root model.Nonetheless, our computational approach is based on good practices in Finite-State technologies (FSA/FST) as all the full-forms were computed in advance for accurate identification and to get the best from the FSA compression for fast and efficient lookups
Mars, Mourad. "Analyse morphologique robuste de l'arabe et applications pédagogiques." Thesis, Grenoble, 2012. http://www.theses.fr/2012GRENL046.
Full textL'auteur n'a pas fourni de résumé en anglais
Zhou, Rongyan. "Exploration of opportunities and challenges brought by Industry 4.0 to the global supply chains and the macroeconomy by integrating Artificial Intelligence and more traditional methods." Electronic Thesis or Diss., université Paris-Saclay, 2021. http://www.theses.fr/2021UPAST037.
Full textIndustry 4.0 is a significant shift and a tremendous challenge for every industrial segment, especially for the manufacturing industry that gave birth to the new industrial revolution. The research first uses literature analysis to sort out the literature, and focuses on the use of “core literature extension method” to enumerate the development direction and application status of different fields, which devotes to showing a leading role for theory and practice of industry 4.0. The research then explores the main trend of multi-tier supply in Industry 4.0 by combining machine learning and traditional methods. Next, the research investigates the relationship of industry 4.0 investment and employment to look into the inter-regional dependence of industry 4.0 so as to present a reasonable clustering based on different criteria and make suggestions and analysis of the global supply chain for enterprises and organizations. Furthermore, our analysis system takes a glance at the macroeconomy. The combination of natural language processing in machine learning to classify research topics and traditional literature review to investigate the multi-tier supply chain significantly improves the study's objectivity and lays a solid foundation for further research. Using complex networks and econometrics to analyze the global supply chain and macroeconomic issues enriches the research methodology at the macro and policy level. This research provides analysis and references to researchers, decision-makers, and companies for their strategic decision-making
Ramadier, Lionel. "Indexation et apprentissage de termes et de relations à partir de comptes rendus de radiologie." Thesis, Montpellier, 2016. http://www.theses.fr/2016MONTT298/document.
Full textIn the medical field, the computerization of health professions and development of the personal medical file (DMP) results in a fast increase in the volume of medical digital information. The need to convert and manipulate all this information in a structured form is a major challenge. This is the starting point for the development of appropriate tools where the methods from the natural language processing (NLP) seem well suited.The work of this thesis are within the field of analysis of medical documents and address the issue of representation of biomedical information (especially the radiology area) and its access. We propose to build a knowledge base dedicated to radiology within a general knowledge base (lexical-semantic network JeuxDeMots). We show the interest of the hypothesis of no separation between different types of knowledge through a document analysis. This hypothesis is that the use of general knowledge, in addition to those specialties, significantly improves the analysis of medical documents.At the level of lexical-semantic network, manual and automated addition of meta information on annotations (frequency information, pertinence, etc.) is particularly useful. This network combines weight and annotations on typed relationships between terms and concepts as well as an inference mechanism which aims to improve quality and network coverage. We describe how from semantic information in the network, it is possible to define an increase in gross index built for each records to improve information retrieval. We present then a method of extracting semantic relationships between terms or concepts. This extraction is performed using lexical patterns to which we added semantic constraints.The results show that the hypothesis of no separation between different types of knowledge to improve the relevance of indexing. The index increase results in an improved return while semantic constraints improve the accuracy of the relationship extraction
Fradet, Nathan. "Apprentissage automatique pour la modélisation de musique symbolique." Electronic Thesis or Diss., Sorbonne université, 2024. https://accesdistant.sorbonne-universite.fr/login?url=https://theses-intra.sorbonne-universite.fr/2024SORUS037.pdf.
Full textSymbolic music modeling (SMM) represents the tasks performed by Deep Learning models on the symbolic music modality, among which are music generation or music information retrieval. SMM is often handled with sequential models that process data as sequences of discrete elements called tokens. This thesis study how symbolic music can be tokenized, and what are the impacts of the different ways to do it impact models performances and efficiency. Current challenges include the lack of software to perform this step, poor model efficiency and inexpressive tokens. We address these challenges by: 1) developing a complete, flexible and easy to use software library allowing to tokenize symbolic music; 2) analyzing the impact of various tokenization strategies on model performances; 3) increasing the performance and efficiency of models by leveraging large music vocabularies with the use of byte pair encoding; 4) building the first large-scale model for symbolic music generation
Lopez, Cédric. "Titrage automatique de documents textuels." Thesis, Montpellier 2, 2012. http://www.theses.fr/2012MON20071/document.
Full textDuring the first millennium BC, the already existing libraries needed to organize texts preservation, and were thus immediately confronted with the difficulties of indexation. The use of a title occurred then as a first solution, enabling a quick indentification of every work, and in most of the cases, helping to discern works thematically close to a given one. While in Ancient Greece, titles have had a little informative function, although still performing an indentification function, the invention of the printing office with mobile characters (Gutenberg, XVth century AD) dramatically increased the number of documents, which are today spread on a large-scale. The title acquired little by little new functions, leaning very often to sociocultural or political influence (in particular in journalistic articles).Today, for both electronic and paper documents, the presence of one or several titles is very often noticed. It helps creating a first link between the reader and the subject of the document. But how some words can have a so big influence? What functions do the titles have to perform at this beginning of the XXIth century? How can one automatically generate titles respecting these functions? The automatic titling of textual documents is one of the key domains of Web pages accessibility (W3C standards) such as defined in a standard given by associations about the disabled. For a given reader, the goal is to increase the readability of pages obtained from a search, since usual searches are often disheartening readers who must supply big cognitive efforts. For a Website designer, the aim is to improve the indexation of pages for a more relevant search. Other interests motivate this study (titling of commercial Web pages, titling in order to automatically generate contents, titling to bring elements to enhance automatic summarization).In this study, we use NLP (Natural Language Processing) methods and systems. While numerous works were published about indexation and automatic summarization, automatic titling remained discreet and knew some difficulties as for its positioning in NLP. We support in this study that the automatic titling must be nevertheless considered as a full task.Having defined problems connected to automatic titling, and having positioned this task among the already existing tasks, we provide a series of methods enabling syntactically correct titles production, according to several objectives. In particular, we are interested in the generation of informative titles, and, for the first time in the history of automatic titling, we introduce the concept of catchiness.Our TIT' system consists of three methods (POSTIT, NOMIT, and CATIT), that enables to produce sets of informative titles in 81% of the cases and catchy titles in 78% of the cases
Lesnikova, Tatiana. "Liage de données RDF : évaluation d'approches interlingues." Thesis, Université Grenoble Alpes (ComUE), 2016. http://www.theses.fr/2016GREAM011/document.
Full textThe Semantic Web extends the Web by publishing structured and interlinked data using RDF.An RDF data set is a graph where resources are nodes labelled in natural languages. One of the key challenges of linked data is to be able to discover links across RDF data sets. Given two data sets, equivalent resources should be identified and linked by owl:sameAs links. This problem is particularly difficult when resources are described in different natural languages.This thesis investigates the effectiveness of linguistic resources for interlinking RDF data sets. For this purpose, we introduce a general framework in which each RDF resource is represented as a virtual document containing text information of neighboring nodes. The context of a resource are the labels of the neighboring nodes. Once virtual documents are created, they are projected in the same space in order to be compared. This can be achieved by using machine translation or multilingual lexical resources. Once documents are in the same space, similarity measures to find identical resources are applied. Similarity between elements of this space is taken for similarity between RDF resources.We performed evaluation of cross-lingual techniques within the proposed framework. We experimentally evaluate different methods for linking RDF data. In particular, two strategies are explored: applying machine translation or using references to multilingual resources. Overall, evaluation shows the effectiveness of cross-lingual string-based approaches for linking RDF resources expressed in different languages. The methods have been evaluated on resources in English, Chinese, French and German. The best performance (over 0.90 F-measure) was obtained by the machine translation approach. This shows that the similarity-based method can be successfully applied on RDF resources independently of their type (named entities or thesauri concepts). The best experimental results involving just a pair of languages demonstrated the usefulness of such techniques for interlinking RDF resources cross-lingually
Ratkovic, Zorana. "Predicative Analysis for Information Extraction : application to the biology domain." Thesis, Paris 3, 2014. http://www.theses.fr/2014PA030110.
Full textThe abundance of biomedical information expressed in natural language has resulted in the need for methods to process this information automatically. In the field of Natural Language Processing (NLP), Information Extraction (IE) focuses on the extraction of relevant information from unstructured data in natural language. A great deal of IE methods today focus on Machine Learning (ML) approaches that rely on deep linguistic processing in order to capture the complex information contained in biomedical texts. In particular, syntactic analysis and parsing have played an important role in IE, by helping capture how words in a sentence are related. This thesis examines how dependency parsing can be used to facilitate IE. It focuses on a task-based approach to dependency parsing evaluation and parser selection, including a detailed error analysis. In order to achieve a high quality of syntax-based IE, different stages of linguistic processing are addressed, including both pre-processing steps (such as tokenization) and the use of complementary linguistic processing (such as the use of semantics and coreference analysis). This thesis also explores how the different levels of linguistics processing can be represented for use within an ML-based IE algorithm, and how the interface between these two is of great importance. Finally, biomedical data is very heterogeneous, encompassing different subdomains and genres. This thesis explores how subdomain-adaptationcan be achieved by using already existing subdomain knowledge and resources. The methods and approaches described are explored using two different biomedical corpora, demonstrating how the IE results are used in real-life tasks
Desot, Thierry. "Apport des modèles neuronaux de bout-en-bout pour la compréhension automatique de la parole dans l'habitat intelligent." Thesis, Université Grenoble Alpes, 2020. http://www.theses.fr/2020GRALM069.
Full textSmart speakers offer the possibility of interacting with smart home systems, and make it possible to issue a range of requests about various subjects. They represent the first ambient voice interfaces that are frequently available in home environments. Very often they are only capable of inferring voice commands of a simple syntax in short utterances in the realm of smart homes that promote home care for senior adults. They support them during everyday situations by improving their quality of life, and also providing assistance in situations of distress. The design of these smart homes mainly focuses on the safety and comfort of its habitants. As a result, these research projects frequently concentrate on human activity detection, resulting in a lack of attention for the communicative aspects in a smart home design. Consequently, there are insufficient speech corpora, specific to the home automation field, in particular for languages other than English. However the availability of these corpora are crucial for developing interactive communication systems between the smart home and its inhabitants. Such corpora at one’s disposal could also contribute to the development of a generation of smart speakers capable of extracting more complex voice commands. As a consequence, part of our work consisted in developing a corpus generator, producing home automation domain specific voice commands, automatically annotated with intent and concept labels. The extraction of intents and concepts from these commands, by a Spoken Language Understanding (SLU) system is necessary to provide the decision-making module with the information, necessary for their execution. In order to react to speech, the natural language understanding (NLU) module is typically preceded by an automatic speech recognition (ASR) module, automatically converting speech into transcriptions. As several studies have shown, the interaction between ASR and NLU in a sequential SLU approach accumulates errors. Therefore, one of the main motivations of our work is the development of an end-to-end SLU module, extracting concepts and intents directly from speech. To achieve this goal, we first develop a sequential SLU approach as our baseline approach, in which a classic ASR method generates transcriptions that are passed to the NLU module, before continuing with the development of an End-to-end SLU module. These two SLU systems were evaluated on a corpus recorded in the home automation domain. We investigate whether the prosodic information that the end-to-end SLU system has access to, contributes to SLU performance. We position the two approaches also by comparing their robustness, facing speech with more semantic and syntactic variation.The context of this thesis is the ANR VocADom project
Ameli, Samila. "Construction d'un langage de dictionnaire conceptuel en vue du traitement du langage naturel : application au langage médical." Compiègne, 1989. http://www.theses.fr/1989COMPD226.
Full textThis study deals with the realisation of a « new generation » information retrieval system, taking consideration of texts signification. This system compares texts (questions and documents) by their content. A knowledge base being indispensable for text “comprehension”, a dictionary of concepts has been designed in which are defined the concepts and their mutual relations thru a user friendly language called SUMIX. SUMIX enables us (1) to solve ambiguities due to polysemia by considering context dependencies, (2) to make use of property inheritance and so can largely help cogniticiens in the creation of the knowledge and inference base, (3) to define subject dependant relation between concepts which make possible metaknowledge handling. The dictionary of concepts is essentially used (1) to index concepts (and not characters string) which enables us to select a wide range of documents in the conceptual extraction phase, (2) to filter the previously selected documents by comparing the structure of each document with that of the query in the structural analysis phase
Smart, John Ferguson. "L' analyse et la représentation de compte-rendus médicaux." Aix-Marseille 2, 1996. http://www.theses.fr/1996AIX22095.
Full textBelabbas, Azeddine. "Satisfaction de contraintes et validation des grammaires du langage naturel." Paris 13, 1996. http://www.theses.fr/1996PA132044.
Full textNazarenko, Adeline. "Compréhension du langage naturel : le problème de la causalité." Paris 13, 1994. http://www.theses.fr/1994PA132007.
Full textFouqueré, Christophe. "Systèmes d'analyse tolérante du langage naturel." Paris 13, 1988. http://www.theses.fr/1988PA132003.
Full textCiortuz, Liviu-Virgil. "Programmation concurrente par contraintes et traitement du langage naturel : le système DF." Lille 1, 1996. http://www.theses.fr/1996LIL10145.
Full textNous avons implémenté un prototype du système df en oz, le langage concurrent multi-paradigme développé au dfki, en mettant en oeuvre une alternative typée et orientée-objet de son sous-système d'articles ouverts. Le système df est appliqué au traitement du langage naturel : analyse, génération et traduction automatique. Nous avons entrepris la conception d'un noyau hpsg pour le roumain, avec une implémentation concurrente. La définitude (définiteness), la topique (topic) et la modification (adjunction) du groupe nominal roumain sont analysées et la fonctionnalité des pronoms clitiques roumains sont expliquées à partir de la linéarisation (linearization) du groupe verbal transitif. Nous avons défini deux méta-schémas au dessus des schémas de la règle de dominance immédiate (id) dans la théorie de hpsg : le méta-schéma id 1, qui concerne les sujets multiples (par exemple les déterminants dans le groupe nominal roumain) et le méta-schéma id 2/6, pour la corrélation des composants localement ordonnés en dépendance non-bornée (comme les clitiques dans le groupe verbal transitif roumain). Les deux méta-schémas id font marcher la concurrence dans le cadre de la théorie des grammaires hpsg
Fort, Karën. "Les ressources annotées, un enjeu pour l’analyse de contenu : vers une méthodologie de l’annotation manuelle de corpus." Paris 13, 2012. http://scbd-sto.univ-paris13.fr/intranet/edgalilee_th_2012_fort.pdf.
Full textManual corpus annotation has become a key issue for Natural Langage Processing (NLP), as manually annotated corpora are used both to create and to evaluate NLP tools. However, the process of manual annotation remains underdescribed and the tools used to support it are often misused. This situation prevents the campaign manager from evaluating and guarantying the quality of the annotation. We propose in this work a unified vision of manual corpus annotation for NLP. It results from our experience of annotation campaigns, either as a manager or as a participant, as well as from collaborations with other researchers. We first propose a global methodology for managing manual corpus annotation campaigns, that relies on two pillars: an organization for annotation campaigns that puts evaluation at the heart of the process and an innovative grid for the analysis of the complexity dimensions of an annotation campaign. A second part of our work concerns the tools of the campaign manager. We evaluated the precise influence of automatic pre-annotation on the quality and speed of the correction by humans, through a series of experiments on part-of-speech tagging for English. Furthermore, we propose practical solutions for the evaluation of manual annotations, that proche che vide the campaign manager with the means to select the most appropriate measures. Finally, we brought to light the processes and tools involved in an annotation campaign and we instantiated the methodology that we described
RAMMAL, MAHMOUD. "Une interface conceptuelle pour le traitement du langage naturel. Application au langage medical dans le systeme adm." Compiègne, 1993. http://www.theses.fr/1993COMP594S.
Full textDégremont, Jean-François. "Ethnométhodologie et innovation technologique : le cas du traitement automatique des langues naturelles." Paris 7, 1989. http://www.theses.fr/1989PA070043.
Full textThe thesis begins with a short historical reminder of ethnomethodology, considered as a scientific field, since the whole beginners during the 30's until the 1967 explosion in US and Europe. The first part is an explication of the main concepts of ethnomethodology. They are developped from the pariseptist school theoretical point of view, which tries to associate the strongest refuse of inductions and the indifference principle, mainly when natural languages, considered as well as studies objects and communication tools, are used. The second part of the thesis is devoted to the concrete application of these theoretical concepts in the field of technological strategies which have been elaborated in France in the area of natural language processing. Three studies successively describe the ethnomethods and rational properties of practical activities which are used in an administrative team, the elaboration of a technology policy and indexical descriptions of the language industry field. The conclusion tries to show how the concepts and methods developped by ethnomethodology can increase, in this field, the efficacy of strategical analysis and the quality of research and development programs
Gayral, Françoise. "Sémantique du langage naturel et profondeur variable : Une première approche." Paris 13, 1992. http://www.theses.fr/1992PA132004.
Full textAlain, Pierre. "Contributions à l'évaluation des modèles de langage." Rennes 1, 2007. http://www.theses.fr/2007REN1S003.
Full textThis work deals with the evaluation of language models independently of any applicative task. A comparative study between several language models is generally related to the role that a model has into a complete system. Our objective consists in being independant of the applicative system, and thus to provide a true comparison of language models. Perplexity is a widely used criterion as to comparing language models without any task assumptions. However, the main drawback is that perplexity supposes probability distributions and hence cannot compare heterogeneous models. As an evaluation framework, we went back to the definition of the Shannon's game which is based on model prediction performance using rank based statistics. Our methodology is able to predict joint word sequences that are independent of the task or model assumptions. Experiments are carried out on French and English modeling with large vocabularies, and compare different kinds of language models
Laskri, Mohamed Tayeb. "Approche de l'automatisation de thésaurus : étude de la sémantique adaptée du langage naturel." Aix-Marseille 2, 1987. http://www.theses.fr/1987AIX22076.
Full textMazahreh, Mazhar. "Recherche et analyse informatique des expressions du langage naturel correspondant à des questions sur les bases de données." Paris, EHESS, 1990. http://www.theses.fr/1990EHES0059.
Full textHarrathi, Farah. "Extraction de concepts et de relations entre concepts à partir des documents multilingues : approche statistique et ontologique." Lyon, INSA, 2009. http://theses.insa-lyon.fr/publication/2009ISAL0073/these.pdf.
Full textThe research work of this thesis is related to the problem of document search indexing and more specifically in that of the extraction of semantic descriptors for document indexing. Information Retrieval System (IRS) is a set of models and systems for selecting a set of documents satisfying user needs in terms of information expressed as a query. In IR, a query is composed mainly of two processes for representation and retrieval. The process of representation is called indexing, it allows to represent documents and query descriptors, or indexes. These descriptors reflect the contents of documents. The retrieval process consists on the comparison between documents representations and query representation. In the classical IRS, the descriptors used are words (simple or compound). These IRS consider the document as a set of words, often called a "bag of words". In these systems, the words are considered as graphs without semantics. The only information used for these words is their occurrence frequency in the documents. These systems do not take into account the semantic relationships between words. For example, it is impossible to find documents represented by a word synonymous with M1 word M2, where the request is represented by M2. Also, in a classic IRS document indexed by the term "bus" will never be found by a query indexed by the word "taxi", yet these are two words that deal with the same subject "means of transportation. " To address these limitations, several studies were interested taking into account of the semantic indexing terms. This type of indexing is called semantic or conceptual indexing. These works take into account the notion of concept in place of notion of word. In this work the terms denoting concepts are extracted from the document by using statistical techniques. These terms are then projected onto resource of semantics such as: ontology, thesaurus and so on to extract the concepts involved
PARK, SE YOUNG. "Un algorithme efficace pour l'analyse du langage naturel : application aux traitements des erreurs et aux grammaires discontinues." Paris 7, 1989. http://www.theses.fr/1989PA077214.
Full textFourour, Nordine. "Identification et catégorisation automatique des entités nommées dans les textes français." Nantes, 2004. http://www.theses.fr/2004NANT2126.
Full textNamed Entity (NE) Recognition is a recurring problem in the different domain of Natural Language Processing. As a result of, a linguistic investigation allowing to set-up operational parameters defining the concept of named entity, a state of art of the domain, and a corpus investigation using referential and graphical criteria, we present Nemesis - a French named entity recognizer. This system analyzes the internal and external evidences by using grammar rules and trigger word lexicons, and includes a learning process. With these processes, Nemesis performance achieves about 90% of precision and 80% of recall. To increase the recall, we put forward optional modules (analysis of the wide context and utilization of the Web as a source of new contexts) and investigate in setting up a disambiguation and grammar rules inference module
Tartier, Annie. "Analyse automatique de l'évolution terminologique : variations et distances." Nantes, 2004. http://www.theses.fr/2004NANT2040.
Full textThe aim of this thesis is to work out automatic methods for uncovering any evolutionary phenomena within terms extracted from diachronic corpora of scientific or technical texts. The first research axis concerns the nature of changes. It is based on a terminological variation typology aiming to define a distance between two terminological forms. That distance allows us to easily put together the variants of a term and to define measures from sets of studied terms. The second axis concerns time structuration and proposes several diachronic examination modes in order to distinguish ephemeral changes from durable ones which could be the signs of an evolution. These ideas are implemented in a prototype which first proposes temporal profiles, then some information about stable, old or new terms, information given for exact forms or to the nearest variant
Balicco, Laurence. "Génération de repliques en français dans une interface homme-machine en langue naturelle." Grenoble 2, 1993. http://www.theses.fr/1993GRE21025.
Full textThis research takes place in the context of natural language generation. This field has benn neglected for a long time because it seemed a much easier phase that those of analysis. The thesis corresponds to a first work on generation in the criss team and places the problem of generation in the context of a manmachine dialogue in natural language. Some of its consequences are : generation from a logical content to be translated into natural language, this translation of the original content kept as close as possible,. . . After the study of the different works that have been done, we decided to create our own generation system, resusing when it is possible, the tools elaborated during the analyzing process. This generation process is based on a linguistic model, which uses syntactic and morphologic information and in which linguistic transformations called operations are defined (coodination, anaphorisation, thematisation,. . . ). These operations can be given by the dialogue or calulated during the generation process. The model allows the creation of several of the same utterance and therefore a best adaptation for different users. This thesis presents the studied works, essentially on the french and the english languages, the linguistic model developped, the computing model used, and a brief presentation of an european project which offers a possible application of ou
Alsandouk, Fatima. "Grammaire de scene : processus de comprehension de textes de description geometrique." Toulouse 2, 1990. http://www.theses.fr/1990TOU20058.
Full textDenand, Nicolas. "Traitement automatique de phrases locatives statiques du français." Aix-Marseille 2, 2004. http://www.theses.fr/2004AIX22035.
Full textWolfarth, Claire. "Apport du TAL à l’exploitation linguistique d’un corpus scolaire longitudinal." Thesis, Université Grenoble Alpes (ComUE), 2019. http://www.theses.fr/2019GREAL025.
Full textIn recent years, there has been an actual effort to constitute and promote children’s writings corpora especially in French. The first research works on writing acquisition relied on small corpora that were not widely distributed. Longitudinal corpora, monitoring a cohort of children’s productions from similar collection conditions from one year to the next, do not exist in French yet.Moreover, although natural language processing (NLP) has provided tools for a wide variety of corpora, few studies have been conducted on children's writings corpora. This new scope represents a challenge for the NLP field because of children's writings specificities, and particularly their deviation from the written norm. Hence, tools currently available are not suitable for the exploitation of these corpora. There is therefore a challenge for NLP to develop specific methods for these written productions.This thesis provides two main contributions. On the one hand, this work has led to the creation of a large and digitized longitudinal corpus of children's writings (from 6 to 11 years old) named the Scoledit corpus. Its constitution implies the collection, the digitization and the transcription of productions, the annotation of linguistic data and the dissemination of the resource thus constituted. On the other hand, this work enables the development of a method exploiting this corpus, called the comparison approach, which is based on the comparison between the transcription of children’s productions and their standardized version.In order to create a first level of alignment, this method compared transcribed forms to their normalized counterparts, using the aligner AliScol. It also made possible the exploration of various linguistic analyses (lexical, morphographic, graphical). And finally, in order to analyse graphemes, an aligner of transcribed and normalized graphemes, called AliScol_Graph was created
Krit, Hatem. "Locadelane : un langage objet d'aide à la compréhension automatique du discours exprimé en langage naturel et écri." Toulouse 3, 1990. http://www.theses.fr/1990TOU30008.
Full textMaire-Reppert, Daniele. "L'imparfait de l'indicatif en vue d'un traitement informatique du français." Paris 4, 1990. http://www.theses.fr/1990PA040039.
Full textMy approach to the French imperfect is in keeping with the methodology of expert systems. I have first identified and then given a topological representation of the values of the imperfect, i. E. : descriptive state, permanent state, new state, progressive process, habit, possibility, hypothetical, politeness and hypocoristic. We have then defined its constant in order to distinguish the imperfect from the other tenses. Finally, we have worked out a set of heuristic rules (about a hundred of production rules), the function of which is to associate a semantic value to a temporal morpheme according to the context. This contextual research has been led at the level of the text, of the sentence and the archetype of the verb. Our study of the imperfect has been completed by an example of insertion of the "heuristic rules" module in architecture of natural languages treatment and by short analysis of the contribution of such an approach in teaching foreign languages
Tannier, Xavier. "Extraction et recherche d'information en langage naturel dans les documents semi-structurés." Phd thesis, Ecole Nationale Supérieure des Mines de Saint-Etienne, 2006. http://tel.archives-ouvertes.fr/tel-00121721.
Full text(écrits en XML en pratique) combine des aspects de la RI
traditionnelle et ceux de l'interrogation de bases de données. La
structure a une importance primordiale, mais le besoin d'information
reste vague. L'unité de recherche est variable (un paragraphe, une
figure, un article complet\dots). Par ailleurs, la flexibilité du
langage XML autorise des manipulations du contenu qui provoquent
parfois des ruptures arbitraires dans le flot naturel du texte.
Les problèmes posés par ces caractéristiques sont nombreux, que ce
soit au niveau du pré-traitement des documents ou de leur
interrogation. Face à ces problèmes, nous avons étudié les solutions
spécifiques que pouvait apporter le traitement automatique de la
langue (TAL). Nous avons ainsi proposé un cadre théorique et une
approche pratique pour permettre l'utilisation des techniques
d'analyse textuelle en faisant abstraction de la structure. Nous avons
également conçu une interface d'interrogation en langage naturel pour
la RI dans les documents XML, et proposé des méthodes tirant profit de
la structure pour améliorer la recherche des éléments pertinents.
Arias, Aguilar José Anibal. "Méthodes spectrales pour le traitement automatique de documents audio." Toulouse 3, 2008. http://thesesups.ups-tlse.fr/436/.
Full textThe disfluencies are a frequently occurring phenomenon in any spontaneous speech production; it consists of the interruption of the normal flow of speech. They have given rise to numerous studies in Natural Language Processing. Indeed, their study and precise identification are essential, both from a theoretical and applicative perspective. However, most of the researches about the subject relate to everyday uses of language: "small talk" dialogs, requests for schedule, speeches, etc. But what about spontaneous speech production made in a restrained framework? To our knowledge, no study has ever been carried out in this context. However, we know that using a "language specialty" in the framework of a given task leads to specific behaviours. Our thesis work is devoted to the linguistic and computational study of disfluencies within such a framework. These dialogs concern air traffic control, which entails both pragmatic and linguistic constraints. We carry out an exhaustive study of disfluencies phenomena in this context. At first we conduct a subtle analysis of these phenomena. Then we model them to a level of abstraction, which allows us to obtain the patterns corresponding to the different configurations observed. Finally we propose a methodology for automatic processing. It consists of several algorithms to identify the different phenomena, even in the absence of explicit markers. It is integrated into a system of automatic processing of speech. Eventually, the methodology is validated on a corpus of 400 sentences
Kupsc, Anna. "Une grammaire hpsg des clitiques polonais." Paris 7, 2000. http://www.theses.fr/2000PA070086.
Full textEL, HAROUCHY ZAHRA. "Dictionnaire et grammaire pour le traitement automatique des ambiguites morphologiques des mots simples en francais." Besançon, 1997. http://www.theses.fr/1997BESA1010.
Full textWhen carrying out the automatic analysis of a text, one of the first stages consists in determining the grammatical categories of the words. In order to do this, a dictionary has been designed which recognises the one or several grammatical categories of non-compound words from their endings. This dictionary, which we have called automatic dictionary, is a collection of general rules (which can consist of sub- rules). A general rule sets forth an ending. An operator (the one or several grammatical categories) is associated with each rule. For example, we have the following general rule: +words ending in 'able' are adjectives;. Examples of exceptions to (or sub-rules) of this general rule are nouns such as (+cartable ;), conjugated verbs like (+ accable ;), and morphological ambiguities such as + noun and conjugated verb (like +sable;, +table. . . ;), and ambiguities such as + adjectival nouns ;(like, for example, + comptable ;. . . ) consequently, this sort of dictionary gives prominence to those words posessing several grammatical categories. When the automatic dictionary detects a word posessing several categories, the grammar system is consulted,of which the role is to pick out the morphological ambiguities by studying the immediate context. The rules in the grammar system work like a group of possible combinations of elements capable of going after and-or before the ambiguous form ( for example, a rule states that an ambiguous form such as + pronoun or article ; preceded by + a cause de ; is, in fact, an article)
Haddad, Afifa. "Traitement des nominalisations anaphoriques en indexation automatique." Lyon 2, 2001. http://theses.univ-lyon2.fr/documents/lyon2/2001/haddad_a.
Full textThis thesis proposes en indexation method for integral texts based on anaphoric noun phrases. The motivation is to take advantage from the wide context of an anaphora relation in order to build a rich descriptor ? and to get consequently a performant index. The main contribution here is the design of a complete method enabling the systematic reconstitution of all arguments of each anaphoric nominalization encountered in the text. A completely resolved noun phrase constitutes a rich descriptor that is then added to the index. The resolution a nominal anaphora makes use the results of other preliminarly activities. These consists in collecting the syntactic structures of the possible noun phrase corresponding to a nominalization and, identifying a set of the anaphoric noun phrase and the form of its precedent. The feasibility of the proposed has been demonstrated through an application to a real-life corpus
Haddad, Afifa Le Guern Michel. "Traitement des nominalisations anaphoriques en indexation automatique." [S.l.] : [s.n.], 2001. http://theses.univ-lyon2.fr/sdx/theses/lyon2/intranet/haddad_a.
Full textVéronis, Jean. "Contribution à l'étude de l'erreur dans le dialogue homme-machine en langage naturel." Aix-Marseille 3, 1988. http://www.theses.fr/1988AIX30043.
Full textPerraud, Freddy. "Modélisation du langage naturel appliquée à la reconnaissance de l'écriture manuscrite en-ligne." Nantes, 2005. http://www.theses.fr/2005NANT2112.
Full textN'Guéma, Sylvain Abraham. "Intégration de paramètres formels d'intonation à l'analyse syntaxique automatique dans une perspective d'aide à la désambigui͏̈sation syntaxique." Avignon, 1998. http://www.theses.fr/1998AVIG0121.
Full textPoibeau, Thierry. "Extraction d'information à base de connaissances hybrides." Paris 13, 2002. http://www.theses.fr/2002PA132001.
Full textLevrat, Bernard. "Le problème du sens dans les sytèmes de traitement du langage naturel : Une approche alternative au travers de la paraphrase." Paris 13, 1993. http://www.theses.fr/1993PA132023.
Full text