Rozprawy doktorskie na temat „Fouille du texte”
Utwórz poprawne odniesienie w stylach APA, MLA, Chicago, Harvard i wielu innych
Sprawdź 50 najlepszych rozpraw doktorskich naukowych na temat „Fouille du texte”.
Przycisk „Dodaj do bibliografii” jest dostępny obok każdej pracy w bibliografii. Użyj go – a my automatycznie utworzymy odniesienie bibliograficzne do wybranej pracy w stylu cytowania, którego potrzebujesz: APA, MLA, Harvard, Chicago, Vancouver itp.
Możesz również pobrać pełny tekst publikacji naukowej w formacie „.pdf” i przeczytać adnotację do pracy online, jeśli odpowiednie parametry są dostępne w metadanych.
Przeglądaj rozprawy doktorskie z różnych dziedzin i twórz odpowiednie bibliografie.
Dalloux, Clément. "Fouille de texte et extraction d'informations dans les données cliniques". Thesis, Rennes 1, 2020. http://www.theses.fr/2020REN1S050.
Pełny tekst źródłaWith the introduction of clinical data warehouses, more and more health data are available for research purposes. While a significant part of these data exist in structured form, much of the information contained in electronic health records is available in free text form that can be used for many tasks. In this manuscript, two tasks are explored: the multi-label classification of clinical texts and the detection of negation and uncertainty. The first is studied in cooperation with the Rennes University Hospital, owner of the clinical texts that we use, while, for the second, we use publicly available biomedical texts that we annotate and release free of charge. In order to solve these tasks, we propose several approaches based mainly on deep learning algorithms, used in supervised and unsupervised learning situations
Marchand, Morgane. "Domaines et fouille d'opinion : une étude des marqueurs multi-polaires au niveau du texte". Thesis, Paris 11, 2015. http://www.theses.fr/2015PA112026/document.
Pełny tekst źródłaIn this thesis, we are studying the adaptation of a text level opinion classifier across domains. Howerver, people express their opinion in a different way depending on the subject of the conversation. The same word in two different domains can refer to different objects or have an other connotation. If these words are not detected, they will lead to classification errors.We call these words or bigrams « multi-polarity marquers ». Their presence in a text signals a polarity wich is different according to the domain of the text. Their study is the subject of this thesis. These marquers are detected using a khi2 test if labels exist in both targeted domains. We also propose a semi-supervised detection method for the case with labels in only one domain. We use a collection of auto-epurated pivot words in order to assure a stable polarity accross domains.We have also checked the linguistic interest of the selected words with a manual evaluation campaign. The validated words can be : a word of context, a word giving an opinion, a word explaining an opinion or a word wich refer to the evaluated object. Our study also show that the causes of the changing polarity are of three kinds : changing meaning, changing object or changing use.Finally, we have studyed the influence of multi-polarity marquers on opinion classification at text level in three different cases : adaptation of a source domain to a target domain, multi-domain corpora and open domain corpora. The results of our experiments show that the potential improvement is bigger when the initial transfer was difficult. In the favorable cases, we improve accurracy up to five points
Tisserant, Guillaume. "Généralisation de données textuelles adaptée à la classification automatique". Thesis, Montpellier, 2015. http://www.theses.fr/2015MONTS231/document.
Pełny tekst źródłaWe have work for a long time on the classification of text. Early on, many documents of different types were grouped in order to centralize knowledge. Classification and indexing systems were then created. They make it easy to find documents based on readers' needs. With the increasing number of documents and the appearance of computers and the internet, the implementation of text classification systems becomes a critical issue. However, textual data, complex and rich nature, are difficult to treat automatically. In this context, this thesis proposes an original methodology to organize and facilitate the access to textual information. Our automatic classification approache and our semantic information extraction enable us to find quickly a relevant information.Specifically, this manuscript presents new forms of text representation facilitating their processing for automatic classification. A partial generalization of textual data (GenDesc approach) based on statistical and morphosyntactic criteria is proposed. Moreover, this thesis focuses on the phrases construction and on the use of semantic information to improve the representation of documents. We will demonstrate through numerous experiments the relevance and genericity of our proposals improved they improve classification results.Finally, as social networks are in strong development, a method of automatic generation of semantic Hashtags is proposed. Our approach is based on statistical measures, semantic resources and the use of syntactic information. The generated Hashtags can then be exploited for information retrieval tasks from large volumes of data
Charnois, Thierry. "Accès à l'information : vers une hybridation fouille de données et traitement automatique des langues". Habilitation à diriger des recherches, Université de Caen, 2011. http://tel.archives-ouvertes.fr/tel-00657919.
Pełny tekst źródłaRoche, Mathieu. "Fouille de Textes : de l'extraction des descripteurs linguistiques à leur induction". Habilitation à diriger des recherches, Université Montpellier II - Sciences et Techniques du Languedoc, 2011. http://tel.archives-ouvertes.fr/tel-00816263.
Pełny tekst źródłaEpure, Elena Viorica. "Modélisation automatique des conversations en tant que processus d'intentions de discours interdépendantes". Thesis, Paris 1, 2018. http://www.theses.fr/2018PA01E068/document.
Pełny tekst źródłaThe proliferation of digital data has enabled scientific and practitioner communities to createnew data-driven technologies to learn about user behaviors in order to deliver better services and support to people in their digital experience. The majority of these technologies extensively derive value from data logs passively generated during the human-computer interaction. A particularity of these behavioral traces is that they are structured. However, the pro-actively generated text across Internet is highly unstructured and represents the overwhelming majority of behavioral traces. To date, despite its prevalence and the relevance of behavioral knowledge to many domains, such as recommender systems, cyber-security and social network analysis,the digital text is still insufficiently tackled as traces of human behavior to automatically reveal extensive insights into behavior.The main objective of this thesis is to propose a corpus-independent method to automatically exploit the asynchronous communication as pro-actively generated behavior traces in order to discover process models of conversations, centered on comprehensive speech intentions and relations. The solution is built in three iterations, following a design science approach.Multiple original contributions are made. The only systematic study to date on the automatic modeling of asynchronous communication with speech intentions is conducted. A speech intention taxonomy is derived from linguistics to model the asynchronous communication and, comparedto all taxonomies from the related works, it is corpus-independent, comprehensive—as in both finer-grained and exhaustive in the given context, and its application by non-experts is proven feasible through extensive experiments. A corpus-independent, automatic method to annotate utterances of asynchronous communication with the proposed speech intention taxonomy is designed based on supervised machine learning. For this, validated ground-truth corpora arecreated and groups of features—discourse, content and conversation-related, are engineered to be used by the classifiers. In particular, some of the discourse features are novel and defined by considering linguistic means to express speech intentions, without relying on the corpus explicit content, domain or on specificities of the asynchronous communication types. Then, an automatic method based on process mining is designed to generate process models of interrelated speech intentions from conversation turns, annotated with multiple speech intentions per sentence. As process mining relies on well-defined structured event logs, an algorithm to produce such logs from conversations is proposed. Additionally, an extensive design rationale on how conversations annotated with multiple labels per sentence could be transformed in event logs and what is the impact of different decisions on the output behavioral models is released to support future research. Experiments and qualitative validations in medicine and conversation analysis show that the proposed solution reveals reliable and relevant results, but also limitations are identified,to be addressed in future works
Duthil, Benjamin. "De l'extraction des connaissances à la recommandation". Phd thesis, Montpellier 2, 2012. http://tel.archives-ouvertes.fr/tel-00771504.
Pełny tekst źródłaStavrianou, Anna. "Modeling and mining of Web discussions". Phd thesis, Université Lumière - Lyon II, 2010. http://tel.archives-ouvertes.fr/tel-00564764.
Pełny tekst źródłaValsamou, Dialekti. "Extraction d’Information pour les réseaux de régulation de la graine chez Arabidopsis Thaliana". Thesis, Université Paris-Saclay (ComUE), 2017. http://www.theses.fr/2017SACLS027/document.
Pełny tekst źródłaWhile information is abundant in the world, structured, ready-to-use information is rare. Thiswork proposes Information Extraction (IE) as an efficient approach for producing structured,usable information on biology, by presenting a complete IE task on a model biological organism,Arabidopsis thaliana. Information Extraction is the process of extracting meaningful parts of text and identifying their semantic relations.In collaboration with experts on the plant A. Thaliana, a knowledge model was conceived. The goal of this model is providing a formal representation of the knowledge that is necessary to sufficiently describe the domain of grain development. This model contains all the entities and the relations between them which are essential and it can directly be used by algorithms. Inparallel, this model was tested and applied on a set of scientific articles of the domain. These documents constitute the corpus which is needed to train machine learning algorithms. Theexperts annotated the text using the entities and relations of the model. This corpus and this model are the first available for grain development and among very few on A. Thaliana, despite the latter’s importance in biology. This model manages to answer both needs of being complexenough to describe the domain well, and of having enough generalization for machine learning.A relation extraction approach (AlvisRE) was also elaborated and developed. After entityre cognition, the relation extractor tries to detect the cases where the text mentions that twoentities are in a relation, and identify precisely to which type of the model these relations belongto. AlvisRE’s approach is based on textual similarity and it uses all types of information available:lexical, syntactic and semantic. In the tests conducted, AlvisRE had results that are equivalentor sometimes better than the state of the art. Additionally, AlvisRE has the advantage of being modular and adaptive by using semantic information that was produced automatically. This last feature allows me to expect similar performance in other domains
Hoareau, Yann Vigile. "Occurrence du semblable et du différent : réflexion sur la modélisation de la sémantique à partir de la cognition et de la culture et de la fouille de texte". Paris 8, 2010. http://www.theses.fr/2010PA083817.
Pełny tekst źródłaThis thesis proposes a reflexion on the processes of iteration of similar and different episodes on both human and artificial cognition. This process has been identified as central by many researchers from Psychology and Artificial Intelligence such as Piaget, Brunner or Minsky. It is studied under the framework of text comprehension and text production, on the one hand, and, under the framework of large-scale text categorization by artificial systems on the other hand. The influence of the cultural and linguistic rapprochement are studied at La Réunion Island and in Kabylia in the aim of identifying the cognitive processes involved in knowledge activation during text comprehension and text production tasks. The modeling of semantic knowledge by semantic spaces models such as LSA and Random Indexing is studied in the frame of large-scale text categorization. The major contribution of our thesis is the proposition of a cognitive model of text categorization, which is based on the representation of different level of abstraction for textual categories. This model, named Alida, is inspired by classical cognitive models of categorization. Alida is finalist of the text-mining evaluation campaign Deft'09. Also, Alida has been laureate of the National Contest of Business Projects of Innovative Technologies by the French Ministry for Research and Science
Poezevara, Guillaume. "Fouille de graphes pour la découverte de contrastes entre classes : application à l'estimation de la toxicité des molécules". Phd thesis, Université de Caen, 2011. http://tel.archives-ouvertes.fr/tel-01018425.
Pełny tekst źródłaZenasni, Sarah. "Extraction d'information spatiale à partir de données textuelles non-standards". Thesis, Montpellier, 2018. http://www.theses.fr/2018MONTS076/document.
Pełny tekst źródłaThe extraction of spatial information from textual data has become an important research topic in the field of Natural Language Processing (NLP). It meets a crucial need in the information society, in particular, to improve the efficiency of Information Retrieval (IR) systems for different applications (tourism, spatial planning, opinion analysis, etc.). Such systems require a detailed analysis of the spatial information contained in the available textual data (web pages, e-mails, tweets, SMS, etc.). However, the multitude and the variety of these data, as well as the regular emergence of new forms of writing, make difficult the automatic extraction of information from such corpora.To meet these challenges, we propose, in this thesis, new text mining approaches allowing the automatic identification of variants of spatial entities and relations from textual data of the mediated communication. These approaches are based on three main contributions that provide intelligent navigation methods. Our first contribution focuses on the problem of recognition and identification of spatial entities from short messages corpora (SMS, tweets) characterized by weakly standardized modes of writing. The second contribution is dedicated to the identification of new forms/variants of spatial relations from these specific corpora. Finally, the third contribution concerns the identification of the semantic relations associated withthe textual spatial information
Malherbe, Emmanuel. "Standardization of textual data for comprehensive job market analysis". Thesis, Université Paris-Saclay (ComUE), 2016. http://www.theses.fr/2016SACLC058/document.
Pełny tekst źródłaWith so many job adverts and candidate profiles available online, the e-recruitment constitutes a rich object of study. All this information is however textual data, which from a computational point of view is unstructured. The large number and heterogeneity of recruitment websites also means that there is a lot of vocabularies and nomenclatures. One of the difficulties when dealing with this type of raw textual data is being able to grasp the concepts contained in it, which is the problem of standardization that is tackled in this thesis. The aim of standardization is to create a unified process providing values in a nomenclature. A nomenclature is by definition a finite set of meaningful concepts, which means that the attributes resulting from standardization are a structured representation of the information. Several questions are however raised: Are the websites' structured data usable for a unified standardization? What structure of nomenclature is the best suited for standardization, and how to leverage it? Is it possible to automatically build such a nomenclature from scratch, or to manage the standardization process without one? To illustrate the various obstacles of standardization, the examples we are going to study include the inference of the skills or the category of a job advert, or the level of training of a candidate profile. One of the challenges of e-recruitment is that the concepts are continuously evolving, which means that the standardization must be up-to-date with job market trends. In light of this, we will propose a set of machine learning models that require minimal supervision and can easily adapt to the evolution of the nomenclatures. The questions raised found partial answers using Case Based Reasoning, semi-supervised Learning-to-Rank, latent variable models, and leveraging the evolving sources of the semantic web and social media. The different models proposed have been tested on real-world data, before being implemented in a industrial environment. The resulting standardization is at the core of SmartSearch, a project which provides a comprehensive analysis of the job market
Médoc, Nicolas. "A visual analytics approach for multi-resolution and multi-model analysis of text corpora : application to investigative journalism". Thesis, Sorbonne Paris Cité, 2017. http://www.theses.fr/2017USPCB042/document.
Pełny tekst źródłaAs the production of digital texts grows exponentially, a greater need to analyze text corpora arises in various domains of application, insofar as they constitute inexhaustible sources of shared information and knowledge. We therefore propose in this thesis a novel visual analytics approach for the analysis of text corpora, implemented for the real and concrete needs of investigative journalism. Motivated by the problems and tasks identified with a professional investigative journalist, visualizations and interactions are designed through a user-centered methodology involving the user during the whole development process. Specifically, investigative journalists formulate hypotheses and explore exhaustively the field under investigation in order to multiply sources showing pieces of evidence related to their working hypothesis. Carrying out such tasks in a large corpus is however a daunting endeavor and requires visual analytics software addressing several challenging research issues covered in this thesis. First, the difficulty to make sense of a large text corpus lies in its unstructured nature. We resort to the Vector Space Model (VSM) and its strong relationship with the distributional hypothesis, leveraged by multiple text mining algorithms, to discover the latent semantic structure of the corpus. Topic models and biclustering methods are recognized to be well suited to the extraction of coarse-grained topics, i.e. groups of documents concerning similar topics, each one represented by a set of terms extracted from textual contents. We provide a new Weighted Topic Map visualization that conveys a broad overview of coarse-grained topics by allowing quick interpretation of contents through multiple tag clouds while depicting the topical structure such as the relative importance of topics and their semantic similarity. Although the exploration of the coarse-grained topics helps locate topic of interest and its neighborhood, the identification of specific facts, viewpoints or angles related to events or stories requires finer level of structuration to represent topic variants. This nested structure, revealed by Bimax, a pattern-based overlapping biclustering algorithm, captures in biclusters the co-occurrences of terms shared by multiple documents and can disclose facts, viewpoints or angles related to events or stories. This thesis tackles issues related to the visualization of a large amount of overlapping biclusters by organizing term-document biclusters in a hierarchy that limits term redundancy and conveys their commonality and specificities. We evaluated the utility of our software through a usage scenario and a qualitative evaluation with an investigative journalist. In addition, the co-occurrence patterns of topic variants revealed by Bima. are determined by the enclosing topical structure supplied by the coarse-grained topic extraction method which is run beforehand. Nonetheless, little guidance is found regarding the choice of the latter method and its impact on the exploration and comprehension of topics and topic variants. Therefore we conducted both a numerical experiment and a controlled user experiment to compare two topic extraction methods, namely Coclus, a disjoint biclustering method, and hierarchical Latent Dirichlet Allocation (hLDA), an overlapping probabilistic topic model. The theoretical foundation of both methods is systematically analyzed by relating them to the distributional hypothesis. The numerical experiment provides statistical evidence of the difference between the resulting topical structure of both methods. The controlled experiment shows their impact on the comprehension of topic and topic variants, from analyst perspective. (...)
Abdaoui, Amine. "Fouille des médias sociaux français : expertise et sentiment". Thesis, Montpellier, 2016. http://www.theses.fr/2016MONTT249/document.
Pełny tekst źródłaSocial Media has changed the way we communicate between individuals, within organizations and communities. The availability of these social data opens new opportunities to understand and influence the user behavior. Therefore, Social Media Mining is experiencing a growing interest in various scientific and economic circles. In this thesis, we are specifically interested in the users of these networks whom we try to characterize in two ways: (i) their expertise and their reputations and (ii) the sentiments they express.Conventionally, social data is often mined according to its network structure. However, the textual content of the exchanged messages may reveal additional knowledge that can not be known through the analysis of the structure. Until recently, the majority of work done for the analysis of the textual content was proposed for English. The originality of this thesis is to develop methods and resources based on the textual content of the messages for French Social Media Mining.In the first axis, we initially suggest to predict the user expertise. For this, we used forums that recruit health experts to learn classification models that serve to identify messages posted by experts in any other health forum. We demonstrate that models learned on appropriate forums can be used effectively on other forums. Then, in a second step, we focus on the user reputation in these forums. The idea is to seek expressions of trust and distrust expressed in the textual content of the exchanged messages, to search the recipients of these messages and use this information to deduce users' reputation. We propose a new reputation measure that weighs the score of each response by the reputation of its author. Automatic and manual evaluations have demonstrated the effectiveness of the proposed approach.In the second axis, we focus on the extraction of sentiments (emotions and polarity). For this, we started by building a French lexicon of sentiments and emotions that we call FEEL (French Expanded Emotions Lexicon). This lexicon is built semi-automatically by translating and expanding its English counterpart NRC EmoLex. We then compare FEEL with existing French lexicons from literature on reference benchmarks. The results show that FEEL improves the classification of French texts according to their polarities and emotions. Finally, we propose to evaluate different features, methods and resources for the classification of sentiments in French. The conducted experiments have identified useful features and methods in the classification of sentiments for different types of texts. The learned systems have been particularly efficient on reference benchmarks.Generally, this work opens promising perspectives on various analytical tasks of Social Media Mining including: (i) combining multiple sources in mining Social Media users; (ii) multi-modal Social Media Mining using not just text but also image, videos, location, etc. and (iii) multilingual sentiment analysis
Bigeard, Elise. "Détection et analyse de la non-adhérence médicamenteuse dans les réseaux sociaux". Thesis, Lille 3, 2019. http://www.theses.fr/2019LIL3H026.
Pełny tekst źródłaDrug non-compliance refers to situations where the patient does not follow instructions from medical authorities when taking medications. Such situations include taking too much (overuse) or too little (underuse) of medications, drinking contraindicated alcohol, or making a suicide attempt using medication. According to [HAYNES 2002] increasing drug compliance may have a bigger impact on public health than any other medical improvements. However non-compliance data are difficult to obtain since non-adherent patients are unlikely to report their behaviour to their healthcare providers. This is why we use data from social media to study drug non-compliance. Our study is applied to French-speaking forums.First we collect a corpus of messages written by users from medical forums. We build vocabularies of medication and disorder names such as used by patients. We use these vocabularies to index medications and disorders in the corpus. Then we use supervised learning and information retrieval methods to detect messages talking about non-compliance. With machine learning, we obtain 0.433 F-mesure, with up to 0.421 precision or 0.610 recall. With information retrieval, we reach 0.8 precision on the first ten results.After that, we study the content of the non-compliance messages. We identify various non-compliance situations and patient's motivations. We identify 3 main motivations: self-medication, seeking an effect besides the effect the medication was prescribed for, or being in addiction or habituation situation. Self-medication is an umbrella for several situations: avoiding an adverse effect, adjusting the medication's effect, underuse a medication seen as useless, taking decisions without a doctor's advice. Non-compliance can also happen thanks to errors or carelessness, without any particular motivation.Our work provides several kinds of result: annotated corpus with non-compliance messages, classifier for the detection of non-compliance messages, typology of non-compliance situations and analysis of the causes of non-compliance
Doucet, Antoine. "Extraction, Exploitation and Evaluation of Document-based Knowledge". Habilitation à diriger des recherches, Université de Caen, 2012. http://tel.archives-ouvertes.fr/tel-01070505.
Pełny tekst źródłaMédoc, Nicolas. "A visual analytics approach for multi-resolution and multi-model analysis of text corpora : application to investigative journalism". Electronic Thesis or Diss., Sorbonne Paris Cité, 2017. http://www.theses.fr/2017USPCB042.
Pełny tekst źródłaAs the production of digital texts grows exponentially, a greater need to analyze text corpora arises in various domains of application, insofar as they constitute inexhaustible sources of shared information and knowledge. We therefore propose in this thesis a novel visual analytics approach for the analysis of text corpora, implemented for the real and concrete needs of investigative journalism. Motivated by the problems and tasks identified with a professional investigative journalist, visualizations and interactions are designed through a user-centered methodology involving the user during the whole development process. Specifically, investigative journalists formulate hypotheses and explore exhaustively the field under investigation in order to multiply sources showing pieces of evidence related to their working hypothesis. Carrying out such tasks in a large corpus is however a daunting endeavor and requires visual analytics software addressing several challenging research issues covered in this thesis. First, the difficulty to make sense of a large text corpus lies in its unstructured nature. We resort to the Vector Space Model (VSM) and its strong relationship with the distributional hypothesis, leveraged by multiple text mining algorithms, to discover the latent semantic structure of the corpus. Topic models and biclustering methods are recognized to be well suited to the extraction of coarse-grained topics, i.e. groups of documents concerning similar topics, each one represented by a set of terms extracted from textual contents. We provide a new Weighted Topic Map visualization that conveys a broad overview of coarse-grained topics by allowing quick interpretation of contents through multiple tag clouds while depicting the topical structure such as the relative importance of topics and their semantic similarity. Although the exploration of the coarse-grained topics helps locate topic of interest and its neighborhood, the identification of specific facts, viewpoints or angles related to events or stories requires finer level of structuration to represent topic variants. This nested structure, revealed by Bimax, a pattern-based overlapping biclustering algorithm, captures in biclusters the co-occurrences of terms shared by multiple documents and can disclose facts, viewpoints or angles related to events or stories. This thesis tackles issues related to the visualization of a large amount of overlapping biclusters by organizing term-document biclusters in a hierarchy that limits term redundancy and conveys their commonality and specificities. We evaluated the utility of our software through a usage scenario and a qualitative evaluation with an investigative journalist. In addition, the co-occurrence patterns of topic variants revealed by Bima. are determined by the enclosing topical structure supplied by the coarse-grained topic extraction method which is run beforehand. Nonetheless, little guidance is found regarding the choice of the latter method and its impact on the exploration and comprehension of topics and topic variants. Therefore we conducted both a numerical experiment and a controlled user experiment to compare two topic extraction methods, namely Coclus, a disjoint biclustering method, and hierarchical Latent Dirichlet Allocation (hLDA), an overlapping probabilistic topic model. The theoretical foundation of both methods is systematically analyzed by relating them to the distributional hypothesis. The numerical experiment provides statistical evidence of the difference between the resulting topical structure of both methods. The controlled experiment shows their impact on the comprehension of topic and topic variants, from analyst perspective. (...)
Saad, Motaz. "Fouille de documents et d'opinions multilingue". Thesis, Université de Lorraine, 2015. http://www.theses.fr/2015LORR0003/document.
Pełny tekst źródłaThe aim of this thesis is to study sentiments in comparable documents. First, we collect English, French and Arabic comparable corpora from Wikipedia and Euronews, and we align each corpus at the document level. We further gather English-Arabic news documents from local and foreign news agencies. The English documents are collected from BBC website and the Arabic documents are collected from Al-jazeera website. Second, we present a cross-lingual document similarity measure to automatically retrieve and align comparable documents. Then, we propose a cross-lingual sentiment annotation method to label source and target documents with sentiments. Finally, we use statistical measures to compare the agreement of sentiments in the source and the target pair of the comparable documents. The methods presented in this thesis are language independent and they can be applied on any language pair
Valentin, Sarah. "Extraction et combinaison d’informations épidémiologiques à partir de sources informelles pour la veille des maladies infectieuses animales". Thesis, Montpellier, 2020. http://www.theses.fr/2020MONTS067.
Pełny tekst źródłaEpidemic intelligence aims to detect, investigate and monitor potential health threats while relying on formal (e.g. official health authorities) and informal (e.g. media) information sources. Monitoring of unofficial sources, or so-called event-based surveillance (EBS), requires the development of systems designed to retrieve and process unstructured textual data published online. This manuscript focuses on the extraction and combination of epidemiological information from informal sources (i.e. online news), in the context of the international surveillance of animal infectious diseases. The first objective of this thesis is to propose and compare approaches to enhance the identification and extraction of relevant epidemiological information from the content of online news. The second objective is to study the use of epidemiological entities extracted from the news articles (i.e. diseases, hosts, locations and dates) in the context of event extraction and retrieval of related online news.This manuscript proposes new textual representation approaches by selecting, expanding, and combining relevant epidemiological features. We show that adapting and extending text mining and classification methods improves the added value of online news sources for event-based surveillance. We stress the role of domain expert knowledge regarding the relevance and the interpretability of methods proposed in this thesis. While our researches are conducted in the context of animal disease surveillance, we discuss the generic aspects of our approaches regarding unknown threats and One Health surveillance
Guibon, Dinabyll. "Recommandation automatique et adaptative d'émojis". Electronic Thesis or Diss., Aix-Marseille, 2019. http://www.theses.fr/2019AIXM0202.
Pełny tekst źródłaThe first emojis were created in 1999. Since then, their propularity constantly raised in communication systems. Being images representing either an idea, a concept, or an emotion, emojis are available to the users in multiple software contexts: instant messaging, emails, forums, and other types of social medias. Their usage grew constantly and, associated to the constant addition of new emojis, there are now more than 2,789 standard emojis since winter 2018.To access a specific emoji, scrolling through huge emoji librairies or using a emoji search engines is not enough to maximize their usage and their diversity. An emoji recommendation system is required. To answer this need, we present our research work facused on the emoji recommendation topic. The objectives are to create an emoji recommender system adapted to a private and informal conversationnal context. This system must enhance the user experience, the communication quality, and take into account possible new emerging emojis.Our first contribution is to show the limits of a emoji prediction for the real usage case, and to demonstrate the need of a more global recommandation. We also veifie the correlation between the real usage of emojis representing facial expressions and a related theory on facial expressions. We also tackle the evaluation part of this system, with the metrics' limits and the importance of a dedicated user interface.The approach is based on supervised and unsupervised machine learning, associated to language models. Several parts of this work were published in national and international conferences, including the best software award and best poster award for its social media track
Kou, Huaizhong. "Génération d'adaptateurs web intelligents à l'aide de techniques de fouilles de texte". Versailles-St Quentin en Yvelines, 2003. http://www.theses.fr/2003VERS0011.
Pełny tekst źródłaThis thesis defines a system framework of semantically integrating Web information, called SEWISE. It can integrate text information from various Web sources belonging to an application domain into common domain-specific concept ontology. In SEWISE, Web wrappers are built around different Web sites to automatically extract interesting information from. Text mining technologies are then used to discover the semantics Web documents talk about. SEWISE can ease topic-oriented information researches over the Web. Three problems related to the document categorization are studied. Firstly, we investigate the approaches to feature selection and proposed two approaches CBA and IBA to select features. To estimate statistic term associations and integrate them within document similarity model, a mathematical model is proposed. Finally, the category score calculation algorithms used by k-NN classifiers are studied. Two weighted algorithms CBW and IBW to calculate category score are proposed
Ait, Saada Mira. "Unsupervised learning from textual data with neural text representations". Electronic Thesis or Diss., Université Paris Cité, 2023. http://www.theses.fr/2023UNIP7122.
Pełny tekst źródłaThe digital era generates enormous amounts of unstructured data such as images and documents, requiring specific processing methods to extract value from them. Textual data presents an additional challenge as it does not contain numerical values. Word embeddings are techniques that transform text into numerical data, enabling machine learning algorithms to process them. Unsupervised tasks are a major challenge in the industry as they allow value creation from large amounts of data without requiring costly manual labeling. In thesis we explore the use of Transformer models for unsupervised tasks such as clustering, anomaly detection, and data visualization. We also propose methodologies to better exploit multi-layer Transformer models in an unsupervised context to improve the quality and robustness of document clustering while avoiding the choice of which layer to use and the number of classes. Additionally, we investigate more deeply Transformer language models and their application to clustering, examining in particular transfer learning methods that involve fine-tuning pre-trained models on a different task to improve their quality for future tasks. We demonstrate through an empirical study that post-processing methods based on dimensionality reduction are more advantageous than fine-tuning strategies proposed in the literature. Finally, we propose a framework for detecting text anomalies in French adapted to two cases: one where the data concerns a specific topic and the other where the data has multiple sub-topics. In both cases, we obtain superior results to the state of the art with significantly lower computation time
Al-Natsheh, Hussein. "Text Mining Approaches for Semantic Similarity Exploration and Metadata Enrichment of Scientific Digital Libraries". Thesis, Lyon, 2019. http://www.theses.fr/2019LYSE2062.
Pełny tekst źródłaFor scientists and researchers, it is very critical to ensure knowledge is accessible for re-use and development. Moreover, the way we store and manage scientific articles and their metadata in digital libraries determines the amount of relevant articles we can discover and access depending on what is actually meant in a search query. Yet, are we able to explore all semantically relevant scientific documents with the existing keyword-based search information retrieval systems? This is the primary question addressed in this thesis. Hence, the main purpose of our work is to broaden or expand the knowledge spectrum of researchers working in an interdisciplinary domain when they use the information retrieval systems of multidisciplinary digital libraries. However, the problem raises when such researchers use community-dependent search keywords while other scientific names given to relevant concepts are being used in a different research community.Towards proposing a solution to this semantic exploration task in multidisciplinary digital libraries, we applied several text mining approaches. First, we studied the semantic representation of words, sentences, paragraphs and documents for better semantic similarity estimation. In addition, we utilized the semantic information of words in lexical databases and knowledge graphs in order to enhance our semantic approach. Furthermore, the thesis presents a couple of use-case implementations of our proposed model
Elleuch, Marwa. "Business process discovery from emails, a first step towards business process management in less structured information systems". Electronic Thesis or Diss., Institut polytechnique de Paris, 2021. http://www.theses.fr/2021IPPAS014.
Pełny tekst źródłaProcess discovery aims at analysing the execution logs of information systems (IS), used when performing business activities, for discovering business process (BP) knowledge. Significant research works has been conducted in such area. However, they generally assume that these execution logs are of high or of middle level of maturity w.r.t BP discovery. This means that (i) they are composed of structured records while each one captures evidence of one activity execution, and (ii) a part of events’ attributes (e.g. activity name, timestamp) are explicitly included in these records which facilitates their inference. Nevertheless, BP can be entirely or partially performed through less structured IS generating execution logs of low level of maturity. More precisely, emailing systems are widely used as an alternative tool to collaboratively perform BP tasks. Traditional BP discovery techniques could not be applied or at least not directly applied due to the unstructured nature of email logs data. Recently, there have been several initiatives to extend the scope of BP discovery to consider email logs. However, most of them: (i) mostly require human intervention, and (ii) were limited to BP discovery according to its behavioral perspective. In this thesis, we propose to discover BP fragments from email logs w.r.t their functional, data, organizational and behavioral perspectives. We first formalize these perspectives considering emailing systems specifities. We introduce the notion of actors’ contributions towards performing activities to enrich the organizational and the behavioral perspectives. We additionally consider the informational entities manipulated by BP activities to describe the data perspective. To automate their discovery, we introduce a completely unsupervised approach. This approach mainly transforms the unstructured email log into a structured event log before mining it for discovering BP w.r.t multiple perspectives. We introduce in this context several algorithmic solutions for: (i) unsupervised learning activities based on discovering frequent patterns of words from emails, (ii) discovering activity occurrences in emails for capturing event attributes, (iii) discovering speech acts of activity occurrences for recognizing the sender purposes of including activities in emails, (iv) overlapping clustering of activities to discover their manipulated artifacts (i.e. informational entities), and (v) mining sequencing constraints between event types to discover BP behavioral perspective. We validated our approach using emails from the public dataset Enron to show the effectiveness of the obtained results. We publically provide these results to ensure reproducibility in the studied area. We finally show the usefulness of our results for improving BPM through two potential applications: (i) a BP discovery & recommendation tool to be integrated in emailing systems, and (ii) CRM data analysis for mining reasons of users’ satisfaction/non-satisfaction
Séguéla, Julie. "Fouille de données textuelles et systèmes de recommandation appliqués aux offres d'emploi diffusées sur le web". Thesis, Paris, CNAM, 2012. http://www.theses.fr/2012CNAM0801/document.
Pełny tekst źródłaLast years, e-recruitment expansion has led to the multiplication of web channels dedicated to job postings. In an economic context where cost control is fundamental, assessment and comparison of recruitment channel performances have become necessary. The purpose of this work is to develop a decision-making tool intended to guide recruiters while they are posting a job on the Internet. This tool provides to recruiters the expected performance on job boards for a given job offer. First, we identify the potential predictors of a recruiting campaign performance. Then, we apply text mining techniques to the job offer texts in order to structure postings and to extract information relevant to improve their description in a predictive model. The job offer performance predictive algorithm is based on a hybrid recommender system, suitable to the cold-start problem. The hybrid system, based on a supervised similarity measure, outperforms standard multivariate models. Our experiments are led on a real dataset, coming from a job posting database
Albeiriss, Baian. "Etude terminologique de la chimie en arabe dans une approche de fouille de textes". Thesis, Lyon, 2018. http://www.theses.fr/2018LYSE2057/document.
Pełny tekst źródłaDespite the importance of an international nomenclature, the field of chemistry still suffers from some linguistic problems, linked in particular to its simple and complex terminological units, which can hinder scientific communication. Arabic is no exception, especially since its agglutinating spelling and, in general, not vowelized, may lead to enormous ambiguity's problems. This is in addition to the recurring use of borrowings. The problematic is how to represent the simple and complex terminological units of this specialized language. In other words, formalize the terminological characteristics by studying the mechanisms of themorphosyntactic construction of the chemistry' terms in Arabic. This study should lead to the establishment of a semantic-disambiguation tool that aims to create a tool for extracting the terms of Arabic chemistry and their relationships. A relevant search in Arabic cannot be done without an automated system of language processing; this automatic processing of corpuswritten in Arabic cannot be done without a language analysis; this linguistic analysis, more exactly, this terminology study, is the basis to build the rules of an identification grammar in order to identify the chemistry's terms in Arabic. The construction of this identification grammar requires modelling of morphosyntactic patterns from their observation in corpus and leads to the definition of rules of grammar and constraints
Béchet, Nicolas. "Extraction et regroupement de descripteurs morpho-syntaxiques pour des processus de Fouille de Textes". Phd thesis, Université Montpellier II - Sciences et Techniques du Languedoc, 2009. http://tel.archives-ouvertes.fr/tel-00462206.
Pełny tekst źródłaBéchet, Nicolas. "Extraction et regroupement de descripteurs morpho-syntaxiques pour des processus de Fouille de Textes". Phd thesis, Montpellier 2, 2009. http://www.theses.fr/2009MON20222.
Pełny tekst źródłaSéguéla, Julie. "Fouille de données textuelles et systèmes de recommandation appliqués aux offres d'emploi diffusées sur le web". Electronic Thesis or Diss., Paris, CNAM, 2012. http://www.theses.fr/2012CNAM0801.
Pełny tekst źródłaLast years, e-recruitment expansion has led to the multiplication of web channels dedicated to job postings. In an economic context where cost control is fundamental, assessment and comparison of recruitment channel performances have become necessary. The purpose of this work is to develop a decision-making tool intended to guide recruiters while they are posting a job on the Internet. This tool provides to recruiters the expected performance on job boards for a given job offer. First, we identify the potential predictors of a recruiting campaign performance. Then, we apply text mining techniques to the job offer texts in order to structure postings and to extract information relevant to improve their description in a predictive model. The job offer performance predictive algorithm is based on a hybrid recommender system, suitable to the cold-start problem. The hybrid system, based on a supervised similarity measure, outperforms standard multivariate models. Our experiments are led on a real dataset, coming from a job posting database
MacMurray, Erin. "Discours de presse et veille stratégique d'évènements. Approche textométrique et extraction d'informations pour la fouille de textes". Thesis, Paris 3, 2012. http://www.theses.fr/2012PA030083/document.
Pełny tekst źródłaThis research demonstrates two methods of text mining for strategic monitoring purposes: information extraction and Textometry. In strategic monitoring, text mining is used to automatically obtain information on the activities of corporations. For this objective, information extraction identifies and labels units of information, named entities (companies, places, people), which then constitute entry points for the analysis of economic activities or events. These include mergers, bankruptcies, partnerships, etc., involving corresponding corporations. A Textometric method, however, uses several statistical models to study the distribution of words in large corpora, with the goal of shedding light on significant characteristics of the textual data. In this research, Textometry, an approach traditionally considered incompatible with information extraction methods, is applied to the same corpus as an information extraction procedure in order to obtain information on economic events. Several textometric analyses (characteristic elements, co-occurrences) are examined on a corpus of online news feeds. The results are then compared to those produced by the information extraction procedure. Both approaches contribute differently to processing textual data, producing complementary analyses of the corpus. Following the comparison, this research presents the advantages for these two text mining methods in strategic monitoring of current events
Toussaint, Yannick. "Fouille de textes : des méthodes symboliques pour la construction d'ontologies et l'annotation sémantique guidée par les connaissances". Habilitation à diriger des recherches, Université Henri Poincaré - Nancy I, 2011. http://tel.archives-ouvertes.fr/tel-00764162.
Pełny tekst źródłaErin, Macmurray. "Discours de presse et veille stratégique d'événements Approche textométrique et extraction d'informations pour la fouille de textes". Phd thesis, Université de la Sorbonne nouvelle - Paris III, 2012. http://tel.archives-ouvertes.fr/tel-00740601.
Pełny tekst źródłaEl, Aouad Sara. "Personalized, Aspect-based Summarization of Movie Reviews". Electronic Thesis or Diss., Sorbonne université, 2019. https://accesdistant.sorbonne-universite.fr/login?url=https://theses-intra.sorbonne-universite.fr/2019SORUS019.pdf.
Pełny tekst źródłaOnline reviewing websites help users decide what to buy or places to go. These platforms allow users to express their opinions using numerical ratings as well as textual comments. The numerical ratings give a coarse idea of the service. On the other hand, textual comments give full details which is tedious for users to read. In this dissertation, we develop novel methods and algorithms to generate personalized, aspect-based summaries of movie reviews for a given user. The first problem we tackle is extracting a set of related words to an aspect from movie reviews. Our evaluation shows that our method is able to extract even unpopular terms that represent an aspect, such as compound terms or abbreviations, as opposed to the methods from the related work. We then study the problem of annotating sentences with aspects, and propose a new method that annotates sentences based on a similarity between the aspect signature and the terms in the sentence. The third problem we tackle is the generation of personalized, aspect-based summaries. We propose an optimization algorithm to maximize the coverage of the aspects the user is interested in and the representativeness of sentences in the summary subject to a length and similarity constraints. Finally, we perform three user studies that show that the approach we propose outperforms the state of art method for generating summaries
Roche, Mathieu. "Intégration de la construction de la terminologie de domaines spécialisés dans un processus global de fouille de textes". Paris 11, 2004. http://www.theses.fr/2004PA112330.
Pełny tekst źródłaInformation extraction from specialized texts requires the application of a complete process of text mining. One of the steps of this process is term detection. The terms are defined as groups of words representing a linguistic instance of some user-defined concept. For example, the term "data mining" evokes the concept of “computational technique”. Initially, the task of terminology acquisition consists in extracting groups of words instanciating simple syntactic patterns such as Noun-Noun, Adjective-Noun, etc. One specificity of our algorithm is its iterative mode used to build complex terms. For example, if at the first iteration the Noun-Noun term “data mining” is found, at the following step the term “data-mining application” can be obtained. Moreover, with EXIT (Iterative EXtraction of the Terminology) the expert stands at the center of the terminology extraction process and he can intervene throughout the process. In addition to the iterative aspect of the system, many parameters were added. One of these parameters makes possible the use of various statistical criteria to classify the terms according to their relevance for a task to achieve. Our approach was validated with four corpora of different languages and size, and different fields of specialty. Lastly, a method based on a supervised machine learning approach is proposed in order to improve the quality of the obtained terminology
Zhang, Lei. "Analyse automatique d'opinion : problématique de l'intensité et de la négation pour l'application à un corpus journalistique". Phd thesis, Université de Caen, 2012. http://tel.archives-ouvertes.fr/tel-00777603.
Pełny tekst źródłaZerida, Nadia. "Apport de la combinaison des connaissances structuro-linguistiques et de la fouille de textes pour la catégorisation de documents". Paris 8, 2009. http://www.theses.fr/2009PA083147.
Pełny tekst źródłaThis thesis lies in the difficult context of linguistics and computer science. More precisely, we aim to demonstrate the value of the simultaneous consideration of the document structure and linguistic knowledge for the classification of documents according to their style. For this, we defined new descriptors, which, combined with linguistic descriptors exploiting hierarchy of text, are relevant to characterize the types of documents. Then, we proposed a classification method based on non-presence of patterns in the documents. One of originalities of our work is to combine linguistic and machine learning methods with techniques search for local patterns. Assumptions giving priority to descriptors related to the structure of documents, with a relativization of the lexicon are considered. These assumptions exploit an hierarchy of textual units, where the introduction of a strategy for prioritization of a set of hybrid multi-scale descriptors has been defined. This hierarchy represents the logical structure of the document based on the principle that different windows of observation correspond to different types of information. These are interconnected through the concept of inheritance of context in order to preserve the global coherence of the document. On the other hand, assumptions related to the task of categorization have emerged, such as exploitation of the total or partial absence of patterns under certain constraints, which can be used to build new analogies for the categorization of documents. Then, by analyzing by evidence pattrens with low or zero frequencies, a new approach of categorization by exclusion-inclusion was proposed by introducing a new concept such as exclusive patterns
Dang, Qinran. "Brouillard de pollution en Chine. Analyse sémantique différentielle de corpus institutionnels, médiatiques et de microblogues". Thesis, Paris, INALCO, 2020. http://www.theses.fr/2020INAL0009.
Pełny tekst źródłaAir pollution has increasingly become a serious problem in China, more and more journalistic articles and miniblogs (weibo in Chinese, equivalent to tweet), comming from governmental or media websites, social networks, blogs and forums, etc., discuss the issue of «雾 霾» (wumai in Chinese, means smog) in China through several angles : political, ecological, economic, sociological, health, etc. The semantics of the themes adressed in these texts differ significantly from each other according to their textual genre. In the framework of our research, our objectif is double-fold : on the one hand, to identify different themes of a digital propose-bulit corpus relating to wumai ; and on the other hand, to interpret differentially the semantics of these themes. Firstly, we collect the textual data written in chinese and related to wumai. These journalistic articles and weibo deriving from three traditional chinese and the social network are divided into four genres of sub-corpus. Secondly, we constitute our corpus through a series of data processing : data cleaning, word segmentation, normalization, POS tagging, benchmarking and data organization. We study the characteristics of the four genres of sub-corpus through a series of discriminating variables - hyperstructural, lexical, semiotic, rhetorical, modal and syntactic - distributed at the infratextual and intratextual level. After that, based on the characteristics of each textual genre, we identify the main themes exposed in each genre of sub-corpus, and analyze the semantics of these identified themes in a contrastive way. Our analysis results are interpreted from two angles : quantitative and qualitative. All statistical analysis are assisted by textometric tools ; and the semantic interpretations are implemented on several fundamental concepts of SI (Sémantique interprétative) proposed by Rastier (1987)
Deschênes, Louis-Georges. "La maladie dans la Bible hébraïque à la lumière des textes d'Ougarit". Sherbrooke : Université de Sherbrooke, 2000.
Znajdź pełny tekst źródłaBoukhaled, Mohamed Amine. "On Computational Stylistics : mining Literary Texts for the Extraction of Characterizing Stylistic Patterns". Thesis, Paris 6, 2016. http://www.theses.fr/2016PA066517/document.
Pełny tekst źródłaThe present thesis locates itself in the interdisciplinary field of computational stylistics, namely the application of statistical and computational methods to the study of literary style. Historically, most of the work done in computational stylistics has been focused on lexical aspects especially in the early decades of the discipline. However, in this thesis, our focus is put on the syntactic aspect of style which is quite much harder to capture and to analyze given its abstract nature. As main contribution, we work on an approach to the computational stylistic study of classic French literary texts based on a hermeneutic point of view, in which discovering interesting linguistic patterns is done without any prior knowledge. More concretely, we focus on the development and the extraction of complex yet computationally feasible stylistic features that are linguistically motivated, namely morpho-syntactic patterns. Following the hermeneutic line of thought, we propose a knowledge discovery process for the stylistic characterization with an emphasis on the syntactic dimension of style by extracting relevant patterns from a given text. This knowledge discovery process consists of two main steps, a sequential pattern mining step followed by the application of some interestingness measures. In particular, the extraction of all possible syntactic patterns of a given length is proposed as a particularly useful way to extract interesting features in an exploratory scenario. We propose, carry out an experimental evaluation and report results on three proposed interestingness measures, each of which is based on a different theoretical linguistic and statistical backgrounds
Ahmia, Oussama. "Veille stratégique assistée sur des bases de données d’appels d’offres par traitement automatique de la langue naturelle et fouille de textes". Thesis, Lorient, 2020. http://www.theses.fr/2020LORIS555.
Pełny tekst źródłaThis thesis, carried out within the framework of a CIFRE contract with the OctopusMind company, is focused on developing a set of automated tools dedicated and optimized to assist call for tender databases processing, for the purpose of strategic intelligence monitoring. Our contribution is divided into three chapters: The first chapter is about developing a partially comparable multilingual corpus, built from the European calls for tender published by TED (Tenders Electronic Daily). It contains more than 2 million documents translated into 24 languages published over the last 9 years. The second chapter presents a study on the questions of words, sentences and documents embedding, likely to capture semantic features at different scales. We proposed two approaches: the first one is based on a combination between a word embedding (word2vec) and latent semantic analysis (LSA). The second one is based on a novel artificial neural network architecture based on two-level convolutional attention mechanisms. These embedding methods are evaluated on classification and text clustering tasks. The third chapter concerns the extraction of semantic relationships in calls for tenders, in particular, allowing to link buildings to areas, lots to budgets, and so on. The supervised approaches developed in this part of the thesis are essentially based on Conditionnal Random Fields. The end of the third chapter concerns the application aspect, in particular with the implementation of some solutions deployed within OctopusMind's software environment, including information extraction, a recommender system, as well as the combination of these different modules to solve some more complex problems
Nguyen, Tuan Dang. "Extraction d'information `a partir de documents Web multilingues : une approche d'analyses structurelles". Phd thesis, Université de Caen, 2006. http://tel.archives-ouvertes.fr/tel-00258948.
Pełny tekst źródłaFili, Abdallah Bazzana André. "Des textes aux tessons la céramique médiévale de l'Occident musulman à travers le corpus mérinide de Fès (Maroc, XIVe siècle) /". Lyon : Université Lumière Lyon 2, 2001. http://theses.univ-lyon2.fr/sdx/theses/lyon2/2001/fili_a.
Pełny tekst źródłaStavrianopoulou, Eftychia. "Untersuchungen zur Struktur des Reiches von Pylos : die Stellung der Ortschaften im Lichte der Linear B-Texte /". Partille : P. Ǻströms, 1989. http://catalogue.bnf.fr/ark:/12148/cb388940220.
Pełny tekst źródłaBoukhaled, Mohamed Amine. "On Computational Stylistics : mining Literary Texts for the Extraction of Characterizing Stylistic Patterns". Electronic Thesis or Diss., Paris 6, 2016. http://www.theses.fr/2016PA066517.
Pełny tekst źródłaThe present thesis locates itself in the interdisciplinary field of computational stylistics, namely the application of statistical and computational methods to the study of literary style. Historically, most of the work done in computational stylistics has been focused on lexical aspects especially in the early decades of the discipline. However, in this thesis, our focus is put on the syntactic aspect of style which is quite much harder to capture and to analyze given its abstract nature. As main contribution, we work on an approach to the computational stylistic study of classic French literary texts based on a hermeneutic point of view, in which discovering interesting linguistic patterns is done without any prior knowledge. More concretely, we focus on the development and the extraction of complex yet computationally feasible stylistic features that are linguistically motivated, namely morpho-syntactic patterns. Following the hermeneutic line of thought, we propose a knowledge discovery process for the stylistic characterization with an emphasis on the syntactic dimension of style by extracting relevant patterns from a given text. This knowledge discovery process consists of two main steps, a sequential pattern mining step followed by the application of some interestingness measures. In particular, the extraction of all possible syntactic patterns of a given length is proposed as a particularly useful way to extract interesting features in an exploratory scenario. We propose, carry out an experimental evaluation and report results on three proposed interestingness measures, each of which is based on a different theoretical linguistic and statistical backgrounds
Ramiandrisoa, Iarivony. "Extraction et fouille de données textuelles : application à la détection de la dépression, de l'anorexie et de l'agressivité dans les réseaux sociaux". Thesis, Toulouse 3, 2020. http://www.theses.fr/2020TOU30191.
Pełny tekst źródłaOur research mainly focuses on tasks with an application purpose: depression and anorexia detection on the one hand and aggression detection on the other; this from messages posted by users on a social media platform. We have also proposed an unsupervised method of keyphrases extraction. These three pieces of work were initiated at different times during this thesis work. Our first contribution concerns the automatic keyphrases extraction from scientific documents or news articles. More precisely, we improve an unsupervised graph-based method to solve the weaknesses of graph-based methods by combining existing solutions. We evaluated our approach on eleven data collections including five containing long documents, four containing short documents and finally two containing news articles. We have shown that our proposal improves the results in certain contexts. The second contribution of this thesis is to provide a solution for early depression and anorexia detection. We proposed models that use classical classifiers, namely logistic regression and random forest, based on : (a) features and (b) sentence embedding. We evaluated our models on the eRisk data collections. We have observed that feature-based models perform very well on precision-oriented measures both for depression or anorexia detection. The model based on sentence embedding is more efficient on ERDE_50 and recall-oriented measures. We also obtained better results compared to the state-of-the-art on precision and ERDE_50 for depression detection, and on precision and recall for anorexia detection. Our last contribution is to provide an approach for aggression detection in messages posted by users on social networks. We reused the same models used for depression or anorexia detection to create models. We added other models based on deep learning approach. We evaluated our models on the data collections of TRAC shared task. We observed that our models using deep learning provide better results than our models using classical classifiers. Our results in this part of the thesis are in the middle (fifth or ninth results) compared to the competitors. We still got the best result on one of the data collections
El, Haj Abir. "Stochastics blockmodels, classifications and applications". Thesis, Poitiers, 2019. http://www.theses.fr/2019POIT2300.
Pełny tekst źródłaThis PhD thesis focuses on the analysis of weighted networks, where each edge is associated to a weight representing its strength. We introduce an extension of the binary stochastic block model (SBM), called binomial stochastic block model (bSBM). This question is motivated by the study of co-citation networks in a context of text mining where data is represented by a graph. Nodes are words and each edge joining two words is weighted by the number of documents included in the corpus simultaneously citing this pair of words. We develop an inference method based on a variational maximization algorithm (VEM) to estimate the parameters of the modelas well as to classify the words of the network. Then, we adopt a method based on maximizing an integrated classification likelihood (ICL) criterion to select the optimal model and the number of clusters. Otherwise, we develop a variational approach to analyze the given network. Then we compare the two approaches. Applications based on real data are adopted to show the effectiveness of the two methods as well as to compare them. Finally, we develop a SBM model with several attributes to deal with node-weighted networks. We motivate this approach by an application that aims at the development of a tool to help the specification of different cognitive treatments performed by the brain during the preparation of the writing
Amrani, Ahmed Charef Eddine. "Induction et visualisation interactive pour l'étiquetage morphosyntaxique des corpus de spécialité : application à la biologie moléculaire". Paris 11, 2005. http://www.theses.fr/2005PA112369.
Pełny tekst źródłaWithin the framework of a complete text-mining process, we were interested in Part-of-Speech tagging of specialized corpora. The existing taggers are trained on general language corpora, and give inconsistent results on the specialized texts. To solve this problem, we developed an interactive, convivial and inductive tagger named ETIQ. This tagger makes it possible to the expert to correct the tagging obtained by a general tagger and to adapt it to a specialized corpus. We supplemented our approach in order to treat efficiently the recurring errors of part-of-speech tagging due to ambiguous words having different tags according to the context. With this intention, we used a supervised learning to induce correction rules. In some cases, when the rules are too difficult to generate by the expert of the text domain, we propose to the expert to annotate the examples in a very simple way using the interface. In order to reduce the number of total examples to annotate, we used an active learning algorithm. The correction of difficult part-of-speech tagging ambiguities is a significant stage to obtain a ‘perfectly’ tagged specialized corpus. In order to resolve these ambiguities and thus to decrease the number of tagging errors, we used an interactive and iterative approach we call: Progressive Induction. This approach is a combination of machine learning, of hand-crafted rules, and of manually engineered corrections by user. The proposed approach enabled us to obtain a “correctly” tagged molecular biology corpus. By using this corpus, we carried out a comparative study of several taggers
Arsevska, Elena. "Élaboration d'une méthode semi-automatique pour l'identification et le traitement des signaux d'émergence pour la veille internationale sur les maladies animales infectieuses". Thesis, Université Paris-Saclay (ComUE), 2017. http://www.theses.fr/2017SACLS008/document.
Pełny tekst źródłaMonitoring animal health worldwide, especially the early detection of outbreaks of emerging and exotic pathogens, is one of the means of preventing the introduction of infectious diseases in France.Recently, there is an increasing awareness among health authorities for the use of unstructured information published on the Web for epidemic intelligence purposes.In this manuscript we present a semi-automatic text mining approach, which detects, collects, classifies and extracts information from non-structured textual data available in the media reports on the Web. Our approach is generic; however, it was elaborated using five exotic animal infectious diseases: african swine fever, foot-and-mouth disease, bluetongue, Schmallenberg, and avian influenza.We show that the text mining techniques, supplemented by the knowledge of domain experts, are the foundation of an efficient and reactive system for monitoring animal health emergence on the Web.Our tool will be used by the French epidemic intelligence team for international monitoring of animal health, and will facilitate the early detection of events related to emerging health hazards identified from media reports on the Web
Ibrahim, Aly Sayed Mohamed. "Les petits souterrains du Sérapéum de Memphis : étude d'archéologie, religion, et histoire : textes inédits". Université Lumière - Lyon 2, 1991. http://www.theses.fr/1991LYO20036.
Pełny tekst źródłaThe purpose of this theses is to give an overvieu of the excavations carried on in the lesser voults of the serapeum of memphis during 1986, publishing a corpos of the monuments discovered during the work and to give a daitaled commantry of these documents from the archeological, historical and religious point of vieu. At the end, i deal with the ancient egyptian concept concerning the apis bull