Dissertations / Theses on the topic 'Mots de données'
Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles
Consult the top 50 dissertations / theses for your research on the topic 'Mots de données.'
Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.
You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.
Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.
Pelfrêne, Johann. "Extraction de mots approchés." Rouen, 2004. http://www.theses.fr/2004ROUES013.
Full textIndexing structures are wellknown for exact subwords (suffix array, tree, automaton), however no indexing structure is known for approximate patterns. We study patterns with don't cares, for which a recent result proposed a linear bound of the number of maximal irredundant patterns with don't cares. We introduce the primitive patterns, allowing to reduce the number of interesting patterns that can be extracted from a given text. Like the maximal irredundant patterns, the primitive patterns form a basis for the maximal patterns. We show therefore that the number of primitive patterns, and consequently the number of maximal irredundant ones, is not linear but exponential. This work presents properties of such patterns, an extraction algorithm, and an algorithm which decides the primitivity without computing the basis. These algorithms are extended to the extraction in multiple texts, to the update after adding a new text, and to ambiguous characters for which the don't care character is a special case. We introduce a scoring scheme, reducing the number of conserved patterns
Vieilleribière, Adrien. "Transformations de mots, d'arbres et de statistiques." Paris 11, 2008. http://www.theses.fr/2008PA112238.
Full textWe live in a world of exchange where errors are everywhere. Being able to verify approximately if a property is close or far to be satisfied is essential therefore. This leads to develop techniques to exchange huge and imperfect data. The subject of this thesis is the study of approximated processes that are taking errors into account, and of the comparison between exact and approximated processes as far as complexity is concerned. From a theoretical point of view, the heart of data exchange is the transformation of words and trees. This thesis shows that it is possible to decide the approximate equivalence of Finite State Machines for a particular distance, and extends the decision to a type of linear tree transducers inspired by XSLT. To that end, a method to approximate instances (word or tree), languages and transducers of these instances is introduced. Its main interest lies in the possibility of linking the edit distance with moves between two transducers with the geometrical distance (L1 norm) between their embeddings. This geometrical embedding also enables to decide the approximate consistency of an instance in constant time, i. E. Independently from the input size. Finally, the implementation of data exchange is illustrated: weighted transductions are used to simulate computations of distances between languages, and tree transductions are used to build maps of log files and to visualize results of an OLAP query
Laurence, Grégoire. "Normalisation et Apprentissage de Transductions d'Arbres en Mots." Phd thesis, Université des Sciences et Technologie de Lille - Lille I, 2014. http://tel.archives-ouvertes.fr/tel-01053084.
Full textClément, Julien. "Algorithmes, mots et textes aléatoires." Habilitation à diriger des recherches, Université de Caen, 2011. http://tel.archives-ouvertes.fr/tel-00913127.
Full textSablayrolles, Jean-François. "Les néologismes du français contemporain : traitement théorique et analyses de données." Paris 8, 1996. http://www.theses.fr/1996PA081066.
Full textThe aim of the present work is to examine the place of neologisms in interlocution and the treatments of neology in linguistic theories. This examination induces to compare the definitions given by some dictionaries and encylopedias of the end of the last century and of nowadays. Then the comparison between about a hundred typologies made in these same periods leads to consider the concepts set in several contemporary linguistic models, whether structuralist or generativist. Diversity of the proposed solutions and the dificulties they encounter egg to inquire about the nature of the relevant linguistic unity. The notion of word, debated and insufficient, is leaved on behalf of lexie, functional unity memorized in competence,then,it is the notion of newness which is examined from two points of view:new for whom and new when compared with what? the formation and examination of six corpus from different origines (miscellaneous, weekly papers, a novel by r. Jorif, le burelain, the chronicles of ph. Meyer,le monde, and neologisms collected in a lycee) allow to test the taken up definitions and to bear out, for the maint points, some expressed hypothesis about the inequalities between membres of the linguistic community facing the neologic phenomenon. Everybody does not create as many neologisms nor the same ones. Complementary analysis, caused by the examination of the facts, consider the existence of circumstances propitious to neology, then of causes which egg the locuteur to create a new lexie and about which the interpretant makes hypothesis. At last, considerations connected with the circulation du dire specify the concept of newness and show the incertitudes which bear on the future of lexies created by a definite locuteur in definite circumstances
Thuilier, Juliette. "Contraintes préférentielles et ordre des mots en français." Phd thesis, Université Paris-Diderot - Paris VII, 2012. http://tel.archives-ouvertes.fr/tel-00781228.
Full textGoyet, Louise. "Développement des capacités de segmentation de la parole continue en mots, chez les enfants francophones : données électrophysiologiques et comportementales." Paris 5, 2010. http://www.theses.fr/2010PA05H104.
Full textThe acquisition of the lexicon (words) constitutes a main stage in language development. However in order to acquire the lexicon of their native language, infants must learn to identify and to segment word forms in continuous speech. This ability to extract word forms into continuous speech is called: the word segmentation. This word segmentation ability is thus crucial for language acquisition. However, accessing word forms would not be an issue if word boundaries were clearly marked at the acoustic level, or if words were (often) presented in isolation. So how, infants could segment fluent speech? What are developmental origin of segmentation abilities and the underlying mechanisms involved? Numerous studies have shown that segmentation abilities emerge around 8 months (Saffran et al. , 1996 ; Jusczyk et al. , 1999b), develop during the following months, and rely on infants' processing of various word boundary cues (allophonic, phonotactic, prosody/rhythmic, Transitional Probability cues) which relative weight changes across development. The goal of this PHd research is to focus on the rhythmic main unit segmentation cue which depends on the native language rhythmic type (Jusczyk et al. , 1999 ; Nazzi et al. , 2006). Furthermore, these researches will fit into the continuity of the solution of the bootstrapping issues, in the form of the early rhythmic segmentation hypothesis (Nazzi et al. , 1998). This hypothesis states that infants rely on the underlying rhythmic unit of their native language at the onset of segmentation abilities (the trochaic unit for stress-based languages: English, German and Dutch, the syllable for syllable-based languages: French, Italian, Spanish). For French, behavioural evidence (Nazzi et al. 2006) showed that infants could use the rhythmic unit appropriate to their native language (the syllable) to segment fluent speech by 12 months of age (word are segmented into syllable units), but failed to show whole word segmentation at that age, (ability which emerge at 16 months). Given the implications of such findings, the goal of this PHd research will be to study and to re-evaluates during the development, the early rhythmic segmentation hypothesis (the syllabic segmentation), the issue of whole word and the interaction of the use of various segmentation cues and by consequence their impact on word segmentation. To evaluate this, we used two experimental methods: an electrophysiological one (High density ERPs: event-related potentials), and a behavioural one (HPP ; Headturn Preference Procedure), by testing French-learning 8 and 12-month-olds on bisyllabic word segmentation. The results of the research confirm, for these French-learning infants, the rhythmic-based segmentation hypothesis, which postulate that French-learning infants rely on syllables to segment fluent speech, and in addition, the results show that the use of these cues, differs according to their respective weight into the fluent speech
Andreewsky, Marina. "Construction automatique d'un système de type expert pour l'interrogation de bases de données textuelles." Paris 11, 1989. http://www.theses.fr/1989PA112310.
Full textSebastian, Tom. "Evaluation of XPath queries on XML streams with networks of early nested word automata." Thesis, Lille 1, 2016. http://www.theses.fr/2016LIL10037/document.
Full textThe challenge that we tackle in this thesis is the problem of how to answer XPath queries on XML streams with low latency, full coverage, high time efficiency, and low memory costs. We first propose to approximate earliest query answering for navigational XPath queries by compilation to early nested word automata. It turns out that this leads to almost optimal latency and memory consumption. Second, we contribute a formal semantics of XPath 3.0. It is obtained by mapping XPath to the new query language λXP that we introduce. We then show how to compile λXP queries to networks of early nested word automata, and develop streaming algorithms for the latter. Thereby we obtain a streaming algorithm that indeed covers all of XPath 3.0. Third, we develop an algorithm for projecting XML streams with respect to the query defined by an early nested word automaton. Thereby we are able to make our streaming algorithms highly time efficient. We have implemented all our algorithms with the objective to obtain an industrially applicable streaming tool. It turns out that our algorithms outperform all previous approaches in time efficiency, coverage, and latency
Bonis, Thomas. "Algorithmes d'apprentissage statistique pour l'analyse géométrique et topologique de données." Thesis, Université Paris-Saclay (ComUE), 2016. http://www.theses.fr/2016SACLS459/document.
Full textIn this thesis, we study data analysis algorithms using random walks on neighborhood graphs, or random geometric graphs. It is known random walks on such graphs approximate continuous objects called diffusion processes. In the first part of this thesis, we use this approximation result to propose a new soft clustering algorithm based on the mode seeking framework. For our algorithm, we want to define clusters using the properties of a diffusion process. Since we do not have access to this continuous process, our algorithm uses a random walk on a random geometric graph instead. After proving the consistency of our algorithm, we evaluate its efficiency on both real and synthetic data. We then deal tackle the issue of the convergence of invariant measures of random walks on random geometric graphs. As these random walks converge to a diffusion process, we can expect their invariant measures to converge to the invariant measure of this diffusion process. Using an approach based on Stein's method, we manage to obtain quantitfy this convergence. Moreover, the method we use is more general and can be used to obtain other results such as convergence rates for the Central Limit Theorem. In the last part of this thesis, we use the concept of persistent homology, a concept of algebraic topology, to improve the pooling step of the bag-of-words approach for 3D shapes
Ouksili, Hanane. "Exploration et interrogation de données RDF intégrant de la connaissance métier." Thesis, Université Paris-Saclay (ComUE), 2016. http://www.theses.fr/2016SACLV069.
Full textAn increasing number of datasets is published on the Web, expressed in languages proposed by the W3C to describe Web data such as RDF, RDF(S) and OWL. The Web has become a unprecedented source of information available for users and applications, but the meaningful usage of this information source is still a challenge. Querying these data sources requires the knowledge of a formal query language such as SPARQL, but it mainly suffers from the lack of knowledge about the source itself, which is required in order to target the resources and properties relevant for the specific needs of the application. The work described in this thesis addresses the exploration of RDF data sources. This exploration is done according to two complementary ways: discovering the themes or topics representing the content of the data source, and providing a support for an alternative way of querying the data sources by using keywords instead of a query formulated in SPARQL. The proposed exploration approach combines two complementary strategies: thematic-based exploration and keyword search. Theme discovery from an RDF dataset consists in identifying a set of sub-graphs which are not necessarily disjoints, and such that each one represents a set of semantically related resources representing a theme according to the point of view of the user. These themes can be used to enable a thematic exploration of the data source where users can target the relevant theme and limit their exploration to the resources composing this theme. Keyword search is a simple and intuitive way of querying data sources. In the case of RDF datasets, this search raises several problems, such as indexing graph elements, identifying the relevant graph fragments for a specific query, aggregating these relevant fragments to build the query results, and the ranking of these results. In our work, we address these different problems and we propose an approach which takes as input a keyword query and provides a list of sub-graphs, each one representing a candidate result for the query. These sub-graphs are ordered according to their relevance to the query. For both keyword search and theme identification in RDF data sources, we have taken into account some external knowledge in order to capture the users needs, or to bridge the gap between the concepts invoked in a query and the ones of the data source. This external knowledge could be domain knowledge allowing to refine the user's need expressed by a query, or to refine the definition of themes. In our work, we have proposed a formalization to this external knowledge and we have introduced the notion of pattern to this end. These patterns represent equivalences between properties and paths in the dataset. They are evaluated and integrated in the exploration process to improve the quality of the result
Badr, Georges. "Modèle théorique et outil de simulation pour une meilleure évaluation des claviers logiciels augmentés d'un système de prédiction de mots." Toulouse 3, 2011. http://thesesups.ups-tlse.fr/1549/.
Full textPredictive model and simulation tool for a best evaluation of soft keyboard augmented by words prediction list The software keyboards are used to enable text input in mobility and for devices without physical keyboards, such as the new generation of mobile phones. However, these keyboards have several drawbacks such as slowness text entry and fatigue generated for motor impaired users. The solution was to combine software keyboard to lists containing the words likely to continue the word introduced by the user. While these lists, so-called prediction lists, reduce the number of clicks and the number of operations, the speed of user input has decreased. An experiment with an eye tracking system has identified the "strategies" of the user while using and searching a list of words. These results were helpful to refine the prediction models in order to reduce the gap between the performance predicted and the performance actually recorded. Based on observations made during the first experiment, we propose two variants of the use of word prediction list. The first proposes a new way to interact with the list of words and allows maximum use of it. The second evaluates a repositioning of the list of words in order to reduce the number of eye movements to the list. These two propositions were theoretically and experimentally evaluated by users. These software can improve the input performances compared with a classic word prediction list
Guillaumin, Matthieu. "Données multimodales pour l'analyse d'image." Phd thesis, Grenoble, 2010. http://tel.archives-ouvertes.fr/tel-00522278/en/.
Full textGuillaumin, Matthieu. "Données multimodales pour l'analyse d'image." Phd thesis, Grenoble, 2010. http://www.theses.fr/2010GRENM048.
Full textThis dissertation delves into the use of textual metadata for image understanding. We seek to exploit this additional textual information as weak supervision to improve the learning of recognition models. There is a recent and growing interest for methods that exploit such data because they can potentially alleviate the need for manual annotation, which is a costly and time-consuming process. We focus on two types of visual data with associated textual information. First, we exploit news images that come with descriptive captions to address several face related tasks, including face verification, which is the task of deciding whether two images depict the same individual, and face naming, the problem of associating faces in a data set to their correct names. Second, we consider data consisting of images with user tags. We explore models for automatically predicting tags for new images, i. E. Image auto-annotation, which can also used for keyword-based image search. We also study a multimodal semi-supervised learning scenario for image categorisation. In this setting, the tags are assumed to be present in both labelled and unlabelled training data, while they are absent from the test data. Our work builds on the observation that most of these tasks can be solved if perfectly adequate similarity measures are used. We therefore introduce novel approaches that involve metric learning, nearest neighbour models and graph-based methods to learn, from the visual and textual data, task-specific similarities. For faces, our similarities focus on the identities of the individuals while, for images, they address more general semantic visual concepts. Experimentally, our approaches achieve state-of-the-art results on several standard and challenging data sets. On both types of data, we clearly show that learning using additional textual information improves the performance of visual recognition systems
Cappuzzo, Riccardo. "Deep learning models for tabular data curation." Electronic Thesis or Diss., Sorbonne université, 2022. http://www.theses.fr/2022SORUS047.
Full textData retention is a pervasive and far-reaching topic, affecting everything from academia to industry. Current solutions rely on manual work by domain users, but they are not adequate. We are investigating how to apply deep learning to tabular data curation. We focus our work on developing unsupervised data curation systems and designing curation systems that intrinsically model categorical values in their raw form. We first implement EmbDI to generate embeddings for tabular data, and address the tasks of entity resolution and schema matching. We then turn to the data imputation problem using graphical neural networks in a multi-task learning framework called GRIMP
Ramiandrisoa, Iarivony. "Extraction et fouille de données textuelles : application à la détection de la dépression, de l'anorexie et de l'agressivité dans les réseaux sociaux." Thesis, Toulouse 3, 2020. http://www.theses.fr/2020TOU30191.
Full textOur research mainly focuses on tasks with an application purpose: depression and anorexia detection on the one hand and aggression detection on the other; this from messages posted by users on a social media platform. We have also proposed an unsupervised method of keyphrases extraction. These three pieces of work were initiated at different times during this thesis work. Our first contribution concerns the automatic keyphrases extraction from scientific documents or news articles. More precisely, we improve an unsupervised graph-based method to solve the weaknesses of graph-based methods by combining existing solutions. We evaluated our approach on eleven data collections including five containing long documents, four containing short documents and finally two containing news articles. We have shown that our proposal improves the results in certain contexts. The second contribution of this thesis is to provide a solution for early depression and anorexia detection. We proposed models that use classical classifiers, namely logistic regression and random forest, based on : (a) features and (b) sentence embedding. We evaluated our models on the eRisk data collections. We have observed that feature-based models perform very well on precision-oriented measures both for depression or anorexia detection. The model based on sentence embedding is more efficient on ERDE_50 and recall-oriented measures. We also obtained better results compared to the state-of-the-art on precision and ERDE_50 for depression detection, and on precision and recall for anorexia detection. Our last contribution is to provide an approach for aggression detection in messages posted by users on social networks. We reused the same models used for depression or anorexia detection to create models. We added other models based on deep learning approach. We evaluated our models on the data collections of TRAC shared task. We observed that our models using deep learning provide better results than our models using classical classifiers. Our results in this part of the thesis are in the middle (fifth or ninth results) compared to the competitors. We still got the best result on one of the data collections
Arruda, Lima Katia. "Vers une éthique pour les médias numériques : défis entre le public et le privé : que faisons-nous en fin de compte avec les mots?" Thèse, Université de Sherbrooke, 2017. http://hdl.handle.net/11143/11623.
Full textAbstract : We confront the tension between legitimacy vs. manipulation in persuasive discourse: the old tricky aporia of argumentation, dating back to the ancient Greeks when they first founded democracy. This has been more recently highlighted by Philippe Breton (2008) as the subtle “paradox of argumentation,” which concerns the dynamics of human language as a valuable hermeneutical enterprise, one susceptible to (mis) interpretations as well as to phenomena of critical dissent and controversies. Our main questions subsequently turn around the central concern of how we may promote democratic participation and discussion, in the era of the Internet, in ways that can work to motivate the improvement of our inter-subjective communicative performances in healthy and legitimate manners, instead of facilitating corruption via blunt censorship or other manipulative tricks. As we consider dialogue and argumentation to be the most crucial traits of the democratic enterprise, we also discuss the role played by American pragmatism to the nourishment of such democratic ideal. Particularly, we focus on the theoretical approaches proposed by Peirce and Mead concerning autonomy and reflexivity, not without mentioning its champion on education, John Dewey, whose works have all been preoccupied with the maintenance and development of the main axes for good functioning democratic societies, namely: education, science, and communication. To better reflect about this, we integrate into Breton’s triangle a Peirce-Mead semiotic “triadic” approach that supports autonomy, so as to propose a compounded model that is able to both encompass the rich possibilities of communication and, on the other hand, delimit as much as possible the range of interactive dialogism, peculiar to human language, so as to foster ethical (legitimate) exchanges. All these elements considered in Part A prepare the terrain for the subsequent considerations developed in Part B, regarding an ethics for digital media. Conclusions: • The paradigm proposed by Discourse Ethics, under the light of a semiotic approach on autonomy, reflexivity and the self, is suggested as a reliable theoretical framework of departure. • This has led us to a compounded ‘triadic’ model that incorporates the most relevant aspects from the views of Peirce, Mead, Grize and Breton. • Then, in Part B, concerning the challenges brought by Digital Media to contemporary societies, we conclude that the more of one’s privacy an individual is required to relinquish for governments and/or companies (no matter the reasons involved), the more transparency by those handling one’s sensitive information should be required to be provided, in return. • All this in order to prevent manipulation and abuses of power as much as possible, so as to keep a balanced ‘communicative triangle’ among interlocutors (according to the proposed triangular model), essential for democracies to be maintained and thrive, so as to rend possible the adoption of a Magna Carta for the Internet that would be globally acceptable and focused on the three main principles of: - net neutrality; - freedom of expression; - privacy protection.
Dao, Ngoc Bich. "Réduction de dimension de sac de mots visuels grâce à l’analyse formelle de concepts." Thesis, La Rochelle, 2017. http://www.theses.fr/2017LAROS010/document.
Full textIn several scientific fields such as statistics, computer vision and machine learning, redundant and/or irrelevant information reduction in the data description (dimension reduction) is an important step. This process contains two different categories : feature extraction and feature selection, of which feature selection in unsupervised learning is hitherto an open question. In this manuscript, we discussed about feature selection on image datasets using the Formal Concept Analysis (FCA), with focus on lattice structure and lattice theory. The images in a dataset were described as a set of visual words by the bag of visual words model. Two algorithms were proposed in this thesis to select relevant features and they can be used in both unsupervised learning and supervised learning. The first algorithm was the RedAttSansPerte, which based on lattice structure and lattice theory, to ensure its ability to remove redundant features using the precedence graph. The formal definition of precedence graph was given in this thesis. We also demonstrated their properties and the relationship between this graph and the AC-poset. Results from experiments indicated that the RedAttsSansPerte algorithm reduced the size of feature set while maintaining their performance against the evaluation by classification. Secondly, the RedAttsFloue algorithm, an extension of the RedAttsSansPerte algorithm, was also proposed. This extension used the fuzzy precedence graph. The formal definition and the properties of this graph were demonstrated in this manuscript. The RedAttsFloue algorithm removed redundant and irrelevant features while retaining relevant information according to the flexibility threshold of the fuzzy precedence graph. The quality of relevant information was evaluated by the classification. The RedAttsFloue algorithm is suggested to be more robust than the RedAttsSansPerte algorithm in terms of reduction
Bui, Quang Anh. "Vers un système omni-langage de recherche de mots dans des bases de documents écrits homogènes." Thesis, La Rochelle, 2015. http://www.theses.fr/2015LAROS010/document.
Full textThe objective of our thesis is to build an omni-language word retrieval system for scanned documents. We place ourselves in the context where the content of documents is homogenous and the prior knowledge about the document (the language, the writer, the writing style, etc.) is not known. Due to this system, user can freely and intuitively compose his/her query. With the query created by the user, he/she can retrieve words in homogenous documents of any language, without finding an occurrence of the word to search. The key of our proposed system is the invariants, which are writing pieces that frequently appeared in the collection of documents. The invariants can be used in query making process in which the user selects and composes appropriate invariants to make the query. They can be also used as structural descriptor to characterize word images in the retrieval process. We introduce in this thesis our method for automatically extracting invariants from document collection, our evaluation method for evaluating the quality of invariants and invariant’s applications in the query making process as well as in the retrieval process
Marie, Damien. "Anatomie du gyrus de Heschl et spécialisation hémisphérique : étude d'une base de données de 430 sujets témoins volontaire sains." Thesis, Bordeaux 2, 2013. http://www.theses.fr/2013BOR22072/document.
Full textThis thesis concerns the macroscopical anatomy of Heschl’s gyri (HG) in relation with Manual Preference (MP) and the Hemispheric Specialization (HS) for language studied in a multimodal database dedicated to the investigation of HS and balanced for sex and MP (BIL&GIN). HG, located on the surface of the temporal lobe, hosts the primary auditory cortex. Previous studies have shown that HG volume is leftward asymmetrical and that the left HG (LHG) covaries with phonological performance and with the amount of cortex dedicated to the processing of the temporal aspects of sounds, suggesting a relationship between LHG and HSL. However HG anatomy is highly variable and little known. In this thesis we have: 1- Described HG inter-hemispheric gyrification pattern on the anatomical MRI images of 430 healthy participants. 2- Studied the variation of the first or anterior HG (aHG) surface area and its asymmetry and shown its reduction in the presence of duplication and that its leftward asymmetry was present only in the case of a single LHG. Left-handers exhibited a lower incidence of right duplication and a loss of aHG leftward asymmetry. 3- Tested whether the variance of HG anatomy explained the interindividual variability of asymmetries measured with fMRI during the listening of a list of words in 281 participants, and whether differences in HG anatomy with MP were related to decreased HS for language in left-handers. HG inter-hemispheric gyrification pattern explained 11% of the variance of HG functional asymmetry, the patterns including a unique LHG being those with the strongest leftward asymmetry. There was no incidence of MP on HG functional lateralization
Kooli, Nihel. "Rapprochement de données pour la reconnaissance d'entités dans les documents océrisés." Thesis, Université de Lorraine, 2016. http://www.theses.fr/2016LORR0108/document.
Full textThis thesis focuses on entity recognition in documents recognized by OCR, driven by a database. An entity is a homogeneous group of attributes such as an enterprise in a business form described by the name, the address, the contact numbers, etc. or meta-data of a scientific paper representing the title, the authors and their affiliation, etc. Given a database which describes entities by its records and a document which contains one or more entities from this database, we are looking to identify entities in the document using the database. This work is motivated by an industrial application which aims to automate the image document processing, arriving in a continuous stream. We addressed this problem as a matching issue between the document and the database contents. The difficulties of this task are due to the variability of the entity attributes representation in the database and in the document and to the presence of similar attributes in different entities. Added to this are the record redundancy and typing errors in the database, and the alteration of the structure and the content of the document, caused by OCR. To deal with these problems, we opted for a two-step approach: entity resolution and entity recognition. The first step is to link the records referring to the same entity and to synthesize them in an entity model. For this purpose, we proposed a supervised approach based on a combination of several similarity measures between attributes. These measures tolerate character mistakes and take into account the word permutation. The second step aims to match the entities mentioned in documents with the resulting entity model. We proceeded by two different ways, one uses the content matching and the other integrates the structure matching. For the content matching, we proposed two methods: M-EROCS and ERBL. M-EROCS, an improvement / adaptation of a state of the art method, is to match OCR blocks with the entity model based on a score that tolerates the OCR errors and the attribute variability. ERBL is to label the document with the entity attributes and to group these labels into entities. The structure matching is to exploit the structural relationships between the entity labels to correct the mislabeling. The proposed method, called G-ELSE, is based on local structure graph matching with a structural model which is learned for this purpose. This thesis being carried out in collaboration with the ITESOFT-Yooz society, we have experimented all the proposed steps on two administrative corpuses and a third one extracted from the web
Rihany, Mohamad. "Keyword Search and Summarization Approaches for RDF Dataset Exploration." Electronic Thesis or Diss., université Paris-Saclay, 2022. http://www.theses.fr/2022UPASG030.
Full textAn increasing number of datasets are published on the Web, expressed in the standard languages proposed by the W3C such as RDF, RDF (S), and OWL. These datasets represent an unprecedented amount of data available for users and applications. In order to identify and use the relevant datasets, users and applications need to explore them using queries written in SPARQL, a query language proposed by the W3C. But in order to write a SPARQL query, a user should not only be familiar with the query language but also have knowledge about the content of the RDF dataset in terms of the resources, classes or properties it contains. The goal of this thesis is to provide approaches to support the exploration of these RDF datasets. We have studied two alternative and complementary exploration techniques, keyword search and summarization of an RDF dataset. Keyword search returns RDF graphs in response to a query expressed as a set of keywords, where each resulting graph is the aggregation of elements extracted from the source dataset. These graphs represent possible answers to the keyword query, and they can be ranked according to their relevance. Keyword search in RDF datasets raises the following issues: (i) identifying for each keyword in the query the matching elements in the considered dataset, taking into account the differences of terminology between the keywords and the terms used in the RDF dataset, (ii) combining the matching elements to build the result by defining aggregation algorithms that find the best way of linking matching elements, and finally (iii), finding appropriate metrics to rank the results, as several matching elements may exist for each keyword and consequently several graphs may be returned. In our work, we propose a keyword search approach that addresses these issues. Providing a summarized view of an RDF dataset can help a user in identifying if this dataset is relevant to his needs, and in highlighting its most relevant elements. This could be useful for the exploration of a given dataset. In our work, we propose a novel summarization approach based on the underlying themes of a dataset. Our theme-based summarization approach consists of extracting the existing themes in a data source, and building the summarized view so as to ensure that all these discovered themes are represented. This raises the following questions: (i) how to identify the underlying themes in an RDF dataset? (ii) what are the suitable criteria to identify the relevant elements in the themes extracted from the RDF graph? (iii) how to aggregate and connect the relevant elements to create a theme summary? and finally, (iv) how to create the summary for the whole RDF graph from the generated theme summaries? In our work, we propose a theme-based summarization approach for RDF datasets which answers these questions and provides a summarized representation ensuring that each theme is represented proportionally to its importance in the initial dataset
Lebboss, Georges. "Contribution à l’analyse sémantique des textes arabes." Thesis, Paris 8, 2016. http://www.theses.fr/2016PA080046/document.
Full textThe Arabic language is poor in electronic semantic resources. Among those resources there is Arabic WordNet which is also poor in words and relationships.This thesis focuses on enriching Arabic WordNet by synsets (a synset is a set of synonymous words) taken from a large general corpus. This type of corpus does not exist in Arabic, so we had to build it, before subjecting it to a number of pretreatments.We developed, Gilles Bernard and myself, a method of word vectorization called GraPaVec which can be used here. I built a system which includes a module Add2Corpus, pretreatments, word vectorization using automatically generated frequency patterns, which yields a data matrix whose rows are the words and columns the patterns, each component representing the frequency of a word in a pattern.The word vectors are fed to the neural model Self Organizing Map (SOM) ;the classification produced constructs synsets. In order to validate the method, we had to create a gold standard corpus (there are none in Arabic for this area) from Arabic WordNet, and then compare the GraPaVec method with Word2Vec and Glove ones. The result shows that GraPaVec gives for this problem the best results with a F-measure 25 % higher than the others. The generated classes will be used to create new synsets to be included in Arabic WordNet
Nguyen, Nhu Khoa. "Emerging Trend Detection in News Articles." Electronic Thesis or Diss., La Rochelle, 2023. http://www.theses.fr/2023LAROS003.
Full textIn the financial domain, information plays an utmost important role in making investment/business decisions as good knowledge can lead to crafting correct approaches in how to invest or if the investment is worth it. Moreover, being able to identify potential emerging themes/topics is an integral part of this field, since it can help get a head start over other investors, thus gaining a huge competitive advantage. To deduce topics that can be emerging in the future, data such as annual financial reports, stock market, and management meeting summaries are usually considered for review by professional financial experts. Reliable sources of information coming from reputable news publishers, can also be utilized for the purpose of detecting emerging themes. Unlike social media, articles from these publishers have high credibility and quality, thus when analyzed in large sums, it is likely to discover dormant/hidden information about trends or what can become future trends. However, due to the vast amount of information generated each day, it has become more demanding and difficult to analyze the data manually for the purpose of trend identification. Our research explores and analyzes data from different quality sources, such as scientific publication abstracts and a provided news article dataset from Bloomberg called Event-Driven Feed (EDF) to experiment on Emerging Trend Detection. Due to the enormous amount of available data spread over extended time periods, it encourages the use of contrastive approaches to measuring the divergence between past and present surrounding context of extracted words and phrases, thus comparing the similarity between unique vector representations of each interval to discover movement in word usage that can lead to the discovery of new trend. Experimental results reveal that the assessment of context change through time of selected terms is able to detect critical emerging trends and points of emergence. It is also discovered that assessing the evolution of context over a long time span is better than just contrasting the two most recent points in time
Exibard, Léo. "Automatic synthesis of systems with data." Electronic Thesis or Diss., Aix-Marseille, 2021. http://www.theses.fr/2021AIXM0312.
Full textWe often interact with machines that react in real time to our actions (robots, websites etc). They are modelled as reactive systems, that continuously interact with their environment. The goal of reactive synthesis is to automatically generate a system from the specification of its behaviour so as to replace the error-prone low-level development phase by a high-level specification design.In the classical setting, the set of signals available to the machine is assumed to be finite. However, this assumption is not realistic to model systems which process data from a possibly infinite set (e.g. a client id, a sensor value, etc.). The goal of this thesis is to extend reactive synthesis to the case of data words. We study a model that is well-suited for this more general setting, and examine the feasibility of its synthesis problem(s). We also explore the case of non-reactive systems, where the machine does not have to react immediately to its inputs
Ehrhart, Hélène. "Essais sur la composition des recettes fiscales dans les pays en développement." Phd thesis, Université d'Auvergne - Clermont-Ferrand I, 2011. http://tel.archives-ouvertes.fr/tel-00638775.
Full textMarchand, Morgane. "Domaines et fouille d'opinion : une étude des marqueurs multi-polaires au niveau du texte." Thesis, Paris 11, 2015. http://www.theses.fr/2015PA112026/document.
Full textIn this thesis, we are studying the adaptation of a text level opinion classifier across domains. Howerver, people express their opinion in a different way depending on the subject of the conversation. The same word in two different domains can refer to different objects or have an other connotation. If these words are not detected, they will lead to classification errors.We call these words or bigrams « multi-polarity marquers ». Their presence in a text signals a polarity wich is different according to the domain of the text. Their study is the subject of this thesis. These marquers are detected using a khi2 test if labels exist in both targeted domains. We also propose a semi-supervised detection method for the case with labels in only one domain. We use a collection of auto-epurated pivot words in order to assure a stable polarity accross domains.We have also checked the linguistic interest of the selected words with a manual evaluation campaign. The validated words can be : a word of context, a word giving an opinion, a word explaining an opinion or a word wich refer to the evaluated object. Our study also show that the causes of the changing polarity are of three kinds : changing meaning, changing object or changing use.Finally, we have studyed the influence of multi-polarity marquers on opinion classification at text level in three different cases : adaptation of a source domain to a target domain, multi-domain corpora and open domain corpora. The results of our experiments show that the potential improvement is bigger when the initial transfer was difficult. In the favorable cases, we improve accurracy up to five points
Boroş, Emanuela. "Neural Methods for Event Extraction." Thesis, Université Paris-Saclay (ComUE), 2018. http://www.theses.fr/2018SACLS302/document.
Full textWith the increasing amount of data and the exploding number data sources, the extraction of information about events, whether from the perspective of acquiring knowledge or from a more directly operational perspective, becomes a more and more obvious need. This extraction nevertheless comes up against a recurring difficulty: most of the information is present in documents in a textual form, thus unstructured and difficult to be grasped by the machine. From the point of view of Natural Language Processing (NLP), the extraction of events from texts is the most complex form of Information Extraction (IE) techniques, which more generally encompasses the extraction of named entities and relationships that bind them in the texts. The event extraction task can be represented as a complex combination of relations linked to a set of empirical observations from texts. Compared to relations involving only two entities, there is, therefore, a new dimension that often requires going beyond the scope of the sentence, which constitutes an additional difficulty. In practice, an event is described by a trigger and a set of participants in that event whose values are text excerpts. While IE research has benefited significantly from manually annotated datasets to learn patterns for text analysis, the availability of these resources remains a significant problem. These datasets are often obtained through the sustained efforts of research communities, potentially complemented by crowdsourcing. In addition, many machine learning-based IE approaches rely on the ability to extract large sets of manually defined features from text using sophisticated NLP tools. As a result, adaptation to a new domain is an additional challenge. This thesis presents several strategies for improving the performance of an Event Extraction (EE) system using neural-based approaches exploiting morphological, syntactic, and semantic properties of word embeddings. These have the advantage of not requiring a priori modeling domain knowledge and automatically generate a much larger set of features to learn a model. More specifically, we proposed different deep learning models for two sub-tasks related to EE: event detection and argument detection and classification. Event Detection (ED) is considered an important subtask of event extraction since the detection of arguments is very directly dependent on its outcome. ED specifically involves identifying instances of events in texts and classifying them into specific event types. Classically, the same event may appear as different expressions and these expressions may themselves represent different events in different contexts, hence the difficulty of the task. The detection of the arguments is based on the detection of the expression considered as triggering the event and ensures the recognition of the participants of the event. Among the difficulties to take into account, it should be noted that an argument can be common to several events and that it does not necessarily identify with an easily recognizable named entity. As a preliminary to the introduction of our proposed models, we begin by presenting in detail a state-of-the-art model which constitutes the baseline. In-depth experiments are conducted on the use of different types of word embeddings and the influence of the different hyperparameters of the model using the ACE 2005 evaluation framework, a standard evaluation for this task. We then propose two new models to improve an event detection system. One allows increasing the context taken into account when predicting an event instance by using a sentential context, while the other exploits the internal structure of words by taking advantage of seemingly less obvious but essentially important morphological knowledge. We also reconsider the detection of arguments as a high-order relation extraction and we analyze the dependence of arguments on the ED task
Bose, Sougata. "On decision problems on word transducers with origin semantics." Thesis, Bordeaux, 2021. http://www.theses.fr/2021BORD0073.
Full textThe origin semantics for word transducers was introduced by Bojańczyk in 2014 in order to obtain a machine-independent characterization for word-to-word functions defined by transducers. Our primary goal was to study some classical decision problems for transducers in the origin semantics, such as the containment and the equivalence problem. We showed that these problems become decidable in the origin semantics, even though the classical version is undecidable.Motivated by the observation that the origin semantics is more fine-grained than classical semantics, we defined resynchronizers as a way to describe distortions of origins, and to study the above problems in a more relaxed way. We extended the model of rational resynchronizers, introduced by Filiot et al. for one-way transducers, to regular resynchronizers, which work for larger classes of transducers.We studied the two variants of the containment up to resynchronizer problem, which asks if a transducer is contained in another up to a distortion specified by a resynchronizer. We showed that the problem is decidable when the resynchronizer is given as part of the input. When the resynchronizer is not specified in the input, we aimed to synthesize such a resynchronizer, whenever possible. We call this the synthesis problem for resynchronizers and show that it is undecidable in general. We identified some restricted cases when the problem becomes decidable. We also studied the one-way resynchronizability problem, which asks whether a given two-way transducer is resynchronizable in a one-way transducer, and showed that this problem is decidable as well
Tran, Hoang Tung. "Automatic tag correction in videos : an approach based on frequent pattern mining." Thesis, Saint-Etienne, 2014. http://www.theses.fr/2014STET4028/document.
Full textThis thesis presents a new system for video auto tagging which aims at correcting the tags provided by users for videos uploaded on the Internet. Most existing auto-tagging systems rely mainly on the textual information and learn a great number of classifiers (on per possible tag) to tag new videos. However, the existing user-provided video annotations are often incorrect and incomplete. Indeed, users uploading videos might often want to rapidly increase their video’s number-of-view by tagging them with popular tags which are irrelevant to the video. They can also forget an obvious tag which might greatly help an indexing process. In this thesis, we limit the use this questionable textual information and do not build a supervised model to perform the tag propagation. We propose to compare directly the visual content of the videos described by different sets of features such as SIFT-based Bag-Of-visual-Words or frequent patterns built from them. We then propose an original tag correction strategy based on the frequency of the tags in the visual neighborhood of the videos. We have also introduced a number of strategies and datasets to evaluate our system. The experiments show that our method can effectively improve the existing tags and that frequent patterns build from Bag-Of-visual-Words are useful to construct accurate visual features
Doucet, Antoine. "Extraction, Exploitation and Evaluation of Document-based Knowledge." Habilitation à diriger des recherches, Université de Caen, 2012. http://tel.archives-ouvertes.fr/tel-01070505.
Full textFiroozeh, Nazanin. "Semantic-oriented Recommandation for Content Enrichment." Thesis, Sorbonne Paris Cité, 2018. http://www.theses.fr/2018USPCD033.
Full textIn this thesis, we aim at enriching the content of an unstructured document with respect to a domain of interest. The goal is to minimize the vocabulary and informational gap between the document and the domain. Such an enrichment which is based on Natural Language Processing and Information Retrieval technologies has several applications. As an example, flling in the gap between a scientifc paper and a collection of highly cited papers in a domain helps the paper to be better acknowledged by the community that refers to that collection. Another example is to fll in the gap between a web page and the usual keywords of visitors that are interested in a given domain so as it is better indexed and referred to in that domain, i.e. more accessible for those visitors. We propose a method to fll that gap. We first generate an enrichment collection, which consists of the important documents related to the domain of interest. The main information of the enrichment collection is then extracted, disambiguated and proposed to a user,who performs the enrichment. This is achieved by decomposing the problem into two main components of keyword extraction and topic detection. We present a comprehensive study over different approaches of each component. Using our findings, we propose approaches for extracting keywords from web pages, detecting their under lying topics, disambiguating them and returning the ones related to the domain of interest. The enrichment is performed by recommending discriminative sets of semantically relevant keywords, i.e. topics, to a user. The topics are labeled with representative keywords and have a level of granularity that is easily interpretable. Topic keywords are ranked by importance. This helps to control the length of the document, which needs to be enriched, by targeting the most important keywords of each topic. Our approach is robust to the noise in web pages. It is also knowledge-poor and domain-independent. It, however, exploits search engines for generating the required data but is optimized in the number of requests sent to them. In addition, the approach is easily tunable to different languages. We have implemented the keyword extraction approach in 12 languages and four of them have been tested over various domains. The topic detection approach has been implemented and tested on English and French. However, it is on French language that the approaches have been tested on a large scale : the keyword extraction on roughly 400 domains and the topic detection on 80 domains.To evaluate the performance of our enrichment approach, we focused on French and we performed different experiments on the proposed keyword extraction and topic detection methods. To evaluate their robustness, we studied them on 10 topically diverse domains.Results were evaluated through both user-based evaluations on a real application context and by comparing with baseline approaches. Our results on the keyword extraction approach showed that the statistical features are not adequate for capturing words importance within a web page. In addition, we found our proposed approach of keyword extraction to be effective when applied on real applications. The evaluations on the topic detection approach also showed that it can electively filter out the keywords which are not related to a target domain and that it labels the topics with representative and discriminative keywords. In addition, the approach achieved a high precision in preserving the semantic consistency of the keywords within each topic. We showed that our approach out performs a baseline approach, since the widely-used co-occurrence feature between keywords is notivenough for capturing their semantic similarity and consequently for detecting semantically consistent topics
El, Aouad Sara. "Personalized, Aspect-based Summarization of Movie Reviews." Electronic Thesis or Diss., Sorbonne université, 2019. https://accesdistant.sorbonne-universite.fr/login?url=https://theses-intra.sorbonne-universite.fr/2019SORUS019.pdf.
Full textOnline reviewing websites help users decide what to buy or places to go. These platforms allow users to express their opinions using numerical ratings as well as textual comments. The numerical ratings give a coarse idea of the service. On the other hand, textual comments give full details which is tedious for users to read. In this dissertation, we develop novel methods and algorithms to generate personalized, aspect-based summaries of movie reviews for a given user. The first problem we tackle is extracting a set of related words to an aspect from movie reviews. Our evaluation shows that our method is able to extract even unpopular terms that represent an aspect, such as compound terms or abbreviations, as opposed to the methods from the related work. We then study the problem of annotating sentences with aspects, and propose a new method that annotates sentences based on a similarity between the aspect signature and the terms in the sentence. The third problem we tackle is the generation of personalized, aspect-based summaries. We propose an optimization algorithm to maximize the coverage of the aspects the user is interested in and the representativeness of sentences in the summary subject to a length and similarity constraints. Finally, we perform three user studies that show that the approach we propose outperforms the state of art method for generating summaries
Julien, Robert. "Magmatologie des trois phases d'édification du massif du Mont-Dore (Massif Central, France) : données volcanologiques sur le site de Croizat." Paris 11, 1988. http://www.theses.fr/1988PA112092.
Full textCoste, Marion. "Une leçon de musique donnée aux mots : ruser avec les frontières dans l'œuvre de Michel Butor." Thesis, Sorbonne Paris Cité, 2015. http://www.theses.fr/2015USPCA109/document.
Full textMusic has much influenced the writing of Michel Butor whose works often translate musical structures into literary art. These can be the counterpoint (fugue, theme and variation), serial music or jazz. This way of working shows the metamorphosis of these musical structures in the texts, particularly complex when the writer has to translate the simultaneousness inherent in musical polyphony. This musical practise of writing upsets the conventional literary structures, thus associating with innovations which characterise the Nouveau Roman (frequent change of narrators, fragmentation of the narrative) and also proposing new constraints that lead the writing into novel forms: conference-concerts, mobile forms, radio works. This practise also modifies our reading habits, compelling the reader to be responsible for the construction of the work and our perception of time which is no longer linear but cyclical. Lastly, the influence of music enables to create what I have called cultural cosmoses, inventing connections between cultures usually isolated in time or space, in a gesture of hospitality and generosity which is characteristic of the works of Michel Butor. The writer sees this literary hospitality as an ethic, or politic model. The different literary genres practised by Michel Butor are studied through a few works which testify to the various modalities of the musical influence on the writing of Michel Butor: the novel, the mobile works, the dialogues with art works, the opera Your Faust and the narrations of dreams are related to the musical trends familiar to the the writer
Lasserre, Marine. "De l'intrusion d'un lexique allogène : l'exemple des éléments néoclassiques." Thesis, Toulouse 2, 2016. http://www.theses.fr/2016TOU20012/document.
Full textSuch elements as those borrowed from Ancient Greek or Latin, which enter into so-called neoclassical compounds, are often, in the literature, grouped together in a unique class of elements. This dissertation focuses on formations that involve eleven French final elements logie, logue¬, cratie, crate, phobie, phobe, phone, phage, vore, cide and cole. In French, those constructions generatelarge sets of lexemes and appear to be available for lexical creation aside from learned vocabulary. The study of these elements is conducted within the framework of Constructed Morphology (Booij 2010). A database, called NεoClassy, was built in order to analyse these elements. It gathers the lexemes, formed with these final elements, which were collected in dictionaries and on the Web. The analyses that were conducted on this database have brought morpho-phonological, semantic and lexical constraints to light. A distributional analysis has also shown that some neoclassical constructions from NεoClassy and derivates have a similar behaviour. The schemas that involve final neoclassical elements and those in which native final elements are specified are integrated in a same model, on the basis of phonological, semantic and lexical criteria. Neoclassical elements do not constitute a homogeneous class: some are considered as suppletive stems of lexemes when others have a similar behaviour to that of affixes
Deprez, Jean-François. "Estimation 3D de la déformation des tissus mous biologiques par traitement numérique des données ultrasonores radiofréquences." Lyon, INSA, 2008. http://theses.insa-lyon.fr/publication/2008ISAL0087/these.pdf.
Full textUltrasound elastography is now recognized as a promising technique for tissue characterization. Its aim is to provide information about the mechanical properties of soft biological tissues. Since many pathological processes, such as breast or prostate cancer, involve a significant change in tissue stiffness, this information may be of great help for clinicians. This thesis deals with static elastography, which investigates tissue deformation under an externalload. Ln practical terms, pairs of preand post-compression ultrasonic radio-frequency signals are acquired, and changes induced within the signals by the stress are analyze to compute a map of local strains. Accurately estimating the strain is one of the fundamental challenges in elastography, because the clinician's diagnosis will rely on these estimations. Since static elastography has appeared in the early 90s, mainly 1 D methods were developed, estimating the deformation along the US beam's propagation axis. But biological soft tissues are almost incompressible: tissue deformation due to the external static load is therefore three-dimensional. Ln such conditions, 1 D or 2D techniques may lead to insufficiently accurate estimations. That is why we propose in this thesis a 3D technique, designed to accurately estimate biological soft tissue deformation under load. This estimator is applied to pressure ulcer early detection
Maitre, Julien. "Détection et analyse des signaux faibles. Développement d’un framework d’investigation numérique pour un service caché Lanceurs d’alerte." Thesis, La Rochelle, 2022. http://www.theses.fr/2022LAROS020.
Full textThis manuscript provides the basis for a complete chain of document analysis for a whistleblower service, such as GlobalLeaks. We propose a chain of semi-automated analysis of text document and search using websearch queries to in fine present dashboards describing weak signals. We identify and solve methodological and technological barriers inherent to : 1) automated analysis of text document with minimum a priori information,2) enrichment of information using web search 3) data visualization dashboard and 3D interactive environment. These static and dynamic approaches are used in the context of data journalism for processing heterogeneous types of information within documents. This thesis also proposed a feasibility study and prototyping by the implementation of a processing chain in the form of a software. This construction requires a weak signal definition. Our goal is to provide configurable and generic tool. Our solution is based on two approaches : static and dynamic. In the static approach, we propose a solution requiring less intervention from the domain expert. In this context, we propose a new approach of multi-leveltopic modeling. This joint approach combines topic modeling, word embedding and an algorithm. The use of a expert helps to assess the relevance of the results and to identify topics with weak signals. In the dynamic approach, we integrate a solution for monitoring weak signals and we follow up to study their evolution. Wetherefore propose and agent mining solution which combines data mining and multi-agent system where agents representing documents and words are animated by attraction/repulsion forces. The results are presented in a data visualization dashboard and a 3D interactive environment in Unity. First, the static approach is evaluated in a proof-of-concept with synthetic and real text corpus. Second, the complete chain of document analysis (static and dynamic) is implemented in a software and are applied to data from document databases
Gkotse, Blerina. "Ontology-based Generation of Personalised Data Management Systems : an Application to Experimental Particle Physics." Thesis, Université Paris sciences et lettres, 2020. http://www.theses.fr/2020UPSLM017.
Full textThis thesis work aims at bridging the gap between the fields of Web Semantics and Experimental Particle Physics. Taking as a use case a specific type of physics experiments, namely the irradiation experiments used for assessing the resistance of components to radiation, a domain model, what in Web Semantics is called an ontology, has been created for describing the main concepts underlying the data management of irradiation experiments. Using such a formalisation, a methodology has been introduced for the automatic generation of data management systems based on ontologies and used to generate a web application for IEDM, the previously introduced ontology. In the last part of this thesis work, by the use of user-interface (UI) display preferences stored as instances of a UI-dedicated ontology we introduced, a method that represents these ontology instances as feature vectors (embeddings) for recommending personalised UIs is presented
Warintarawej, Pattaraporn. "Automatic Analysis of Blend Words." Thesis, Montpellier 2, 2013. http://www.theses.fr/2013MON20020.
Full textLexical blending is amazing in the sense of morphological productivity, involving the coinage of a new lexeme by fusing parts of at least two source words. Since new things need new words, blending has become a frequent productive word creation such as smog (smoke and fog), or alicament (aliment and médicament) (a French blend word), etc. The challenge is to design methods to discover how the first source word and the second source word are combined. The thesis aims at automatic analysis blend words in order to find the source words they evoke. The contributions of the thesis can divided into two main parts. First, the contribution to automatic blend word analysis, we develop top-k classification and its evaluation framework to predict concepts of blend words. We investigate three different features of words: character N-grams, syllables and morpho-phonological stems. Moreover, we propose a novel approach to automatically identify blend source words, named Enqualitum. The experiments are conducted on both synthetic French blend words and words from a French thesaurus. Second, the contribution to software engineering application, we apply the idea of learning character patterns of identifiers to predict concepts of source codes and also introduce a method to automate semantic context in source codes. The experiments are conducted on real identifier names from open source software packages. The results show the usefulness and the effectiveness of our proposed approaches
Ait, Saada Mira. "Unsupervised learning from textual data with neural text representations." Electronic Thesis or Diss., Université Paris Cité, 2023. http://www.theses.fr/2023UNIP7122.
Full textThe digital era generates enormous amounts of unstructured data such as images and documents, requiring specific processing methods to extract value from them. Textual data presents an additional challenge as it does not contain numerical values. Word embeddings are techniques that transform text into numerical data, enabling machine learning algorithms to process them. Unsupervised tasks are a major challenge in the industry as they allow value creation from large amounts of data without requiring costly manual labeling. In thesis we explore the use of Transformer models for unsupervised tasks such as clustering, anomaly detection, and data visualization. We also propose methodologies to better exploit multi-layer Transformer models in an unsupervised context to improve the quality and robustness of document clustering while avoiding the choice of which layer to use and the number of classes. Additionally, we investigate more deeply Transformer language models and their application to clustering, examining in particular transfer learning methods that involve fine-tuning pre-trained models on a different task to improve their quality for future tasks. We demonstrate through an empirical study that post-processing methods based on dimensionality reduction are more advantageous than fine-tuning strategies proposed in the literature. Finally, we propose a framework for detecting text anomalies in French adapted to two cases: one where the data concerns a specific topic and the other where the data has multiple sub-topics. In both cases, we obtain superior results to the state of the art with significantly lower computation time
Morbieu, Stanislas. "Leveraging textual embeddings for unsupervised learning." Electronic Thesis or Diss., Université Paris Cité, 2020. http://www.theses.fr/2020UNIP5191.
Full textTextual data is ubiquitous and is a useful information pool for many companies. In particular, the web provides an almost inexhaustible source of textual data that can be used for recommendation systems, business or technological watch, information retrieval, etc. Recent advances in natural language processing have made possible to capture the meaning of words in their context in order to improve automatic translation systems, text summary, or even the classification of documents according to predefined categories. However, the majority of these applications often rely on a significant human intervention to annotate corpora: This annotation consists, for example in the context of supervised classification, in providing algorithms with examples of assigning categories to documents. The algorithm therefore learns to reproduce human judgment in order to apply it for new documents. The object of this thesis is to take advantage of these latest advances which capture the semantic of the text and use it in an unsupervised framework. The contributions of this thesis revolve around three main axes. First, we propose a method to transfer the information captured by a neural network for co-clustering of documents and words. Co-clustering consists in partitioning the two dimensions of a data matrix simultaneously, thus forming both groups of similar documents and groups of coherent words. This facilitates the interpretation of a large corpus of documents since it is possible to characterize groups of documents by groups of words, thus summarizing a large corpus of text. More precisely, we train the Paragraph Vectors algorithm on an augmented dataset by varying the different hyperparameters, classify the documents from the different vector representations and apply a consensus algorithm on the different partitions. A constrained co-clustering of the co-occurrence matrix between terms and documents is then applied to maintain the consensus partitioning. This method is found to result in significantly better quality of document partitioning on various document corpora and provides the advantage of the interpretation offered by the co-clustering. Secondly, we present a method for evaluating co-clustering algorithms by exploiting vector representations of words called word embeddings. Word embeddings are vectors constructed using large volumes of text, one major characteristic of which is that two semantically close words have word embeddings close by a cosine distance. Our method makes it possible to measure the matching between the partition of the documents and the partition of the words, thus offering in a totally unsupervised setting a measure of the quality of the co-clustering. Thirdly, we are interested in recommending classified ads. We present a system that allows to recommend similar classified ads when consulting one. The descriptions of classified ads are often short, syntactically incorrect, and the use of synonyms makes it difficult for traditional systems to accurately measure semantic similarity. In addition, the high renewal rate of classified ads that are still valid (product not sold) implies choices that make it possible to have low computation time. Our method, simple to implement, responds to this use case and is again based on word embeddings. The use of these has advantages but also involves some difficulties: the creation of such vectors requires choosing the values of some parameters, and the difference between the corpus on which the word embeddings were built upstream. and the one on which they are used raises the problem of out-of-vocabulary words, which have no vector representation. To overcome these problems, we present an analysis of the impact of the different parameters on word embeddings as well as a study of the methods allowing to deal with the problem of out-of-vocabulary words
Dutoit, Denis. "Reconnaissance de mots isoles a travers le reseau telephonique." Paris, ENST, 1988. http://www.theses.fr/1988ENST0008.
Full textBessenay, Carole. "La gestion des données environnementales dans un espace naturel sensible : le système d'information géographique des Hautes-Chaumes foréziennes (Massif central)." Saint-Etienne, 1995. http://www.theses.fr/1995STET2024.
Full textThe object of this research is to present and to apply to a specific territory the geographical information systems' concepts and potentialities that can help understand the functioning and evolution processes of natural spaces. The GIS of the "Hautes-Chaumes foreziennes" underlines the interest of a computerization of "ecological planning" methods whose aims are to integrate environment into management practices thanks to the analysis of the specific aptitudes or sensitivities of one space. This study is based on the inventory and the mapping ot the Hautes-Chaumes principal natural and human characteristics : topography, vegetation, humidity, pastoral activities. . . The selection of several criteria allows the elaboration of a pluridisciplinary diagnosis which underlines the important sensitivity of this area. This diagnosis is then compared with an evaluation model of anthropic frequenting in a way to define a zoning of the most vulnerable sectors, which are both sensitive and subject to important pressures. This analysis should urge politicians to conceive differentiated management measures related with the incentives at stake in each area in order to conciliate all anthropic activities while respecting the aptitudes of this natural space
Bouchon, Camillia. "Asymétrie fonctionnelle entre consonnes et voyelles de la naissance à l'âge de 6 mois : données d'imagerie cérébrale et de comportement." Thesis, Paris 5, 2014. http://www.theses.fr/2014PA05H119.
Full textSpeech is composed of two categories of sound, i.e. consonants and vowels, which have different properties and serve different linguistic functions. This consonant/vowel asymmetry, which is established in adults, has led Nespor, Peña and Mehler (2003) to suggest a division of labor present from birth, whereby consonants would facilitate lexical acquisition while vowels would help to learn grammatical rules of language. We have explored the developmental validity of this hypothesis by studying its origins in French-learning infants. First, our optical brain imaging studies show that both consonants and vowels provide input for precursory mechanisms of syntax processing (Exp. 1 - 3). Secondly, our studies on own-name recognition at 5 months demonstrate sensitivity to a vowel mispronunciation in monolingual infants (Alix/Elix), but fail to show a reaction to a consonant mispronunciation in initial position (Victor/Zictor) for monolinguals and bilinguals, or in final position (Luca/Luga) for monolinguals (Exp. 4 - 9). Thus, vowels are a better input for lexical processing in first familiar words. Our results contribute to the understanding of the developmental origin of consonant/vowel functional asymmetry, hence the influence of the native input on its emergence
Gong, Yifan. "Contribution à l'interprétation automatique des signaux en présence d'incertitude." Nancy 1, 1988. http://www.theses.fr/1988NAN10035.
Full textReuter, Sylvain. "La stimulation bi-ventriculaire dans l'insuffisance cardiaque réfractaire : corrélation entre les données cliniques et hémodynamiques sur un suivi de huit mois." Bordeaux 2, 2000. http://www.theses.fr/2000BOR23075.
Full textDeprez, Jean-François Basset Olivier Brusseau Elisabeth. "Estimation 3D de la déformation des tissus mous biologiques par traitement numérique des données ultrasonores radiofréquences." Villeurbanne : Doc'INSA, 2009. http://docinsa.insa-lyon.fr/these/pont.php?id=deprez.
Full textFath, Nour-Eddine. "Vers une homogénéisation en termes de données, justification et prétention, des propriétés argumentativo-illocutoires assciées au connecteur donc." Besançon, 1995. http://www.theses.fr/1995BESA1021.
Full textLarnaudie, Bruno. "Codesign, architecture fonctionnelle de fusion et architecture capteurs pour l'identification de situations accidentogènes : application à la sécurisation de véhicules deux-roues." Paris 11, 2006. http://www.theses.fr/2006PA112234.
Full textThis thesis was supported by SUMOTORI PREDIT project. This project has won the price of Technologies for safety. Both project and thesis are aimed to prove the feasibility of a safety system for two-wheeled vehicles. To solve these problems, we have realized the instrumentation of a two-wheeled vehicle. This instrumentation has been realized in two steps, the first one has consisted in a study of the adequate sensors to the dynamic of the motorbike and of the most optimal possible placement of those ones. The second part of the instrumentation has consisted in the realization of a multi-sensors recorder to acquire the whole sensors of the two-wheeled vehicle. This recorder falls within the scope of a Codesign reasoning, which has lead to achieve a dual-microcontroller architecture. The test bench of the recorder is in conformity with the real times constraint linked to our application. The whole instrumentation has been tested successfully, with a series of measurements. During this series, a stuntman fell with the instrumented two-wheeled vehicle. All these measurements constitute a data base of accidental situations (of more than 70 experiments), which is probably one of the first throughout the world. The examination of this database, in retiming all the scenarios experiments, have revealed “typical features” from sensors. These “typical features” from sensors are characteristic of the motorbike trajectories. The default of coherence of some “typical features” from sensors are good indicators about the falls of the motorbike. The three extracted indicators make it possible to detect the fall of the motorbike and thus answer the problematic of this thesis