Дисертації з теми "Text document classification"
Оформте джерело за APA, MLA, Chicago, Harvard та іншими стилями
Ознайомтеся з топ-50 дисертацій для дослідження на тему "Text document classification".
Біля кожної праці в переліку літератури доступна кнопка «Додати до бібліографії». Скористайтеся нею – і ми автоматично оформимо бібліографічне посилання на обрану працю в потрібному вам стилі цитування: APA, MLA, «Гарвард», «Чикаго», «Ванкувер» тощо.
Також ви можете завантажити повний текст наукової публікації у форматі «.pdf» та прочитати онлайн анотацію до роботи, якщо відповідні параметри наявні в метаданих.
Переглядайте дисертації для різних дисциплін та оформлюйте правильно вашу бібліографію.
Mondal, Abhro Jyoti. "Document Classification using Characteristic Signatures." University of Cincinnati / OhioLINK, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1511793852923472.
Повний текст джерелаSendur, Zeynel. "Text Document Categorization by Machine Learning." Scholarly Repository, 2008. http://scholarlyrepository.miami.edu/oa_theses/209.
Повний текст джерелаBlein, Florent. "Automatic Document Classification Applied to Swedish News." Thesis, Linköping University, Department of Computer and Information Science, 2005. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-3065.
Повний текст джерелаThe first part of this paper presents briefly the ELIN[1] system, an electronic newspaper project. ELIN is a framework that stores news and displays them to the end-user. Such news are formatted using the xml[2] format. The project partner Corren[3] provided ELIN with xml articles, however the format used was not the same. My first task has been to develop a software that converts the news from one xml format (Corren) to another (ELIN).
The second and main part addresses the problem of automatic document classification and tries to find a solution for a specific issue. The goal is to automatically classify news articles from a Swedish newspaper company (Corren) into the IPTC[4] news categories.
This work has been carried out by implementing several classification algorithms, testing them and comparing their accuracy with existing software. The training and test documents were 3 weeks of the Corren newspaper that had to be classified into 2 categories.
The last tests were run with only one algorithm (Naïve Bayes) over a larger amount of data (7, then 10 weeks) and categories (12) to simulate a more real environment.
The results show that the Naïve Bayes algorithm, although the oldest, was the most accurate in this particular case. An issue raised by the results is that feature selection improves speed but can seldom reduce accuracy by removing too many features.
Anne, Chaitanya. "Advanced Text Analytics and Machine Learning Approach for Document Classification." ScholarWorks@UNO, 2017. http://scholarworks.uno.edu/td/2292.
Повний текст джерелаAlsaad, Amal. "Enhanced root extraction and document classification algorithm for Arabic text." Thesis, Brunel University, 2016. http://bura.brunel.ac.uk/handle/2438/13510.
Повний текст джерелаMcElroy, Jonathan David. "Automatic Document Classification in Small Environments." DigitalCommons@CalPoly, 2012. https://digitalcommons.calpoly.edu/theses/682.
Повний текст джерелаFelhi, Mehdi. "Document image segmentation : content categorization." Thesis, Université de Lorraine, 2014. http://www.theses.fr/2014LORR0109/document.
Повний текст джерелаIn this thesis I discuss the document image segmentation problem and I describe our new approaches for detecting and classifying document contents. First, I discuss our skew angle estimation approach. The aim of this approach is to develop an automatic approach able to estimate, with precision, the skew angle of text in document images. Our method is based on Maximum Gradient Difference (MGD) and R-signature. Then, I describe our second method based on Ridgelet transform.Our second contribution consists in a new hybrid page segmentation approach. I first describe our stroke-based descriptor that allows detecting text and line candidates using the skeleton of the binarized document image. Then, an active contour model is applied to segment the rest of the image into photo and background regions. Finally, text candidates are clustered using mean-shift analysis technique according to their corresponding sizes. The method is applied for segmenting scanned document images (newspapers and magazines) that contain text, lines and photo regions. Finally, I describe our stroke-based text extraction method. Our approach begins by extracting connected components and selecting text character candidates over the CIE LCH color space using the Histogram of Oriented Gradients (HOG) correlation coefficients in order to detect low contrasted regions. The text region candidates are clustered using two different approaches ; a depth first search approach over a graph, and a stable text line criterion. Finally, the resulted regions are refined by classifying the text line candidates into « text» and « non-text » regions using a Kernel Support Vector Machine K-SVM classifier
Wang, Yanbo Justin. "Language-independent pre-processing of large document bases for text classification." Thesis, University of Liverpool, 2008. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.445960.
Повний текст джерелаWang, Yalin. "Document analysis : table structure understanding and zone content classification /." Thesis, Connect to this title online; UW restricted, 2002. http://hdl.handle.net/1773/6079.
Повний текст джерелаWei, Zhihua. "The research on chinese text multi-label classification." Thesis, Lyon 2, 2010. http://www.theses.fr/2010LYO20025/document.
Повний текст джерелаLa thèse est centrée sur la Classification de texte, domaine en pleine expansion, avec de nombreuses applications actuelles et potentielles. Les apports principaux de la thèse portent sur deux points : Les spécificités du codage et du traitement automatique de la langue chinoise : mots pouvant être composés de un, deux ou trois caractères ; absence de séparation typographique entre les mots ; grand nombre d’ordres possibles entre les mots d’une phrase ; tout ceci aboutissant à des problèmes difficiles d’ambiguïté. La solution du codage en «n-grams »(suite de n=1, ou 2 ou 3 caractères) est particulièrement adaptée à la langue chinoise, car elle est rapide et ne nécessite pas les étapes préalables de reconnaissance des mots à l’aide d’un dictionnaire, ni leur séparation. La classification multi-labels, c'est-à-dire quand chaque individus peut être affecté à une ou plusieurs classes. Dans le cas des textes, on cherche des classes qui correspondent à des thèmes (topics) ; un même texte pouvant être rattaché à un ou plusieurs thème. Cette approche multilabel est plus générale : un même patient peut être atteint de plusieurs pathologies ; une même entreprise peut être active dans plusieurs secteurs industriels ou de services. La thèse analyse ces problèmes et tente de leur apporter des solutions, d’abord pour les classifieurs unilabels, puis multi-labels. Parmi les difficultés, la définition des variables caractérisant les textes, leur grand nombre, le traitement des tableaux creux (beaucoup de zéros dans la matrice croisant les textes et les descripteurs), et les performances relativement mauvaises des classifieurs multi-classes habituels
文本分类是信息科学中一个重要而且富有实际应用价值的研究领域。随着文本分类处理内容日趋复杂化和多元化,分类目标也逐渐多样化,研究有效的、切合实际应用需求的文本分类技术成为一个很有挑战性的任务,对多标签分类的研究应运而生。本文在对大量的单标签和多标签文本分类算法进行分析和研究的基础上,针对文本表示中特征高维问题、数据稀疏问题和多标签分类中分类复杂度高而精度低的问题,从不同的角度尝试运用粗糙集理论加以解决,提出了相应的算法,主要包括:针对n-gram作为中文文本特征时带来的维数灾难问题,提出了两步特征选择的方法,即去除类内稀有特征和类间特征选择相结合的方法,并就n-gram作为特征时的n值选取、特征权重的选择和特征相关性等问题在大规模中文语料库上进行了大量的实验,得出一些有用的结论。针对文本分类中运用高维特征表示文本带来的分类效率低,开销大等问题,提出了基于LDA模型的多标签文本分类算法,利用LDA模型提取的主题作为文本特征,构建高效的分类器。在PT3多标签分类转换方法下,该分类算法在中英文数据集上都表现出很好的效果,与目前公认最好的多标签分类方法效果相当。针对LDA模型现有平滑策略的随意性和武断性的缺点,提出了基于容差粗糙集的LDA语言模型平滑策略。该平滑策略首先在全局词表上构造词的容差类,再根据容差类中词的频率为每类文档的未登录词赋予平滑值。在中英文、平衡和不平衡语料库上的大量实验都表明该平滑方法显著提高了LDA模型的分类性能,在不平衡语料库上的提高尤其明显。针对多标签分类中分类复杂度高而精度低的问题,提出了一种基于可变精度粗糙集的复合多标签文本分类框架,该框架通过可变精度粗糙集方法划分文本特征空间,进而将多标签分类问题分解为若干个两类单标签分类问题和若干个标签数减少了的多标签分类问题。即,当一篇未知文本被划分到某一类文本的下近似区域时,可以直接用简单的单标签文本分类器判断其类别;当未知文本被划分在边界域时,则采用相应区域的多标签分类器进行分类。实验表明,这种分类框架下,分类的精确度和算法效率都有较大的提高。本文还设计和实现了一个基于多标签分类的网页搜索结果可视化系统(MLWC),该系统能够直接调用搜索引擎返回的搜索结果,并采用改进的Naïve Bayes多标签分类算法实现实时的搜索结果分类,使用户可以快速地定位搜索结果中感兴趣的文本。
Tisserant, Guillaume. "Généralisation de données textuelles adaptée à la classification automatique." Thesis, Montpellier, 2015. http://www.theses.fr/2015MONTS231/document.
Повний текст джерелаWe have work for a long time on the classification of text. Early on, many documents of different types were grouped in order to centralize knowledge. Classification and indexing systems were then created. They make it easy to find documents based on readers' needs. With the increasing number of documents and the appearance of computers and the internet, the implementation of text classification systems becomes a critical issue. However, textual data, complex and rich nature, are difficult to treat automatically. In this context, this thesis proposes an original methodology to organize and facilitate the access to textual information. Our automatic classification approache and our semantic information extraction enable us to find quickly a relevant information.Specifically, this manuscript presents new forms of text representation facilitating their processing for automatic classification. A partial generalization of textual data (GenDesc approach) based on statistical and morphosyntactic criteria is proposed. Moreover, this thesis focuses on the phrases construction and on the use of semantic information to improve the representation of documents. We will demonstrate through numerous experiments the relevance and genericity of our proposals improved they improve classification results.Finally, as social networks are in strong development, a method of automatic generation of semantic Hashtags is proposed. Our approach is based on statistical measures, semantic resources and the use of syntactic information. The generated Hashtags can then be exploited for information retrieval tasks from large volumes of data
Mazyad, Ahmad. "Contribution to automatic text classification : metrics and evolutionary algorithms." Thesis, Littoral, 2018. http://www.theses.fr/2018DUNK0487/document.
Повний текст джерелаThis thesis deals with natural language processing and text mining, at the intersection of machine learning and statistics. We are particularly interested in Term Weighting Schemes (TWS) in the context of supervised learning and specifically the Text Classification (TC) task. In TC, the multi-label classification task has gained a lot of interest in recent years. Multi-label classification from textual data may be found in many modern applications such as news classification where the task is to find the categories that a newswire story belongs to (e.g., politics, middle east, oil), based on its textual content, music genre classification (e.g., jazz, pop, oldies, traditional pop) based on customer reviews, film classification (e.g. action, crime, drama), product classification (e.g. Electronics, Computers, Accessories). Traditional classification algorithms are generally binary classifiers, and they are not suited for the multi-label classification. The multi-label classification task is, therefore, transformed into multiple single-label binary tasks. However, this transformation introduces several issues. First, terms distributions are only considered in relevance to the positive and the negative categories (i.e., information on the correlations between terms and categories is lost). Second, it fails to consider any label dependency (i.e., information on existing correlations between classes is lost). Finally, since all categories but one are grouped into one category (the negative category), the newly created tasks are imbalanced. This information is commonly used by supervised TWS to improve the effectiveness of the classification system. Hence, after presenting the process of multi-label text classification, and more particularly the TWS, we make an empirical comparison of these methods applied to the multi-label text classification task. We find that the superiority of the supervised methods over the unsupervised methods is still not clear. We show then that these methods are not fully adapted to the multi-label classification problem and they ignore much statistical information that coul be used to improve the classification results. Thus, we propose a new TWS based on information gain. This new method takes into consideration the term distribution, not only regarding the positive and the negative categories but also in relevance to all classes. Finally, aiming at finding specialized TWS that also solve the issue of imbalanced tasks, we studied the benefits of using genetic programming for generating TWS for the text classification task. Unlike previous studies, we generate formulas by combining statistical information at a microscopic level (e.g., the number of documents that contain a specific term) instead of using complete TWS. Furthermore, we make use of categorical information such as (e.g., the number of categories where a term occurs). Experiments are made to measure the impact of these methods on the performance of the model. We show through these experiments that the results are positive
Martinez-Alvarez, Miguel. "Knowledge-enhanced text classification : descriptive modelling and new approaches." Thesis, Queen Mary, University of London, 2014. http://qmro.qmul.ac.uk/xmlui/handle/123456789/27205.
Повний текст джерелаAl-Nashashibi, May Y. A. "Arabic Language Processing for Text Classification. Contributions to Arabic Root Extraction Techniques, Building An Arabic Corpus, and to Arabic Text Classification Techniques." Thesis, University of Bradford, 2012. http://hdl.handle.net/10454/6326.
Повний текст джерелаPetra University, Amman (Jordan)
Al-Nashashibi, May Yacoub Adib. "Arabic language processing for text classification : contributions to Arabic root extraction techniques, building an Arabic corpus, and to Arabic text classification techniques." Thesis, University of Bradford, 2012. http://hdl.handle.net/10454/6326.
Повний текст джерелаLund, Max. "Duplicate Detection and Text Classification on Simplified Technical English." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-158714.
Повний текст джерела鈴木, 祐介, Yusuke Suzuki, 茂樹 松原, Shigeki Matsubara, 正俊 吉川 та Masatoshi Yoshikswa. "アンカーテキストとハイパーリンクに基づくWeb 文書の階層的分類". 人工知能学会, 2005. http://hdl.handle.net/2237/97.
Повний текст джерелаFearn, Wilson Murray. "Exploring the Relationship Between Vocabulary Scaling and Algorithmic Performance in Text Classification for Large Datasets." BYU ScholarsArchive, 2019. https://scholarsarchive.byu.edu/etd/9053.
Повний текст джерелаCalarota, Gabriele. "Domain-specific word embeddings for ICD-9-CM classification." Bachelor's thesis, Alma Mater Studiorum - Università di Bologna, 2018. http://amslaurea.unibo.it/16714/.
Повний текст джерелаBouaziz, Ameni. "Méthodes d’apprentissage interactif pour la classification des messages courts." Thesis, Université Côte d'Azur (ComUE), 2017. http://www.theses.fr/2017AZUR4039/document.
Повний текст джерелаAutomatic short text classification is more and more used nowadays in various applications like sentiment analysis or spam detection. Short texts like tweets or SMS are more challenging than traditional texts. Therefore, their classification is more difficult owing to their shortness, sparsity and lack of contextual information. We present two new approaches to improve short text classification. Our first approach is "Semantic Forest". The first step of this approach proposes a new enrichment method that uses an external source of enrichment built in advance. The idea is to transform a short text from few words to a larger text containing more information in order to improve its quality before building the classification model. Contrarily to the methods proposed in the literature, the second step of our approach does not use traditional learning algorithm but proposes a new one based on the semantic links among words in the Random Forest classifier. Our second contribution is "IGLM" (Interactive Generic Learning Method). It is a new interactive approach that recursively updates the classification model by considering the new data arriving over time and by leveraging the user intervention to correct misclassified data. An abstraction method is then combined with the update mechanism to improve short text quality. The experiments performed on these two methods show their efficiency and how they outperform traditional algorithms in short text classification. Finally, the last part of the thesis concerns a complete and argued comparative study of the two proposed methods taking into account various criteria such as accuracy, speed, etc
Tran, Thi Quynh Nhi. "Robust and comprehensive joint image-text representations." Thesis, Paris, CNAM, 2017. http://www.theses.fr/2017CNAM1096/document.
Повний текст джерелаThis thesis investigates the joint modeling of visual and textual content of multimedia documents to address cross-modal problems. Such tasks require the ability to match information across modalities. A common representation space, obtained by eg Kernel Canonical Correlation Analysis, on which images and text can be both represented and directly compared is a generally adopted solution.Nevertheless, such a joint space still suffers from several deficiencies that may hinder the performance of cross-modal tasks. An important contribution of this thesis is therefore to identify two major limitations of such a space. The first limitation concerns information that is poorly represented on the common space yet very significant for a retrieval task. The second limitation consists in a separation between modalities on the common space, which leads to coarse cross-modal matching. To deal with the first limitation concerning poorly-represented data, we put forward a model which first identifies such information and then finds ways to combine it with data that is relatively well-represented on the joint space. Evaluations on emph{text illustration} tasks show that by appropriately identifying and taking such information into account, the results of cross-modal retrieval can be strongly improved. The major work in this thesis aims to cope with the separation between modalities on the joint space to enhance the performance of cross-modal tasks.We propose two representation methods for bi-modal or uni-modal documents that aggregate information from both the visual and textual modalities projected on the joint space. Specifically, for uni-modal documents we suggest a completion process relying on an auxiliary dataset to find the corresponding information in the absent modality and then use such information to build a final bi-modal representation for a uni-modal document. Evaluations show that our approaches achieve state-of-the-art results on several standard and challenging datasets for cross-modal retrieval or bi-modal and cross-modal classification
Walker, Briana Shanise. "Rethinking Document Classification: A Pilot for the Application of Text Mining Techniques To Enhance Standardized Assessment Protocols for Critical Care Medical Team Transfer of Care." Case Western Reserve University School of Graduate Studies / OhioLINK, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=case1496760037827537.
Повний текст джерелаAlkhatib, Wael [Verfasser], Ralf [Akademischer Betreuer] Steinmetz, and Steffen [Akademischer Betreuer] Staab. "Semantically Enhanced and Minimally Supervised Models for Ontology Construction, Text Classification, and Document Recommendation / Wael Alkhatib ; Ralf Steinmetz, Steffen Staab." Darmstadt : Universitäts- und Landesbibliothek Darmstadt, 2020. http://d-nb.info/1216997691/34.
Повний текст джерелаElagouni, Khaoula. "Combining neural-based approaches and linguistic knowledge for text recognition in multimedia documents." Thesis, Rennes, INSA, 2013. http://www.theses.fr/2013ISAR0013/document.
Повний текст джерелаThis thesis focuses on the recognition of textual clues in images and videos. In this context, OCR (optical character recognition) systems, able to recognize caption texts as well as natural scene texts captured anywhere in the environment have been designed. Novel approaches, robust to text variability (differentfonts, colors, sizes, etc.) and acquisition conditions (complex background, non uniform lighting, low resolution, etc.) have been proposed. In particular, two kinds of methods dedicated to text recognition are provided:- A segmentation-based approach that computes nonlinear separations between characters well adapted to the localmorphology of images;- Two segmentation-free approaches that integrate a multi-scale scanning scheme. The first one relies on a graph model, while the second one uses a particular connectionist recurrent model able to handle spatial constraints between characters.In addition to the originalities of each approach, two extra contributions of this work lie in the design of a character recognition method based on a neural classification model and the incorporation of some linguistic knowledge that enables to take into account the lexical context.The proposed OCR systems were tested and evaluated on two datasets: a caption texts video dataset and a natural scene texts dataset (namely the public database ICDAR 2003). Experiments have demonstrated the efficiency of our approaches and have permitted to compare their performances to those of state-of-the-art methods, highlighting their advantages and limits
Risch, Jean-Charles. "Enrichissement des Modèles de Classification de Textes Représentés par des Concepts." Thesis, Reims, 2017. http://www.theses.fr/2017REIMS012/document.
Повний текст джерелаMost of text-classification methods use the ``bag of words” paradigm to represent texts. However Bloahdom and Hortho have identified four limits to this representation: (1) some words are polysemics, (2) others can be synonyms and yet differentiated in the analysis, (3) some words are strongly semantically linked without being taken into account in the representation as such and (4) certain words lose their meaning if they are extracted from their nominal group. To overcome these problems, some methods no longer represent texts with words but with concepts extracted from a domain ontology (Bag of Concept), integrating the notion of meaning into the model. Models integrating the bag of concepts remain less used because of the unsatisfactory results, thus several methods have been proposed to enrich text features using new concepts extracted from knowledge bases. My work follows these approaches by proposing a model-enrichment step using a domain ontology, I proposed two measures to estimate to belong to the categories of these new concepts. Using the naive Bayes classifier algorithm, I tested and compared my contributions on the Ohsumed corpus using the domain ontology ``Disease Ontology”. The satisfactory results led me to analyse more precisely the role of semantic relations in the enrichment step. These new works have been the subject of a second experiment in which we evaluate the contributions of the hierarchical relations of hypernymy and hyponymy
Synek, Radovan. "Klasifikace textu pomocí metody SVM." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2010. http://www.nusl.cz/ntk/nusl-237229.
Повний текст джерелаSullivan, Daniel Edward. "Evaluation of Word and Paragraph Embeddings and Analogical Reasoning as an Alternative to Term Frequency-Inverse Document Frequency-based Classification in Support of Biocuration." Diss., Virginia Tech, 2016. http://hdl.handle.net/10919/80572.
Повний текст джерелаPh. D.
Ghanmi, Nabil. "Segmentation d'images de documents manuscrits composites : application aux documents de chimie." Thesis, Université de Lorraine, 2016. http://www.theses.fr/2016LORR0109/document.
Повний текст джерелаThis thesis deals with chemistry document segmentation and structure analysis. This work aims to help chemists by providing the information on the experiments which have already been carried out. The documents are handwritten, heterogeneous and multi-writers. Although their physical structure is relatively simple, since it consists of a succession of three regions representing: the chemical formula of the experiment, a table of the used products and one or more text blocks describing the experimental procedure, several difficulties are encountered. In fact, the lines located at the region boundaries and the imperfections of the table layout make the separation task a real challenge. The proposed methodology takes into account these difficulties by performing segmentation at several levels and treating the region separation as a classification problem. First, the document image is segmented into linear structures using an appropriate horizontal smoothing. The horizontal threshold combined with a vertical overlapping tolerance favor the consolidation of fragmented elements of the formula without too merge the text. These linear structures are classified in text or graphic based on discriminant structural features. Then, the segmentation is continued on text lines to separate the rows of the table from the lines of the raw text locks. We proposed for this classification, a CRF model for determining the optimal labelling of the line sequence. The choice of this kind of model has been motivated by its ability to absorb the variability of lines and to exploit contextual information. For the segmentation of table into cells, we proposed a hybrid method that includes two levels of analysis: structural and syntactic. The first relies on the presence of graphic lines and the alignment of both text and spaces. The second tends to exploit the coherence of the cell content syntax. We proposed, in this context, a Recognition-based approach using contextual knowledge to detect the numeric fields present in the table. The thesis was carried out in the framework of CIFRE, in collaboration with the eNovalys campany.We have implemented and tested all the steps of the proposed system on a consequent dataset of chemistry documents
Albitar, Shereen. "De l'usage de la sémantique dans la classification supervisée de textes : application au domaine médical." Thesis, Aix-Marseille, 2013. http://www.theses.fr/2013AIXM4343/document.
Повний текст джерелаThe main interest of this research is the effect of using semantics in the process of supervised text classification. This effect is evaluated through an experimental study on documents related to the medical domain using the UMLS (Unified Medical Language System) as a semantic resource. This evaluation follows four scenarios involving semantics at different steps of the classification process: the first scenario incorporates the conceptualization step where text is enriched with corresponding concepts from UMLS; both the second and the third scenarios concern enriching vectors that represent text as Bag of Concepts (BOC) with similar concepts; the last scenario considers using semantics during class prediction, where concepts as well as the relations between them are involved in decision making. We test the first scenario using three popular classification techniques: Rocchio, NB and SVM. We choose Rocchio for the other scenarios for its extendibility with semantics. According to experiment, results demonstrated significant improvement in classification performance using conceptualization before indexing. Moderate improvements are reported using conceptualized text representation with semantic enrichment after indexing or with semantic text-to-text semantic similarity measures for prediction
Salah, Aghiles. "Von Mises-Fisher based (co-)clustering for high-dimensional sparse data : application to text and collaborative filtering data." Thesis, Sorbonne Paris Cité, 2016. http://www.theses.fr/2016USPCB093/document.
Повний текст джерелаCluster analysis or clustering, which aims to group together similar objects, is undoubtedly a very powerful unsupervised learning technique. With the growing amount of available data, clustering is increasingly gaining in importance in various areas of data science for several reasons such as automatic summarization, dimensionality reduction, visualization, outlier detection, speed up research engines, organization of huge data sets, etc. Existing clustering approaches are, however, severely challenged by the high dimensionality and extreme sparsity of the data sets arising in some current areas of interest, such as Collaborative Filtering (CF) and text mining. Such data often consists of thousands of features and more than 95% of zero entries. In addition to being high dimensional and sparse, the data sets encountered in the aforementioned domains are also directional in nature. In fact, several previous studies have empirically demonstrated that directional measures—that measure the distance between objects relative to the angle between them—, such as the cosine similarity, are substantially superior to other measures such as Euclidean distortions, for clustering text documents or assessing the similarities between users/items in CF. This suggests that in such context only the direction of a data vector (e.g., text document) is relevant, not its magnitude. It is worth noting that the cosine similarity is exactly the scalar product between unit length data vectors, i.e., L 2 normalized vectors. Thus, from a probabilistic perspective using the cosine similarity is equivalent to assuming that the data are directional data distributed on the surface of a unit-hypersphere. Despite the substantial empirical evidence that certain high dimensional sparse data sets, such as those encountered in the above domains, are better modeled as directional data, most existing models in text mining and CF are based on popular assumptions such as Gaussian, Multinomial or Bernoulli which are inadequate for L 2 normalized data. In this thesis, we focus on the two challenging tasks of text document clustering and item recommendation, which are still attracting a lot of attention in the domains of text mining and CF, respectively. In order to address the above limitations, we propose a suite of new models and algorithms which rely on the von Mises-Fisher (vMF) assumption that arises naturally for directional data lying on a unit-hypersphere
Poulain, d'Andecy Vincent. "Système à connaissance incrémentale pour la compréhension de document et la détection de fraude." Thesis, La Rochelle, 2021. http://www.theses.fr/2021LAROS025.
Повний текст джерелаThe Document Understanding is the Artificial Intelligence ability for machines to Read documents. In a global vision, it aims the understanding of the document function, the document class, and in a more local vision, it aims the understanding of some specific details like entities. The scientific challenge is to recognize more than 90% of the data. While the industrial challenge requires this performance with the least human effort to train the machine. This thesis defends that Incremental Learning methods can cope with both challenges. The proposals enable an efficient iterative training with very few document samples. For the classification task, we demonstrate (1) the continue learning of textual descriptors, (2) the benefit of the discourse sequence, (3) the benefit of integrating a Souvenir of few samples in the knowledge model. For the data extraction task, we demonstrate an iterative structural model, based on a star-graph representation, which is enhanced by the embedding of few a priori knowledges. Aware about economic and societal impacts because the document fraud, this thesis deals with this issue too. Our modest contribution is only to study the different fraud categories to open further research. This research work has been done in a non-classic framework, in conjunction of industrial activities for Yooz and collaborative research projects like the FEDER Securdoc project supported by la région Nouvelle Aquitaine, and the Labcom IDEAS supported by the ANR
Tiepmar, Jochen. "Release of the MySQL based implementation of the CTS protocol." Universitätsbibliothek Leipzig, 2016. http://nbn-resolving.de/urn:nbn:de:bsz:15-qucosa-201773.
Повний текст джерелаГригораш, Вадим Святославович, та Vadym Hryhorash. "Комп’ютеризована система тематичної рубрикації документів". Bachelor's thesis, Тернопільський національний технічний університет імені Івана Пулюя, 2021. http://elartu.tntu.edu.ua/handle/lib/35428.
Повний текст джерелаIn the qualification work the program prototype of the computerized system of thematic rubrication of text documents is designed and realized. The system includes a document repository and a component that corresponds to the rubrication of documents based on the analysis of their content. The component of rubrication of documents consists of the following modules: the module of preliminary processing of the text; text feature detection module; document classification module. As methods for detecting the features of the text in the document, it is proposed to use a variety of statistical features of the TF-IDF algorithm, as well as semantic vectorization.
Вступ. 1. Аналіз технічного завдання і сфери застосування комп’ютеризованої системи тема- тичної рубрикації документів 2. Проектування структури і компонентів комп’ютеризованої системи тематичної рубрикації документів. 3. Програмна реалізація комп’ютеризованої системи тематичної рубрикації документів. 4. Безпека життєдіяльності, основи охорони праці. Висновки
Liaghat, Zeinab. "Quality-efficiency trade-offs in machine learning applied to text processing." Doctoral thesis, Universitat Pompeu Fabra, 2017. http://hdl.handle.net/10803/402575.
Повний текст джерелаAvui en dia, la quantitat de documents digitals disponibles està creixent ràpidament, expandint- se a un ritme considerable i procedint de diverses fonts. Les fonts d’informació no estructurada i semiestructurada inclouen la World Wide Web, articles de notícies, bases de dades biològiques, correus electrònics, biblioteques digitals, repositoris electrònics governamentals, , sales de xat, forums en línia, blogs i mitjans socials com Facebook, Instagram, LinkedIn, Pinterest, Twitter, YouTube i molts d’altres. Extreure’n informació d’aquests recursos i trobar informació útil d’aquestes col.leccions s’ha convertit en un desafiament que fa que l’organització d’aquesta enorme quantitat de dades esdevingui una necessitat. La mineria de dades, l’aprenentatge automàtic i el processament del llenguatge natural són tècniques poderoses que poden utilitzar-se conjuntament per fer front a aquest gran desafiament. Segons la tasca o el problema en qüestió existeixen molts emfo- caments diferents que es poden utilitzar. Els mètodes que s’estan implementant s’optimitzen continuament, però aquests mètodes d’aprenentatge automàtic supervisats han estat provats i comparats amb grans dades d’entrenament. La pregunta és : Què passa amb la qualitat dels mètodes si incrementem les dades de 100 MB a 1 GB? Més encara: Les millores en la qualitat valen la pena quan la taxa de processament de les dades minva? Podem canviar qualitat per eficiència, tot recuperant la perdua de qualitat quan processem més dades? Aquesta tesi és una primera aproximació per resoldre aquestes preguntes de forma gene- ral per a tasques de processament de text, ja que no hi ha hagut suficient investigació per a comparar aquests mètodes considerant el balanç entre el tamany de les dades, la qualitat dels resultats i el temps de processament. Per tant, proposem un marc per analitzar aquest balanç i l’apliquem a tres problemes importants de processament de text: Reconeixement d’Entitats Anomenades, Anàlisi de Sentiments i Classificació de Documents. Aquests problemes tam- bé han estat seleccionats perquè tenen nivells diferents de granularitat: paraules, opinions i documents complerts. Per a cada problema seleccionem diferents algoritmes d’aprenentatge automàtic i avaluem el balanç entre aquestes variables per als diferents algoritmes en grans conjunts de dades públiques ( notícies, opinions, patents). Utilitzem subconjunts de diferents tamanys entre 50 MB i alguns GB per a explorar aquests balanç. Per acabar, com havíem suposat, no perquè un algoritme és eficient en poques dades serà eficient en grans quantitats de dades. Per als dos últims problemes considerem algoritmes similars i també dos conjunts diferents de dades i tècniques d’avaluació per a estudiar l’impacte d’aquests dos paràmetres en els resultats. Mostrem que els resultats no canvien significativament amb aquests canvis.
Hoy en día, la cantidad de documentos digitales disponibles está creciendo rápidamente, ex- pandiéndose a un ritmo considerable y procediendo de una variedad de fuentes. Estas fuentes de información no estructurada y semi estructurada incluyen la World Wide Web, artículos de noticias, bases de datos biológicos, correos electrónicos, bibliotecas digitales, repositorios electrónicos gubernamentales, salas de chat, foros en línea, blogs y medios sociales como Fa- cebook, Instagram, LinkedIn, Pinterest, Twitter, YouTube, además de muchos otros. Extraer información de estos recursos y encontrar información útil de tales colecciones se ha convertido en un desafío que hace que la organización de esa enorme cantidad de datos sea una necesidad. La minería de datos, el aprendizaje automático y el procesamiento del lenguaje natural son técnicas poderosas que pueden utilizarse conjuntamente para hacer frente a este gran desafío. Dependiendo de la tarea o el problema en cuestión, hay muchos enfoques dife- rentes que se pueden utilizar. Los métodos que se están implementando se están optimizando continuamente, pero estos métodos de aprendizaje automático supervisados han sido probados y comparados con datos de entrenamiento grandes. La pregunta es ¿Qué pasa con la calidad de los métodos si incrementamos los datos de 100 MB a 1GB? Más aún, ¿las mejoras en la cali- dad valen la pena cuando la tasa de procesamiento de los datos disminuye? ¿Podemos cambiar calidad por eficiencia, recuperando la perdida de calidad cuando procesamos más datos? Esta tesis es una primera aproximación para resolver estas preguntas de forma general para tareas de procesamiento de texto, ya que no ha habido investigación suficiente para comparar estos métodos considerando el balance entre el tamaño de los datos, la calidad de los resultados y el tiempo de procesamiento. Por lo tanto, proponemos un marco para analizar este balance y lo aplicamos a tres importantes problemas de procesamiento de texto: Reconocimiento de En- tidades Nombradas, Análisis de Sentimientos y Clasificación de Documentos. Estos problemas fueron seleccionados también porque tienen distintos niveles de granularidad: palabras, opinio- nes y documentos completos. Para cada problema seleccionamos distintos algoritmos de apren- dizaje automático y evaluamos el balance entre estas variables para los distintos algoritmos en grandes conjuntos de datos públicos (noticias, opiniones, patentes). Usamos subconjuntos de distinto tamaño entre 50 MB y varios GB para explorar este balance. Para concluir, como ha- bíamos supuesto, no porque un algoritmo es eficiente en pocos datos será eficiente en grandes cantidades de datos. Para los dos últimos problemas consideramos algoritmos similares y tam- bién dos conjuntos distintos de datos y técnicas de evaluación, para estudiar el impacto de estos dos parámetros en los resultados. Mostramos que los resultados no cambian significativamente con estos cambios.
Ailem, Melissa. "Sparsity-sensitive diagonal co-clustering algorithms for the effective handling of text data." Thesis, Sorbonne Paris Cité, 2016. http://www.theses.fr/2016USPCB087.
Повний текст джерелаIn the current context, there is a clear need for Text Mining techniques to analyse the huge quantity of unstructured text documents available on the Internet. These textual data are often represented by sparse high dimensional matrices where rows and columns represent documents and terms respectively. Thus, it would be worthwhile to simultaneously group these terms and documents into meaningful clusters, making this substantial amount of data easier to handle and interpret. Co-clustering techniques just serve this purpose. Although many existing co-clustering approaches have been successful in revealing homogeneous blocks in several domains, these techniques are still challenged by the high dimensionality and sparsity characteristics exhibited by document-term matrices. Due to this sparsity, several co-clusters are primarily composed of zeros. While homogeneous, these co-clusters are irrelevant and must be filtered out in a post-processing step to keep only the most significant ones. The objective of this thesis is to propose new co-clustering algorithms tailored to take into account these sparsity-related issues. The proposed algorithms seek a block diagonal structure and allow to straightaway identify the most useful co-clusters, which makes them specially effective for the text co-clustering task. Our contributions can be summarized as follows: First, we introduce and demonstrate the effectiveness of a novel co-clustering algorithm based on a direct maximization of graph modularity. While existing graph-based co-clustering algorithms rely on spectral relaxation, the proposed algorithm uses an iterative alternating optimization procedure to reveal the most meaningful co-clusters in a document-term matrix. Moreover, the proposed optimization has the advantage of avoiding the computation of eigenvectors, a task which is prohibitive when considering high dimensional data. This is an improvement over spectral approaches, where the eigenvectors computation is necessary to perform the co-clustering. Second, we use an even more powerful approach to discover block diagonal structures in document-term matrices. We rely on mixture models, which offer strong theoretical foundations and considerable flexibility that makes it possible to uncover various specific cluster structure. More precisely, we propose a rigorous probabilistic model based on the Poisson distribution and the well known Latent Block Model. Interestingly, this model includes the sparsity in its formulation, which makes it particularly effective for text data. Setting the estimate of this model’s parameters under the Maximum Likelihood (ML) and the Classification Maximum Likelihood (CML) approaches, four co-clustering algorithms have been proposed, including a hard, a soft, a stochastic and a fourth algorithm which leverages the benefits of both the soft and stochastic variants, simultaneously. As a last contribution of this thesis, we propose a new biomedical text mining framework that includes some of the above mentioned co-clustering algorithms. This work shows the contribution of co-clustering in a real biomedical text mining problematic. The proposed framework is able to propose new clues about the results of genome wide association studies (GWAS) by mining PUBMED abstracts. This framework has been tested on asthma disease and allowed to assess the strength of associations between asthma genes reported in previous GWAS as well as discover new candidate genes likely associated to asthma. In a nutshell, while several text co-clustering algorithms already exist, their performance can be substantially increased if more appropriate models and algorithms are available. According to the extensive experiments done on several challenging real-world text data sets, we believe that this thesis has served well this objective
Franco, Salvador Marc. "A Cross-domain and Cross-language Knowledge-based Representation of Text and its Meaning." Doctoral thesis, Universitat Politècnica de València, 2017. http://hdl.handle.net/10251/84285.
Повний текст джерелаEl Procesamiento del Lenguaje Natural (PLN) es un campo de la informática, la inteligencia artificial y la lingüística computacional centrado en las interacciones entre las máquinas y el lenguaje de los humanos. Uno de sus mayores desafíos implica capacitar a las máquinas para inferir el significado del lenguaje natural humano. Con este propósito, diversas representaciones del significado y el contexto han sido propuestas obteniendo un rendimiento competitivo. Sin embargo, estas representaciones todavía tienen un margen de mejora en escenarios transdominios y translingües. En esta tesis estudiamos el uso de grafos de conocimiento como una representación transdominio y translingüe del texto y su significado. Un grafo de conocimiento es un grafo que expande y relaciona los conceptos originales pertenecientes a un conjunto de palabras. Sus propiedades se consiguen gracias al uso como base de conocimiento de una red semántica multilingüe de amplia cobertura. Esto permite tener una cobertura de cientos de lenguajes y millones de conceptos generales y específicos del ser humano. Como punto de partida de nuestra investigación empleamos características basadas en grafos de conocimiento - junto con otras tradicionales y meta-aprendizaje - para la tarea de PLN de clasificación de la polaridad mono- y transdominio. El análisis y conclusiones de ese trabajo muestra evidencias de que los grafos de conocimiento capturan el significado de una forma independiente del dominio. La siguiente parte de nuestra investigación aprovecha la capacidad de la red semántica multilingüe y se centra en tareas de Recuperación de Información (RI). Primero proponemos un modelo de análisis de similitud completamente basado en grafos de conocimiento para detección de plagio translingüe. A continuación, mejoramos ese modelo para cubrir palabras fuera de vocabulario y tiempos verbales, y lo aplicamos a las tareas translingües de recuperación de documentos, clasificación, y detección de plagio. Por último, estudiamos el uso de grafos de conocimiento para las tareas de PLN de respuesta de preguntas en comunidades, identificación del lenguaje nativo, y identificación de la variedad del lenguaje. Las contribuciones de esta tesis ponen de manifiesto el potencial de los grafos de conocimiento como representación transdominio y translingüe del texto y su significado en tareas de PLN y RI. Estas contribuciones han sido publicadas en diversas revistas y conferencias internacionales.
El Processament del Llenguatge Natural (PLN) és un camp de la informàtica, la intel·ligència artificial i la lingüística computacional centrat en les interaccions entre les màquines i el llenguatge dels humans. Un dels seus majors reptes implica capacitar les màquines per inferir el significat del llenguatge natural humà. Amb aquest propòsit, diverses representacions del significat i el context han estat proposades obtenint un rendiment competitiu. No obstant això, aquestes representacions encara tenen un marge de millora en escenaris trans-dominis i trans-llenguatges. En aquesta tesi estudiem l'ús de grafs de coneixement com una representació trans-domini i trans-llenguatge del text i el seu significat. Un graf de coneixement és un graf que expandeix i relaciona els conceptes originals pertanyents a un conjunt de paraules. Les seves propietats s'aconsegueixen gràcies a l'ús com a base de coneixement d'una xarxa semàntica multilingüe d'àmplia cobertura. Això permet tenir una cobertura de centenars de llenguatges i milions de conceptes generals i específics de l'ésser humà. Com a punt de partida de la nostra investigació emprem característiques basades en grafs de coneixement - juntament amb altres tradicionals i meta-aprenentatge - per a la tasca de PLN de classificació de la polaritat mono- i trans-domini. L'anàlisi i conclusions d'aquest treball mostra evidències que els grafs de coneixement capturen el significat d'una forma independent del domini. La següent part de la nostra investigació aprofita la capacitat\hyphenation{ca-pa-ci-tat} de la xarxa semàntica multilingüe i se centra en tasques de recuperació d'informació (RI). Primer proposem un model d'anàlisi de similitud completament basat en grafs de coneixement per a detecció de plagi trans-llenguatge. A continuació, vam millorar aquest model per cobrir paraules fora de vocabulari i temps verbals, i ho apliquem a les tasques trans-llenguatges de recuperació de documents, classificació, i detecció de plagi. Finalment, estudiem l'ús de grafs de coneixement per a les tasques de PLN de resposta de preguntes en comunitats, identificació del llenguatge natiu, i identificació de la varietat del llenguatge. Les contribucions d'aquesta tesi posen de manifest el potencial dels grafs de coneixement com a representació trans-domini i trans-llenguatge del text i el seu significat en tasques de PLN i RI. Aquestes contribucions han estat publicades en diverses revistes i conferències internacionals.
Franco Salvador, M. (2017). A Cross-domain and Cross-language Knowledge-based Representation of Text and its Meaning [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/84285
TESIS
Balikas, Georgios. "Explorer et apprendre à partir de collections de textes multilingues à l'aide des modèles probabilistes latents et des réseaux profonds." Thesis, Université Grenoble Alpes (ComUE), 2017. http://www.theses.fr/2017GREAM054/document.
Повний текст джерелаText is one of the most pervasive and persistent sources of information. Content analysis of text in its broad sense refers to methods for studying and retrieving information from documents. Nowadays, with the ever increasing amounts of text becoming available online is several languages and different styles, content analysis of text is of tremendous importance as it enables a variety of applications. To this end, unsupervised representation learning methods such as topic models and word embeddings constitute prominent tools.The goal of this dissertation is to study and address challengingproblems in this area, focusing on both the design of novel text miningalgorithms and tools, as well as on studying how these tools can be applied to text collections written in a single or several languages.In the first part of the thesis we focus on topic models and more precisely on how to incorporate prior information of text structure to such models.Topic models are built on the premise of bag-of-words, and therefore words are exchangeable. While this assumption benefits the calculations of the conditional probabilities it results in loss of information.To overcome this limitation we propose two mechanisms that extend topic models by integrating knowledge of text structure to them. We assume that the documents are partitioned in thematically coherent text segments. The first mechanism assigns the same topic to the words of a segment. The second, capitalizes on the properties of copulas, a tool mainly used in the fields of economics and risk management that is used to model the joint probability density distributions of random variables while having access only to their marginals.The second part of the thesis explores bilingual topic models for comparable corpora with explicit document alignments. Typically, a document collection for such models is in the form of comparable document pairs. The documents of a pair are written in different languages and are thematically similar. Unless translations, the documents of a pair are similar to some extent only. Meanwhile, representative topic models assume that the documents have identical topic distributions, which is a strong and limiting assumption. To overcome it we propose novel bilingual topic models that incorporate the notion of cross-lingual similarity of the documents that constitute the pairs in their generative and inference processes. Calculating this cross-lingual document similarity is a task on itself, which we propose to address using cross-lingual word embeddings.The last part of the thesis concerns the use of word embeddings and neural networks for three text mining applications. First, we discuss polylingual document classification where we argue that translations of a document can be used to enrich its representation. Using an auto-encoder to obtain these robust document representations we demonstrate improvements in the task of multi-class document classification. Second, we explore multi-task sentiment classification of tweets arguing that by jointly training classification systems using correlated tasks can improve the obtained performance. To this end we show how can achieve state-of-the-art performance on a sentiment classification task using recurrent neural networks. The third application we explore is cross-lingual information retrieval. Given a document written in one language, the task consists in retrieving the most similar documents from a pool of documents written in another language. In this line of research, we show that by adapting the transportation problem for the task of estimating document distances one can achieve important improvements
Sychra, Martin. "Analýza sentimentu s využitím dolování dat." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2016. http://www.nusl.cz/ntk/nusl-255424.
Повний текст джерелаPagliarani, Andrea. "New markov chain based methods for single and cross-domain sentiment classification." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2015. http://amslaurea.unibo.it/8445/.
Повний текст джерелаPrůša, Petr. "Multi-label klasifikace textových dokumentů." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2012. http://www.nusl.cz/ntk/nusl-412872.
Повний текст джерелаPhaweni, Thembani. "Classification and visualisation of text documents using networks." Master's thesis, University of Cape Town, 2018. http://hdl.handle.net/11427/29534.
Повний текст джерелаAbdaoui, Amine. "Fouille des médias sociaux français : expertise et sentiment." Thesis, Montpellier, 2016. http://www.theses.fr/2016MONTT249/document.
Повний текст джерелаSocial Media has changed the way we communicate between individuals, within organizations and communities. The availability of these social data opens new opportunities to understand and influence the user behavior. Therefore, Social Media Mining is experiencing a growing interest in various scientific and economic circles. In this thesis, we are specifically interested in the users of these networks whom we try to characterize in two ways: (i) their expertise and their reputations and (ii) the sentiments they express.Conventionally, social data is often mined according to its network structure. However, the textual content of the exchanged messages may reveal additional knowledge that can not be known through the analysis of the structure. Until recently, the majority of work done for the analysis of the textual content was proposed for English. The originality of this thesis is to develop methods and resources based on the textual content of the messages for French Social Media Mining.In the first axis, we initially suggest to predict the user expertise. For this, we used forums that recruit health experts to learn classification models that serve to identify messages posted by experts in any other health forum. We demonstrate that models learned on appropriate forums can be used effectively on other forums. Then, in a second step, we focus on the user reputation in these forums. The idea is to seek expressions of trust and distrust expressed in the textual content of the exchanged messages, to search the recipients of these messages and use this information to deduce users' reputation. We propose a new reputation measure that weighs the score of each response by the reputation of its author. Automatic and manual evaluations have demonstrated the effectiveness of the proposed approach.In the second axis, we focus on the extraction of sentiments (emotions and polarity). For this, we started by building a French lexicon of sentiments and emotions that we call FEEL (French Expanded Emotions Lexicon). This lexicon is built semi-automatically by translating and expanding its English counterpart NRC EmoLex. We then compare FEEL with existing French lexicons from literature on reference benchmarks. The results show that FEEL improves the classification of French texts according to their polarities and emotions. Finally, we propose to evaluate different features, methods and resources for the classification of sentiments in French. The conducted experiments have identified useful features and methods in the classification of sentiments for different types of texts. The learned systems have been particularly efficient on reference benchmarks.Generally, this work opens promising perspectives on various analytical tasks of Social Media Mining including: (i) combining multiple sources in mining Social Media users; (ii) multi-modal Social Media Mining using not just text but also image, videos, location, etc. and (iii) multilingual sentiment analysis
Pitou, Cynthia. "Extraction d'informations textuelles au sein de documents numérisés : cas des factures." Thesis, La Réunion, 2017. http://www.theses.fr/2017LARE0015.
Повний текст джерелаDocument processing is the transformation of a human understandable data in a computer system understandable format. Document analysis and understanding are the two phases of document processing. Considering a document containing lines, words and graphical objects such as logos, the analysis of such a document consists in extracting and isolating the words, lines and objects and then grouping them into blocks. The subsystem of document understanding builds relationships (to the right, left, above, below) between the blocks. A document processing system must be able to: locate textual information, identify if that information is relevant comparatively to other information contained in the document, extract that information in a computer system understandable format. For the realization of such a system, major difficulties arise from the variability of the documents characteristics, such as: the type (invoice, form, quotation, report, etc.), the layout (font, style, disposition), the language, the typography and the quality of scanning.This work is concerned with scanned documents, also known as document images. We are particularly interested in locating textual information in invoice images. Invoices are largely used and well regulated documents, but not unified. They contain mandatory information (invoice number, unique identifier of the issuing company, VAT amount, net amount, etc.) which, depending on the issuer, can take various locations in the document. The present work is in the framework of region-based textual information localization and extraction.First, we present a region-based method guided by quadtree decomposition. The principle of the method is to decompose the images of documents in four equals regions and each regions in four new regions and so on. Then, with a free optical character recognition (OCR) engine, we try to extract precise textual information in each region. A region containing a number of expected textual information is not decomposed further. Our method allows to determine accurately in document images, the regions containing text information that one wants to locate and retrieve quickly and efficiently.In another approach, we propose a textual information extraction model consisting in a set of prototype regions along with pathways for browsing through these prototype regions. The life cycle of the model comprises five steps:- Produce synthetic invoice data from real-world invoice images containing the textual information of interest, along with their spatial positions.- Partition the produced data.- Derive the prototype regions from the obtained partition clusters.- Derive pathways for browsing through the prototype regions, from the concept lattice of a suitably defined formal context.- Update incrementally the set of protype regions and the set of pathways, when one has to add additional data
Tagny, Ngompe Gildas. "Méthodes D'Analyse Sémantique De Corpus De Décisions Jurisprudentielles." Thesis, IMT Mines Alès, 2020. http://www.theses.fr/2020EMAL0002.
Повний текст джерелаA case law is a corpus of judicial decisions representing the way in which laws are interpreted to resolve a dispute. It is essential for lawyers who analyze it to understand and anticipate the decision-making of judges. Its exhaustive analysis is difficult manually because of its immense volume and the unstructured nature of the documents. The estimation of the judicial risk by individuals is thus impossible because they are also confronted with the complexity of the judicial system and language. The automation of decision analysis enable an exhaustive extraction of relevant knowledge for structuring case law for descriptive and predictive analyses. In order to make the comprehension of a case law exhaustive and more accessible, this thesis deals with the automation of some important tasks for the expert analysis of court decisions. First, we study the application of probabilistic sequence labeling models for the detection of the sections that structure court decisions, legal entities, and legal rules citations. Then, the identification of the demands of the parties is studied. The proposed approach for the recognition of the requested and granted quanta exploits the proximity between sums of money and automatically learned key-phrases. We also show that the meaning of the judges' result is identifiable either from predefined keywords or by a classification of decisions. Finally, for a given category of demands, the situations or factual circumstances in which those demands are made, are discovered by clustering the decisions. For this purpose, a method of learning a similarity distance is proposed and compared with established distances. This thesis discusses the experimental results obtained on manually annotated real data. Finally, the thesis proposes a demonstration of applications to the descriptive analysis of a large corpus of French court decisions
Alvarenga, Leonel Diógenes Carvalhaes. "Uso de Seleção de Características da Wikipedia na Classificação Automática de Textos." Universidade Federal de Goiás, 2012. http://repositorio.bc.ufg.br/tede/handle/tde/2870.
Повний текст джерелаMade available in DSpace on 2014-07-31T14:43:10Z (GMT). No. of bitstreams: 2 license_rdf: 23148 bytes, checksum: 9da0b6dfac957114c6a7714714b86306 (MD5) uso_de_selecao_de_caracteristicas_da_wikipedia_na_classificacao_automatica_de_textos.pdf: 1449954 bytes, checksum: 9086dec3868b6b703340b550c614d33d (MD5) Previous issue date: 2012-09-20
Fundação de Amparo à Pesquisa do Estado de Goiás - FAPEG
The traditional methods of text classification typically represent documents only as a set of words, also known as "Bag of Words"(BOW). Several studies have shown good results on making use of thesauri and encyclopedias as external information sources, aiming to expand the BOW representation by the identification of synonymy and hyponymy relationships between present terms in a document collection. However, the expansion process may introduce terms that lead to an erroneous classification. In this paper, we propose the use of feature selection measures in order to select features extracted from Wikipedia in order to improve the efectiveness of the expansion process. The study also proposes a feature selection measure called Tendency Factor to One Category (TF1C), so that the experiments showed that this measure proves to be competitive with the other measures Information Gain, Gain Ratio and Chisquared, in the process, delivering the best gains in microF1 and macroF1, in most experiments. The full use of features selected in this process showed to be more stable in assisting the classification, while it showed lower performance on restricting its insertion only to documents of the classes in which these features are well punctuated by the selection measures. When applied in the Reuters-21578, Ohsumed first - 20000 and 20Newsgroups collections, our approach to feature selection allowed the reduction of noise insertion inherent in the expansion process, and improved the results of use hyponyms, and demonstrated that the synonym relationship from Wikipedia can also be used in the document expansion, increasing the efectiveness of the automatic text classification.
Os métodos tradicionais de classificação de textos normalmente representam documentos apenas como um conjunto de palavras, também conhecido como BOW (do inglês, Bag of Words). Vários estudos têm mostrado bons resultados ao utilizar-se de tesauros e enciclopédias como fontes externas de informações, objetivando expandir a representação BOW a partir da identificação de relacionamentos de sinonômia e hiponômia entre os termos presentes em uma coleção de documentos. Todavia, o processo de expansão pode introduzir termos que conduzam a uma classificação errônea do documento. No presente trabalho, propõe-se a aplicação de medidas de avaliação de termos para a seleção de características extraídas da Wikipédia, com o objetivo de melhorar a eficácia de sua utilização durante o processo de expansão de documentos. O estudo também propõe uma medida de seleção de características denominada Fator de Tendência a uma Categoria (FT1C), de modo que os experimentos realizados demonstraram que esta medida apresenta desempenho competitivo com as medidas Information Gain, Gain Ratio e Chi-squared, neste processo, apresentando os melhores ganhos de microF1 e macroF1, na maioria dos experimentos realizados. O uso integral das características selecionadas neste processo, demonstrou auxiliar a classificação de forma mais estável, ao passo que apresentou menor desempenho ao se restringir sua inserção somente aos documentos das classes em que estas características são bem pontuadas pelas medidas de seleção. Ao ser aplicada nas coleções Reuters-21578, Ohsumed rst-20000 e 20Newsgroups, a abordagem com seleção de características permitiu a redução da inserção de ruídos inerentes do processo de expansão e potencializou o uso de hipônimos, assim como demonstrou que as relações de sinonômia da Wikipédia também podem ser utilizadas na expansão de documentos, elevando a eficácia da classificação automática de textos.
Saad, Mohammed Fathi Hassan. "A cluster classification method for the extraction of knowledge from text documents." Thesis, University of East Anglia, 2008. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.446513.
Повний текст джерелаKe, Guiyao. "Mesures de comparabilité pour la construction assistée de corpus comparables bilingues thématiques." Phd thesis, Université de Bretagne Sud, 2014. http://tel.archives-ouvertes.fr/tel-00997837.
Повний текст джерелаTsatsaronis, George. "An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition." BioMed Central, 2015. https://tud.qucosa.de/id/qucosa%3A29496.
Повний текст джерелаTeboul, Bruno. "Le développement du neuromarketing aux Etats-Unis et en France. Acteurs-réseaux, traces et controverses." Thesis, Paris Sciences et Lettres (ComUE), 2016. http://www.theses.fr/2016PSLED036/document.
Повний текст джерелаOur research explores the comparative development of neuromarketing between the United States and France. We start by analyzing the literature on neuromarketing. We use as theoretical and methodological framework the Actor Network Theory (ANT) (in the wake of the work of Bruno Latour and Michel Callon). We show how “human and non-human” entities (“actants”): actor-network, traces (publications) and controversies form the pillars of a new discipline such as the neuromarketing. Our hybrid approach “qualitative-quantitative” allows us to build an applied methodology of the ANT: bibliometric analysis (Publish Or Perish), text mining, clustering and semantic analysis of the scientific literature and web of the neuromarketing. From these results, we build data visualizations, mapping of network graphs (Gephi) that reveal the interrelations and associations between actors, traces and controversies about neuromarketing
Tsatsaronis, George. "An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition." Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden, 2017. http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-202687.
Повний текст джерела