Dissertations / Theses: 'Textual data-mining'

1

Zhou, Wubai. "Data Mining Techniques to Understand Textual Data." FIU Digital Commons, 2017. https://digitalcommons.fiu.edu/etd/3493.

Full text

Abstract:

More than ever, information delivery online and storage heavily rely on text. Billions of texts are produced every day in the form of documents, news, logs, search queries, ad keywords, tags, tweets, messenger conversations, social network posts, etc. Text understanding is a fundamental and essential task involving broad research topics, and contributes to many applications in the areas text summarization, search engine, recommendation systems, online advertising, conversational bot and so on. However, understanding text for computers is never a trivial task, especially for noisy and ambiguous text such as logs, search queries. This dissertation mainly focuses on textual understanding tasks derived from the two domains, i.e., disaster management and IT service management that mainly utilizing textual data as an information carrier. Improving situation awareness in disaster management and alleviating human efforts involved in IT service management dictates more intelligent and efficient solutions to understand the textual data acting as the main information carrier in the two domains. From the perspective of data mining, four directions are identified: (1) Intelligently generate a storyline summarizing the evolution of a hurricane from relevant online corpus; (2) Automatically recommending resolutions according to the textual symptom description in a ticket; (3) Gradually adapting the resolution recommendation system for time correlated features derived from text; (4) Efficiently learning distributed representation for short and lousy ticket symptom descriptions and resolutions. Provided with different types of textual data, data mining techniques proposed in those four research directions successfully address our tasks to understand and extract valuable knowledge from those textual data. My dissertation will address the research topics outlined above. Concretely, I will focus on designing and developing data mining methodologies to better understand textual information, including (1) a storyline generation method for efficient summarization of natural hurricanes based on crawled online corpus; (2) a recommendation framework for automated ticket resolution in IT service management; (3) an adaptive recommendation system on time-varying temporal correlated features derived from text; (4) a deep neural ranking model not only successfully recommending resolutions but also efficiently outputting distributed representation for ticket descriptions and resolutions.

APA, Harvard, Vancouver, ISO, and other styles

2

Ur-Rahman, Nadeem. "Textual data mining applications for industrial knowledge management solutions." Thesis, Loughborough University, 2010. https://dspace.lboro.ac.uk/2134/6373.

Full text

Abstract:

In recent years knowledge has become an important resource to enhance the business and many activities are required to manage these knowledge resources well and help companies to remain competitive within industrial environments. The data available in most industrial setups is complex in nature and multiple different data formats may be generated to track the progress of different projects either related to developing new products or providing better services to the customers. Knowledge Discovery from different databases requires considerable efforts and energies and data mining techniques serve the purpose through handling structured data formats. If however the data is semi-structured or unstructured the combined efforts of data and text mining technologies may be needed to bring fruitful results. This thesis focuses on issues related to discovery of knowledge from semi-structured or unstructured data formats through the applications of textual data mining techniques to automate the classification of textual information into two different categories or classes which can then be used to help manage the knowledge available in multiple data formats. Applications of different data mining techniques to discover valuable information and knowledge from manufacturing or construction industries have been explored as part of a literature review. The application of text mining techniques to handle semi-structured or unstructured data has been discussed in detail. A novel integration of different data and text mining tools has been proposed in the form of a framework in which knowledge discovery and its refinement processes are performed through the application of Clustering and Apriori Association Rule of Mining algorithms. Finally the hypothesis of acquiring better classification accuracies has been detailed through the application of the methodology on case study data available in the form of Post Project Reviews (PPRs) reports. The process of discovering useful knowledge, its interpretation and utilisation has been automated to classify the textual data into two classes.

APA, Harvard, Vancouver, ISO, and other styles

3

Kubalík, Jakub. "Mining of Textual Data from the Web for Speech Recognition." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2010. http://www.nusl.cz/ntk/nusl-237170.

Full text

Abstract:

Prvotním cílem tohoto projektu bylo prostudovat problematiku jazykového modelování pro rozpoznávání řeči a techniky pro získávání textových dat z Webu. Text představuje základní techniky rozpoznávání řeči a detailněji popisuje jazykové modely založené na statistických metodách. Zvláště se práce zabývá kriterii pro vyhodnocení kvality jazykových modelů a systémů pro rozpoznávání řeči. Text dále popisuje modely a techniky dolování dat, zvláště vyhledávání informací. Dále jsou představeny problémy spojené se získávání dat z webu, a v kontrastu s tím je představen vyhledávač Google. Součástí projektu byl návrh a implementace systému pro získávání textu z webu, jehož detailnímu popisu je věnována náležitá pozornost. Nicméně, hlavním cílem práce bylo ověřit, zda data získaná z Webu mohou mít nějaký přínos pro rozpoznávání řeči. Popsané techniky se tak snaží najít optimální způsob, jak data získaná z Webu použít pro zlepšení ukázkových jazykových modelů, ale i modelů nasazených v reálných rozpoznávacích systémech.

APA, Harvard, Vancouver, ISO, and other styles

4

Kalledat, Tobias. "Tracking domain knowledge based on segmented textual sources." Doctoral thesis, Humboldt-Universität zu Berlin, Wirtschaftswissenschaftliche Fakultät, 2009. http://dx.doi.org/10.18452/15925.

Full text

Abstract:

Die hier vorliegende Forschungsarbeit hat zum Ziel, Erkenntnisse über den Einfluss der Vorverarbeitung auf die Ergebnisse der Wissensgenerierung zu gewinnen und konkrete Handlungsempfehlungen für die geeignete Vorverarbeitung von Textkorpora in Text Data Mining (TDM) Vorhaben zu geben. Der Fokus liegt dabei auf der Extraktion und der Verfolgung von Konzepten innerhalb bestimmter Wissensdomänen mit Hilfe eines methodischen Ansatzes, der auf der waagerechten und senkrechten Segmentierung von Korpora basiert. Ergebnis sind zeitlich segmentierte Teilkorpora, welche die Persistenzeigenschaft der enthaltenen Terme widerspiegeln. Innerhalb jedes zeitlich segmentierten Teilkorpus können jeweils Cluster von Termen gebildet werden, wobei eines diejenigen Terme enthält, die bezogen auf das Gesamtkorpus nicht persistent sind und das andere Cluster diejenigen, die in allen zeitlichen Segmenten vorkommen. Auf Grundlage einfacher Häufigkeitsmaße kann gezeigt werden, dass allein die statistische Qualität eines einzelnen Korpus es erlaubt, die Vorverarbeitungsqualität zu messen. Vergleichskorpora sind nicht notwendig. Die Zeitreihen der Häufigkeitsmaße zeigen signifikante negative Korrelationen zwischen dem Cluster von Termen, die permanent auftreten, und demjenigen das die Terme enthält, die nicht persistent in allen zeitlichen Segmenten des Korpus vorkommen. Dies trifft ausschließlich auf das optimal vorverarbeitete Korpus zu und findet sich nicht in den anderen Test Sets, deren Vorverarbeitungsqualität gering war. Werden die häufigsten Terme unter Verwendung domänenspezifischer Taxonomien zu Konzepten gruppiert, zeigt sich eine signifikante negative Korrelation zwischen der Anzahl unterschiedlicher Terme pro Zeitsegment und den einer Taxonomie zugeordneten Termen. Dies trifft wiederum nur für das Korpus mit hoher Vorverarbeitungsqualität zu. Eine semantische Analyse auf einem mit Hilfe einer Schwellenwert basierenden TDM Methode aufbereiteten Datenbestand ergab signifikant unterschiedliche Resultate an generiertem Wissen, abhängig von der Qualität der Datenvorverarbeitung. Mit den in dieser Forschungsarbeit vorgestellten Methoden und Maßzahlen ist sowohl die Qualität der verwendeten Quellkorpora, als auch die Qualität der angewandten Taxonomien messbar. Basierend auf diesen Erkenntnissen werden Indikatoren für die Messung und Bewertung von Korpora und Taxonomien entwickelt sowie Empfehlungen für eine dem Ziel des nachfolgenden Analyseprozesses adäquate Vorverarbeitung gegeben.
The research work available here has the goal of analysing the influence of pre-processing on the results of the generation of knowledge and of giving concrete recommendations for action for suitable pre-processing of text corpora in TDM. The research introduced here focuses on the extraction and tracking of concepts within certain knowledge domains using an approach of horizontally (timeline) and vertically (persistence of terms) segmenting of corpora. The result is a set of segmented corpora according to the timeline. Within each timeline segment clusters of concepts can be built according to their persistence quality in relation to each single time-based corpus segment and to the whole corpus. Based on a simple frequency measure it can be shown that only the statistical quality of a single corpus allows measuring the pre-processing quality. It is not necessary to use comparison corpora. The time series of the frequency measure have significant negative correlations between the two clusters of concepts that occur permanently and others that vary within an optimal pre-processed corpus. This was found to be the opposite in every other test set that was pre-processed with lower quality. The most frequent terms were grouped into concepts by the use of domain-specific taxonomies. A significant negative correlation was found between the time series of different terms per yearly corpus segments and the terms assigned to taxonomy for corpora with high quality level of pre-processing. A semantic analysis based on a simple TDM method with significant frequency threshold measures resulted in significant different knowledge extracted from corpora with different qualities of pre-processing. With measures introduced in this research it is possible to measure the quality of applied taxonomy. Rules for the measuring of corpus as well as taxonomy quality were derived from these results and advice suggested for the appropriate level of pre-processing.

APA, Harvard, Vancouver, ISO, and other styles

5

元吉, 忠寛, and Tadahiro MOTOYOSHI. "災害のイマジネーション力に関する探索的研究 - 大学生の想像力と阪神淡路大震災の事例との比較 -." 名古屋大学大学院教育発達科学研究科, 2006. http://hdl.handle.net/2237/9454.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Spiegler, Sebastian R. "Comparative study of clustering algorithms on textual databases : clustering of curricula vitae into comptency-based groups to support knowledge management /." Saarbrücken : VDM Verl. Müller, 2007. http://deposit.d-nb.de/cgi-bin/dokserv?id=3035354&prov=M&dok_var=1&dok_ext=htm.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Nieto, Erick Mauricio Gómez. "Projeção multidimensional aplicada a visualização de resultados de busca textual." Universidade de São Paulo, 2012. http://www.teses.usp.br/teses/disponiveis/55/55134/tde-05122012-105730/.

Full text

Abstract:

Usuários da Internet estão muito familiarizados que resultados de uma consulta sejam exibidos como uma lista ordenada de snippets. Cada snippet possui conteúdo textual que mostra um resumo do documento referido (ou página web) e um link para o mesmo. Esta representação tem muitas vantagens como, por exemplo, proporcionar uma navegação fácil e simples de interpretar. No entanto, qualquer usuário que usa motores de busca poderia reportar possivelmente alguma experiência de decepção com este modelo. Todavia, ela tem limitações em situações particulares, como o não fornecimento de uma visão geral da coleção de documentos recuperados. Além disso, dependendo da natureza da consulta - por exemplo, pode ser muito geral, ou ambígua, ou mal expressa - a informação desejada pode ser mal classificada, ou os resultados podem contemplar temas variados. Várias tarefas de busca seriam mais fáceis se fosse devolvida aos usuários uma visão geral dos documentos organizados de modo a refletir a forma como são relacionados, em relação ao conteúdo. Propomos uma técnica de visualização para exibir os resultados de consultas web que visa superar tais limitações. Ela combina a capacidade de preservação de vizinhança das projeções multidimensionais com a conhecida representação baseada em snippets. Essa visualização emprega uma projeção multidimensional para derivar layouts bidimensionais dos resultados da pesquisa, que preservam as relações de similaridade de texto, ou vizinhança. A similaridade é calculada mediante a aplicação da similaridade do cosseno sobre uma representação bag-of-words vetorial de coleções construídas a partir dos snippets. Se os snippets são exibidos diretamente de acordo com o layout derivado, eles se sobrepõem consideravelmente, produzindo uma visualização pobre. Nós superamos esse problema definindo uma energia funcional que considera tanto a sobreposição entre os snippets e a preservação da estrutura de vizinhanças como foi dada no layout da projeção. Minimizando esta energia funcional é fornecida uma representação bidimensional com preservação das vizinhanças dos snippets textuais com sobreposição mínima. A visualização transmite tanto uma visão global dos resultados da consulta como os agrupamentos visuais que refletem documentos relacionados, como é ilustrado em vários dos exemplos apresentados
Internet users are very familiar with the results of a search query displayed as a ranked list of snippets. Each textual snippet shows a content summary of the referred document (or web page) and a link to it. This display has many advantages, e.g., it affords easy navigation and is straightforward to interpret. Nonetheless, any user of search engines could possibly report some experience of disappointment with this metaphor. Indeed, it has limitations in particular situations, as it fails to provide an overview of the document collection retrieved. Moreover, depending on the nature of the query - e.g., it may be too general, or ambiguous, or ill expressed - the desired information may be poorly ranked, or results may contemplate varied topics. Several search tasks would be easier if users were shown an overview of the returned documents, organized so as to reflect how related they are, content-wise. We propose a visualization technique to display the results of web queries aimed at overcoming such limitations. It combines the neighborhood preservation capability of multidimensional projections with the familiar snippet-based representation by employing a multidimensional projection to derive two-dimensional layouts of the query search results that preserve text similarity relations, or neighborhoods. Similarity is computed by applying the cosine similarity over a bag-of-words vector representation of collection built from the snippets. If the snippets are displayed directly according to the derived layout they will overlap considerably, producing a poor visualization. We overcome this problem by defining an energy functional that considers both the overlapping amongst snippets and the preservation of the neighborhood structure as given in vii the projected layout. Minimizing this energy functional provides a neighborhood preserving two-dimensional arrangement of the textual snippets with minimum overlap. The resulting visualization conveys both a global view of the query results and visual groupings that reflect related results, as illustrated in several examples shown

APA, Harvard, Vancouver, ISO, and other styles

8

Fabbri, Renato. "Topological stability and textual differentiation in human interaction networks: statistical analysis, visualization and linked data." Universidade de São Paulo, 2017. http://www.teses.usp.br/teses/disponiveis/76/76132/tde-11092017-154706/.

Full text

Abstract:

This work reports on stable (or invariant) topological properties and textual differentiation in human interaction networks, with benchmarks derived from public email lists. Activity along time and topology were observed in snapshots in a timeline, and at different scales. Our analysis shows that activity is practically the same for all networks across timescales ranging from seconds to months. The principal components of the participants in the topological metrics space remain practically unchanged as different sets of messages are considered. The activity of participants follows the expected scale-free outline, thus yielding the hub, intermediary and peripheral classes of vertices by comparison against the Erdös-Rényi model. The relative sizes of these three sectors are essentially the same for all email lists and the same along time. Typically, 3-12% of the vertices are hubs, 15-45% are intermediary and 44-81% are peripheral vertices. Texts from each of such sectors are shown to be very different through direct measurements and through an adaptation of the Kolmogorov-Smirnov test. These properties are consistent with the literature and may be general for human interaction networks, which has important implications for establishing a typology of participants based on quantitative criteria. For guiding and supporting this research, we also developed a visualization method of dynamic networks through animations. To facilitate verification and further steps in the analyses, we supply a linked data representation of data related to our results.
Este trabalho relata propriedades topológicas estáveis (ou invariantes) e diferenciação textual em redes de interação humana, com referências derivadas de listas públicas de e-mail. A atividade ao longo do tempo e a topologia foram observadas em instantâneos ao longo de uma linha do tempo e em diferentes escalas. A análise mostra que a atividade é praticamente a mesma para todas as redes em escalas temporais de segundos a meses. As componentes principais dos participantes no espaço das métricas topológicas mantêm-se praticamente inalteradas quando diferentes conjuntos de mensagens são considerados. A atividade dos participantes segue o esperado perfil livre de escala, produzindo, assim, as classes de vértices dos hubs, dos intermediários e dos periféricos em comparação com o modelo Erdös-Rényi. Os tamanhos relativos destes três setores são essencialmente os mesmos para todas as listas de e-mail e ao longo do tempo. Normalmente, 3-12% dos vértices são hubs, 15-45% são intermediários e 44-81% são vértices periféricos. Os textos de cada um destes setores são considerados muito diferentes através de uma adaptação dos testes de Kolmogorov-Smirnov. Estas propriedades são consistentes com a literatura e podem ser gerais para redes de interação humana, o que tem implicações importantes para o estabelecimento de uma tipologia dos participantes com base em critérios quantitativos. De modo a guiar e apoiar esta pesquisa, também desenvolvemos um método de visualização para redes dinâmicas através de animações. Para facilitar a verificação e passos seguintes nas análises, fornecemos uma representação em dados ligados dos dados relacionados aos nossos resultados.

APA, Harvard, Vancouver, ISO, and other styles

9

Mendes, MarÃlia Soares. "MALTU - model for evaluation of interaction in social systems from the Users Textual Language." Universidade Federal do CearÃ, 2015. http://www.teses.ufc.br/tde_busca/arquivo.php?codArquivo=14296.

Full text

Abstract:

The field of Human Computer Interaction (HCI) has suggested various methods for evaluating systems in order to improve their usability and User eXperience (UX). The advent of Web 2.0 has allowed the development of applications marked by collaboration, communication and interaction among their users in a way and on a scale never seen before. Social Systems (SS) (e.g. Twitter, Facebook, MySpace, LinkedIn etc.) are examples of such applications and have features such as: frequent exchange of messages, spontaneity and expression of feelings. The opportunities and challenges posed by these types of applications require the traditional evaluation methods to be reassessed, taking into consideration these new characteristics. For instance, the postings of users on SS reveal their opinions on various issues, including on what they think of the system. This work aims to test the hypothesis that the postings of users in SS provide relevant data for evaluation of the usability and of UX in SS. While researching through literature, we have not identified any evaluation model intending to collect and interpret texts from users in order to assess the user experience and system usability. Thus, this thesis proposes MALTU - Model for evaluation of interaction in social systems from the Users Textual Language. In order to provide a basis for the development of the proposed model, we conducted a study of how users express their opinions on the system in natural language. We extracted postings of users from four SS of different contexts. HCI experts classified, studied and processed such postings by using Natural Language Processing (PLN) techniques and data mining, and then analyzed them in order to obtain a generic model. The MALTU was applied in two SS: an entertainment and an educational SS. The results show that is possible to evaluate a system from the postings of users in SS. Such assessments are aided by extraction patterns related to the use, to the types of postings and to HCI factors used in system.
A Ãrea de InteraÃÃo Humano-Computador (IHC) tem sugerido muitas formas para avaliar sistemas a fim de melhorar sua usabilidade e a eXperiÃncia do UsuÃrio (UX). O surgimento da web 2.0 permitiu o desenvolvimento de aplicaÃÃes marcadas pela colaboraÃÃo, comunicaÃÃo e interatividade entre seus usuÃrios de uma forma e em uma escala nunca antes observadas. Sistemas Sociais (SS) (e.g., Twitter, Facebook, MySpace, LinkedIn etc.) sÃo exemplos dessas aplicaÃÃes e possuem caracterÃsticas como: frequente troca de mensagens e expressÃo de sentimentos de forma espontÃnea. As oportunidades e os desafios trazidos por esses tipos de aplicaÃÃes exigem que os mÃtodos tradicionais de avaliaÃÃo sejam repensados, considerando essas novas caracterÃsticas. Por exemplo, as postagens dos usuÃrios em SS revelam suas opiniÃes sobre diversos assuntos, inclusive sobre o que eles pensam do sistema em uso. Esta tese procura testar a hipÃtese de que as postagens dos usuÃrios em SS fornecem dados relevantes para avaliaÃÃo da Usabilidade e da UX (UUX) em SS. Durante as pesquisas realizadas na literatura, nÃo foi identificado nenhum modelo de avaliaÃÃo que tenha direcionado seu foco na coleta e anÃlise das postagens dos usuÃrios a fim de avaliar a UUX de um sistema em uso. Sendo assim, este estudo propÃe o MALTU â Modelo para AvaliaÃÃo da interaÃÃo em sistemas sociais a partir da Linguagem Textual do UsuÃrio. A fim de fornecer bases para o desenvolvimento do modelo proposto, foram realizados estudos de como os usuÃrios expressam suas opiniÃes sobre o sistema em lÃngua natural. Foram extraÃdas postagens de usuÃrios de quatro SS de contextos distintos. Tais postagens foram classificadas por especialistas de IHC, estudadas e processadas utilizando tÃcnicas de Processamento da Linguagem Natural (PLN) e mineraÃÃo de dados e, analisadas a fim da obtenÃÃo de um modelo genÃrico. O MALTU foi aplicado em dois SS: um de entretenimento e um SS educativo. Os resultados mostram que Ã possÃvel avaliar um sistema a partir das postagens dos usuÃrios em SS. Tais avaliaÃÃes sÃo auxiliadas por padrÃes de extraÃÃo relacionados ao uso, aos tipos de postagens e Ãs metas de IHC utilizadas na avaliaÃÃo do sistema.

APA, Harvard, Vancouver, ISO, and other styles

10

Kamenieva, Iryna. "Research Ontology Data Models for Data and Metadata Exchange Repository." Thesis, Växjö University, School of Mathematics and Systems Engineering, 2009. http://urn.kb.se/resolve?urn=urn:nbn:se:vxu:diva-6351.

Full text

Abstract:

For researches in the field of the data mining and machine learning the necessary condition is an availability of various input data set. Now researchers create the databases of such sets. Examples of the following systems are: The UCI Machine Learning Repository, Data Envelopment Analysis Dataset Repository, XMLData Repository, Frequent Itemset Mining Dataset Repository. Along with above specified statistical repositories, the whole pleiad from simple filestores to specialized repositories can be used by researchers during solution of applied tasks, researches of own algorithms and scientific problems. It would seem, a single complexity for the user will be search and direct understanding of structure of so separated storages of the information. However detailed research of such repositories leads us to comprehension of deeper problems existing in usage of data. In particular a complete mismatch and rigidity of data files structure with SDMX - Statistical Data and Metadata Exchange - standard and structure used by many European organizations, impossibility of preliminary data origination to the concrete applied task, lack of data usage history for those or other scientific and applied tasks.

Now there are lots of methods of data miming, as well as quantities of data stored in various repositories. In repositories there are no methods of DM (data miming) and moreover, methods are not linked to application areas. An essential problem is subject domain link (problem domain), methods of DM and datasets for an appropriate method. Therefore in this work we consider the building problem of ontological models of DM methods, interaction description of methods of data corresponding to them from repositories and intelligent agents allowing the statistical repository user to choose the appropriate method and data corresponding to the solved task. In this work the system structure is offered, the intelligent search agent on ontological model of DM methods considering the personal inquiries of the user is realized.

For implementation of an intelligent data and metadata exchange repository the agent oriented approach has been selected. The model uses the service oriented architecture. Here is used the cross platform programming language Java, multi-agent platform Jadex, database server Oracle Spatial 10g, and also the development environment for ontological models - Protégé Version 3.4.

APA, Harvard, Vancouver, ISO, and other styles

11

Ammari, Ahmad N. "Transforming user data into user value by novel mining techniques for extraction of web content, structure and usage patterns : the development and evaluation of new Web mining methods that enhance information retrieval and improve the understanding of users' Web behavior in websites and social blogs." Thesis, University of Bradford, 2010. http://hdl.handle.net/10454/5269.

Full text

Abstract:

The rapid growth of the World Wide Web in the last decade makes it the largest publicly accessible data source in the world, which has become one of the most significant and influential information revolution of modern times. The influence of the Web has impacted almost every aspect of humans' life, activities and fields, causing paradigm shifts and transformational changes in business, governance, and education. Moreover, the rapid evolution of Web 2.0 and the Social Web in the past few years, such as social blogs and friendship networking sites, has dramatically transformed the Web from a raw environment for information consumption to a dynamic and rich platform for information production and sharing worldwide. However, this growth and transformation of the Web has resulted in an uncontrollable explosion and abundance of the textual contents, creating a serious challenge for any user to find and retrieve the relevant information that he truly seeks to find on the Web. The process of finding a relevant Web page in a website easily and efficiently has become very difficult to achieve. This has created many challenges for researchers to develop new mining techniques in order to improve the user experience on the Web, as well as for organizations to understand the true informational interests and needs of their customers in order to improve their targeted services accordingly by providing the products, services and information that truly match the requirements of every online customer. With these challenges in mind, Web mining aims to extract hidden patterns and discover useful knowledge from Web page contents, Web hyperlinks, and Web usage logs. Based on the primary kinds of Web data used in the mining process, Web mining tasks can be categorized into three main types: Web content mining, which extracts knowledge from Web page contents using text mining techniques, Web structure mining, which extracts patterns from the hyperlinks that represent the structure of the website, and Web usage mining, which mines user's Web navigational patterns from Web server logs that record the Web page access made by every user, representing the interactional activities between the users and the Web pages in a website. The main goal of this thesis is to contribute toward addressing the challenges that have been resulted from the information explosion and overload on the Web, by proposing and developing novel Web mining-based approaches. Toward achieving this goal, the thesis presents, analyzes, and evaluates three major contributions. First, the development of an integrated Web structure and usage mining approach that recommends a collection of hyperlinks for the surfers of a website to be placed at the homepage of that website. Second, the development of an integrated Web content and usage mining approach to improve the understanding of the user's Web behavior and discover the user group interests in a website. Third, the development of a supervised classification model based on recent Social Web concepts, such as Tag Clouds, in order to improve the retrieval of relevant articles and posts from Web social blogs.

APA, Harvard, Vancouver, ISO, and other styles

12

Ammari, Ahmad N. "Transforming user data into user value by novel mining techniques for extraction of web content, structure and usage patterns. The Development and Evaluation of New Web Mining Methods that enhance Information Retrieval and improve the Understanding of User¿s Web Behavior in Websites and Social Blogs." Thesis, University of Bradford, 2010. http://hdl.handle.net/10454/5269.

Full text

Abstract:

The rapid growth of the World Wide Web in the last decade makes it the largest publicly accessible data source in the world, which has become one of the most significant and influential information revolution of modern times. The influence of the Web has impacted almost every aspect of humans' life, activities and fields, causing paradigm shifts and transformational changes in business, governance, and education. Moreover, the rapid evolution of Web 2.0 and the Social Web in the past few years, such as social blogs and friendship networking sites, has dramatically transformed the Web from a raw environment for information consumption to a dynamic and rich platform for information production and sharing worldwide. However, this growth and transformation of the Web has resulted in an uncontrollable explosion and abundance of the textual contents, creating a serious challenge for any user to find and retrieve the relevant information that he truly seeks to find on the Web. The process of finding a relevant Web page in a website easily and efficiently has become very difficult to achieve. This has created many challenges for researchers to develop new mining techniques in order to improve the user experience on the Web, as well as for organizations to understand the true informational interests and needs of their customers in order to improve their targeted services accordingly by providing the products, services and information that truly match the requirements of every online customer. With these challenges in mind, Web mining aims to extract hidden patterns and discover useful knowledge from Web page contents, Web hyperlinks, and Web usage logs. Based on the primary kinds of Web data used in the mining process, Web mining tasks can be categorized into three main types: Web content mining, which extracts knowledge from Web page contents using text mining techniques, Web structure mining, which extracts patterns from the hyperlinks that represent the structure of the website, and Web usage mining, which mines user's Web navigational patterns from Web server logs that record the Web page access made by every user, representing the interactional activities between the users and the Web pages in a website. The main goal of this thesis is to contribute toward addressing the challenges that have been resulted from the information explosion and overload on the Web, by proposing and developing novel Web mining-based approaches. Toward achieving this goal, the thesis presents, analyzes, and evaluates three major contributions. First, the development of an integrated Web structure and usage mining approach that recommends a collection of hyperlinks for the surfers of a website to be placed at the homepage of that website. Second, the development of an integrated Web content and usage mining approach to improve the understanding of the user's Web behavior and discover the user group interests in a website. Third, the development of a supervised classification model based on recent Social Web concepts, such as Tag Clouds, in order to improve the retrieval of relevant articles and posts from Web social blogs.

APA, Harvard, Vancouver, ISO, and other styles

13

Malherbe, Emmanuel. "Standardization of textual data for comprehensive job market analysis." Thesis, Université Paris-Saclay (ComUE), 2016. http://www.theses.fr/2016SACLC058/document.

Full text

Abstract:

Sachant qu'une grande partie des offres d'emplois et des profils candidats est en ligne, le e-recrutement constitue un riche objet d'étude. Ces documents sont des textes non structurés, et le grand nombre ainsi que l'hétérogénéité des sites de recrutement implique une profusion de vocabulaires et nomenclatures. Avec l'objectif de manipuler plus aisément ces données, Multiposting, une entreprise française spécialisée dans les outils de e-recrutement, a soutenu cette thèse, notamment en terme de données, en fournissant des millions de CV numériques et offres d'emplois agrégées de sources publiques.Une difficulté lors de la manipulation de telles données est d'en déduire les concepts sous-jacents, les concepts derrière les mots n'étant compréhensibles que des humains. Déduire de tels attributs structurés à partir de donnée textuelle brute est le problème abordé dans cette thèse, sous le nom de normalisation. Avec l'objectif d'un traitement unifié, la normalisation doit fournir des valeurs dans une nomenclature, de sorte que les attributs résultants forment une représentation structurée unique de l'information. Ce traitement traduit donc chaque document en un language commun, ce qui permet d'agréger l'ensemble des données dans un format exploitable et compréhensible. Plusieurs questions sont cependant soulevées: peut-on exploiter les structures locales des sites web dans l'objectif d'une normalisation finale unifiée? Quelle structure de nomenclature est la plus adaptée à la normalisation, et comment l'exploiter? Est-il possible de construire automatiquement une telle nomenclature de zéro, ou de normaliser sans en avoir une?Pour illustrer le problème de la normalisation, nous allons étudier par exemple la déduction des compétences ou de la catégorie professionelle d'une offre d'emploi, ou encore du niveau d'étude d'un profil de candidat. Un défi du e-recrutement est que les concepts évoluent continuellement, de sorte que la normalisation se doit de suivre les tendances du marché. A la lumière de cela, nous allons proposer un ensemble de modèles d'apprentissage statistique nécessitant le minimum de supervision et facilement adaptables à l'évolution des nomenclatures. Les questions posées ont trouvé des solutions dans le raisonnement à partir de cas, le learning-to-rank semi-supervisé, les modèles à variable latente, ainsi qu'en bénéficiant de l'Open Data et des médias sociaux. Les différents modèles proposés ont été expérimentés sur des données réelles, avant d'être implémentés industriellement. La normalisation résultante est au coeur de SmartSearch, un projet qui fournit une analyse exhaustive du marché de l'emploi
With so many job adverts and candidate profiles available online, the e-recruitment constitutes a rich object of study. All this information is however textual data, which from a computational point of view is unstructured. The large number and heterogeneity of recruitment websites also means that there is a lot of vocabularies and nomenclatures. One of the difficulties when dealing with this type of raw textual data is being able to grasp the concepts contained in it, which is the problem of standardization that is tackled in this thesis. The aim of standardization is to create a unified process providing values in a nomenclature. A nomenclature is by definition a finite set of meaningful concepts, which means that the attributes resulting from standardization are a structured representation of the information. Several questions are however raised: Are the websites' structured data usable for a unified standardization? What structure of nomenclature is the best suited for standardization, and how to leverage it? Is it possible to automatically build such a nomenclature from scratch, or to manage the standardization process without one? To illustrate the various obstacles of standardization, the examples we are going to study include the inference of the skills or the category of a job advert, or the level of training of a candidate profile. One of the challenges of e-recruitment is that the concepts are continuously evolving, which means that the standardization must be up-to-date with job market trends. In light of this, we will propose a set of machine learning models that require minimal supervision and can easily adapt to the evolution of the nomenclatures. The questions raised found partial answers using Case Based Reasoning, semi-supervised Learning-to-Rank, latent variable models, and leveraging the evolving sources of the semantic web and social media. The different models proposed have been tested on real-world data, before being implemented in a industrial environment. The resulting standardization is at the core of SmartSearch, a project which provides a comprehensive analysis of the job market

APA, Harvard, Vancouver, ISO, and other styles

14

Saneifar, Hassan. "Locating Information in Heterogeneous log files." Thesis, Montpellier 2, 2011. http://www.theses.fr/2011MON20092/document.

Full text

Abstract:

Cette thèse s'inscrit dans les domaines des systèmes Question Réponse en domaine restreint, la recherche d'information ainsi que TALN. Les systèmes de Question Réponse (QR) ont pour objectif de retrouver un fragment pertinent d'un document qui pourrait être considéré comme la meilleure réponse concise possible à une question de l'utilisateur. Le but de cette thèse est de proposer une approche de localisation de réponses dans des masses de données complexes et évolutives décrites ci-dessous.. De nos jours, dans de nombreux domaines d'application, les systèmes informatiques sont instrumentés pour produire des rapports d'événements survenant, dans un format de données textuelles généralement appelé fichiers log. Les fichiers logs représentent la source principale d'informations sur l'état des systèmes, des produits, ou encore les causes de problèmes qui peuvent survenir. Les fichiers logs peuvent également inclure des données sur les paramètres critiques, les sorties de capteurs, ou une combinaison de ceux-ci. Ces fichiers sont également utilisés lors des différentes étapes du développement de logiciels, principalement dans l'objectif de débogage et le profilage. Les fichiers logs sont devenus un élément standard et essentiel de toutes les grandes applications. Bien que le processus de génération de fichiers logs est assez simple et direct, l'analyse de fichiers logs pourrait être une tâche difficile qui exige d'énormes ressources de calcul, de temps et de procédures sophistiquées. En effet, il existe de nombreux types de fichiers logs générés dans certains domaines d'application qui ne sont pas systématiquement exploités d'une manière efficace en raison de leurs caractéristiques particulières. Dans cette thèse, nous nous concentrerons sur un type des fichiers logs générés par des systèmes EDA (Electronic Design Automation). Ces fichiers logs contiennent des informations sur la configuration et la conception des Circuits Intégrés (CI) ainsi que les tests de vérification effectués sur eux. Ces informations, très peu exploitées actuellement, sont particulièrement attractives et intéressantes pour la gestion de conception, la surveillance et surtout la vérification de la qualité de conception. Cependant, la complexité de ces données textuelles complexes, c.-à-d. des fichiers logs générés par des outils de conception de CI, rend difficile l'exploitation de ces connaissances. Plusieurs aspects de ces fichiers logs ont été moins soulignés dans les méthodes de TALN et Extraction d'Information (EI). Le grand volume de données et leurs caractéristiques particulières limitent la pertinence des méthodes classiques de TALN et EI. Dans ce projet de recherche nous cherchons à proposer une approche qui permet de répondre à répondre automatiquement aux questionnaires de vérification de qualité des CI selon les informations se trouvant dans les fichiers logs générés par les outils de conception. Au sein de cette thèse, nous étudions principalement "comment les spécificités de fichiers logs peuvent influencer l'extraction de l'information et les méthodes de TALN?". Le problème est accentué lorsque nous devons également prendre leurs structures évolutives et leur vocabulaire spécifique en compte. Dans ce contexte, un défi clé est de fournir des approches qui prennent les spécificités des fichiers logs en compte tout en considérant les enjeux qui sont spécifiques aux systèmes QR dans des domaines restreints. Ainsi, les contributions de cette thèse consistent brièvement en :〉Proposer une méthode d'identification et de reconnaissance automatique des unités logiques dans les fichiers logs afin d'effectuer une segmentation textuelle selon la structure des fichiers. Au sein de cette approche, nous proposons un type original de descripteur qui permet de modéliser la structure textuelle et le layout des documents textuels.〉Proposer une approche de la localisation de réponse (recherche de passages) dans les fichiers logs. Afin d'améliorer la performance de recherche de passage ainsi que surmonter certains problématiques dûs aux caractéristiques des fichiers logs, nous proposons une approches d'enrichissement de requêtes. Cette approches, fondée sur la notion de relevance feedback, consiste en un processus d'apprentissage et une méthode de pondération des mots pertinents du contexte qui sont susceptibles d'exister dans les passage adaptés. Cela dit, nous proposons également une nouvelle fonction originale de pondération (scoring), appelée TRQ (Term Relatedness to Query) qui a pour objectif de donner un poids élevé aux termes qui ont une probabilité importante de faire partie des passages pertinents. Cette approche est également adaptée et évaluée dans les domaines généraux.〉Etudier l'utilisation des connaissances morpho-syntaxiques au sein de nos approches. A cette fin, nous nous sommes intéressés à l'extraction de la terminologie dans les fichiers logs. Ainsi, nous proposons la méthode Exterlog, adaptée aux spécificités des logs, qui permet d'extraire des termes selon des patrons syntaxiques. Afin d'évaluer les termes extraits et en choisir les plus pertinents, nous proposons un protocole de validation automatique des termes qui utilise une mesure fondée sur le Web associée à des mesures statistiques, tout en prenant en compte le contexte spécialisé des logs
In this thesis, we present contributions to the challenging issues which are encounteredin question answering and locating information in complex textual data, like log files. Question answering systems (QAS) aim to find a relevant fragment of a document which could be regarded as the best possible concise answer for a question given by a user. In this work, we are looking to propose a complete solution to locate information in a special kind of textual data, i.e., log files generated by EDA design tools.Nowadays, in many application areas, modern computing systems are instrumented to generate huge reports about occurring events in the format of log files. Log files are generated in every computing field to report the status of systems, products, or even causes of problems that can occur. Log files may also include data about critical parameters, sensor outputs, or a combination of those. Analyzing log files, as an attractive approach for automatic system management and monitoring, has been enjoying a growing amount of attention [Li et al., 2005]. Although the process of generating log files is quite simple and straightforward, log file analysis could be a tremendous task that requires enormous computational resources, long time and sophisticated procedures [Valdman, 2004]. Indeed, there are many kinds of log files generated in some application domains which are not systematically exploited in an efficient way because of their special characteristics. In this thesis, we are mainly interested in log files generated by Electronic Design Automation (EDA) systems. Electronic design automation is a category of software tools for designing electronic systems such as printed circuit boards and Integrated Circuits (IC). In this domain, to ensure the design quality, there are some quality check rules which should be verified. Verification of these rules is principally performed by analyzing the generated log files. In the case of large designs that the design tools may generate megabytes or gigabytes of log files each day, the problem is to wade through all of this data to locate the critical information we need to verify the quality check rules. These log files typically include a substantial amount of data. Accordingly, manually locating information is a tedious and cumbersome process. Furthermore, the particular characteristics of log files, specially those generated by EDA design tools, rise significant challenges in retrieval of information from the log files. The specific features of log files limit the usefulness of manual analysis techniques and static methods. Automated analysis of such logs is complex due to their heterogeneous and evolving structures and the large non-fixed vocabulary.In this thesis, by each contribution, we answer to questions raised in this work due to the data specificities or domain requirements. We investigate throughout this work the main concern "how the specificities of log files can influence the information extraction and natural language processing methods?". In this context, a key challenge is to provide approaches that take the log file specificities into account while considering the issues which are specific to QA in restricted domains. We present different contributions as below:> Proposing a novel method to recognize and identify the logical units in the log files to perform a segmentation according to their structure. We thus propose a method to characterize complex logicalunits found in log files according to their syntactic characteristics. Within this approach, we propose an original type of descriptor to model the textual structure and layout of text documents.> Proposing an approach to locate the requested information in the log files based on passage retrieval. To improve the performance of passage retrieval, we propose a novel query expansion approach to adapt an initial query to all types of corresponding log files and overcome the difficulties like mismatch vocabularies. Our query expansion approach relies on two relevance feedback steps. In the first one, we determine the explicit relevance feedback by identifying the context of questions. The second phase consists of a novel type of pseudo relevance feedback. Our method is based on a new term weighting function, called TRQ (Term Relatedness to Query), introduced in this work, which gives a score to terms of corpus according to their relatedness to the query. We also investigate how to apply our query expansion approach to documents from general domains.> Studying the use of morpho-syntactic knowledge in our approaches. For this purpose, we are interested in the extraction of terminology in the log files. Thus, we here introduce our approach, named Exterlog (EXtraction of TERminology from LOGs), to extract the terminology of log files. To evaluate the extracted terms and choose the most relevant ones, we propose a candidate term evaluation method using a measure, based on the Web and combined with statistical measures, taking into account the context of log files

APA, Harvard, Vancouver, ISO, and other styles

15

Valentin, Sarah. "Extraction et combinaison d’informations épidémiologiques à partir de sources informelles pour la veille des maladies infectieuses animales." Thesis, Montpellier, 2020. http://www.theses.fr/2020MONTS067.

Full text

Abstract:

L’intelligence épidémiologique a pour but de détecter, d’analyser et de surveiller au cours du temps les potentielles menaces sanitaires. Ce processus de surveillance repose sur des sources dites formelles, tels que les organismes de santé officiels, et des sources dites informelles, comme les médias. La veille des sources informelles est réalisée au travers de la surveillance basée sur les événements (event-based surveillance en anglais). Ce type de veille requiert le développement d’outils dédiés à la collecte et au traitement de données textuelles non structurées publiées sur le Web. Cette thèse se concentre sur l’extraction et la combinaison d’informations épidémiologiques extraites d’articles de presse en ligne, dans le cadre de la veille des maladies infectieuses animales. Le premier objectif de cette thèse est de proposer et de comparer des approches pour améliorer l’identification et l’extraction d’informations épidémiologiques pertinentes à partir du contenu d’articles. Le second objectif est d’étudier l’utilisation de descripteurs épidémiologiques (i.e. maladies, hôtes, localisations et dates) dans le contexte de l’extraction d’événements et de la mise en relation d’articles similaires au regard de leur contenu épidémiologique. Dans ce manuscrit, nous proposons de nouvelles représentations textuelles fondées sur la sélection, l’expansion et la combinaison de descripteurs épidémiologiques. Nous montrons que l’adaptation et l’extension de méthodes de fouille de texte et de classification permet d’améliorer l’utilisation des articles en ligne tant que source de données sanitaires. Nous mettons en évidence le rôle de l’expertise quant à la pertinence et l’interprétabilité de certaines des approches proposées. Bien que nos travaux soient menés dans le contexte de la surveillance de maladies en santé animale, nous discutons des aspects génériques des méthodes proposées, vis-à-vis de de maladies inconnues et dans un contexte One Health (« une seule santé »)
Epidemic intelligence aims to detect, investigate and monitor potential health threats while relying on formal (e.g. official health authorities) and informal (e.g. media) information sources. Monitoring of unofficial sources, or so-called event-based surveillance (EBS), requires the development of systems designed to retrieve and process unstructured textual data published online. This manuscript focuses on the extraction and combination of epidemiological information from informal sources (i.e. online news), in the context of the international surveillance of animal infectious diseases. The first objective of this thesis is to propose and compare approaches to enhance the identification and extraction of relevant epidemiological information from the content of online news. The second objective is to study the use of epidemiological entities extracted from the news articles (i.e. diseases, hosts, locations and dates) in the context of event extraction and retrieval of related online news.This manuscript proposes new textual representation approaches by selecting, expanding, and combining relevant epidemiological features. We show that adapting and extending text mining and classification methods improves the added value of online news sources for event-based surveillance. We stress the role of domain expert knowledge regarding the relevance and the interpretability of methods proposed in this thesis. While our researches are conducted in the context of animal disease surveillance, we discuss the generic aspects of our approaches regarding unknown threats and One Health surveillance

APA, Harvard, Vancouver, ISO, and other styles

16

Yang, Hsien-Min 1957. "PRINCIPAL COMPONENTS AND TEXTURE ANALYSIS OF THE NS-001 THEMATIC MAPPER SIMULATOR DATA IN THE ROSEMONT MINING DISTRICT, ARIZONA (GEOLOGIC, DIGITAL IMAGE PROCESSING, TEXTURE EXTRACTION)." Thesis, The University of Arizona, 1985. http://hdl.handle.net/10150/275436.

Full text

APA, Harvard, Vancouver, ISO, and other styles

17

Musil, David. "Algoritmus pro detekci pozitívního a negatívního textu." Master's thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2016. http://www.nusl.cz/ntk/nusl-242026.

Full text

Abstract:

As information and communication technology develops swiftly, amount of information produced by various sources grows as well. Sorting and obtaining knowledge from this data requires significant effort which is not ensured easily by a human, meaning machine processing is taking place. Acquiring emotion from text data is an interesting area of research and it’s going through considerable expansion while being used widely. Purpose of this thesis is to create a system for positive and negative emotion detection from text along with evaluation of its performance. System was created with Java programming language and it allows training with use of large amount of data (known as Big Data), exploiting Spark library. Thesis describes structure and handling text from database used as source of input data. Classificator model was created with use of Support Vector Machines and optimized by the n-grams method.

APA, Harvard, Vancouver, ISO, and other styles

18

Kalmegh, Prajakta. "Image mining methodologies for content based retrieval." Thesis, Georgia Institute of Technology, 2010. http://hdl.handle.net/1853/39587.

Full text

Abstract:

The thesis presents a system for content based image retrieval and mining. The research presents a design of a scalable solution for efficient retrieval of images from large image databases using image features such as color, shape and texture. A framework for automatic labeling of images and clustering of meta data in database based on the dominant shapes, textures and colors in the image is proposed. The thesis also presents a new image tagging methodology to annotate the dominant image features to the image as meta data. The users of this system can input a query image and select similar image retrieval criteria by selecting a feature type from amongst color, texture or shape. The system retrieves images from the database that match the specified pattern and displays them by relevance. The user can enter a set of keywords or a combination of keywords that form the input text query. Images in the database that match the input text query are fetched and displayed. This ensures content based similar image search even for text based search. An efficient clustering algorithm is shown to improve the image retrieval by an order of magnitude.

APA, Harvard, Vancouver, ISO, and other styles

19

Diaz, Alexandra Katiuska Ramos. "Biagrupamento heurístico e coagrupamento baseado em fatoração de matrizes: um estudo em dados textuais." Universidade de São Paulo, 2018. http://www.teses.usp.br/teses/disponiveis/100/100131/tde-12112018-182428/.

Full text

Abstract:

Biagrupamento e coagrupamento são tarefas de mineração de dados que permitem a extração de informação relevante sobre dados e têm sido aplicadas com sucesso em uma ampla variedade de domínios, incluindo aqueles que envolvem dados textuais -- foco de interesse desta pesquisa. Nas tarefas de biagrupamento e coagrupamento, os critérios de similaridade são aplicados simultaneamente às linhas e às colunas das matrizes de dados, agrupando simultaneamente os objetos e os atributos e possibilitando a criação de bigrupos/cogrupos. Contudo suas definições variam segundo suas naturezas e objetivos, sendo que a tarefa de coagrupamento pode ser vista como uma generalização da tarefa de biagrupamento. Estas tarefas, quando aplicadas nos dados textuais, demandam uma representação em um modelo de espaço vetorial que, comumente, leva à geração de espaços caracterizados pela alta dimensionalidade e esparsidade, afetando o desempenho de muitos dos algoritmos. Este trabalho apresenta uma análise do comportamento do algoritmo para biagrupamento Cheng e Church e do algoritmo para coagrupamento de decomposição de valores em blocos não negativos (\\textit{Non-Negative Block Value Decomposition} - NBVD), aplicado ao contexto de dados textuais. Resultados experimentais quantitativos e qualitativos são apresentados a partir das experimentações destes algoritmos em conjuntos de dados sintéticos criados com diferentes níveis de esparsidade e em um conjunto de dados real. Os resultados são avaliados em termos de medidas próprias de biagrupamento, medidas internas de agrupamento a partir das projeções nas linhas dos bigrupos/cogrupos e em termos de geração de informação. As análises dos resultados esclarecem questões referentes às dificuldades encontradas por estes algoritmos nos ambiente de experimentação, assim como se são capazes de fornecer informações diferenciadas e úteis na área de mineração de texto. De forma geral, as análises realizadas mostraram que o algoritmo NBVD é mais adequado para trabalhar com conjuntos de dados em altas dimensões e com alta esparsidade. O algoritmo de Cheng e Church, embora tenha obtidos resultados bons de acordo com os objetivos do algoritmo, no contexto de dados textuais, propiciou resultados com baixa relevância
Biclustering e coclustering are data mining tasks that allow the extraction of relevant information about data and have been applied successfully in a wide variety of domains, including those involving textual data - the focus of interest of this research. In biclustering and coclustering tasks, similarity criteria are applied simultaneously to the rows and columns of the data matrices, simultaneously grouping the objects and attributes and enabling the discovery of biclusters/coclusters. However their definitions vary according to their natures and objectives, being that the task of coclustering can be seen as a generalization of the task of biclustering. These tasks applied in the textual data demand a representation in a model of vector space, which commonly leads to the generation of spaces characterized by high dimensionality and sparsity and influences the performance of many algorithms. This work provides an analysis of the behavior of the algorithm for biclustering Cheng and Church and the algorithm for coclustering non-negative block decomposition (NBVD) applied to the context of textual data. Quantitative and qualitative experimental results are shown, from experiments on synthetic datasets created with different sparsity levels and on a real data set. The results are evaluated in terms of their biclustering oriented measures, internal clustering measures applied to the projections in the lines of the biclusters/coclusters and in terms of generation of information. The analysis of the results clarifies questions related to the difficulties faced by these algorithms in the experimental environment, as well as if they are able to provide differentiated information useful to the field of text mining. In general, the analyses carried out showed that the NBVD algorithm is better suited to work with datasets in high dimensions and with high sparsity. The algorithm of Cheng and Church, although it obtained good results according to its own objectives, provided results with low relevance in the context of textual data

APA, Harvard, Vancouver, ISO, and other styles

20

Matička, Jiří. "Extrakce klíčových slov z dokumentů." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2012. http://www.nusl.cz/ntk/nusl-236533.

Full text

Abstract:

This thesis pursues an automated extraction of keywords from documents. Its goal is to design and implement an application which will be able to extract an appropriate set of keywords related to the contents of the document. The major requirements for the application are speed and accuracy. That is why the first part of the thesis talks about already developed principles and a detailed classification based on various criteria. The second part is focused on choosing and a thorough functional describing of one of the methods which should have been used for extracting the keywords. The next parts contain a detailed draft of the application and its implementation. Finally, the last chapter is particularly important due to testing the application on a group of text documents and evaluating final results of the extraction process.

APA, Harvard, Vancouver, ISO, and other styles

21

Sychra, Martin. "Analýza sentimentu s využitím dolování dat." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2016. http://www.nusl.cz/ntk/nusl-255424.

Full text

Abstract:

The theme of the work is sentiment analysis, especially in terms of informatics (marginally from a linguistic point of view). The linguistic part discusses the term sentiment and language methods for its analysis, e.g. lemmatization, POS tagging, using the list of stopwords etc. More attention is paid to the structure of the sentiment analyzer which is based on some of the machine learning methods (support vector machines, Naive Bayes and maximum entropy classification). On the basis of the theoretical background, a functional analyzer is projected and implemented. The experiments are focused mainly on comparing the classification methods and on the benefits of using the individual preprocessing methods. The success rate of the constructed classifier reaches up to 84 % in the cross-validation.

APA, Harvard, Vancouver, ISO, and other styles

22

Průša, Petr. "Multi-label klasifikace textových dokumentů." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2012. http://www.nusl.cz/ntk/nusl-412872.

Full text

Abstract:

The master's thesis deals with automatic classifi cation of text document. It explains basic terms and problems of text mining. The thesis explains term clustering and shows some basic clustering algoritms. The thesis also shows some methods of classi fication and deals with matrix regression closely. Application using matrix regression for classifi cation was designed and developed. Experiments were focused on normalization and thresholding.

APA, Harvard, Vancouver, ISO, and other styles

23

Križan, Viliam. "Analýza sociálních sítí využitím metod rozpoznání vzoru." Master's thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2015. http://www.nusl.cz/ntk/nusl-220399.

Full text

Abstract:

Diplomová práca sa zaoberá rozpoznávaním emócií z textu v sociálnych sieťach. Práca popisuje súčasné metódy extrakcie príznakov, používané lexikóny, korpusy a klasifikátory. Emócie boli rozpoznávané na základe klasifikátoru, netrénovaného na anotovaných dátach z mikroblogovacej siete Twitter. Výhodou použitia služby Twitter, bolo geografické vymedzenie dát, ktoré umožňuje sledovanie zmien emócií populácie v rôznych mestách. Prvým prístupom klasifikácie bolo vytvorenie Baseline algoritmu, ktorý používal jednoduchý lexikón. Pre zlepšenie klasifikácie sme v druhom bode použili komplexnejší SVM klasifikátor. SVM klasifikátory, extrakcie a selekcie príznakov boli použité z dostupnej Python knižnice Scikit. Dáta pre natrénovanie klasifikátoru boli zhromažďované z oblasti USA, a to s pomocou vytvorenej aplikácie. Klasifikátor bol natrénovaný na dátach, označených pri ich zhromažďovaní - bez manuálnej anotácie. Boli použité dve rôzne implantácie SVM klasifikátorov. Výsledné klasifikované emócie, v rôznych mestách a dňoch, boli zobrazené v podobe farebných značiek na mape.

APA, Harvard, Vancouver, ISO, and other styles

24

Hasan, Maryam. "Extracting Structured Knowledge from Textual Data in Software Repositories." Master's thesis, 2011. http://hdl.handle.net/10048/1776.

Full text

Abstract:

Software team members, as they communicate and coordinate their work with others throughout the life-cycle of their projects, generate different kinds of textual artifacts. Despite the variety of works in the area of mining software artifacts, relatively little research has focused on communication artifacts. Software communication artifacts, in addition to source code artifacts, contain useful semantic information that is not fully explored by existing approaches. This thesis, presents the development of a text analysis method and tool to extract and represent useful pieces of information from a wide range of textual data sources associated with software projects. Our text analysis system integrates Natural Language Processing techniques and statistical text analysis methods, with software domain knowledge. The extracted information is represented as RDF-style triples which constitute interesting relations between developers and software products. We applied the developed system to analyze five different textual information, i.e., source code commits, bug reports, email messages, chat logs, and wiki pages. In the evaluation of our system, we found its precision to be 82%, its recall 58%, and its F-measure 68%.

APA, Harvard, Vancouver, ISO, and other styles

25

Dlamini, Phezulu, and 佩祖露. "Mining Textual Relationships from Social Media Data for Users’ E-Learning Experiences." Thesis, 2017. http://ndltd.ncl.edu.tw/handle/r4v6xc.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

"Stock market forecasting by integrating time-series and textual information." 2003. http://library.cuhk.edu.hk/record=b5896089.

Full text

Abstract:

Fung Pui Cheong Gabriel.
Thesis (M.Phil.)--Chinese University of Hong Kong, 2003.
Includes bibliographical references (leaves 88-93).
Abstracts in English and Chinese.
Abstract (English) --- p.i
Abstract (Chinese) --- p.ii
Acknowledgement --- p.iii
Contents --- p.v
List of Figures --- p.ix
List of Tables --- p.x
Chapter Part I --- The Very Beginning --- p.1
Chapter 1 --- Introduction --- p.2
Chapter 1.1 --- Contributions --- p.3
Chapter 1.2 --- Dissertation Organization --- p.4
Chapter 2 --- Problem Formulation --- p.6
Chapter 2.1 --- Defining the Prediction Task --- p.6
Chapter 2.2 --- Overview of the System Architecture --- p.8
Chapter Part II --- Literatures Review --- p.11
Chapter 3 --- The Social Dynamics of Financial Markets --- p.12
Chapter 3.1 --- The Collective Behavior of Groups --- p.13
Chapter 3.2 --- Prediction Based on Publicity Information --- p.16
Chapter 4 --- Time Series Representation --- p.20
Chapter 4.1 --- Technical Analysis --- p.20
Chapter 4.2 --- Piecewise Linear Approximation --- p.23
Chapter 5 --- Text Classification --- p.27
Chapter 5.1 --- Document Representation --- p.28
Chapter 5.2 --- Document Pre-processing --- p.30
Chapter 5.3 --- Classifier Construction --- p.31
Chapter 5.3.1 --- Naive Bayes (NB) --- p.31
Chapter 5.3.2 --- Support Vectors Machine (SVM) --- p.33
Chapter Part III --- Mining Financial Time Series and Textual Doc- uments Concurrently --- p.36
Chapter 6 --- Time Series Representation --- p.37
Chapter 6.1 --- Discovering Trends on the Time Series --- p.37
Chapter 6.2 --- t-test Based Split and Merge Segmentation Algorithm ´ؤ Splitting Phrase --- p.39
Chapter 6.3 --- t-test Based Split and Merge Segmentation Algorithm - Merging Phrase --- p.41
Chapter 7 --- Article Alignment and Pre-processing --- p.43
Chapter 7.1 --- Aligning News Articles to the Stock Trends --- p.44
Chapter 7.2 --- Selecting Positive Training Examples --- p.46
Chapter 7.3 --- Selecting Negative Training Examples --- p.48
Chapter 8 --- System Learning --- p.52
Chapter 8.1 --- Similarity Based Classification Approach --- p.53
Chapter 8.2 --- Category Sketch Generation --- p.55
Chapter 8.2.1 --- Within-Category Coefficient --- p.55
Chapter 8.2.2 --- Cross-Category Coefficient --- p.56
Chapter 8.2.3 --- Average-Importance Coefficient --- p.57
Chapter 8.3 --- Document Sketch Generation --- p.58
Chapter 9 --- System Operation --- p.60
Chapter 9.1 --- System Operation --- p.60
Chapter Part IV --- Results and Discussions --- p.62
Chapter 10 --- Evaluations --- p.63
Chapter 10.1 --- Time Series Evaluations --- p.64
Chapter 10.2 --- Classifier Evaluations --- p.64
Chapter 10.2.1 --- Batch Classification Evaluation --- p.69
Chapter 10.2.2 --- Online Classification Evaluation --- p.71
Chapter 10.2.3 --- Components Analysis --- p.74
Chapter 10.2.4 --- Document Sketch Analysis --- p.75
Chapter 10.3 --- Prediction Evaluations --- p.75
Chapter 10.3.1 --- Simulation Results --- p.77
Chapter 10.3.2 --- Hit Rate Analysis --- p.78
Chapter Part V --- The Final Words --- p.80
Chapter 11 --- Conclusion and Future Work --- p.81
Appendix --- p.84
Chapter A --- Hong Kong Stocks Categorization Powered by Reuters --- p.84
Chapter B --- Morgan Stanley Capital International (MSCI) Classification --- p.85
Chapter C --- "Precision, Recall and F1 measure" --- p.86
Bibliography --- p.88

APA, Harvard, Vancouver, ISO, and other styles

27

Wren, Jonathan Daniel. "The iridescent system : an automated data-mining method to identify, evaluate, and analyze sets of relationships within textual databases." 2000. http://edissertations.library.swmed.edu/pdf/WrenJ012403/WrenJonathan.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

28

Zois, Christos. "Applying text mining techniques to forecast the stock market fluctuations of large it companies with twitter data: descriptive and predictive approaches to enhance the research of stock market predictions with textual and semantic data." Master's thesis, 2019. http://hdl.handle.net/10362/92164.

Full text

Abstract:

Project Work presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Information Systems and Technologies Management
This research project applies advanced text mining techniques as a method to predict stock market fluctuations by merging published tweets and daily stock market prices for a set of American Information Technology companies. This project executes a systematical approach to investigate and further analyze, by using mainly R code, two main objectives: i) which are the descriptive criteria, patterns, and variables, which are correlated with the stock fluctuation and ii) does the single usage of tweets indicate moderate signal to predict with high accuracy the stock market fluctuations. The main supposition and expected output of the research work is to deliver findings about the twitter text significance and predictability power to indicate the importance of social media content in terms of stock market fluctuations by using descriptive and predictive data mining approaches, as natural language processing, topic modelling, sentiment analysis and binary classification with neural networks.

APA, Harvard, Vancouver, ISO, and other styles

29

(10157291), Yi-Yu Lai. "Relational Representation Learning Incorporating Textual Communication for Social Networks." Thesis, 2021.

Find full text

Abstract:

Representation learning (RL) for social networks facilitates real-world tasks such as visualization, link prediction and friend recommendation. Many methods have been proposed in this area to learn continuous low-dimensional embedding of nodes, edges or relations in social and information networks. However, most previous network RL methods neglect social signals, such as textual communication between users (nodes). Unlike more typical binary features on edges, such as post likes and retweet actions, social signals are more varied and contain ambiguous information. This makes it more challenging to incorporate them into RL methods, but the ability to quantify social signals should allow RL methods to better capture the implicit relationships among real people in social networks. Second, most previous work in network RL has focused on learning from homogeneous networks (i.e., single type of node, edge, role, and direction) and thus, most existing RL methods cannot capture the heterogeneous nature of relationships in social networks. Based on these identified gaps, this thesis aims to study the feasibility of incorporating heterogeneous information, e.g., texts, attributes, multiple relations and edge types (directions), to learn more accurate, fine-grained network representations.

In this dissertation, we discuss a preliminary study and outline three major works that aim to incorporate textual interactions to improve relational representation learning. The preliminary study learns a joint representation that captures the textual similarity in content between interacting nodes. The promising results motivate us to pursue broader research on using social signals for representation learning. The first major component aims to learn explicit node and relation embeddings in social networks. Traditional knowledge graph (KG) completion models learn latent representations of entities and relations by interpreting them as translations operating on the embedding of the entities. However, existing approaches do not consider textual communications between users, which contain valuable information to provide meaning and context for social relationships. We propose a novel approach that incorporates textual interactions between each pair of users to improve representation learning of both users and relationships. The second major component focuses on analyzing how users interact with each other via natural language content. Although the data is interconnected and dependent, previous research has primarily focused on modeling the social network behavior separately from the textual content. In this work, we model the data in a holistic way, taking into account the connections between the social behavior of users and the content generated when they interact, by learning a joint embedding over user characteristics and user language. In the third major component, we consider the task of learning edge representations in social networks. Edge representations are especially beneficial as we need to describe or explain the relationships, activities, and interactions among users. However, previous work in this area lack well-defined edge representations and ignore the relational signals over multiple views of social networks, which typically contain multi-view contexts (due to multiple edge types) that need to be considered when learning the representation. We propose a new methodology that captures asymmetry in multiple views by learning well-defined edge representations and incorporates textual communications to identify multiple sources of social signals that moderate the impact of different views between users.

APA, Harvard, Vancouver, ISO, and other styles

30

Sarkas, Nikolaos. "Querying, Exploring and Mining the Extended Document." Thesis, 2011. http://hdl.handle.net/1807/29857.

Full text

Abstract:

The evolution of the Web into an interactive medium that encourages active user engagement has ignited a huge increase in the amount, complexity and diversity of available textual data. This evolution forces us to re-evaluate our view of documents as simple pieces of text and of document collections as immutable and isolated. Extended documents published in the context of blogs, micro-blogs, on-line social networks, customer feedback portals, can be associated with a wealth of meta-data in addition to their textual component: tags, links, sentiment, entities mentioned in text, etc. Collections of user-generated documents grow, evolve, co-exist and interact: they are dynamic and integrated. These unique characteristics of modern documents and document collections present us with exciting opportunities for improving the way we interact with them. At the same time, this additional complexity combined with the vast amounts of available textual data present us with formidable computational challenges. In this context, we introduce, study and extensively evaluate an array of effective and efficient solutions for querying, exploring and mining extended documents, dynamic and integrated document collections. For collections of socially annotated extended documents, we present an improved probabilistic search and ranking approach based on our growing understanding of the dynamics of the social annotation process. For extended documents, such as blog posts, associated with entities extracted from text and categorical attributes, we enable their interactive exploration through the efficient computation of strong entity associations. Associated entities are computed for all possible attribute value restrictions of the document collection. For extended documents, such as user reviews, annotated with a numerical rating, we introduce a keyword-query refinement approach. The solution enables the interactive navigation and exploration of large result sets. We extend the skyline query to document streams, such as news articles, associated with categorical attributes and partially ordered domains. The technique incrementally maintains a small set of recent, uniquely interesting extended documents from the stream.Finally, we introduce a solution for the scalable integration of structured data sources into Web search. Queries are analysed in order to determine what structured data, if any, should be used to augment Web search results.

APA, Harvard, Vancouver, ISO, and other styles

31

Chen, Jhih-Rong, and 陳之容. "Texture Synthesis Using Data Mining Technique." Thesis, 2004. http://ndltd.ncl.edu.tw/handle/48354363991458229077.

Full text

Abstract:

碩士
國立東華大學
資訊工程學系
92
We present a new texture synthesis algorithm, which combines texture synthesis with data mining technique. And our approach works well for many types of textures without any knowledge of their physical information process. Our approach first analyzes input texture to construct patch candidate data, and then we use this data to find frequent pattern sequences for synthesis results by using data mining technique─Sequential Pattern Mining.

APA, Harvard, Vancouver, ISO, and other styles

32

"All Purpose Textual Data Information Extraction, Visualization and Querying." Master's thesis, 2018. http://hdl.handle.net/2286/R.I.50530.

Full text

Abstract:

abstract: Since the advent of the internet and even more after social media platforms, the explosive growth of textual data and its availability has made analysis a tedious task. Information extraction systems are available but are generally too specific and often only extract certain kinds of information they deem necessary and extraction worthy. Using data visualization theory and fast, interactive querying methods, leaving out information might not really be necessary. This thesis explores textual data visualization techniques, intuitive querying, and a novel approach to all-purpose textual information extraction to encode large text corpus to improve human understanding of the information present in textual data. This thesis presents a modified traversal algorithm on dependency parse output of text to extract all subject predicate object pairs from text while ensuring that no information is missed out. To support full scale, all-purpose information extraction from large text corpuses, a data preprocessing pipeline is recommended to be used before the extraction is run. The output format is designed specifically to fit on a node-edge-node model and form the building blocks of a network which makes understanding of the text and querying of information from corpus quick and intuitive. It attempts to reduce reading time and enhancing understanding of the text using interactive graph and timeline.
Dissertation/Thesis
Masters Thesis Software Engineering 2018

APA, Harvard, Vancouver, ISO, and other styles

33

Louis, Anita Lily. "Unsupervised discovery of relations for analysis of textual data in digital forensics." Diss., 2010. http://hdl.handle.net/2263/27479.

Full text

Abstract:

This dissertation addresses the problem of analysing digital data in digital forensics. It will be shown that text mining methods can be adapted and applied to digital forensics to aid analysts to more quickly, efficiently and accurately analyse data to reveal truly useful information. Investigators who wish to utilise digital evidence must examine and organise the data to piece together events and facts of a crime. The difficulty with finding relevant information quickly using the current tools and methods is that these tools rely very heavily on background knowledge for query terms and do not fully utilise the content of the data. A novel framework in which to perform evidence discovery is proposed in order to reduce the quantity of data to be analysed, aid the analysts' exploration of the data and enhance the intelligibility of the presentation of the data. The framework combines information extraction techniques with visual exploration techniques to provide a novel approach to performing evidence discovery, in the form of an evidence discovery system. By utilising unrestricted, unsupervised information extraction techniques, the investigator does not require input queries or keywords for searching, thus enabling the investigator to analyse portions of the data that may not have been identified by keyword searches. The evidence discovery system produces text graphs of the most important concepts and associations extracted from the full text to establish ties between the concepts and provide an overview and general representation of the text. Through an interactive visual interface the investigator can explore the data to identify suspects, events and the relations between suspects. Two models are proposed for performing the relation extraction process of the evidence discovery framework. The first model takes a statistical approach to discovering relations based on co-occurrences of complex concepts. The second model utilises a linguistic approach using named entity extraction and information extraction patterns. A preliminary study was performed to assess the usefulness of a text mining approach to digital forensics as against the traditional information retrieval approach. It was concluded that the novel approach to text analysis for evidence discovery presented in this dissertation is a viable and promising approach. The preliminary experiment showed that the results obtained from the evidence discovery system, using either of the relation extraction models, are sensible and useful. The approach advocated in this dissertation can therefore be successfully applied to the analysis of textual data for digital forensics Copyright
Dissertation (MSc)--University of Pretoria, 2010.
Computer Science
unrestricted

APA, Harvard, Vancouver, ISO, and other styles

34

Moravcová, Libuše. "Srovnání sylabů předmětů na různých univerzitách dolováním znalosti z textu." Master's thesis, 2018. http://www.nusl.cz/ntk/nusl-428783.

Full text

Abstract:

The thesis is focused on how to get the most accurate information about Universities, faculties, fields, and the syllabi of particular subjects of those Universities through text-mining tools. The first part describes the basics of text mining and related topics, collecting and creating data text background, turning them into the English language. In the next phase, the database will be generated from accumulated data entries. The purpose of the next step will be to obtain the most matching results such as specific phrases. The procedure of valorizing and summarizing will be used at the end of the thesis. In case of any problems, possible solutions or alternatives will be suggested.

APA, Harvard, Vancouver, ISO, and other styles

35

Liao, Shao-An, and 廖紹安. "Using Data Mining Techniques And Texture Analysis for Landslide Change Assessment-A Case Study at Chiufanershan Area." Thesis, 2008. http://ndltd.ncl.edu.tw/handle/28409706792512680728.

Full text

Abstract:

碩士
明道大學
環境規劃暨設計研究所
96
Massive landslides, caused by the catastrophic Chi-Chi earthquake on September 21, 1999, occurred at Chiufanershan area in Nantou County. In this study, multi-temporal SPOT satellite images were chosen for landslide change analysis. First, image subtraction method was employed for analyzing landslide spectral characteristics, and ISODATA method was used for training sites selection before performing supervised classification. The landslide sites were extracted and compared using the common used maximum likelihood estimation (MLE), data mining techniques, including support vector machine (SVM) and decision tree C5.0, and texture analysis. The results can be used as the references of disaster assessment in the landslide area. The analyzed result shows support vector machine (SVM) has higher overall accuracy than other classification methods. By texture analysis, the classified average overall accuracy (Kappa value) was obviously risen up from 76.5% to 87.2%. It indicates that texture information can effectively isolate the surface characteristics among land covers. The analyzed results show landslide areas have decreased from 210.8 ha on September 27, 1999 to 72.6 ha on May 5, 2007, about 65.6% of area restored, indicating that the sites of landslide have been gradually restored over nine years of natural vegetation succession.

APA, Harvard, Vancouver, ISO, and other styles

36

Ray, A., P. K. Bala, and Nripendra P. Rana. "Exploring the drivers of customers’ brand attitudes of online travel agency services: A text-mining based approach." 2021. http://hdl.handle.net/10454/18339.

Full text

Abstract:

Yes
This paper aims to explore the important qualitative aspects of online user-generated-content that reflects customers’ brand-attitudes. Additionally, the qualitative aspects can help service-providers understand customers’ brand-attitudes by focusing on the important aspects rather than reading the entire review, which will save both their time and effort. We have utilised a total of 10,000 reviews from TripAdvisor (an online-travel-agency provider). This study has analysed the data using statistical-technique (logistic regression), predictive-model (artificial-neural-networks) and structural-modelling technique to understand the most important aspects (i.e. sentiment, emotion or parts-of-speech) that can help to predict customers’ brand-attitudes. Results show that sentiment is the most important aspect in predicting brand-attitudes. While total sentiment content and content polarity have significant positive association, negative high-arousal emotions and low-arousal emotions have significant negative association with customers’ brand attitudes. However, parts-of-speech aspects have no significant impact on brand attitude. The paper concludes with implications, limitations and future research directions.
The full-text of this article will be released for public view at the end of the publisher embargo on 28 Aug 2022.

APA, Harvard, Vancouver, ISO, and other styles

37

Samson, Anne-Renée. "Extraction automatique et visualisation des thèmes abordés dans des résumés de mémoires et de thèses en anthropologie au Québec, de 1985 à 2009." Thèse, 2013. http://hdl.handle.net/1866/10440.

Full text

Abstract:

S’insérant dans les domaines de la Lecture et de l’Analyse de Textes Assistées par Ordinateur (LATAO), de la Gestion Électronique des Documents (GÉD), de la visualisation de l’information et, en partie, de l’anthropologie, cette recherche exploratoire propose l’expérimentation d’une méthodologie descriptive en fouille de textes afin de cartographier thématiquement un corpus de textes anthropologiques. Plus précisément, nous souhaitons éprouver la méthode de classification hiérarchique ascendante (CHA) pour extraire et analyser les thèmes issus de résumés de mémoires et de thèses octroyés de 1985 à 2009 (1240 résumés), par les départements d’anthropologie de l’Université de Montréal et de l’Université Laval, ainsi que le département d’histoire de l’Université Laval (pour les résumés archéologiques et ethnologiques). En première partie de mémoire, nous présentons notre cadre théorique, c'est-à-dire que nous expliquons ce qu’est la fouille de textes, ses origines, ses applications, les étapes méthodologiques puis, nous complétons avec une revue des principales publications. La deuxième partie est consacrée au cadre méthodologique et ainsi, nous abordons les différentes étapes par lesquelles ce projet fut conduit; la collecte des données, le filtrage linguistique, la classification automatique, pour en nommer que quelques-unes. Finalement, en dernière partie, nous présentons les résultats de notre recherche, en nous attardant plus particulièrement sur deux expérimentations. Nous abordons également la navigation thématique et les approches conceptuelles en thématisation, par exemple, en anthropologie, la dichotomie culture ̸ biologie. Nous terminons avec les limites de ce projet et les pistes d’intérêts pour de futures recherches.
Taking advantage of the recent development of automated analysis of textual data, digital records of documents, data graphics and anthropology, this study was set forth using data mining techniques to create a thematic map of anthropological documents. In this exploratory research, we propose to evaluate the usefulness of thematic analysis by using automated classification of textual data, as well as information visualizations (based on network analysis). More precisely, we want to examine the method of hierarchical clustering (HCA, agglomerative) for thematic analysis and information extraction. We built our study from a database consisting of 1 240 thesis abstracts, granted from 1985 to 2009, by anthropological departments at the University of Montreal and University Laval, as well as historical department at University Laval (for archaeological and ethnological abstracts). In the first section, we present our theoretical framework; we expose definitions of text mining, its origins, the practical applications and the methodology, and in the end, we present a literature review. The second part is devoted to the methodological framework and we discuss the various stages through which the project was conducted; construction of database, linguistic and statistical filtering, automated classification, etc. Finally, in the last section, we display results of two specific experiments and we present our interpretations. We also discuss about thematic navigation and conceptual approaches. We conclude with the limitations we faced through this project and paths of interest for future research.

APA, Harvard, Vancouver, ISO, and other styles

38

Zouaq, Amal. "Une approche d'ingénierie ontologique pour l'acquisition et l'exploitation des connaissances à partir de documents textuels : vers des objets de connaissances et d'apprentissage." Thèse, 2007. http://hdl.handle.net/1866/6437.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Textual data-mining'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles