Dissertations / Theses: 'Big text data'

1

Šoltýs, Matej. "Big Data v technológiách IBM." Master's thesis, Vysoká škola ekonomická v Praze, 2014. http://www.nusl.cz/ntk/nusl-193914.

Full text

Abstract:

This diploma thesis presents Big Data technologies and their possible use cases and applications. Theoretical part is initially focused on definition of term Big Data and afterwards is focused on Big Data technology, particularly on Hadoop framework. There are described principles of Hadoop, such as distributed storage and data processing, and its individual components. Furthermore are presented the largest vendors of Big Data technologies. At the end of this part of the thesis are described possible use cases of Big Data technologies and also some case studies. The practical part describes implementation of demo example of Big Data technologies and it is divided into two chapters. The first chapter of the practical part deals with conceptual design of demo example, used products and architecture of the solution. Afterwards, implementation of the demo example is described in the second chapter, from preparation of demo environment to creation of applications. Goals of this thesis are description and characteristics of Big Data, presentation of the largest vendors and their Big Data products, description of possible use cases of Big Data technologies and especially implementation of demo example in Big Data tools from IBM.

APA, Harvard, Vancouver, ISO, and other styles

2

Leis, Machín Angela 1974. "Studying depression through big data analytics on Twitter." Doctoral thesis, TDX (Tesis Doctorals en Xarxa), 2021. http://hdl.handle.net/10803/671365.

Full text

Abstract:

Mental disorders have become a major concern in public health, since they are one of the main causes of the overall disease burden worldwide. Depressive disorders are the most common mental illnesses, and they constitute the leading cause of disability worldwide. Language is one of the main tools on which mental health professionals base their understanding of human beings and their feelings, as it provides essential information for diagnosing and monitoring patients suffering from mental disorders. In parallel, social media platforms such as Twitter, allow us to observe the activity, thoughts and feelings of people’s daily lives, including those of patients suffering from mental disorders such as depression. Based on the characteristics and linguistic features of the tweets, it is possible to identify signs of depression among Twitter users. Moreover, the effect of antidepressant treatments can be linked to changes in the features of the tweets posted by depressive users. The analysis of this huge volume and diversity of data, the so-called “Big Data”, can provide relevant information about the course of mental disorders and the treatments these patients are receiving, which allows us to detect, monitor and predict depressive disorders. This thesis presents different studies carried out on Twitter data in the Spanish language, with the aim of detecting behavioral and linguistic patterns associated to depression, which can constitute the basis of new and complementary tools for the diagnose and follow-up of patients suffering from this disease

APA, Harvard, Vancouver, ISO, and other styles

3

Nhlabano, Valentine Velaphi. "Fast Data Analysis Methods For Social Media Data." Diss., University of Pretoria, 2018. http://hdl.handle.net/2263/72546.

Full text

Abstract:

The advent of Web 2.0 technologies which supports the creation and publishing of various social media content in a collaborative and participatory way by all users in the form of user generated content and social networks has led to the creation of vast amounts of structured, semi-structured and unstructured data. The sudden rise of social media has led to their wide adoption by organisations of various sizes worldwide in order to take advantage of this new way of communication and engaging with their stakeholders in ways that was unimaginable before. Data generated from social media is highly unstructured, which makes it challenging for most organisations which are normally used for handling and analysing structured data from business transactions. The research reported in this dissertation was carried out to investigate fast and efficient methods available for retrieving, storing and analysing unstructured data form social media in order to make crucial and informed business decisions on time. Sentiment analysis was conducted on Twitter data called tweets. Twitter, which is one of the most widely adopted social network service provides an API (Application Programming Interface), for researchers and software developers to connect and collect public data sets of Twitter data from the Twitter database. A Twitter application was created and used to collect streams of real-time public data via a Twitter source provided by Apache Flume and efficiently storing this data in Hadoop File System (HDFS). Apache Flume is a distributed, reliable, and available system which is used to efficiently collect, aggregate and move large amounts of log data from many different sources to a centralized data store such as HDFS. Apache Hadoop is an open source software library that runs on low-cost commodity hardware and has the ability to store, manage and analyse large amounts of both structured and unstructured data quickly, reliably, and flexibly at low-cost. A Lexicon based sentiment analysis approach was taken and the AFINN-111 lexicon was used for scoring. The Twitter data was analysed from the HDFS using a Java MapReduce implementation. MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. The results demonstrate that it is fast, efficient and economical to use this approach to analyse unstructured data from social media in real time.
Dissertation (MSc)--University of Pretoria, 2019.
National Research Foundation (NRF) - Scarce skills
Computer Science
MSc
Unrestricted

APA, Harvard, Vancouver, ISO, and other styles

4

Bischof, Jonathan Michael. "Interpretable and Scalable Bayesian Models for Advertising and Text." Thesis, Harvard University, 2014. http://dissertations.umi.com/gsas.harvard:11400.

Full text

Abstract:

In the era of "big data", scalable statistical inference is necessary to learn from new and growing sources of quantitative information. However, many commercial and scientific applications also require models to be interpretable to end users in order to generate actionable insights about quantities of interest. We present three case studies of Bayesian hierarchical models that improve the interpretability of existing models while also maintaining or improving the efficiency of inference. The first paper is an application to online advertising that presents an augmented regression model interpretable in terms of the amount of revenue a customer is expected to generate over his or her entire relationship with the company---even if complete histories are never observed. The resulting Poisson Process Regression employs a marginal inference strategy that avoids specifying customer-level latent variables used in previous work that complicate inference and interpretability. The second and third papers are applications to the analysis of text data that propose improved summaries of topic components discovered by these mixture models. While the current practice is to summarize topics in terms of their most frequent words, we show significantly greater interpretability in online experiments with human evaluators by using words that are also relatively exclusive to the topic of interest. In the process we develop a new class of topic models that directly regularize the differential usage of words across topics in order to produce stable estimates of the combined frequency-exclusivity metric as well as proposing efficient and parallelizable MCMC inference strategies.
Statistics

APA, Harvard, Vancouver, ISO, and other styles

5

Abrantes, Filipe André Catarino. "Processos e ferramentas de análise de Big Data : a análise de sentimento no twitter." Master's thesis, Instituto Superior de Economia e Gestão, 2017. http://hdl.handle.net/10400.5/15802.

Full text

Abstract:

Mestrado em Gestão de Sistemas de Informação
Com o aumento exponencial na produção de dados a nível mundial, torna-se crucial encontrar processos e ferramentas que permitam analisar este grande volume de dados (comumente denominado de Big Data), principalmente os não estruturados como é o caso dos dados produzidos em formato de texto. As empresas, hoje, tentam extrair valor destes dados, muitos deles gerados por clientes ou potenciais clientes, que lhes podem conferir vantagem competitiva. A dificuldade subsiste na forma como se analisa dados não estruturados, nomeadamente, os dados produzidos através das redes digitais, que são uma das grandes fontes de informação das organizações. Neste trabalho será enquadrada a problemática da estruturação e análise de Big Data, são apresentadas as diferentes abordagens para a resolução deste problema e testada uma das abordagens num bloco de dados selecionado. Optou-se pela abordagem de análise de sentimento, através de técnica de text mining, utilizando a linguagem R e texto partilhado na rede Twitter, relativo a quatro gigantes tecnológicas: Amazon, Apple, Google e Microsoft. Conclui-se, após o desenvolvimento e experimento do protótipo realizado neste projeto, que é possível efetuar análise de sentimento de tweets utilizando a ferramenta R, permitindo extrair informação de valor a partir de grandes blocos de dados.
Due to the exponential increase of global data, it becomes crucial to find processes and tools that make it possible to analyse this large volume (usually known as Big Data) of unstructured data, especially, the text format data. Nowadays, companies are trying to extract value from these data, mostly generated by customers or potential customers, which can assure a competitive leverage. The main difficulty is how to analyse unstructured data, in particular, data generated through digital networks, which are one of the biggest sources of information for organizations. During this project, the problem of Big Data structuring and analysis will be framed, will be presented the different approaches to solve this issue and one of the approaches will be tested in a selected data block. It was selected the sentiment analysis approach, using text mining technique, R language and text shared in Twitter, related to four technology giants: Amazon, Apple, Google and Microsoft. In conclusion, after the development and experimentation of the prototype carried out in this project, that it is possible to perform tweets sentiment analysis using the tool R, allowing to extract valuable information from large blocks of data.
info:eu-repo/semantics/publishedVersion

APA, Harvard, Vancouver, ISO, and other styles

6

Hill, Geoffrey. "Sensemaking in Big Data: Conceptual and Empirical Approaches to Actionable Knowledge Generation from Unstructured Text Streams." Kent State University / OhioLINK, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=kent1433597354.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Chennen, Kirsley. "Maladies rares et "Big Data" : solutions bioinformatiques vers une analyse guidée par les connaissances : applications aux ciliopathies." Thesis, Strasbourg, 2016. http://www.theses.fr/2016STRAJ076/document.

Full text

Abstract:

Au cours de la dernière décennie, la recherche biomédicale et la pratique médicale ont été révolutionné par l'ère post-génomique et l'émergence des « Big Data » en biologie. Il existe toutefois, le cas particulier des maladies rares caractérisées par la rareté, allant de l’effectif des patients jusqu'aux connaissances sur le domaine. Néanmoins, les maladies rares représentent un réel intérêt, car les connaissances fondamentales accumulées en temps que modèle d'études et les solutions thérapeutique qui en découlent peuvent également bénéficier à des maladies plus communes. Cette thèse porte sur le développement de nouvelles solutions bioinformatiques, intégrant des données Big Data et des approches guidées par la connaissance pour améliorer l'étude des maladies rares. En particulier, mon travail a permis (i) la création de PubAthena, un outil de criblage de la littérature pour la recommandation de nouvelles publications pertinentes, (ii) le développement d'un outil pour l'analyse de données exomique, VarScrut, qui combine des connaissance multiniveaux pour améliorer le taux de résolution
Over the last decade, biomedical research and medical practice have been revolutionized by the post-genomic era and the emergence of Big Data in biology. The field of rare diseases, are characterized by scarcity from the patient to the domain knowledge. Nevertheless, rare diseases represent a real interest as the fundamental knowledge accumulated as well as the developed therapeutic solutions can also benefit to common underlying disorders. This thesis focuses on the development of new bioinformatics solutions, integrating Big Data and Big Data associated approaches to improve the study of rare diseases. In particular, my work resulted in (i) the creation of PubAthena, a tool for the recommendation of relevant literature updates, (ii) the development of a tool for the analysis of exome datasets, VarScrut, which combines multi-level knowledge to improve the resolution rate

APA, Harvard, Vancouver, ISO, and other styles

8

Soen, Kelvin, and Bo Yin. "Customer Behaviour Analysis of E-commerce : What information can we get from customers' reviews through big data analysis." Thesis, KTH, Entreprenörskap och Innovation, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-254194.

Full text

Abstract:

Online transactions have been growing exponentially in the last decade, contributing to up to 11% of total retail sales. One of the parameters of success in online transactions are online reviews where customers have the chance to assign level of satisfaction regarding their purchase. This review system acts as a bargaining power for customers so that their suppliers pay more attention to their satisfaction, as well as benchmark for future prospective customers. This research digs into what actually causes customers to assign high level of satisfaction in their online purchase experience: Whether it is packaging, delivery time or else. This research also tries to dig into customer behaviour related to online reviews from three different perspectives: gender, culture and economic structure. Data mining methodology is used to collect and analyse the data, thus providing a reliable quantitative study. The end result of this study is expected to assist in marketing decisions to capture certain types of consumers who significantly place or purchasing decision based on online reviews.
Entrepreneurship & Innovation Management

APA, Harvard, Vancouver, ISO, and other styles

9

Lindén, Johannes. "Huvudtitel: Understand and Utilise Unformatted Text Documents by Natural Language Processing algorithms." Thesis, Mittuniversitetet, Avdelningen för informationssystem och -teknologi, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:miun:diva-31043.

Full text

Abstract:

News companies have a need to automate and make the editors process of writing about hot and new events more effective. Current technologies involve robotic programs that fills in values in templates and website listeners that notifies the editors when changes are made so that the editor can read up on the source change at the actual website. Editors can provide news faster and better if directly provided with abstracts of the external sources. This study applies deep learning algorithms to automatically formulate abstracts and tag sources with appropriate tags based on the context. The study is a full stack solution, which manages both the editors need for speed and the training, testing and validation of the algorithms. Decision Tree, Random Forest, Multi Layer Perceptron and phrase document vectors are used to evaluate the categorisation and Recurrent Neural Networks is used to paraphrase unformatted texts. In the evaluation a comparison between different models trained by the algorithms with a variation of parameters are done based on the F-score. The results shows that the F-scores are increasing the more document the training has and decreasing the more categories the algorithm needs to consider. The Multi-Layer Perceptron perform best followed by Random Forest and finally Decision Tree. The document length matters, when larger documents are considered during training the score is increasing considerably. A user survey about the paraphrase algorithms shows the paraphrase result is insufficient to satisfy editors need. It confirms a need for more memory to conduct longer experiments.

APA, Harvard, Vancouver, ISO, and other styles

10

Savalli, Antonino. "Tecniche analitiche per “Open Data”." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2019. http://amslaurea.unibo.it/17476/.

Full text

Abstract:

L’ultimo decennio ha reso estremamente popolare il concetto di Open Government, un modello di amministrazione aperto che fonda le sue basi sui principi di trasparenza, partecipazione e collaborazione. Nel 2011, nasce il progetto Dati.gov.it, un portale che ha il ruolo di “catalogo nazionale dei metadati relativi ai dati rilasciati in formato aperto dalle pubbliche amministrazioni italiane”. L'obiettivo della tesi è fornire un efficace strumento per ricercare, usare e confrontare le informazioni presenti sul portale Dati.gov.it, individuando tra i dataset similarità che possano risolvere e/o limitare l’eterogeneità dei dati presenti. Il progetto consiste nello sviluppo su tre aree di studio principali: Standard di Open Data e Metadata, Record Linkage e Data Fusion. Nello specifico, sono state implementate sette funzioni contenute in un'unica libreria. La funzione search permette di ricercare all'interno del portale dati.gov.it. La funzione ext permette di estrarre le informazioni da sette formati sorgente: csv, json, xml, xls, rdf, pdf e txt. La funzione pre-process permette il Data Cleaning. La funzione find_claims è il cuore del progetto, perché contiene l'algoritmo di Text Mining che stabilisce una relazione tra i dataset individuando le parole in comune che hanno una sufficiente importanza all'interno del contesto. La funzione header_linkage permette di trovare la similarità tra i nomi degli attributi di due dataset, consigliando quali attributi concatenare. In modo analogo, record_linkage permette di trovare similarità tra i valori degli attributi di due dataset, consigliando quali attributi concatenare. Infine, la funzione merge_keys permette di fondere i risultati di header_linkage e record_linkage. I risultati sperimentali hanno fornito feedback positivi sul funzionamento dei principali metodi implementati per quanto concerne la similarità sintattica tra due dataset.

APA, Harvard, Vancouver, ISO, and other styles

11

Yu, Shuren. "How to Leverage Text Data in a Decision Support System? : A Solution Based on Machine Learning and Qualitative Analysis Methods." Thesis, Umeå universitet, Institutionen för informatik, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-163899.

Full text

Abstract:

In the big data context, the growing volume of textual data presents challenges for traditional structured data-based decision support systems (DSS). DSS based on structured data is difficult to process the semantic information of text data. To meet the challenge, this thesis proposes a solution for the Decision Support System (DSS) based on machine Learning and qualitative analysis, namely TLE-DSS. TLE-DSS refers to three critical analytical modules: Thematic Analysis (TA), Latent Dirichlet Allocation (LDA)and Evolutionary Grounded Theory (EGT). To better understand the operation mechanism of TLE-DSS, this thesis used an experimental case to explain how to make decisions through TLE-DSS. Additionally, during the data analysis of the experimental case, by calculating the difference of perplexity of different models to compare similarities, this thesis proposed a solution to determine the optimal number of topics in LDA. Meanwhile, by using LDAvis, a model with the optimal number of topics was visualized. Moreover, the thesis also expounded the principle and application value of EGT. In the last part, this thesis discussed the challenges and potential ethical issues that TLE-DSS still faces.

APA, Harvard, Vancouver, ISO, and other styles

12

Alshaer, Mohammad. "An Efficient Framework for Processing and Analyzing Unstructured Text to Discover Delivery Delay and Optimization of Route Planning in Realtime." Thesis, Lyon, 2019. http://www.theses.fr/2019LYSE1105/document.

Full text

Abstract:

L'Internet des objets, ou IdO (en anglais Internet of Things, ou IoT) conduit à un changement de paradigme du secteur de la logistique. L'avènement de l'IoT a modifié l'écosystème de la gestion des services logistiques. Les fournisseurs de services logistiques utilisent aujourd'hui des technologies de capteurs telles que le GPS ou la télémétrie pour collecter des données en temps réel pendant la livraison. La collecte en temps réel des données permet aux fournisseurs de services de suivre et de gérer efficacement leur processus d'expédition. Le principal avantage de la collecte de données en temps réel est qu’il permet aux fournisseurs de services logistiques d’agir de manière proactive pour éviter des conséquences telles que des retards de livraison dus à des événements imprévus ou inconnus. De plus, les fournisseurs ont aujourd'hui tendance à utiliser des données provenant de sources externes telles que Twitter, Facebook et Waze, parce que ces sources fournissent des informations critiques sur des événements tels que le trafic, les accidents et les catastrophes naturelles. Les données provenant de ces sources externes enrichissent l'ensemble de données et apportent une valeur ajoutée à l'analyse. De plus, leur collecte en temps réel permet d’utiliser les données pour une analyse en temps réel et de prévenir des résultats inattendus (tels que le délai de livraison, par exemple) au moment de l’exécution. Cependant, les données collectées sont brutes et doivent être traitées pour une analyse efficace. La collecte et le traitement des données en temps réel constituent un énorme défi. La raison principale est que les données proviennent de sources hétérogènes avec une vitesse énorme. La grande vitesse et la variété des données entraînent des défis pour effectuer des opérations de traitement complexes telles que le nettoyage, le filtrage, le traitement de données incorrectes, etc. La diversité des données - structurées, semi-structurées et non structurées - favorise les défis dans le traitement des données à la fois en mode batch et en temps réel. Parce que, différentes techniques peuvent nécessiter des opérations sur différents types de données. Une structure technique permettant de traiter des données hétérogènes est très difficile et n'est pas disponible actuellement. En outre, l'exécution d'opérations de traitement de données en temps réel est très difficile ; des techniques efficaces sont nécessaires pour effectuer les opérations avec des données à haut débit, ce qui ne peut être fait en utilisant des systèmes d'information logistiques conventionnels. Par conséquent, pour exploiter le Big Data dans les processus de services logistiques, une solution efficace pour la collecte et le traitement des données en temps réel et en mode batch est essentielle. Dans cette thèse, nous avons développé et expérimenté deux méthodes pour le traitement des données: SANA et IBRIDIA. SANA est basée sur un classificateur multinomial Naïve Bayes, tandis qu'IBRIDIA s'appuie sur l'algorithme de classification hiérarchique (CLH) de Johnson, qui est une technologie hybride permettant la collecte et le traitement de données par lots et en temps réel. SANA est une solution de service qui traite les données non structurées. Cette méthode sert de système polyvalent pour extraire les événements pertinents, y compris le contexte (tel que le lieu, l'emplacement, l'heure, etc.). En outre, il peut être utilisé pour effectuer une analyse de texte sur les événements ciblés. IBRIDIA a été conçu pour traiter des données inconnues provenant de sources externes et les regrouper en temps réel afin d'acquérir une connaissance / compréhension des données permettant d'extraire des événements pouvant entraîner un retard de livraison. Selon nos expériences, ces deux approches montrent une capacité unique à traiter des données logistiques
Internet of Things (IoT) is leading to a paradigm shift within the logistics industry. The advent of IoT has been changing the logistics service management ecosystem. Logistics services providers today use sensor technologies such as GPS or telemetry to collect data in realtime while the delivery is in progress. The realtime collection of data enables the service providers to track and manage their shipment process efficiently. The key advantage of realtime data collection is that it enables logistics service providers to act proactively to prevent outcomes such as delivery delay caused by unexpected/unknown events. Furthermore, the providers today tend to use data stemming from external sources such as Twitter, Facebook, and Waze. Because, these sources provide critical information about events such as traffic, accidents, and natural disasters. Data from such external sources enrich the dataset and add value in analysis. Besides, collecting them in real-time provides an opportunity to use the data for on-the-fly analysis and prevent unexpected outcomes (e.g., such as delivery delay) at run-time. However, data are collected raw which needs to be processed for effective analysis. Collecting and processing data in real-time is an enormous challenge. The main reason is that data are stemming from heterogeneous sources with a huge speed. The high-speed and data variety fosters challenges to perform complex processing operations such as cleansing, filtering, handling incorrect data, etc. The variety of data – structured, semi-structured, and unstructured – promotes challenges in processing data both in batch-style and real-time. Different types of data may require performing operations in different techniques. A technical framework that enables the processing of heterogeneous data is heavily challenging and not currently available. In addition, performing data processing operations in real-time is heavily challenging; efficient techniques are required to carry out the operations with high-speed data, which cannot be done using conventional logistics information systems. Therefore, in order to exploit Big Data in logistics service processes, an efficient solution for collecting and processing data in both realtime and batch style is critically important. In this thesis, we developed and experimented with two data processing solutions: SANA and IBRIDIA. SANA is built on Multinomial Naïve Bayes classifier whereas IBRIDIA relies on Johnson's hierarchical clustering (HCL) algorithm which is hybrid technology that enables data collection and processing in batch style and realtime. SANA is a service-based solution which deals with unstructured data. It serves as a multi-purpose system to extract the relevant events including the context of the event (such as place, location, time, etc.). In addition, it can be used to perform text analysis over the targeted events. IBRIDIA was designed to process unknown data stemming from external sources and cluster them on-the-fly in order to gain knowledge/understanding of data which assists in extracting events that may lead to delivery delay. According to our experiments, both of these approaches show a unique ability to process logistics data. However, SANA is found more promising since the underlying technology (Naïve Bayes classifier) out-performed IBRIDIA from performance measuring perspectives. It is clearly said that SANA was meant to generate a graph knowledge from the events collected immediately in realtime without any need to wait, thus reaching maximum benefit from these events. Whereas, IBRIDIA has an important influence within the logistics domain for identifying the most influential category of events that are affecting the delivery. Unfortunately, in IBRIRDIA, we should wait for a minimum number of events to arrive and always we have a cold start. Due to the fact that we are interested in re-optimizing the route on the fly, we adopted SANA as our data processing framework

APA, Harvard, Vancouver, ISO, and other styles

13

Musil, David. "Algoritmus pro detekci pozitívního a negatívního textu." Master's thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2016. http://www.nusl.cz/ntk/nusl-242026.

Full text

Abstract:

As information and communication technology develops swiftly, amount of information produced by various sources grows as well. Sorting and obtaining knowledge from this data requires significant effort which is not ensured easily by a human, meaning machine processing is taking place. Acquiring emotion from text data is an interesting area of research and it’s going through considerable expansion while being used widely. Purpose of this thesis is to create a system for positive and negative emotion detection from text along with evaluation of its performance. System was created with Java programming language and it allows training with use of large amount of data (known as Big Data), exploiting Spark library. Thesis describes structure and handling text from database used as source of input data. Classificator model was created with use of Support Vector Machines and optimized by the n-grams method.

APA, Harvard, Vancouver, ISO, and other styles

14

Cancellieri, Andrea. "Analisi di tecniche per l'estrazione di informazioni da documenti testuali e non strutturati." Bachelor's thesis, Alma Mater Studiorum - Università di Bologna, 2014. http://amslaurea.unibo.it/7773/.

Full text

Abstract:

Obiettivo di questa tesi dal titolo “Analisi di tecniche per l’estrazione di informazioni da documenti testuali e non strutturati” è quello di mostrare tecniche e metodologie informatiche che permettano di ricavare informazioni e conoscenza da dati in formato testuale. Gli argomenti trattati includono l'analisi di software per l'estrazione di informazioni, il web semantico, l'importanza dei dati e in particolare i Big Data, Open Data e Linked Data. Si parlerà inoltre di data mining e text mining.

APA, Harvard, Vancouver, ISO, and other styles

15

Canducci, Marco. "Previsioni di borsa mediante analisi di dati testuali: studio ed estensione di un metodo basato su Google Trends." Bachelor's thesis, Alma Mater Studiorum - Università di Bologna, 2017.

Find full text

Abstract:

In una società sempre più digitalizzata, molte delle nostre azioni quotidiane lasciano sulla rete una grande quantità di dati, da cui è possibile estrarre conoscenza di grande valore. Secondo recenti studi, dalle informazioni presenti sul Web generate dalle attività degli utenti, è possibile prevedere i movimenti del mercato azionario con un grado di accuratezza interessante. In particolare, si considera l'articolo del 2013 "Quantifying Trading Behaviour in Financial Markets using Google Trends", in cui sono presentati i risultati di 98 simulazioni di trading su dati storici reali dal 2004 al 2011: in ciascuna di esse si compiono operazioni in base all'andamento della quantità di ricerche su Google di un singolo termine correlato alla finanza. Un primo contributo di questa tesi è l'implementazione di un algoritmo di trading parametrizzato che considera l’andamento di tutti i 98 termini contemporaneamente, ponendo quindi il metodo proposto nell'articolo nelle migliori condizioni possibili. In tal modo, sperimentando su dati storici reali della durata quasi quinquennale (gennaio 2012 - settembre 2016) si ottengono rendimenti pari al 644% dell'investimento iniziale, con un'accuratezza sulle decisioni di compravendita del 74%. Un ulteriore contributo è il miglioramento in quantità e qualità dei dati di input per l'algoritmo realizzato sfruttando tecniche di destagionalizzazione, un maggior numero di termini finanziari (185) e differenti variazioni temporali dei volumi di ricerca. Effettuando una nuova simulazione sullo stesso periodo (gennaio 2012 - settembre 2016), il rendimento sul valore iniziale è aumentato al 911%, col 79% delle decisioni accurate.

APA, Harvard, Vancouver, ISO, and other styles

16

Risch, Jean-Charles. "Enrichissement des Modèles de Classification de Textes Représentés par des Concepts." Thesis, Reims, 2017. http://www.theses.fr/2017REIMS012/document.

Full text

Abstract:

La majorité des méthodes de classification de textes utilisent le paradigme du sac de mots pour représenter les textes. Pourtant cette technique pose différents problèmes sémantiques : certains mots sont polysémiques, d'autres peuvent être des synonymes et être malgré tout différenciés, d'autres encore sont liés sémantiquement sans que cela soit pris en compte et enfin, certains mots perdent leur sens s'ils sont extraits de leur groupe nominal. Pour pallier ces problèmes, certaines méthodes ne représentent plus les textes par des mots mais par des concepts extraits d'une ontologie de domaine, intégrant ainsi la notion de sens au modèle. Les modèles intégrant la représentation des textes par des concepts restent peu utilisés à cause des résultats peu satisfaisants. Afin d'améliorer les performances de ces modèles, plusieurs méthodes ont été proposées pour enrichir les caractéristiques des textes à l'aide de nouveaux concepts extraits de bases de connaissances. Mes travaux donnent suite à ces approches en proposant une étape d'enrichissement des modèles à l'aide d'une ontologie de domaine associée. J'ai proposé deux mesures permettant d'estimer l'appartenance aux catégories de ces nouveaux concepts. A l'aide de l'algorithme du classifieur naïf Bayésien, j'ai testé et comparé mes contributions sur le corpus de textes labéllisés Ohsumed et l'ontologie de domaine Disease Ontology. Les résultats satisfaisants m'ont amené à analyser plus précisément le rôle des relations sémantiques dans l'enrichissement des modèles. Ces nouveaux travaux ont été le sujet d'une seconde expérience où il est question d'évaluer les apports des relations hiérarchiques d'hyperonymie et d'hyponymie
Most of text-classification methods use the ``bag of words” paradigm to represent texts. However Bloahdom and Hortho have identified four limits to this representation: (1) some words are polysemics, (2) others can be synonyms and yet differentiated in the analysis, (3) some words are strongly semantically linked without being taken into account in the representation as such and (4) certain words lose their meaning if they are extracted from their nominal group. To overcome these problems, some methods no longer represent texts with words but with concepts extracted from a domain ontology (Bag of Concept), integrating the notion of meaning into the model. Models integrating the bag of concepts remain less used because of the unsatisfactory results, thus several methods have been proposed to enrich text features using new concepts extracted from knowledge bases. My work follows these approaches by proposing a model-enrichment step using a domain ontology, I proposed two measures to estimate to belong to the categories of these new concepts. Using the naive Bayes classifier algorithm, I tested and compared my contributions on the Ohsumed corpus using the domain ontology ``Disease Ontology”. The satisfactory results led me to analyse more precisely the role of semantic relations in the enrichment step. These new works have been the subject of a second experiment in which we evaluate the contributions of the hierarchical relations of hypernymy and hyponymy

APA, Harvard, Vancouver, ISO, and other styles

17

Gerrish, Charlotte. "European Copyright Law and the Text and Data Mining Exceptions and Limitations : With a focus on the DSM Directive, is the EU Approach a Hindrance or Facilitator to Innovation in the Region?" Thesis, Uppsala universitet, Juridiska institutionen, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-385195.

Full text

Abstract:

We are in a digital age with Big Data at the heart of our global online environment. Exploiting Big Data by manual means is virtually impossible. We therefore need to rely on innovative methods such as Machine Learning and AI to allow us to fully harness the value of Big Data available in our digital society. One of the key processes allowing us to innovate using new technologies such as Machine Learning and AI is by the use of TDM which is carried out on large volumes of Big Data. Whilst there is no single definition of TDM, it is universally acknowledged that TDM involves the automated analytical processing of raw and unstructured data sets through sophisticated ICT tools in order to obtain valuable insights for society or to enable efficient Machine Learning and AI development. Some of the source text and data on which TDM is performed is likely to be protected by copyright, which creates difficulties regarding the balance between the exclusive rights of copyright holders, and the interests of innovators developing TDM technologies and performing TDM, for both research and commercial purposes, who need as much unfettered access to source material in order to create the most performant AI solutions. As technology has grown so rapidly over the last few decades, the copyright law framework must adapt to avoid becoming redundant. This paper looks at the European approach to copyright law in the era of Big Data, and specifically its approach to TDM exceptions in light of the recent DSM Directive, and whether this approach has been, or is, a furtherance or hindrance to innovation in the EU.

APA, Harvard, Vancouver, ISO, and other styles

18

Mariaux, Sébastien. "Les organisations de l'économie sociale et solidaire face aux enjeux écologiques : stratégies de communication et d'action environnementale." Electronic Thesis or Diss., Aix-Marseille, 2019. http://www.theses.fr/2019AIXM0463.

Full text

Abstract:

La protection de l’environnement naturel constitue un enjeu déterminant pour le futur de l’humanité. L’ESS, qui partage les principes du développement durable, est particulièrement bien placée pour mettre en œuvre des alternatives de développement plus écologiques. Cette recherche a pour objet d’examiner les facteurs et les modalités de l’action environnementale dans cette économie hétérogène. La thèse appréhende les organisations de l’ESS sous l’angle de l’identité organisationnelle et s’intéresse d’une part à la communication environnementale, d’autre part aux actions concrètes. L’étude de la communication environnementale a pour terrain le réseau social Twitter. Elle s’appuie sur un programme codé en Python, et sur les techniques d’exploration automatique de texte. Elle permet de mettre en évidence plusieurs stratégies rhétoriques. Une seconde étude traite de sept cas, sur la base d’entretiens semi-directifs. Elle met en lumière le rôle de l’engagement individuel mais aussi des logiques collectives dans l’action environnementale. Ce travail apporte une contribution méthodologique en développant l’approche de l’exploration automatique de texte, peu utilisée dans les Sciences de Gestion. Sur le plan théorique, la thèse introduit la dimension collective en tant qu’identité organisationnelle de l’ESS. Nous adaptons ensuite un modèle d’action environnementale en identifiant un déterminant supplémentaire spécifique à ces organisations. Finalement, la recherche invite l’ESS à remettre au centre les questions d’écologie, et donne des pistes pour soutenir les organisations dans une démarche environnementale
The protection of the natural environment is a key issue for the future of humanity. SSE, which shares the principles of sustainable development, is particularly well suited to implement more environmentally friendly development alternatives. The purpose of this research is to examine the factors and modalities of environmental action in this heterogeneous economy. The thesis looks at SSE organisations from the perspective of organisational identity and focuses on environmental communication on the one hand, and concrete actions on the other. The study of environmental communication is based on the social network Twitter. It is based on a program coded in Python, and on automatic text mining techniques. It highlights several rhetorical strategies. A second study deals with seven cases, based on semi-directive interviews. It sheds light on the role of individual commitment but also on collective logic in environmental action.This work makes a methodological contribution by developing the approach of automatic text mining, which is rarely used in Management Sciences. On the theoretical level, the thesis introduces the collective dimension as anborganisational identity of the SSE. We then adapt an environmental action model by identifying an additional determinant specific to these organizations. Finally, the research invites the SSE to put ecological issues back at the centre, and gives suggestions for supporting organisations in their efforts to protect the environment

APA, Harvard, Vancouver, ISO, and other styles

19

Francia, Matteo. "Progettazione di un sistema di Social Intelligence e Sentiment Analysis per un'azienda del settore consumer goods." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2012. http://amslaurea.unibo.it/3850/.

Full text

APA, Harvard, Vancouver, ISO, and other styles

20

Doucet, Rachel A., Deyan M. Dontchev, Javon S. Burden, and Thomas L. Skoff. "Big data analytics test bed." Thesis, Monterey, California: Naval Postgraduate School, 2013. http://hdl.handle.net/10945/37615.

Full text

Abstract:

Approved for public release; distribution is unlimited
The proliferation of big data has significantly expanded the quantity and breadth of information throughout the DoD. The task of processing and analyzing this data has become difficult, if not infeasible, using traditional relational databases. The Navy has a growing priority for information processing, exploitation, and dissemination, which makes use of the vast network of sensors that produce a large amount of big data. This capstone report explores the feasibility of a scalable Tactical Cloud architecture that will harness and utilize the underlying open-source tools for big data analytics. A virtualized cloud environment was built and analyzed at the Naval Postgraduate School, which offers a test bed, suitable for studying novel variations of these architectures. Further, the technologies directly used to implement the test bed seek to demonstrate a sustainable methodology for rapidly configuring and deploying virtualized machines and provides an environment for performance benchmark and testing. The capstone findings indicate the strategies and best practices to automate the deployment, provisioning and management of big data clusters. The functionality we seek to support is a far more general goal: finding open-source tools that help to deploy and configure large clusters for on-demand big data analytics.

APA, Harvard, Vancouver, ISO, and other styles

21

Lucchi, Giulia. "Applicazione web per visualizzare e gestire dati estratti da Twitter." Bachelor's thesis, Alma Mater Studiorum - Università di Bologna, 2016. http://amslaurea.unibo.it/12555/.

Full text

Abstract:

Web 2.0, Big Data, la centralità dell'utente e il web visto come architettura partecipativa sono gli elementi che hanno portato ad un enorme quantità di dati destrutturati, con un enorme potenziale informativo. Di grande rilevanza quindi divengono i social network, i maggiori propulsori di questi contenuti. Il lavoro di tesi mira ad un'analisi di questo tipo di dati. In particolare si è deciso di usufruire di una delle piattaforme social più diffuse: Twitter. Il contesto preso in considerazione sono gli eventi sismici avvenuti quest'anno in Centro Italia, focalizzandosi sulla prima forte scossa del 24-08-2016. Si va quindi a esaminare come in una situazione di emergenza anche il web risponde, in particolare come il microblog Twitter, preso in considerazione, può esserne considerato uno strumento operativo. Infine siamo andati a creare un'applicazione con lo scopo di dare la possibilità a tutti gli utenti del web di compiere anche loro un'analisi sui proprio dati e visualizzarli in modo semplice ed intuitivo.

APA, Harvard, Vancouver, ISO, and other styles

22

Grepl, Filip. "Aplikace pro řízení paralelního zpracování dat." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2021. http://www.nusl.cz/ntk/nusl-445490.

Full text

Abstract:

This work deals with the design and implementation of a system for parallel execution of tasks in the Knowledge Technology Research Group. The goal is to create a web application that allows to control their processing and monitor runs of these tasks including the use of system resources. The work first analyzes the current method of parallel data processing and the shortcomings of this solution. Then the work describes the existing tools including the problems that their test deployment revealed. Based on this knowledge, the requirements for a new application are defined and the design of the entire system is created. After that the selected parts of implementation and the way of the whole system testing is described together with the comparison of the efficiency with the original system.

APA, Harvard, Vancouver, ISO, and other styles

23

Jing, Liping, and 景麗萍. "Text subspace clustering with feature weighting and ontologies." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2007. http://hub.hku.hk/bib/B39332834.

Full text

APA, Harvard, Vancouver, ISO, and other styles

24

O'Sullivan, Jack William. "Biostatistical and meta-research approaches to assess diagnostic test use." Thesis, University of Oxford, 2018. http://ora.ox.ac.uk/objects/uuid:1419df96-1534-4cfe-b686-cde554ff7345.

Full text

Abstract:

The aim of this thesis was to assess test use from primary care. Test use is an essential part of general practice, yet there is surprisingly little data exploring and quantifying its activity. My overarching hypothesis was that test use from primary care is sub-optimal, specifically that tests are overused (overtesting) - ordered when they will lead to no patient benefit, and underused (undertesting) - not ordered when they would lead to patient benefit. Previous metrics used to identify potential over and undertesting have been categorised into direct and indirect measures. Indirect measures take a population-level approach and are 'unexpected variation' in healthcare resource use, such as geographical variation. Direct measures consider individual patient data and directly compare resource use with an appropriateness criterion (such as a guideline). In this thesis, I examined three indirect measures: temporal change in test use, between-practice variation in test use and variation between general practices in the proportion of test results that return an abnormal result. In chapter 3, I identified which tests have been subject to the greatest change in their use from 2000/1 to 2015/16 in UK primary care. In chapter 4, I identified the tests that had been subject to the greatest between-practice variation in their use in UK primary care. In chapter 5, I present a method to identify General Practices whose doctors order a lower proportion of tests that return a normal result. In chapter 6, I present a method to directly quantify over and undertesting; I conducted a systematic review of studies that measured the adherence of general practitioner's test use with guidelines. In chapter 7 I acknowledge that the use of guidelines to audit general practitioner's test use is flawed; guidelines are of varying quality and not designed to dictate clinical practice. In this chapter, I determine the quality and reporting of guidelines, the quality of the evidence underpinning their recommendations and explore the association between guideline quality and non-adherence. Overall, I have shown that most tests have increased substantially in use (MRI knee, vitamin D and MRI brain the most), there is marked between-practice variation in the use of many tests (drug monitoring, urine albumin and pelvic CT the most) and that some general practices order a significantly lower proportion of tests that return an abnormal result. I have also shown that there is marked variation in how often GPs follow guidelines, but guidelines based on highly quality evidence are adhered to significantly more frequently. Lastly, in my Discussion chapter, I discuss the implications of my thesis, how it fits into the wider literature and an idea for a proposed step-wise approach to systematically identify overtesting.

APA, Harvard, Vancouver, ISO, and other styles

25

陳我智 and Ngor-chi Chan. "Text-to-speech conversion for Putonghua." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 1990. http://hub.hku.hk/bib/B31209580.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

Cardinal, Robert W. "DATA REDUCTION AND PROCESSING SYSTEM FOR FLIGHT TEST OF NEXT GENERATION BOEING AIRPLANES." International Foundation for Telemetering, 1993. http://hdl.handle.net/10150/608878.

Full text

Abstract:

International Telemetering Conference Proceedings / October 25-28, 1993 / Riviera Hotel and Convention Center, Las Vegas, Nevada
This paper describes the recently developed Loral Instrumentation ground-based equipment used to select and process post-flight test data from the Boeing 777 airplane as it is played back from a digital tape recorder (e.g., the Ampex DCRSi II) at very high speeds. Gigabytes (GB) of data, stored on recorder cassettes in the Boeing 777 during flight testing, are played back on the ground at a 15-30 MB/sec rate into ten multiplexed Loral Instrumentation System 500 Model 550s for high-speed decoding, processing, time correlation, and subsequent storage or distribution. The ten Loral 550s are multiplexed for independent data path processing from ten separate tape sources simultaneously. This system features a parallel multiplexed configuration that allows Boeing to perform critical 777 flight test processing at unprecedented speeds. Boeing calls this system the Parallel Multiplexed Processing Data (PMPD) System. The key advantage of the ground station's design is that Boeing engineers can add their own application-specific control and setup software. The Loral 550 VMEbus allows Boeing to add VME modules when needed, ensuring system growth with the addition of other LI-developed products, Boeing-developed products or purchased VME modules. With hundreds of third-party VME modules available, system expansion is unlimited. The final system has the capability to input data at 15 MB/sec. The present aggregate throughput capability of all ten 24-bit Decoders is 150 MB/sec from ten separate tape sources. A 24-bit Decoder was designed to support the 30 MB/sec DCRSi III so that the system can eventually support a total aggregate throughput of 300 MB/sec. Clearly, such high data selection, rejection, and processing will significantly accelerate flight certification and production testing of today's state-of-the-art aircraft. This system was supplied with low level software interfaces such that the customer would develop their own applications specific code and displays. The Loral 550 lends itself to this kind of applications due to its VME chassis, VxWorks operating system and the modularity of the software.

APA, Harvard, Vancouver, ISO, and other styles

27

Hon, Wing-kai, and 韓永楷. "On the construction and application of compressed text indexes." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2004. http://hub.hku.hk/bib/B31059739.

Full text

APA, Harvard, Vancouver, ISO, and other styles

28

Lam, Yan-ki Jacky. "Developmental normative data for the random gap detection test." Click to view the E-thesis via HKU Scholors Hub, 2005. http://lookup.lib.hku.hk/lookup/bib/B38279289.

Full text

Abstract:

Thesis (B.Sc)--University of Hong Kong, 2005.
"A dissertation submitted in partial fulfilment of the requirements for the Bachelor of Science (Speech and Hearing Sciences), The University of Hong Kong, June 30, 2005." Also available in print.

APA, Harvard, Vancouver, ISO, and other styles

29

Lee, Wai-ming, and 李慧明. "Correlation of PCPT and SPT data from a shallow marine site investigation." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2004. http://hub.hku.hk/bib/B44570077.

Full text

APA, Harvard, Vancouver, ISO, and other styles

30

Ho, Yuen-ying, and 何婉瑩. "The effect of introducing a computer software in enhancing comprehension of classical Chinese text." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 1995. http://hub.hku.hk/bib/B31957869.

Full text

APA, Harvard, Vancouver, ISO, and other styles

31

Franco, Davide. "The Borexino experiment test of the purification systems and data analysis in the counting test facility /." [S.l.] : [s.n.], 2005. http://deposit.ddb.de/cgi-bin/dokserv?idn=974442968.

Full text

APA, Harvard, Vancouver, ISO, and other styles

32

Yang, Wenwei, and 楊文衛. "Development and application of automatic monitoring system for standard penetration test in site investigation." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2006. http://hub.hku.hk/bib/B36811919.

Full text

APA, Harvard, Vancouver, ISO, and other styles

33

Smedley, Mark, and Gary Simpson. "SHOCK & VIBRATION TESTING OF AN AIRBORNE INSTRUMENTATION DIGITAL RECORDER." International Foundation for Telemetering, 2000. http://hdl.handle.net/10150/606747.

Full text

Abstract:

International Telemetering Conference Proceedings / October 23-26, 2000 / Town & Country Hotel and Conference Center, San Diego, California
Shock and vibration testing was performed on the Metrum-Datatape Inc. 32HE recorder to determine its viability as an airborne instrumentation recorder. A secondary goal of the testing was to characterize the recorder operational shock and vibration envelope. Both flight testing and laboratory environmental testing of the recorder was performed to make these determinations. This paper addresses the laboratory portion of the shock and vibration testing and addresses the test methodology and rationale, test set-up, results, challenges, and lessons learned.

APA, Harvard, Vancouver, ISO, and other styles

34

Wong, Ping-wai, and 黃炳蔚. "Semantic annotation of Chinese texts with message structures based on HowNet." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2007. http://hub.hku.hk/bib/B38212389.

Full text

APA, Harvard, Vancouver, ISO, and other styles

35

Kozák, David. "Indexace rozsáhlých textových dat a vyhledávání v zaindexovaných datech." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2020. http://www.nusl.cz/ntk/nusl-417263.

Full text

Abstract:

Tématem této práce je sémantické vyhledávání ve velkých textových datech. Cílem je navrhnout a implementovat vyhledávač, který se bude efektivně dotazovat nad sémanticky obohacenými dokumenty a prezentovat výsledky uživatelsky přívětivým způsobem. V práci jsou nejdříve analyzovány současné sémantické vyhledávače, spolu s jejich silnými a slabými stránkami. Poté je přednesen návrh nového vyhledávače s vlastním dotazovacím jazykem. Tento systém se skládá z komponent pro indexaci a dotazování se nad dokumenty, management serveru, překladače pro dotazovací jazyk a dvou klientských aplikací, webové a konzolové. Vyhledávač byl úspěšně navržen, implementován i nasazen a je veřejně dostupný na Internetu. Výsledky práce umožňují široké veřejnosti využívat sémantického vyhledávání.

APA, Harvard, Vancouver, ISO, and other styles

36

Stolz, Carsten Dirk. "Erfolgsmessung informationsorientierter Websites." kostenfrei, 2007. http://deposit.d-nb.de/cgi-bin/dokserv?idn=989985180.

Full text

APA, Harvard, Vancouver, ISO, and other styles

37

Mittermayer, Marc-André. "Einsatz von Text Mining zur Prognose kurzfristiger Trends von Aktienkursen nach der Publikation von Unternehmensnachrichten /." Berlin : dissertation.de, 2006. http://deposit.d-nb.de/cgi-bin/dokserv?id=2871284&prov=M&dok_var=1&dok_ext=htm.

Full text

APA, Harvard, Vancouver, ISO, and other styles

38

Moyse, Gilles. "Résumés linguistiques de données numériques : interprétabilité et périodicité de séries." Thesis, Paris 6, 2016. http://www.theses.fr/2016PA066526/document.

Full text

Abstract:

Nos travaux s'inscrivent dans le domaine des résumés linguistiques flous (RLF) qui permettent la génération de phrases en langage naturel, descriptives de données numériques, et offrent ainsi une vision synthétique et compréhensible de grandes masses d'information. Nous nous intéressons d'abord à l'interprétabilité des RLF, capitale pour fournir une vision simplement appréhendable de l'information à un utilisateur humain et complexe du fait de sa formulation linguistique. En plus des travaux existant à ce sujet sur les composants élémentaires des RLF, nous proposons une approche globale de l'interprétabilité des résumés vus comme un ensemble de phrases et nous intéressons plus spécifiquement à la question de leur cohérence. Afin de la garantir dans le cadre de la logique floue standard, nous introduisons une formalisation originale de l'opposition entre phrases de complexité croissante. Ce formalisme nous permet de démontrer que les propriétés de cohérence sont vérifiables par le choix d'un modèle de négation spécifique. D'autre part, nous proposons sur cette base un cube en 4 dimensions mettant en relation toutes les oppositions possibles entre les phrases d'un RLF et montrons que ce cube généralise plusieurs structures d'opposition logiques existantes. Nous considérons ensuite le cas de données sous forme de séries numériques et nous intéressons à des résumés linguistiques portant sur leur périodicité : les phrases que nous proposons indiquent à quel point une série est périodique et proposent une formulation linguistique appropriée de sa période. La méthode d’extraction proposée, nommée DPE pour Detection of Periodic Events, permet de segmenter les données de manière adaptative et sans paramètre utilisateur, en utilisant des outils issus de la morphologie mathématique. Ces segments sont ensuite utilisés pour calculer la période de la série temporelle ainsi que sa périodicité, calculée comme un degré de qualité sur le résultat renvoyé mesurant à quel point la série est périodique. Enfin, DPE génère des phrases comme « Environ toutes les 2 heures, l'afflux de client est important ». Des expériences sur des données artificielles et réelles confirment la pertinence de l'approche. D’un point de vue algorithmique, nous proposons une implémentation incrémentale et efficace de DPE, basée sur l’établissement de formules permettant le calcul de mises à jour des variables. Cette implémentation permet le passage à l'échelle de la méthode ainsi que l'analyse en temps réel de flux de données. Nous proposons également une extension de DPE basée sur le concept de périodicité locale permettant d'identifier les sous-séquences périodiques d'une série temporelle par l’utilisation d’un test statistique original. La méthode, validée sur des données artificielles et réelles, génère des phrases en langage naturel permettant d’extraire des informations du type « Toutes les deux semaines sur le premier semestre de l'année, les ventes sont élevées »
Our research is in the field of fuzzy linguistic summaries (FLS) that allow to generate natural language sentences to describe very large amounts of numerical data, providing concise and intelligible views of these data. We first focus on the interpretability of FLS, crucial to provide end-users with an easily understandable text, but hard to achieve due to its linguistic form. Beyond existing works on that topic, based on the basic components of FLS, we propose a general approach for the interpretability of summaries, considering them globally as groups of sentences. We focus more specifically on their consistency. In order to guarantee it in the framework of standard fuzzy logic, we introduce a new model of oppositions between increasingly complex sentences. The model allows us to show that these consistency properties can be satisfied by selecting a specific negation approach. Moreover, based on this model, we design a 4-dimensional cube displaying all the possible oppositions between sentences in a FLS and show that it generalises several existing logical opposition structures. We then consider the case of data in the form of numerical series and focus on linguistic summaries about their periodicity: the sentences we propose indicate the extent to which the series are periodic and offer an appropriate linguistic expression of their periods. The proposed extraction method, called DPE, standing for Detection of Periodic Events, splits the data in an adaptive manner and without any prior information, using tools from mathematical morphology. The segments are then exploited to compute the period and the periodicity, measuring the quality of the estimation and the extent to which the series is periodic. Lastly, DPE returns descriptive sentences of the form ``Approximately every 2 hours, the customer arrival is important''. Experiments with artificial and real data show the relevance of the proposed DPE method. From an algorithmic point of view, we propose an incremental and efficient implementation of DPE, based on established update formulas. This implementation makes DPE scalable and allows it to process real-time streams of data. We also present an extension of DPE based on the local periodicity concept, allowing the identification of local periodic subsequences in a numerical series, using an original statistical test. The method validated on artificial and real data returns natural language sentences that extract information of the form ``Every two weeks during the first semester of the year, sales are high''

APA, Harvard, Vancouver, ISO, and other styles

39

O’Donnell, John. "SOME PRACTICAL CONSIDERATIONS IN THE USE OF PSEUDO-RANDOM SEQUENCES FOR TESTING THE EOS AM-1 RECEIVER." International Foundation for Telemetering, 1998. http://hdl.handle.net/10150/609651.

Full text

Abstract:

International Telemetering Conference Proceedings / October 26-29, 1998 / Town & Country Resort Hotel and Convention Center, San Diego, California
There are well-known advantages in using pseudo-random sequences for testing of data communication links. The sequences, also called pseudo-noise (PN) sequences, approximate random data very well, especially for sequences thousands of bits long. They are easy to generate and are widely used for bit error rate testing because it is easy to synchronize a slave pattern generator to a received PN stream for bit-by-bit comparison. There are other aspects of PN sequences, however, that are not as widely known or applied. This paper points out how some of the less familiar characteristics of PN sequences can be put to practical use in the design of a Digital Test Set and other specialbuilt test equipment used for checkout of the EOS AM-1 Space Data Receiver. The paper also shows how knowledge of these PN sequence characteristics can simplify troubleshooting the digital sections in the Space Data Receiver. Finally, the paper addresses the sufficiency of PN data testing in characterizing the performance of a receiver/data recovery system.

APA, Harvard, Vancouver, ISO, and other styles

40

Kopylova, Evguenia. "Algorithmes bio-informatiques pour l'analyse de données de séquençage à haut débit." Phd thesis, Université des Sciences et Technologie de Lille - Lille I, 2013. http://tel.archives-ouvertes.fr/tel-00919185.

Full text

Abstract:

Nucleotide sequence alignment is a method used to identify regions of similarity between organisms at the genomic level. In this thesis we focus on the alignment of millions of short sequences produced by Next-Generation Sequencing (NGS) technologies against a reference database. Particularly, we direct our attention toward the analysis of metagenomic and metatranscriptomic data, that is the DNA and RNA directly extracted for an environment. Two major challenges were confronted in our developed algorithms. First, all NGS technologies today are susceptible to sequencing errors in the form of nucleotide substitutions, insertions and deletions and error rates vary between 1-15%. Second, metagenomic samples can contain thousands of unknown organisms and the only means of identifying them is to align against known closely related species. To overcome these challenges we designed a new approximate matching technique based on the universal Levenshtein automaton which quickly locates short regions of similarity (seeds) between two sequences allowing 1 error of any type. Using seeds to detect possible high scoring alignments is a widely used heuristic for rapid sequence alignment, although most existing software are optimized for performing high similarity searches and apply exact seeds. Furthermore, we describe a new indexing data structure based on the Burst trie which optimizes the search for approximate seeds. We demonstrate the efficacy of our method in two implemented software, SortMeRNA and SortMeDNA. The former can quickly filter ribosomal RNA fragments from metatranscriptomic data and the latter performs full alignment for genomic and metagenomic data.

APA, Harvard, Vancouver, ISO, and other styles

41

Nyström, Josefina. "Multivariate non-invasive measurements of skin disorders." Doctoral thesis, Umeå University, Chemistry, 2006. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-865.

Full text

Abstract:

The present thesis proposes new methods for obtaining objective and accurate diagnoses in modern healthcare. Non-invasive techniques have been used to examine or diagnose three different medical conditions, namely neuropathy among diabetics, radiotherapy induced erythema (skin redness) among breast cancer patients and diagnoses of cutaneous malignant melanoma. The techniques used were Near-InfraRed spectroscopy (NIR), Multi Frequency Bio Impedance Analysis of whole body (MFBIA-body), Laser Doppler Imaging (LDI) and Digital Colour Photography (DCP).

The neuropathy for diabetics was studied in papers I and II. The first study was performed on diabetics and control subjects of both genders. A separation was seen between males and females and therefore the data had to be divided in order to obtain good models. NIR spectroscopy was shown to be a viable technique for measuring neuropathy once the division according to gender was made. The second study on diabetics, where MFBIA-body was added to the analysis, was performed on males exclusively. Principal component analysis showed that healthy reference subjects tend to separate from diabetics. Also, diabetics with severe neuropathy separate from persons less affected.

The preliminary study presented in paper III was performed on breast cancer patients in order to investigate if NIR, LDI and DCP were able to detect radiotherapy induced erythema. The promising results in the preliminary study motivated a new and larger study. This study, presented in papers IV and V, intended to investigate the measurement techniques further but also to examine the effect that two different skin lotions, Essex and Aloe vera have on the development of erythema. The Wilcoxon signed rank sum test showed that DCP and NIR could detect erythema, which is developed during one week of radiation treatment. LDI was able to detect erythema developed during two weeks of treatment. None of the techniques could detect any differences between the two lotions regarding the development of erythema.

The use of NIR to diagnose cutaneous malignant melanoma is presented as unpublished results in this thesis. This study gave promising but inconclusive results. NIR could be of interest for future development of instrumentation for diagnosis of skin cancer.

APA, Harvard, Vancouver, ISO, and other styles

42

Narmack, Kirilll. "Dynamic Speed Adaptation for Curves using Machine Learning." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-233545.

Full text

Abstract:

The vehicles of tomorrow will be more sophisticated, intelligent and safe than the vehicles of today. The future is leaning towards fully autonomous vehicles. This degree project provides a data driven solution for a speed adaptation system that can be used to compute a vehicle speed for curves, suitable for the underlying driving style of the driver, road properties and weather conditions. A speed adaptation system for curves aims to compute a vehicle speed suitable for curves that can be used in Advanced Driver Assistance Systems (ADAS) or in Autonomous Driving (AD) applications. This degree project was carried out at Volvo Car Corporation. Literature in the field of speed adaptation systems and factors affecting the vehicle speed in curves was reviewed. Naturalistic driving data was both collected by driving and extracted from Volvo's data base and further processed. A novel speed adaptation system for curves was invented, implemented and evaluated. This speed adaptation system is able to compute a vehicle speed suitable for the underlying driving style of the driver, road properties and weather conditions. Two different artificial neural networks and two mathematical models were used to compute the desired vehicle speed in curves. These methods were compared and evaluated.
Morgondagens fordon kommer att vara mer sofistikerade, intelligenta och säkra än dagens fordon. Framtiden lutar mot fullständigt autonoma fordon. Detta examensarbete tillhandahåller en datadriven lösning för ett hastighetsanpassningssystem som kan beräkna ett fordons hastighet i kurvor som är lämpligt för förarens körstil, vägens egenskaper och rådande väder. Ett hastighetsanpassningssystem för kurvor har som mål att beräkna en fordonshastighet för kurvor som kan användas i Advanced Driver Assistance Systems (ADAS) eller Autonomous Driving (AD) applikationer. Detta examensarbete utfördes på Volvo Car Corporation. Litteratur kring hastighetsanpassningssystem samt faktorer som påverkar ett fordons hastighet i kurvor studerades. Naturalistisk bilkörningsdata samlades genom att köra bil samt extraherades från Volvos databas och bearbetades. Ett nytt hastighetsanpassningssystem uppfanns, implementerades samt utvärderades. Hastighetsanpassningssystemet visade sig vara kapabelt till att beräkna en lämplig fordonshastighet för förarens körstil under rådande väderförhållanden och vägens egenskaper. Två olika artificiella neuronnätverk samt två matematiska modeller användes för att beräkna fordonets hastighet. Dessa metoder jämfördes och utvärderades.

APA, Harvard, Vancouver, ISO, and other styles

43

Andrade, Carina Sofia Marinho de. "Text mining na análise de sentimentos em contextos de big data." Master's thesis, 2015. http://hdl.handle.net/1822/40034.

Full text

Abstract:

Dissertação de mestrado integrado em Engenharia e Gestão de Sistemas de Informação
A evolução da tecnologia associada à constante utilização de diferentes dispositivos conectados à internet proporciona um vasto crescimento do volume e variedade de dados gerados diariamente a grande velocidade, fenómeno habitualmente denominado de Big Data. Relacionado com o crescimento do volume de dados está o aumento da notoriedade das várias técnicas de Text Mining, devido essencialmente à possibilidade de retirar maior valor dos dados gerados pelas várias aplicações, tentando-se assim obter informação benéfica para várias áreas de estudo. Um dos atuais pontos de interesse no que a este tema diz respeito é a Análise de Sentimentos onde através de várias técnicas é possível perceber, entre os mais variados tipos de dados, que sentimentos e opiniões se encontram implícitas nos mesmos. Tendo esta dissertação como finalidade o desenvolvimento de um sistema baseado em tecnologia Big Data e que assentará sobre técnicas de Text Mining e Análise de Sentimentos para o apoio à decisão, o documento enquadra conceptualmente os três conceitos acima referidos, fornecendo uma visão global dos mesmos e descrevendo aplicações práticas onde geralmente são utilizados. Além disso, é proposta uma arquitetura para a Análise de Sentimentos no contexto de utilização de dados provenientes da rede social Twitter e desenvolvidas aplicações práticas, recorrendo a exemplos do quotidiano onde a Análise de Sentimentos traz benefícios quando é aplicada. Com os casos de demonstração apresentados é possível verificar o papel de cada tecnologia utilizada e da técnica adotada para a Análise de Sentimentos. Por outro lado, as conclusões a que se chega com os casos de demonstração, permitem perceber as dificuldades que ainda existem associadas à realização de Análise de Sentimentos: as dificuldades no tratamento de texto, a falta de dicionários em Português, entre outros assuntos que serão abordados neste documento.
The evolution of technology, associated with the common use of different devices connected to the internet, provides a vast growth in the data volume and variety that are daily generated at high velocity, phenomenon commonly denominated as Big Data. Related with the growth in data volume is the increase awareness of several Text Mining techniques, making possible the extraction of useful insight from the data generated by multiple applications, thus trying to obtain beneficial information to multiple study areas. One of the current interests in what concerns this topic is Sentiment Analysis, where through the use of several data analysis techniques it is possible to understand, among a vast variety of data and data types, which sentiments and opinions are implicit in them. Since the purpose of this dissertation is the development of a system based on Big Data technologies that will implement Text Mining and Sentiment Analysis techniques for decision support, this document presents a conceptual framework of the three concepts mentioned above, providing a global overview of them and describing practical applications where they are generally used. Besides, it is proposed an architecture for Sentiment Analysis in the context of data from the Twitter social network. For that, practical applications are developed, using real world examples where Sentiment Analysis brings benefits when applied. With the presented demonstration cases it is possible to verify the role of each technology used and the techniques adopted for Sentiment Analysis. Moreover, the conclusions drawn from the demonstration cases allow us to understand the difficulties that are still present in the development of Sentiment Analysis: difficulties in text processing, the lack of Portuguese lexicons, among other topics addressed in this document.

APA, Harvard, Vancouver, ISO, and other styles

44

WU, JIA-HAO, and 吳家豪. "On-line Health News Analysis Involving Big Data based on Text Mining." Thesis, 2016. http://ndltd.ncl.edu.tw/handle/70822968718916156554.

Full text

Abstract:

碩士
國立聯合大學
資訊管理學系碩士班
104
People in Taiwan have been alerted by the problems of food safety for the past few years; therefore, people have paid more attention to health news. This study tries to find the critical terms in the on-line health news and predict votes for the "Like" of the news based on text mining and business intelligence algorithms. In addition, in order to deal with the possible big data from on-line news, this study proposes a big data system structure with Hadoop-based platform and Spark parallel framework by parallel processing on multiple data nodes. The results show that the support vector machine with 50 concept dimensions has the best prediction accuracy. When the number of iterations raised, the corresponding execution time increased; however, the increasing ratio of the execution time is much less than that of the iterations. Moreover, when the amount of data becomes huge, the performance of Spark distributed computing structure will improve significantly. The proposed approach can help managers of on-line news to choose or invest more popular health news thus to attract more potential readers. The proposed structure and analytic results regarding big data can also provide insights for the future studies.

APA, Harvard, Vancouver, ISO, and other styles

45

Amado, Maria Alexandra Amorim. "A review of the literature on big data in marketing using text mining." Master's thesis, 2015. http://hdl.handle.net/10071/11101.

Full text

Abstract:

With the amount of currently existing data, organizations have access to more information, collecting data of all kinds and accumulating easily terabytes or petabytes of data. These data come from various sources: streams of social networks, mobile, pictures, transactions, GPS signals and more. To analyze this large amount of data currently designated as Big Data, it is increasingly a concern of organizations in terms of competition, enhancing productivity growth and innovation. But what exactly is Big Data? The Big Data is more than just a matter of size: with the emergence of new technologies for data collection and advanced data mining, through powerful data analysis tools, Big Data offers an unprecedented opportunity to acquire knowledge with new types of data and discover business opportunities faster. Its application to Marketing can bring great potential to organizations allowing a better view of the market, creation of better consumer interactions through research of their behavior, in order to identify what the right message to deliver at the right channel, at the right time to the right consumer. These improved interactions result in increased revenues and competitive differentiation. Text Mining was used in this study to develop an automatic literature review and analyze the application of Big Data in Marketing in four dimensions: Temporal, Geographic, Sectors and Products.
Com a quantidade de dados atualmente existente, as organizações têm acesso a cada vez mais informação, recolhendo dados de todos os tipos e acumulando facilmente terabytes ou petabytes de dados. Estes dados são provenientes de várias fontes: streams de redes sociais, mobile, imagens, transações, sinais de GPS e etc.. Analisar esta grande quantidade de dados, atualmente designado de Big Data, é cada vez mais uma preocupação das organizações em termos de concorrência, potenciando o crescimento da produtividade e inovação. Mas o que é exatamente o Big Data? O Big Data é mais do que apenas uma questão de tamanho: com o aparecimento de novas tecnologias para recolha de dados Data Mining avançado, através de ferramentas de análise de dados poderosas, o Big Data oferece uma oportunidade sem precedentes para adquirir conhecimento através de novos tipos de dados e descobrir oportunidades de negócio mais rapidamente. A sua aplicação ao Marketing pode trazer um grande potencial às organizações uma vez que lhes permite melhorar a visão do mercado, a criação de melhores interações com os clientes através da investigação do seu comportamento, a fim de identificar a mensagem certa para entregar no canal certo, no momento certo para o cliente certo. Essas interações melhoradas resultam num aumento de receitas e diferenciação competitiva. Neste estudo foi utilizado Text Mining para desenvolver uma revisão automática da literatura e analisar a aplicação do Big Data em Marketing em quatro dimensões: temporal, geográfica, setores e produtos

APA, Harvard, Vancouver, ISO, and other styles

46

Ke, Cheng Hao, and 柯政豪. "The Ecology of Big Data: A Memetic Approach on the Evolution of Online Text." Thesis, 2016. http://ndltd.ncl.edu.tw/handle/z5bard.

Full text

Abstract:

碩士
國立政治大學
公共行政學系
104
The mismatch between theory and method is a crisis which the discipline of public administration cannot afford to ignore. The arrival of the “Era of Big Data”, only serves to make matters worse. As data becomes uncoupled with the individual, so goes any pretense of trying to provide analyses beyond that of mere description. If public administration refuses to import new ontology and epistemology, then very little could be gained from online text research. The Darwinian theory of evolution, ever since the Modern Synthesis, has embraced the replicator centered point of view when explaining all living phenomena. This has unshackled the theory from limitations of the traditional individual centered view of evolution. Memetics is a recent offshoot of the theory of evolution. It views social cultural change as a process based on the evolution of a cultural unit of selection, the meme. Due to memetics’ ability to explain social cultural evolution from the meme’s point of view, it is a natural candidate to examine the dynamics of “big” online text data. The first part of this research is on the construction of an online text analysis framework, with testable hypotheses, through the integration of past literature on evolution, social cultural evolution, memetics and ecology. The second part is concerned with the testing of the framework with empirical data. The text corpus used in this research contains 1,761 news reports from the Yahoo! News website on the issue of high school curriculum change. Chinese term segmentation and text clustering algorithms were applied to the corpus, in order to extract text quasi-species composed of similar memes. Statistical tests were then used to determine the influence of text characteristics and temporal distribution dynamics on the population of quasi-species. Findings indicate that the population dynamics of text quasi-species were influenced by density dependence. Text characteristics, such as word length and sentiment, also exert significant influence on the number of comments that each text receives. However, these influences are not equal under different density conditions. The location of the news articles within the website also creates a difference in the number of comments received. Finally, interactions between the temporal distribution of different quasi-species and between quasi-species and term groups also yielded significant positive and negative correlations. The results are proof that memetics is an ideal theoretical platform to connect theory with text mining/analysis methods. It allows for a theory based approach and the creation of testable hypotheses. Frameworks and methods based on evolution and ecological research are also applicable under memetics. The empirical findings point to the importance of monitoring the temporal distribution of online text, and the significance of text characteristics and website environments to text population changes. The results also illustrate the importance of term groups in the influence of text population dynamics. Together these variables and effects are all central to the understanding of the change in online text and comment numbers, and the effect of past text population on current population changes. Online texts from different websites should also be analyzed separately. This research recommends that future public administration big data analyses should continue to adopt the memetic approach. Nevertheless, attention should be given to the strengths and weaknesses of different text mining algorithms and density dependence tests. Big data time series from different websites and with longer temporal spans should also be considered, while social cultural artifacts other than texts should not be excluded from memetics based researches. New frameworks must also be constructed to integrate and understand, the interaction between important variables, such as, text characteristics and environmental influences. Findings on all forms of online data would also be enhanced through comparisons with results from questionnaires designed with memetics in mind.

APA, Harvard, Vancouver, ISO, and other styles

47

熊原朗. "Optimization Study of Applying Apache Spark on Plain Text Big Data with Association Rules Operations." Thesis, 2019. http://ndltd.ncl.edu.tw/handle/6b74rz.

Full text

Abstract:

碩士
國立彰化師範大學
資訊工程學系
107
Plain texts generated by humans on the Internet is increasing. The ISPs also use this data to create competitive systems that provide more appropriate services. In various of big data computing frameworks, it is quite common to use Apache Spark to process plain text data and use collaborative filtering to build recommendation systems. However, when using Spark for data processing, it may encounter that developers implement different APIs for specific text operations, which have a considerable impact on the performance and efficiency. Moreover, many of researchers and medium-sized enterprises run small-scale clusters, and most of the research on Spark parameter adjustment is in large-scale clusters. For small-scale clusters, there will be different interactions between parameters and node performance. This paper provides a performance optimization study for small-scale cluster deployment in the context of the application of Spark to process plain text big data association rules operations. Through different APIs and different operating parameters, to meet the lack of computational power of small-scale clusters to achieve the highest efficiency in a limited environment. Using the improved implementation of this paper, the maximum speed can be increased by 3.44 times, and the operation can be completed when the output data size exceeds 3 times of the available memory of a single node. After simulating the small-scale cluster load, it is found that using Kryo serialization, recommended parallelism, and giving Spark its own allocation of core resources instead of manual allocation, the highest computing performance can be obtained.

APA, Harvard, Vancouver, ISO, and other styles

48

Veiga, Hugo Miguel Ferrão Casal da. "Text mining e twitter : o poder das redes sociais num mercado competitivo." Master's thesis, 2016. http://hdl.handle.net/10362/17365.

Full text

Abstract:

Project Work presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Knowledge Management and Business Intelligence
Actualmente, com a massificação da utilização das redes sociais, as empresas passam a sua mensagem nos seus canais de comunicação, mas os consumidores dão a sua opinião sobre ela. Argumentam, opinam, criticam (Nardi, Schiano, Gumbrecht, & Swartz, 2004). Positiva ou negativamente. Neste contexto o Text Mining surge como uma abordagem interessante para a resposta à necessidade de obter conhecimento a partir dos dados existentes. Neste trabalho utilizámos um algoritmo de Clustering hierárquico com o objectivo de descobrir temas distintos num conjunto de tweets obtidos ao longo de um determinado período de tempo para as empresas Burger King e McDonald’s. Com o intuito de compreender o sentimento associado a estes temas foi feita uma análise de sentimentos a cada tema encontrado, utilizando um algoritmo Bag-of-Words. Concluiu-se que o algoritmo de Clustering foi capaz de encontrar temas através do tweets obtidos, essencialmente ligados a produtos e serviços comercializados pelas empresas. O algoritmo de Sentiment Analysis atribuiu um sentimento a esses temas, permitindo compreender de entre os produtos/serviços identificados quais os que obtiveram uma polaridade positiva ou negativa, e deste modo sinalizar potencias situações problemáticas na estratégia das empresas, e situações positivas passíveis de identificação de decisões operacionais bem-sucedidas.

APA, Harvard, Vancouver, ISO, and other styles

49

KUANG, PEI-WEI, and 匡裴暐. "A Study of On-line Tourism News base on Business Intelligence and Text Mining – A Big Data Structure." Thesis, 2017. http://ndltd.ncl.edu.tw/handle/jv5f42.

Full text

Abstract:

碩士
國立聯合大學
資訊管理學系碩士班
105
awareness in Taiwan, demands for travel are growing fast; therefore, people have paid more attention to tourism-related news. In addition, because of the booming development of internet technology, the amount of data increases dramatically. Traditional database structure is insufficient to deal with problems involving big data. This study employs the concept of text mining and business intelligence to analyze and predict on-line tourism news based on a Hadoop-based big data structure. First, a text mining approach is utilized to analyze the content of the on-line tourism news. Correlation analysis and association rule algorithm are adopted to analyze the relationships among the content of news, and the number of “Click”, “Share” and “Hashtag”. Then, a genetic-based ensemble method consisted of the ordinal logistic regression, support vector machine and decision tree algorithm is developed to predict the number of the “Click” and “Share” of on-line tourism news, and the number of domestic tourists. The results show that the proposed approach and structure can increase hit rates and computational efficiency.

APA, Harvard, Vancouver, ISO, and other styles

50

CHEN, SZU-LING, and 陳思伶. "Using Big Data and Text Analytics to Understand How Customer Experiences Posted on Yelp.com Impact the Hospitality Industry." Thesis, 2016. http://ndltd.ncl.edu.tw/handle/67055472370562629103.

Full text

Abstract:

碩士
國立臺北大學
企業管理學系
104
Nowadays, E-commerce systems used by the major Internet organizations, such as Google, Amazon, expedia.com, include highly scalable E-commerce platforms and social media platforms. These companies try to make use of web data that is less structured but composed of rich customer views and behavioral information. However, studies using these unstructured data to generate business value are still under- researched. This paper focuses on exploring the value of customer reviews post on the social media platforms in the hospitality industry by using big data analytic techniques. We aim to find the keywords that can help customer to find a suitable hotel. To be more specific, this study combines programming skills and applies data mining approaches to analyze lots of consumer reviews extracted from Yelp.com to deconstruct hotel guest experience and digging the possible texture that can be applied when searching or booking hotels. More importantly, the new approach we use in this study would make it possible to utilize big data analytics to find different perspectives that might not have been studied in existing hospitality literature. Moreover, it serves as a basis for further research related to unstructured data used in the E-commerce and hospitality industries.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Big text data'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles