Thèses : « Sentient Machine »

1

OGURI, PEDRO. « MACHINE LEARNING FOR SENTIMENT CLASSIFICATION ». PONTIFÍCIA UNIVERSIDADE CATÓLICA DO RIO DE JANEIRO, 2006. http://www.maxwell.vrac.puc-rio.br/Busca_etds.php?strSecao=resultado&nrSeq=9947@1.

Texte intégral

Résumé :

COORDENAÇÃO DE APERFEIÇOAMENTO DO PESSOAL DE ENSINO SUPERIOR
Sentiment Analysis é um problema de categorização de texto no qual deseja-se identificar opiniões favoráveis e desfavoráveis com relação a um tópico. Um exemplo destes tópicos de interesse são organizações e seus produtos. Neste problema, documentos são classificados pelo sentimento, conotação, atitudes e opiniões ao invés de se restringir aos fatos descritos neste. O principal desafio em Sentiment Classification é identificar como sentimentos são expressados em textos e se tais sentimentos indicam uma opinião positiva (favorável) ou negativa (desfavorável) com relação a um tópico. Devido ao crescente volume de dados disponível na Web, onde todos tendem a ser geradores de conteúdo e expressarem opiniões sobre os mais variados assuntos, técnicas de Aprendizado de Máquina vem se tornando cada vez mais atraentes. Nesta dissertação investigamos métodos de Aprendizado de Máquina para Sentiment Analysis. Apresentamos alguns modelos de representação de documentos como saco de palavras e N-grama. Testamos os classificadores SVM (Máquina de Vetores Suporte) e Naive Bayes com diferentes modelos de representação textual e comparamos seus desempenhos.
Sentiment Analysis is a text categorization problem in which we want to identify favorable and unfavorable opinions towards a given topic. Examples of such topics are organizations and its products. In this problem, docu- ments are classifed according to their sentiment, connotation, attitudes and opinions instead of being limited to the facts described in it. The main challenge in Sentiment Classification is identifying how sentiments are expressed in texts and whether they indicate a positive (favorable) or negative (unfavorable) opinion towards a topic. Due to the growing volume of information available online in an environment where we all tend to be content generators and express opinions on a variety of subjects, Machine Learning techniques have become more and more attractive. In this dissertation, we investigate Machine Learning methods applied to Sentiment Analysis. We present document representation models such as bag-of-words and N-grams.We compare the performance of the Naive Bayes and the Support Vector Machine classifiers for each proposed model

Styles APA, Harvard, Vancouver, ISO, etc.

2

Alotaibi, Saud Saleh. « Sentiment analysis in the Arabic language using machine learning ». Thesis, Colorado State University, 2015. http://pqdtopen.proquest.com/#viewpdf?dispub=3720340.

Texte intégral

Résumé :

Sentiment analysis has recently become one of the growing areas of research related to natural language processing and machine learning. Much opinion and sentiment about specific topics are available online, which allows several parties such as customers, companies and even governments, to explore these opinions. The first task is to classify the text in terms of whether or not it expresses opinion or factual information. Polarity classification is the second task, which distinguishes between polarities (positive, negative or neutral) that sentences may carry. The analysis of natural language text for the identification of subjectivity and sentiment has been well studied in terms of the English language. Conversely, the work that has been carried out in terms of Arabic remains in its infancy; thus, more cooperation is required between research communities in order for them to offer a mature sentiment analysis system for Arabic. There are recognized challenges in this field; some of which are inherited from the nature of the Arabic language itself, while others are derived from the scarcity of tools and sources.

This dissertation provides the rationale behind the current work and proposed methods to enhance the performance of sentiment analysis in the Arabic language. The first step is to increase the resources that help in the analysis process; the most important part of this task is to have annotated sentiment corpora. Several free corpora are available for the English language, but these resources are still limited in other languages, such as Arabic. This dissertation describes the work undertaken by the author to enrich sentiment analysis in Arabic by building a new Arabic Sentiment Corpus. The data is labeled not only with two polarities (positive and negative), but the neutral sentiment is also used during the annotation process.

The second step includes the proposal of features that may capture sentiment orientation in the Arabic language, as well as using different machine learning classifiers that may be able to work better and capture the non-linearity with a richly morphological and highly inflectional language, such as Arabic. Different types of features are proposed. These proposed features try to capture different aspects and characteristics of Arabic. Morphological, Semantic, Stylistic features are proposed and investigated. In regard with the classifier, the performance of using linear and nonlinear machine learning approaches was compared. The results are promising for the continued use of nonlinear ML classifiers for this task. Learning knowledge from a particular dataset domain and applying it to a different domain is one useful method in the case of limited resources, such as with the Arabic language. This dissertation shows and discussed the possibility of applying cross-domain in the field of Arabic sentiment analysis. It also indicates the feasibility of using different mechanisms of the cross-domain method.

Other work in this dissertation includes the exploration of the effect of negation in Arabic subjectivity and polarity classification. The negation word lists were devised to help in this and other natural language processing tasks. These words include both types of Arabic, Modern Standard and some of Dialects. Two methods of dealing with the negation in sentiment analysis in Arabic were proposed. The first method is based on a static approach that assumes that each sentence containing negation words is considered a negated sentence. When determining the effect of negation, different techniques were proposed, using different word window sizes, or using base phrase chunk. The second approach depends on a dynamic method that needs an annotated negation dataset in order to build a model that can determine whether or not the sentence is negated by the negation words and to establish the effect of the negation on the sentence. The results achieved by adding negation to Arabic sentiment analysis were promising and indicate that the negation has an effect on this task. Finally, the experiments and evaluations that were conducted in this dissertation encourage the researchers to continue in this direction of research.

Styles APA, Harvard, Vancouver, ISO, etc.

3

Paknejad, Sepideh. « Sentiment classification on Amazon reviews using machine learning approaches ». Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-233551.

Texte intégral

Résumé :

As online marketplaces have been popular during the past decades, the online sellers and merchants ask their purchasers to share their opinions about the products they have bought. As a result, millions of reviews are being generated daily which makes it difficult for a potential consumer to make a good decision on whether to buy the product. Analyzing this enormous amount of opinions is also hard and time consuming for product manufacturers. This thesis considers the problem of classifying reviews by their overall semantic (positive or negative). To conduct the study two different supervised machine learning techniques, SVM and Naïve Bayes, has been attempted on beauty products from Amazon. Their accuracies have then been compared. The results showed that the SVM approach outperforms the Naïve Bayes approach when the data set is bigger. However, both algorithms reached promising accuracies of at least 80%.
Eftersom marknadsplatser online har varit populära under de senaste decennierna, så har online-säljare och inköpsmän ställt kunderna frågor om deras åsikter gällande varorna de har köpt. Som ett resultat genereras miljontals recensioner dagligen vilket gör det svårt för en potentiell konsument att fatta ett bra beslut om de ska köpa produkten eller inte. Att analysera den enorma mängden åsikter är också svårt och tidskrävande för produktproducenter. Denna avhandling tar upp problemet med att klassificera recensioner med deras övergripande semantiska (positiva eller negativa). För att genomföra studien har två olika övervakade maskininlärningstekniker, SVM och Naïve Bayes, testats på recensioner av skönhetsprodukter från Amazon. Deras noggrannhet har sedan jämförts. Resultaten visade att SVM-tillvägagångssättet överträffar Naïve Bayes-tillvägagångssättet när datasetet är större. Båda algoritmerna nådde emellertid lovande noggrannheter på minst 80%.

Styles APA, Harvard, Vancouver, ISO, etc.

4

WESTLING, ANDERS. « Sentiment Analysisof Microblog Posts from a Crisis Eventusing Machine Learning ». Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-138428.

Texte intégral

Résumé :

With social media services becoming more and more popular, there now exists a constant stream of opinions publicly available on the Internet. These opinions can be analyzed to find the users’ sentiments towards things. One example of interest is to see how people are feeling during a crisis situation to get a better understanding about what kind of help that would be the most useful at the moment. The goal of this degree project has been to see if it is possible to create an automatic classifier, based on machine learning techniques, that can accurately determine whether a microblog post written during a political event in Russia is for, against, or neutral towards the group of people being at the center of the event. Because of the shortness of microblog texts and the informal language often used in them, the problem is expected to be more difficult compared to sentiment analysis of normal length texts. A number of different machine learning algorithms were studied along with different ways to convert the microblog texts into a representation that can be used by the classifier algorithms. The most promising of these algorithms and representations were implemented and tested to see if an accurate classifier could be obtained. The results show that the algorithms are not good enough to create a sufficiently accurate classifier with the training data used. One major factor is believed to be the small training data set used. A better classifier could potentially be achieved by training the classifier with more microblog posts. It is of interest to examine other sentiment classifications of microblog posts, since the one used in this project is believed to be especially difficult. This study and previous research on similar classifications suggest that this is a difficult problem that requires more work if an accurate classifier is to be obtained.
I och med att tjänster för sociala medier blir allt mer populära, existerar det nu en konstant ström av åsikter fritt tillgängliga på internet. Dessa åsikter kan analyseras för att finna användarnas känslor kring olika ämnen. Ett exempel av intresse är att se hur folk känner under en krissituation för att få en bättre uppfattning om vilken typ av hjälp som skulle vara till mest nytta för tillfället. Målet med detta examensarbete har varit att se om det är möjligt att skapa en automatisk klassificerare, baserad på maskininlärningsmetoder, som med precision kan avgöra huruvida ett mikroblogginlägg skrivet under en politisk händelse i Ryssland är för, emot, eller neutral till den grupp människor som händelsen kretsar kring. Problemet väntas vara svårare än sentimentanalys av normallånga texter, detta eftersom mikroblogginlägg är mycket kortare och ofta har ett informellt språk. Ett antal olika algoritmer för maskininlärning studerades tillsammans med olika metoder för att representera mikroblogginläggen på ett format som algoritmerna kan arbeta med. De mest lovande utav dessa algoritmer och representationer implementerades och testades för att se om en effektiv klassificerare kunde åstakommas. Resultaten visar att algoritmerna inte är tillräckligt bra för att skapa en tillräckligt precis klassificerare med den träningsdata som användes. En stor faktor tros vara den lilla mängden träningsdata som användes. En bättre klassificerare skulle potentiellt kunna uppnås om genom att använda fler mikrobloginlägg som träningsdata. Det vore även intressant att utforska andra sentimentklassificeringar utav mikroblogginlägg, då den som användes i det här arbetet tros vara särskilt svår. Den här studien och tidigare forskning på liknande klassificeringar talar för att detta är ett svårt problem som kräver mer arbete för att en precis klassificerare ska kunna erhållas.

Styles APA, Harvard, Vancouver, ISO, etc.

5

Erogul, Umut. « Sentiment Analysis In Turkish ». Master's thesis, METU, 2009. http://etd.lib.metu.edu.tr/upload/12610616/index.pdf.

Texte intégral

Résumé :

Sentiment analysis is the automatic classification of a text, trying to determine the attitude of the writer with respect to a specific topic. The attitude may be either their judgment or evaluation, their feelings or the intended emotional communication. The recent increase in the use of review sites and blogs, has made a great amount of subjective data available. Nowadays, it is nearly impossible to manually process all the relevant data available, and as a consequence, the importance given to the automatic classification of unformatted data, has increased. Up to date, all of the research carried on sentiment analysis was focused on English language. In this thesis, two Turkish datasets tagged with sentiment information is introduced and existing methods for English are applied on these datasets. This thesis also suggests new methods for Turkish sentiment analysis.

Styles APA, Harvard, Vancouver, ISO, etc.

6

Di, Gennaro Pierluigi. « Due approcci alla sentiment polarity classification di tweet per la lingua italiana ». Master's thesis, Alma Mater Studiorum - Università di Bologna, 2017. http://amslaurea.unibo.it/13270/.

Texte intégral

Résumé :

Questo lavoro di tesi si pone l'obiettivo di fornire un'ampia panoramica sull'attuale stato dell'arte della ricerca sulla sentiment analysis mostrando le metodologie, le tecniche e le applicazioni realizzate negli ultimi anni e di presentare le implementazioni concrete (ed i risultati ottenuti) di due diversi sistemi per la sentiment polarity classification di tweet per la lingua italiana. Il primo sistema (FICLIT+CS@Unibo System) utilizza un approccio basato sull'orientamento semantico tramite la realizzazione e l'utilizzo di un lessico annotato e la propagazione della polarità lungo alberi sintattici mentre il secondo utilizza algoritmi stocastico/statistici di machine learning per la creazione di un modello generalizzato per la classificazione del sentimento a partire da un training set annotato.

Styles APA, Harvard, Vancouver, ISO, etc.

7

Vaswani, Vishwas. « Predicting sentiment-mention associations in product reviews ». Thesis, Kansas State University, 2012. http://hdl.handle.net/2097/13714.

Texte intégral

Résumé :

Master of Science
Department of Computing and Information Sciences
Doina Caragea
With the rising trend in social networking, more people express their opinions on the web. As a consequence, there has been an increase in the number of blogs where people write reviews about the products they buy or services they experience. These reviews can be very helpful to other potential customers who want to know the pros and cons of a product, and also to manufacturers who want to get feedback from customers about their products. Sentiment analysis of online data (such as review blogs) is a rapidly growing field of research in Machine Learning, which can leverage online reviews and quickly extract the sentiment of a whole blog. The accuracy of a sentiment analyzer relies heavily on correctly identifying associations between a sentiment (opinion) word and the targeted mention (token or object) in blog sentences. In this work, we focus on the task of automatically identifying sentiment-mention associations, in other words, we identify the target mention that is associated with a sentiment word in a sentence. Support Vector Machines (SVM), a supervised machine learning algorithm, was used to learn classifiers for this task. Syntactic and semantic features extracted from sentences were used as input to the SVM algorithm. The dataset used in the work has reviews from car and camera domain. The work is divided into two phases. In the first phase, we learned domain specific classifiers for the car and camera domains, respectively. To further improve the predictions of the domain specific classifiers we investigated the use of transfer learning techniques in the second phase. More precisely, the goal was to use knowledge from a source domain to improve predictions for a target domain. We considered two transfer learning approaches: a feature level fusion approach and a classifier level fusion approach. Experimental results show that transfer learning can help to improve the predictions made using the domain specific classifier approach. While both the feature level and classifier level fusion approaches were shown to improve the prediction accuracy, the classifier level fusion approach gave better results.

Styles APA, Harvard, Vancouver, ISO, etc.

8

Svensson, Kristoffer. « Sentiment Analysis With Convolutional Neural Networks : Classifying sentiment in Swedish reviews ». Thesis, Linnéuniversitetet, Institutionen för datavetenskap (DV), 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-64768.

Texte intégral

Résumé :

Today many companies exist and market their products and services on social medias, and therefore may receive reviews and thoughts from their end-users directly in these social medias. Reading every text by hand can be time-consuming, so by analysing the sentiment for all texts give the companies an overview how positive or negative the users are on a specific subject. Sentiment analysis is a feature that Beanloop AB is interested in implementing in their future projects and this thesis research problem was to investigate how deep learning could be used for this task. It was done by conducting an experiment with deep learning and neural networks. Several convolutional neural network models were implemented with different settings to find a combination of settings that gave the highest accuracy on the given test dataset. There were two different kind of models, one kind classifying positive and negative, and the second classified the previous two categories but also neutral. The training dataset and the test dataset contained data from two recommendation sites, www.reco.se and se.trustpilot.com. The final result shows that when classifying three categories (positive, negative and neutral) the models had problems to reach an accuracy at 85%, were only one model reached 80% accuracy as best on the test dataset. However, when only classifying two categories (positive and negative) the models showed very good results and reached almost 95% accuracy for every model.

Styles APA, Harvard, Vancouver, ISO, etc.

9

CAMBA, GIACOMO. « Machine Learning in Social Media Sentiment Classification and Trading Strategy Design ». Doctoral thesis, Università degli Studi di Cagliari, 2022. http://hdl.handle.net/11584/333407.

Texte intégral

Résumé :

The goal of this thesis is to build a trading strategy that jointly uses quantitative and qualitative sentiment variables. In particular, we want to see if we can improve the equity line of a trading bot when trained in a trading environment in which we also insert sentiment variables and attention measures in addition to price and volume variables. Our target market is the US stock market and in particular the S&P 500. As a proxy for the equity investors' attention, we use the S&P 500 Google Search Volume Index downloaded from Google Trend, while the sentiment variable is built from textual data of the 4 main financial social media. The text corpus includes the tweets posted on StockTwits and Twitter and the comments published on the Yahoo Finance and Investing Message Board concerning the ticker of the American stock index and its Etf. The downloaded messages are over 5.7 million and cover a period of 15 years from 2006 to 2021. 32% of this data has been labeled by users as bullish or bearish, while the remainder is unlabeled. This meant for us to research the best sentiment classifier and use it to label messages that didn't have one, as we wanted our sentiment variable to include the full amount of data collected. To do this, we adopted the two main financial sentiment analysis approaches on the labeled data, namely the lexicon approach and the machine learning model approach. After testing the classification skills of 16 of the main financial and non-financial sentiment lexicons, and having verified their poor performance, we necessarily had to undertake the machine learning strategy. This meant, first of all establishing the best word embedding techniques distinct between frequentist and probabilistic methods, then comparing different unsupervised learning algorithms to understand if there could be some data dimensionality reduction techniques without losing the most precious information, and finally testing the classification capabilities of the most advanced machine learning models in textual data classification field. Supervised model training included exhaustive parametric research via 5-folds cross-validation for simpler models and random parametric research for more complex models. Ultimately, we find that the best sentiment classifier on our data is the LSTM model, with a test accuracy of 77%. After having employed it to label the unlabeled data, we were able to build a sentiment variable expressing investors' bullish and/or bearish moods. Subsequently, the sentiment and attention variables were aggregated to the price and volume data of the US stock market ETF to create a reinforcement learning environment in which to train our agent. By doing several tests, we discover that our agent achieves a significantly higher return when the sentiment and attention variables are also included in the RL environment.

Styles APA, Harvard, Vancouver, ISO, etc.

10

CAPUA, M. DI. « A DEEP LEARNING APPROACH FOR SENTIMENT ANALYSIS ». Doctoral thesis, Università degli Studi di Milano, 2017. http://hdl.handle.net/2434/467844.

Texte intégral

Résumé :

La Sentiment Analysis si riferisce alla analisi qualitativa volta ad identificare e classificare opinioni contenute in frasi e testi, allo scopo di stabilire lo “stato d’animo” dell’autore rispetto ad un particolare argomento o prodotto, e di determinare se tale stato è di fatto positivo, negativo oppure neutrale. Le opinioni espresse in un testo, come ad esempio giudizi, sentimenti ed emozioni, sono di recente diventate oggetto di studio e di ricerca sia in ambito accademico che industriale. Sfortunatamente la comprensione del linguaggio, applicata a commenti di utenti, è un attività estremamente complessa per una macchina, specialmente se ci si riferisce ai contesti dei moderni social network. Le modalità in cui le persone si esprimono in linguaggio naturale, sono molteplici, e l’utilizzo “informale” della lingua adottato tipicamente nei social netowrks, genera frasi spesso dense di errori, modi di dire (slang), costrutti sintattici ”personalizzati”, o anche frasi arricchite da caratteri speciali (come l’hashtag in Twitter), il che complica notevolmente l’analisi. Recentemente, le tecniche di Deep Learning, stanno emergendo nel panorama del machine learning, come un modello computazionale che può essere adoperato con efficacia per scoprire relazioni semantiche complesse, all’interno di un testo, anche senza la necessità di dover individuare a priori caratteristiche (features) di tali relazioni. Questi approcci hanno migliorato l’attuale stato dell’arte in diversi settori della Sentiment Analysis, come ad esempio la classificazione di frasi o di documenti, l’apprendimento basato su lexicon, fino ad arrivare alla analisi di fenomeni complessi come il cyber bullismo. I contributi di questa tesi sono di due tipi. Il primo contributo fornito, relativo ad aspetti generali di Sentiment Analysis, riguarda la proposta di un modello di rete neurale semi supervisionata, basato sulle reti di tipo Deep Belief, in grado di affrontare l’incertezza dei dati insita nelle frasi testuali, con particolare riferimento alla lingua italiana. Il modello proposto è stato testato rispetto a diversi datasets presi dalla letteratura di riferimento, composti da testi relativi a critiche cinematografiche, adottando una rappresentazione dell’informazione basata su vettori (Word2Vec) ed introducendo anche metodi derivati dal campo del Natural Language Processing (NLP). Il secondo contributo fornito in questa tesi, partendo dall’assunto che il cyber bullismo può essere considerato come un caso particolare di Sentiment Analysis, propone un approccio non supervisionato alla rilevazione automatica di tracce di cyber bullismo all’interno di social networks, basato sia su di una rete neurale di tipo GHSOM (Growing Hierarchical Self Organizing Map), sia su di un modello di caratteristiche (features) predefinito. Il modello non supervisionato proposto dimostra di raggiungere comunque risultati interessanti rispetto ai tipici modelli supervisionati, applicati solitamente in questo ambito.
Sentiment Analysis refers to the process of computationally identifying and categorizing opinions expressed in a piece of text, in order to determine whether the writer’s attitude towards a particular topic or product is positive, negative, or even neutral. The views expressed and its related concepts, such as feelings, judgments, and emotions have become recently a subject of study and research in both academic and industrial areas. Unfortunately language comprehension of user comments, especially in social networks, is inherently complex to computers. The ways in which humans express themselves with natural language are nearly unlimited and informal texts is riddled with typos, misspellings, badly set up syntactic constructions and also specific symbols (e.g. hashtags in Twitter) which exponentially complicate this task. Recently, deep learning approaches are emerging as powerful computational models that discover intricate semantic representations of texts automatically from data without hand-made feature engineering. These approaches have improved the state-of-the-art in many Sentiment Analysis tasks including sentiment classification of sentences or documents, sentiment lexicon learning and also in more complex problems as cyber bullying detection. The contributions of this work are twofold. First, related to the general Sentiment Analysis problem, we propose a semi-supervised neural network model, based on Deep Belief Networks, able to deal with data uncertainty for text sentences in Italian language. We test this model against some datasets from literature related to movie reviews, adopting a vectorized representation of text (Word2Vec) and exploiting methods from Natural Language Processing (NLP) pre-processing. Second, assuming that the cyber bullying phenomenon can be treated as a particular Sentiment Analysis problem, we propose an unsupervised approach to automatic cyber bullying detection in social networks, based both on Growing Hierarchical Self Organizing Map (GHSOM) and on a new specific features model, showing that our solution can achieve interesting results, respect to classical supervised approaches.

Styles APA, Harvard, Vancouver, ISO, etc.

11

Salah, Zaher. « Machine learning and sentiment analysis approaches for the analysis of Parliamentary debates ». Thesis, University of Liverpool, 2014. http://livrepository.liverpool.ac.uk/19793/.

Texte intégral

Résumé :

In this thesis the author seeks to establish the most appropriate mechanism for conducting sentiment analysis with respect to political debates; firstly so as to predict their outcome and secondly to support a mechanism to provide for the visualisation of such debates in the context of further analysis. To this end two alternative approaches are considered, a classification-based approach and a lexicon-based approach. In the context of the second approach both generic and domain specific sentiment lexicons are considered. Two techniques to generating domain-specific sentiment lexicons are also proposed: (i) direct generation and (ii) adaptation. The first was founded on the idea of generating a dedicated lexicon directly from labelled source data. The second approach was founded on the idea of using an existing general purpose lexicon and adapting this so that it becomes a specialised lexicon with respect to some domain. The operation of both the generic and domain specific sentiment lexicons are compared with the classification-based approach. The comparison between the potential sentiment mining approaches was conducted by predicting the attitude of individual debaters (speakers) in political debates (using a corpus of labelled political speeches extracted from political debate transcripts taken from the proceedings of the UK House of Commons). The reported comparison indicates that the attitude of speakers can be effectively predicted using sentiment mining. The author then goes on to propose a framework, the Debate Graph Extraction (DGE) framework, for extracting debate graphs from transcripts of political debates. The idea is to represent the structure of a debate as a graph with speakers as nodes and “exchanges” as links. Links between nodes were established according to the exchanges between the speeches. Nodes were labelled according to the “attitude” (sentiment) of the speakers, “positive” or “negative”, using one of the three proposed sentiment mining approaches. The attitude of the speakers was then used to label the graph links as being either “supporting” or “opposing”. If both speakers had the same attitude (both “positive” or both “negative”) the link was labelled as being “supporting”; otherwise the link was labelled as being “opposing”. The resulting graphs capture the abstract representation of a debate where two opposing factions exchange arguments on related content. Finally, the author moves to discuss mechanisms whereby debate graphs can be structurally analysed using network mathematics and community detection techniques. To this end the debate graphs were conceptualised as networks in order to conduct appropriate network analysis. The significance was that the network mathematics and community detection processes can draw conclusions about the general properties of debates in parliamentary practice through the exploration of the embedded patterns of connectivity and reactivity between the exchanging nodes (speakers).

Styles APA, Harvard, Vancouver, ISO, etc.

12

Manda, Kundan Reddy. « Sentiment Analysis of Twitter Data Using Machine Learning and Deep Learning Methods ». Thesis, Blekinge Tekniska Högskola, Institutionen för datavetenskap, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-18447.

Texte intégral

Résumé :

Background: Twitter, Facebook, WordPress, etc. act as the major sources of information exchange in today's world. The tweets on Twitter are mainly based on the public opinion on a product, event or topic and thus contains large volumes of unprocessed data. Synthesis and Analysis of this data is very important and difficult due to the size of the dataset. Sentiment analysis is chosen as the apt method to analyse this data as this method does not go through all the tweets but rather relates to the sentiments of these tweets in terms of positive, negative and neutral opinions. Sentiment Analysis is normally performed in 3 ways namely Machine learning-based approach, Sentiment lexicon-based approach, and Hybrid approach. The Machine learning based approach uses machine learning algorithms and deep learning algorithms for analysing the data, whereas the sentiment lexicon-based approach uses lexicons in analysing the data and they contain vocabulary of positive and negative words. The Hybrid approach uses a combination of both Machine learning and sentiment lexicon approach for classification. Objectives: The primary objectives of this research are: To identify the algorithms and metrics for evaluating the performance of Machine Learning Classifiers. To compare the metrics from the identified algorithms depending on the size of the dataset that affects the performance of the best-suited algorithm for sentiment analysis. Method: The method chosen to address the research questions is Experiment. Through which the identified algorithms are evaluated with the selected metrics. Results: The identified machine learning algorithms are Naïve Bayes, Random Forest, XGBoost and the deep learning algorithm is CNN-LSTM. The algorithms are evaluated with respect to the metrics namely precision, accuracy, F1 score, recall and compared. CNN-LSTM model is best suited for sentiment analysis on twitter data with respect to the selected size of the dataset. Conclusion: Through the analysis of results, the aim of this research is achieved in identifying the best-suited algorithm for sentiment analysis on twitter data with respect to the selected dataset. CNN-LSTM model results in having the highest accuracy of 88% among the selected algorithms for the sentiment analysis of Twitter data with respect to the selected dataset.

Styles APA, Harvard, Vancouver, ISO, etc.

13

Heeley, Robert. « A hybrid machine learning approach to measuring sentiment, credibility and influence on Twitter ». Thesis, City, University of London, 2017. http://openaccess.city.ac.uk/20213/.

Texte intégral

Résumé :

Current sentiment analysis on Twitter is hampered by two factors namely, not all accounts are genuine and not all users have the same level of influence. Including non credible and irrelevant Tweets in sentiment analysis dilutes the effectiveness of any sentiment produced. Similarly, counting a Tweet with a potential audience of 10 users as having the same impact as a Tweet that could reach 1 million users is not accurately reflecting its importance. In order to mitigate against these inherent problems a novel method was devised to account for credibility and to measure influence. The current definition of credibility on Twitter was redefined and expanded to incorporate the subtle nuances that exist beyond the simple variance between human or bot account. Once basic sentiment was produced it was filtered by removing non credible Tweets and the remaining sentiment was augmented by weighting it based upon both the user’s and the Tweet’s influence scores. Measuring one person’s opinion is costly and lacking in power, however, machine learning techniques allow us to capture and analyse millions of opinions. Combining a Tweet’s sentiment with the user’s influence score and their credibility rating greatly increases the understanding and usefulness of that sentiment. In order to gauge and measure the impact of this research and highlight its generalisability, this thesis examined 2 distinct real world datasets, the UK General Election 2015 and the Rugby World Cup 2015, which also served to validate the approach used. A better more accurate understanding of sentiment on Twitter has the potential for broad impact from providing targeted advertising that is in tune with people’s needs and desires to providing governments with a better understanding of the will and desire of the people.

Styles APA, Harvard, Vancouver, ISO, etc.

14

Sundaram, Arun C. « A Comparison of Machine Learning Techniques on Automated Essay Grading and Sentiment Analysis ». The Ohio State University, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=osu1428460405.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

15

Blomkvist, Oscar. « Machine Learning Based Sentiment Classification of Text, with Application to Equity Research Reports ». Thesis, KTH, Matematisk statistik, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-257506.

Texte intégral

Résumé :

In this thesis, we analyse the sentiment in equity research reports written by analysts at Skandinaviska Enskilda Banken (SEB). We provide a description of established statistical and machine learning methods for classifying the sentiment in text documents as positive or negative. Specifically, a form of recurrent neural network known as long short-term memory (LSTM) is of interest. We investigate two different labelling regimes for generating training data from the reports. Benchmark classification accuracies are obtained using logistic regression models. Finally, two different word embedding models and bidirectional LSTMs of varying network size are implemented and compared to the benchmark results. We find that the logistic regression works well for one of the labelling approaches, and that the best LSTM models outperform it slightly.
I denna rapport analyserar vi sentimentet, eller attityden, i aktieanalysrapporter skrivna av analytiker på Skandinaviska Enskilda Banken (SEB). Etablerade statistiska metoder och maskininlärningsmetoder för klassificering av sentimentet i textdokument som antingen positivt eller negativt presenteras. Vi är speciellt intresserade av en typ av rekurrent neuronnät känt som long short-term memory (LSTM). Vidare undersöker vi två olika scheman för att märka upp träningsdatan som genereras från rapporterna. Riktmärken för klassificeringsgraden erhålls med hjälp av logistisk regression. Slutligen implementeras två olika ordrepresentationsmodeller och dubbelriktad LSTM av varierande nätverksstorlek, och jämförs med riktmärkena. Vi finner att logistisk regression presterar bra för ett av märkningsschemana, och att LSTM har något bättre prestanda.

Styles APA, Harvard, Vancouver, ISO, etc.

16

Poria, Soujanya. « Novel symbolic and machine-learning approaches for text-based and multimodal sentiment analysis ». Thesis, University of Stirling, 2017. http://hdl.handle.net/1893/25396.

Texte intégral

Résumé :

Emotions and sentiments play a crucial role in our everyday lives. They aid decision-making, learning, communication, and situation awareness in human-centric environments. Over the past two decades, researchers in artificial intelligence have been attempting to endow machines with cognitive capabilities to recognize, infer, interpret and express emotions and sentiments. All such efforts can be attributed to affective computing, an interdisciplinary field spanning computer science, psychology, social sciences and cognitive science. Sentiment analysis and emotion recognition has also become a new trend in social media, avidly helping users understand opinions being expressed on different platforms in the web. In this thesis, we focus on developing novel methods for text-based sentiment analysis. As an application of the developed methods, we employ them to improve multimodal polarity detection and emotion recognition. Specifically, we develop innovative text and visual-based sentiment-analysis engines and use them to improve the performance of multimodal sentiment analysis. We begin by discussing challenges involved in both text-based and multimodal sentiment analysis. Next, we present a number of novel techniques to address these challenges. In particular, in the context of concept-based sentiment analysis, a paradigm gaining increasing interest recently, it is important to identify concepts in text; accordingly, we design a syntaxbased concept-extraction engine. We then exploit the extracted concepts to develop conceptbased affective vector space which we term, EmoSenticSpace. We then use this for deep learning-based sentiment analysis, in combination with our novel linguistic pattern-based affective reasoning method termed sentiment flow. Finally, we integrate all our text-based techniques and combine them with a novel deep learning-based visual feature extractor for multimodal sentiment analysis and emotion recognition. Comparative experimental results using a range of benchmark datasets have demonstrated the effectiveness of the proposed approach.

Styles APA, Harvard, Vancouver, ISO, etc.

17

Angioni, Consuelo <1991&gt. « Machine learning e fattore umano nella sentiment analysis. Il caso Starbucks a Milano ». Master's Degree Thesis, Università Ca' Foscari Venezia, 2017. http://hdl.handle.net/10579/10577.

Texte intégral

Résumé :

Sviluppo di un modello di sentiment analysis basato su machine learning e applicazione ad un dataset di 30000 osservazioni da Twitter. Comparazione del modello con altre modalità di sentiment. Considerazioni sull'evoluzione della sentiment analysis in opinion mining.

Styles APA, Harvard, Vancouver, ISO, etc.

18

Memari, Majid. « Predicting the Stock Market Using News Sentiment Analysis ». OpenSIUC, 2018. https://opensiuc.lib.siu.edu/theses/2442.

Texte intégral

Résumé :

ABSTRACT MAJID MEMARI, for the Masters of Science degree in Computer Science, presented on November 3rd, 2017 at Southern Illinois University, Carbondale, IL. Title: PREDICTING THE STOCK MARKET USING NEWS SENTIMENT ANALYSIS Major Professor: Dr. Norman Carver Big data is a term for data sets that are so large or complex that traditional data processing application software is inadequate to deal with them. GDELT is the largest, most comprehensive, and highest resolution open database ever created. It is a platform that monitors the world's news media from nearly every corner of every country in print, broadcast, and web formats, in over 100 languages, every moment of every day that stretches all the way back to January 1st, 1979, and updates daily [1]. Stock market prediction is the act of trying to determine the future value of a company stock or other financial instrument traded on an exchange. The successful prediction of a stock's future price could yield significant profit. The efficient-market hypothesis suggests that stock prices reflect all currently available information and any price changes that are not based on newly revealed information thus are inherently unpredictable [2]. On the other hand, other studies show that it is predictable. The stock market prediction has been a long-time attractive topic and is extensively studied by researchers in different fields with numerous studies of the correlation between stock market fluctuations and different data sources derived from the historical data of world major stock indices or external information from social media and news [6]. The main objective of this research is to investigate the accuracy of predicting the unseen prices of the Dow Jones Industrial Average using information derived from GDELT database. Dow Jones Industrial Average (DJIA) is a stock market index, and one of several indices created by Wall Street Journal editor and Dow Jones & Company co-founder Charles Dow. This research is based on data sets of events from GDELT database and daily prices of the DJI from Yahoo Finance, all from March 2015 to October 2017. First, multiple different classification machine learning models are applied to the generated datasets and then also applied to multiple different Ensemble methods. In statistics and machine learning, Ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. Afterwards, performances are evaluated for each model using the optimized parameters. Finally, experimental results show that using Ensemble methods has a significant (positive) impact on improving the prediction accuracy. Keywords: Big Data, GDELT, Stock Market, Prediction, Dow Jones Index, Machine Learning, Ensemble Methods

Styles APA, Harvard, Vancouver, ISO, etc.

19

Larsson, Martin, et Samuel Ljungberg. « Readability : Man and Machine : Using readability metrics to predict results from unsupervised sentiment analysis ». Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-301842.

Texte intégral

Résumé :

Readability metrics assess the ease with which human beings read and understand written texts. With the advent of machine learning techniques that allow computers to also analyse text, this provides an interesting opportunity to investigate whether readability metrics can be used to inform on the ease with which machines understand texts. To that end, the specific machine analysed in this paper uses word embeddings to conduct unsupervised sentiment analysis. This specification minimises the need for labelling and human intervention, thus relying heavily on the machine instead of the human. Across two different datasets, sentiment predictions are made using Google’s Word2Vec word embedding algorithm, and are evaluated to produce a dichotomous output variable per sentiment. This variable, representing whether a prediction is correct or not, is then used as the dependent variable in a logistic regression with 17 readability metrics as independent variables. The resulting model has high explanatory power and the effects of readability metrics on the results from the sentiment analysis are mostly statistically significant. However, metrics affect sentiment classification in the two datasets differently, indicating that the metrics are expressions of linguistic behaviour unique to the datasets. The implication of the findings is that readability metrics could be used directly in sentiment classification models to improve modelling accuracy. Moreover, the results also indicate that machines are able to pick up on information that human beings do not pick up on, for instance that certain words are associated with more positive or negative sentiments.
Läsbarhetsmått bedömer hur lätt eller svårt det är för människor att läsa och förstå skrivna texter. Eftersom nya maskininlärningstekniker har utvecklats kan datorer numera också analysera texter. Därför är en intressant infallsvinkel huruvida läsbarhetsmåtten också kan användas för att bedöma hur lätt eller svårt det är för maskiner att förstå texter. Mot denna bakgrund använder den specifika maskinen i denna uppsats ordinbäddningar i syfte att utföra oövervakad sentimentanalys. Således minimeras behovet av etikettering och mänsklig handpåläggning, vilket resulterar i en mer djupgående analys av maskinen istället för människan. I två olika dataset jämförs rätt svar mot sentimentförutsägelser från Googles ordinbäddnings-algoritm Word2Vec för att producera en binär utdatavariabel per sentiment. Denna variabel, som representerar om en förutsägelse är korrekt eller inte, används sedan som beroende variabel i en logistisk regression med 17 olika läsbarhetsmått som oberoende variabler. Den resulterande modellen har högt förklaringsvärde och effekterna av läsbarhetsmåtten på resultaten från sentimentanalysen är mestadels statistiskt signifikanta. Emellertid är effekten på klassificeringen beroende på dataset, vilket indikerar att läsbarhetsmåtten ger uttryck för olika lingvistiska beteenden som är unika till datamängderna. Implikationen av resultaten är att läsbarhetsmåtten kan användas direkt i modeller som utför sentimentanalys för att förbättra deras prediktionsförmåga. Dessutom indikerar resultaten också att maskiner kan plocka upp på information som människor inte kan, exempelvis att vissa ord är associerade med positiva eller negativa sentiment.

Styles APA, Harvard, Vancouver, ISO, etc.

20

Pereira, Vinicius Gomes. « Using supervised machine learning and sentiment analysis techniques to predict homophobia in portuguese tweets ». reponame:Repositório Institucional do FGV, 2018. http://hdl.handle.net/10438/24301.

Texte intégral

Résumé :

Submitted by Vinicius Pereira (viniciusgomespe@gmail.com) on 2018-06-26T20:56:26Z No. of bitstreams: 1 DissertacaoFinal.pdf: 2029614 bytes, checksum: 3eda3dc97f25c0eecd86608653150d82 (MD5)
Approved for entry into archive by Janete de Oliveira Feitosa (janete.feitosa@fgv.br) on 2018-07-11T12:40:51Z (GMT) No. of bitstreams: 1 DissertacaoFinal.pdf: 2029614 bytes, checksum: 3eda3dc97f25c0eecd86608653150d82 (MD5)
Made available in DSpace on 2018-07-16T17:48:51Z (GMT). No. of bitstreams: 1 DissertacaoFinal.pdf: 2029614 bytes, checksum: 3eda3dc97f25c0eecd86608653150d82 (MD5) Previous issue date: 2018-04-16
Este trabalho estuda a identificação de tweets homofóbicos, utilizando uma abordagem de processamento de linguagem natural e aprendizado de máquina. O objetivo é construir um modelo preditivo que possa detectar, com razoável precisão, se um Tweet contém conteúdo ofensivo a indivı́duos LGBT ou não. O banco de dados utilizado para treinar os modelos preditivos foi construı́do agregando tweets de usuários que interagiram com polı́ticos e/ou partidos polı́ticos no Brasil. Tweets contendo termos relacionados a LGBTs ou que têm referências a indivı́duos LGBT foram coletados e classificados manualmente. Uma grande parte deste trabalho está na construção de features que capturam com precisão não apenas o texto do tweet, mas também caracterı́sticas especı́ficas dos usuários e de expressões coloquiais do português. Em particular, os usos de palavrões e vocabulários especı́ficos são um forte indicador de tweets ofensivos. Naturalmente, n-gramas e esquemas de frequência de termos também foram considerados como caracterı́sticas do modelo. Um total de 12 conjuntos de recursos foram construı́dos. Uma ampla gama de técnicas de aprendizado de máquina foi empregada na tarefa de classificação: Naive Bayes, regressões logı́sticas regularizadas, redes neurais feedforward, XGBoost (extreme gradient boosting), random forest e support vector machines. Depois de estimar e ajustar cada modelo, eles foram combinados usando voting e stacking. Voting utilizando 10 modelos obteve o melhor resultado, com 89,42% de acurácia.
This work studies the identification of homophobic tweets from a natural language processing and machine learning approach. The goal is to construct a predictive model that can detect, with reasonable accuracy, whether a Tweet contains offensive content to LGBT or not. The database used to train the predictive models was constructed aggregating tweets from users that have interacted with politicians and/or political parties in Brazil. Tweets containing LGBT-related terms or that have references to open LGBT individuals were collected and manually classified. A large part of this work is in constructing features that accurately capture not only the text of the tweet but also specific characteristics of the users and language choices. In particular, the uses of swear words and strong vocabulary is a quite strong predictor of offensive tweets. Naturally, n-grams and term weighting schemes were also considered as features of the model. A total of 12 sets of features were constructed. A broad range of machine learning techniques were employed in the classification task: naive Bayes, regularized logistic regressions, feedforward neural networks, extreme gradient boosting (XGBoost), random forest and support vector machines. After estimating and tuning each model, they were combined using voting and stacking. Voting using 10 models obtained the best result, with 89.42% accuracy.

Styles APA, Harvard, Vancouver, ISO, etc.

21

Dettori, Emilio. « Sentiment Analysis per la moderazione di una community ». Bachelor's thesis, Alma Mater Studiorum - Università di Bologna, 2018.

Trouver le texte intégral

Résumé :

Questa tesi, dopo aver illustrato i concetti di Machine Learning e Natural Language Processing, descrive il processo di Sentiment Analysis e le sue applicazioni. L'obiettivo del progetto di tesi è stato quello di studiare un sistema di moderazione automatica testuale di una community online, svolto utilizzando le tre tecniche descritte. In particolare, il fine del progetto è quello di effettuare un'analisi lessicale del testo su un corpus creato appositamente, per poi sviluppare algoritmi di Machine Learning in grado di apprendere da esso. Per ogni tecnologia analizzata sono mostrati esempi e casi d'uso.

Styles APA, Harvard, Vancouver, ISO, etc.

22

Torchi, Andrea. « Sperimentazioni per "Sentiment Analysis" tramite Reti Neurali Profonde ». Master's thesis, Alma Mater Studiorum - Università di Bologna, 2020.

Trouver le texte intégral

Résumé :

Il lavoro di questa tesi verte sulla progettazione e sulla realizzazione di un software di machine-learning che svolga un compito di Sentiment Analysis, in particolar modo è stato svolto nell'ambito di un progetto commissionato all'azienda presso la quale ho svolto il mio tirocinio. Come fonte di dati da cui iniziare e su cui basare il progetto ho scelto alcuni social networks, per via della grande quantità di dati che offrono. Prima e durante il lavoro di tesi ho studiato i principi del machine-learning ed in particolare delle reti neurali, concetti che furono in seguito applicati nella realizzazione delle reti per lo svolgimento del compito di Sentiment Analysis.

Styles APA, Harvard, Vancouver, ISO, etc.

23

Zhang, Jun. « Sentiment analysis of movie reviews in Chinese ». Thesis, Uppsala universitet, Institutionen för lingvistik och filologi, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-412670.

Texte intégral

Résumé :

Sentiment analysis aims at figuring out the opinions of the users towards a certain service or product. In this research, the aim is at classifying the sentiments of users based on the comments they have posed on Douban movie website. In this thesis, I try two different ways to classify the sentiments: with the first one classifying comments into five classes of ratings from 1 to 5, and with the second one classifying comments into three classes of ratings: negative, neutral and positive. For the latter, the ratings of 1 and 2 are grouped as negative, the ratings of 3 neutral and the ratings of 4 and 5 positive. First, Term Frequency Inverse Document Frequency (TF-IDF) is used as the feature extraction technique for machine learning algorithms. Chi Square and Mutual Information are used for feature selection. The selected features are fed into different machine learning methods: Logistic Regression, Linear SVC, SGD classifier and Multinomial Naive Bayes. The performance of models with feature selection will be compared with the performance of models without feature selection for 5-class classification as well as 3-class classification. Also, fastText and Skip-Gram are used as embedding methods for deep learning algorithms LSTM and BILSTM. FastText will also be used for both embedding as well as being a classifier. The aim is to compare different machine learning and deep learning algorithms using different vectorization methods to see which model performs the best regarding both 5-class and 3-class classification. The two classification strategies will be compared with each other in terms of error analysis. The aim is to figure out the similarities and differences of misclassifications made by two different classification strategies.

Styles APA, Harvard, Vancouver, ISO, etc.

24

Ahmed, Kachkach. « Analyzing user behavior and sentiment in music streaming services ». Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-186527.

Texte intégral

Résumé :

These last years, streaming services (for music, podcasts, TV shows and movies) have been under the spotlight by disrupting traditional media consumption platforms. If the technical implications of streaming huge amounts of data are well researched, much remains to be done to analyze the wealth of data collected by these services and exploit it to its full potential in order to improve them. Using raw data about users’ interactions with the music streaming service Spotify, this thesis focuses on three main concepts: streaming context, user attention and the sequential analysis of user actions. We discuss the importance of each of these aspects and propose different statistical and machine learning techniques to model them. We show how these models can be used to improve streaming services by inferring user sentiment and improving recommender systems, characterizing user sessions, extracting behavioral patterns and providing useful business metrics.
De senaste åren har strömningtjänster (för musik, podcasts, TV-serier och filmer) varit i strålkastarljuset genom att förändra synen på hur vi konsumerar media. Om det tekniska impikationerna av att strömma stora mängder data är väl utforskat finns det mycket kvar i att analysera de stora datamängderna som samlas in för att förstå och förbättra tjänsterna. Genom att använda rådata om hur användarna interagerar med musiktjänsten Spotify, fokuserar den här uppsatsen på tre huvudkoncept: strömmandets kontext, användares uppmäksamhet samt sekvensiell analys av användares handlingar. Vi diskuterar betydelsen av varje koncept och föreslår en olika statistiska och maskininlärningstekniker för att modellera dem. Vi visar hur dessa modeller kan användas för att förbättra strömmningstjänster genom att antyda användares sentiment, förbättra rekommendationer, karaktärisera användarsessioner, extrahera betendemönster och ta fram användbar affärsdata.

Styles APA, Harvard, Vancouver, ISO, etc.

25

Nilsson, Ludvig, et Olle Djerf. « How to improve Swedish sentiment polarityclassification using context analysis ». Thesis, Uppsala universitet, Institutionen för lingvistik och filologi, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-446382.

Texte intégral

Résumé :

This thesis considers sentiment polarity analysis in Swedish. De-spite being the most widely spoken of the Nordic languages less re-search in sentiment has been conducted in this area compared toneighboring languages. As such this is a largely exploratory projectusing techniques that have shown positive results for other languages.We perform a comparison of techniques applied to a CNN to existingSwedish and multilingual variations of the state of the art BERTmodel. We find that the preprocessing techniques do in fact bene-fit our CNN model, but still do not match the results of fine-tuned BERT models. We conclude that a Swedish specific BERT modelcan outperform the generic multilingual ones, but only under certainconditions.

Styles APA, Harvard, Vancouver, ISO, etc.

26

Nepal, Srijan. « Linguistic Approach to Information Extraction and Sentiment Analysis on Twitter ». University of Cincinnati / OhioLINK, 2012. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1342544962.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

27

Clark, Eric Michael. « Applications In Sentiment Analysis And Machine Learning For Identifying Public Health Variables Across Social Media ». ScholarWorks @ UVM, 2019. https://scholarworks.uvm.edu/graddis/1006.

Texte intégral

Résumé :

Twitter, a popular social media outlet, has evolved into a vast source of linguistic data, rich with opinion, sentiment, and discussion. We mined data from several public Twitter endpoints to identify content relevant to healthcare providers and public health regulatory professionals. We began by compiling content related to electronic nicotine delivery systems (or e-cigarettes) as these had become popular alternatives to tobacco products. There was an apparent need to remove high frequency tweeting entities, called bots, that would spam messages, advertisements, and fabricate testimonials. Algorithms were constructed using natural language processing and machine learning to sift human responses from automated accounts with high degrees of accuracy. We found the average hyperlink per tweet, the average character dissimilarity between each individual's content, as well as the rate of introduction of unique words were valuable attributes in identifying automated accounts. We performed a 10-fold Cross Validation and measured performance of each set of tweet features, at various bin sizes, the best of which performed with 97% accuracy. These methods were used to isolate automated content related to the advertising of electronic cigarettes. A rich taxonomy of automated entities, including robots, cyborgs, and spammers, each with different measurable linguistic features were categorized. Electronic cigarette related posts were classified as automated or organic and content was investigated with a hedonometric sentiment analysis. The overwhelming majority (≈ 80%) were automated, many of which were commercial in nature. Others used false testimonials that were sent directly to individuals as a personalized form of targeted marketing. Many tweets advertised nicotine vaporizer fluid (or e-liquid) in various “kid-friendly” flavors including 'Fudge Brownie', 'Hot Chocolate', 'Circus Cotton Candy' along with every imaginable flavor of fruit, which were long ago banned for traditional tobacco products. Others offered free trials, as well as incentives to retweet and spread the post among their own network. Free prize giveaways were also hosted whose raffle tickets were issued for sharing their tweet. Due to the large youth presence on the public social media platform, this was evidence that the marketing of electronic cigarettes needed considerable regulation. Twitter has since officially banned all electronic cigarette advertising on their platform. Social media has the capacity to afford the healthcare industry with valuable feedback from patients who reveal and express their medical decision-making process, as well as self-reported quality of life indicators both during and post treatment. We have studied several active cancer patient populations, discussing their experiences with the disease as well as survivor-ship. We experimented with a Convolutional Neural Network (CNN) as well as logistic regression to classify tweets as patient related. This led to a sample of 845 breast cancer survivor accounts to study, over 16 months. We found positive sentiments regarding patient treatment, raising support, and spreading awareness. A large portion of negative sentiments were shared regarding political legislation that could result in loss of coverage of their healthcare. We refer to these online public testimonies as “Invisible Patient Reported Outcomes” (iPROs), because they carry relevant indicators, yet are difficult to capture by conventional means of self-reporting. Our methods can be readily applied interdisciplinary to obtain insights into a particular group of public opinions. Capturing iPROs and public sentiments from online communication can help inform healthcare professionals and regulators, leading to more connected and personalized treatment regimens. Social listening can provide valuable insights into public health surveillance strategies.

Styles APA, Harvard, Vancouver, ISO, etc.

28

Haddi, Emma. « Sentiment analysis : text, pre-processing, reader views and cross domains ». Thesis, Brunel University, 2015. http://bura.brunel.ac.uk/handle/2438/11196.

Texte intégral

Résumé :

Sentiment analysis has emerged as a field that has attracted a significant amount of attention since it has a wide variety of applications that could benefit from its results, such as news analytics, marketing, question answering, knowledge management and so on. This area, however, is still early in its development where urgent improvements are required on many issues, particularly on the performance of sentiment classification. In this thesis, three key challenging issues affecting sentiment classification are outlined and innovative ways of addressing these issues are presented. First, text pre-processing has been found crucial on the sentiment classification performance. Consequently, a combination of several existing preprocessing methods is proposed for the sentiment classification process. Second, text properties of financial news are utilised to build models to predict sentiment. Two different models are proposed, one that uses financial events to predict financial news sentiment, and the other uses a new interesting perspective that considers the opinion reader view, as opposed to the classic approach that examines the opinion holder view. A new method to capture the reader sentiment is suggested. Third, one characteristic of financial news is that it stretches over a number of domains, and it is very challenging to infer sentiment between different domains. Various approaches for cross-domain sentiment analysis have been proposed and critically evaluated.

Styles APA, Harvard, Vancouver, ISO, etc.

29

Sychra, Martin. « Analýza sentimentu s využitím dolování dat ». Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2016. http://www.nusl.cz/ntk/nusl-255424.

Texte intégral

Résumé :

The theme of the work is sentiment analysis, especially in terms of informatics (marginally from a linguistic point of view). The linguistic part discusses the term sentiment and language methods for its analysis, e.g. lemmatization, POS tagging, using the list of stopwords etc. More attention is paid to the structure of the sentiment analyzer which is based on some of the machine learning methods (support vector machines, Naive Bayes and maximum entropy classification). On the basis of the theoretical background, a functional analyzer is projected and implemented. The experiments are focused mainly on comparing the classification methods and on the benefits of using the individual preprocessing methods. The success rate of the constructed classifier reaches up to 84 % in the cross-validation.

Styles APA, Harvard, Vancouver, ISO, etc.

30

Pecore, Stefania. « Analyse des sentiments et des émotions de commentaires complexes en langue française ». Thesis, Lorient, 2019. http://www.theses.fr/2019LORIS522/document.

Texte intégral

Résumé :

Les définitions des mots « sentiment », « opinion » et « émotion » sont toujours très vagues comme l’atteste aussi le dictionnaire qui semble expliquer un mot en utilisant le deux autres. Tout le monde est affecté par les opinions : les entreprises pour vendre les produits, les gens pour les acheter et, plus en général, pour prendre des décisions, les chercheurs en intelligence artificielle pour comprendre la nature de l’être humain. Aujourd’hui on a une quantité d’information disponible jamais vue avant, mais qui résulte peu accessible. Les mégadonnées (en anglais « big data ») ne sont pas organisées, surtout pour certaines langues – dont la difficulté à les exploiter. La recherche française souffre d’une manque de ressources « prêt-à-porter » pour conduire des tests. Cette thèse a l’objectif d’explorer la nature des sentiments et des émotions, dans le cadre du Traitement Automatique du Langage et des Corpus. Les contributions de cette thèse sont plusieurs : création de nouvelles ressources pour l’analyse du sentiment et de l’émotion, emploi et comparaison de plusieurs techniques d’apprentissage automatique, et plus important, l’étude du problème sous différents points de vue : classification des commentaires en ligne en polarité (positive et négative), Aspect-Based Sentiment Analysis des caractéristiques du produit recensé. Enfin, un étude psycholinguistique, supporté par des approches lexicales et d’apprentissage automatique, sur le rapport entre qui juge et l’objet jugé
"Sentiment", "opinion" and "emotion" are words really vaguely defined; not even the dictionary seems to be of any help, being it the first to define each of the three by using the remaining two. And yet, the civilised world is heavily affected by opinions: companies need them to understand how to sell their products; people use them to buy the most fitting product and, more generally, to weigh their decisions; researchers exploit them in Artificial Intelligence studies to understand the nature of the human being. Today we can count on a humongous amount of available information, though it’s hard to use it. In fact, the so-called “Big data” are not always structured – especially for certain languages. French research suffers from a lack of readily available resources for tests. In the context of Natural Language Processing, this thesis aims to explore the nature of sentiment and emotion. Some of our contributions to the NLP research community are: creation of new resources for sentiment and emotion analysis, tests and comparisons of several machine learning methods to study the problem from different points of view - classification of online reviews using sentiment polarity, classification of product characteristics using Aspect- Based Sentiment Analysis. Finally, a psycholinguistic study - supported by a machine learning and lexical approaches – on the relation between who judges, the reviewer, and the object that has been judged, the product

Styles APA, Harvard, Vancouver, ISO, etc.

31

Cunanan, Kevin. « Developing a Recurrent Neural Network with High Accuracy for Binary Sentiment Analysis ». Scholarship @ Claremont, 2018. http://scholarship.claremont.edu/cmc_theses/1835.

Texte intégral

Résumé :

Sentiment analysis has taken on various machine learning approaches in order to optimize accuracy, precision, and recall. However, Long Short-Term Memory (LSTM) Recurrent Neural Networks (RNNs) account for the context of a sentence by using previous predictions as additional input for future sentence predictions. Our approach focused on developing an LSTM RNN that could perform binary sentiment analysis for positively and negatively labeled sentences. In collaboration with Mariam Salloum, I developed a collection of programs to classify individual sentences as either positive or negative. This paper additionally looks into machine learning, neural networks, data preprocessing, implementation, and resulting comparisons.

Styles APA, Harvard, Vancouver, ISO, etc.

32

Viegas, Catarina Correia. « Assessing public figures’ reputation through sentiment analysis on twitter using machine learning : creation of a system ». Master's thesis, Instituto Superior de Economia e Gestão, 2020. http://hdl.handle.net/10400.5/20993.

Texte intégral

Résumé :

Mestrado em Gestão de Sistemas de Informação
Nunca se geraram tantos dados e a um ritmo tão alucinante como atualmente. Vive-se indubitavelmente numa era de Big Data e este termo não passa despercebido, trazendo consigo inúmeros desafios, mas também múltiplas oportunidades. Cerca de 80% dos dados encontra-se de forma desestruturada. Aqui, há um foco especial para o formato de texto, formato esse que para além de comum, agrega um grande potencial. Existem várias aplicações, técnicas e ferramentas associadas à análise de documentos textuais, e esta área surge fortemente ligada ao Processamento de Linguagem Natural. Um dos grandes desafios de ambos está relacionado com Análise de Sentimentos. Sendo interessante aliar tendências e abordar questões como a reputação online, o presente projeto focou-se na criação de um sistema capaz de identificar o sentimento associado a figuras públicas demonstrado através de publicações no Twitter. Com essa finalidade, o levou-se a cabo uma revisão de literatura capaz de explicitar os tópicos associados à temática escolhida. Relativamente ao sistema, optou-se por uma abordagem de Machine Learning com recurso a métodos supervisionados de aprendizagem. Para tal, criou-se um dataset manualmente anotado e procedeu-se ao treino de três classificadores (Naïve Bayes, Support Vector Machines e Entropia Máxima). O impacto de algumas técnicas de pré-processamento também foi medido. Os resultados obtidos não foram tão bons como desejado, mas o melhor modelo foi incorporado no sistema. Este projeto contribuiu para aumentar a base de conhecimento das áreas em que se insere, e fornece ainda um dataset manualmente anotado que poderá ser utilizado em investigações futuras.
Never has so much data been generated and at such an astounding rate as nowadays. This is undoubtedly an era of Big Data and this term does not go unnoticed, bearing within innumerous challenges, but also a multitude of opportunities. Of the generated data, roughly 80% comes unstructured, and there is a special focus on the text format, which appears frequently and carries great potential. There are several applications, techniques and tools connected to the analysis of textual documents and this area is strongly linked to Natural Language Processing. One of the greatest challenges of both is related to Sentiment Analysis. Since it would be interesting to combine trends and address issues such as online reputation, this project focused on creating a system capable of identifying the sentiment associated with public figures, demonstrated through Twitter publications. Firstly, a literature review capable of exploring the topics associated with the chosen subject was carried out. Afterwards,and regarding the system, a Machine Learning approach using supervised learning methods was adopted. For this, a manually annotated dataset was created and three of the most used classifiers (Naïve Bayes, Support Vector Machines and Maximum Entropy) were trained. The impact of some pre-processing techniques was also assessed. The obtained results were not as good as initially desired, nonetheless the best model was chosen to incorporate the system. This project contributed to increase the knowledge base of the areas in which it is comprised and provides a manually annotated dataset that can be used in further research.
info:eu-repo/semantics/publishedVersion

Styles APA, Harvard, Vancouver, ISO, etc.

33

Wright, Lindsey. « Classifying textual fast food restaurant reviews quantitatively using text mining and supervised machine learning algorithms ». Digital Commons @ East Tennessee State University, 2018. https://dc.etsu.edu/honors/451.

Texte intégral

Résumé :

Companies continually seek to improve their business model through feedback and customer satisfaction surveys. Social media provides additional opportunities for this advanced exploration into the mind of the customer. By extracting customer feedback from social media platforms, companies may increase the sample size of their feedback and remove bias often found in questionnaires, resulting in better informed decision making. However, simply using personnel to analyze the thousands of relative social media content is financially expensive and time consuming. Thus, our study aims to establish a method to extract business intelligence from social media content by structuralizing opinionated textual data using text mining and classifying these reviews by the degree of customer satisfaction. By quantifying textual reviews, companies may perform statistical analysis to extract insight from the data as well as effectively address concerns. Specifically, we analyzed a subset of 56,000 Yelp reviews on fast food restaurants and attempt to predict a quantitative value reflecting the overall opinion of each review. We compare the use of two different predictive modeling techniques, bagged Decision Trees and Random Forest Classifiers. In order to simplify the problem, we train our model to accurately classify strongly negative and strongly positive reviews (1 and 5 stars) reviews. In addition, we identify drivers behind strongly positive or negative reviews allowing businesses to understand their strengths and weaknesses. This method provides companies an efficient and cost-effective method to process and understand customer satisfaction as it is discussed on social media.

Styles APA, Harvard, Vancouver, ISO, etc.

34

Jabreel, Mohammed Hamood Abdullah. « Sentiment Analysis of Textual Content in Social Networks. From Hand-Crafted to Deep Learning-Based Models ». Doctoral thesis, Universitat Rovira i Virgili, 2020. http://hdl.handle.net/10803/669441.

Texte intégral

Résumé :

Aquesta tesi proposa diversos mètodes avançats per analitzar automàticament el contingut textual compartit a les xarxes socials i identificar les opinions, emocions i sentiments a diferents nivells d’anàlisi i en diferents idiomes. Comencem proposant un sistema d’anàlisi de sentiments, anomenat SentiRich, basat en un conjunt ric d’atributs, inclosa la informació extreta de lèxics de sentiments i models de word embedding pre-entrenats. A continuació, proposem un sistema basat en Xarxes Neurals Convolucionals i regressors XGboost per resoldre una sèrie de tasques d’anàlisi de sentiments i emocions a Twitter. Aquestes tasques van des de les tasques típiques d’anàlisi de sentiments fins a determinar automàticament la intensitat d’una emoció (com ara alegria, por, ira, etc.) i la intensitat del sentiment dels autors a partir dels seus tweets. També proposem un nou sistema basat en Deep Learning per solucionar el problema de classificació de les emocions múltiples a Twitter. A més, es va considerar el problema de l’anàlisi del sentiment depenent de l’objectiu. Per a aquest propòsit, proposem un sistema basat en Deep Learning que identifica i extreu l'objectiu dels tweets. Tot i que alguns idiomes, com l’anglès, disposen d’una àmplia gamma de recursos per permetre l’anàlisi del sentiment, a la majoria de llenguatges els hi manca. Per tant, utilitzem la tècnica d'anàlisi de sentiments entre idiomes per desenvolupar un sistema nou, multilingüe i basat en Deep Learning per a llenguatges amb pocs recursos lingüístics. Proposem combinar l’ajuda a la presa de decisions multi-criteri i anàlisis de sentiments per desenvolupar un sistema que permeti als usuaris la possibilitat d’explotar tant les opinions com les seves preferències en el procés de classificació d’alternatives. Finalment, vam aplicar els sistemes desenvolupats al camp de la comunicació de les marques de destinació a través de les xarxes socials. Amb aquesta finalitat, hem recollit tweets de persones locals, visitants i els gabinets oficials de Turisme de diferents destinacions turístiques i es van analitzar les opinions i les emocions compartides en ells. En general, els mètodes proposats en aquesta tesi milloren el rendiment dels enfocaments d’última generació i mostren troballes apassionants.
Esta tesis propone varios métodos avanzados para analizar automáticamente el contenido textual compartido en las redes sociales e identificar opiniones, emociones y sentimientos, en diferentes niveles de análisis y en diferentes idiomas. Comenzamos proponiendo un sistema de análisis de sentimientos, llamado SentiRich, que está basado en un conjunto rico de características, que incluyen la información extraída de léxicos de sentimientos y modelos de word embedding previamente entrenados. Luego, proponemos un sistema basado en redes neuronales convolucionales y regresores XGboost para resolver una variedad de tareas de análisis de sentimientos y emociones en Twitter. Estas tareas van desde las típicas tareas de análisis de sentimientos hasta la determinación automática de la intensidad de una emoción (como alegría, miedo, ira, etc.) y la intensidad del sentimiento de los autores de los tweets. También proponemos un novedoso sistema basado en Deep Learning para abordar el problema de clasificación de emociones múltiples en Twitter. Además, consideramos el problema del análisis de sentimientos dependiente del objetivo. Para este propósito, proponemos un sistema basado en Deep Learning que identifica y extrae el objetivo de los tweets. Si bien algunos idiomas, como el inglés, tienen una amplia gama de recursos para permitir el análisis de sentimientos, la mayoría de los idiomas carecen de ellos. Por lo tanto, utilizamos la técnica de Análisis de Sentimiento Inter-lingual para desarrollar un sistema novedoso, multilingüe y basado en Deep Learning para los lenguajes con pocos recursos lingüísticos. Proponemos combinar la Ayuda a la Toma de Decisiones Multi-criterio y el análisis de sentimientos para desarrollar un sistema que brinde a los usuarios la capacidad de explotar las opiniones junto con sus preferencias en el proceso de clasificación de alternativas. Finalmente, aplicamos los sistemas desarrollados al campo de la comunicación de las marcas de destino a través de las redes sociales. Con este fin, recopilamos tweets de personas locales, visitantes, y gabinetes oficiales de Turismo de diferentes destinos turísticos y analizamos las opiniones y las emociones compartidas en ellos. En general, los métodos propuestos en esta tesis mejoran el rendimiento de los enfoques de vanguardia y muestran hallazgos interesa.
This thesis proposes several advanced methods to automatically analyse textual content shared on social networks and identify people’ opinions, emotions and feelings at a different level of analysis and in different languages. We start by proposing a sentiment analysis system, called SentiRich, based on a set of rich features, including the information extracted from sentiment lexicons and pre-trained word embedding models. Then, we propose an ensemble system based on Convolutional Neural Networks and XGboost regressors to solve an array of sentiment and emotion analysis tasks on Twitter. These tasks range from the typical sentiment analysis tasks, to automatically determining the intensity of an emotion (such as joy, fear, anger, etc.) and the intensity of sentiment (aka valence) of the authors from their tweets. We also propose a novel Deep Learning-based system to address the multiple emotion classification problem on Twitter. Moreover, we considered the problem of target-dependent sentiment analysis. For this purpose, we propose a Deep Learning-based system that identifies and extracts the target of the tweets. While some languages, such as English, have a vast array of resources to enable sentiment analysis, most low-resource languages lack them. So, we utilise the Cross-lingual Sentiment Analysis technique to develop a novel, multi-lingual and Deep Learning-based system for low resource languages. We propose to combine Multi-Criteria Decision Aid and sentiment analysis to develop a system that gives users the ability to exploit reviews alongside their preferences in the process of alternatives ranking. Finally, we applied the developed systems to the field of communication of destination brands through social networks. To this end, we collected tweets of local people, visitors, and official brand destination offices from different tourist destinations and analysed the opinions and the emotions shared in these tweets.

Styles APA, Harvard, Vancouver, ISO, etc.

35

Carter, David. « Inferring Aspect-Specific Opinion Structure in Product Reviews ». Thesis, Université d'Ottawa / University of Ottawa, 2015. http://hdl.handle.net/10393/32177.

Texte intégral

Résumé :

Identifying differing opinions on a given topic as expressed by multiple people (as in a set of written reviews for a given product, for example) presents challenges. Opinions about a particular subject are often nuanced: a person may have both negative and positive opinions about different aspects of the subject of interest, and these aspect-specific opinions can be independent of the overall opinion on the subject. Being able to identify, collect, and count these nuanced opinions in a large set of data offers more insight into the strengths and weaknesses of competing products and services than does aggregating the overall ratings of such products and services. I make two useful and useable contributions in working with opinionated text. First, I present my implementation of a semi-supervised co-training machine classification method for identifying both product aspects (features of products) and sentiments expressed about such aspects. It offers better precision than fully-supervised methods while requiring much less text to be manually tagged (a time-consuming process). This algorithm can also be run in a fully supervised manner when more data is available. Second, I apply this co-training approach to reviews of restaurants and various electronic devices; such text contains both factual statements and opinions about features/aspects of products. The algorithm automatically identifies the product aspects and the words that indicate aspect-specific opinion polarity, while largely avoiding the problem of misclassifying the products themselves as inherently positive or negative. This method performs well compared to other approaches. When run on a set of reviews of five technology products collected from Amazon, the system performed with some demonstrated competence (with an average precision of 0.83) at the difficult task of simultaneously identifying aspects and sentiments, though comparison to contemporaries' simpler rules-based approaches was difficult. When run on a set of opinionated sentences about laptops and restaurants that formed the basis of a shared challenge in the SemEval-2014 Task 4 competition, it was able to classify the sentiments expressed about aspects of laptops better than any team that competed in the task (achieving 0.72 accuracy). It was above the mean in its ability to identify the aspects of restaurants about which people expressed opinions, even when co-training using only half of the labelled training data at the outset. While the SemEval-2014 aspect-based sentiment extraction task considered only separately the tasks of identifying product aspects and determining their polarities, I take an extra step and evaluate sentences as a whole, inferring aspects and the aspect-specific sentiments expressed simultaneously, a more difficult task that seems more applicable to real-world tasks. I present first results of this sentence-level task. The algorithm uses both lexical and syntactic information in a manner that is shown to be able to handle new words that it has never before seen. It offers some demonstrated ability to adapt to new subject domains for which it has no training data. The system is characterizable by very high precision and weak-to-average recall and it estimates its own confidence in its predictions; this characteristic should make the algorithm suitable for use on its own or for combination in a confidence-based voting ensemble. The software created for and described in the course of this dissertation is made available online.

Styles APA, Harvard, Vancouver, ISO, etc.

36

COREA, FRANCESCO. « Essays on machine learning for economics and finance ». Doctoral thesis, Luiss Guido Carli, 2017. http://hdl.handle.net/11385/201135.

Texte intégral

Résumé :

Econometrics and machine learning are quite close and related concepts. Nowadays, it is always more important to extract value from raw data, and distilling actionable insights from quantitative values as well as qualitative features. In order to deal with these topics, the first chapters (Chapter 1 - 4) are going to introduce the new wave called machine learning or big data and they will explain the most common techniques used in the field, respectively regression, clustering, model selection, and tree-based models (Chapter 2); time series analysis (Chapter 3); and eventually forecasting model with shrinkage methods (Chapter 4). Then, three applications are going to be provided. In Chapter 5, it is going to be shown an example of big dataset for the insurance vertical. Rothschild and Stiglitz ([30]) argued that people signal their risk profile through their insurance demand, i.e. individuals with a high risk profile would buy insurance as much as they can, while people who are not going to buy any insurance are the ones with a lower risk profile. This issue is commonly known as adverse selection. Even if their prediction seems to work quite well in a lot of different markets, Cutler et al. ([13]) proved that there exist some insurance markets in United States in which the expected result is completely different. In the wake of this study, we provide empirical evidences that there are some European insurance markets in which the low risk profile agents are the ones who buy more insurance. In Chapter 6, a second application is going to be provided. It has been studies the effect of behavioural biases on entrepreneurial choices to insure their firms against kinds of corporate risks. It has been used a large sample of Italian Small and Medium sized - finding that they under-insure themselves. The dataset allows to link corporate insurance choices with the personal traits of the entrepreneur and his household’s financial choices. In Chapter 7, finally, an application to financial markets is going to be shown. Bollen et al. ([10]) reintroduced the idea of formulating prediction based on the general sentiment of the investors, even if they originally exploited microblogging data. The purpose of this study is to verify whether social data may have a predictive power for the stock prices, returns, and volumes. The analysis has been implemented for different large technology companies, and the robustness has been tested through a ten-days rolling window. The evidence shows that there is some intrinsic value in these new features, and that both the sentiment and the amount of tweets posted online can improve the forecast given by a baseline autoregressive model. Some additional variations have been tested eventually with the same dataset.

Styles APA, Harvard, Vancouver, ISO, etc.

37

Elena, Podasca. « Predicting the Movement Direction of OMXS30 Stock Index Using XGBoost and Sentiment Analysis ». Thesis, Blekinge Tekniska Högskola, Institutionen för datavetenskap, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-21119.

Texte intégral

Résumé :

Background. Stock market prediction is an active yet challenging research area. A lot of effort has been put in by both academia and practitioners to produce accurate stock market predictions models, in the attempt to maximize investment objectives. Tree-based ensemble machine learning methods such as XGBoost have proven successful in practice. At the same time, there is a growing trend to incorporate multiple data sources in prediction models, such as historical prices and text, in order to achieve superior forecasting performance. However, most applications and research have so far focused on the American or Asian stock markets, while the Swedish stock market has not been studied extensively from the perspective of hybrid models using both price and text derived features. Objectives. The purpose of this thesis is to investigate whether augmenting a numerical dataset based on historical prices with sentiment features extracted from financial news improves classification performance when predicting the daily price trend of the Swedish stock market index, OMXS30. Methods. A dataset of 3,517 samples between 2006 - 2020 was collected from two sources, historical prices and financial news. XGBoost was used as classifier and four different metrics were employed for model performance comparison given three complementary datasets: the dataset which contains only the sentiment feature, the dataset with only price-derived features and finally, the dataset augmented with sentiment feature extracted from financial news. Results. Results show that XGBoost has a good performance in classifying the daily trend of OMXS30 given historical price features, achieving an accuracy of 73% on the test set. A small improvement across all metrics is recorded on the test set when augmenting the numerical dataset with sentiment features extracted from financial news. Conclusions. XGBoost is a powerful ensemble method for stock market prediction, reflected in a satisfactory classification performance of the daily movement direction of OMXS30. However, augmenting the numerical input set with sentiment features extracted from text did not have a powerful impact on classification performance in this case, as the improvements across all employed metrics were small.

Styles APA, Harvard, Vancouver, ISO, etc.

38

Schulder, Marc [Verfasser], et Dietrich [Akademischer Betreuer] Klakow. « Sentiment polarity shifters : creating lexical resources through manual annotation and bootstrapped machine learning / Marc Schulder ; Betreuer : Dietrich Klakow ». Saarbrücken : Saarländische Universitäts- und Landesbibliothek, 2019. http://d-nb.info/1199932906/34.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

39

Jamil, Zunaira. « Monitoring Tweets for Depression to Detect At-Risk Users ». Thesis, Université d'Ottawa / University of Ottawa, 2017. http://hdl.handle.net/10393/36030.

Texte intégral

Résumé :

According to the World Health Organization, mental health is an integral part of health and well-being. Mental illness can affect anyone, rich or poor, male or female. One such example of mental illness is depression. In Canada 5.3% of the population had presented a depressive episode in the past 12 months. Depression is difficult to diagnose, resulting in high under-diagnosis. Diagnosing depression is often based on self-reported experiences, behaviors reported by relatives, and a mental status examination. Currently, author- ities use surveys and questionnaires to identify individuals who may be at risk of depression. This process is time-consuming and costly. We propose an automated system that can identify at-risk users from their public social media activity. More specifically, we identify at-risk users from Twitter. To achieve this goal we trained a user-level classifier using Support Vector Machine (SVM) that can detect at-risk users with a recall of 0.8750 and a precision of 0.7778. We also trained a tweet-level classifier that predicts if a tweet indicates distress. This task was much more difficult due to the imbalanced data. In the dataset that we labeled, we came across 5% distress tweets and 95% non-distress tweets. To handle this class imbalance, we used undersampling methods. The resulting classifier uses SVM and performs with a recall of 0.8020 and a precision of 0.1237. Our system can be used by authorities to find a focused group of at-risk users. It is not a platform for labeling an individual as a patient with depres- sion, but only a platform for raising an alarm so that the relevant authorities could take necessary interventions to further analyze the predicted user to confirm his/her state of mental health. We respect the ethical boundaries relating to the use of social media data and therefore do not use any user identification information in our research.

Styles APA, Harvard, Vancouver, ISO, etc.

40

Tempfli, Peter. « Preprocessing method comparison and model tuning for natural language data ». Thesis, Högskolan Dalarna, Mikrodataanalys, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:du-34438.

Texte intégral

Résumé :

Twitter and other microblogging services are a valuable source for almost real-time marketing, public opinion and brand-related consumer information mining. As such, collection and analysis of user-generated natural language content is in the focus of research regarding automated sentiment analysis. The most successful approach in the field is supervised machine learning, where the three key problems are data cleaning and transformation, feature generation and model choice and training parameter selection. Papers in recent years thoroughly examined the field and there is a agreement that relatively simple techniques as bag-of-words transformation of text and a naive bayes models can generate acceptable results (between 75% and 85% percent F1-scores for an average dataset) and fine tuning can be really difficult and yields relatively small results. However, a few percent in performance even on a middle-size dataset can mean thousands of better classified documents, which can mean thousands of missed sales or angry customers in any business domain. Thus this work presents and demonstrates a framework for better tailored, fine-tuned models for analysing twitter data. The experiments show that Naive Bayes classifiers with domain specific stopword selection work the best (up to 88% F1-score), however the performance dramatically decreases if the data is unbalanced or the classes are not binary. Filtering stopwords is crucial to increase prediction performance; and the experiment shows that a stopword set should be domain-specific. The conclusion is that there is no one best way for model training and stopword selection in sentiment analysis. Thus the work suggests that there is space for using a comparison framework to fine-tune prediction models to a given problem: such a comparison framework should compare different training settings on the same dataset, so the best trained models can be found for a given real-life problem.

Styles APA, Harvard, Vancouver, ISO, etc.

41

Smith, Anri. « Comparison of sovereign risk and its determinants ». Master's thesis, Faculty of Science, 2019. http://hdl.handle.net/11427/31119.

Texte intégral

Résumé :

This paper aims to measure, compare and model Sovereign Risk. The risk position of South Africa compared to Emerging Markets as well as in comparison to Developed Markets is considered. Particular interest is taken in how the South African Sovereign Risk environment, and its associated determinants, differs and conforms to that of other Emerging Markets. This effectively highlights how the South African economy is similar to the Emerging Markets and where it behaves differently. Regression, optimisation techniques, dimension reduction techniques as well as Machine Learning techniques, through the use of sentiment analysis, is utilised in this research.

Styles APA, Harvard, Vancouver, ISO, etc.

42

Abdullah, Siti Norbaiti binti. « Machine learning approach for crude oil price prediction ». Thesis, University of Manchester, 2014. https://www.research.manchester.ac.uk/portal/en/theses/machine-learning-approach-for-crude-oil-price-prediction(949fa2d5-1a4d-416a-8e7c-dd66da95398e).html.

Texte intégral

Résumé :

Crude oil prices impact the world economy and are thus of interest to economic experts and politicians. Oil price’s volatile behaviour, which has moulded today’s world economy, society and politics, has motivated and continues to excite researchers for further study. This volatile behaviour is predicted to prompt more new and interesting research challenges. In the present research, machine learning and computational intelligence utilising historical quantitative data, with the linguistic element of online news services, are used to predict crude oil prices via five different models: (1) the Hierarchical Conceptual (HC) model; (2) the Artificial Neural Network-Quantitative (ANN-Q) model; (3) the Linguistic model; (4) the Rule-based Expert model; and, finally, (5) the Hybridisation of Linguistic and Quantitative (LQ) model. First, to understand the behaviour of the crude oil price market, the HC model functions as a platform to retrieve information that explains the behaviour of the market. This is retrieved from Google News articles using the keyword “Crude oil price”. Through a systematic approach, price data are classified into categories that explain the crude oil price’s level of impact on the market. The price data classification distinguishes crucial behaviour information contained in the articles. These distinguished data features ranked hierarchically according to the level of impact and used as reference to discover the numeric data implemented in model (2). Model (2) is developed to validate the features retrieved in model (1). It introduces the Back Propagation Neural Network (BPNN) technique as an alternative to conventional techniques used for forecasting the crude oil market. The BPNN technique is proven in model (2) to have produced more accurate and competitive results. Likewise, the features retrieved from model (1) are also validated and proven to cause market volatility. In model (3), a more systematic approach is introduced to extract the features from the news corpus. This approach applies a content utilisation technique to news articles and mines news sentiments by applying a fuzzy grammar fragment extraction. To extract the features from the news articles systematically, a domain-customised ‘dictionary’ containing grammar definitions is built beforehand. These retrieved features are used as the linguistic data to predict the market’s behaviour with crude oil price. A decision tree is also produced from this model which hierarchically delineates the events (i.e., the market’s rules) that made the market volatile, and later resulted in the production of model (4). Then, model (5) is built to complement the linguistic character performed in model (3) from the numeric prediction model made in model (2). To conclude, the hybridisation of these two models and the integration of models (1) to (5) in this research imitates the execution of crude oil market’s regulators in calculating their risk of actions before executing a price hedge in the market, wherein risk calculation is based on the ‘facts’ (quantitative data) and ‘rumours’ (linguistic data) collected. The hybridisation of quantitative and linguistic data in this study has shown promising accuracy outcomes, evidenced by the optimum value of directional accuracy and the minimum value of errors obtained.

Styles APA, Harvard, Vancouver, ISO, etc.

43

Sommar, Fredrik, et Milosz Wielondek. « Combining Lexicon- and Learning-based Approaches for Improved Performance and Convenience in Sentiment Classification ». Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-166430.

Texte intégral

Résumé :

Sentiment classification is the process of categorizing data into categories based on its polarity with a wide array of applications across several industries. This report examines a combination of two prominent approaches to sentiment classification using a lexicon of weighted words and machine learning respectively. These approaches are compared with the combined hybrid approach in order to give an account of their relative strengths and weaknesses. When run on a set of IMDb movie reviews the results indicate that the hybrid model performs better than the lexicon-based approach, in turn being outperformed by the learning-based approach. However, the gain in convenience brought on by eliminating the need for training data makes the hybrid model an appealing alternative to the other approaches with a slight trade-off in performance.
Att klassificera text i kategorier baserat på känslan de uttrycker är ett aktuellt område idag och kan tillämpas inom många industrier. Rapporten undersöker en kombination av de två framstående tillvägagångssätten till denna typ av klassificering baserade på ett lexikon med definerade ordvikter respektive maskininlärning. Denna hybridlösning jämförs mot de två andra tillvägagångssätten för att framlägga deras relativa styrkor och svagheter. På ett dataset med filmrecensioner från IMDb får maskininlärningsklassificeraren bäst resultat, följt av hybridlösningen och sist den lexikonbaserade lösningen. Trots det kan hybridlösningen vara att föredra i situationer där det är ogenomförbart eller oskäligt att förbereda träningsdata för maskininlärningsklassificeraren, dock med ett visst avkall på prestanda.

Styles APA, Harvard, Vancouver, ISO, etc.

44

Mao, Yi. « Domain knowledge, uncertainty, and parameter constraints ». Diss., Georgia Institute of Technology, 2010. http://hdl.handle.net/1853/37295.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

45

Sundström, Johan. « Sentiment analysis of Swedish reviews and transfer learning using Convolutional Neural Networks ». Thesis, Uppsala universitet, Avdelningen för systemteknik, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-339066.

Texte intégral

Résumé :

Sentiment analysis is a field within machine learning that focus on determine the contextual polarity of subjective information. It is a technique that can be used to analyze the "voice of the customer" and has been applied with success for the English language for opinionated information such as customer reviews, political opinions and social media data. A major problem regarding machine learning models is that they are domain dependent and will therefore not perform well for other domains. Transfer learning or domain adaption is a research field that study a model's ability of transferring knowledge across domains. In the extreme case a model will train on data from one domain, the source domain, and try to make accurate predictions on data from another domain, the target domain. The deep machine learning model Convolutional Neural Network (CNN) has in recent years gained much attention due to its performance in computer vision both for in-domain classification and transfer learning. It has also performed well for natural language processing problems but has not been investigated to the same extent for transfer learning within this area. The purpose of this thesis has been to investigate how well suited the CNN is for cross-domain sentiment analysis of Swedish reviews. The research has been conducted by investigating how the model perform when trained with data from different domains with varying amount of source and target data. Additionally, the impact on the model’s transferability when using different text representation has also been studied. This study has shown that a CNN without pre-trained word embedding is not that well suited for transfer learning since it performs worse than a traditional logistic regression model. Substituting 20% of source training data with target data can in many of the test cases boost the performance with 7-8% both for the logistic regression and the CNN model. Using pre-trained word embedding produced by a word2vec model increases the CNN's transferability as well as the in-domain performance and outperform the logistic regression model and the CNN model without pre-trained word embedding in the majority of test cases.

Styles APA, Harvard, Vancouver, ISO, etc.

46

Barakat, Serena, et Mahesh Chathuranga. « Improving Hashtag Recommendation for Instagram Images by Considering Hashtag Relativity and Sentiment ». Thesis, Högskolan Dalarna, Mikrodataanalys, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:du-29455.

Texte intégral

Résumé :

Extracting knowledge from user-generated content (UGC) in social media platforms is a very hot research topic in the area of machine learning, nonetheless, the main challenge resides in the fact that UGC carries inference, abstraction and subjectivity alongside objectivity. With the aim of recognising the importance of subjectivity as an influential aspect for providing humanoid results from a machine learning algorithm, this study proposes a novel approach to improve Instagram hashtag recommendation by considering sentiment that can be expressed for images. Two main points are studied in this thesis; evaluating the relativity of Instagram image to hashtag for both objective and subjective features of an image and the effect of sentiment on said relativity. This work examines three machine learning methods for hashtag recommendation: AWS service, developed algorithms with and without sentiment considerations. The models are tested on a collected dataset of de-identified Instagram posts in location London gathered from public profiles. The results show that considering sentiment significantly improves Instagram hashtag recommendation.

Styles APA, Harvard, Vancouver, ISO, etc.

47

Holmqvist, Carl. « Opinion analysis of microblogs for stock market prediction ». Thesis, KTH, Teoretisk datalogi, TCS, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-233197.

Texte intégral

Résumé :

This degree project investigates if a company’s stock price development can be predicted using the general opinion expressed in tweets about the company. The project starts off with the model from a previous project and then tries to improve the results using state-of-the-art neural network sentiment analysis and more tweet data. This project also attempts to perform hourly predictions along with daily predictions in order to investigate the method further. The results show a decrease in accuracy compared to the previous project. The results also indicate that the neural network sentiment analysis improves the accuracy of the stock price development when compared to the baseline model under comparable conditions.
Detta examensarbete undersöker om ett företags aktievärdesutveckling kan förutspås genom att använda sig av den generella opinionen hos tweets skrivna om företaget. Examensarbetet utgår ifrån en model från ett tidigare projekt och försöker förbättra resultaten från denna genom att använda sig av dels state-of-the-art sentimentanalys med neurala nätverk, dels mer tweet data. Examensarbetet undersöker både prognoser timvis samt dygnsvis för att undersöka metoden djupare. Resultaten tyder på en minskad träffsäkerhet jämfört med det tidigare projektet. Resultaten indikerar också att sentimentanalys med neurala nätverk förbättrar träffsäkerheten hos aktievärdesprognosen jämfört med tidigare sentimentanalysmetod givet jämförbara förutsättningar.

Styles APA, Harvard, Vancouver, ISO, etc.

48

Nhlabano, Valentine Velaphi. « Fast Data Analysis Methods For Social Media Data ». Diss., University of Pretoria, 2018. http://hdl.handle.net/2263/72546.

Texte intégral

Résumé :

The advent of Web 2.0 technologies which supports the creation and publishing of various social media content in a collaborative and participatory way by all users in the form of user generated content and social networks has led to the creation of vast amounts of structured, semi-structured and unstructured data. The sudden rise of social media has led to their wide adoption by organisations of various sizes worldwide in order to take advantage of this new way of communication and engaging with their stakeholders in ways that was unimaginable before. Data generated from social media is highly unstructured, which makes it challenging for most organisations which are normally used for handling and analysing structured data from business transactions. The research reported in this dissertation was carried out to investigate fast and efficient methods available for retrieving, storing and analysing unstructured data form social media in order to make crucial and informed business decisions on time. Sentiment analysis was conducted on Twitter data called tweets. Twitter, which is one of the most widely adopted social network service provides an API (Application Programming Interface), for researchers and software developers to connect and collect public data sets of Twitter data from the Twitter database. A Twitter application was created and used to collect streams of real-time public data via a Twitter source provided by Apache Flume and efficiently storing this data in Hadoop File System (HDFS). Apache Flume is a distributed, reliable, and available system which is used to efficiently collect, aggregate and move large amounts of log data from many different sources to a centralized data store such as HDFS. Apache Hadoop is an open source software library that runs on low-cost commodity hardware and has the ability to store, manage and analyse large amounts of both structured and unstructured data quickly, reliably, and flexibly at low-cost. A Lexicon based sentiment analysis approach was taken and the AFINN-111 lexicon was used for scoring. The Twitter data was analysed from the HDFS using a Java MapReduce implementation. MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. The results demonstrate that it is fast, efficient and economical to use this approach to analyse unstructured data from social media in real time.
Dissertation (MSc)--University of Pretoria, 2019.
National Research Foundation (NRF) - Scarce skills
Computer Science
MSc
Unrestricted

Styles APA, Harvard, Vancouver, ISO, etc.

49

Hernández, Martínez Víctor Alejandro. « Identificación de la presencia de ironía en el texto generado por usuarios de Twitter utilizando técnicas de Opinion Mining y Machine Learning ». Tesis, Universidad de Chile, 2015. http://repositorio.uchile.cl/handle/2250/134793.

Texte intégral

Résumé :

Ingeniero Civil Industrial
El siguiente trabajo tiene como objetivo general dise~nar e implementar un módulo clasificador de texto que permita identificar la presencia de ironía en el contenido generado por usuarios de Twitter, mediante el uso de herramientas asociadas a Opinion Mining y Machine Learning. La ironía es un fenómeno que forma parte del contenido generado por las personas en la Web, y representa un campo de estudio nuevo que ha atraído la atención de algunos investigadores del área de Opinion Mining debido a su complejidad y al impacto que puede tener en el desempeño de las aplicaciones de Análisis de Sentimientos actuales. Este trabajo de título se desarrolla dentro del marco de OpinionZoom, proyecto CORFO código 13IDL2-23170 titulado "OpinionZoom: Plataforma de análisis de sentimientos e ironía a partir de la información textual en redes sociales para la caracterización de la demanda de productos y servicios" desarrollado en el Web Intelligence Centre del Departamento de Ingeniería Industrial de la Facultad de Ciencias Físicas y Matemáticas de la Universidad de Chile, el cual busca generar un sistema avanzado para analizar datos extraídos desde redes sociales para obtener información relevante para las empresas en relación a sus productos y servicios. La hipótesis de investigación de este trabajo dice que es posible detectar la presencia de ironía en texto en idioma Español con cierto nivel de precisión, utilizando una adaptación de la metodología propuesta por Reyes et al. (2013) en [5] la cual involucra la construcción de un corpus en función de la estructura de Twitter junto con la capacidad de las personas para detectar ironía. El modelo utilizado se compone de 11 atributos entre los cuales se rescatan características sintácticas, semánticas y emocionales o psicológicas, con el objetivo de poder describir ironía en texto. Para esto, se genera un corpus de casos irónicos y no irónicos a partir de una selección semiautomática utilizando una serie de hashtags en Twitter, para luego validar su etiquetado utilizando evaluadores humanos. Además, esto se complementa con la inclusión de textos objetivos como parte del set de casos no irónicos. Luego, utilizando este corpus, se pretende realizar el entrenamiento de un algoritmo de aprendizaje supervisado para realizar la posterior clasificación de texto. Para ésto, se implementa un módulo de extracción de atributos que transforma cada texto en un vector representativo de los atributo. Finalmente, se utilizan los vectores obtenidos para implementar un módulo clasificador de texto, el cual permite realizar una clasificación entre tipos irónicos y no irónicos de texto. Para probar su desempe~no, se realizan dos pruebas. La primera utiliza como casos no irónicos los textos objetivos y la segunda utiliza como casos no irónicos aquellos textos evaluados por personas como tales. La primera obtuvo un alto nivel de precisión, mientras que la segunda fue insuficiente. En base a los resultados se concluye que esta implementación no es una solución absoluta. Existen algunas limitaciones asociadas a la construcción del corpus, las herramientas utilizadas e incluso el modelo, sin embargo, los resultados muestran que bajo ciertos escenarios de comparación, es posible detectar ironía en texto por lo que se cumple la hipótesis. Se sugiere ampliar la investigación, mejorar la obtención del corpus, utilizar herramientas más desarrolladas y analizar aquellos elementos que el modelo no puede capturar.

Styles APA, Harvard, Vancouver, ISO, etc.

50

Liaghat, Zeinab. « Quality-efficiency trade-offs in machine learning applied to text processing ». Doctoral thesis, Universitat Pompeu Fabra, 2017. http://hdl.handle.net/10803/402575.

Texte intégral

Résumé :

Nowadays, the amount of available digital documents is rapidly growing, expanding at a considerable rate and coming from a variety of sources. Sources of unstructured and semi-structured information include the World Wide Web, news articles, biological databases, electronic mail, digital libraries, governmental digital repositories, chat rooms, online forums, blogs, and social media such as Facebook, Instagram, LinkedIn, Pinterest, Twitter, YouTube, Instagram, Pinterest, plus many others. Extracting information from these resources and finding useful information from such collections has become a challenge, which makes organizing massive amounts of data a necessity. Data mining, machine learning, and natural language processing are powerful techniques that can be used together to deal with this big challenge. Depending on the task or problem at hand, there are many different approaches that can be used. The methods that are being implemented are continuously being optimized, but not all these methods have been tested and compared for quality after training on large size corpora for supervised machine learning algorithms. The question is what happens to the quality of methods if we increase the data size from, say, 100 MB to over 1 GB? Moreover, are quality gains worth it when the rate of data processing diminishes? Can we trade quality for time efficiency and recover the quality loss by just being able to process more data? This thesis is first attempt to answer these questions in a general way for text processing tasks, as not enough research has been done to compare those methods considering the trade-offs of data size, quality, and processing time. Hence, we propose a trade-off analysis framework and apply it to three important text processing problems: Named Entity Recognition, Sentiment Analysis, and Document Classification. These problems were also chosen because they have different levels of object granularity: words, passages, and documents. For each problem, we select several machine learning algorithms and we evaluate the trade-offs of these different methods on large publicly available datasets (news, reviews, patents). We use different data subsets of increasing size ranging from 50 MB to a few GB, to explore these trade-offs. We conclude, as hypothesized, that just because the method has good performance in small data, it does not necessarily have the same performance for big data. For the two last problems, we consider similar algorithms and also consider two different data sets and two different evaluation techniques, to study the impact of the data and the evaluation technique on the resulting trade-offs. We find that the results do not change significantly.
Avui en dia, la quantitat de documents digitals disponibles està creixent ràpidament, expandint- se a un ritme considerable i procedint de diverses fonts. Les fonts d’informació no estructurada i semiestructurada inclouen la World Wide Web, articles de notícies, bases de dades biològiques, correus electrònics, biblioteques digitals, repositoris electrònics governamentals, , sales de xat, forums en línia, blogs i mitjans socials com Facebook, Instagram, LinkedIn, Pinterest, Twitter, YouTube i molts d’altres. Extreure’n informació d’aquests recursos i trobar informació útil d’aquestes col.leccions s’ha convertit en un desafiament que fa que l’organització d’aquesta enorme quantitat de dades esdevingui una necessitat. La mineria de dades, l’aprenentatge automàtic i el processament del llenguatge natural són tècniques poderoses que poden utilitzar-se conjuntament per fer front a aquest gran desafiament. Segons la tasca o el problema en qüestió existeixen molts emfo- caments diferents que es poden utilitzar. Els mètodes que s’estan implementant s’optimitzen continuament, però aquests mètodes d’aprenentatge automàtic supervisats han estat provats i comparats amb grans dades d’entrenament. La pregunta és : Què passa amb la qualitat dels mètodes si incrementem les dades de 100 MB a 1 GB? Més encara: Les millores en la qualitat valen la pena quan la taxa de processament de les dades minva? Podem canviar qualitat per eficiència, tot recuperant la perdua de qualitat quan processem més dades? Aquesta tesi és una primera aproximació per resoldre aquestes preguntes de forma gene- ral per a tasques de processament de text, ja que no hi ha hagut suficient investigació per a comparar aquests mètodes considerant el balanç entre el tamany de les dades, la qualitat dels resultats i el temps de processament. Per tant, proposem un marc per analitzar aquest balanç i l’apliquem a tres problemes importants de processament de text: Reconeixement d’Entitats Anomenades, Anàlisi de Sentiments i Classificació de Documents. Aquests problemes tam- bé han estat seleccionats perquè tenen nivells diferents de granularitat: paraules, opinions i documents complerts. Per a cada problema seleccionem diferents algoritmes d’aprenentatge automàtic i avaluem el balanç entre aquestes variables per als diferents algoritmes en grans conjunts de dades públiques ( notícies, opinions, patents). Utilitzem subconjunts de diferents tamanys entre 50 MB i alguns GB per a explorar aquests balanç. Per acabar, com havíem suposat, no perquè un algoritme és eficient en poques dades serà eficient en grans quantitats de dades. Per als dos últims problemes considerem algoritmes similars i també dos conjunts diferents de dades i tècniques d’avaluació per a estudiar l’impacte d’aquests dos paràmetres en els resultats. Mostrem que els resultats no canvien significativament amb aquests canvis.
Hoy en día, la cantidad de documentos digitales disponibles está creciendo rápidamente, ex- pandiéndose a un ritmo considerable y procediendo de una variedad de fuentes. Estas fuentes de información no estructurada y semi estructurada incluyen la World Wide Web, artículos de noticias, bases de datos biológicos, correos electrónicos, bibliotecas digitales, repositorios electrónicos gubernamentales, salas de chat, foros en línea, blogs y medios sociales como Fa- cebook, Instagram, LinkedIn, Pinterest, Twitter, YouTube, además de muchos otros. Extraer información de estos recursos y encontrar información útil de tales colecciones se ha convertido en un desafío que hace que la organización de esa enorme cantidad de datos sea una necesidad. La minería de datos, el aprendizaje automático y el procesamiento del lenguaje natural son técnicas poderosas que pueden utilizarse conjuntamente para hacer frente a este gran desafío. Dependiendo de la tarea o el problema en cuestión, hay muchos enfoques dife- rentes que se pueden utilizar. Los métodos que se están implementando se están optimizando continuamente, pero estos métodos de aprendizaje automático supervisados han sido probados y comparados con datos de entrenamiento grandes. La pregunta es ¿Qué pasa con la calidad de los métodos si incrementamos los datos de 100 MB a 1GB? Más aún, ¿las mejoras en la cali- dad valen la pena cuando la tasa de procesamiento de los datos disminuye? ¿Podemos cambiar calidad por eficiencia, recuperando la perdida de calidad cuando procesamos más datos? Esta tesis es una primera aproximación para resolver estas preguntas de forma general para tareas de procesamiento de texto, ya que no ha habido investigación suficiente para comparar estos métodos considerando el balance entre el tamaño de los datos, la calidad de los resultados y el tiempo de procesamiento. Por lo tanto, proponemos un marco para analizar este balance y lo aplicamos a tres importantes problemas de procesamiento de texto: Reconocimiento de En- tidades Nombradas, Análisis de Sentimientos y Clasificación de Documentos. Estos problemas fueron seleccionados también porque tienen distintos niveles de granularidad: palabras, opinio- nes y documentos completos. Para cada problema seleccionamos distintos algoritmos de apren- dizaje automático y evaluamos el balance entre estas variables para los distintos algoritmos en grandes conjuntos de datos públicos (noticias, opiniones, patentes). Usamos subconjuntos de distinto tamaño entre 50 MB y varios GB para explorar este balance. Para concluir, como ha- bíamos supuesto, no porque un algoritmo es eficiente en pocos datos será eficiente en grandes cantidades de datos. Para los dos últimos problemas consideramos algoritmos similares y tam- bién dos conjuntos distintos de datos y técnicas de evaluación, para estudiar el impacto de estos dos parámetros en los resultados. Mostramos que los resultados no cambian significativamente con estos cambios.

Styles APA, Harvard, Vancouver, ISO, etc.

Thèses sur le sujet « Sentient Machine »

Créez une référence correcte selon les styles APA, MLA, Chicago, Harvard et plusieurs autres