To see the other types of publications on this topic, follow the link: TWITTER DATASET.

Journal articles on the topic 'TWITTER DATASET'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'TWITTER DATASET.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Meier, Florian. "TWikiL – the Twitter Wikipedia Link Dataset." Proceedings of the International AAAI Conference on Web and Social Media 16 (May 31, 2022): 1292–301. http://dx.doi.org/10.1609/icwsm.v16i1.19381.

Full text
Abstract:
Recent research has shown how strongly Wikipedia and other web services or platforms are connected. For example, search engines rely heavily on surfacing Wikipedia links to satisfy their users' information needs and volunteer-created Wikipedia content frequently gets re-used on other social media platforms like Reddit. However, publicly accessible datasets that enable researchers to study the interrelationship between Wikipedia and other platforms are sparse. In addition to that, most studies only focus on certain points in time and don't consider the historical perspective. To begin solving these problems we developed TWikiL, the Twitter Wikipedia Link Dataset, which contains all Wikipedia links posted on Twitter in the period 2006 to January 2021. We extract Wikipedia links from Tweets and enrich the referenced articles with their respective Wikidata identifiers and Wikipedia topic categories, which will make this dataset immediately useful for a large range of scholarly use cases. In this paper, we describe the data collection process, perform an initial exploratory analysis and present a comprehensive overview of how this dataset can be useful for the research community.
APA, Harvard, Vancouver, ISO, and other styles
2

Almalki, Jameel. "A machine learning-based approach for sentiment analysis on distance learning from Arabic Tweets." PeerJ Computer Science 8 (July 26, 2022): e1047. http://dx.doi.org/10.7717/peerj-cs.1047.

Full text
Abstract:
Social media platforms such as Twitter, YouTube, Instagram and Facebook are leading sources of large datasets nowadays. Twitter’s data is one of the most reliable due to its privacy policy. Tweets have been used for sentiment analysis and to identify meaningful information within the dataset. Our study focused on the distance learning domain in Saudi Arabia by analyzing Arabic tweets about distance learning. This work proposes a model for analyzing people’s feedback using a Twitter dataset in the distance learning domain. The proposed model is based on the Apache Spark product to manage the large dataset. The proposed model uses the Twitter API to get the tweets as raw data. These tweets were stored in the Apache Spark server. A regex-based technique for preprocessing removed retweets, links, hashtags, English words and numbers, usernames, and emojis from the dataset. After that, a Logistic-based Regression model was trained on the pre-processed data. This Logistic Regression model, from the field of machine learning, was used to predict the sentiment inside the tweets. Finally, a Flask application was built for sentiment analysis of the Arabic tweets. The proposed model gives better results when compared to various applied techniques. The proposed model is evaluated on test data to calculate Accuracy, F1 Score, Precision, and Recall, obtaining scores of 91%, 90%, 90%, and 89%, respectively.
APA, Harvard, Vancouver, ISO, and other styles
3

Dar, Momna, Faiza Iqbal, Rabia Latif, Ayesha Altaf, and Nor Shahida Mohd Jamail. "Policy-Based Spam Detection of Tweets Dataset." Electronics 12, no. 12 (June 14, 2023): 2662. http://dx.doi.org/10.3390/electronics12122662.

Full text
Abstract:
Spam communications from spam ads and social media platforms such as Facebook, Twitter, and Instagram are increasing, making spam detection more popular. Many languages are used for spam review identification, including Chinese, Urdu, Roman Urdu, English, Turkish, etc.; however, there are fewer high-quality datasets available for Urdu. This is mainly because Urdu is less extensively used on social media networks such as Twitter, making it harder to collect huge volumes of relevant data. This paper investigates policy-based Urdu tweet spam detection. This study aims to collect over 1,100,000 real-time tweets from multiple users. The dataset is carefully filtered to comply with Twitter’s 100-tweet-per-hour limit. For data collection, the snscrape library is utilized, which is equipped with an API for accessing various attributes such as username, URL, and tweet content. Then, a machine learning pipeline consisting of TF-IDF, Count Vectorizer, and the following machine learning classifiers: multinomial naïve Bayes, support vector classifier RBF, logical regression, and BERT, are developed. Based on Twitter policy standards, feature extraction is performed, and the dataset is separated into training and testing sets for spam analysis. Experimental results show that the logistic regression classifier has achieved the highest accuracy, with an F1-score of 0.70 and an accuracy of 99.55%. The findings of the study show the effectiveness of policy-based spam detection in Urdu tweets using machine learning and BERT layer models and contribute to the development of a robust Urdu language social media spam detection method.
APA, Harvard, Vancouver, ISO, and other styles
4

Ferragina, Paolo, Francesco Piccinno, and Roberto Santoro. "On Analyzing Hashtags in Twitter." Proceedings of the International AAAI Conference on Web and Social Media 9, no. 1 (August 3, 2021): 110–19. http://dx.doi.org/10.1609/icwsm.v9i1.14584.

Full text
Abstract:
Hashtags, originally introduced in Twitter, are now becoming the most used way to tag short messages in social networks since this facilitates subsequent search, classification and clustering over those messages. However, extracting information from hashtags is difficult because their composition is not constrained by any (linguistic) rule and they usually appear in short and poorly written messages which are difficult to analyze with classic IR techniques. In this paper we address two challenging problems regarding the meaning of hashtags — namely, hashtag relatedness and hashtag classification - and we provide two main contributions. First we build a novel graph upon hashtags and (Wikipedia) entities drawn from the tweets by means of topic annotators (such as TagME); this graph will allow us to model in an efficacious way not only classic co-occurrences but also semantic relatedness among hashtags and entities, or between entities themselves. Based on this graph, we design algorithms that significantly improve state-of-the-art results upon known publicly available datasets. The second contribution is the construction and the public release to the research community of two new datasets: the former is a new dataset for hashtag relatedness, the latter is a dataset for hashtag classification that is up to two orders of magnitude larger than the existing ones. These datasets will be used to show the robustness and efficacy of our approaches, showing improvements in F1 up to two-digits in percentage (absolute).
APA, Harvard, Vancouver, ISO, and other styles
5

Thakur, Nirmalya. "MonkeyPox2022Tweets: A Large-Scale Twitter Dataset on the 2022 Monkeypox Outbreak, Findings from Analysis of Tweets, and Open Research Questions." Infectious Disease Reports 14, no. 6 (November 14, 2022): 855–83. http://dx.doi.org/10.3390/idr14060087.

Full text
Abstract:
The mining of Tweets to develop datasets on recent issues, global challenges, pandemics, virus outbreaks, emerging technologies, and trending matters has been of significant interest to the scientific community in the recent past, as such datasets serve as a rich data resource for the investigation of different research questions. Furthermore, the virus outbreaks of the past, such as COVID-19, Ebola, Zika virus, and flu, just to name a few, were associated with various works related to the analysis of the multimodal components of Tweets to infer the different characteristics of conversations on Twitter related to these respective outbreaks. The ongoing outbreak of the monkeypox virus, declared a Global Public Health Emergency (GPHE) by the World Health Organization (WHO), has resulted in a surge of conversations about this outbreak on Twitter, which is resulting in the generation of tremendous amounts of Big Data. There has been no prior work in this field thus far that has focused on mining such conversations to develop a Twitter dataset. Furthermore, no prior work has focused on performing a comprehensive analysis of Tweets about this ongoing outbreak. To address these challenges, this work makes three scientific contributions to this field. First, it presents an open-access dataset of 556,427 Tweets about monkeypox that have been posted on Twitter since the first detected case of this outbreak. A comparative study is also presented that compares this dataset with 36 prior works in this field that focused on the development of Twitter datasets to further uphold the novelty, relevance, and usefulness of this dataset. Second, the paper reports the results of a comprehensive analysis of the Tweets of this dataset. This analysis presents several novel findings; for instance, out of all the 34 languages supported by Twitter, English has been the most used language to post Tweets about monkeypox, about 40,000 Tweets related to monkeypox were posted on the day WHO declared monkeypox as a GPHE, a total of 5470 distinct hashtags have been used on Twitter about this outbreak out of which #monkeypox is the most used hashtag, and Twitter for iPhone has been the leading source of Tweets about the outbreak. The sentiment analysis of the Tweets was also performed, and the results show that despite a lot of discussions, debate, opinions, information, and misinformation, on Twitter on various topics in this regard, such as monkeypox and the LGBTQI+ community, monkeypox and COVID-19, vaccines for monkeypox, etc., “neutral” sentiment was present in most of the Tweets. It was followed by “negative” and “positive” sentiments, respectively. Finally, to support research and development in this field, the paper presents a list of 50 open research questions related to the outbreak in the areas of Big Data, Data Mining, Natural Language Processing, and Machine Learning that may be investigated based on this dataset.
APA, Harvard, Vancouver, ISO, and other styles
6

Gamal, Donia, Marco Alfonse, El-Sayed M.El-Horbaty, and Abdel-Badeeh M.Salem. "Twitter Benchmark Dataset for Arabic Sentiment Analysis." International Journal of Modern Education and Computer Science 11, no. 1 (January 8, 2019): 33–38. http://dx.doi.org/10.5815/ijmecs.2019.01.04.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Aguilar-Gallegos, Norman, Leticia Elizabeth Romero-García, Enrique Genaro Martínez-González, Edgar Iván García-Sánchez, and Jorge Aguilar-Ávila. "Dataset on dynamics of Coronavirus on Twitter." Data in Brief 30 (June 2020): 105684. http://dx.doi.org/10.1016/j.dib.2020.105684.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Guo, Xiaobo, and Soroush Vosoughi. "A Large-Scale Longitudinal Multimodal Dataset of State-Backed Information Operations on Twitter." Proceedings of the International AAAI Conference on Web and Social Media 16 (May 31, 2022): 1245–50. http://dx.doi.org/10.1609/icwsm.v16i1.19375.

Full text
Abstract:
This paper proposes a large-scale and comprehensive dataset of 28 sub-datasets of state-backed tweets and accounts affiliated with 14 different countries, spanning more than 3 years, and a corresponding "negative" dataset of background tweets from the same time period and on similar topics. To our knowledge, this is the first dataset that contains both state-sponsored propaganda tweets and carefully collected corresponding negative tweet datasets for so many countries spanning such a long period of time.
APA, Harvard, Vancouver, ISO, and other styles
9

Nia, Zahra Movahedi, Ali Ahmadi, Bruce Mellado, Jianhong Wu, James Orbinski, Ali Asgary, and Jude D. Kong. "Twitter-based gender recognition using transformers." Mathematical Biosciences and Engineering 20, no. 9 (2023): 15957–77. http://dx.doi.org/10.3934/mbe.2023711.

Full text
Abstract:
<abstract> <p>Social media contains useful information about people and society that could help advance research in many different areas of health (e.g. by applying opinion mining, emotion/sentiment analysis and statistical analysis) such as mental health, health surveillance, socio-economic inequality and gender vulnerability. User demographics provide rich information that could help study the subject further. However, user demographics such as gender are considered private and are not freely available. In this study, we propose a model based on transformers to predict the user's gender from their images and tweets. The image-based classification model is trained in two different methods: using the profile image of the user and using various image contents posted by the user on Twitter. For the first method a Twitter gender recognition dataset, publicly available on Kaggle and for the second method the PAN-18 dataset is used. Several transformer models, i.e. vision transformers (ViT), LeViT and Swin Transformer are fine-tuned for both of the image datasets and then compared. Next, different transformer models, namely, bidirectional encoders representations from transformers (BERT), RoBERTa and ELECTRA are fine-tuned to recognize the user's gender by their tweets. This is highly beneficial, because not all users provide an image that indicates their gender. The gender of such users could be detected from their tweets. The significance of the image and text classification models were evaluated using the Mann-Whitney U test. Finally, the combination model improved the accuracy of image and text classification models by 11.73 and 5.26% for the Kaggle dataset and by 8.55 and 9.8% for the PAN-18 dataset, respectively. This shows that the image and text classification models are capable of complementing each other by providing additional information to one another. Our overall multimodal method has an accuracy of 88.11% for the Kaggle and 89.24% for the PAN-18 dataset and outperforms state-of-the-art models. Our work benefits research that critically require user demographic information such as gender to further analyze and study social media content for health-related issues.</p> </abstract>
APA, Harvard, Vancouver, ISO, and other styles
10

Sagarika, Namasani, Bommadi Sreenija Reddy, Vanka Varshitha, Kodavati Geetanjali, N. V. Ganapathi Raju, and Latha Kunaparaju. "Sarcasm Discernment on Social Media Platform." E3S Web of Conferences 309 (2021): 01037. http://dx.doi.org/10.1051/e3sconf/202130901037.

Full text
Abstract:
Past studies in Sarcasm Detection mostly make use of Twitter datasets collected using hashtag-based supervision but such datasets are noisy in terms of labels and language. To overcome the limitations related to noise in Twitter datasets, this News Headlines dataset for Sarcasm Detection is collected from two news website. TheOnion aims at producing sarcastic versions of current events and we collected all the headlines from News in Brief and News in Photos categories (which are sarcastic). We collect real (and non-sarcastic) news headlines from Huff Post. Sarcasm Detection on social media platform. The dataset is collected from two news websites, theonion.com and huffingtonpost.com. Since news headlines are written by professionals in a formal manner, there are no spelling mistakes and informal usage. This reduces the sparsity and also increases the chance of finding pre-trained embeddings. Furthermore, since the sole purpose of TheOnion is to publish sarcastic news, we get high-quality labels with much less noise as compared to Twitter datasets. Unlike tweets that reply to other tweets, the news headlines obtained are self-contained.
APA, Harvard, Vancouver, ISO, and other styles
11

Hernandez, Luis Alberto Robles, Tiffany J. Callahan, and Juan M. Banda. "A biomedically oriented automatically annotated Twitter COVID-19 dataset." Genomics & Informatics 19, no. 3 (September 30, 2021): e21. http://dx.doi.org/10.5808/gi.21011.

Full text
Abstract:
The use of social media data, like Twitter, for biomedical research has been gradually increasing over the years. With the coronavirus disease 2019 (COVID-19) pandemic, researchers have turned to more non-traditional sources of clinical data to characterize the disease in near-real time, study the societal implications of interventions, as well as the sequelae that recovered COVID-19 cases present (Long-). However, manually curated social media datasets are difficult to come by due to the expensive costs of manual annotation and the efforts needed to identify the correct texts. When datasets are available, they are usually very small and their annotations don’t generalize well over time or to larger sets of documents. As part of the 2021 Biomedical Linked Annotation Hackathon, we release our dataset of over 120 million automatically annotated tweets for biomedical research purposes. Incorporating best-practices, we identify tweets with potentially high clinical relevance. We evaluated our work by comparing several SpaCy-based annotation frameworks against a manually annotated gold-standard dataset. Selecting the best method to use for automatic annotation, we then annotated 120 million tweets and released them publicly for future downstream usage within the biomedical domain.
APA, Harvard, Vancouver, ISO, and other styles
12

Chang, Rong-Ching, Ashwin Rao, Qiankun Zhong, Magdalena Wojcieszak, and Kristina Lerman. "#RoeOverturned: Twitter Dataset on the Abortion Rights Controversy." Proceedings of the International AAAI Conference on Web and Social Media 17 (June 2, 2023): 997–1005. http://dx.doi.org/10.1609/icwsm.v17i1.22207.

Full text
Abstract:
On June 24, 2022, the United States Supreme Court overturned landmark rulings made in its 1973 verdict in Roe v. Wade. The justices by way of a majority vote in Dobbs v. Jackson Women's Health Organization, decided that abortion wasn't a constitutional right and returned the issue of abortion to the elected representatives. This decision triggered multiple protests and debates across the US, especially in the context of the midterm elections in November 2022. Given that many citizens use social media platforms to express their views and mobilize for collective action, and given that online debate provides tangible effects on public opinion, political participation, news media coverage, and the political decision-making, it is crucial to understand online discussions surrounding this topic. Toward this end, we present the first large-scale Twitter dataset collected on the abortion rights debate in the United States. We present a set of 74M tweets systematically collected over the course of one year from January 1, 2022 to January 6, 2023.
APA, Harvard, Vancouver, ISO, and other styles
13

Vanitha, C. N., S. Malathy, S. A. Krishna, M. Vanitha, and Sathishkumar V. E. "Sentimental analysis using machine learning in Twitter dataset." Applied and Computational Engineering 6, no. 1 (June 14, 2023): 953–58. http://dx.doi.org/10.54254/2755-2721/6/20230966.

Full text
Abstract:
The analysis of sentiment is also known as opinion mining. Observation of sentiment is used for detecting different emotions of people through the feedback given by them. It is done to know whether the customer is satisfied by the organizations product, service and so on. Nowadays rating the product or service becomes very essential and important in everyones life. These are nothing without rating the sentiments of the customer. It is absolutely essential to collect the sentimental data since it helps in improving the product or its service, to satisfy the customer and to increase the sale. In this fast-changing world, Twitter is one of the most used and biggest sensation creating apps. Twitter is mostly used for delivering thoughts of people. This process is known as sentimental delivery. This research analysis is done with the help of views, likes, comments and shares of a particular tweet. The output of this analysis might be positive, negative or neutral. Machine learning is an algorithm or a method in which the task is conducted by an AI system. In this output value is predicted from the given input data.
APA, Harvard, Vancouver, ISO, and other styles
14

Seno, Danar Wido, and Arief Wibowo. "Analisis Sentimen Data Twitter Tentang Pasangan Capres-Cawapres Pemilu 2019 Dengan Metode Lexicon Based Dan Support Vector Machine." Jurnal Ilmiah FIFO 11, no. 2 (November 1, 2019): 144. http://dx.doi.org/10.22441/fifo.2019.v11i2.004.

Full text
Abstract:
Social media writing content growing make a lot of new words that appear on Twitter in the form of words and abbreviations that appear so that sentiment analysis is increasingly difficult to get high accuracy of textual data on Twitter social media. In this study, the authors conducted research on sentiment analysis of the pairs of candidates for President and Vice President of Indonesia in the 2019 Elections. To obtain higher accuracy results and accommodate the problem of textual data development on Twitter, the authors conducted a combination of methods to conduct the sentiment analysis with unsupervised and supervised methods. namely Lexicon Based. This study used Twitter data in October 2018 using the search keywords with the names of each pair of candidates for President and Vice President of the 2019 Elections totaling 800 datasets. From the study with 800 datasets the best accuracy was obtained with a value of 92.5% with 80% training data composition and 20% testing data with a Precision value in each class between 85.7% - 97.2% and Recall value for each class among 78, 2% - 93.5%. With the Lexicon Based method as a labeling dataset, the process of labeling the Support Vector Machine dataset is no longer done manually but is processed by the Lexicon Based method and the dictionary on the lexicon can be added along with the development of data content on Twitter social media.
APA, Harvard, Vancouver, ISO, and other styles
15

Totare, Prof R. Y., Aishwarya Ahergawli, Abhijeet Girase, Ishwari Tale, and Ayushi Khanbard. "A Review on Twitter Sentiment Analysis Using ML." International Journal for Research in Applied Science and Engineering Technology 10, no. 12 (December 31, 2022): 1930–33. http://dx.doi.org/10.22214/ijraset.2022.48382.

Full text
Abstract:
Abstract: Social Media sites like twitter have billions of people share their opinions day by day as tweets. As tweet is characteristic short and basic way of human emotions. So, in this paper we focused on sentiment analysis of Twitter data. Most of Twitter's existing sentiment analysis solutions basically consider only the textual information of Twitter messages and strives to work well in the face of short and ambiguous Twitter messages. Recent studies show that patterns of spreading feelings on Twitter have close relationships with the polarities of Twitter messages. In this paper focus on how to combine the textual information of Twitter messages and sentiment dissemination models to get a better performance of sentiment analysis in Twitter data. To this end, proposed system first analyses the diffusion of feelings by studying a phenomenon called inversion of feelings and find some interesting properties of the reversal of feelings. Therefore, we consider the interrelations between the textual information of Twitter messages and the patterns of diffusion of feelings, and propose random forest machine learning to predict the polarities of the feelings expressed in Twitter messages. As far as we know, this work is the first to use sentiment dissemination models to improve Twitter's sentiment analysis. Numerous experiments in the real-world dataset show that, compared to state-of-the-art text-based analysis algorithms.
APA, Harvard, Vancouver, ISO, and other styles
16

Neagu, Dan Claudiu, Andrei Bogdan Rus, Mihai Grec, Mihai Augustin Boroianu, Nicolae Bogdan, and Attila Gal. "Towards Sentiment Analysis for Romanian Twitter Content." Algorithms 15, no. 10 (September 28, 2022): 357. http://dx.doi.org/10.3390/a15100357.

Full text
Abstract:
With the increased popularity of social media platforms such as Twitter or Facebook, sentiment analysis (SA) over the microblogging content becomes of crucial importance. The literature reports good results for well-resourced languages such as English, Spanish or German, but open research space still exists for underrepresented languages such as Romanian, where there is a lack of public training datasets or pretrained word embeddings. The majority of research on Romanian SA tackles the issue in a binary classification manner (positive vs. negative), using a single public dataset which consists of product reviews. In this paper, we respond to the need for a media surveillance project to possess a custom multinomial SA classifier for usage in a restrictive and specific production setup. We describe in detail how such a classifier was built, with the help of an English dataset (containing around 15,000 tweets) translated to Romanian with a public translation service. We test the most popular classification methods that could be applied to SA, including standard machine learning, deep learning and BERT. As we could not find any results for multinomial sentiment classification (positive, negative and neutral) in Romanian, we set two benchmark accuracies of ≈78% using standard machine learning and ≈81% using BERT. Furthermore, we demonstrate that the automatic translation service does not downgrade the learning performance by comparing the accuracies achieved by the models trained on the original dataset with the models trained on the translated data.
APA, Harvard, Vancouver, ISO, and other styles
17

Pfeffer, Jürgen, Daniel Matter, Kokil Jaidka, Onur Varol, Afra Mashhadi, Jana Lasser, Dennis Assenmacher, et al. "Just Another Day on Twitter: A Complete 24 Hours of Twitter Data." Proceedings of the International AAAI Conference on Web and Social Media 17 (June 2, 2023): 1073–81. http://dx.doi.org/10.1609/icwsm.v17i1.22215.

Full text
Abstract:
At the end of October 2022, Elon Musk concluded his acquisition of Twitter. In the weeks and months before that, several questions were publicly discussed that were not only of interest to the platform's future buyers, but also of high relevance to the Computational Social Science research community. For example, how many active users does the platform have? What percentage of accounts on the site are bots? And, what are the dominating topics and sub-topical spheres on the platform? In a globally coordinated effort of 80 scholars to shed light on these questions, and to offer a dataset that will equip other researchers to do the same, we have collected all 375 million tweets published within a 24-hour time period starting on September 21, 2022. To the best of our knowledge, this is the first complete 24-hour Twitter dataset that is available for the research community. With it, the present work aims to accomplish two goals. First, we seek to answer the aforementioned questions and provide descriptive metrics about Twitter that can serve as references for other researchers. Second, we create a baseline dataset for future research that can be used to study the potential impact of the platform's ownership change.
APA, Harvard, Vancouver, ISO, and other styles
18

Suhavi, Asmit Kumar Singh, Udit Arora, Somyadeep Shrivastava, Aryaveer Singh, Rajiv Ratn Shah, and Ponnurangam Kumaraguru. "Twitter-STMHD: An Extensive User-Level Database of Multiple Mental Health Disorders." Proceedings of the International AAAI Conference on Web and Social Media 16 (May 31, 2022): 1182–91. http://dx.doi.org/10.1609/icwsm.v16i1.19368.

Full text
Abstract:
Social Media is equipped with the ability to track and quantify user behavior, establishing it as an appropriate resource for mental health studies. However, previous efforts in the area have been limited by the lack of data and contextually relevant information. There is a need for large-scale, well-labeled mental health datasets with fast reproducible methods to facilitate their heuristic growth. In this paper, we cater to this need by building the Twitter - Self-Reported Temporally-Contextual Mental Health Diagnosis Dataset (Twitter-STMHD), a large scale, user-level dataset grouped into 8 disorder categories and a companion class of control users. The dataset is 60% hand-annotated, which lead to the creation of high-precision self-reported diagnosis report patterns, used for the construction of the rest of the dataset. The dataset, instead of being a corpus of tweets, is a collection of user-profiles of those suffering from mental health disorders to provide a holistic view of the problem statement. By leveraging temporal information, the data for a given profile in the dataset has been collected for disease prevalence periods: onset of disorder, diagnosis and progression, along with a fourth period: COVID-19. This is the only and the largest dataset that captures the tweeting activity of users suffering from mental health disorders during the COVID-19 period.
APA, Harvard, Vancouver, ISO, and other styles
19

Shirolkar, Ashwini Anandrao, and R. J. Deshmukh. "Finding Topic Experts in the Twitter Dataset Using LDA Algorithm." International Journal of Applied Evolutionary Computation 10, no. 2 (April 2019): 19–26. http://dx.doi.org/10.4018/ijaec.2019040103.

Full text
Abstract:
In microblogging services like Twitter, the expert judgment problem has gained increasing attention in social media. Twitter is a new type of social media that provides a publicly available way for users to publish 140-character short messages (tweets). However, previous methods cannot be directly applied to twitter experts finding problems. They generally rely on the assumption that all the documents associated with the candidate experts contain tacit knowledge related to the expertise of individuals. Whereas it might not be directly associated with their expertise, i.e., who is not an expert, but may publish/re-tweet a substantial number of tweets containing the topic words. Recently, several attempts use the relations among users and twitter list for expert finding. Nevertheless, these strategies only partially utilize such relations. To address these issues a probabilistic method is developed to jointly exploit three types of relations (i.e., follower relation, user-list relation and list-list relation) for finding experts. LDA algorithms are used for finding topic experts. LDA is based upon the concept of searching for a linear combination of variables (predictors) that best separates two classes (targets). Semi-supervised graph-based ranking approach (SSGR) to offline calculate the global authority of users. Then, the local relevance between users and the given query is computed. Then, the rank of all the users is found and the top-N users with the highest-ranking scores. Therefore, the proposed approach can jointly exploit the different types of relations among users and lists for improving the accuracy of finding experts on a given topic on Twitter.
APA, Harvard, Vancouver, ISO, and other styles
20

Jiang, Julie, Xiang Ren, and Emilio Ferrara. "Retweet-BERT: Political Leaning Detection Using Language Features and Information Diffusion on Social Networks." Proceedings of the International AAAI Conference on Web and Social Media 17 (June 2, 2023): 459–69. http://dx.doi.org/10.1609/icwsm.v17i1.22160.

Full text
Abstract:
Estimating the political leanings of social media users is a challenging and ever more pressing problem given the increase in social media consumption. We introduce Retweet-BERT, a simple and scalable model to estimate the political leanings of Twitter users. Retweet-BERT leverages the retweet network structure and the language used in users' profile descriptions. Our assumptions stem from patterns of networks and linguistics homophily among people who share similar ideologies. Retweet-BERT demonstrates competitive performance against other state-of-the-art baselines, achieving 96%-97% macro-F1 on two recent Twitter datasets (a COVID-19 dataset and a 2020 United States presidential elections dataset). We also perform manual validation to validate the performance of Retweet-BERT on users not in the training data. Finally, in a case study of COVID-19, we illustrate the presence of political echo chambers on Twitter and show that it exists primarily among right-leaning users. Our code is open-sourced and our data is publicly available.
APA, Harvard, Vancouver, ISO, and other styles
21

Balli, Cagla, Mehmet Serdar Guzel, Erkan Bostanci, and Alok Mishra. "Sentimental Analysis of Twitter Users from Turkish Content with Natural Language Processing." Computational Intelligence and Neuroscience 2022 (April 13, 2022): 1–17. http://dx.doi.org/10.1155/2022/2455160.

Full text
Abstract:
Artificial Intelligence has guided technological progress in recent years; it has shown significant development with increased academic studies on Machine Learning and the high demand for this field in the sector. In addition to the advancement of technology day by day, the pandemic, which has become a part of our lives since early 2020, has led to social media occupying a larger place in the lives of individuals. Therefore, social media posts have become an excellent data source for the field of sentiment analysis. The main contribution of this study is based on the Natural Language Processing method, which is one of the machine learning topics in the literature. Sentiment analysis classification is a solid example for machine learning tasks that belongs to human-machine interaction. It is essential to make the computer understand people emotional situation with classifiers. There are a limited number of Turkish language studies in the literature. Turkish language has different types of linguistic features from English. Since Turkish is an agglutinative language, it is challenging to make sentiment analysis with that language. This paper aims to perform sentiment analysis of several machine learning algorithms on Turkish language datasets that are collected from Twitter. In this research, besides using public dataset that belongs to Beyaz (2021) to get more general results, another dataset is created to understand the impact of the pandemic on people and to learn about public opinions. Therefore, a custom dataset, namely, SentimentSet (Balli 2021), was created, consisting of Turkish tweets that were filtered with words such as pandemic and corona by manually marking as positive, negative, or neutral. Besides, SentimentSet could be used in future researches as benchmark dataset. Results show classification accuracy of not only up to ∼87% with test data from datasets of both datasets and trained models, but also up to ∼84% with small “Sample Test Data” generated by the same methods as SentimentSet dataset. These research results contributed to indicating Turkish language specific sentiment analysis that is dependent on language specifications.
APA, Harvard, Vancouver, ISO, and other styles
22

Chinthamu, Narender, Satheesh Kumar Gooda, P. Shenbagavalli, N. Krishnamoorthy, and S. Tamil Selvan. "Detecting the Anti-Social Activity on Twitter using EGBDT with BCM." International Journal on Recent and Innovation Trends in Computing and Communication 11, no. 4s (April 3, 2023): 109–15. http://dx.doi.org/10.17762/ijritcc.v11i4s.6313.

Full text
Abstract:
The rise of social media and its consequences is a hot topic on research platforms. Twitter has drawn the attention of the research community in recent years due to various qualities it possesses. They include Twitter's open nature, which, unlike other platforms, allows visitors to see posts posted by Twitter users without having to register. In twitter the sentiment analysis of tweets are used for detecting the anti-social activity event which is one of the challenging tasks in existing works. There are many classification algorithms are used to detect the anti-social activities but they obtains less accuracy. The EGBDT (Enhanced Gradient-Boosted Decision Tree) is used to optimize the best features from the NSD dataset and it is given as input to BCM (Bayesian Certainty Method) for detecting the anti-social activities. In this work, tweets from NSD dataset are used for analyzing the sentiment polarity i.e. positive or negative. The efficiency of the proposed work is compared with SVM, KNN and C4.5. From this analysis the proposed EGBDT and BCM obtained better results than other techniques.
APA, Harvard, Vancouver, ISO, and other styles
23

Elmas, Tuğrulcan, Rebekah Overdorf, and Karl Aberer. "A Dataset of State-Censored Tweets." Proceedings of the International AAAI Conference on Web and Social Media 15 (May 22, 2021): 1009–15. http://dx.doi.org/10.1609/icwsm.v15i1.18124.

Full text
Abstract:
Many governments impose traditional censorship methods on social media platforms. Instead of removing it completely, many social media companies, including Twitter, only withhold the content from the requesting country. This makes such content still accessible outside of the censored region, allowing for an excellent setting in which to study government censorship on social media. We mine such content using the Internet Archive's Twitter Stream Grab. We release a dataset of 583,437 tweets by 155,715 users that were censored between 2012-2020 July. We also release 4,301 accounts that were censored in their entirety. Additionally, we release a set of 22,083,759 supplemental tweets made up of all tweets by users with at least one censored tweet as well as instances of other users retweeting the censored user. We provide an exploratory analysis of this dataset. Our dataset will not only aid in the study of government censorship but will also aid in studying hate speech detection and the effect of censorship on social media users. The dataset is publicly available at https://doi.org/10.5281/zenodo.4439509
APA, Harvard, Vancouver, ISO, and other styles
24

Alhazmi, Huda. "Arabic Twitter Conversation Dataset about the COVID-19 Vaccine." Data 7, no. 11 (November 4, 2022): 152. http://dx.doi.org/10.3390/data7110152.

Full text
Abstract:
The development and rollout of COVID-19 vaccination around the world offers hope for controlling the pandemic. People turned to social media such as Twitter seeking information or to voice their opinion. Therefore, mining such conversation can provide a rich source of data for different applications related to the COVID-19 vaccine. In this data article, we developed an Arabic Twitter dataset of 1.1 M Arabic posts regarding the COVID-19 vaccine. The dataset was streamed over one year, covering the period from January to December 2021. We considered a set of crawling keywords in the Arabic language related to the conversation about the vaccine. The dataset consists of seven databases that can be analyzed separately or merged for further analysis. The initial analysis depicts the embedded features within the posts, including hashtags, media, and the dynamic of replies and retweets. Further, the textual analysis reveals the most frequent words that can capture the trends of the discussions. The dataset was designed to facilitate research across different fields, such as social network analysis, information retrieval, health informatics, and social science.
APA, Harvard, Vancouver, ISO, and other styles
25

Saval, Pradnya. "Categorizing and Labelling of Twitter Dataset using RCNN Model." International Journal for Research in Applied Science and Engineering Technology 7, no. 5 (May 31, 2019): 279–85. http://dx.doi.org/10.22214/ijraset.2019.5045.

Full text
APA, Harvard, Vancouver, ISO, and other styles
26

Jeremy, Nicholaus Hendrik, Cristian Prasetyo, and Derwin Suhartono. "Identifying Personality Traits for Indonesian User from Twitter Dataset." INTERNATIONAL JOURNAL of FUZZY LOGIC and INTELLIGENT SYSTEMS 19, no. 4 (December 31, 2019): 283–89. http://dx.doi.org/10.5391/ijfis.2019.19.4.283.

Full text
APA, Harvard, Vancouver, ISO, and other styles
27

Fajri, Faisal, Bambang Tutuko, and Sukemi Sukemi. "Membandingkan Nilai Akurasi BERT dan DistilBERT pada Dataset Twitter." JUSIFO (Jurnal Sistem Informasi) 8, no. 2 (December 31, 2022): 71–80. http://dx.doi.org/10.19109/jusifo.v8i2.13885.

Full text
Abstract:
The growth of digital media has been incredibly fast, which has made consuming information a challenging task. Social media processing aided by Machine Learning has been very helpful in the digital era. Sentiment analysis is a fundamental task in Natural Language Processing (NLP). Based on the increasing number of social media users, the amount of data stored in social media platforms is also growing rapidly. As a result, many researchers are conducting studies that utilize social media data. Opinion mining (OM) or Sentiment Analysis (SA) is one of the methods used to analyze information contained in text from social media. Until now, several other studies have attempted to predict Data Mining (DM) using remarkable data mining techniques. The objective of this research is to compare the accuracy values of BERT and DistilBERT. DistilBERT is a technique derived from BERT that provides speed and maximizes classification. The research findings indicate that the use of DistilBERT method resulted in an accuracy value of 97%, precision of 99%, recall of 99%, and f1-score of 99%, which is higher compared to BERT that yielded an accuracy value of 87%, precision of 91%, recall of 91%, and f1-score of 89%.
APA, Harvard, Vancouver, ISO, and other styles
28

Juanita, Safitri. "Analisis Sentimen Persepsi Masyarakat Terhadap Pemilu 2019 Pada Media Sosial Twitter Menggunakan Naive Bayes." JURNAL MEDIA INFORMATIKA BUDIDARMA 4, no. 3 (July 20, 2020): 552. http://dx.doi.org/10.30865/mib.v4i3.2140.

Full text
Abstract:
According to the BAWASLU evaluation a variety of related negative content supports supporting prospective couples to burst into various social media pages. So sometimes the content leads to a hoax issue to the issue of religious and inter-group Racial (SARA). One of the social media used by the people of Indonesia is Twitter, according to Kompas.com number of Twitter daily users globally claimed to be increasing, this appears to be the 3rd Quarter Twitter Financial Report of 2019 on Twitter's 3rd quarter of 2019 Financial reports, daily active users on the Twitter platform are recorded to increase by 17 percent, to the number of 145 million users. So it is necessary that a sentiment analysis study can capture a pattern of community perception on social media Twitter against the 2019 elections and it is expected that this research can help interested parties to increase voter participation rate in the next 5 years. This research method uses the Indonesian tweet data taken from 16 April 2018-16 April 2019, further data in preprocessing, text transformation, stemming Bahasa Indonesia, specifying attribute class, load dictonary and a classification of Naive Bayes using Weka. The conclusion of this study was the classification of Naive Bayes finding that the 2019 election tweet dataset had a negative perception pattern of 52% much greater than the positive perception of 18% and the neutral perception had a value of 31% higher than positive perception. Naive Bayes ' degree of classification accuracy against the training dataset is 81% and the dataset testing 76%, the average precision value for positive sentiment is 86.65%, negative sentiment is 77.15%, and neutral sentiment is worth 80.95% while the average recall rate on positive sentiment is 36.8%, negative sentiment is 93.2% and the neutral sentiment is 86.8%
APA, Harvard, Vancouver, ISO, and other styles
29

Noor, Ibrahim Moge. "Sentiment Analysis on New Currency in Kenya Using Twitter Dataset." Proceeding International Conference on Science and Engineering 3 (April 30, 2020): 237–40. http://dx.doi.org/10.14421/icse.v3.503.

Full text
Abstract:
Social media sites recently became popular, it is clear that it has major influence in society, and almost one third of the entire world are in social media. It became a platform where people express their feelings, share their ideas, wisdoms and give feedback of an event or a product, with help of new technology it gave us an opportunity to analyse these contents easily. Twitter being one of these sites, with full of people opinions, where one can truck sentiment express about different kind of topics, instead of wasting time and energy for long surveys, due to advance sentiment analysis we can now collect a huge data of opinions of people. Sentiment analysis was one of the major interesting research area nowadays. In this paper we focused Sentimental insight into the 2019 Kenya currency replacement. Kenya government has announced that the country currency is to be replace wıth new generatıon of bank notes, the government ordered the Kenyan citizen to return back the old 1000 shilling notes ($10) to bank by 1st October 2019, in a bid to fight against corruption and money laundering. Kenyans citizen expressed their reaction over new banknotes. We perform sentiment analysis of the tweets using Multinomial Naïve Bayes algorithm by utilizing data from one of the social media platform–Twitter and I have collected during this period of demonetization, 1122 tweets from twitter using web scrapper with help of twitter advance search.
APA, Harvard, Vancouver, ISO, and other styles
30

Singhal, Shilpa. "Rumor Detection on Twitter." International Journal for Research in Applied Science and Engineering Technology 9, no. 8 (August 31, 2021): 2543–46. http://dx.doi.org/10.22214/ijraset.2021.37799.

Full text
Abstract:
Abstract: Social media interaction such as news spreading around the network is a great source of information nowadays. From one’s perspective, its negligible exertion, straightforward access, and quick dispersing of information that lead people to look out and eat up news from internet-based life. Twitter is among the most well-known ongoing news sources that ends up a standout amongst the most dominant news spreading mediums. It is known to cause extensive harm by spreading bits of fake news among the people. Online clients are normally vulnerable and are reliable on web-based networking media as their source of information without checking the veracity of the information being spread. This research contributes to develops a system for detection of rumors about real- world events that propagate on Twitter and to design a prediction algorithm that will train the machine to predict whether the given data is information or a rumor. The work finds all the useful features of a Tweet. The dataset used is the pheme dataset of known Rumors and Non Rumors. Afterwards, we make a comparison between various known Machine learning algorithms such as Decision tree, SVM, Random Tree.
APA, Harvard, Vancouver, ISO, and other styles
31

Alruily, Meshrif. "Issues of Dialectal Saudi Twitter Corpus." International Arab Journal of Information Technology 17, no. 3 (May 1, 2019): 367–74. http://dx.doi.org/10.34028/iajit/17/3/10.

Full text
Abstract:
Text mining research relies heavily on the availability of a suitable corpus. This paper presents a dialectal Saudi corpus that contains 207452 tweets generated by Saudi Twitter users. In addition, a comparison between the Saudi tweets dataset, Egyptian Twitter corpus and Arabic top news raw corpus (representing Modern Standard Arabic (MSA) in various aspects, such as the differences between formal and colloquial texts was carried out. Moreover, investigation into the issues and phenomena, such as shortening, concatenation, colloquial language, compounding, foreign language, spelling errors and neologisms on this type of dataset was performed.
APA, Harvard, Vancouver, ISO, and other styles
32

Melotte, Sara, and Mayank Kejriwal. "A Geo-Tagged COVID-19 Twitter Dataset for 10 North American Metropolitan Areas over a 255-Day Period." Data 6, no. 6 (June 16, 2021): 64. http://dx.doi.org/10.3390/data6060064.

Full text
Abstract:
One of the unfortunate findings from the ongoing COVID-19 crisis is the disproportionate impact the crisis has had on people and communities who were already socioeconomically disadvantaged. It has, however, been difficult to study this issue at scale and in greater detail using social media platforms like Twitter. Several COVID-19 Twitter datasets have been released, but they have very broad scope, both topically and geographically. In this paper, we present a more controlled and compact dataset that can be used to answer a range of potential research questions (especially pertaining to computational social science) without requiring extensive preprocessing or tweet-hydration from the earlier datasets. The proposed dataset comprises tens of thousands of geotagged (and in many cases, reverse-geocoded) tweets originally collected over a 255-day period in 2020 over 10 metropolitan areas in North America. Since there are socioeconomic disparities within these cities (sometimes to an extreme extent, as witnessed in ‘inner city neighborhoods’ in some of these cities), the dataset can be used to assess such socioeconomic disparities from a social media lens, in addition to comparing and contrasting behavior across cities.
APA, Harvard, Vancouver, ISO, and other styles
33

Putri, Kaswili Sriwenda, Iwan Rizal Setiawan, and Agung Pambudi. "ANALISIS SENTIMEN TERHADAP BRAND SKINCARE LOKAL MENGGUNAKAN NAÏVE BAYES CLASSIFIER." Technologia : Jurnal Ilmiah 14, no. 3 (July 4, 2023): 227. http://dx.doi.org/10.31602/tji.v14i3.11259.

Full text
Abstract:
Seiring dengan perkembangan teknologi informasi yang makin meningkat serta akses internet yang semakin mudah, banyak masyarakat yang menyuarakan opini mereka di media sosial salah satunya yaitu Twitter. Salah satu isu atau topik yang sering dibahas di twitter adalah mengenai perawatan kulit atau skincare terutama dalam mengulas produk-produk skincare dari suatu brand. Ulasan serta opini terhadap brand skincare terutama brand skincare lokal seperti Avoskin, Azarine dan Somethinc di twitter dijadikan sumber data untuk mengetahui persepsi masyarakat terhadap brand skincare lokal tersebut. Data tweet yang digunakan dibagi kedalam 3 dataset berdasarkan ulasan terhadap brand yang dituju yaitu Avoskin, Azarine dan Somethinc. Untuk mendapatka hasil yang jelas, maka dilakukan proses klasifikasi. Salah satu algoritma yang dapat digunakan untuk mengklasifikasikan suatu data adalah naïve bayes classifier dengan mengklasifikasikan data kedalam 2 jenis, yaitu positif dan negatif. Proses klasisfikasi yang menggunakan naïve bayes classifier ini menghasilkan nilai akurasi sebesar 79% untuk dataset Avoskin, 78% untuk dataset Azarine dan 75% untuk dataset Somethinc. Sedangkan pengujian dengan k-fold cross validation menghasilkan nilai sebesar 79% untuk dataset Avoskin serta Somethinc dan 78% untuk dataset Azarine.
APA, Harvard, Vancouver, ISO, and other styles
34

Ray Chowdhury, Jishnu, Cornelia Caragea, and Doina Caragea. "On Identifying Hashtags in Disaster Twitter Data." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 01 (April 3, 2020): 498–506. http://dx.doi.org/10.1609/aaai.v34i01.5387.

Full text
Abstract:
Tweet hashtags have the potential to improve the search for information during disaster events. However, there is a large number of disaster-related tweets that do not have any user-provided hashtags. Moreover, only a small number of tweets that contain actionable hashtags are useful for disaster response. To facilitate progress on automatic identification (or extraction) of disaster hashtags for Twitter data, we construct a unique dataset of disaster-related tweets annotated with hashtags useful for filtering actionable information. Using this dataset, we further investigate Long Short-Term Memory-based models within a Multi-Task Learning framework. The best performing model achieves an F1-score as high as $92.22%$. The dataset, code, and other resources are available on Github.1
APA, Harvard, Vancouver, ISO, and other styles
35

Smetanin, Sergey. "RuSentiTweet: a sentiment analysis dataset of general domain tweets in Russian." PeerJ Computer Science 8 (July 19, 2022): e1039. http://dx.doi.org/10.7717/peerj-cs.1039.

Full text
Abstract:
The Russian language is still not as well-resourced as English, especially in the field of sentiment analysis of Twitter content. Though several sentiment analysis datasets of tweets in Russia exist, they all are either automatically annotated or manually annotated by one annotator. Thus, there is no inter-annotator agreement, or annotation may be focused on a specific domain. In this article, we present RuSentiTweet, a new sentiment analysis dataset of general domain tweets in Russian. RuSentiTweet is currently the largest in its class for Russian, with 13,392 tweets manually annotated with moderate inter-rater agreement into five classes: Positive, Neutral, Negative, Speech Act, and Skip. As a source of data, we used Twitter Stream Grab, a historical collection of tweets obtained from the general Twitter API stream, which provides a 1% sample of the public tweets. Additionally, we released a RuBERT-based sentiment classification model that achieved F1 = 0.6594 on the test subset.
APA, Harvard, Vancouver, ISO, and other styles
36

Abilov, Anton, Yiqing Hua, Hana Matatov, Ofra Amir, and Mor Naaman. "VoterFraud2020: a Multi-modal Dataset of Election Fraud Claims on Twitter." Proceedings of the International AAAI Conference on Web and Social Media 15 (May 22, 2021): 901–12. http://dx.doi.org/10.1609/icwsm.v15i1.18113.

Full text
Abstract:
The wide spread of unfounded election fraud claims surrounding the U.S. 2020 election had resulted in undermining of trust in the election, culminating in violence inside the U.S. capitol. Under these circumstances, it is critical to understand the discussions surrounding these claims on Twitter, a major platform where the claims were disseminated. To this end, we collected and released the VoterFraud2020 dataset, a multi-modal dataset with 7.6M tweets and 25.6M retweets from 2.6M users related to voter fraud claims. To make this data immediately useful for a diverse set of research projects, we further enhance the data with cluster labels computed from the retweet graph, each user's suspension status, and the perceptual hashes of tweeted images. The dataset also includes aggregate data for all external links and YouTube videos that appear in the tweets. Preliminary analyses of the data show that Twitter's user suspension actions mostly affected a specific community of voter fraud claim promoters, and exposes the most common URLs, images and YouTube videos shared in the data.
APA, Harvard, Vancouver, ISO, and other styles
37

Ang, Chee Siang, and Ranjith Venkatachala. "Generalizability of Machine Learning to Categorize Various Mental Illness Using Social Media Activity Patterns." Societies 13, no. 5 (May 5, 2023): 117. http://dx.doi.org/10.3390/soc13050117.

Full text
Abstract:
Mental illness has recently become a global health issue, causing significant suffering in people’s lives and having a negative impact on productivity. In this study, we analyzed the generalization capacity of machine learning to classify various mental illnesses across multiple social media platforms (Twitter and Reddit). Language samples were gathered from Reddit and Twitter postings in discussion forums devoted to various forms of mental illness (anxiety, autism, schizophrenia, depression, bipolar disorder, and BPD). Following this process, information from 606,208 posts (Reddit) created by a total of 248,537 people and from 23,102,773 tweets was used for the analysis. We initially trained and tested machine learning models (CNN and Word2vec) using labeled Twitter datasets, and then we utilized the dataset from Reddit to assess the effectiveness of our trained models and vice versa. According to the experimental findings, the suggested method successfully classified mental illness in social media texts even when training datasets did not include keywords or when unrelated datasets were utilized for testing.
APA, Harvard, Vancouver, ISO, and other styles
38

Feizollah, Ali, Nor Badrul Anuar, Riyadh Mehdi, Ahmad Firdaus, and Ainin Sulaiman. "Understanding COVID-19 Halal Vaccination Discourse on Facebook and Twitter Using Aspect-Based Sentiment Analysis and Text Emotion Analysis." International Journal of Environmental Research and Public Health 19, no. 10 (May 21, 2022): 6269. http://dx.doi.org/10.3390/ijerph19106269.

Full text
Abstract:
The COVID-19 pandemic introduced unprecedented challenges for people and governments. Vaccines are an available solution to this pandemic. Recipients of the vaccines are of different ages, gender, and religion. Muslims follow specific Islamic guidelines that prohibit them from taking a vaccine with certain ingredients. This study aims at analyzing Facebook and Twitter data to understand the discourse related to halal vaccines using aspect-based sentiment analysis and text emotion analysis. We searched for the term “halal vaccine” and limited the timeline to the period between 1 January 2020, and 30 April 2021, and collected 6037 tweets and 3918 Facebook posts. We performed data preprocessing on tweets and Facebook posts and built the Latent Dirichlet Allocation (LDA) model to identify topics. Calculating the sentiment analysis for each topic was the next step. Finally, this study further investigates emotions in the data using the National Research Council of Canada Emotion Lexicon. Our analysis identified four topics in each of the Twitter dataset and Facebook dataset. Two topics of “COVID-19 vaccine” and “halal vaccine” are shared between the two datasets. The other two topics in tweets are “halal certificate” and “must halal”, while “sinovac vaccine” and “ulema council” are two other topics in the Facebook dataset. The sentiment analysis shows that the sentiment toward halal vaccine is mostly neutral in Twitter data, whereas it is positive in Facebook data. The emotion analysis indicates that trust is the most present emotion among the top three emotions in both datasets, followed by anticipation and fear.
APA, Harvard, Vancouver, ISO, and other styles
39

Verawati, Ike, and Bagas Sonas Audit. "Algoritma Naïve Bayes Classifier Untuk Analisis Sentiment Pengguna Twitter Terhadap Provider By.u." JURNAL MEDIA INFORMATIKA BUDIDARMA 6, no. 3 (July 25, 2022): 1411. http://dx.doi.org/10.30865/mib.v6i3.4132.

Full text
Abstract:
The development of the internet which has increased in recent years has made it easy for people to give their opinion on a product. By.u, as a new internet service provider, has made many new users share their opinions with each other. Many by.u users give their opinions through social media, especially twitter. From these problems, research was conducted using sentiment analysis. The research stages consisted of collecting data from social media Twitter, preprocessing data, weighting TF-IDF data and classifying using the Naïve Bayes Classifier algorithm. To get the best evaluation results, a comparison of training data and test data was carried out. Data classification is done automatically after cleaning the data in the preprocessing process. There are 2 labels for the data resulting from the automatic classification, namely positive and negative. The dataset after classification will be used as training data and test data. The datasets to be tested are divided into 3 numbers, namely the number of 1000 datasets, 2000 datasets, and 3000 datasets. The test was carried out 3 times for each dataset. The accuracy test is carried out using a confusion matrix. The test results with the highest accuracy were obtained by the nave Bayes classifier with a multinomial model of 85%.
APA, Harvard, Vancouver, ISO, and other styles
40

Jalal, Niyaz, and Kayhan Z. Ghafoor. "Machine Learning Algorithms for Detecting and Analyzing Social Bots Using a Novel Dataset." ARO-THE SCIENTIFIC JOURNAL OF KOYA UNIVERSITY 10, no. 2 (September 10, 2022): 11–21. http://dx.doi.org/10.14500/aro.101032.

Full text
Abstract:
Social media is internet-based technology and an electronic form of communication that facilitates sharing of ideas, documents, and personal information. Twitter is a microblogging platform and is the most effective social service for posting microblogs and likings, commenting, sharing, and communicating with others. The problem we are shedding light on in this paper is the misuse of bots on Twitter. The purpose of bots is to automate specific repetitive tasks instead of human interaction. However, bots are misused to influence people’s minds by spreading rumors and conspiracy related to controversial topics. In this paper, we initiate a new benchmark created on a 1.5M Twitter profile. We train different supervised machine learning on our benchmark to detect bots on Twitter. In addition to increasing benchmark scalability, various autofeature selections are utilized to identify the most influential features and remove the less influential ones. Furthermore, over-under-sampling is applied to reduce the imbalance effect on the benchmark. Finally, our benchmark compared with other stateof-the-art benchmarks and achieved a 6% higher area under the curve than other datasets in the case of generalization, improving the model performance by at least 2% by applying over-/undersampling.
APA, Harvard, Vancouver, ISO, and other styles
41

Jalal, Niyaz, and Kayhan Z. Ghafoor. "Machine Learning Algorithms for Detecting and Analyzing Social Bots Using a Novel Dataset." ARO-THE SCIENTIFIC JOURNAL OF KOYA UNIVERSITY 10, no. 2 (September 10, 2022): 11–21. http://dx.doi.org/10.14500/aro.11032.

Full text
Abstract:
Social media is internet-based technology and an electronic form of communication that facilitates sharing of ideas, documents, and personal information. Twitter is a microblogging platform and is the most effective social service for posting microblogs and likings, commenting, sharing, and communicating with others. The problem we are shedding light on in this paper is the misuse of bots on Twitter. The purpose of bots is to automate specific repetitive tasks instead of human interaction. However, bots are misused to influence people’s minds by spreading rumors and conspiracy related to controversial topics. In this paper, we initiate a new benchmark created on a 1.5M Twitter profile. We train different supervised machine learning on our benchmark to detect bots on Twitter. In addition to increasing benchmark scalability, various autofeature selections are utilized to identify the most influential features and remove the less influential ones. Furthermore, over-under-sampling is applied to reduce the imbalance effect on the benchmark. Finally, our benchmark compared with other stateof-the-art benchmarks and achieved a 6% higher area under the curve than other datasets in the case of generalization, improving the model performance by at least 2% by applying over-/undersampling.
APA, Harvard, Vancouver, ISO, and other styles
42

Noor, Ibrahim Moge, and Metin Turan. "Sentiment Analysis on New Currency in Kenya using Twitter Dataset." IJID (International Journal on Informatics for Development) 8, no. 2 (March 23, 2020): 81. http://dx.doi.org/10.14421/ijid.2019.08206.

Full text
Abstract:
Social media sites recently became popular, it is clear that it has major influence in society. Twitter is one of these sites, full of people’s opinions, where one can truck sentiment express about different kinds of topics. Sentiment analysis is one of the major interesting research areas nowadays. In this paper, we focused on Sentimental insight into the 2019 Kenya currency replacement. Kenyans citizens expressed their reaction over new banknotes. We perform sentiment analysis of the tweets from twitter using the Multinomial Naïve Bayes algorithm. We split our dataset using k-folder cross validation since we had limited amounts of data, so to achieve unbiased prediction of the model we obtained an average accuracy of 75.3%.
APA, Harvard, Vancouver, ISO, and other styles
43

Alvarez-Melis, David, and Martin Saveski. "Topic Modeling in Twitter: Aggregating Tweets by Conversations." Proceedings of the International AAAI Conference on Web and Social Media 10, no. 1 (August 4, 2021): 519–22. http://dx.doi.org/10.1609/icwsm.v10i1.14817.

Full text
Abstract:
We propose a new pooling technique for topic modeling in Twitter, which groups together tweets occurring in the same user-to-user conversation. Under this scheme, tweets and their replies are aggregated into a single document and the users who posted them are considered co-authors. To compare this new scheme against existing ones, we train topic models using Latent Dirichlet Allocation (LDA) and the Author-Topic Model (ATM) on datasets consisting of tweets pooled according to the different methods. Using the underlying categories of the tweets in this dataset as a noisy ground truth, we show that this new technique outperforms other pooling methods in terms of clustering quality and document retrieval.
APA, Harvard, Vancouver, ISO, and other styles
44

Fattah, St Fajriah, and Purnawansyah. "Analisis sentimen terhadap Body Shaming pada Twitter menggunakan Metode Naïve Bayes Classifier." Indonesian Journal of Data and Science 3, no. 2 (July 31, 2022): 61–71. http://dx.doi.org/10.56705/ijodas.v3i2.46.

Full text
Abstract:
Salah satu bentuk media sosial yang sedang populer saat ini adalah twitter. Namun tidak jarang pengguna twitter memberikan komentar yang cenderung menyinggung pengguna twitter lain dengan kalimat negatif. Salah satu bentuk komentar negatif yang sering dilontarkan pengguna twitter adalah tentang body shaming. Body shaming merupakan komentar negatif terhadap fisik seseorang seperti gendut, pesek, cungkring dan lain-lain. Berdasarkan perilaku body shaming pada twitter, maka pada penelitian ini akan dilakukan analisis sentimen menggunakan metode Naive Bayes Classifier. Tujuan dari penelitian adalah mengukur performa accuracy, precision, recall, dan f-measure pada metode Naïve Bayes Classifier dalam analisis sentimen terhadap body shaming pada Twitter. Dataset tersebut digunakan untuk mengklasifikasikan tweets yang bersifat positif dan negatif. Teknik klasifikasi yang digunakan yaitu dengan mengukur performa dari accuracy, precision, recall, dan f-measure menggunakan metode naïve bayes classifier. Berdasarkan hasil pengujian performansi accuracy, precision, recall, dan f-measure dengan feature model trigram menggunakan metode naïve bayes classifier dilakukan pada dataset tweets body shaming yang berjumlah 908 data. Berdasarkan hasil pengujian performa dengan model trigram didapatkan hasil accuracy 61%, precision 56%, recall 55% dan f-measure 55%.
APA, Harvard, Vancouver, ISO, and other styles
45

Malik, Muzamil, Waqar Aslam, Zahid Aslam, Abdullah Alharbi, Bader Alouffi, and Hafiz Tayyab Rauf. "A Performance Comparison of Unsupervised Techniques for Event Detection from Oscar Tweets." Computational Intelligence and Neuroscience 2022 (May 24, 2022): 1–14. http://dx.doi.org/10.1155/2022/5980043.

Full text
Abstract:
People’s lives are influenced by social media. It is an essential source for sharing news, awareness, detecting events, people’s interests, etc. Social media covers a wide range of topics and events to be discussed. Extensive work has been published to capture the interesting events and insights from datasets. Many techniques are presented to detect events from social media networks like Twitter. In text mining, most of the work is done on a specific dataset, and there is the need to present some new datasets to analyse the performance and generic nature of Topic Detection and Tracking methods. Therefore, this paper publishes a dataset of real-life event, the Oscars 2018, gathered from Twitter and makes a comparison of soft frequent pattern mining (SFPM), singular value decomposition and k-means (K-SVD), feature-pivot (Feat-p), document-pivot (Doc-p), and latent Dirichlet allocation (LDA). The dataset contains 2,160,738 tweets collected using some seed words. Only English tweets are considered. All of the methods applied in this paper are unsupervised. This area needs to be explored on different datasets. The Oscars 2018 is evaluated using keyword precision (K-Prec), keyword recall (K-Rec), and topic recall (T-Rec) for detecting events of greater interest. The highest K-Prec, K-Rec, and T-Rec were achieved by SFPM, but they started to decrease as the number of clusters increased. The lowest performance was achieved by Feat-p in terms of all three metrics. Experiments on the Oscars 2018 dataset demonstrated that all the methods are generic in nature and produce meaningful clusters.
APA, Harvard, Vancouver, ISO, and other styles
46

Ramasamy, Lakshmana Kumar, Seifedine Kadry, Yunyoung Nam, and Maytham N. Meqdad. "Performance analysis of sentiments in Twitter dataset using SVM models." International Journal of Electrical and Computer Engineering (IJECE) 11, no. 3 (June 1, 2021): 2275. http://dx.doi.org/10.11591/ijece.v11i3.pp2275-2284.

Full text
Abstract:
Sentiment Analysis is a current research topic by many researches using supervised and machine learning algorithms. The analysis can be done on movie reviews, twitter reviews, online product reviews, blogs, discussion forums, Myspace comments and social networks. The Twitter data set is analyzed using support vector machines (SVM) classifier with various parameters. The content of tweet is classified to find whether it contains fact data or opinion data. The deep analysis is required to find the opinion of the tweets posted by the individual. The sentiment is classified in to positive, negative and neutral. From this classification and analysis, an important decision can be made to improve the productivity. The performance of SVM radial kernel, SVM linear grid and SVM radial grid was compared and found that SVM linear grid performs better than other SVM models.
APA, Harvard, Vancouver, ISO, and other styles
47

Shirolkar, Ashwini Anandrao, and R. J. Deshmukh. "Finding Topic Experts in the Twitter dataset using LDA Algorithm." International Journal of Computer Sciences and Engineering 6, no. 8 (August 31, 2018): 742–46. http://dx.doi.org/10.26438/ijcse/v6i8.742746.

Full text
APA, Harvard, Vancouver, ISO, and other styles
48

S Nair, Pramod, Nidhi Choubey, and Srinivasa Rao D. "A Hybrid Technique to Classify Trending Topic on Twitter Dataset." International Journal of Engineering and Technology 9, no. 5 (October 30, 2017): 3470–80. http://dx.doi.org/10.21817/ijet/2017/v9i5/170905006.

Full text
APA, Harvard, Vancouver, ISO, and other styles
49

Toșa, Cristian, and Ari K. M. Tarigan. "Comparing sustainable product hashtags: Insights from a historical twitter dataset." Data in Brief 49 (August 2023): 109427. http://dx.doi.org/10.1016/j.dib.2023.109427.

Full text
APA, Harvard, Vancouver, ISO, and other styles
50

Alam, Firoj, Umair Qazi, Muhammad Imran, and Ferda Ofli. "HumAID: Human-Annotated Disaster Incidents Data from Twitter with Deep Learning Benchmarks." Proceedings of the International AAAI Conference on Web and Social Media 15 (May 22, 2021): 933–42. http://dx.doi.org/10.1609/icwsm.v15i1.18116.

Full text
Abstract:
Social networks are widely used for information consumption and dissemination, especially during time-critical events such as natural disasters. Despite its significantly large volume, social media content is often too noisy for direct use in any application. Therefore, it is important to filter, categorize, and concisely summarize the available content to facilitate effective consumption and decision-making. To address such issues automatic classification systems have been developed using supervised modeling approaches, thanks to the earlier efforts on creating labeled datasets. However, existing datasets are limited in different aspects (e.g., size, contains duplicates) and less suitable to support more advanced and data-hungry deep learning models. In this paper, we present a new large-scale dataset with ~77K human-labeled tweets, sampled from a pool of ~24 million tweets across 19 disaster events that happened between 2016 and 2019. Moreover, we propose a data collection and sampling pipeline, which is important for social media data sampling for human annotation. We report multiclass classification results using classic and deep learning (fastText and transformer) based models to set the ground for future studies. The dataset and associated resources are publicly available at https://crisisnlp.qcri.org/humaid_dataset.html.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography