Journal articles on the topic 'Big Data, Machine Learning, Data Science, Apache Spark'

To see the other types of publications on this topic, follow the link: Big Data, Machine Learning, Data Science, Apache Spark.

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Big Data, Machine Learning, Data Science, Apache Spark.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Mutasher, Watheq Ghanim, and Abbas Fadhil Aljuboori. "Real Time Big Data Sentiment Analysis and Classification of Facebook." Webology 19, no. 1 (January 20, 2022): 1112–27. http://dx.doi.org/10.14704/web/v19i1/web19076.

Full text
Abstract:
Many peoples use Facebook to connect and share their views on various issues, with the majority of user-generated content consisting of textual information. Since there is so much actual data from people who are posting messages on their situation in real time thoughts on a range of subjects in everyday life, the collection and analysis of these data, which may well be helpful for political decision or public opinion monitoring, is a worthwhile research project. Therefore, in this paper doing to analyze for public text post on Facebook stream in real time through environment Hadoop ecosystem by using apache spark with NLTK python. The post or feeds are gathered form the Facebook API in real time the data stored database used Apache spark to quick query processing the text partitions in each data nodes (machine). Also used Amazon cloud based Hadoop cluster ecosystem into processing of huge data and eliminate on-site hardware, IT support, and other operational difficulties and installation configuration Hadoop such as Hadoop distribution file system and Apache spark. By using the principle of decision dictionary, emotion analysis is used as positive, negative, or neutral and execution two algorithms in machine learning (naive bias & support vector machine) to build model predict the outcome demonstrates a high level of precision in sentiment analysis.
APA, Harvard, Vancouver, ISO, and other styles
2

Omar, Hoger Khayrolla, and Alaa Khalil Jumaa. "Distributed big data analysis using spark parallel data processing." Bulletin of Electrical Engineering and Informatics 11, no. 3 (June 1, 2022): 1505–15. http://dx.doi.org/10.11591/eei.v11i3.3187.

Full text
Abstract:
Nowadays, the big data marketplace is rising rapidly. The big challenge is finding a system that can store and handle a huge size of data and then processing that huge data for mining the hidden knowledge. This paper proposed a comprehensive system that is used for improving big data analysis performance. It contains a fast big data processing engine using Apache Spark and a big data storage environment using Apache Hadoop. The system tests about 11 Gigabytes of text data which are collected from multiple sources for sentiment analysis. Three different machine learning (ML) algorithms are used in this system which is already supported by the Spark ML package. The system programs were written in Java and Scala programming languages and the constructed model consists of the classification algorithms as well as the pre-processing steps in a figure of ML pipeline. The proposed system was implemented in both central and distributed data processing. Moreover, some datasets manipulation manners have been applied in the system tests to check which manner provides the best accuracy and time performance. The results showed that the system works efficiently for treating big data, it gains excellent accuracy with fast execution time especially in the distributed data nodes.
APA, Harvard, Vancouver, ISO, and other styles
3

Omar, Hoger Khayrolla, and Alaa Khalil Jumaa. "Big Data Analysis Using Apache Spark MLlib and Hadoop HDFS with Scala and Java." Kurdistan Journal of Applied Research 4, no. 1 (May 8, 2019): 7–14. http://dx.doi.org/10.24017/science.2019.1.2.

Full text
Abstract:
Nowadays with the technology revolution the term of big data is a phenomenon of the decade moreover, it has a significant impact on our applied science trends. Exploring well big data tool is a necessary demand presently. Hadoop is a good big data analyzing technology, but it is slow because the Job result among each phase must be stored before the following phase is started as well as to the replication delays. Apache Spark is another tool that developed and established to be the real model for analyzing big data with its innovative processing framework inside the memory and high-level programming libraries for machine learning, efficient data treating and etc. In this paper, some comparisons are presented about the time performance evaluation among Scala and Java in apache spark MLlib. Many tests have been done in supervised and unsupervised machine learning methods with utilizing big datasets. However, loading the datasets from Hadoop HDFS as well as to the local disk to identify the pros and cons of each manner and discovering perfect reading or loading dataset situation to reach best execution style. The results showed that the performance of Scala about 10% to 20% is better than Java depending on the algorithm type. The aim of the study is to analyze big data with more suitable programming languages and as consequences gaining better performance.
APA, Harvard, Vancouver, ISO, and other styles
4

Wei, Chih-Chiang, and Tzu-Hao Chou. "Typhoon Quantitative Rainfall Prediction from Big Data Analytics by Using the Apache Hadoop Spark Parallel Computing Framework." Atmosphere 11, no. 8 (August 17, 2020): 870. http://dx.doi.org/10.3390/atmos11080870.

Full text
Abstract:
Situated in the main tracks of typhoons in the Northwestern Pacific Ocean, Taiwan frequently encounters disasters from heavy rainfall during typhoons. Accurate and timely typhoon rainfall prediction is an imperative topic that must be addressed. The purpose of this study was to develop a Hadoop Spark distribute framework based on big-data technology, to accelerate the computation of typhoon rainfall prediction models. This study used deep neural networks (DNNs) and multiple linear regressions (MLRs) in machine learning, to establish rainfall prediction models and evaluate rainfall prediction accuracy. The Hadoop Spark distributed cluster-computing framework was the big-data technology used. The Hadoop Spark framework consisted of the Hadoop Distributed File System, MapReduce framework, and Spark, which was used as a new-generation technology to improve the efficiency of the distributed computing. The research area was Northern Taiwan, which contains four surface observation stations as the experimental sites. This study collected 271 typhoon events (from 1961 to 2017). The following results were obtained: (1) in machine-learning computation, prediction errors increased with prediction duration in the DNN and MLR models; and (2) the system of Hadoop Spark framework was faster than the standalone systems (single I7 central processing unit (CPU) and single E3 CPU). When complex computation is required in a model (e.g., DNN model parameter calibration), the big-data-based Hadoop Spark framework can be used to establish highly efficient computation environments. In summary, this study successfully used the big-data Hadoop Spark framework with machine learning, to develop rainfall prediction models with effectively improved computing efficiency. Therefore, the proposed system can solve problems regarding real-time typhoon rainfall prediction with high timeliness and accuracy.
APA, Harvard, Vancouver, ISO, and other styles
5

Gupta, Madhuri, and Bharat Gupta. "Survey of Breast Cancer Detection Using Machine Learning Techniques in Big Data." Journal of Cases on Information Technology 21, no. 3 (July 2019): 80–92. http://dx.doi.org/10.4018/jcit.2019070106.

Full text
Abstract:
Cancer is a disease in which cells in body grow and divide beyond the control. Breast cancer is the second most common disease after lung cancer in women. Incredible advances in health sciences and biotechnology have prompted a huge amount of gene expression and clinical data. Machine learning techniques are improving the prior detection of breast cancer from this data. The research work carried out focuses on the application of machine learning methods, data analytic techniques, tools, and frameworks in the field of breast cancer research with respect to cancer survivability, cancer recurrence, cancer prediction and detection. Some of the widely used machine learning techniques used for detection of breast cancer are support vector machine and artificial neural network. Apache Spark data processing engine is found to be compatible with most of the machine learning frameworks.
APA, Harvard, Vancouver, ISO, and other styles
6

Kamburugamuve, Supun, Pulasthi Wickramasinghe, Saliya Ekanayake, and Geoffrey C. Fox. "Anatomy of machine learning algorithm implementations in MPI, Spark, and Flink." International Journal of High Performance Computing Applications 32, no. 1 (July 2, 2017): 61–73. http://dx.doi.org/10.1177/1094342017712976.

Full text
Abstract:
With the ever-increasing need to analyze large amounts of data to get useful insights, it is essential to develop complex parallel machine learning algorithms that can scale with data and number of parallel processes. These algorithms need to run on large data sets as well as they need to be executed with minimal time in order to extract useful information in a time-constrained environment. Message passing interface (MPI) is a widely used model for developing such algorithms in high-performance computing paradigm, while Apache Spark and Apache Flink are emerging as big data platforms for large-scale parallel machine learning. Even though these big data frameworks are designed differently, they follow the data flow model for execution and user APIs. Data flow model offers fundamentally different capabilities than the MPI execution model, but the same type of parallelism can be used in applications developed in both models. This article presents three distinct machine learning algorithms implemented in MPI, Spark, and Flink and compares their performance and identifies strengths and weaknesses in each platform.
APA, Harvard, Vancouver, ISO, and other styles
7

Özgüven, Yavuz, Utku Gönener, and Süleyman Eken. "A Dockerized big data architecture for sports analytics." Computer Science and Information Systems, no. 00 (2022): 10. http://dx.doi.org/10.2298/csis220118010o.

Full text
Abstract:
The big data revolution has had an impact on sports analytics as well. Many large corporations have begun to see the financial benefits of integrating sports analytics with big data. When we rely on central processing systems to aggregate and analyze large amounts of sport data from many sources, we compromise the accuracy and timeliness of the data. As a response to these issues, distributed systems come to the rescue, and the MapReduce paradigm holds promise for large scale data analytics. We describe a big data architecture based on Docker containers with Apache Spark in this paper. We evaluate the architecture on four data-intensive case studies in sport analytics including structured analysis, streaming, machine learning approaches, and graph-based analysis.
APA, Harvard, Vancouver, ISO, and other styles
8

Concolato, Claude E., and Li M. Chen. "Data Science: A New Paradigm in the Age of Big-Data Science and Analytics." New Mathematics and Natural Computation 13, no. 02 (July 2017): 119–43. http://dx.doi.org/10.1142/s1793005717400038.

Full text
Abstract:
As an emergent field of inquiry, Data Science serves both the information technology world and the applied sciences. Data Science is a known term that tends to be synonymous with the term Big-Data; however, Data Science is the application of solutions found through mathematical and computational research while Big-Data Science describes problems concerning the analysis of data with respect to volume, variation, and velocity (3V). Even though there is not much developed in theory from a scientific perspective for Data Science, there is still great opportunity for tremendous growth. Data Science is proving to be of paramount importance to the IT industry due to the increased need for understanding the insurmountable amount of data being produced and in need of analysis. In short, data is everywhere with various formats. Scientists are currently using statistical and AI analysis techniques like machine learning methods to understand massive sets of data, and naturally, they attempt to find relationships among datasets. In the past 10 years, the development of software systems within the cloud computing paradigm using tools like Hadoop and Apache Spark have aided in making tremendous advances to Data Science as a discipline [Z. Sun, L. Sun and K. Strang, Big data analytics services for enhancing business intelligence, Journal of Computer Information Systems (2016), doi: 10.1080/08874417.2016.1220239]. These advances enabled both scientists and IT professionals to use cloud computing infrastructure to process petabytes of data on daily basis. This is especially true for large private companies such as Walmart, Nvidia, and Google. This paper seeks to address pragmatic ways of looking at how Data Science — with respect to Big-Data Science — is practiced in the modern world. We also examine how mathematics and computer science help shape Big-Data Science’s terrain. We will highlight how mathematics and computer science have significantly impacted the development of Data Science approaches, tools, and how those approaches pose new questions that can drive new research areas within these core disciplines involving data analysis, machine learning, and visualization.
APA, Harvard, Vancouver, ISO, and other styles
9

Myung, Rohyoung, and Sukyong Choi. "Machine-Learning Based Memory Prediction Model for Data Parallel Workloads in Apache Spark." Symmetry 13, no. 4 (April 16, 2021): 697. http://dx.doi.org/10.3390/sym13040697.

Full text
Abstract:
A lack of memory can lead to job failures or increase processing times for garbage collection. However, if too much memory is provided, the processing time is only marginally reduced, and most of the memory is wasted. Many big data processing tasks are executed in cloud environments. When renting virtual resources in a cloud environment, it is necessary to pay the cost according to the specifications of resources (i.e., the number of virtual cores and the size of memory), as well as rental time. In this paper, given the type of workload and volume of the input data, we analyze the memory usage pattern and derive the efficient memory size of data-parallel workloads in Apache Spark. Then, we propose a machine-learning-based prediction model that determines the efficient memory for a given workload and data. To determine the validity of the proposed model, we applied it to data-parallel workloads which include a deep learning model. The predicted memory values were in close agreement with the actual amount of required memory. Additionally, the whole building time for the proposed model requires a maximum of 44% of the total execution time of a data-parallel workload. The proposed model can improve memory efficiency up to 1.89 times compared with the vanilla Spark setting.
APA, Harvard, Vancouver, ISO, and other styles
10

Hussin, Sahar K., Salah M. Abdelmageid, Adel Alkhalil, Yasser M. Omar, Mahmoud I. Marie, and Rabie A. Ramadan. "Handling Imbalance Classification Virtual Screening Big Data Using Machine Learning Algorithms." Complexity 2021 (January 28, 2021): 1–15. http://dx.doi.org/10.1155/2021/6675279.

Full text
Abstract:
Virtual screening is the most critical process in drug discovery, and it relies on machine learning to facilitate the screening process. It enables the discovery of molecules that bind to a specific protein to form a drug. Despite its benefits, virtual screening generates enormous data and suffers from drawbacks such as high dimensions and imbalance. This paper tackles data imbalance and aims to improve virtual screening accuracy, especially for a minority dataset. For a dataset identified without considering the data’s imbalanced nature, most classification methods tend to have high predictive accuracy for the majority category. However, the accuracy was significantly poor for the minority category. The paper proposes a K-mean algorithm coupled with Synthetic Minority Oversampling Technique (SMOTE) to overcome the problem of imbalanced datasets. The proposed algorithm is named as KSMOTE. Using KSMOTE, minority data can be identified at high accuracy and can be detected at high precision. A large set of experiments were implemented on Apache Spark using numeric PaDEL and fingerprint descriptors. The proposed solution was compared to both no-sampling method and SMOTE on the same datasets. Experimental results showed that the proposed solution outperformed other methods.
APA, Harvard, Vancouver, ISO, and other styles
11

Abdel-Fattah, Manal A., Nermin Abdelhakim Othman, and Nagwa Goher. "Predicting Chronic Kidney Disease Using Hybrid Machine Learning Based on Apache Spark." Computational Intelligence and Neuroscience 2022 (February 23, 2022): 1–12. http://dx.doi.org/10.1155/2022/9898831.

Full text
Abstract:
Chronic kidney disease (CKD) has become a widespread disease among people. It is related to various serious risks like cardiovascular disease, heightened risk, and end-stage renal disease, which can be feasibly avoidable by early detection and treatment of people in danger of this disease. The machine learning algorithm is a source of significant assistance for medical scientists to diagnose the disease accurately in its outset stage. Recently, Big Data platforms are integrated with machine learning algorithms to add value to healthcare. Therefore, this paper proposes hybrid machine learning techniques that include feature selection methods and machine learning classification algorithms based on big data platforms (Apache Spark) that were used to detect chronic kidney disease (CKD). The feature selection techniques, namely, Relief-F and chi-squared feature selection method, were applied to select the important features. Six machine learning classification algorithms were used in this research: decision tree (DT), logistic regression (LR), Naive Bayes (NB), Random Forest (RF), support vector machine (SVM), and Gradient-Boosted Trees (GBT Classifier) as ensemble learning algorithms. Four methods of evaluation, namely, accuracy, precision, recall, and F1-measure, were applied to validate the results. For each algorithm, the results of cross-validation and the testing results have been computed based on full features, the features selected by Relief-F, and the features selected by chi-squared feature selection method. The results showed that SVM, DT, and GBT Classifiers with the selected features had achieved the best performance at 100% accuracy. Overall, Relief-F’s selected features are better than full features and the features selected by chi-square.
APA, Harvard, Vancouver, ISO, and other styles
12

Dener, Murat, Gökçe Ok, and Abdullah Orman. "Malware Detection Using Memory Analysis Data in Big Data Environment." Applied Sciences 12, no. 17 (August 27, 2022): 8604. http://dx.doi.org/10.3390/app12178604.

Full text
Abstract:
Malware is a significant threat that has grown with the spread of technology. This makes detecting malware a critical issue. Static and dynamic methods are widely used in the detection of malware. However, traditional static and dynamic malware detection methods may fall short in advanced malware detection. Data obtained through memory analysis can provide important insights into the behavior and patterns of malware. This is because malwares leave various traces on memories. For this reason, the memory analysis method is one of the issues that should be studied in malware detection. In this study, the use of memory data in malware detection is suggested. Malware detection was carried out by using various deep learning and machine learning approaches in a big data environment with memory data. This study was carried out with Pyspark on Apache Spark big data platform in Google Colaboratory. Experiments were performed on the balanced CIC-MalMem-2022 dataset. Binary classification was made using Random Forest, Decision Tree, Gradient Boosted Tree, Logistic Regression, Naive Bayes, Linear Vector Support Machine, Multilayer Perceptron, Deep Feed Forward Neural Network, and Long Short-Term Memory algorithms. The performances of the algorithms used have been compared. The results were evaluated using the Accuracy, F1-score, Precision, Recall, and AUC performance metrics. As a result, the most successful malware detection was obtained with the Logistic Regression algorithm, with an accuracy level of 99.97% in malware detection by memory analysis. Gradient Boosted Tree follows the Logistic Regression algorithm with 99.94% accuracy. The Naive Bayes algorithm showed the lowest performance in malware analysis with memory data, with an accuracy of 98.41%. In addition, many of the algorithms used have achieved very successful results. According to the results obtained, the data obtained from memory analysis is very useful in detecting malware. In addition, deep learning and machine learning approaches were trained with memory datasets and achieved very successful results in malware detection.
APA, Harvard, Vancouver, ISO, and other styles
13

Asaithambi, Suriya, Sitalakshmi Venkatraman, and Ramanathan Venkatraman. "Big Data and Personalisation for Non-Intrusive Smart Home Automation." Big Data and Cognitive Computing 5, no. 1 (January 30, 2021): 6. http://dx.doi.org/10.3390/bdcc5010006.

Full text
Abstract:
With the advent of the Internet of Things (IoT), many different smart home technologies are commercially available. However, the adoption of such technologies is slow as many of them are not cost-effective and focus on specific functions such as energy efficiency. Recently, IoT devices and sensors have been designed to enhance the quality of personal life by having the capability to generate continuous data streams that can be used to monitor and make inferences by the user. While smart home devices connect to the home Wi-Fi network, there are still compatibility issues between devices from different manufacturers. Smart devices get even smarter when they can communicate with and control each other. The information collected by one device can be shared with others for achieving an enhanced automation of their operations. This paper proposes a non-intrusive approach of integrating and collecting data from open standard IoT devices for personalised smart home automation using big data analytics and machine learning. We demonstrate the implementation of our proposed novel technology instantiation approach for achieving non-intrusive IoT based big data analytics with a use case of a smart home environment. We employ open-source frameworks such as Apache Spark, Apache NiFi and FB-Prophet along with popular vendor tech-stacks such as Azure and DataBricks.
APA, Harvard, Vancouver, ISO, and other styles
14

Khan, Muhammad Ashfaq, Md Rezaul Karim, and Yangwoo Kim. "A Two-Stage Big Data Analytics Framework with Real World Applications Using Spark Machine Learning and Long Short-Term Memory Network." Symmetry 10, no. 10 (October 11, 2018): 485. http://dx.doi.org/10.3390/sym10100485.

Full text
Abstract:
Every day we experience unprecedented data growth from numerous sources, which contribute to big data in terms of volume, velocity, and variability. These datasets again impose great challenges to analytics framework and computational resources, making the overall analysis difficult for extracting meaningful information in a timely manner. Thus, to harness these kinds of challenges, developing an efficient big data analytics framework is an important research topic. Consequently, to address these challenges by exploiting non-linear relationships from very large and high-dimensional datasets, machine learning (ML) and deep learning (DL) algorithms are being used in analytics frameworks. Apache Spark has been in use as the fastest big data processing arsenal, which helps to solve iterative ML tasks, using distributed ML library called Spark MLlib. Considering real-world research problems, DL architectures such as Long Short-Term Memory (LSTM) is an effective approach to overcoming practical issues such as reduced accuracy, long-term sequence dependency, and vanishing and exploding gradient in conventional deep architectures. In this paper, we propose an efficient analytics framework, which is technically a progressive machine learning technique merged with Spark-based linear models, Multilayer Perceptron (MLP) and LSTM, using a two-stage cascade structure in order to enhance the predictive accuracy. Our proposed architecture enables us to organize big data analytics in a scalable and efficient way. To show the effectiveness of our framework, we applied the cascading structure to two different real-life datasets to solve a multiclass and a binary classification problem, respectively. Experimental results show that our analytical framework outperforms state-of-the-art approaches with a high-level of classification accuracy.
APA, Harvard, Vancouver, ISO, and other styles
15

Hosseini, Behrooz, and Kourosh Kiani. "A Robust Distributed Big Data Clustering-based on Adaptive Density Partitioning using Apache Spark." Symmetry 10, no. 8 (August 15, 2018): 342. http://dx.doi.org/10.3390/sym10080342.

Full text
Abstract:
Unsupervised machine learning and knowledge discovery from large-scale datasets have recently attracted a lot of research interest. The present paper proposes a distributed big data clustering approach-based on adaptive density estimation. The proposed method is developed-based on Apache Spark framework and tested on some of the prevalent datasets. In the first step of this algorithm, the input data is divided into partitions using a Bayesian type of Locality Sensitive Hashing (LSH). Partitioning makes the processing fully parallel and much simpler by avoiding unneeded calculations. Each of the proposed algorithm steps is completely independent of the others and no serial bottleneck exists all over the clustering procedure. Locality preservation also filters out the outliers and enhances the robustness of the proposed approach. Density is defined on the basis of Ordered Weighted Averaging (OWA) distance which makes clusters more homogenous. According to the density of each node, the local density peaks will be detected adaptively. By merging the local peaks, final cluster centers will be obtained and other data points will be a member of the cluster with the nearest center. The proposed method has been implemented and compared with similar recently published researches. Cluster validity indexes achieved from the proposed method shows its superiorities in precision and noise robustness in comparison with recent researches. Comparison with similar approaches also shows superiorities of the proposed method in scalability, high performance, and low computation cost. The proposed method is a general clustering approach and it has been used in gene expression clustering as a sample of its application.
APA, Harvard, Vancouver, ISO, and other styles
16

Alexopoulos, Athanasios, Georgios Drakopoulos, Andreas Kanavos, Phivos Mylonas, and Gerasimos Vonitsanos. "Two-Step Classification with SVD Preprocessing of Distributed Massive Datasets in Apache Spark." Algorithms 13, no. 3 (March 24, 2020): 71. http://dx.doi.org/10.3390/a13030071.

Full text
Abstract:
At the dawn of the 10V or big data data era, there are a considerable number of sources such as smart phones, IoT devices, social media, smart city sensors, as well as the health care system, all of which constitute but a small portion of the data lakes feeding the entire big data ecosystem. This 10V data growth poses two primary challenges, namely storing and processing. Concerning the latter, new frameworks have been developed including distributed platforms such as the Hadoop ecosystem. Classification is a major machine learning task typically executed on distributed platforms and as a consequence many algorithmic techniques have been developed tailored for these platforms. This article extensively relies in two ways on classifiers implemented in MLlib, the main machine learning library for the Hadoop ecosystem. First, a vast number of classifiers is applied to two datasets, namely Higgs and PAMAP. Second, a two-step classification is ab ovo performed to the same datasets. Specifically, the singular value decomposition of the data matrix determines first a set of transformed attributes which in turn drive the classifiers of MLlib. The twofold purpose of the proposed architecture is to reduce complexity while maintaining a similar if not better level of the metrics of accuracy, recall, and F 1 . The intuition behind this approach stems from the engineering principle of breaking down complex problems to simpler and more manageable tasks. The experiments based on the same Spark cluster indicate that the proposed architecture outperforms the individual classifiers with respect to both complexity and the abovementioned metrics.
APA, Harvard, Vancouver, ISO, and other styles
17

Kokkinos, Konstantinos, and Eftihia Nathanail. "Exploring an Ensemble of Textual Machine Learning Methodologies for Traffic Event Detection and Classification." Transport and Telecommunication Journal 21, no. 4 (December 1, 2020): 285–94. http://dx.doi.org/10.2478/ttj-2020-0023.

Full text
Abstract:
AbstractLate research has established the critical environmental, health and social impacts of traffic in highly populated urban regions. Apart from traffic monitoring, textual analysis of geo-located social media responses can provide an intelligent means in detecting and classifying traffic related events. This paper deals with the content analysis of Twitter textual data using an ensemble of supervised and unsupervised Machine Learning methods in order to cluster and properly classify traffic related events. Voluminous textual data was gathered using innovative Twitter APIs and managed by Big Data cloud methodologies via an Apache Spark system. Events were detected using a traffic related typology and the clustering K-Means model, where related event classification was achieved applying Support Vector Machines (SVM), Convolutional Neural Networks (CNN) and Long Short Term Memory (LSTM) networks. We provide experimental results for 2-class and 3-class classification examples indicating that the ensemble performs with accuracy and F-score reaching 98.5%.
APA, Harvard, Vancouver, ISO, and other styles
18

Cerquitelli, Tania, Giovanni Malnati, and Daniele Apiletti. "Exploiting Scalable Machine-Learning Distributed Frameworks to Forecast Power Consumption of Buildings." Energies 12, no. 15 (July 31, 2019): 2933. http://dx.doi.org/10.3390/en12152933.

Full text
Abstract:
The pervasive and increasing deployment of smart meters allows collecting a huge amount of fine-grained energy data in different urban scenarios. The analysis of such data is challenging and opening up a variety of interesting and new research issues across energy and computer science research areas. The key role of computer scientists is providing energy researchers and practitioners with cutting-edge and scalable analytics engines to effectively support their daily research activities, hence fostering and leveraging data-driven approaches. This paper presents SPEC, a scalable and distributed engine to predict building-specific power consumption. SPEC addresses the full analytic stack and exploits a data stream approach over sliding time windows to train a prediction model tailored to each building. The model allows us to predict the upcoming power consumption at a time instant in the near future. SPEC integrates different machine learning approaches, specifically ridge regression, artificial neural networks, and random forest regression, to predict fine-grained values of power consumption, and a classification model, the random forest classifier, to forecast a coarse consumption level. SPEC exploits state-of-the-art distributed computing frameworks to address the big data challenges in harvesting energy data: the current implementation runs on Apache Spark, the most widespread high-performance data-processing platform, and can natively scale to huge datasets. As a case study, SPEC has been tested on real data of an heating distribution network and power consumption data collected in a major Italian city. Experimental results demonstrate the effectiveness of SPEC to forecast both fine-grained values and coarse levels of power consumption of buildings.
APA, Harvard, Vancouver, ISO, and other styles
19

Munawar, Hafiz Suliman, Siddra Qayyum, Fahim Ullah, and Samad Sepasgozar. "Big Data and Its Applications in Smart Real Estate and the Disaster Management Life Cycle: A Systematic Analysis." Big Data and Cognitive Computing 4, no. 2 (March 26, 2020): 4. http://dx.doi.org/10.3390/bdcc4020004.

Full text
Abstract:
Big data is the concept of enormous amounts of data being generated daily in different fields due to the increased use of technology and internet sources. Despite the various advancements and the hopes of better understanding, big data management and analysis remain a challenge, calling for more rigorous and detailed research, as well as the identifications of methods and ways in which big data could be tackled and put to good use. The existing research lacks in discussing and evaluating the pertinent tools and technologies to analyze big data in an efficient manner which calls for a comprehensive and holistic analysis of the published articles to summarize the concept of big data and see field-specific applications. To address this gap and keep a recent focus, research articles published in last decade, belonging to top-tier and high-impact journals, were retrieved using the search engines of Google Scholar, Scopus, and Web of Science that were narrowed down to a set of 139 relevant research articles. Different analyses were conducted on the retrieved papers including bibliometric analysis, keywords analysis, big data search trends, and authors’ names, countries, and affiliated institutes contributing the most to the field of big data. The comparative analyses show that, conceptually, big data lies at the intersection of the storage, statistics, technology, and research fields and emerged as an amalgam of these four fields with interlinked aspects such as data hosting and computing, data management, data refining, data patterns, and machine learning. The results further show that major characteristics of big data can be summarized using the seven Vs, which include variety, volume, variability, value, visualization, veracity, and velocity. Furthermore, the existing methods for big data analysis, their shortcomings, and the possible directions were also explored that could be taken for harnessing technology to ensure data analysis tools could be upgraded to be fast and efficient. The major challenges in handling big data include efficient storage, retrieval, analysis, and visualization of the large heterogeneous data, which can be tackled through authentication such as Kerberos and encrypted files, logging of attacks, secure communication through Secure Sockets Layer (SSL) and Transport Layer Security (TLS), data imputation, building learning models, dividing computations into sub-tasks, checkpoint applications for recursive tasks, and using Solid State Drives (SDD) and Phase Change Material (PCM) for storage. In terms of frameworks for big data management, two frameworks exist including Hadoop and Apache Spark, which must be used simultaneously to capture the holistic essence of the data and make the analyses meaningful, swift, and speedy. Further field-specific applications of big data in two promising and integrated fields, i.e., smart real estate and disaster management, were investigated, and a framework for field-specific applications, as well as a merger of the two areas through big data, was highlighted. The proposed frameworks show that big data can tackle the ever-present issues of customer regrets related to poor quality of information or lack of information in smart real estate to increase the customer satisfaction using an intermediate organization that can process and keep a check on the data being provided to the customers by the sellers and real estate managers. Similarly, for disaster and its risk management, data from social media, drones, multimedia, and search engines can be used to tackle natural disasters such as floods, bushfires, and earthquakes, as well as plan emergency responses. In addition, a merger framework for smart real estate and disaster risk management show that big data generated from the smart real estate in the form of occupant data, facilities management, and building integration and maintenance can be shared with the disaster risk management and emergency response teams to help prevent, prepare, respond to, or recover from the disasters.
APA, Harvard, Vancouver, ISO, and other styles
20

Bandi, Raswitha, J. Amudhavel, and R. Karthik. "Machine Learning with PySpark - Review." Indonesian Journal of Electrical Engineering and Computer Science 12, no. 1 (October 1, 2018): 102. http://dx.doi.org/10.11591/ijeecs.v12.i1.pp102-106.

Full text
Abstract:
<p>A reasonable distributed memory-based Computing system for machine learning is Apache Spark. Spark is being superior in computing when compared with Hadoop. Apache Spark is a quick, simple to use for handling big data that has worked in modules of Machine Learning, streaming SQL, and graph processing. We can apply machine learning algorithms to big data easily, which makes it simple by using Spark and its machine learning library MLlib, even this can be made simpler by using the Python API PySpark. This paper presents the study on how to develop machine learning algorithms in PySpark. </p>
APA, Harvard, Vancouver, ISO, and other styles
21

Malleswari, M., R. J. Manira, Praveen Kumar, and Murugan . "Comparative Analysis of Machine Learning Techniques to Identify Churn for Telecom Data." International Journal of Engineering & Technology 7, no. 3.34 (September 1, 2018): 291. http://dx.doi.org/10.14419/ijet.v7i3.34.19210.

Full text
Abstract:
Big data analytics has been the focus for large scale data processing. Machine learning and Big data has great future in prediction. Churn prediction is one of the sub domain of big data. Preventing customer attrition especially in telecom is the advantage of churn prediction. Churn prediction is a day-to-day affair involving millions. So a solution to prevent customer attrition can save a lot. This paper propose to do comparison of three machine learning techniques Decision tree algorithm, Random Forest algorithm and Gradient Boosted tree algorithm using Apache Spark. Apache Spark is a data processing engine used in big data which provides in-memory processing so that the processing speed is higher. The analysis is made by extracting the features of the data set and training the model. Scala is a programming language that combines both object oriented and functional programming and so a powerful programming language. The analysis is implemented using Apache Spark and modelling is done using scala ML. The accuracy of Decision tree model came out as 86%, Random Forest model is 87% and Gradient Boosted tree is 85%.
APA, Harvard, Vancouver, ISO, and other styles
22

Jankatti, Santosh, Raghavendra B. K., Raghavendra S., and Meenakshi Meenakshi. "Performance evaluation of Map-reduce jar pig hive and spark with machine learning using big data." International Journal of Electrical and Computer Engineering (IJECE) 10, no. 4 (August 1, 2020): 3811. http://dx.doi.org/10.11591/ijece.v10i4.pp3811-3818.

Full text
Abstract:
Big data is the biggest challenges as we need huge processing power system and good algorithms to make an decision. We need Hadoop environment with pig hive, machine learning and hadoopecosystem components. The data comes from industries. Many devices around us and sensor, and from social media sites. According to McKinsey There will be a shortage of 15000000 big data professionals by the end of 2020. There are lots of technologies to solve the problem of big data Storage and processing. Such technologies are Apache Hadoop, Apache Spark, Apache Kafka, and many more. Here we analyse the processing speed for the 4GB data on cloudx lab with Hadoop mapreduce with varing mappers and reducers and with pig script and Hive querries and spark environment along with machine learning technology and from the results we can say that machine learning with Hadoop will enhance the processing performance along with with spark, and also we can say that spark is better than Hadoop mapreduce pig and hive, spark with hive and machine learning will be the best performance enhanced compared with pig and hive, Hadoop mapreduce jar.
APA, Harvard, Vancouver, ISO, and other styles
23

Aminudin, Aminudin, and Eko Budi Cahyono. "Pengukuran Performa Apache Spark dengan Library H2O Menggunakan Benchmark Hibench Berbasis Cloud Computing." Jurnal Teknologi Informasi dan Ilmu Komputer 6, no. 5 (October 8, 2019): 519. http://dx.doi.org/10.25126/jtiik.2019651520.

Full text
Abstract:
<p class="Judul2">Apache Spark merupakan platform yang dapat digunakan untuk memproses data dengan ukuran data yang relatif besar (<em>big data</em>) dengan kemampuan untuk membagi data tersebut ke masing-masing cluster yang telah ditentukan konsep ini disebut dengan parallel komputing. Apache Spark mempunyai kelebihan dibandingkan dengan framework lain yang serupa misalnya Apache Hadoop dll, di mana Apache Spark mampu memproses data secara streaming artinya data yang masuk ke dalam lingkungan Apache Spark dapat langsung diproses tanpa menunggu data lain terkumpul. Agar di dalam Apache Spark mampu melakukan proses machine learning, maka di dalam paper ini akan dilakukan eksperimen yaitu dengan mengintegrasikan Apache Spark yang bertindak sebagai lingkungan pemrosesan data yang besar dan konsep parallel komputing akan dikombinasikan dengan library H2O yang khusus untuk menangani pemrosesan data menggunakan algoritme machine learning. Berdasarkan hasil pengujian Apache Spark di dalam lingkungan cloud computing, Apache Spark mampu memproses data cuaca yang didapatkan dari arsip data cuaca terbesar yaitu yaitu data NCDC dengan ukuran data sampai dengan 6GB. Data tersebut diproses menggunakan salah satu model machine learning yaitu deep learning dengan membagi beberapa node yang telah terbentuk di lingkungan cloud computing dengan memanfaatkan library H2O. Keberhasilan tersebut dapat dilihat dari parameter pengujian yang telah diujikan meliputi nilai running time, throughput, Avarege Memory dan Average CPU yang didapatkan dari Benchmark Hibench. Semua nilai tersebut dipengaruhi oleh banyaknya data dan jumlah node.</p><p class="Judul2"> </p><p class="Judul2"><em><strong>Abstract</strong></em></p><p><em>Apache Spark is a platform that can be used to process data with relatively large data sizes (big data) with the ability to divide the data into each cluster that has been determined. This concept is called parallel computing. Apache Spark has advantages compared to other similar frameworks such as Apache Hadoop, etc., where Apache Spark is able to process data in streaming, meaning that the data entered into the Apache Spark environment can be directly processed without waiting for other data to be collected. In order for Apache Spark to be able to do machine learning processes, in this paper an experiment will be conducted that integrates Apache Spark which acts as a large data processing environment and the concept of parallel computing will be combined with H2O libraries specifically for handling data processing using machine learning algorithms . Based on the results of testing Apache Spark in a cloud computing environment, Apache Spark is able to process weather data obtained from the largest weather data archive, namely NCDC data with data sizes up to 6GB. The data is processed using one of the machine learning models namely deep learning by dividing several nodes that have been formed in the cloud computing environment by utilizing the H2O library. The success can be seen from the test parameters that have been tested including the value of running time, throughput, Avarege Memory and CPU Average obtained from the Hibench Benchmark. All these values are influenced by the amount of data and number of nodes.</em><em></em></p><p class="Judul2"><em><strong><br /></strong></em></p>
APA, Harvard, Vancouver, ISO, and other styles
24

Liu, K., and J. Boehm. "CLASSIFICATION OF BIG POINT CLOUD DATA USING CLOUD COMPUTING." ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences XL-3/W3 (August 20, 2015): 553–57. http://dx.doi.org/10.5194/isprsarchives-xl-3-w3-553-2015.

Full text
Abstract:
Point cloud data plays an significant role in various geospatial applications as it conveys plentiful information which can be used for different types of analysis. Semantic analysis, which is an important one of them, aims to label points as different categories. In machine learning, the problem is called classification. In addition, processing point data is becoming more and more challenging due to the growing data volume. In this paper, we address point data classification in a big data context. The popular cluster computing framework Apache Spark is used through the experiments and the promising results suggests a great potential of Apache Spark for large-scale point data processing.
APA, Harvard, Vancouver, ISO, and other styles
25

Armanur Rahman, Md, Abid Hossen, J. Hossen, Venkataseshaiah C, Thangavel Bhuvaneswari, and Aziza Sultana. "Towards machine learning-based self-tuning of Hadoop-Spark system." Indonesian Journal of Electrical Engineering and Computer Science 15, no. 2 (August 1, 2019): 1076. http://dx.doi.org/10.11591/ijeecs.v15.i2.pp1076-1085.

Full text
Abstract:
Apache Spark is an open source distributed platform which uses the concept of distributed memory for processing big data. Spark has more than 180 predominant configuration parameter. Configuration settings directly control the efficiency of Apache spark while processing big data, to get the best outcome yet a challenging task as it has many configuration parameters. Currently, these predominant parameters are tuned manually by trial and error. To overcome this manual tuning problem in this paper proposed and developed a self-tuning approach using machine learning. This approach can tune the parameter value when it’s required. The approach was implemented on Dell server and experiment was done on five different sizes of the dataset and parameter. A comparison is provided to highlight the experimented result of the proposed approach with default Spark configuration system. The results demonstrate that the execution is speeded-up by about 33% (on an average) compared to the default configuration.
APA, Harvard, Vancouver, ISO, and other styles
26

Saeed, Mozamel M., Zaher Al Aghbari, and Mohammed Alsharidah. "Big data clustering techniques based on Spark: a literature review." PeerJ Computer Science 6 (November 30, 2020): e321. http://dx.doi.org/10.7717/peerj-cs.321.

Full text
Abstract:
A popular unsupervised learning method, known as clustering, is extensively used in data mining, machine learning and pattern recognition. The procedure involves grouping of single and distinct points in a group in such a way that they are either similar to each other or dissimilar to points of other clusters. Traditional clustering methods are greatly challenged by the recent massive growth of data. Therefore, several research works proposed novel designs for clustering methods that leverage the benefits of Big Data platforms, such as Apache Spark, which is designed for fast and distributed massive data processing. However, Spark-based clustering research is still in its early days. In this systematic survey, we investigate the existing Spark-based clustering methods in terms of their support to the characteristics Big Data. Moreover, we propose a new taxonomy for the Spark-based clustering methods. To the best of our knowledge, no survey has been conducted on Spark-based clustering of Big Data. Therefore, this survey aims to present a comprehensive summary of the previous studies in the field of Big Data clustering using Apache Spark during the span of 2010–2020. This survey also highlights the new research directions in the field of clustering massive data.
APA, Harvard, Vancouver, ISO, and other styles
27

Karsi, Redouane, Mounia Zaim, and Jamila El Alami. "Assessing naive Bayes and support vector machine performance in sentiment classification on a big data platform." IAES International Journal of Artificial Intelligence (IJ-AI) 10, no. 4 (December 1, 2021): 990. http://dx.doi.org/10.11591/ijai.v10.i4.pp990-996.

Full text
Abstract:
<p><span lang="EN-US">Nowadays, mining user reviews becomes a very useful mean for decision making in several areas. Traditionally, machine learning algorithms have been widely and effectively used to analyze user’s opinions on a limited volume of data. In the case of massive data, powerful hardware resources (CPU, memory, and storage) are essential for dealing with the whole data processing phases including, collection, pre-processing, and learning in an optimal time. Several big data technologies have emerged to efficiently process massive data, like Apache Spark, which is a distributed framework for data processing that provides libraries implementing several machine learning algorithms. In order to evaluate the performance of Apache Spark's machine learning library (MLlib) on a large volume of data, classification accuracies and processing time of two machine learning algorithms implemented in spark: naive </span><span>B</span><span lang="EN-US">ayes and support vector machine (SVM) are compared to the performance achieved by the standard implementation of these two algorithms on large different size datasets built from movie reviews. The results of our experiment show that the performance of classifiers running under spark is higher than traditional ones and reaches F-measure greater than 84%. At the same time, we found that under spark framework, the learning time is relatively low.</span></p>
APA, Harvard, Vancouver, ISO, and other styles
28

Boachie, Emmanuel, and Chunlin Li. "Big Data Processing with Apache Spark in University Institutions: Spark Streaming and Machine Learning Algorithm." International Journal of Continuing Engineering Education and Life-Long Learning 28, no. 4 (2018): 1. http://dx.doi.org/10.1504/ijceell.2018.10017171.

Full text
APA, Harvard, Vancouver, ISO, and other styles
29

Boachie, Emmanuel, and Chunlin Li. "Big data processing with Apache Spark in university institutions: spark streaming and machine learning algorithm." International Journal of Continuing Engineering Education and Life-Long Learning 29, no. 1/2 (2019): 5. http://dx.doi.org/10.1504/ijceell.2019.099217.

Full text
APA, Harvard, Vancouver, ISO, and other styles
30

Yamuna Bee, Mrs J., E. Naveena, Reshma Elizabeth Thomas, Arathi Chandran, Siva Subramania Raja M, and A. Akhilesh. "Intrusion Detection on Apache Spark Platform in Big data and Machine Learning Techniques." Journal of University of Shanghai for Science and Technology 23, no. 06 (June 22, 2021): 1257–66. http://dx.doi.org/10.51201/jusst/21/06427.

Full text
Abstract:
With the rising cyber-physical power systems and emerging danger of cyber-attacks, the traditional power services are faced with higher risks of being compromised, as vulnerabilities in cyber communications can be broken to cause material damage. Therefore, adjustment needs to be made in the present control scheme plan methods to moderate the impact of possible attacks on service quality. This paper, focuses on the service of synchronized source-load contribution in main frequency regulation, a weakness study is performed with model the attack intrusion process, and the risk review of the service is made by further model the attack impacts on the service’s bodily things. On that basis, the customary synchronized reserve allotment optimization model is adapted and the allocation scheme is correct according to the cyber-attack impact. The proposed alteration methods are validating through a case study, showing efficiency in defensive alongside the cyber-attack impacts.
APA, Harvard, Vancouver, ISO, and other styles
31

Rodrigues, Anisha P., Roshan Fernandes, Adarsh Bhandary, Asha C. Shenoy, Ashwanth Shetty, and M. Anisha. "Real-Time Twitter Trend Analysis Using Big Data Analytics and Machine Learning Techniques." Wireless Communications and Mobile Computing 2021 (October 25, 2021): 1–13. http://dx.doi.org/10.1155/2021/3920325.

Full text
Abstract:
Twitter is a popular microblogging social media, using which its users can share useful information. Keeping a track of user postings and common hashtags allows us to understand what is happening around the world and what are people’s opinions on it. As such, a Twitter trend analysis analyzes Twitter data and hashtags to determine what topics are being talked about the most on Twitter. Feature extraction and trend detection can be performed using machine learning algorithms. Big data tools and techniques are needed to extract relevant information from continuous steam of data originating from Twitter. The objectives of this research work are to analyze the relative popularity of different hashtags and which field has the maximum share of voice. Along with this, the common interests of the community can also be determined. Twitter trends plan an important role in the business field, marketing, politics, sports, and entertainment activities. The proposed work implemented the Twitter trend analysis using latent Dirichlet allocation, cosine similarity, K means clustering, and Jaccard similarity techniques and compared the results with Big Data Apache SPARK tool implementation. The LDA technique for trend analysis resulted in an accuracy of 74% and Jaccard with an accuracy of 83% for static data. The results proved that the real-time tweets are analyzed comparatively faster in the Big Data Apache SPARK tool than in the normal execution environment.
APA, Harvard, Vancouver, ISO, and other styles
32

Boden, Christoph, Tilmann Rabl, and Volker Markl. "The Berlin Big Data Center (BBDC)." it - Information Technology 60, no. 5-6 (December 19, 2018): 321–26. http://dx.doi.org/10.1515/itit-2018-0016.

Full text
Abstract:
Abstract The last decade has been characterized by the collection and availability of unprecedented amounts of data due to rapidly decreasing storage costs and the omnipresence of sensors and data-producing global online-services. In order to process and analyze this data deluge, novel distributed data processing systems resting on the paradigm of data flow such as Apache Hadoop, Apache Spark, or Apache Flink were built and have been scaled to tens of thousands of machines. However, writing efficient implementations of data analysis programs on these systems requires a deep understanding of systems programming, prohibiting large groups of data scientists and analysts from efficiently using this technology. In this article, we present some of the main achievements of the research carried out by the Berlin Big Data Cente (BBDC). We introduce the two domain-specific languages Emma and LARA, which are deeply embedded in Scala and enable declarative specification and the automatic parallelization of data analysis programs, the PEEL Framework for transparent and reproducible benchmark experiments of distributed data processing systems, approaches to foster the interpretability of machine learning models and finally provide an overview of the challenges to be addressed in the second phase of the BBDC.
APA, Harvard, Vancouver, ISO, and other styles
33

Khamaru, Ananda, and Tryambak Hiwarkar. "A Dynamics of Machine Learning on Map-Reduce Architecture for Enhancing Big Data Analysis Performance." International Journal of Computer Science and Mobile Computing 11, no. 11 (November 30, 2022): 109–30. http://dx.doi.org/10.47760/ijcsmc.2022.v11i11.009.

Full text
Abstract:
Big data is a fantastic resource for disseminating system-generated insights to external stakeholders. However, automation is required to manage such a large body of information, and this has spurred the development of data processing and machine learning tools. Just as in other fields of study and business, the ICT industry is serving and developing platforms and solutions to help professionals treat their knowledge and learn automatically. Large companies like Google and Microsoft, as well as the Apache Foundation's incubator, are the primary providers of these platforms. Spark is an open-source platform for handling Big Data insights that have been tainted by contamination. This unified framework provides a variety of methods for dealing with unstructured or structured text data, graph data, and real-time streaming data. Spark relies on MLlib to create customised ML algorithms. To parallelize a huge cluster of machines for data analytics, these methods require less memory, less processing time, and, to a large extent, hand tuned specialized architecture. Data sets are analysed with machine learning methods including Linear Regression, Decision Tree, Random Forest, and Gradient Boosting Tree. In order to comprehend the data sets with the help of machine learning algorithms and to determine the best forecast value from the comparative study, the prediction model provided in this research is used. One key goal of this study is to use the proposed model to make the most accurate forecast possible utilising machine learning methods. The suggested model utilizes the Apache Spark framework to perform a comparative analysis of the various existing approaches that have implemented the supervised and unsupervised techniques utilizing the MapReduce approach. By comparing the temporal complexity of each method, this method calculates the best prediction from the model. This dissertation emphasizes the characteristics of datasets that are most useful for examining the most effective prediction using machine learning algorithms.
APA, Harvard, Vancouver, ISO, and other styles
34

Bagui, Sikha, Jason Simonds, Russell Plenkers, Timothy A. Bennett, and Subhash Bagui. "Classifying UNSW-NB15 Network Traffic in the Big Data Framework Using Random Forest in Spark." International Journal of Big Data Intelligence and Applications 2, no. 1 (January 2021): 1–23. http://dx.doi.org/10.4018/ijbdia.287617.

Full text
Abstract:
The focus of this work is on detecting and classifying attacks in network traffic using a binary as well as multi-class machine learning classifier, Random Forest, in a distributed Big Data environment using Apache Spark. The classifier is tested using the UNSW-NB15 dataset. Major problems in these types of datasets include high dimensionality and imbalanced data. To address the issue of high dimensionality, both Information Gain as well as Principal Components Analysis (PCA) were applied before training and testing the data using Random Forest in Apache Spark. Binary as well as multi-class Random Forest classifiers were compared in a distributed environment, with and without using PCA, using various number of Spark cores and Random Forest trees, in terms of performance time and statistical measures. The highest accuracy was obtained by the binary classifier at 99.94%, using 8 cores and 30 trees. This study obtained higher accuracy and lower FAR rates than previously achieved, with low testing times.
APA, Harvard, Vancouver, ISO, and other styles
35

Awan, Mazhar Javed, Umar Farooq, Hafiz Muhammad Aqeel Babar, Awais Yasin, Haitham Nobanee, Muzammil Hussain, Owais Hakeem, and Azlan Mohd Zain. "Real-Time DDoS Attack Detection System Using Big Data Approach." Sustainability 13, no. 19 (September 27, 2021): 10743. http://dx.doi.org/10.3390/su131910743.

Full text
Abstract:
Currently, the Distributed Denial of Service (DDoS) attack has become rampant, and shows up in various shapes and patterns, therefore it is not easy to detect and solve with previous solutions. Classification algorithms have been used in many studies and have aimed to detect and solve the DDoS attack. DDoS attacks are performed easily by using the weaknesses of networks and by generating requests for services for software. Real-time detection of DDoS attacks is difficult to detect and mitigate, but this solution holds significant value as these attacks can cause big issues. This paper addresses the prediction of application layer DDoS attacks in real-time with different machine learning models. We applied the two machine learning approaches Random Forest (RF) and Multi-Layer Perceptron (MLP) through the Scikit ML library and big data framework Spark ML library for the detection of Denial of Service (DoS) attacks. In addition to the detection of DoS attacks, we optimized the performance of the models by minimizing the prediction time as compared with other existing approaches using big data framework (Spark ML). We achieved a mean accuracy of 99.5% of the models both with and without big data approaches. However, in training and testing time, the big data approach outperforms the non-big data approach due to that the Spark computations in memory are in a distributed manner. The minimum average training and testing time in minutes was 14.08 and 0.04, respectively. Using a big data tool (Apache Spark), the maximum intermediate training and testing time in minutes was 34.11 and 0.46, respectively, using a non-big data approach. We also achieved these results using the big data approach. We can detect an attack in real-time in few milliseconds.
APA, Harvard, Vancouver, ISO, and other styles
36

Dayana, Ms, K. Keerthika, E. Bibilin Manuela, and J. Julie Christina. "Prediction of Cardiovascular Disease Using PySpark Techniques." International Journal for Research in Applied Science and Engineering Technology 10, no. 6 (June 30, 2022): 1228–33. http://dx.doi.org/10.22214/ijraset.2022.44018.

Full text
Abstract:
Abstract: On a day after day, human life is affected by differing kinds of diseases that is why their life is in distress. cardiovascular disease may be a generic class of disease that's effective in spreading infections and notably, it affects the heart and veins. it's determined that vessel diseases have become modest in old individuals besides in children too. it's terribly requisite to portend this sort of illness within the starting phases; many varieties of tests square measure used for diagnosticating these ailments. This implementation has been done by employing a big data tool that's Apache Spark and victimization spark's MLlib and PySpark libraries that square measure integrated with it. Apache Spark is among the foremost wide used big data technologies, and it's a stack of some libraries that are Spark SQL, Spark MLlib, Spark Streaming, etc. This analysis work aims to create a prediction model to predict whether or not people have cardiovascular disease or not, using machine learning classification techniques that embrace logistic regression, decision tree, random forest to enhance the performance of models. They compared the analysis of all applied machine learning models. The results obtained are compared with the results of existing models within the same domain and located to be improved. Keywords: heart, blood vessels, Xampp server, data analytics, cardiovascular diseases.
APA, Harvard, Vancouver, ISO, and other styles
37

Kanavos, Andreas, Maria Trigka, Elias Dritsas, Gerasimos Vonitsanos, and Phivos Mylonas. "A Regularization-Based Big Data Framework for Winter Precipitation Forecasting on Streaming Data." Electronics 10, no. 16 (August 4, 2021): 1872. http://dx.doi.org/10.3390/electronics10161872.

Full text
Abstract:
In the current paper, we propose a machine learning forecasting model for the accurate prediction of qualitative weather information on winter precipitation types, utilized in Apache Spark Streaming distributed framework. The proposed model receives storage and processes data in real-time, in order to extract useful knowledge from different sensors related to weather data. In following, the numerical weather prediction model aims at forecasting the weather type given three precipitation classes namely rain, freezing rain, and snow as recorded in the Automated Surface Observing System (ASOS) network. For depicting the effectiveness of our proposed schema, a regularization technique for feature selection so as to avoid overfitting is implemented. Several classification models covering three different categorization methods namely the Bayesian, decision trees, and meta/ensemble methods, have been investigated in a real dataset. The experimental analysis illustrates that the utilization of the regularization technique could offer a significant boost in forecasting performance.
APA, Harvard, Vancouver, ISO, and other styles
38

Lasri, Imane, Anouar Riadsolh, and Mourad Elbelkacemi. "Real-time Twitter Sentiment Analysis for Moroccan Universities using Machine Learning and Big Data Technologies." International Journal of Emerging Technologies in Learning (iJET) 18, no. 05 (March 7, 2023): 42–61. http://dx.doi.org/10.3991/ijet.v18i05.35959.

Full text
Abstract:
In recent years, sentiment analysis (SA) has raised the interest of researchers in several domains, including higher education. It can be applied to measure the quality of the services supplied by the higher education institution and construct a university ranking mechanism from social media like Twitter. Hence, this study presents a novel system for Twitter sentiment prediction on Moroccan public universities in real-time. It consists of two phases: offline sentiment analysis phase and real-time prediction phase. In the offline phase, the collected French tweets about twelve Moroccan universities were classified according to their sentiment into ‘positive’, ‘negative’, or ‘neutral’ using six machine learning algorithms (random forest, multinomial Naive Bayes classifier, logistic regression, decision tree, linear support vector classifier, and extreme gradient boosting) with the term frequency-inverse document frequency (TF-IDF) and count vectorizer feature extraction techniques. The results reveal that random forest classifier coupled with TF-IDF has obtained the best test accuracy of 90%. This model was then applied on real-time tweets. The real-time prediction pipeline comprises Twitter streaming API for data collection, Apache Kafka for data ingestion, Apache Spark for real-time sentiment analysis, Elasticsearch for real-time data exploration, and Kibana for data visualization. The obtained results can be used by the Ministry of higher education, scientific research and innovation of Morocco for decision-making process.
APA, Harvard, Vancouver, ISO, and other styles
39

Rabhi, Loubna, Noureddine Falih, Lekbir Afraites, and Belaid Bouikhalene. "Digital agriculture based on big data analytics: a focus on predictive irrigation for smart farming in Morocco." Indonesian Journal of Electrical Engineering and Computer Science 24, no. 1 (October 1, 2021): 581. http://dx.doi.org/10.11591/ijeecs.v24.i1.pp581-589.

Full text
Abstract:
Due to the spead of objects connected to the internet and objects connected to each other, agriculture nowadays knows a huge volume of data exchanged called big data. Therefore, this paper discusses connected agriculture or agriculture 4.0 instead of a traditional one. As irrigation is one of the foremost challenges in agriculture, it is also moved from manual watering towards smart watering based on big data analytics where the farmer can water crops regularly and without wastage even remotely. The method used in this paper combines big data, remote sensing and data mining algorithms (neural network and support vector machine). In this paper, we are interfacing the databricks platform based on the apache Spark tool for using machine learning to predict the soil drought based on detecting the soil moisture and temperature.
APA, Harvard, Vancouver, ISO, and other styles
40

Zhang, Xiongwei, Hager Saleh, Eman M. G. Younis, Radhya Sahal, and Abdelmgeid A. Ali. "Predicting Coronavirus Pandemic in Real-Time Using Machine Learning and Big Data Streaming System." Complexity 2020 (December 19, 2020): 1–10. http://dx.doi.org/10.1155/2020/6688912.

Full text
Abstract:
Twitter is a virtual social network where people share their posts and opinions about the current situation, such as the coronavirus pandemic. It is considered the most significant streaming data source for machine learning research in terms of analysis, prediction, knowledge extraction, and opinions. Sentiment analysis is a text analysis method that has gained further significance due to social networks’ emergence. Therefore, this paper introduces a real-time system for sentiment prediction on Twitter streaming data for tweets about the coronavirus pandemic. The proposed system aims to find the optimal machine learning model that obtains the best performance for coronavirus sentiment analysis prediction and then uses it in real-time. The proposed system has been developed into two components: developing an offline sentiment analysis and modeling an online prediction pipeline. The system has two components: the offline and the online components. For the offline component of the system, the historical tweets’ dataset was collected in duration 23/01/2020 and 01/06/2020 and filtered by #COVID-19 and #Coronavirus hashtags. Two feature extraction methods of textual data analysis were used, n-gram and TF-ID, to extract the dataset’s essential features, collected using coronavirus hashtags. Then, five regular machine learning algorithms were performed and compared: decision tree, logistic regression, k-nearest neighbors, random forest, and support vector machine to select the best model for the online prediction component. The online prediction pipeline was developed using Twitter Streaming API, Apache Kafka, and Apache Spark. The experimental results indicate that the RF model using the unigram feature extraction method has achieved the best performance, and it is used for sentiment prediction on Twitter streaming data for coronavirus.
APA, Harvard, Vancouver, ISO, and other styles
41

Alnafessah, Ahmad, and Giuliano Casale. "Artificial neural networks based techniques for anomaly detection in Apache Spark." Cluster Computing 23, no. 2 (October 23, 2019): 1345–60. http://dx.doi.org/10.1007/s10586-019-02998-y.

Full text
Abstract:
Abstract Late detection and manual resolutions of performance anomalies in Cloud Computing and Big Data systems may lead to performance violations and financial penalties. Motivated by this issue, we propose an artificial neural network based methodology for anomaly detection tailored to the Apache Spark in-memory processing platform. Apache Spark is widely adopted by industry because of its speed and generality, however there is still a shortage of comprehensive performance anomaly detection methods applicable to this platform. We propose an artificial neural networks driven methodology to quickly sift through Spark logs data and operating system monitoring metrics to accurately detect and classify anomalous behaviors based on the Spark resilient distributed dataset characteristics. The proposed method is evaluated against three popular machine learning algorithms, decision trees, nearest neighbor, and support vector machine, as well as against four variants that consider different monitoring datasets. The results prove that our proposed method outperforms other methods, typically achieving 98–99% F-scores, and offering much greater accuracy than alternative techniques to detect both the period in which anomalies occurred and their type.
APA, Harvard, Vancouver, ISO, and other styles
42

Azeroual, Otmane, and Anastasija Nikiforova. "Apache Spark and MLlib-Based Intrusion Detection System or How the Big Data Technologies Can Secure the Data." Information 13, no. 2 (January 24, 2022): 58. http://dx.doi.org/10.3390/info13020058.

Full text
Abstract:
Since the turn of the millennium, the volume of data has increased significantly in both industries and scientific institutions. The processing of these volumes and variety of data we are dealing with are unlikely to be accomplished with conventional software solutions. Thus, new technologies belonging to the big data processing area, able to distribute and process data in a scalable way, are integrated into classical Business Intelligence (BI) systems or replace them. Furthermore, we can benefit from big data technologies to gain knowledge about security, which can be obtained from massive databases. The paper presents a security-relevant data analysis based on the big data analytics engine Apache Spark. A prototype intrusion detection system is developed aimed at detecting data anomalies through machine learning by using the k-means algorithm for clustering analysis implemented in Sparks MLlib. The extraction of features to detect anomalies is currently challenging because the problem of detecting anomalies is not actively and exhaustively monitored. The detection of abnormal data can be effectuated by using relevant data that are already in companies’ and scientific organizations’ possession. Their interpretation and further processing in a continuous manner can sufficiently contribute to anomaly and intrusion detection.
APA, Harvard, Vancouver, ISO, and other styles
43

Колмогорова, С. С., and Н. О. Голубятникова. "ON THE APPLICATION OF BIG DATA STRUCTURE REGULARIZATION IN A DISTRIBUTED EVALUATION SYSTEM FOR EMERGENCY PARAMETERS." ВЕСТНИК ВОРОНЕЖСКОГО ГОСУДАРСТВЕННОГО ТЕХНИЧЕСКОГО УНИВЕРСИТЕТА, no. 5 (November 17, 2022): 91–99. http://dx.doi.org/10.36622/vstu.2022.18.5.012.

Full text
Abstract:
За последнее десятилетие значительно увеличилось число методов машинного обучения, а также областей их применения и подходов в связи с необходимостью разработки более точных и надежных моделей прогнозирования. Рассмотрен подход прогнозирования параметров электромагнитного поля на основе распределенной среды Apache Spark Streaming. Первоначально данные с различных датчиков электромагнитного поля в реальном времени обрабатываются до уровня структурированных данных, являющихся входными данными в модели прогнозирования, которая фокусируется на прогнозировании типа значения с учетом нескольких классов. Кроме того, для повышения эффективности прогнозирования был использован метод регуляризации для выбора признаков, чтобы уменьшить переобучение. Описанная архитектура представляет собой интеграцию Apache Kafka, Spark и Cassandra и рекомендуется к применению для прикладного мониторинга и прогнозирования состояния систем различного профиля. Экспериментальный анализ показывает, что использование метода регуляризации повышает эффективность рекуррентной нейронной сети при прогнозировании параметров электромагнитного поля. Предложенная модель способна эффективно использовать смешанные прикладные данные, уменьшает вероятность переобучения модели и снижает вычислительные затраты Over the past decade, the number of machine learning methods, as well as their areas of application and approaches, has increased significantly due to the need to develop more accurate and reliable forecasting models. An approach to predicting the parameters of an electromagnetic field based on the Apache Spark Streaming distributed environment is considered. Initially, data from various real-time electromagnetic field sensors are processed to the level of structured data, which is the input to the prediction model, which focuses on predicting the type of value given several classes. In addition, in order to improve the prediction performance, a regularization method was used for feature selection to reduce overfitting. The described architecture is an integration of Apache Kafka, Spark and Cassandra and is recommended for application monitoring and predicting the state of systems of various profiles. Experimental analysis shows that the use of the regularization method increases the efficiency of the recurrent neural network in predicting the parameters of the electromagnetic field. The proposed model is able to effectively use mixed applied data, reduces the likelihood of model overfitting and reduces computational costs
APA, Harvard, Vancouver, ISO, and other styles
44

Elia, Domenico, Gioacchino Vino, Giacinto Donvito, and Marica Antonacci. "Developing a monitoring system for Cloud-based distributed data-centers." EPJ Web of Conferences 214 (2019): 08012. http://dx.doi.org/10.1051/epjconf/201921408012.

Full text
Abstract:
Nowadays more and more datacenters cooperate each others to achieve a common and more complex goal. New advanced functionalities are required to support experts during recovery and managing activities, like anomaly detection and fault pattern recognition. The proposed solution provides an active support to problem solving for datacenter management teams by providing automatically the root-cause of detected anomalies. The project has been developed in Bari using the datacenter ReCaS as testbed. Big Data solutions have been selected to properly handle the complexity and size of the data. Features like open source, big community, horizontal scalability and high availability have been considered and tools belonging to the Hadoop ecosystem have been selected. The collected information is sent to a combination of Apache Flume and Apache Kafka, used as transport layer, in turn delivering data to databases and processing components. Apache Spark has been selected as analysis component. Different kind of databases have been considered in order to satisfy multiple requirements: Hadoop Distributed File System, Neo4j, InfluxDB and Elasticsearch. Grafana and Kibana are used to show data in a dedicated dashboards. The Root-cause analysis engine has been implemented using custom machine learning algorithms. Finally, results are forwarded to experts by email or Slack, using Riemann.
APA, Harvard, Vancouver, ISO, and other styles
45

Alomari, Ebtesam, Iyad Katib, Aiiad Albeshri, Tan Yigitcanlar, and Rashid Mehmood. "Iktishaf+: A Big Data Tool with Automatic Labeling for Road Traffic Social Sensing and Event Detection Using Distributed Machine Learning." Sensors 21, no. 9 (April 24, 2021): 2993. http://dx.doi.org/10.3390/s21092993.

Full text
Abstract:
Digital societies could be characterized by their increasing desire to express themselves and interact with others. This is being realized through digital platforms such as social media that have increasingly become convenient and inexpensive sensors compared to physical sensors in many sectors of smart societies. One such major sector is road transportation, which is the backbone of modern economies and costs globally 1.25 million deaths and 50 million human injuries annually. The cutting-edge on big data-enabled social media analytics for transportation-related studies is limited. This paper brings a range of technologies together to detect road traffic-related events using big data and distributed machine learning. The most specific contribution of this research is an automatic labelling method for machine learning-based traffic-related event detection from Twitter data in the Arabic language. The proposed method has been implemented in a software tool called Iktishaf+ (an Arabic word meaning discovery) that is able to detect traffic events automatically from tweets in the Arabic language using distributed machine learning over Apache Spark. The tool is built using nine components and a range of technologies including Apache Spark, Parquet, and MongoDB. Iktishaf+ uses a light stemmer for the Arabic language developed by us. We also use in this work a location extractor developed by us that allows us to extract and visualize spatio-temporal information about the detected events. The specific data used in this work comprises 33.5 million tweets collected from Saudi Arabia using the Twitter API. Using support vector machines, naïve Bayes, and logistic regression-based classifiers, we are able to detect and validate several real events in Saudi Arabia without prior knowledge, including a fire in Jeddah, rains in Makkah, and an accident in Riyadh. The findings show the effectiveness of Twitter media in detecting important events with no prior knowledge about them.
APA, Harvard, Vancouver, ISO, and other styles
46

Almalki, Jameel. "A machine learning-based approach for sentiment analysis on distance learning from Arabic Tweets." PeerJ Computer Science 8 (July 26, 2022): e1047. http://dx.doi.org/10.7717/peerj-cs.1047.

Full text
Abstract:
Social media platforms such as Twitter, YouTube, Instagram and Facebook are leading sources of large datasets nowadays. Twitter’s data is one of the most reliable due to its privacy policy. Tweets have been used for sentiment analysis and to identify meaningful information within the dataset. Our study focused on the distance learning domain in Saudi Arabia by analyzing Arabic tweets about distance learning. This work proposes a model for analyzing people’s feedback using a Twitter dataset in the distance learning domain. The proposed model is based on the Apache Spark product to manage the large dataset. The proposed model uses the Twitter API to get the tweets as raw data. These tweets were stored in the Apache Spark server. A regex-based technique for preprocessing removed retweets, links, hashtags, English words and numbers, usernames, and emojis from the dataset. After that, a Logistic-based Regression model was trained on the pre-processed data. This Logistic Regression model, from the field of machine learning, was used to predict the sentiment inside the tweets. Finally, a Flask application was built for sentiment analysis of the Arabic tweets. The proposed model gives better results when compared to various applied techniques. The proposed model is evaluated on test data to calculate Accuracy, F1 Score, Precision, and Recall, obtaining scores of 91%, 90%, 90%, and 89%, respectively.
APA, Harvard, Vancouver, ISO, and other styles
47

Sibirskaya, E. V., L. V. Oveshnikova, and I. R. Lyapina. "MONITORING THE LABOR MARKET USING BIG DATA ANALYTICS TECHNOLOGIES." Economic Science and Humanities 364, no. 5 (2022): 92–105. http://dx.doi.org/10.33979/2073-7424-2022-364-5-92-105.

Full text
Abstract:
The object of this study is the monitoring data of the professional and qualification sphere in the context of the database of vacancies in the areas of professional activity «food industry». The purpose of performing the whole range of works: monitoring the professional and qualification sphere in order to analyze data on vacancies and resumes collected from open sources (Work in Russia, HeadHunter, SuperJob) to study the dynamics and structure of their distribution by areas of professional activity «food industry», as well as an analysis of the requirements of employers for the positions of employees in the labor market using Big Data analytics technologies. Due to the significant volume and complexity of the initial data, the entire scope of work was carried out using the Big Data analysis infrastructure deployed on a computing cluster. Apache Spark, Apache Flume were used as the main software packages for creating the infrastructure. To update the information with the reference book of professions, machine learning methods were used, including models prepared on the general and specialized corpora of texts in Russian, using the vector representation of words and expressions. Thus, the study analyzed the dynamics and structure of vacancies and resumes in the field of professional activity «food industry» for 55 professions in accordance with the Directory of Professions (regional section): change in the number of vacancies and resumes, maximum and minimum wages; current trends in the number of jobs are presented; a comparative analysis of changes in the average monthly nominal accrued wages of those working in the economy; the need for workers to fill vacancies in accordance with the All-Russian classifier of occupations was studied (OKZ).
APA, Harvard, Vancouver, ISO, and other styles
48

Gautier, W., S. Falquier, and S. Gaudan. "MARITIME BIG DATA ANALYSIS WITH ARLAS." International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences XLVI-4/W2-2021 (August 19, 2021): 71–76. http://dx.doi.org/10.5194/isprs-archives-xlvi-4-w2-2021-71-2021.

Full text
Abstract:
Abstract. The maritime industry has become a major part of globalization. Political and economic actors are meeting challenges regarding shipping and people transport. The Automatic Identification System (AIS) records and broadcasts the location of numerous vessels and delivers a huge amount of information that can be used to analyze fluxes and behaviors. However, the exploitation of these numerous messages requires tools based on Big Data principles.Acknowledgement of origin, destination, travel duration and distance of each vessel can help transporters to manage their fleet and ports to analyze fluxes and focus their investigations on some containers based on their previous locations. Thanks to the historical AIS messages provided by the Danish Maritime Authority and ARLAS PROC/ML, an open source and scalable processing platform based on Apache SPARK, we are able to apply our pipeline of processes and extract this information from millions of AIS messages. We use a Hidden Markov Model (HMM) to identify when a vessel is still or moving and we create “courses”, embodying the travel of the vessel. Then we derive the travel indicators. The visualization of results is made possible by ARLAS Exploration, an open source and scalable tool to explore geolocated data. This carto-centered application allows users to navigate into the huge amount of enriched data and helps to take benefits of these new origin and destination indicators. This tool can also be used to help in the creation of Machine Learning algorithms in order to deal with many maritime transportation challenges.
APA, Harvard, Vancouver, ISO, and other styles
49

Azaiz, Mohamed Amine, and Djamel Amar Bensaber. "An Efficient Parallel Hybrid Feature Selection Approach for Big Data Analysis." International Journal of Swarm Intelligence Research 13, no. 1 (January 1, 2022): 1–22. http://dx.doi.org/10.4018/ijsir.308291.

Full text
Abstract:
Classification algorithms face runtime complexity due to high data dimension, especially in the context of big data. Feature selection (FS) is a technique for reducing dimensions and improving learning performance. In this paper, the authors proposed a hybrid FS algorithm for classification in the context of big data. Firstly, only the most relevant features are selected using symmetric uncertainty (SU) as a measure of correlation. The features are distributed into subsets using Apache Spark to calculate SU between each feature and target class in parallel. Then a Binary PSO (BPSO) algorithm is used to find the optimal FS. The BPSO has limited convergence and restricted inertial weight adjustment, so the authors suggested using a multiple inertia weight strategy to influence the changes in particle motions so that the search process is more varied. Also, the authors proposed a parallel fitness evaluation for particles under Spark to accelerate the algorithm. The results showed that the proposed FS achieved higher classification performance with a smaller size in reasonable time.
APA, Harvard, Vancouver, ISO, and other styles
50

Alotaibi, Shoayee, Rashid Mehmood, Iyad Katib, Omer Rana, and Aiiad Albeshri. "Sehaa: A Big Data Analytics Tool for Healthcare Symptoms and Diseases Detection Using Twitter, Apache Spark, and Machine Learning." Applied Sciences 10, no. 4 (February 19, 2020): 1398. http://dx.doi.org/10.3390/app10041398.

Full text
Abstract:
Smartness, which underpins smart cities and societies, is defined by our ability to engage with our environments, analyze them, and make decisions, all in a timely manner. Healthcare is the prime candidate needing the transformative capability of this smartness. Social media could enable a ubiquitous and continuous engagement between healthcare stakeholders, leading to better public health. Current works are limited in their scope, functionality, and scalability. This paper proposes Sehaa, a big data analytics tool for healthcare in the Kingdom of Saudi Arabia (KSA) using Twitter data in Arabic. Sehaa uses Naive Bayes, Logistic Regression, and multiple feature extraction methods to detect various diseases in the KSA. Sehaa found that the top five diseases in Saudi Arabia in terms of the actual afflicted cases are dermal diseases, heart diseases, hypertension, cancer, and diabetes. Riyadh and Jeddah need to do more in creating awareness about the top diseases. Taif is the healthiest city in the KSA in terms of the detected diseases and awareness activities. Sehaa is developed over Apache Spark allowing true scalability. The dataset used comprises 18.9 million tweets collected from November 2018 to September 2019. The results are evaluated using well-known numerical criteria (Accuracy and F1-Score) and are validated against externally available statistics.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography