Log in

Relevant bibliographies by topics / Big Data, Machine Learning, Data Science, Apache Spark

Academic literature on the topic 'Big Data, Machine Learning, Data Science, Apache Spark'

Author: Grafiati

Published: 14 March 2023

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Contents

Journal articles
Dissertations / Theses
Books
Book chapters
Conference papers

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Big Data, Machine Learning, Data Science, Apache Spark.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Big Data, Machine Learning, Data Science, Apache Spark"

1

Mutasher, Watheq Ghanim, and Abbas Fadhil Aljuboori. "Real Time Big Data Sentiment Analysis and Classification of Facebook." Webology 19, no. 1 (January 20, 2022): 1112–27. http://dx.doi.org/10.14704/web/v19i1/web19076.

Full text

Abstract:

Many peoples use Facebook to connect and share their views on various issues, with the majority of user-generated content consisting of textual information. Since there is so much actual data from people who are posting messages on their situation in real time thoughts on a range of subjects in everyday life, the collection and analysis of these data, which may well be helpful for political decision or public opinion monitoring, is a worthwhile research project. Therefore, in this paper doing to analyze for public text post on Facebook stream in real time through environment Hadoop ecosystem by using apache spark with NLTK python. The post or feeds are gathered form the Facebook API in real time the data stored database used Apache spark to quick query processing the text partitions in each data nodes (machine). Also used Amazon cloud based Hadoop cluster ecosystem into processing of huge data and eliminate on-site hardware, IT support, and other operational difficulties and installation configuration Hadoop such as Hadoop distribution file system and Apache spark. By using the principle of decision dictionary, emotion analysis is used as positive, negative, or neutral and execution two algorithms in machine learning (naive bias & support vector machine) to build model predict the outcome demonstrates a high level of precision in sentiment analysis.

APA, Harvard, Vancouver, ISO, and other styles

2

Omar, Hoger Khayrolla, and Alaa Khalil Jumaa. "Distributed big data analysis using spark parallel data processing." Bulletin of Electrical Engineering and Informatics 11, no. 3 (June 1, 2022): 1505–15. http://dx.doi.org/10.11591/eei.v11i3.3187.

Full text

Abstract:

Nowadays, the big data marketplace is rising rapidly. The big challenge is finding a system that can store and handle a huge size of data and then processing that huge data for mining the hidden knowledge. This paper proposed a comprehensive system that is used for improving big data analysis performance. It contains a fast big data processing engine using Apache Spark and a big data storage environment using Apache Hadoop. The system tests about 11 Gigabytes of text data which are collected from multiple sources for sentiment analysis. Three different machine learning (ML) algorithms are used in this system which is already supported by the Spark ML package. The system programs were written in Java and Scala programming languages and the constructed model consists of the classification algorithms as well as the pre-processing steps in a figure of ML pipeline. The proposed system was implemented in both central and distributed data processing. Moreover, some datasets manipulation manners have been applied in the system tests to check which manner provides the best accuracy and time performance. The results showed that the system works efficiently for treating big data, it gains excellent accuracy with fast execution time especially in the distributed data nodes.

APA, Harvard, Vancouver, ISO, and other styles

3

Omar, Hoger Khayrolla, and Alaa Khalil Jumaa. "Big Data Analysis Using Apache Spark MLlib and Hadoop HDFS with Scala and Java." Kurdistan Journal of Applied Research 4, no. 1 (May 8, 2019): 7–14. http://dx.doi.org/10.24017/science.2019.1.2.

Full text

Abstract:

Nowadays with the technology revolution the term of big data is a phenomenon of the decade moreover, it has a significant impact on our applied science trends. Exploring well big data tool is a necessary demand presently. Hadoop is a good big data analyzing technology, but it is slow because the Job result among each phase must be stored before the following phase is started as well as to the replication delays. Apache Spark is another tool that developed and established to be the real model for analyzing big data with its innovative processing framework inside the memory and high-level programming libraries for machine learning, efficient data treating and etc. In this paper, some comparisons are presented about the time performance evaluation among Scala and Java in apache spark MLlib. Many tests have been done in supervised and unsupervised machine learning methods with utilizing big datasets. However, loading the datasets from Hadoop HDFS as well as to the local disk to identify the pros and cons of each manner and discovering perfect reading or loading dataset situation to reach best execution style. The results showed that the performance of Scala about 10% to 20% is better than Java depending on the algorithm type. The aim of the study is to analyze big data with more suitable programming languages and as consequences gaining better performance.

APA, Harvard, Vancouver, ISO, and other styles

4

Wei, Chih-Chiang, and Tzu-Hao Chou. "Typhoon Quantitative Rainfall Prediction from Big Data Analytics by Using the Apache Hadoop Spark Parallel Computing Framework." Atmosphere 11, no. 8 (August 17, 2020): 870. http://dx.doi.org/10.3390/atmos11080870.

Full text

Abstract:

Situated in the main tracks of typhoons in the Northwestern Pacific Ocean, Taiwan frequently encounters disasters from heavy rainfall during typhoons. Accurate and timely typhoon rainfall prediction is an imperative topic that must be addressed. The purpose of this study was to develop a Hadoop Spark distribute framework based on big-data technology, to accelerate the computation of typhoon rainfall prediction models. This study used deep neural networks (DNNs) and multiple linear regressions (MLRs) in machine learning, to establish rainfall prediction models and evaluate rainfall prediction accuracy. The Hadoop Spark distributed cluster-computing framework was the big-data technology used. The Hadoop Spark framework consisted of the Hadoop Distributed File System, MapReduce framework, and Spark, which was used as a new-generation technology to improve the efficiency of the distributed computing. The research area was Northern Taiwan, which contains four surface observation stations as the experimental sites. This study collected 271 typhoon events (from 1961 to 2017). The following results were obtained: (1) in machine-learning computation, prediction errors increased with prediction duration in the DNN and MLR models; and (2) the system of Hadoop Spark framework was faster than the standalone systems (single I7 central processing unit (CPU) and single E3 CPU). When complex computation is required in a model (e.g., DNN model parameter calibration), the big-data-based Hadoop Spark framework can be used to establish highly efficient computation environments. In summary, this study successfully used the big-data Hadoop Spark framework with machine learning, to develop rainfall prediction models with effectively improved computing efficiency. Therefore, the proposed system can solve problems regarding real-time typhoon rainfall prediction with high timeliness and accuracy.

APA, Harvard, Vancouver, ISO, and other styles

5

Gupta, Madhuri, and Bharat Gupta. "Survey of Breast Cancer Detection Using Machine Learning Techniques in Big Data." Journal of Cases on Information Technology 21, no. 3 (July 2019): 80–92. http://dx.doi.org/10.4018/jcit.2019070106.

Full text

Abstract:

Cancer is a disease in which cells in body grow and divide beyond the control. Breast cancer is the second most common disease after lung cancer in women. Incredible advances in health sciences and biotechnology have prompted a huge amount of gene expression and clinical data. Machine learning techniques are improving the prior detection of breast cancer from this data. The research work carried out focuses on the application of machine learning methods, data analytic techniques, tools, and frameworks in the field of breast cancer research with respect to cancer survivability, cancer recurrence, cancer prediction and detection. Some of the widely used machine learning techniques used for detection of breast cancer are support vector machine and artificial neural network. Apache Spark data processing engine is found to be compatible with most of the machine learning frameworks.

APA, Harvard, Vancouver, ISO, and other styles

6

Kamburugamuve, Supun, Pulasthi Wickramasinghe, Saliya Ekanayake, and Geoffrey C. Fox. "Anatomy of machine learning algorithm implementations in MPI, Spark, and Flink." International Journal of High Performance Computing Applications 32, no. 1 (July 2, 2017): 61–73. http://dx.doi.org/10.1177/1094342017712976.

Full text

Abstract:

With the ever-increasing need to analyze large amounts of data to get useful insights, it is essential to develop complex parallel machine learning algorithms that can scale with data and number of parallel processes. These algorithms need to run on large data sets as well as they need to be executed with minimal time in order to extract useful information in a time-constrained environment. Message passing interface (MPI) is a widely used model for developing such algorithms in high-performance computing paradigm, while Apache Spark and Apache Flink are emerging as big data platforms for large-scale parallel machine learning. Even though these big data frameworks are designed differently, they follow the data flow model for execution and user APIs. Data flow model offers fundamentally different capabilities than the MPI execution model, but the same type of parallelism can be used in applications developed in both models. This article presents three distinct machine learning algorithms implemented in MPI, Spark, and Flink and compares their performance and identifies strengths and weaknesses in each platform.

APA, Harvard, Vancouver, ISO, and other styles

7

Özgüven, Yavuz, Utku Gönener, and Süleyman Eken. "A Dockerized big data architecture for sports analytics." Computer Science and Information Systems, no. 00 (2022): 10. http://dx.doi.org/10.2298/csis220118010o.

Full text

Abstract:

The big data revolution has had an impact on sports analytics as well. Many large corporations have begun to see the financial benefits of integrating sports analytics with big data. When we rely on central processing systems to aggregate and analyze large amounts of sport data from many sources, we compromise the accuracy and timeliness of the data. As a response to these issues, distributed systems come to the rescue, and the MapReduce paradigm holds promise for large scale data analytics. We describe a big data architecture based on Docker containers with Apache Spark in this paper. We evaluate the architecture on four data-intensive case studies in sport analytics including structured analysis, streaming, machine learning approaches, and graph-based analysis.

APA, Harvard, Vancouver, ISO, and other styles

8

Concolato, Claude E., and Li M. Chen. "Data Science: A New Paradigm in the Age of Big-Data Science and Analytics." New Mathematics and Natural Computation 13, no. 02 (July 2017): 119–43. http://dx.doi.org/10.1142/s1793005717400038.

Full text

Abstract:

As an emergent field of inquiry, Data Science serves both the information technology world and the applied sciences. Data Science is a known term that tends to be synonymous with the term Big-Data; however, Data Science is the application of solutions found through mathematical and computational research while Big-Data Science describes problems concerning the analysis of data with respect to volume, variation, and velocity (3V). Even though there is not much developed in theory from a scientific perspective for Data Science, there is still great opportunity for tremendous growth. Data Science is proving to be of paramount importance to the IT industry due to the increased need for understanding the insurmountable amount of data being produced and in need of analysis. In short, data is everywhere with various formats. Scientists are currently using statistical and AI analysis techniques like machine learning methods to understand massive sets of data, and naturally, they attempt to find relationships among datasets. In the past 10 years, the development of software systems within the cloud computing paradigm using tools like Hadoop and Apache Spark have aided in making tremendous advances to Data Science as a discipline [Z. Sun, L. Sun and K. Strang, Big data analytics services for enhancing business intelligence, Journal of Computer Information Systems (2016), doi: 10.1080/08874417.2016.1220239]. These advances enabled both scientists and IT professionals to use cloud computing infrastructure to process petabytes of data on daily basis. This is especially true for large private companies such as Walmart, Nvidia, and Google. This paper seeks to address pragmatic ways of looking at how Data Science — with respect to Big-Data Science — is practiced in the modern world. We also examine how mathematics and computer science help shape Big-Data Science’s terrain. We will highlight how mathematics and computer science have significantly impacted the development of Data Science approaches, tools, and how those approaches pose new questions that can drive new research areas within these core disciplines involving data analysis, machine learning, and visualization.

APA, Harvard, Vancouver, ISO, and other styles

9

Myung, Rohyoung, and Sukyong Choi. "Machine-Learning Based Memory Prediction Model for Data Parallel Workloads in Apache Spark." Symmetry 13, no. 4 (April 16, 2021): 697. http://dx.doi.org/10.3390/sym13040697.

Full text

Abstract:

A lack of memory can lead to job failures or increase processing times for garbage collection. However, if too much memory is provided, the processing time is only marginally reduced, and most of the memory is wasted. Many big data processing tasks are executed in cloud environments. When renting virtual resources in a cloud environment, it is necessary to pay the cost according to the specifications of resources (i.e., the number of virtual cores and the size of memory), as well as rental time. In this paper, given the type of workload and volume of the input data, we analyze the memory usage pattern and derive the efficient memory size of data-parallel workloads in Apache Spark. Then, we propose a machine-learning-based prediction model that determines the efficient memory for a given workload and data. To determine the validity of the proposed model, we applied it to data-parallel workloads which include a deep learning model. The predicted memory values were in close agreement with the actual amount of required memory. Additionally, the whole building time for the proposed model requires a maximum of 44% of the total execution time of a data-parallel workload. The proposed model can improve memory efficiency up to 1.89 times compared with the vanilla Spark setting.

APA, Harvard, Vancouver, ISO, and other styles

10

Hussin, Sahar K., Salah M. Abdelmageid, Adel Alkhalil, Yasser M. Omar, Mahmoud I. Marie, and Rabie A. Ramadan. "Handling Imbalance Classification Virtual Screening Big Data Using Machine Learning Algorithms." Complexity 2021 (January 28, 2021): 1–15. http://dx.doi.org/10.1155/2021/6675279.

Full text

Abstract:

Virtual screening is the most critical process in drug discovery, and it relies on machine learning to facilitate the screening process. It enables the discovery of molecules that bind to a specific protein to form a drug. Despite its benefits, virtual screening generates enormous data and suffers from drawbacks such as high dimensions and imbalance. This paper tackles data imbalance and aims to improve virtual screening accuracy, especially for a minority dataset. For a dataset identified without considering the data’s imbalanced nature, most classification methods tend to have high predictive accuracy for the majority category. However, the accuracy was significantly poor for the minority category. The paper proposes a K-mean algorithm coupled with Synthetic Minority Oversampling Technique (SMOTE) to overcome the problem of imbalanced datasets. The proposed algorithm is named as KSMOTE. Using KSMOTE, minority data can be identified at high accuracy and can be detected at high precision. A large set of experiments were implemented on Apache Spark using numeric PaDEL and fingerprint descriptors. The proposed solution was compared to both no-sampling method and SMOTE on the same datasets. Experimental results showed that the proposed solution outperformed other methods.

APA, Harvard, Vancouver, ISO, and other styles

More sources

Dissertations / Theses on the topic "Big Data, Machine Learning, Data Science, Apache Spark"

1

Ray, Sujan. "Dimensionality Reduction in Healthcare Data Analysis on Cloud Platform." University of Cincinnati / OhioLINK, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=ucin161375080072697.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Murgia, Antonio. "Lightweight Internet Traffic Classification - A Subject Based Solution with Word Embeddings." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2016. http://amslaurea.unibo.it/10569/.

Full text

Abstract:

Internet traffic classification is a relevant and mature research field, anyway of growing importance and with still open technical challenges, also due to the pervasive presence of Internet-connected devices into everyday life. We claim the need for innovative traffic classification solutions capable of being lightweight, of adopting a domain-based approach, of not only concentrating on application-level protocol categorization but also classifying Internet traffic by subject. To this purpose, this paper originally proposes a classification solution that leverages domain name information extracted from IPFIX summaries, DNS logs, and DHCP leases, with the possibility to be applied to any kind of traffic. Our proposed solution is based on an extension of Word2vec unsupervised learning techniques running on a specialized Apache Spark cluster. In particular, learning techniques are leveraged to generate word-embeddings from a mixed dataset composed by domain names and natural language corpuses in a lightweight way and with general applicability. The paper also reports lessons learnt from our implementation and deployment experience that demonstrates that our solution can process 5500 IPFIX summaries per second on an Apache Spark cluster with 1 slave instance in Amazon EC2 at a cost of $ 3860 year. Reported experimental results about Precision, Recall, F-Measure, Accuracy, and Cohen's Kappa show the feasibility and effectiveness of the proposal. The experiments prove that words contained in domain names do have a relation with the kind of traffic directed towards them, therefore using specifically trained word embeddings we are able to classify them in customizable categories. We also show that training word embeddings on larger natural language corpuses leads improvements in terms of precision up to 180%.

APA, Harvard, Vancouver, ISO, and other styles

3

Соболь, Віталій Миколайович, and Vitaliy Sobol. "Розподілена комп’ютерна система для прогнозування поширення рослинного покриву з використанням засобів машинного навчання." Master's thesis, Тернопільський національний технічний університет імені Івана Пулюя, 2021. http://elartu.tntu.edu.ua/handle/lib/36653.

Full text

Abstract:

Метою роботи є розробка програмного забезпечення та імплементація алгоритмів машинного навчання для прогнозування лісового покриття певної території, зважаючи на різноманітність та унікальність навколишнього середовища, та початкових насаджень на певній території. У дослідженні проведено аналіз важливих понять, принципів і послідовності виконання процесів, що використовуються при проектуванні комп’ютерних систем та написання програм, та роботі з великими даними, зокрема, термінологічні особливості у процесі імплементації програмного забезпечення на прогнозування, що дало змогу зрозуміти і в подальшому визначити шляхи імплементації методів машинного навчання для підвищення ефективності зелених насаджень на певній території.
The aim of the work is to develop software and implement machine learning algorithms for forecasting the forest cover of a certain area, taking into account the diversity and uniqueness of the environment and the original plantings in a certain area. The study analyzes important concepts, principles and sequences of processes used in the design of computer systems and program writing, and work with big data, in particular, terminological features in the process of implementing software for forecasting, which allowed to understand further identify ways to implement machine learning methods to improve the efficiency of greenery in a given area.
ПЕРЕЛІК ОСНОВНИХ УМОВНИХ ПОЗНАЧЕНЬ, СИМВОЛІВ І СКОРОЧЕНЬ... 9 ВСТУП...10 РОЗДІЛ 1. АНАЛІЗ ОСОБЛИВОСТЕЙ ПРОЦЕСУ ОБРОБКИ ВЕЛИКИХ ДАНИХ ТА КЛАСИФІКАЦІЯ ОСНОВНИХ АЛГОРИТМІВ...14 1.1. Аналіз та основні виклики науки про дані...14 1.2. Порівняння Hadoop і Spark, як основних конкурентів по роботі з Великими даними...17 1.3. Обгрунтування вибору Apache Spark як основного фреймворка роботи...19 1.4. Швидкий перехід до регресії...22 1.5. Вектори та особливості...23 1.6. Тренувальні приклади...24 1.7. Дерева рішень та ліси...25 1.8. Набір даних лісового покриття...26 1.9. Висновки до розділу...27 РОЗДІЛ 2. ОПИС ТА ВИБІР МЕТОДІВ МАШИННОГО НАВЧАННЯ ПРИ ОБРОБЦІ ВЕЛИКИХ ДАНИХ...28 2.1. Попередня обробка даних та аналіз даних...28 2.1.1. Пропущені значення...29 2.1.2. Дублювання даних...29 2.1.3. Шуми та викиди...30 2.1.4. Очищення даних...31 2.1.5. Методи нормування даних...32 2.1.6. Методи заповнення пропусків...33 2.2. Вибір базових класифікаторів...34 2.2.1. Загальна постановка задачі класифікації...34 2.2.2. Лінійні класифікатори...36 2.2.2.1. Лінійний дискримінант Фішера...40 2.2.2.2. Одношаровий персептрон...40 2.2.2.3. Логістична регресія...40 2.2.2.4. Метод опорних векторів...41 2.2.3. Метод k найбільших сусідів...42 2.2.4. Наївний байєсівський класифікатор...43 2.2.5. Дерева рішень...44 2.3. Використання ансамблів моделей класифікації, як більш ефективного алгоритму...45 2.3.1. Беггінг...45 2.3.2. Бустинг...48 2.4. Метрики оцінки якості роботи класифікаторів ...50 2.4.1. Правильність (Accuracy)...51 2.4.2. Точність (Precision)..51 2.4.3. Повнота (Recall) або Чутливість (Sensitivity)... 51 2.4.4. Специфічність (Specificity).... 52 2.4.5. F - міра...52 2.4.6. Log-loss (logarithmic loss).... 52 2.4.7. ROC крива (Receiver Operating Characteristics Curve)... 52 2.5. Висновки до розділу...54 РОЗДІЛ 3. ВИБІР ТА ОПИС МЕТОДІВ МАШИННОГО НАВЧАННЯ ДЛЯ ОБРОБКИ ВЕЛИКИХ ДАНИХ...55 3.1. Підготовка вхідних даних та обробка файлу CSV...55 3.2. Перше дерево рішень (Decision Tree).... 57 3.3. Гіперпараметри дерева рішень...61 3.4. Налаштування дерев рішень...63 3.5. Переглянуто категорійні характеристики...68 3.6. Висновки до розділу...71 РОЗДІЛ 4. ОХОРОНА ПРАЦІ ТА БЕЗПЕКА В НАДЗВИЧАЙНИХ СИТУАЦІЯХ...73 4.1. Охорона праці...73 4.2. Підвищення стійкості роботи об'єктів господарської діяльності у воєнний час...75 4.3. Висновки до розділу...80 ВИСНОВКИ...82 СПИСОК ВИКОРИСТАНИХ ДЖЕРЕЛ...84 Додаток А Тези конференцій...86 Додаток Б Повний код програми...90

APA, Harvard, Vancouver, ISO, and other styles

4

"Large-Scale Matrix Completion Using Orthogonal Rank-One Matrix Pursuit, Divide-Factor-Combine, and Apache Spark." Master's thesis, 2014. http://hdl.handle.net/2286/R.I.24857.

Full text

Abstract:

abstract: As the size and scope of valuable datasets has exploded across many industries and fields of research in recent years, an increasingly diverse audience has sought out effective tools for their large-scale data analytics needs. Over this period, machine learning researchers have also been very prolific in designing improved algorithms which are capable of finding the hidden structure within these datasets. As consumers of popular Big Data frameworks have sought to apply and benefit from these improved learning algorithms, the problems encountered with the frameworks have motivated a new generation of Big Data tools to address the shortcomings of the previous generation. One important example of this is the improved performance in the newer tools with the large class of machine learning algorithms which are highly iterative in nature. In this thesis project, I set about to implement a low-rank matrix completion algorithm (as an example of a highly iterative algorithm) within a popular Big Data framework, and to evaluate its performance processing the Netflix Prize dataset. I begin by describing several approaches which I attempted, but which did not perform adequately. These include an implementation of the Singular Value Thresholding (SVT) algorithm within the Apache Mahout framework, which runs on top of the Apache Hadoop MapReduce engine. I then describe an approach which uses the Divide-Factor-Combine (DFC) algorithmic framework to parallelize the state-of-the-art low-rank completion algorithm Orthogoal Rank-One Matrix Pursuit (OR1MP) within the Apache Spark engine. I describe the results of a series of tests running this implementation with the Netflix dataset on clusters of various sizes, with various degrees of parallelism. For these experiments, I utilized the Amazon Elastic Compute Cloud (EC2) web service. In the final analysis, I conclude that the Spark DFC + OR1MP implementation does indeed produce competitive results, in both accuracy and performance. In particular, the Spark implementation performs nearly as well as the MATLAB implementation of OR1MP without any parallelism, and improves performance to a significant degree as the parallelism increases. In addition, the experience demonstrates how Spark's flexible programming model makes it straightforward to implement this parallel and iterative machine learning algorithm.
Dissertation/Thesis
M.S. Computer Science 2014

APA, Harvard, Vancouver, ISO, and other styles

Books on the topic "Big Data, Machine Learning, Data Science, Apache Spark"

1

Apache Spark Quick Start Guide: Quickly Learn the Art of Writing Efficient Big Data Applications with Apache Spark. Packt Publishing, Limited, 2019.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

2

Apache Spark Machine Learning Blueprints. Packt Publishing, Limited, 2016.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

3

Learning Spark: Lightning-Fast Big Data Analysis. O'Reilly Media, 2015.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

4

Karim, Rezaul, Romeo Kienzler, Sridhar Alla, Siamak Amirghodsi, and Meenakshi Rajendran. Apache Spark 2 : Data Processing and Real-Time Analytics: Master Complex Big Data Processing, Stream Analytics, and Machine Learning with Apache Spark. Packt Publishing, Limited, 2018.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

5

Hands-On Data Science and Python Machine Learning. Packt Publishing - ebooks Account, 2017.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

6

Spark: The Definitive Guide: Big Data Processing Made Simple. O'Reilly Media, 2018.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

7

Quddus, Jillur. Machine Learning with Apache Spark Quick Start Guide: Uncover patterns, derive actionable insights, and learn from big data using MLlib. Packt Publishing, 2018.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

8

Gulli, Dr Antonio. A collection of Advanced Data Science and Machine Learning Interview Questions Solved in Python and Spark: Hands-on Big Data and Machine ... Programming Interview Questions). Createspace Independent Publishing Platform, 2015.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

Book chapters on the topic "Big Data, Machine Learning, Data Science, Apache Spark"

1

Mogha, Garima, Khyati Ahlawat, and Amit Prakash Singh. "Performance Analysis of Machine Learning Techniques on Big Data Using Apache Spark." In Data Science and Analytics, 17–26. Singapore: Springer Singapore, 2018. http://dx.doi.org/10.1007/978-981-10-8527-7_2.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Abdel Hai, Ameen, and Babak Forouraghi. "On Scalability of Distributed Machine Learning with Big Data on Apache Spark." In Big Data – BigData 2018, 209–19. Cham: Springer International Publishing, 2018. http://dx.doi.org/10.1007/978-3-319-94301-5_16.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Hafez, Manar Mohamed, Mohamed Elemam Shehab, Essam El Fakharany, and Abd El Ftah Abdel Ghfar Hegazy. "Effective Selection of Machine Learning Algorithms for Big Data Analytics Using Apache Spark." In Advances in Intelligent Systems and Computing, 692–704. Cham: Springer International Publishing, 2016. http://dx.doi.org/10.1007/978-3-319-48308-5_66.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Kerestely, Arpad, Alexandra Baicoianu, and Razvan Bocu. "A Research Study on Running Machine Learning Algorithms on Big Data with Spark." In Knowledge Science, Engineering and Management, 307–18. Cham: Springer International Publishing, 2021. http://dx.doi.org/10.1007/978-3-030-82136-4_25.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Cheng, Jane, and Peng Zhao. "Sustainable Big Data Analytics Process Pipeline Using Apache Ecosystem." In Encyclopedia of Data Science and Machine Learning, 1247–59. IGI Global, 2022. http://dx.doi.org/10.4018/978-1-7998-9220-5.ch073.

Full text

Abstract:

This article provides a comprehensive understanding of the cutting-edge big data workflow technologies that have been widely applied in industrial applications, covering a broad range of the most current big data processing methods and tools, including Hadoop, Hive, MapReduce, Sqoop, Hue, Spark, Cloudera, Airflow, and GitLab. An industrial data workflow pipeline is proposed and investigated in terms of the system architecture, which is designed to meet the needs of data-driven industrial big data analytics applications concentrated on large-scale data processing. It differs from traditional data pipelines and workflows in its ability of ETL and analytical portals. The proposed data workflow can improve the industrial analytics applications for multiple tasks. This article also provides bid data researchers and professionals with an understanding of the challenges facing big data analytics in real-world environments and informs interdisciplinary studies in this field.

APA, Harvard, Vancouver, ISO, and other styles

6

Chen, Li, and Lala Aicha Coulibaly. "Data Science and Big Data Practice Using Apache Spark and Python." In Advances in Data Mining and Database Management, 67–95. IGI Global, 2021. http://dx.doi.org/10.4018/978-1-7998-4963-6.ch004.

Full text

Abstract:

Data science and big data analytics are still at the center of computer science and information technology. Students and researchers not in computer science often found difficulties in real data analytics using programming languages such as Python and Scala, especially when they attempt to use Apache-Spark in cloud computing environments-Spark Scala and PySpark. At the same time, students in information technology could find it difficult to deal with the mathematical background of data science algorithms. To overcome these difficulties, this chapter will provide a practical guideline to different users in this area. The authors cover the main algorithms for data science and machine learning including principal component analysis (PCA), support vector machine (SVM), k-means, k-nearest neighbors (kNN), regression, neural networks, and decision trees. A brief description of these algorithms will be explained, and the related code will be selected to fit simple data sets and real data sets. Some visualization methods including 2D and 3D displays will be also presented in this chapter.

APA, Harvard, Vancouver, ISO, and other styles

7

Gupta, Madhuri, and Bharat Gupta. "Survey of Breast Cancer Detection Using Machine Learning Techniques in Big Data." In Research Anthology on Medical Informatics in Breast and Cervical Cancer, 371–85. IGI Global, 2022. http://dx.doi.org/10.4018/978-1-6684-7136-4.ch020.

Full text

Abstract:

Cancer is a disease in which cells in body grow and divide beyond the control. Breast cancer is the second most common disease after lung cancer in women. Incredible advances in health sciences and biotechnology have prompted a huge amount of gene expression and clinical data. Machine learning techniques are improving the prior detection of breast cancer from this data. The research work carried out focuses on the application of machine learning methods, data analytic techniques, tools, and frameworks in the field of breast cancer research with respect to cancer survivability, cancer recurrence, cancer prediction and detection. Some of the widely used machine learning techniques used for detection of breast cancer are support vector machine and artificial neural network. Apache Spark data processing engine is found to be compatible with most of the machine learning frameworks.

APA, Harvard, Vancouver, ISO, and other styles

8

Brahmane, Anilkumar V., and B. Chaitanya Krishna. "DSAE – Deep Stack Auto Encoder and RCBO – Rider Chaotic Biogeography Optimization Algorithm for Big Data Classification." In Recent Trends in Intensive Computing. IOS Press, 2021. http://dx.doi.org/10.3233/apc210198.

Full text

Abstract:

In today’s era Big data classification is a very crucial and equally widely arise issue is many applications. Not only engineering applications but also in social, agricultural, banking, educational and many more applications are there in science and engineering where accurate big data classification is required. We proposed a very novel and efficient methodology for big data classification using Deep stack encoder and Rider chaotic biogeography algorithms. Our proposed algorithms are the combinations of two algorithms. First one is Rider Optimization algorithm and second one is chaotic biogeography-based optimization algorithm. So, we named it as RCBO which is integration is ROA and CBBO. Our proposed system also uses the Deep stack auto encoder for the purpose of training the system which actually produced the accurate classification. The Apache spark platform is used initial distribution of the data from master node to slave nodes. Our proposed system is tested and executed on the UCI Machine learning data set which gives the excellent results while comparing with other algorithms such as KNN classification, Extreme Learning Machine Random Forest algorithms.

APA, Harvard, Vancouver, ISO, and other styles

9

Rashid, Mamoon, Vishal Goyal, Shabir Ahmad Parah, and Harjeet Singh. "Drug Prediction in Healthcare Using Big Data and Machine Learning." In Advances in Social Networking and Online Communities, 79–92. IGI Global, 2019. http://dx.doi.org/10.4018/978-1-5225-9096-5.ch005.

Full text

Abstract:

The healthcare system is literally losing patients due to improper diagnosis, accidents, and infections in hospitals alone. To address these challenges, the authors are proposing the drug prediction model that will act as informative guide for patients and help them for taking right medicines for the cure of particular disease. In this chapter, the authors are proposing use of Hadoop distributed file system for the storage of medical datasets related to medicinal drugs. MLLib Library of Apache Spark is to be used for initial data analysis for drug suggestions related to symptoms gathered from particular user. The model will analyze the previous history of patients for any side effects of the drug to be recommended. This proposal will consider weather and maps API from Google as well so that the patients can easily locate the nearby stores where the medicines will be available. It is believed that this proposal of research will surely eradicate the issues by prescribing the optimal drug and its availability by giving the location of the retailer of that drug near the customer.

APA, Harvard, Vancouver, ISO, and other styles

10

Dumancas, Gerard G., Ghalib A. Bello, Jeff Hughes, Renita Murimi, Lakshmi Chockalingam Kasi Viswanath, Casey O'Neal Orndorff, Glenda Fe Dumancas, and Jacy D. O'Dell. "Visualization Tools for Big Data Analytics in Quantitative Chemical Analysis." In Advances in Data Mining and Database Management, 873–917. IGI Global, 2018. http://dx.doi.org/10.4018/978-1-5225-3142-5.ch030.

Full text

Abstract:

Modern instruments have the capacity to generate and store enormous volumes of data and the challenges involved in processing, analyzing and visualizing this data are well recognized. The field of Chemometrics (a subspecialty of Analytical Chemistry) grew out of efforts to develop a toolbox of statistical and computer applications for data processing and analysis. This chapter will discuss key concepts of Big Data Analytics within the context of Analytical Chemistry. The chapter will devote particular emphasis on preprocessing techniques, statistical and Machine Learning methodology for data mining and analysis, tools for big data visualization and state-of-the-art applications for data storage. Various statistical techniques used for the analysis of Big Data in Chemometrics are introduced. This chapter also gives an overview of computational tools for Big Data Analytics for Analytical Chemistry. The chapter concludes with the discussion of latest platforms and programming tools for Big Data storage like Hadoop, Apache Hive, Spark, Google Bigtable, and more.

APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Big Data, Machine Learning, Data Science, Apache Spark"

1

Assefi, Mehdi, Ehsun Behravesh, Guangchi Liu, and Ahmad P. Tafti. "Big data machine learning using apache spark MLlib." In 2017 IEEE International Conference on Big Data (Big Data). IEEE, 2017. http://dx.doi.org/10.1109/bigdata.2017.8258338.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Dunner, Celestine, Thomas Parnell, Kubilay Atasu, Manolis Sifalakis, and Haralampos Pozidis. "Understanding and optimizing the performance of distributed machine learning applications on apache spark." In 2017 IEEE International Conference on Big Data (Big Data). IEEE, 2017. http://dx.doi.org/10.1109/bigdata.2017.8257942.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Junaid, Muhammad, Shiraz Ali Wagan, Nawab Muhammad Faseeh Qureshi, Choon Sung Nam, and Dong Ryeol Shin. "Big data Predictive Analytics for Apache Spark using Machine Learning." In 2020 Global Conference on Wireless and Optical Technologies (GCWOT). IEEE, 2020. http://dx.doi.org/10.1109/gcwot49901.2020.9391620.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Chen, Lin, Rui Li, Yige Liu, Ruixuan Zhang, and Diane Myung-kyung Woodbridge. "Machine learning-based product recommendation using Apache Spark." In 2017 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI). IEEE, 2017. http://dx.doi.org/10.1109/uic-atc.2017.8397470.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Alomari, Ebtesam, Rashid Mehmood, and Iyad Katib. "Road Traffic Event Detection Using Twitter Data, Machine Learning, and Apache Spark." In 2019 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI). IEEE, 2019. http://dx.doi.org/10.1109/smartworld-uic-atc-scalcom-iop-sci.2019.00332.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Albaldawi, Wafaa S., Rafah M. Almuttairi, and Mehdi Ebady Manaa. "Big Data Analysis for Healthcare Application using Minhash and Machine Learning in Apache Spark Framework." In 2022 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA). IEEE, 2022. http://dx.doi.org/10.1109/hora55278.2022.9799934.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Sheshasaayee, Ananthi, and J. V. N. Lakshmi. "An insight into tree based machine learning techniques for big data analytics using Apache Spark." In 2017 International Conference on Intelligent Computing, Instrumentation and Control Technologies (ICICICT). IEEE, 2017. http://dx.doi.org/10.1109/icicict1.2017.8342833.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

SASSI, Imad, Sara OUAFTOUH, and Samir ANTER. "Adaptation of Classical Machine Learning Algorithms to Big Data Context: Problems and Challenges : Case Study: Hidden Markov Models Under Spark." In 2019 1st International Conference on Smart Systems and Data Science (ICSSD). IEEE, 2019. http://dx.doi.org/10.1109/icssd47982.2019.9002857.

Full text

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!