To see the other types of publications on this topic, follow the link: MAPREDUCE FRAMEWORKS.

Journal articles on the topic 'MAPREDUCE FRAMEWORKS'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'MAPREDUCE FRAMEWORKS.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Ajibade Lukuman Saheed, Abu Bakar Kamalrulnizam, Ahmed Aliyu, and Tasneem Darwish. "Latency-aware Straggler Mitigation Strategy in Hadoop MapReduce Framework: A Review." Systematic Literature Review and Meta-Analysis Journal 2, no. 2 (October 19, 2021): 53–60. http://dx.doi.org/10.54480/slrm.v2i2.19.

Full text
Abstract:
Processing huge and complex data to obtain useful information is challenging, even though several big data processing frameworks have been proposed and further enhanced. One of the prominent big data processing frameworks is MapReduce. The main concept of MapReduce framework relies on distributed and parallel processing. However, MapReduce framework is facing serious performance degradations due to the slow execution of certain tasks type called stragglers. Failing to handle stragglers causes delay and affects the overall job execution time. Meanwhile, several straggler reduction techniques have been proposed to improve the MapReduce performance. This study provides a comprehensive and qualitative review of the different existing straggler mitigation solutions. In addition, a taxonomy of the available straggler mitigation solutions is presented. Critical research issues and future research directions are identified and discussed to guide researchers and scholars
APA, Harvard, Vancouver, ISO, and other styles
2

Darapaneni, Chandra Sekhar, Bobba Basaveswara Rao, Boggavarapu Bhanu Venkata Satya Vara Prasad, and Suneetha Bulla. "An Analytical Performance Evaluation of MapReduce Model Using Transient Queuing Model." Advances in Modelling and Analysis B 64, no. 1-4 (December 31, 2021): 46–53. http://dx.doi.org/10.18280/ama_b.641-407.

Full text
Abstract:
Today the MapReduce frameworks become the standard distributed computing mechanisms to store, process, analyze, query and transform the Bigdata. While processing the Bigdata, evaluating the performance of the MapReduce framework is essential, to understand the process dependencies and to tune the hyper-parameters. Unfortunately, the scope of the MapReduce framework in-built functions is limited to evaluate the performance till some extent. A reliable analytical performance model is required in this area to evaluate the performance of the MapReduce frameworks. The main objective of this paper is to investigate the performance effect of the MapReduce computing models under various configurations. To accomplish this job, we proposed an analytical transient queuing model, which evaluates the MapReduce model performance for different job arrival rates at mappers and various job completion times of mappers as well as the reducers too. In our transient queuing model, we appointed an efficient multi-server queuing model M/M/C for optimal waiting queue management. To conduct the experiments on proposed analytics model, we selected the Bigdata applications with three mappers and two reducers, under various configurations. As part of the experiments, the transient differential equations, average queue lengths, mappers blocking probability, shuffle waiting probabilities and transient states are evaluated. MATLAB based numerical simulations presented the analytical results for various combinations of the input parameters like λ, µ1 and µ2 and their effect on queue length.
APA, Harvard, Vancouver, ISO, and other styles
3

Kang, Sol Ji, Sang Yeon Lee, and Keon Myung Lee. "Performance Comparison of OpenMP, MPI, and MapReduce in Practical Problems." Advances in Multimedia 2015 (2015): 1–9. http://dx.doi.org/10.1155/2015/575687.

Full text
Abstract:
With problem size and complexity increasing, several parallel and distributed programming models and frameworks have been developed to efficiently handle such problems. This paper briefly reviews the parallel computing models and describes three widely recognized parallel programming frameworks: OpenMP, MPI, and MapReduce. OpenMP is the de facto standard for parallel programming on shared memory systems. MPI is the de facto industry standard for distributed memory systems. MapReduce framework has become the de facto standard for large scale data-intensive applications. Qualitative pros and cons of each framework are known, but quantitative performance indexes help get a good picture of which framework to use for the applications. As benchmark problems to compare those frameworks, two problems are chosen: all-pairs-shortest-path problem and data join problem. This paper presents the parallel programs for the problems implemented on the three frameworks, respectively. It shows the experiment results on a cluster of computers. It also discusses which is the right tool for the jobs by analyzing the characteristics and performance of the paradigms.
APA, Harvard, Vancouver, ISO, and other styles
4

Srirama, Satish Narayana, Oleg Batrashev, Pelle Jakovits, and Eero Vainikko. "Scalability of Parallel Scientific Applications on the Cloud." Scientific Programming 19, no. 2-3 (2011): 91–105. http://dx.doi.org/10.1155/2011/361854.

Full text
Abstract:
Cloud computing, with its promise of virtually infinite resources, seems to suit well in solving resource greedy scientific computing problems. To study the effects of moving parallel scientific applications onto the cloud, we deployed several benchmark applications like matrix–vector operations and NAS parallel benchmarks, and DOUG (Domain decomposition On Unstructured Grids) on the cloud. DOUG is an open source software package for parallel iterative solution of very large sparse systems of linear equations. The detailed analysis of DOUG on the cloud showed that parallel applications benefit a lot and scale reasonable on the cloud. We could also observe the limitations of the cloud and its comparison with cluster in terms of performance. However, for efficiently running the scientific applications on the cloud infrastructure, the applications must be reduced to frameworks that can successfully exploit the cloud resources, like the MapReduce framework. Several iterative and embarrassingly parallel algorithms are reduced to the MapReduce model and their performance is measured and analyzed. The analysis showed that Hadoop MapReduce has significant problems with iterative methods, while it suits well for embarrassingly parallel algorithms. Scientific computing often uses iterative methods to solve large problems. Thus, for scientific computing on the cloud, this paper raises the necessity for better frameworks or optimizations for MapReduce.
APA, Harvard, Vancouver, ISO, and other styles
5

Senthilkumar, M., and P. Ilango. "A Survey on Job Scheduling in Big Data." Cybernetics and Information Technologies 16, no. 3 (September 1, 2016): 35–51. http://dx.doi.org/10.1515/cait-2016-0033.

Full text
Abstract:
Abstract Big Data Applications with Scheduling becomes an active research area in last three years. The Hadoop framework becomes very popular and most used frameworks in a distributed data processing. Hadoop is also open source software that allows the user to effectively utilize the hardware. Various scheduling algorithms of the MapReduce model using Hadoop vary with design and behavior, and are used for handling many issues like data locality, awareness with resource, energy and time. This paper gives the outline of job scheduling, classification of the scheduler, and comparison of different existing algorithms with advantages, drawbacks, limitations. In this paper, we discussed various tools and frameworks used for monitoring and the ways to improve the performance in MapReduce. This paper helps the beginners and researchers in understanding the scheduling mechanisms used in Big Data.
APA, Harvard, Vancouver, ISO, and other styles
6

Adornes, Daniel, Dalvan Griebler, Cleverson Ledur, and Luiz Gustavo Fernandes. "Coding Productivity in MapReduce Applications for Distributed and Shared Memory Architectures." International Journal of Software Engineering and Knowledge Engineering 25, no. 09n10 (November 2015): 1739–41. http://dx.doi.org/10.1142/s0218194015710096.

Full text
Abstract:
MapReduce was originally proposed as a suitable and efficient approach for analyzing and processing large amounts of data. Since then, many researches contributed with MapReduce implementations for distributed and shared memory architectures. Nevertheless, different architectural levels require different optimization strategies in order to achieve high-performance computing. Such strategies in turn have caused very different MapReduce programming interfaces among these researches. This paper presents some research notes on coding productivity when developing MapReduce applications for distributed and shared memory architectures. As a case study, we introduce our current research on a unified MapReduce domain-specific language with code generation for Hadoop and Phoenix++, which has achieved a coding productivity increase from 41.84% and up to 94.71% without significant performance losses (below 3%) compared to those frameworks.
APA, Harvard, Vancouver, ISO, and other styles
7

Song, Minjae, Hyunsuk Oh, Seungmin Seo, and Kyong-Ho Lee. "Map-Side Join Processing of SPARQL Queries Based on Abstract RDF Data Filtering." Journal of Database Management 30, no. 1 (January 2019): 22–40. http://dx.doi.org/10.4018/jdm.2019010102.

Full text
Abstract:
The amount of RDF data being published on the Web is increasing at a massive rate. MapReduce-based distributed frameworks have become the general trend in processing SPARQL queries against RDF data. Currently, query processing systems that use MapReduce have not been able to keep up with the increase of semantic annotated data, resulting in non-interactive SPARQL query processing. The principal reason is that intermediate query results from join operations in a MapReduce framework are so massive that they consume all available network bandwidth. In this article, the authors present an efficient SPARQL processing system that uses MapReduce and HBase. The system runs a job optimized query plan using their proposed abstract RDF data to decrease the number of jobs and also decrease the amount of input data. The authors also present an efficient algorithm of using Map-side joins while also using the abstract RDF data to filter out unneeded RDF data. Experimental results show that the proposed approach demonstrates better performance when processing queries with a large amount of input data than those found in previous works.
APA, Harvard, Vancouver, ISO, and other styles
8

Thabtah, Fadi, Suhel Hammoud, and Hussein Abdel-Jaber. "Parallel Associative Classification Data Mining Frameworks Based MapReduce." Parallel Processing Letters 25, no. 02 (June 2015): 1550002. http://dx.doi.org/10.1142/s0129626415500024.

Full text
Abstract:
Associative classification (AC) is a research topic that integrates association rules with classification in data mining to build classifiers. After dissemination of the Classification-based Association Rule algorithm (CBA), the majority of its successors have been developed to improve either CBA's prediction accuracy or the search for frequent ruleitems in the rule discovery step. Both of these steps require high demands in processing time and memory especially in cases of large training data sets or a low minimum support threshold value. In this paper, we overcome the problem of mining large training data sets by proposing a new learning method that repeatedly transforms data between line and item spaces to quickly discover frequent ruleitems, generate rules, subsequently rank and prune rules. This new learning method has been implemented in a parallel Map-Reduce (MR) algorithm called MRMCAR which can be considered the first parallel AC algorithm in the literature. The new learning method can be utilised in the different steps within any AC or association rule mining algorithms which scales well if contrasted with current horizontal or vertical methods. Two versions of the learning method (Weka, Hadoop) have been implemented and a number of experiments against different data sets have been conducted. The ground bases of the comparisons are classification accuracy and time required by the algorithm for data initialization, frequent ruleitems discovery, rule generation and rule pruning. The results reveal that MRMCAR is superior to both current AC mining algorithms and rule based classification algorithms in improving the classification performance with respect to accuracy.
APA, Harvard, Vancouver, ISO, and other styles
9

Goncalves, Carlos, Luis Assuncao, and Jose C. Cunha. "Flexible MapReduce Workflows for Cloud Data Analytics." International Journal of Grid and High Performance Computing 5, no. 4 (October 2013): 48–64. http://dx.doi.org/10.4018/ijghpc.2013100104.

Full text
Abstract:
Data analytics applications handle large data sets subject to multiple processing phases, some of which can execute in parallel on clusters, grids or clouds. Such applications can benefit from using MapReduce model, only requiring the end-user to define the application algorithms for input data processing and the map and reduce functions, but this poses a need to install/configure specific frameworks such as Apache Hadoop or Elastic MapReduce in Amazon Cloud. In order to provide more flexibility in defining and adjusting the application configurations, as well as in the specification of the composition of the application phases and their orchestration, the authors describe an approach for supporting MapReduce stages as sub-workflows in the AWARD framework (Autonomic Workflow Activities Reconfigurable and Dynamic). The authors discuss how a text mining application is represented as a complex workflow with multiple phases, where individual workflow nodes support MapReduce computations. Access to intermediate data produced during the MapReduce computations is supported by a data sharing abstraction. The authors describe two implementations of this abstraction, one based on a shared tuple space and another based on an in-memory distributed key/value store. The authors describe the implementation of the framework, a set of developed tools, and our experimentation with the execution of the text mining algorithm over multiple Amazon EC2 (Elastic Compute Cloud) instances, and report on the speed-up and size-up results obtained up to 20 EC2 instances and for different corpus sizes, up to 97 million words.
APA, Harvard, Vancouver, ISO, and other styles
10

Esposito, Christian, and Massimo Ficco. "Recent Developments on Security and Reliability in Large-Scale Data Processing with MapReduce." International Journal of Data Warehousing and Mining 12, no. 1 (January 2016): 49–68. http://dx.doi.org/10.4018/ijdwm.2016010104.

Full text
Abstract:
The demand to access to a large volume of data, distributed across hundreds or thousands of machines, has opened new opportunities in commerce, science, and computing applications. MapReduce is a paradigm that offers a programming model and an associated implementation for processing massive datasets in a parallel fashion, by using non-dedicated distributed computing hardware. It has been successfully adopted in several academic and industrial projects for Big Data Analytics. However, since such analytics is increasingly demanded within the context of mission-critical applications, security and reliability in MapReduce frameworks are strongly required in order to manage sensible information, and to obtain the right answer at the right time. In this paper, the authors present the main implementation of the MapReduce programming paradigm, provided by Apache with the name of Hadoop. They illustrate the security and reliability concerns in the context of a large-scale data processing infrastructure. They review the available solutions, and their limitations to support security and reliability within the context MapReduce frameworks. The authors conclude by describing the undergoing evolution of such solutions, and the possible issues for improvements, which could be challenging research opportunities for academic researchers.
APA, Harvard, Vancouver, ISO, and other styles
11

Al-Absi, Ahmed Abdulhakim, Najeeb Abbas Al-Sammarraie, Wael Mohamed Shaher Yafooz, and Dae-Ki Kang. "Parallel MapReduce: Maximizing Cloud Resource Utilization and Performance Improvement Using Parallel Execution Strategies." BioMed Research International 2018 (October 17, 2018): 1–17. http://dx.doi.org/10.1155/2018/7501042.

Full text
Abstract:
MapReduce is the preferred cloud computing framework used in large data analysis and application processing. MapReduce frameworks currently in place suffer performance degradation due to the adoption of sequential processing approaches with little modification and thus exhibit underutilization of cloud resources. To overcome this drawback and reduce costs, we introduce a Parallel MapReduce (PMR) framework in this paper. We design a novel parallel execution strategy of Map and Reduce worker nodes. Our strategy enables further performance improvement and efficient utilization of cloud resources execution of Map and Reduce functions to utilize multicore environments available with computing nodes. We explain in detail makespan modeling and working principle of the PMR framework in the paper. Performance of PMR is compared with Hadoop through experiments considering three biomedical applications. Experiments conducted for BLAST, CAP3, and DeepBind biomedical applications report makespan time reduction of 38.92%, 18.00%, and 34.62% considering the PMR framework against Hadoop framework. Experiments' results prove that the PMR cloud computing platform proposed is robust, cost-effective, and scalable, which sufficiently supports diverse applications on public and private cloud platforms. Consequently, overall presentation and results indicate that there is good matching between theoretical makespan modeling presented and experimental values investigated.
APA, Harvard, Vancouver, ISO, and other styles
12

Ferreira, Tharso, Antonio Espinosa, Juan Carlos Moure, and Porfidio Hernández. "An Optimization for MapReduce Frameworks in Multi-core Architectures." Procedia Computer Science 18 (2013): 2587–90. http://dx.doi.org/10.1016/j.procs.2013.05.446.

Full text
APA, Harvard, Vancouver, ISO, and other styles
13

Marynowski, João Eugenio, Altair Olivo Santin, and Andrey Ricardo Pimentel. "Method for testing the fault tolerance of MapReduce frameworks." Computer Networks 86 (July 2015): 1–13. http://dx.doi.org/10.1016/j.comnet.2015.04.009.

Full text
APA, Harvard, Vancouver, ISO, and other styles
14

Weipeng, Jing, Tian Dongxue, Chen Guangsheng, and Li Yiyuan. "Research on Improved Method of Storage and Query of Large-Scale Remote Sensing Images." Journal of Database Management 29, no. 3 (July 2018): 1–16. http://dx.doi.org/10.4018/jdm.2018070101.

Full text
Abstract:
The traditional method is used to deal with massive remote sensing data stored in low efficiency and poor scalability. This article presents a parallel processing method based on MapReduce and HBase. The filling of remote sensing images by the Hilbert curve makes the MapReduce method construct pyramids in parallel to reduce network communication between nodes. Then, the authors design a massive remote sensing data storage model composed of metadata storage model, index structure and filter column family. Finally, this article uses MapReduce frameworks to realize pyramid construction, storage and query of remote sensing data. The experimental results show that this method can effectively improve the speed of data writing and querying, and has good scalability.
APA, Harvard, Vancouver, ISO, and other styles
15

Diarra, Mamadou, and Telesphore B. Tiendrebeogo. "Performance Evaluation of Big Data Processing of Cloak-Reduce." International Journal of Distributed and Parallel systems 13, no. 1 (January 31, 2022): 13–22. http://dx.doi.org/10.5121/ijdps.2022.13102.

Full text
Abstract:
Big Data has introduced the challenge of storing and processing large volumes of data (text, images, and videos). The success of centralised exploitation of massive data on a node is outdated, leading to the emergence of distributed storage, parallel processing and hybrid distributed storage and parallel processing frameworks. The main objective of this paper is to evaluate the load balancing and task allocation strategy of our hybrid distributed storage and parallel processing framework CLOAK-Reduce. To achieve this goal, we first performed a theoretical approach of the architecture and operation of some DHT-MapReduce. Then, we compared the data collected from their load balancing and task allocation strategy by simulation. Finally, the simulation results show that CLOAK-Reduce C5R5 replication provides better load balancing efficiency, MapReduce job submission with 10% churn or no churn.
APA, Harvard, Vancouver, ISO, and other styles
16

Saundatt, Sujay i. "Databases In The 21’st Century." International Journal for Research in Applied Science and Engineering Technology 10, no. 6 (June 30, 2022): 1440–44. http://dx.doi.org/10.22214/ijraset.2022.43982.

Full text
Abstract:
Abstract: NoSQL databases are the 21’st century databases created to defeat the disadvantages of RDBMS. The objective of NoSQL is to give versatility, accessibility and meet different necessities of distributed computing.The main motivations for NoSQL databases systems are achieving scalability and fail over needs. In the vast majority of the NoSQL data set frameworks, information is parceled and repeated across numerous hubs. Innately, the majority of them utilize either Google's MapReduce or Hadoop Distributed File System or Hadoop MapReduce for information assortment. Cassandra, HBase and MongoDB are for the most part utilized and they can be named as the agent of NoSQL world.
APA, Harvard, Vancouver, ISO, and other styles
17

Memishi, Bunjamin, María S. Pérez, and Gabriel Antoniu. "Feedback-Based Resource Allocation in MapReduce-Based Systems." Scientific Programming 2016 (2016): 1–13. http://dx.doi.org/10.1155/2016/7241928.

Full text
Abstract:
Containers are considered an optimized fine-grain alternative to virtual machines in cloud-based systems. Some of the approaches which have adopted the use of containers are the MapReduce frameworks. This paper makes an analysis of the use of containers in MapReduce-based systems, concluding that the resource utilization of these systems in terms of containers is suboptimal. In order to solve this, the paper describes AdaptCont, a proposal for optimizing the containers allocation in MapReduce systems. AdaptCont is based on the foundations of feedback systems. Two different selection approaches, Dynamic AdaptCont and Pool AdaptCont, are defined. Whereas Dynamic AdaptCont calculates the exact amount of resources per each container, Pool AdaptCont chooses a predefined container from a pool of available configurations. AdaptCont is evaluated for a particular case, the application master container of Hadoop YARN. As we can see in the evaluation, AdaptCont behaves much better than the default resource allocation mechanism of Hadoop YARN.
APA, Harvard, Vancouver, ISO, and other styles
18

Astsatryan, Hrachya, Aram Kocharyan, Daniel Hagimont, and Arthur Lalayan. "Performance Optimization System for Hadoop and Spark Frameworks." Cybernetics and Information Technologies 20, no. 6 (December 1, 2020): 5–17. http://dx.doi.org/10.2478/cait-2020-0056.

Full text
Abstract:
Abstract The optimization of large-scale data sets depends on the technologies and methods used. The MapReduce model, implemented on Apache Hadoop or Spark, allows splitting large data sets into a set of blocks distributed on several machines. Data compression reduces data size and transfer time between disks and memory but requires additional processing. Therefore, finding an optimal tradeoff is a challenge, as a high compression factor may underload Input/Output but overload the processor. The paper aims to present a system enabling the selection of the compression tools and tuning the compression factor to reach the best performance in Apache Hadoop and Spark infrastructures based on simulation analyzes.
APA, Harvard, Vancouver, ISO, and other styles
19

Khalid, Madiha, and Muhammad Murtaza Yousaf. "A Comparative Analysis of Big Data Frameworks: An Adoption Perspective." Applied Sciences 11, no. 22 (November 22, 2021): 11033. http://dx.doi.org/10.3390/app112211033.

Full text
Abstract:
The emergence of social media, the worldwide web, electronic transactions, and next-generation sequencing not only opens new horizons of opportunities but also leads to the accumulation of a massive amount of data. The rapid growth of digital data generated from diverse sources makes it inapt to use traditional storage, processing, and analysis methods. These limitations have led to the development of new technologies to process and store very large datasets. As a result, several execution frameworks emerged for big data processing. Hadoop MapReduce, the pioneering framework, set the ground for forthcoming frameworks that improve the processing and development of large-scale data in many ways. This research focuses on comparing the most prominent and widely used frameworks in the open-source landscape. We identify key requirements of a big framework and review each of these frameworks in the perspective of those requirements. To enhance the clarity of comparison and analysis, we group the logically related features, forming a feature vector. We design seven feature vectors and present a comparative analysis of frameworks with respect to those feature vectors. We identify use cases and highlight the strengths and weaknesses of each framework. Moreover, we present a detailed discussion that can serve as a decision-making guide to select the appropriate framework for an application.
APA, Harvard, Vancouver, ISO, and other styles
20

Yang, Wen Chuan, Jiang Yong Wang, and Hao Yu Zeng. "A MapReduce Telecommunication Data Center Analysis Model." Advanced Materials Research 734-737 (August 2013): 2863–66. http://dx.doi.org/10.4028/www.scientific.net/amr.734-737.2863.

Full text
Abstract:
With the widely use of smart phone in China, all inputs and routes packets streams to the Content Distribution Service (CDS) switching centers. Each produces up to 1.5 terabytes arriving every day. Normally, the job of the switch is to transmit data. Obviously, the ordinary database cannot handle the massive dataset and complex ad-hoc query. In this paper, we propose DeepMR, a MapReduce deep service analysis system based on Hive/Hadoop frameworks. A distributed file system HDFS is used in DeepMR for fast data sharing and query. DeepMR also optimizes scheduling for switch analysis jobs and supports fault tolerance for the entire workflow. Our results show that the model achieves a higher efficiency.
APA, Harvard, Vancouver, ISO, and other styles
21

Tiwari, Jyotindra, Dr Mahesh Pawar, and Dr Anjajana Pandey. "A Survey on Accelerated Mapreduce for Hadoop." Oriental journal of computer science and technology 10, no. 3 (July 3, 2017): 597–602. http://dx.doi.org/10.13005/ojcst/10.03.07.

Full text
Abstract:
Big Data is defined by 3Vs which stands for variety, volume and velocity. The volume of data is very huge, data exists in variety of file types and data grows very rapidly. Big data storage and processing has always been a big issue. Big data has become even more challenging to handle these days. To handle big data high performance techniques have been introduced. Several frameworks like Apache Hadoop has been introduced to process big data. Apache Hadoop provides map/reduce to process big data. But this map/reduce can be further accelerated. In this paper a survey has been performed for map/reduce acceleration and energy efficient computation in quick time.
APA, Harvard, Vancouver, ISO, and other styles
22

Azhir, Elham, Mehdi Hosseinzadeh, Faheem Khan, and Amir Mosavi. "Performance Evaluation of Query Plan Recommendation with Apache Hadoop and Apache Spark." Mathematics 10, no. 19 (September 26, 2022): 3517. http://dx.doi.org/10.3390/math10193517.

Full text
Abstract:
Access plan recommendation is a query optimization approach that executes new queries using prior created query execution plans (QEPs). The query optimizer divides the query space into clusters in the mentioned method. However, traditional clustering algorithms take a significant amount of execution time for clustering such large datasets. The MapReduce distributed computing model provides efficient solutions for storing and processing vast quantities of data. Apache Spark and Apache Hadoop frameworks are used in the present investigation to cluster different sizes of query datasets in the MapReduce-based access plan recommendation method. The performance evaluation is performed based on execution time. The results of the experiments demonstrated the effectiveness of parallel query clustering in achieving high scalability. Furthermore, Apache Spark achieved better performance than Apache Hadoop, reaching an average speedup of 2x.
APA, Harvard, Vancouver, ISO, and other styles
23

Jo, Junghee, and Kang-Woo Lee. "High-Performance Geospatial Big Data Processing System Based on MapReduce." ISPRS International Journal of Geo-Information 7, no. 10 (October 6, 2018): 399. http://dx.doi.org/10.3390/ijgi7100399.

Full text
Abstract:
With the rapid development of Internet of Things (IoT) technologies, the increasing volume and diversity of sources of geospatial big data have created challenges in storing, managing, and processing data. In addition to the general characteristics of big data, the unique properties of spatial data make the handling of geospatial big data even more complicated. To facilitate users implementing geospatial big data applications in a MapReduce framework, several big data processing systems have extended the original Hadoop to support spatial properties. Most of those platforms, however, have included spatial functionalities by embedding them as a form of plug-in. Although offering a convenient way to add new features to an existing system, the plug-in has several limitations. In particular, while executing spatial and nonspatial operations by alternating between the existing system and the plug-in, additional read and write overheads have to be added to the workflow, significantly reducing performance efficiency. To address this issue, we have developed Marmot, a high-performance, geospatial big data processing system based on MapReduce. Marmot extends Hadoop at a low level to support seamless integration between spatial and nonspatial operations of a solid framework, allowing improved performance of geoprocessing workflow. This paper explains the overall architecture and data model of Marmot as well as the main algorithm for automatic construction of MapReduce jobs from a given spatial analysis task. To illustrate how Marmot transforms a sequence of operators for spatial analysis to map and reduce functions in a way to achieve better performance, this paper presents an example of spatial analysis retrieving the number of subway stations per city in Korea. This paper also experimentally demonstrates that Marmot generally outperforms SpatialHadoop, one of the top plug-in based spatial big data frameworks, particularly in dealing with complex and time-intensive queries involving spatial index.
APA, Harvard, Vancouver, ISO, and other styles
24

Yang, Wen Chuan, He Chen, and Qing Yi Qu. "Research of a MapReduce Model to Process the Traffic Big Data." Applied Mechanics and Materials 548-549 (April 2014): 1853–56. http://dx.doi.org/10.4028/www.scientific.net/amm.548-549.1853.

Full text
Abstract:
Normally, the job of the Traffic Data Processing Center (TDPC) is to monitor and retain data. There is a tendency to put more capability into the TDPC, such as ad-hoc query for speeding car identification and feedback abnormal traffic information. Thus we definitely need to think about what can be kept in working storage and how to analysis it. Obviously, the ordinary database cannot handle the massive dataset and complex ad-hoc query. MapReduce is a popular and widely used fine grain parallel runtime, which is developed for high performance processing of large scale dataset. In this paper, we propose MRTP, a MapReduce Traffic Processing system based on Hive/Hadoop frameworks. A distributed file system HDFS is used in MRTP for fast data sharing and query. MRTP supports fast locating speeding car and also optimizes the route to catch fugitive. Our results show that the model achieves a higher efficiency.
APA, Harvard, Vancouver, ISO, and other styles
25

Teffer, Dean, Ravi Srinivasan, and Joydeep Ghosh. "AdaHash: hashing-based scalable, adaptive hierarchical clustering of streaming data on Mapreduce frameworks." International Journal of Data Science and Analytics 8, no. 3 (August 1, 2018): 257–67. http://dx.doi.org/10.1007/s41060-018-0145-7.

Full text
APA, Harvard, Vancouver, ISO, and other styles
26

Karamolegkos, Panagiotis, Argyro Mavrogiorgou, Athanasios Kiourtis, and Dimosthenis Kyriazis. "EverAnalyzer: A Self-Adjustable Big Data Management Platform Exploiting the Hadoop Ecosystem." Information 14, no. 2 (February 3, 2023): 93. http://dx.doi.org/10.3390/info14020093.

Full text
Abstract:
Big Data is a phenomenon that affects today’s world, with new data being generated every second. Today’s enterprises face major challenges from the increasingly diverse data, as well as from indexing, searching, and analyzing such enormous amounts of data. In this context, several frameworks and libraries for processing and analyzing Big Data exist. Among those frameworks Hadoop MapReduce, Mahout, Spark, and MLlib appear to be the most popular, although it is unclear which of them best suits and performs in various data processing and analysis scenarios. This paper proposes EverAnalyzer, a self-adjustable Big Data management platform built to fill this gap by exploiting all of these frameworks. The platform is able to collect data both in a streaming and in a batch manner, utilizing the metadata obtained from its users’ processing and analytical processes applied to the collected data. Based on this metadata, the platform recommends the optimum framework for the data processing/analytical activities that the users aim to execute. To verify the platform’s efficiency, numerous experiments were carried out using 30 diverse datasets related to various diseases. The results revealed that EverAnalyzer correctly suggested the optimum framework in 80% of the cases, indicating that the platform made the best selections in the majority of the experiments.
APA, Harvard, Vancouver, ISO, and other styles
27

Yang, Wen Chuan, Rui Li, and Zhi Dong Shang. "A MapReduce Model to Process Massive Switching Center Data Set." Applied Mechanics and Materials 548-549 (April 2014): 1557–60. http://dx.doi.org/10.4028/www.scientific.net/amm.548-549.1557.

Full text
Abstract:
Accompany the widely use of smart phone in China, all inputs and routes packets streams to the Telecommunication Content Distribution Service Switching Centers (TSC). There is a tendency to put more capability into the switch, such as retain or query passing by data. Thus we definitely need to think about what can be kept in working storage and how to analysis it. Obviously, the ordinary database cannot handle the massive dataset and complex ad-hoc query. In this paper, we propose MRTSC, a MapReduce deep service analysis system based on Hive/Hadoop frameworks. A distributed file system HDFS is used in MRTSC for fast data sharing and query. MRTSC also optimizes scheduling for switch analysis jobs and supports fault tolerance for the entire workflow. Our results show that the model achieves a higher efficiency.
APA, Harvard, Vancouver, ISO, and other styles
28

Saadoon, Muntadher, Siti Hafizah Ab Hamid, Hazrina Sofian, Hamza Altarturi, Nur Nasuha, Zati Hakim Azizul, Asmiza Abdul Sani, and Adeleh Asemi. "Experimental Analysis in Hadoop MapReduce: A Closer Look at Fault Detection and Recovery Techniques." Sensors 21, no. 11 (May 31, 2021): 3799. http://dx.doi.org/10.3390/s21113799.

Full text
Abstract:
Hadoop MapReduce reactively detects and recovers faults after they occur based on the static heartbeat detection and the re-execution from scratch techniques. However, these techniques lead to excessive response time penalties and inefficient resource consumption during detection and recovery. Existing fault-tolerance solutions intend to mitigate the limitations without considering critical conditions such as fail-slow faults, the impact of faults at various infrastructure levels and the relationship between the detection and recovery stages. This paper analyses the response time under two main conditions: fail-stop and fail-slow, when they manifest with node, service, and the task at runtime. In addition, we focus on the relationship between the time for detecting and recovering faults. The experimental analysis is conducted on a real Hadoop cluster comprising MapReduce, YARN and HDFS frameworks. Our analysis shows that the recovery of a single fault leads to an average of 67.6% response time penalty. Even though the detection and recovery times are well-turned, data locality and resource availability must also be considered to obtain the optimum tolerance time and the lowest penalties.
APA, Harvard, Vancouver, ISO, and other styles
29

Astsatryan, Hrachya, Arthur Lalayan, Aram Kocharyan, and Daniel Hagimont. "Performance-efficient Recommendation and Prediction Service for Big Data frameworks focusing on Data Compression and In-memory Data Storage Indicators." Scalable Computing: Practice and Experience 22, no. 4 (November 26, 2021): 401–12. http://dx.doi.org/10.12694/scpe.v22i4.1945.

Full text
Abstract:
The MapReduce framework manages Big Data sets by splitting the large datasets into a set of distributed blocks and processes them in parallel. Data compression and in-memory file systems are widely used methods in Big Data processing to reduce resource-intensive I/O operations and improve I/O rate correspondingly. The article presents a performance-efficient modular and configurable decision-making robust service relying on data compression and in-memory data storage indicators. The service consists of Recommendation and Prediction modules, predicts the execution time of a given job based on metrics, and recommends the best configuration parameters to improve Hadoop and Spark frameworks' performance. Several CPU and data-intensive applications and micro-benchmarks have been evaluated to improve the performance, including Log Analyzer, WordCount, and K-Means.
APA, Harvard, Vancouver, ISO, and other styles
30

Anand, L., K. Senthilkumar, N. Arivazhagan, and V. Sivakumar. "Analysis for guaranteeing performance in map reduce systems with hadoop and R." International Journal of Engineering & Technology 7, no. 3.3 (June 8, 2018): 445. http://dx.doi.org/10.14419/ijet.v7i2.33.14207.

Full text
Abstract:
Corporates have fast developing measures of information to technique and store, an information blast goes ahead by USA. By and by one on the whole the chief regular ways to deal with treat these gigantic data amounts region units upheld the MapReduce parallel programming worldview. Though its utilization is across the board inside the exchange, guaranteeing execution limitations, while at a comparable time limiting costs, still gives escalated challenges. We have an angle to have a trend to propose a harsh grained administration hypothetical approach, bolstered procedures that have effectively attempted their quality inside the administration group. We have an angle to have a leaning to acquaint the essential equation with make dynamic models for substantial data MapReduce frameworks, running a matching business. What are a lot of we have a gradient to have a tendency to learn a join of central administration utilize cases: loose execution minor asset and strict execution. For the essential case we have a slant to have a leaning to build up a join of blame administration systems. An established criticism controller and a decent essentially based input that limits the measure of bunch reconfigurations still. In addition, to deal with strict execution necessities a bolster forward ambiguous controller that speedily stifles the ramifications of huge work estimate varieties is created. Every one of the controllers unit substantial on-line all through a benchmark running all through a genuine sixty hub MapReduce bunch, utilizing a data serious Business Intelligence work. Our investigations show the accomplishment of the administration courses used in soothing administration time requirements.
APA, Harvard, Vancouver, ISO, and other styles
31

Yang, Wen Chuan, Guang Jie Lin, and Jiang Yong Wang. "A MapReduce Clone Car Identification Model over Traffic Data Stream." Applied Mechanics and Materials 346 (August 2013): 117–22. http://dx.doi.org/10.4028/www.scientific.net/amm.346.117.

Full text
Abstract:
Accompany the widely use of Intelligent Traffic in China, all traffic input data streams to the Traffic Surveillance Center (TSC). Some metropolitan TSC, such as in Beijing, produces up to 18 million records and 1T image data arriving every hour. Normally, the job of the TSC is to monitor and retain data. There is a tendency to put more capability into the TSC, such as ad-hoc query for clone car identification and feedback abnormal traffic information. Thus we definitely need to think about what can be kept in working storage and how to analysis it. Obviously, the ordinary database cannot handle the massive dataset and complex ad-hoc query. MapReduce is a popular and widely used fine grain parallel runtime, which is developed for high performance processing of large scale dataset. In this paper, we propose CarMR, a MapReduce Clone Car Identification system based on Hive/Hadoop frameworks. A distributed file system HDFS is used in CarMR for fast data sharing and query. CarMR supports fast locating clone car and also optimizes the route to catch fugitive. Our results show that the model achieves a higher efficiency.
APA, Harvard, Vancouver, ISO, and other styles
32

Fernández, Alberto, Sara del Río, Victoria López, Abdullah Bawakid, María J. del Jesus, José M. Benítez, and Francisco Herrera. "Big Data with Cloud Computing: an insight on the computing environment, MapReduce, and programming frameworks." Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 4, no. 5 (September 2014): 380–409. http://dx.doi.org/10.1002/widm.1134.

Full text
APA, Harvard, Vancouver, ISO, and other styles
33

Dhasaratham, M., and R. P. Singh. "A Survey on Data Anonymization Using Mapreduce on Cloud with Scalable Two-Phase Top-Down Approach." International Journal of Engineering & Technology 7, no. 2.20 (April 18, 2018): 254. http://dx.doi.org/10.14419/ijet.v7i2.20.14773.

Full text
Abstract:
Endless forces anticipate that customers can cut non-public information like electronic prosperity records for information examination or mining, transferral security issues. Anonymizing instructional accumulations by ways for hypothesis to satisfy bound assurance necessities, parenthetically, k-anonymity may be a for the foremost half used arrangement of security shielding frameworks. At appear, the live of information in varied cloud applications augments massively consistent with the massive information slant, on these lines creating it a take a look at for habitually used programming instruments to confine, supervise, and method such large scale information within an appropriate snuck hobby. during this manner, it's a take a look at for existing anonymization approaches to manage accomplish security preservation on insurance sensitive monumental scale instructive files as a results of their insufficiency of skillfulness. during this paper, we have a tendency to propose a versatile 2 part top-down specialization (TDS) to anonymize broad scale instructive accumulations victimisation the MapReduce structure on cloud. In mboth times of our approach, we have a tendency to advisedly layout a affair of innovative MapReduce occupations to determinedly accomplish the specialization reckoning in an awfully versatile means. wildcat assessment happens demonstrate that with our approach, the flexibleness and adequacy of TDS may be basically redesigned over existing philosophies.
APA, Harvard, Vancouver, ISO, and other styles
34

Rahman, Md Wasi-ur, Nusrat Sharmin Islam, Xiaoyi Lu, Dipti Shankar, and Dhabaleswar K. (DK) Panda. "MR-Advisor: A comprehensive tuning, profiling, and prediction tool for MapReduce execution frameworks on HPC clusters." Journal of Parallel and Distributed Computing 120 (October 2018): 237–50. http://dx.doi.org/10.1016/j.jpdc.2017.11.004.

Full text
APA, Harvard, Vancouver, ISO, and other styles
35

Ravindra, Padmashree, and Kemafor Anyanwu. "Nesting Strategies for Enabling Nimble MapReduce Dataflows for Large RDF Data." International Journal on Semantic Web and Information Systems 10, no. 1 (January 2014): 1–26. http://dx.doi.org/10.4018/ijswis.2014010101.

Full text
Abstract:
Graph and semi-structured data are usually modeled in relational processing frameworks as “thin” relations (node, edge, node) and processing such data involves a lot of join operations. Intermediate results of joins with multi-valued attributes or relationships, contain redundant subtuples due to repetition of single-valued attributes. The amount of redundant content is high for real-world multi-valued relationships in social network (millions of Twitter followers of popular celebrities) or biological (multiple references to related proteins) datasets. In MapReduce-based platforms such as Apache Hive and Pig, redundancy in intermediate results contributes avoidable costs to the overall I/O, sorting, and network transfer overhead of join-intensive workloads due to longer workflows. Consequently, providing techniques for dealing with such redundancy will enable more nimble execution of such workflows. This paper argues for the use of a nested data model for representing intermediate data concisely using nesting-aware dataflow operators that allow for lazy and partial unnesting strategies. This approach reduces the overall I/O and network footprint of a workflow by concisely representing intermediate results during most of a workflow's execution, until complete unnesting is absolutely necessary. The proposed strategies are integrated into Apache Pig and experimental evaluation over real-world and synthetic benchmark datasets confirms their superiority over relational-style MapReduce systems such as Apache Pig and Hive.
APA, Harvard, Vancouver, ISO, and other styles
36

Zheng, Kun, Kang Zheng, Falin Fang, Hong Yao, Yunlei Yi, and Deze Zeng. "Real-Time Massive Vector Field Data Processing in Edge Computing." Sensors 19, no. 11 (June 7, 2019): 2602. http://dx.doi.org/10.3390/s19112602.

Full text
Abstract:
The spread of the sensors and industrial systems has fostered widespread real-time data processing applications. Massive vector field data (MVFD) are generated by vast distributed sensors and are characterized by high distribution, high velocity, and high volume. As a result, computing such kind of data on centralized cloud faces unprecedented challenges, especially on the processing delay due to the distance between the data source and the cloud. Taking advantages of data source proximity and vast distribution, edge computing is ideal for timely computing on MVFD. Therefore, we are motivated to propose an edge computing based MVFD processing framework. In particular, we notice that the high volume feature of MVFD results in high data transmission delay. To solve this problem, we invent Data Fluidization Schedule (DFS) in our framework to reduce the data block volume and the latency on Input/Output (I/O). We evaluated the efficiency of our framework in a practical application on massive wind field data processing for cyclone recognition. The high efficiency our framework was verified by the fact that it significantly outperformed classical big data processing frameworks Spark and MapReduce.
APA, Harvard, Vancouver, ISO, and other styles
37

Dey, Tonmoy, Yixin Chen, and Alan Kuhnle. "DASH: A Distributed and Parallelizable Algorithm for Size-Constrained Submodular Maximization." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 4 (June 26, 2023): 3941–48. http://dx.doi.org/10.1609/aaai.v37i4.25508.

Full text
Abstract:
MapReduce (MR) algorithms for maximizing monotone, submodular functions subject to a cardinality constraint (SMCC) are currently restricted to the use of the linear-adaptive (non-parallelizable) algorithm GREEDY. Low-adaptive algorithms do not satisfy the requirements of these distributed MR frameworks, thereby limiting their performance. We study the SMCC problem in a distributed setting and propose the first MR algorithms with sublinear adaptive complexity. Our algorithms, R-DASH, T-DASH and G-DASH provide 0.316 - ε, 3/8 - ε , and (1 - 1/e - ε) approximation ratios, respectively, with nearly optimal adaptive complexity and nearly linear time complexity. Additionally, we provide a framework to increase, under some mild assumptions, the maximum permissible cardinality constraint from O( n / ℓ^2) of prior MR algorithms to O( n / ℓ ), where n is the data size and ℓ is the number of machines; under a stronger condition on the objective function, we increase the maximum constraint value to n. Finally, we provide empirical evidence to demonstrate that our sublinear-adaptive, distributed algorithms provide orders of magnitude faster runtime compared to current state-of-the-art distributed algorithms.
APA, Harvard, Vancouver, ISO, and other styles
38

Ragala, Ramesh, and G. Bharadwaja Kumar. "Recursive Block LU Decomposition based ELM in Apache Spark." Journal of Intelligent & Fuzzy Systems 39, no. 6 (December 4, 2020): 8205–15. http://dx.doi.org/10.3233/jifs-189141.

Full text
Abstract:
Due to the massive memory and computational resources required to build complex machine learning models on large datasets, many researchers are employing distributed environments for training the models on large datasets. The parallel implementations of Extreme Learning Machine (ELM) with many variants have been developed using MapReduce and Spark frameworks in the recent years. However, these approaches have severe limitations in terms of Input-Output (I/O) cost, memory, etc. From the literature, it is known that the complexity of ELM is directly propositional to the computation of Moore-Penrose pseudo inverse of hidden layer matrix in ELM. Most of the ELM variants developed on Spark framework have employed Singular Value Decomposition (SVD) to compute the Moore-Penrose pseudo inverse. But, SVD has severe memory limitations when experimenting with large datasets. In this paper, a method that uses Recursive Block LU Decomposition to compute the Moore-Penrose generalized inverse over the Spark cluster has been proposed to reduce the computational complexity. This method enhances the ELM algorithm to be efficient in handling the scalability and also having faster execution of the model. The experimental results have shown that the proposed method is efficient than the existing algorithms available in the literature.
APA, Harvard, Vancouver, ISO, and other styles
39

Hung, Che-Lun, and Guan-Jie Hua. "Cloud Computing for Protein-Ligand Binding Site Comparison." BioMed Research International 2013 (2013): 1–7. http://dx.doi.org/10.1155/2013/170356.

Full text
Abstract:
The proteome-wide analysis of protein-ligand binding sites and their interactions with ligands is important in structure-based drug design and in understanding ligand cross reactivity and toxicity. The well-known and commonly used software, SMAP, has been designed for 3D ligand binding site comparison and similarity searching of a structural proteome. SMAP can also predict drug side effects and reassign existing drugs to new indications. However, the computing scale of SMAP is limited. We have developed a high availability, high performance system that expands the comparison scale of SMAP. This cloud computing service, called Cloud-PLBS, combines the SMAP and Hadoop frameworks and is deployed on a virtual cloud computing platform. To handle the vast amount of experimental data on protein-ligand binding site pairs, Cloud-PLBS exploits the MapReduce paradigm as a management and parallelizing tool. Cloud-PLBS provides a web portal and scalability through which biologists can address a wide range of computer-intensive questions in biology and drug discovery.
APA, Harvard, Vancouver, ISO, and other styles
40

Cândido, Paulo Gustavo Lopes, Jonathan Andrade Silva, Elaine Ribeiro Faria, and Murilo Coelho Naldi. "Optimization Algorithms for Scalable Stream Batch Clustering with k Estimation." Applied Sciences 12, no. 13 (June 25, 2022): 6464. http://dx.doi.org/10.3390/app12136464.

Full text
Abstract:
The increasing volume and velocity of the continuously generated data (data stream) challenge machine learning algorithms, which must evolve to fit real-world problems. The data stream clustering algorithms face issues such as the rapidly increasing volume of the data, the variety of the number of clusters, and their shapes. The present work aims to improve the accuracy of sequential clustering batches of data streams for scenarios in which clusters evolve dynamically and continuously, automatically estimating their number. In order to achieve this goal, three evolutionary algorithms are presented, along with three novel algorithms designed to deal with clusters of normal distribution based on goodness-of-fit tests in the context of scalable batch stream clustering with automatic estimation of the number of clusters. All of them are developed on top of MapReduce, Discretized-Stream models, and the most recent MPC frameworks to provide scalability, reliability, resilience, and flexibility. The proposed algorithms are experimentally compared with state-of-the-art methods and present the best results for accuracy for normally distributed data sets, reaching their goal.
APA, Harvard, Vancouver, ISO, and other styles
41

Ji, Yunhong, Yunpeng Chai, Xuan Zhou, Lipeng Ren, and Yajie Qin. "Smart Intra-query Fault Tolerance for Massive Parallel Processing Databases." Data Science and Engineering 5, no. 1 (December 19, 2019): 65–79. http://dx.doi.org/10.1007/s41019-019-00114-z.

Full text
Abstract:
AbstractIntra-query fault tolerance has increasingly been a concern for online analytical processing, as more and more enterprises migrate data analytical systems from mainframes to commodity computers. Most massive parallel processing (MPP) databases do not support intra-query fault tolerance. They may suffer from prolonged query latency when running on unreliable commodity clusters. While SQL-on-Hadoop systems can utilize the fault tolerance support of low-level frameworks, such as MapReduce and Spark, their cost-effectiveness is not always acceptable. In this paper, we propose a smart intra-query fault tolerance (SIFT) mechanism for MPP databases. SIFT achieves fault tolerance by performing checkpointing, i.e., materializing intermediate results of selected operators. Different from existing approaches, SIFT aims at promoting query success rate within a given time. To achieve its goal, it needs to: (1) minimize query rerunning time after encountering failures and (2) introduce as less checkpointing overhead as possible. To evaluate SIFT in real-world MPP database systems, we implemented it in Greenplum. The experimental results indicate that it can improve success rate of query processing effectively, especially when working with unreliable hardware.
APA, Harvard, Vancouver, ISO, and other styles
42

Pal, Gautam, Gangmin Li, and Katie Atkinson. "Multi-Agent Big-Data Lambda Architecture Model for E-Commerce Analytics." Data 3, no. 4 (December 1, 2018): 58. http://dx.doi.org/10.3390/data3040058.

Full text
Abstract:
We study big-data hybrid-data-processing lambda architecture, which consolidates low-latency real-time frameworks with high-throughput Hadoop-batch frameworks over a massively distributed setup. In particular, real-time and batch-processing engines act as autonomous multi-agent systems in collaboration. We propose a Multi-Agent Lambda Architecture (MALA) for e-commerce data analytics. We address the high-latency problem of Hadoop MapReduce jobs by simultaneous processing at the speed layer to the requests which require a quick turnaround time. At the same time, the batch layer in parallel provides comprehensive coverage of data by intelligent blending of stream and historical data through the weighted voting method. The cold-start problem of streaming services is addressed through the initial offset from historical batch data. Challenges of high-velocity data ingestion is resolved with distributed message queues. A proposed multi-agent decision-maker component is placed at the MALA stack as the gateway of the data pipeline. We prove efficiency of our batch model by implementing an array of features for an e-commerce site. The novelty of the model and its key significance is a scheme for multi-agent interaction between batch and real-time agents to produce deeper insights at low latency and at significantly lower costs. Hence, the proposed system is highly appealing for applications involving big data and caters to high-velocity streaming ingestion and a massive data pool.
APA, Harvard, Vancouver, ISO, and other styles
43

Akritidis, Leonidas, Athanasios Fevgas, Panagiota Tsompanopoulou, and Panayiotis Bozanis. "Evaluating the Effects of Modern Storage Devices on the Efficiency of Parallel Machine Learning Algorithms." International Journal on Artificial Intelligence Tools 29, no. 03n04 (June 2020): 2060008. http://dx.doi.org/10.1142/s0218213020600088.

Full text
Abstract:
Big Data analytics is presently one of the most emerging areas of research for both organizations and enterprises. The requirement for deployment of efficient machine learning algorithms over huge amounts of data led to the development of parallelization frameworks and of specialized libraries (like Mahout and MLlib) which implement the most important among these algorithms. Moreover, the recent advances in storage technology resulted in the introduction of high-performing devices, broadly known as Solid State Drives (SSDs). Compared to the traditional Hard Drives (HDDs), SSDs offer considerably higher performance and lower power consumption. Motivated by these appealing features and the growing necessity for efficient large-scale data processing, we compared the performance of several machine learning algorithms on MapReduce clusters whose nodes are equipped with HDDs, SSDs, and devices which implement the latest 3D XPoint technology. In particular, we evaluate several dataset preprocessing methods like vectorization and dimensionality reduction, two supervised classifiers, Naive Bayes and Linear Regression, and the popular k-Means clustering algorithm. We use an experimental cluster equipped with the three aforementioned storage devices under different configurations, and two large datasets, Wikipedia and HIGGS. The experiments showed that the benefits which derive from the usage of SSDs depend on the cluster setup and the nature of the applied algorithms.
APA, Harvard, Vancouver, ISO, and other styles
44

Wang, Zhong, Bo Suo, and Zhuo Wang. "MRScheduling: An Effective Technique for Multi-Tenant Meeting Deadline in MapReduce." Applied Mechanics and Materials 644-650 (September 2014): 4482–86. http://dx.doi.org/10.4028/www.scientific.net/amm.644-650.4482.

Full text
Abstract:
The multi-tenant jobs scheduling problem based on MapReduce framework has become more and more significant in contemporary society. Existing scheduling approach or algorithm no longer fit well in scenario that numerous jobs were submitted by multiple users at the same time. Therefore, taken enlarging jobs’ throughput for MapReduce into account, we firstly propose an MRScheduling which focuses on meeting job’s respective deadline. Considering the various parameters which are related to job execution time of a MapReduce’s job, we present a simply time-cost model, for the aim that quantifying the number of job’s assigned map slots and reduce slots. Then, an MRScheduling algorithm is discussed in details. Finally, we perform our approach on both real data and synthetic data on real distributed cluster to verify its effectiveness and efficiency.
APA, Harvard, Vancouver, ISO, and other styles
45

David Odera. "A survey on techniques, methods and security approaches in big data healthcare." Global Journal of Engineering and Technology Advances 14, no. 2 (February 28, 2023): 093–106. http://dx.doi.org/10.30574/gjeta.2023.14.2.0035.

Full text
Abstract:
A huge percentage of people especially in developed countries spend a good chunk of their wealth in managing their health conditions. In order to adequately administer healthcare, governments and various organizations have embraced advanced technology for automating the health industry. In recent past, electronic health records have largely been managed by Enterprise Resource Planning and legacy systems. Big data framework steadily emerge as the underlying technology in healthcare, which offers solutions that limits capacity of others systems in terms of storage and reporting. Automation through cloud services supported by storage of structured and unstructured health data in heterogeneous environment has improved service delivery, efficiency, medication, diagnosis, reporting and storage in healthcare. The argument support the idea that big data healthcare still face information security concern, for instance patient image sharing, authentication of patient, botnet, correlation attacks, man-in-the-middle, Distributed Denial of Service (DDoS), blockchain payment gateway, time complexities of algorithms, despite numerous studies conducted by scholars in security management for big data in smart healthcare. Some security technique include digital image encryption, steganography, biometrics, rule-based policy, prescriptive analysis, blockchain contact tracing, cloud security, MapReduce, machine-learning algorithms, anonymizations among others. However, most of these security solutions and analysis performed on structured and semi-structured data as opposed to unstructured data. This may affect the output of medical reporting of patients’ condition particularly on wearable devices and other examinations such as computerized tomography (CT) Scans among others. A major concern is how to identify inherent security vulnerabilities in big healthcare, which generate images for transmission and storage. Therefore, this paper conducted a comparative survey of solutions that specifically safeguards structured and unstructured data using systems that run on big data frameworks. The literature highlights several security advancements in cryptography, machine learning, anonymization and protocols. Most of these security frameworks lacks implementation evidence. A number of studies did not provide comprehensive performance metrics (accuracy, error, recall, precision) of the models besides using a single algorithm without validated justification. Therefore, a critique on the contribution, performance and areas of improvements discussed and summarized in the paper.
APA, Harvard, Vancouver, ISO, and other styles
46

Gorawski, Marcin, and Michal Lorek. "Efficient storage, retrieval and analysis of poker hands: An adaptive data framework." International Journal of Applied Mathematics and Computer Science 27, no. 4 (December 20, 2017): 713–26. http://dx.doi.org/10.1515/amcs-2017-0049.

Full text
Abstract:
Abstract In online gambling, poker hands are one of the most popular and fundamental units of the game state and can be considered objects comprising all the events that pertain to the single hand played. In a situation where tens of millions of poker hands are produced daily and need to be stored and analysed quickly, the use of relational databases no longer provides high scalability and performance stability. The purpose of this paper is to present an efficient way of storing and retrieving poker hands in a big data environment. We propose a new, read-optimised storage model that offers significant data access improvements over traditional database systems as well as the existing Hadoop file formats such as ORC, RCFile or SequenceFile. Through index-oriented partition elimination, our file format allows reducing the number of file splits that needs to be accessed, and improves query response time up to three orders of magnitude in comparison with other approaches. In addition, our file format supports a range of new indexing structures to facilitate fast row retrieval at a split level. Both index types operate independently of the Hive execution context and allow other big data computational frameworks such as MapReduce or Spark to benefit from the optimized data access path to the hand information. Moreover, we present a detailed analysis of our storage model and its supporting index structures, and how they are organised in the overall data framework. We also describe in detail how predicate based expression trees are used to build effective file-level execution plans. Our experimental tests conducted on a production cluster, holding nearly 40 billion hands which span over 4000 partitions, show that multi-way partition pruning outperforms other existing file formats, resulting in faster query execution times and better cluster utilisation.
APA, Harvard, Vancouver, ISO, and other styles
47

Vidisha Sharma, Satish Kumar Alaria. "Improving the Performance of Heterogeneous Hadoop Clusters Using Map Reduce." International Journal on Recent and Innovation Trends in Computing and Communication 7, no. 2 (February 28, 2019): 11–17. http://dx.doi.org/10.17762/ijritcc.v7i2.5225.

Full text
Abstract:
The key issue that emerges because of the tremendous development of connectivity among devices and frameworks is making such a great amount of data at an exponential rate that an achievable answer for preparing it is getting to be troublesome step by step. Thusly, building up a stage for such propelled dimension of data handling, equipment just as programming improvements should be led to come in level with such generous data. To enhance the proficiency of Hadoop bunches in putting away and dissecting big data, we have proposed an algorithmic methodology that will provide food the necessities of heterogeneous data put away .over Hadoop groups and enhance the execution just as effectiveness. The proposed paper intends to discover the adequacy of new calculation, correlation, proposals, and an aggressive way to deal with discover the best answer for enhancing the big data situation. The Map Reduce method from Hadoop will help in keeping up a nearby watch over the unstructured or heterogeneous Hadoop bunches with bits of knowledge on results obviously from the algorithm.in this paper we proposed new Generating another calculation to tackle these issues for the business just as non-business uses can help the advancement of network. The proposed calculation can help enhance the situation of data ordering calculation MapReduce in heterogeneous Hadoop groups. The exposition work and analyses directed under this work have copied very amazing outcomes, some of them being the selection of schedulers to plan employments, arrangement of data in similitude lattice, bunching before planning inquiries and in addition, iterative, mapping and diminishing and restricting the inner conditions together to stay away from question slowing down and execution times. The test led additionally sets up the way that if a procedure is characterized to deal with the diverse use case situations, one could generally lessen the expense of processing and can profit on depending on disseminated frameworks for quick executions.
APA, Harvard, Vancouver, ISO, and other styles
48

Zhang, Guigang, Chao Li, Yong Zhang, and Chunxiao Xing. "A Semantic++ MapReduce Parallel Programming Model." International Journal of Semantic Computing 08, no. 03 (September 2014): 279–99. http://dx.doi.org/10.1142/s1793351x14400091.

Full text
Abstract:
Big data is playing a more and more important role in every area such as medical health, internet finance, culture and education etc. How to process these big data efficiently is a huge challenge. MapReduce is a good parallel programming language to process big data. However, it has lots of shortcomings. For example, it cannot process complex computing. It cannot suit real-time computing. In order to overcome these shortcomings of MapReduce and its variants, in this paper, we propose a Semantic++ MapReduce parallel programming model. This study includes the following parts. (1) Semantic++ MapReduce parallel programming model. It includes physical framework of semantic++ MapReduce parallel programming model and logic framework of semantic++ MapReduce parallel programming model; (2) Semantic++ extraction and management method for big data; (3) Semantic++ MapReduce parallel programming computing framework. It includes semantic++ map, semantic++ reduce and semantic++ shuffle; (4) Semantic++ MapReduce for multi-data centers. It includes basic framework of semantic++ MapReduce for multi-data centers and semantic++ MapReduce application framework for multi-data centers; (5) A Case Study of semantic++ MapReduce across multi-data centers.
APA, Harvard, Vancouver, ISO, and other styles
49

Bansal, Ajay Kumar, Manmohan Sharma, and Ashu Gupta. "Optimizing resources to mitigate stragglers through virtualization in run time." Journal of University of Shanghai for Science and Technology 23, no. 08 (August 31, 2021): 931–35. http://dx.doi.org/10.51201/jusst/21/08486.

Full text
Abstract:
Modern computing systems are generally enormous in scale, consisting of hundreds to thousands of heterogeneous machine nodes, to meet rising demands for Cloud services. MapReduce and other parallel computing frameworks are frequently used on such cluster architecture to offer consumers dependable and timely services. However, Cloud workloads’ complex features, such as multi-dimensional resource requirements and dynamically changing system settings, such as dynamic node performance, are posing new difficulties for providers in terms of both customer experience and system efficiency. The straggler problem occurs when a small subset of parallelized jobs takes an excessively long time to execute in contrast to their siblings, resulting in a delayed job response and the possibility of late-timing failure. Speculative execution is the state-of-the-art method to straggler mitigation. Speculative execution has been used in numerous real-world systems with a variety of implementation improvements, but the results of this thesis’ research demonstrate that it is typically wasteful. The failure rate of speculative execution might be as high as 71 percent, according to different data center production trace logs. Straggler mitigation is a difficult task in and of itself: 1) stragglers may have varying degrees of severity in parallel job execution; 2) whether a task should be considered a straggler is highly subjective, depending on various application and system conditions; 3) the efficiency of speculative execution would be improved if dynamic node quality could be adequately modeled and predicted; 4) Other sorts of stragglers, such as those generated by data skews, are beyond speculative execution’s capabilities.
APA, Harvard, Vancouver, ISO, and other styles
50

Gao, Tilei, Ming Yang, Rong Jiang, Yu Li, and Yao Yao. "Research on Computing Efficiency of MapReduce in Big Data Environment." ITM Web of Conferences 26 (2019): 03002. http://dx.doi.org/10.1051/itmconf/20192603002.

Full text
Abstract:
The emergence of big data has brought a great impact on traditional computing mode, the distributed computing framework represented by MapReduce has become an important solution to this problem. Based on the big data, this paper deeply studies the principle and framework of MapReduce programming. On the basis of mastering the principle and framework of MapReduce programming, the time consumption of distributed computing framework MapReduce and traditional computing model is compared with concrete programming experiments. The experiment shows that MapReduce has great advantages in large data volume.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography