To see the other types of publications on this topic, follow the link: MAPREDUCE FRAMEWORKS.

Dissertations / Theses on the topic 'MAPREDUCE FRAMEWORKS'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 44 dissertations / theses for your research on the topic 'MAPREDUCE FRAMEWORKS.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

de, Souza Ferreira Tharso. "Improving Memory Hierarchy Performance on MapReduce Frameworks for Multi-Core Architectures." Doctoral thesis, Universitat Autònoma de Barcelona, 2013. http://hdl.handle.net/10803/129468.

Full text
Abstract:
La necesidad de analizar grandes conjuntos de datos de diferentes tipos de aplicaciones ha popularizado el uso de modelos de programación simplicados como MapReduce. La popularidad actual se justifica por ser una abstracción útil para expresar procesamiento paralelo de datos y también ocultar eficazmente la sincronización de datos, tolerancia a fallos y la gestión de balanceo de carga para el desarrollador de la aplicación. Frameworks MapReduce también han sido adaptados a los sistema multi-core y de memoria compartida. Estos frameworks proponen que cada core de una CPU ejecute una tarea Map o Reduce de manera concurrente. Las fases Map y Reduce también comparten una estructura de datos común donde se aplica el procesamiento principal. En este trabajo se describen algunas limitaciones de los actuales frameworks para arquitecturas multi-core. En primer lugar, se describe la estructura de datos que se utiliza para mantener todo el archivo de entrada y datos intermedios en la memoria. Los frameworks actuales para arquitecturas multi-core han estado diseñado para mantener todos los datos intermedios en la memoria. Cuando se ejecutan aplicaciones con un gran conjunto de datos de entrada, la memoria disponible se convierte en demasiada pequeña para almacenar todos los datos intermedios del framework, presentando así una grave pérdida de rendimiento. Proponemos un subsistema de gestión de memoria que permite a las estructuras de datos procesar un número ilimitado de datos a través del uso de un mecanismo de spilling en el disco. También implementamos una forma de gestionar el acceso simultáneo al disco por todos los threads que realizan el procesamiento. Por último, se estudia la utilización eficaz de la jerarquía de memoria de los frameworks MapReduce y se propone una nueva implementación de una tarea MapReduce parcial para conjuntos de datos de entrada. El objetivo es hacer un buen uso de la caché, eliminando las referencias a los bloques de datos que ya no están en uso. Nuestra propuesta fue capaz de reducir significativamente el uso de la memoria principal y mejorar el rendimiento global con el aumento del uso de la memoria caché.
The need of analyzing large data sets from many different application fields has fostered the use of simplified programming models like MapReduce. Its current popularity is justified by being a useful abstraction to express data parallel processing and also by effectively hiding synchronization, fault tolerance and load balancing management details from the application developer. MapReduce frameworks have also been ported to multi-core and shared memory computer systems. These frameworks propose to dedicate a different computing CPU core for each map or reduce task to execute them concurrently. Also, Map and Reduce phases share a common data structure where main computations are applied. In this work we describe some limitations of current multi-core MapReduce frameworks. First, we describe the relevance of the data structure used to keep all input and intermediate data in memory. Current multi-core MapReduce frameworks are designed to keep all intermediate data in memory. When executing applications with large data input, the available memory becomes too small to store all framework intermediate data and there is a severe performance loss. We propose a memory management subsystem to allow intermediate data structures the processing of an unlimited amount of data by the use of a disk spilling mechanism. Also, we have implemented a way to manage concurrent access to disk of all threads participating in the computation. Finally, we have studied the effective use of the memory hierarchy by the data structures of the MapReduce frameworks and proposed a new implementation of partial MapReduce tasks to the input data set. The objective is to make a better use of the cache and to eliminate references to data blocks that are no longer in use. Our proposal was able to significantly reduce the main memory usage and improves the overall performance with the increasing of cache memory usage.
APA, Harvard, Vancouver, ISO, and other styles
2

Kumaraswamy, Ravindranathan Krishnaraj. "Exploiting Heterogeneity in Distributed Software Frameworks." Diss., Virginia Tech, 2016. http://hdl.handle.net/10919/64423.

Full text
Abstract:
The objective of this thesis is to address the challenges faced in sustaining efficient, high-performance and scalable Distributed Software Frameworks (DSFs), such as MapReduce, Hadoop, Dryad, and Pregel, for supporting data-intensive scientific and enterprise applications on emerging heterogeneous compute, storage and network infrastructure. Large DSF deployments in the cloud continue to grow both in size and number, given DSFs are cost-effective and easy to deploy. DSFs are becoming heterogeneous with the use of advanced hardware technologies and due to regular upgrades to the system. For instance, low-cost, power-efficient clusters that employ traditional servers along with specialized resources such as FPGAs, GPUs, powerPC, MIPS and ARM based embedded devices, and high-end server-on-chip solutions will drive future DSFs infrastructure. Similarly, high-throughput DSF storage is trending towards hybrid and tiered approaches that use large in-memory buffers, SSDs, etc., in addition to disks. However, the schedulers and resource managers of these DSFs assume the underlying hardware to be similar or homogeneous. Another problem faced in evolving applications is that they are typically complex workflows comprising of different kernels. The kernels can be diverse, e.g., compute-intensive processing followed by data-intensive visualization and each kernel will have a different affinity towards different hardware. Because of the inability of the DSFs to understand heterogeneity of the underlying hardware architecture and applications, existing resource managers cannot ensure appropriate resource-application match for better performance and resource usage. In this dissertation, we design, implement, and evaluate DerbyhatS, an application-characteristics-aware resource manager for DSFs, which predicts the performance of the application under different hardware configurations and dynamically manage compute and storage resources as per the application needs. We adopt a quantitative approach where we first study the detailed behavior of various Hadoop applications running on different hardware configurations and propose application-attuned dynamic system management in order to improve the resource-application match. We re-design the Hadoop Distributed File System (HDFS) into a multi-tiered storage system that seamlessly integrates heterogeneous storage technologies into the HDFS. We also propose data placement and retrieval policies to improve the utilization of the storage devices based on their characteristics such as I/O throughput and capacity. DerbyhatS workflow scheduler is an application-attuned workflow scheduler and is constituted by two components. phi-Sched coupled with epsilon-Sched manages the compute heterogeneity and DUX coupled with AptStore manages the storage substrate to exploit heterogeneity. DerbyhatS will help realize the full potential of the emerging infrastructure for DSFs, e.g., cloud data centers, by offering many advantages over the state of the art by ensuring application-attuned, dynamic heterogeneous resource management.
Ph. D.
APA, Harvard, Vancouver, ISO, and other styles
3

Venumuddala, Ramu Reddy. "Distributed Frameworks Towards Building an Open Data Architecture." Thesis, University of North Texas, 2015. https://digital.library.unt.edu/ark:/67531/metadc801911/.

Full text
Abstract:
Data is everywhere. The current Technological advancements in Digital, Social media and the ease at which the availability of different application services to interact with variety of systems are causing to generate tremendous volumes of data. Due to such varied services, Data format is now not restricted to only structure type like text but can generate unstructured content like social media data, videos and images etc. The generated Data is of no use unless been stored and analyzed to derive some Value. Traditional Database systems comes with limitations on the type of data format schema, access rates and storage sizes etc. Hadoop is an Apache open source distributed framework that support storing huge datasets of different formatted data reliably on its file system named Hadoop File System (HDFS) and to process the data stored on HDFS using MapReduce programming model. This thesis study is about building a Data Architecture using Hadoop and its related open source distributed frameworks to support a Data flow pipeline on a low commodity hardware. The Data flow components are, sourcing data, storage management on HDFS and data access layer. This study also discuss about a use case to utilize the architecture components. Sqoop, a framework to ingest the structured data from database onto Hadoop and Flume is used to ingest the semi-structured Twitter streaming json data on to HDFS for analysis. The data sourced using Sqoop and Flume have been analyzed using Hive for SQL like analytics and at a higher level of data access layer, Hadoop has been compared with an in memory computing system using Spark. Significant differences in query execution performances have been analyzed when working with Hadoop and Spark frameworks. This integration helps for ingesting huge Volumes of streaming json Variety data to derive better Value based analytics using Hive and Spark.
APA, Harvard, Vancouver, ISO, and other styles
4

Peddi, Sri Vijay Bharat. "Cloud Computing Frameworks for Food Recognition from Images." Thesis, Université d'Ottawa / University of Ottawa, 2015. http://hdl.handle.net/10393/32450.

Full text
Abstract:
Distributed cloud computing, when integrated with smartphone capabilities, contribute to building an efficient multimedia e-health application for mobile devices. Unfortunately, mobile devices alone do not possess the ability to run complex machine learning algorithms, which require large amounts of graphic processing and computational power. Therefore, offloading the computationally intensive part to the cloud, reduces the overhead on the mobile device. In this thesis, we introduce two such distributed cloud computing models, which implement machine learning algorithms in the cloud in parallel, thereby achieving higher accuracy. The first model is based on MapReduce SVM, wherein, through the use of Hadoop, the system distributes the data and processes it across resizable Amazon EC2 instances. Hadoop uses a distributed processing architecture called MapReduce, in which a task is mapped to a set of servers for processing and the results are then reduced back to a single set. In the second method, we implement cloud virtualization, wherein we are able to run our mobile application in the cloud using an Android x86 image. We describe a cloud-based virtualization mechanism for multimedia-assisted mobile food recognition, which allow users to control their virtual smartphone operations through a dedicated client application installed on their smartphone. The application continues to be processed on the virtual mobile image even if the user is disconnected for some reason. Using these two distributed cloud computing models, we were able to achieve higher accuracy and reduced timings for the overall execution of machine learning algorithms and calorie measurement methodologies, when implemented on the mobile device.
APA, Harvard, Vancouver, ISO, and other styles
5

Elteir, Marwa Khamis. "A MapReduce Framework for Heterogeneous Computing Architectures." Diss., Virginia Tech, 2012. http://hdl.handle.net/10919/28786.

Full text
Abstract:
Nowadays, an increasing number of computational systems are equipped with heterogeneous compute resources, i.e., following different architecture. This applies to the level of a single chip, a single node and even supercomputers and large-scale clusters. With its impressive price-to-performance ratio as well as power efficiently compared to traditional multicore processors, graphics processing units (GPUs) has become an integrated part of these systems. GPUs deliver high peak performance; however efficiently exploiting their computational power requires the exploration of a multi-dimensional space of optimization methodologies, which is challenging even for the well-trained expert. The complexity of this multi-dimensional space arises not only from the traditionally well known but arduous task of architecture-aware GPU optimization at design and compile time, but it also arises in the partitioning and scheduling of the computation across these heterogeneous resources. Even with programming models like the Compute Unified Device Architecture (CUDA) and Open Computing Language (OpenCL), the developer still needs to manage the data transfer be- tween host and device and vice versa, orchestrate the execution of several kernels, and more arduously, optimize the kernel code. In this dissertation, we aim to deliver a transparent parallel programming environment for heterogeneous resources by leveraging the power of the MapReduce programming model and OpenCL programming language. We propose a portable architecture-aware framework that efficiently runs an application across heterogeneous resources, specifically AMD GPUs and NVIDIA GPUs, while hiding complex architectural details from the developer. To further enhance performance portability, we explore approaches for asynchronously and efficiently distributing the computations across heterogeneous resources. When applied to benchmarks and representative applications, our proposed framework significantly enhances performance, including up to 58% improvement over traditional approaches to task assignment and up to a 45-fold improvement over state-of-the-art MapReduce implementations.
Ph. D.
APA, Harvard, Vancouver, ISO, and other styles
6

Alkan, Sertan. "A Distributed Graph Mining Framework Based On Mapreduce." Master's thesis, METU, 2010. http://etd.lib.metu.edu.tr/upload/12611588/index.pdf.

Full text
Abstract:
The frequent patterns hidden in a graph can reveal crucial information about the network the graph represents. Existing techniques to mine the frequent subgraphs in a graph database generally rely on the premise that the data can be fit into main memory of the device that the computation takes place. Even though there are some algorithms that are designed using highly optimized methods to some extent, many lack the solution to the problem of scalability. In this thesis work, our aim is to find and enumerate the subgraphs that are at least as frequent as the designated threshold in a given graph. Here, we propose a new distributed algorithm for frequent subgraph mining problem that can scale horizontally as the computing cluster size increases. The method described here, uses a partitioning method and Map/Reduce programming model to distribute the computation of frequent subgraphs. In the core of this algorithm, we make use of an existing graph partitioning method to split the given data in the distributed file system and to merge and join the computed subgraphs without losing information. The frequent subgraph computation in each split is done using another known method which can enumerate the frequent patterns. Although current algorithms can efficiently find frequent patterns, they are not parallel or distributed algorithms in that even when they partition the data, they are designed to work on a single machine. Furthermore, these algorithms are computationally expensive but not fault tolerant and are not designed to work on a distributed file system. Using the Map/Reduce paradigm, we distribute the computation of frequent patterns to every machine in a cluster. Our algorithm, first bi-partitions the data via successive Map/Reduce jobs, then invokes another Map/Reduce job to compute the subgraphs in partitions using CloseGraph, recovers the whole set by invoking a series of Map/Reduce jobs to merge-join the previously found patterns. The implementation uses an open source Map/Reduce environment, Hadoop. In our experiments, our method can scale up to large graphs, as the graph data size gets bigger, this method performs better than the existing algorithms.
APA, Harvard, Vancouver, ISO, and other styles
7

Wang, Yongzhi. "Constructing Secure MapReduce Framework in Cloud-based Environment." FIU Digital Commons, 2015. http://digitalcommons.fiu.edu/etd/2238.

Full text
Abstract:
MapReduce, a parallel computing paradigm, has been gaining popularity in recent years as cloud vendors offer MapReduce computation services on their public clouds. However, companies are still reluctant to move their computations to the public cloud due to the following reason: In the current business model, the entire MapReduce cluster is deployed on the public cloud. If the public cloud is not properly protected, the integrity and the confidentiality of MapReduce applications can be compromised by attacks inside or outside of the public cloud. From the result integrity’s perspective, if any computation nodes on the public cloud are compromised,thosenodes can return incorrect task results and therefore render the final job result inaccurate. From the algorithmic confidentiality’s perspective, when more and more companies devise innovative algorithms and deploy them to the public cloud, malicious attackers can reverse engineer those programs to detect the algorithmic details and, therefore, compromise the intellectual property of those companies. In this dissertation, we propose to use the hybrid cloud architecture to defeat the above two threats. Based on the hybrid cloud architecture, we propose separate solutions to address the result integrity and the algorithmic confidentiality problems. To address the result integrity problem, we propose the Integrity Assurance MapReduce (IAMR) framework. IAMR performs the result checking technique to guarantee high result accuracy of MapReduce jobs, even if the computation is executed on an untrusted public cloud. We implemented a prototype system for a real hybrid cloud environment and performed a series of experiments. Our theoretical simulations and experimental results show that IAMR can guarantee a very low job error rate, while maintaining a moderate performance overhead. To address the algorithmic confidentiality problem, we focus on the program control flow and propose the Confidentiality Assurance MapReduce (CAMR) framework. CAMR performs the Runtime Control Flow Obfuscation (RCFO) technique to protect the predicates of MapReduce jobs. We implemented a prototype system for a real hybrid cloud environment. The security analysis and experimental results show that CAMR defeats static analysis-based reverse engineering attacks, raises the bar for the dynamic analysis-based reverse engineering attacks, and incurs a modest performance overhead.
APA, Harvard, Vancouver, ISO, and other styles
8

Zhang, Yue. "A Workload Balanced MapReduce Framework on GPU Platforms." Wright State University / OhioLINK, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=wright1450180042.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Raja, Anitha. "A Coordination Framework for Deploying Hadoop MapReduce Jobs on Hadoop Cluster." Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-196951.

Full text
Abstract:
Apache Hadoop is an open source framework that delivers reliable, scalable, and distributed computing. Hadoop services are provided for distributed data storage, data processing, data access, and security. MapReduce is the heart of the Hadoop framework and was designed to process vast amounts of data distributed over a large number of nodes. MapReduce has been used extensively to process structured and unstructured data in diverse fields such as e-commerce, web search, social networks, and scientific computation. Understanding the characteristics of Hadoop MapReduce workloads is the key to achieving improved configurations and refining system throughput. Thus far, MapReduce workload characterization in a large-scale production environment has not been well studied. In this thesis project, the focus is mainly on composing a Hadoop cluster (as an execution environment for data processing) to analyze two types of Hadoop MapReduce (MR) jobs via a proposed coordination framework. This coordination framework is referred to as a workload translator. The outcome of this work includes: (1) a parametric workload model for the target MR jobs, (2) a cluster specification to develop an improved cluster deployment strategy using the model and coordination framework, and (3) better scheduling and hence better performance of jobs (i.e. shorter job completion time). We implemented a prototype of our solution using Apache Tomcat on (OpenStack) Ubuntu Trusty Tahr, which uses RESTful APIs to (1) create a Hadoop cluster version 2.7.2 and (2) to scale up and scale down the number of workers in the cluster. The experimental results showed that with well tuned parameters, MR jobs can achieve a reduction in the job completion time and improved utilization of the hardware resources. The target audience for this thesis are developers. As future work, we suggest adding additional parameters to develop a more refined workload model for MR and similar jobs.
Apache Hadoop är ett öppen källkods system som levererar pålitlig, skalbar och distribuerad användning. Hadoop tjänster hjälper med distribuerad data förvaring, bearbetning, åtkomst och trygghet. MapReduce är en viktig del av Hadoop system och är designad att bearbeta stora data mängder och även distribuerad i flera leder. MapReduce är använt extensivt inom bearbetning av strukturerad och ostrukturerad data i olika branscher bl. a e-handel, webbsökning, sociala medier och även vetenskapliga beräkningar. Förståelse av MapReduces arbetsbelastningar är viktiga att få förbättrad konfigurationer och resultat. Men, arbetsbelastningar av MapReduce inom massproduktions miljö var inte djup-forskat hittills. I detta examensarbete, är en hel del fokus satt på ”Hadoop cluster” (som en utförande miljö i data bearbetning) att analysera två typer av Hadoop MapReduce (MR) arbeten genom ett tilltänkt system. Detta system är refererad som arbetsbelastnings översättare. Resultaten från denna arbete innehåller: (1) en parametrisk arbetsbelastningsmodell till inriktad MR arbeten, (2) en specifikation att utveckla förbättrad kluster strategier med båda modellen och koordinations system, och (3) förbättrad planering och arbetsprestationer, d.v.s kortare tid att utföra arbetet. Vi har realiserat en prototyp med Apache Tomcat på (OpenStack) Ubuntu Trusty Tahr som använder RESTful API (1) att skapa ”Hadoop cluster” version 2.7.2 och (2) att båda skala upp och ner antal medarbetare i kluster. Forskningens resultat har visat att med vältrimmad parametrar, kan MR arbete nå förbättringar dvs. sparad tid vid slutfört arbete och förbättrad användning av hårdvara resurser. Målgruppen för denna avhandling är utvecklare. I framtiden, föreslår vi tilläggning av olika parametrar att utveckla en allmän modell för MR och liknande arbeten.
APA, Harvard, Vancouver, ISO, and other styles
10

Lakkimsetti, Praveen Kumar. "A framework for automatic optimization of MapReduce programs based on job parameter configurations." Kansas State University, 2011. http://hdl.handle.net/2097/12011.

Full text
Abstract:
Master of Science
Department of Computing and Information Sciences
Mitchell L. Neilsen
Recently, cost-effective and timely processing of large datasets has been playing an important role in the success of many enterprises and the scientific computing community. Two promising trends ensure that applications will be able to deal with ever increasing data volumes: first, the emergence of cloud computing, which provides transparent access to a large number of processing, storage and networking resources; and second, the development of the MapReduce programming model, which provides a high-level abstraction for data-intensive computing. MapReduce has been widely used for large-scale data analysis in the Cloud [5]. The system is well recognized for its elastic scalability and fine-grained fault tolerance. However, even to run a single program in a MapReduce framework, a number of tuning parameters have to be set by users or system administrators to increase the efficiency of the program. Users often run into performance problems because they are unaware of how to set these parameters, or because they don't even know that these parameters exist. With MapReduce being a relatively new technology, it is not easy to find qualified administrators [4]. The major objective of this project is to provide a framework that optimizes MapReduce programs that run on large datasets. This is done by executing the MapReduce program on a part of the dataset using stored parameter combinations and setting the program with the most efficient combination and this modified program can be executed over the different datasets. We know that many MapReduce programs are used over and over again in applications like daily weather analysis, log analysis, daily report generation etc. So, once the parameter combination is set, it can be used on a number of data sets efficiently. This feature can go a long way towards improving the productivity of users who lack the skills to optimize programs themselves due to lack of familiarity with MapReduce or with the data being processed.
APA, Harvard, Vancouver, ISO, and other styles
11

Li, Min. "A resource management framework for cloud computing." Diss., Virginia Tech, 2014. http://hdl.handle.net/10919/47804.

Full text
Abstract:
The cloud computing paradigm is realized through large scale distributed resource management and computation platforms such as MapReduce, Hadoop, Dryad, and Pregel. These platforms enable quick and efficient development of a large range of applications that can be sustained at scale in a fault-tolerant fashion. Two key technologies, namely resource virtualization and feature-rich enterprise storage, are further driving the wide-spread adoption of virtualized cloud environments. Many challenges arise when designing resource management techniques for both native and virtualized data centers. First, parameter tuning of MapReduce jobs for efficient resource utilization is a daunting and time consuming task. Second, while the MapReduce model is designed for and leverages information from native clusters to operate efficiently, the emergence of virtual cluster topology results in overlaying or hiding the actual network information. This leads to two resource selection and placement anomalies: (i) loss of data locality, and (ii) loss of job locality. Consequently, jobs may be placed physically far from their associated data or related jobs, which adversely affect the overall performance. Finally, the extant resource provisioning approach leads to significant wastage as enterprise cloud providers have to consider and provision for peak loads instead of average load (that is many times lower). In this dissertation, we design and develop a resource management framework to address the above challenges. We first design an innovative resource scheduler, CAM, aimed at MapReduce applications running in virtualized cloud environments. CAM reconciles both data and VM resource allocation with a variety of competing constraints, such as storage utilization, changing CPU load and network link capacities based on a flow-network algorithm. Additionally, our platform exposes the typically hidden lower-level topology information to the MapReduce job scheduler, which enables it to make optimal task assignments. Second, we design an online performance tuning system, mrOnline, which monitors the MapReduce job execution, tunes the parameters based on collected statistics and provides fine-grained control over parameter configuration changes to the user. To this end, we employ a gray-box based smart hill-climbing algorithm that leverages MapReduce runtime statistics and effectively converge to a desirable configuration within a single iteration. Finally, we target enterprise applications in virtualized environment where typically a network attached centralized storage system is deployed. We design a new protocol to share primary data de-duplication information available at the storage server with the client. This enables better client-side cache utilization and reduces server-client network traffic, which leads to overall high performance. Based on the protocol, a workload aware VM management strategy is further introduced to decrease the load to the storage server and enhance the I/O efficiency for clients.
Ph. D.
APA, Harvard, Vancouver, ISO, and other styles
12

Rahman, Md Wasi-ur. "Designing and Modeling High-Performance MapReduce and DAG Execution Framework on Modern HPC Systems." The Ohio State University, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=osu1480475635778714.

Full text
APA, Harvard, Vancouver, ISO, and other styles
13

Donepudi, Harinivesh. "An Apache Hadoop Framework for Large-Scale Peptide Identification." TopSCHOLAR®, 2015. http://digitalcommons.wku.edu/theses/1527.

Full text
Abstract:
Peptide identification is an essential step in protein identification, and Peptide Spectrum Match (PSM) data set is huge, which is a time consuming process to work on a single machine. In a typical run of the peptide identification method, PSMs are positioned by a cross correlation, a statistical score, or a likelihood that the match between the trial and hypothetical is correct and unique. This process takes a long time to execute, and there is a demand for an increase in performance to handle large peptide data sets. Development of distributed frameworks are needed to reduce the processing time, but this comes at the price of complexity in developing and executing them. In distributed computing, the program may divide into multiple parts to be executed. The work in this thesis describes the implementation of Apache Hadoop framework for large-scale peptide identification using C-Ranker. The Apache Hadoop data processing software is immersed in a complex environment composed of massive machine clusters, large data sets, and several processing jobs. The framework uses Apache Hadoop Distributed File System (HDFS) and Apache Mapreduce to store and process the peptide data respectively.The proposed framework uses a peptide processing algorithm named CRanker which takes peptide data as an input and identifies the correct PSMs. The framework has two steps: Execute the C-Ranker algorithm on Hadoop cluster and compare the correct PSMs data generated via Hadoop approach with the normal execution approach of C-Ranker. The goal of this framework is to process large peptide datasets using Apache Hadoop distributed approach.
APA, Harvard, Vancouver, ISO, and other styles
14

Huang, Xin. "Querying big RDF data : semantic heterogeneity and rule-based inconsistency." Thesis, Sorbonne Paris Cité, 2016. http://www.theses.fr/2016USPCB124/document.

Full text
Abstract:
Le Web sémantique est la vision de la prochaine génération de Web proposé par Tim Berners-Lee en 2001. Avec le développement rapide des technologies du Web sémantique, de grandes quantités de données RDF existent déjà sous forme de données ouvertes et liées et ne cessent d'augmenter très rapidement. Les outils traditionnels d'interrogation et de raisonnement sur les données du Web sémantique sont conçus pour fonctionner dans un environnement centralisé. A ce titre, les algorithmes de calcul traditionnels vont inévitablement rencontrer des problèmes de performances et des limitations de mémoire. De gros volumes de données hétérogènes sont collectés à partir de différentes sources de données par différentes organisations. Ces sources de données présentent souvent des divergences et des incertitudes dont la détection et la résolution sont rendues encore plus difficiles dans le big data. Mes travaux de recherche présentent des approches et algorithmes pour une meilleure exploitation de données dans le contexte big data et du web sémantique. Nous avons tout d'abord développé une approche de résolution des identités (Entity Resolution) avec des algorithmes d'inférence et d'un mécanisme de liaison lorsque la même entité est fournie dans plusieurs ressources RDF décrite avec différentes sémantiques et identifiants de ressources URI. Nous avons également développé un moteur de réécriture de requêtes SPARQL basé le modèle MapReduce pour inférer les données implicites décrites intentionnellement par des règles d'inférence lors de l'évaluation de la requête. L'approche de réécriture traitent également de la fermeture transitive et règles cycliques pour la prise en compte de langages de règles plus riches comme RDFS et OWL. Plusieurs optimisations ont été proposées pour améliorer l'efficacité des algorithmes visant à réduire le nombre de jobs MapReduce. La deuxième contribution concerne le traitement d'incohérence dans le big data. Nous étendons l'approche présentée dans la première contribution en tenant compte des incohérences dans les données. Cela comprend : (1) La détection d'incohérence à base de règles évaluées par le moteur de réécriture de requêtes que nous avons développé; (2) L'évaluation de requêtes permettant de calculer des résultats cohérentes selon une des trois sémantiques définies à cet effet. La troisième contribution concerne le raisonnement et l'interrogation sur la grande quantité données RDF incertaines. Nous proposons une approche basée sur MapReduce pour effectuer l'inférence de nouvelles données en présence d'incertitude. Nous proposons un algorithme d'évaluation de requêtes sur de grandes quantités de données RDF probabilistes pour le calcul et l'estimation des probabilités des résultats
Semantic Web is the vision of next generation of Web proposed by Tim Berners-Lee in 2001. Indeed, with the rapid development of Semantic Web technologies, large-scale RDF data already exist as linked open data, and their number is growing rapidly. Traditional Semantic Web querying and reasoning tools are designed to run in stand-alone environment. Therefor, Processing large-scale bulk data computation using traditional solutions will result in bottlenecks of memory space and computational performance inevitably. Large volumes of heterogeneous data are collected from different data sources by different organizations. In this context, different sources always exist inconsistencies and uncertainties which are difficult to identify and evaluate. To solve these challenges of Semantic Web, the main research contents and innovative approaches are proposed as follows. For these purposes, we firstly developed an inference based semantic entity resolution approach and linking mechanism when the same entity is provided in multiple RDF resources described using different semantics and URIs identifiers. We also developed a MapReduce based rewriting engine for Sparql query over big RDF data to handle the implicit data described intentionally by inference rules during query evaluation. The rewriting approach also deal with the transitive closure and cyclic rules to provide a rich inference language as RDFS and OWL. The second contribution concerns the distributed inconsistency processing. We extend the approach presented in first contribution by taking into account inconsistency in the data. This includes: (1)Rules based inconsistency detection with the help of our query rewriting engine; (2)Consistent query evaluation in three different semantics. The third contribution concerns the reasoning and querying over large-scale uncertain RDF data. We propose an MapReduce based approach to deal with large-scale reasoning with uncertainty. Unlike possible worlds semantic, we propose an algorithm for generating intensional Sparql query plan over probabilistic RDF graph for computing the probabilities of results within the query
APA, Harvard, Vancouver, ISO, and other styles
15

RANJAN, RAVI. "PERFORMANCE ANALYSIS OF APRIORI AND FP GROWTH ON DIFFERENT MAPREDUCE FRAMEWORKS." Thesis, 2017. http://dspace.dtu.ac.in:8080/jspui/handle/repository/15814.

Full text
Abstract:
Association rule mining remains a very popular and effective method to extract meaningful information from large datasets. It tries to find possible associations between items in large transaction based datasets. In order to create these associations, frequent patterns have to be generated. Apriori and FP Growth are the two most popular algorithms for frequent itemset mining. To enhance the efficiency and scalability of Apriori and FP Growth, a number of algorithms have been proposed addressing the design of efficient data structures, minimizing database scan and parallel and distributed processing. MapReduce is the emerging parallel and distributed technology to process big datasets on Hadoop Cluster. To mine big datasets it is essential to re-design the data mining algorithm on this new paradigm. However, the existing parallel versions of Apriori and FP-Growth algorithm implemented with the disk-based MapReduce model are not efficient enough for iterative computation. Hence a number of map reduce based platforms are being developed for parallel computing in recent years. Among them, two platforms, namely, Spark and Flink have attracted lot of attention because of their inbuilt support to distributed computations. But, not much work has been done to test the capabilities of these two platforms in the field of parallel and distributed mining. Therefore, this work helps us to better understand, how the two algorithms perform on three different platforms. We conducted an in-depth experiment to gain insight into the effectiveness, efficiency and scalability of the Apriori and Parallel FP Growth algorithm on Hadoop, Spark and Flink.
APA, Harvard, Vancouver, ISO, and other styles
16

Huang, Ruei-Jyun, and 黃瑞竣. "A MapReduce Framework for Heterogeneous Mobile Devices." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/91818518409630056856.

Full text
Abstract:
碩士
國立臺灣科技大學
電子工程系
101
With the advance of science and technology, mobile devices continue to introduce new models, so that users are willing to buy to experience in hardware and software performance. After some years, users could accumulate different computing capability of mobile devices. In the thesis, we will use heterogeneous mobile devices and a wireless router to build a MapReduce framework. Through the MapReduce framework, we not only can control each mobile device but also execute different applications in single mobile device or multiple mobile devices. The MapReduce framework can combine a multi-thread parallel computing with a load balance method to improve the performance when compared to any single mobile device. In the experiments, we will run two applications to count word and prime numbers under 4 different types of mobile devices. We will also run the two applications on a PC as a baseline comparison. According to the experimental results, we can demonstrate the feasibility and efficiency of the MapReduce framework for heterogeneous mobile devices.
APA, Harvard, Vancouver, ISO, and other styles
17

"Thermal Aware Scheduling in Hadoop MapReduce Framework." Master's thesis, 2013. http://hdl.handle.net/2286/R.I.20932.

Full text
Abstract:
abstract: The energy consumption of data centers is increasing steadily along with the associ- ated power-density. Approximately half of such energy consumption is attributed to the cooling energy, as a result of which reducing cooling energy along with reducing servers energy consumption in data centers is becoming imperative so as to achieve greening of the data centers. This thesis deals with cooling energy management in data centers running data-processing frameworks. In particular, we propose ther- mal aware scheduling for MapReduce framework and its Hadoop implementation to reduce cooling energy in data centers. Data-processing frameworks run many low- priority batch processing jobs, such as background log analysis, that do not have strict completion time requirements; they can be delayed by a bounded amount of time. Cooling energy savings are possible by being able to temporally spread the workload, and assign it to the computing equipments which reduce the heat recirculation in data center room and therefore the load on the cooling systems. We implement our scheme in Hadoop and performs some experiments using both CPU-intensive and I/O-intensive workload benchmarks in order to evaluate the efficiency of our scheme. The evaluation results highlight that our thermal aware scheduling reduces hot-spots and makes uniform temperature distribution within the data center possible. Sum- marizing the contribution, we incorporated thermal awareness in Hadoop MapReduce framework by enhancing the native scheduler to make it thermally aware, compare the Thermal Aware Scheduler(TAS) with the Hadoop scheduler (FCFS) by running PageRank and TeraSort benchmarks in the BlueTool data center of Impact lab and show that there is reduction in peak temperature and decrease in cooling power using TAS over FCFS scheduler.
Dissertation/Thesis
M.S. Computer Science 2013
APA, Harvard, Vancouver, ISO, and other styles
18

Li, Jia-Hong, and 李家宏. "Using MapReduce Framework for Mining Association Rules." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/gbw4n8.

Full text
Abstract:
碩士
國立臺中科技大學
資訊工程系碩士班
101
With the rapid development of computer hardware and network technologies, people may gain the demand for the related applications. Cloud computing has become a very popular research area recently. An association rules in data mining which plays important role in cloud computing technology. An association rule is useful for discovering relationships among different products and further provides beneficial decision to policy-market. In association rules, computation load in discovering all frequent itemsets from transaction database is considerably high. Some researchers have shown that such mining big data on a single machine may cause computation infeasible and ineffective. Principle of Inclusion-Exclusion and Transaction Mapping benefits from two famous algorithms – Apriori and FP-Growth. Apriori benefits by join and prune the candidate itemsets. FP-Growth scans database twice only. PIETM mine frequent itemsets recursively by Principle of Inclusion-Exclusion. To achieve the application of processing big data, this paper present a novel PIETM algorithm based on Map-Reduce framework for parallel processing suitable for the application of big transaction database. The experimental results show that after re-adjust the parameter in MapReduce framework, the proposed PIETM algorithm is efficient in the application of processing big data.
APA, Harvard, Vancouver, ISO, and other styles
19

Kao, Yu-Chon, and 高玉璁. "Data-Locality-Aware MapReduce Real-Time Scheduling Framework." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/95200425495617797846.

Full text
Abstract:
碩士
國立臺灣科技大學
電機工程系
103
MapReduce is widely used in cloud applications for large-scale data processing. The increasing number of interactive cloud applications has led to an increasing need for MapReduce real-time scheduling. Most MapReduce applications are data-oriented and nonpreemptively executed. Therefore, the problem of MapReduce real-time scheduling is complicated because of the trade-off between run-time blocking for nonpreemptive execution and data-locality. This paper proposes a data-locality-aware MapReduce real-time scheduling framework for guaranteeing quality of service for interactive MapReduce applications. A scheduler and dispatcher that can be used for scheduling two-phase MapReduce jobs and for assigning jobs to computing resources are presented, and the dispatcher enable the consideration of blocking and data-locality. Furthermore, dynamic power management for run-time energy saving is discussed. Finally, the proposed methodology is evaluated by considering synthetic workloads, and a comparative study of different scheduling algorithms is conducted.
APA, Harvard, Vancouver, ISO, and other styles
20

Zeng, Wei-Chen, and 曾偉誠. "Efficient XML Data Processing based on MapReduce Framework." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/j8b55u.

Full text
Abstract:
碩士
國立臺中科技大學
資訊工程系碩士班
103
As a result of hardware technology and network technology progress, each kind of application material quantity increasing rapidly. So cloud computing technology becomes the processing great quantity material (Big data) the most important research topic.Cloud computing technology offers a new service construction, operates the resources, storage spatial use more effectively, and also provides the development environment and the cloud services. In which great quantity data processing mostly processes at present by the MapReduce operation environment; But under the MapReduce operation environment, the material must take the MapReduce construction standard form expression (i.e.Expressionas (key,value) pair combinations), Use cutting assigned to each computer, to parallel the purpose of. On the other hand, XML (extensible Markup Language) is the new generation of indication language which W3C raises,with transmits, processes each kind of complex document, supports the information inquiry, applications and so on electronic material exchange. It is the present common material exchange and the material storage standard form.Although on computer XML document processing technology already mature,But if you encounter the index XML document is too long or too large, single host cannot afford such a huge amount of computation, it is possible to explore the path originally caused the failure or too slow, so this study provides a tremendous energy for XML data for cloud computing, parallel processing technology.MapReduce with analyze the XML document, the document can because is cut the round number part faction to deliver again each operation node, only then can carry on parallel processing. But this cutting way will create nest of shape structure the XML label to destroy, will recall its nest shape relations with difficulty; Therefore will have in its processing difficulty.This research designs the MapReduce operation mechanism, Stage is divided into one rounds of MapReduce, we need to design their own XMLInputFormat an exclusive category, and named XMLInputFormat.class,XML and original take on HDFS in the correlation between processes the XML big data in the high on the clouds platform, the extract each XML path, and establishes on the HBase high in the clouds information bank, provides following operation processing, for example Data Mining, etc. We use 16 cloud servers build up Hadoop cluster to test the effectiveness of the algorithm, this study tested the effectiveness of two parts divided into XML features and Hadoop parameter adjustment use of these parameters to adjust the test, experiments show that, under the present study for Hadoop MapReduce distributed parallel processing framework to deal with a large amount of XML document is valid.The maximum size of the XML file(16GB) separately experiment 67.4% and a maximum of the XML path(13,600,000) separately experiment 89.7%.
APA, Harvard, Vancouver, ISO, and other styles
21

CHEN, YI-TING, and 陳奕廷. "An Improved K-means Algorithm based on MapReduce Framework." Thesis, 2016. http://ndltd.ncl.edu.tw/handle/582un3.

Full text
Abstract:
碩士
元智大學
資訊工程學系
104
As the data is collected and accumulated, data mining is used in big data analysis to find the relevance of huge amount of data and dig the information behind. Cluster analysis in data mining simplifies and analytics data. In this paper, we will research the problems of K-means algorithm and improve it. There are some disadvantages using K-means algorithm. The users need to determine the K value for a number of clusters, then generate the starting point randomly. In addition, the processing speed could be slow or some problems could not be fixed while dealing with huge amount of data. In order to solve these problems, we propose a method based on MapReduce framework to improve the K-means algorithm. By using agglomerative hierarchical clustering(HAC) to generate the starting point to fix the problem of generating initial point randomly. By using Silhouette coefficient to choose the best number of K clusters. This result of this research shows that choosing the right means we can select the best K value automatically, generate the initial point stable and be able to deal with the great amount of data.
APA, Harvard, Vancouver, ISO, and other styles
22

Hung-YuChang and 張弘諭. "Adaptive MapReduce Framework for Multi-Application Processing on GPU." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/40788660892721645478.

Full text
Abstract:
碩士
國立成功大學
工程科學系碩博士班
101
With the improvements in electronic and computer technology, the amount of data to be processed by each enterprise is getting larger. Handling such amount of data is not a big challenge with the help of MapReduce framework anymore. Many applications from every field can take advantage of MapReduce framework on large amount of CPUs for efficient distributed and parallel computing. On the other hand, graphics processing unit (GPU) technology is also improving. The multi-cores GPU provides stronger computing power that is capable of handling more workloads and data processing. Many MapReduce frameworks are gradually designed and implemented in general purpose graphics processing unit concept on GPU hardware to achieve better performance. However, most GPU MapReduce frameworks are focusing single application processing so far. In other words, no more methodologies or mechanisms are provided for multi-application execution and only can be processed in sequential order. The GPU hardware resources may not be fully utilized and distributed that result in the decrease of computing performance. This study designs and implements a multi-application execution mechanism based on the state-of-the-art GPU MapReduce framework, Mars. It not only provides problem partitioning utility, by considering the data size and hardware resources requirements of each application, but also feeds appropriate amount of workloads into GPU with overlapped GPU operations for efficient parallel execution. Finally, several common applications are used to verify the applicability of this mechanism. The time cost is the main evaluation metric in this study. The overall 1.3 speedup for random application combinations is achieved with the proposed method.
APA, Harvard, Vancouver, ISO, and other styles
23

Hua, Guanjie, and 滑冠傑. "Haplotype Block Partitioning and TagSNP Selection with MapReduce Framework." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/13072946241409858351.

Full text
Abstract:
碩士
靜宜大學
資訊工程學系
101
SNPs play important roles for various analysis applications including medical diagnostic and drug design. They contain the highest-resolution genetic fingerprint for identifying disease associations and human features. Haplotype, is composed of SNPs, region of linked genetic variants that are neighboring usually inherited together. Recently, genetics researches show that SNPs within certain haplotype blocks induce only a few distinct common haplotypes in the majority of the population. The discussion of haplotype block has serious implications of method with association-based for the disease genes mapping. We proposed the method in investigating several efficient combinatorial algorithms related to selecting interesting haplotype blocks under different diversity functions that generalizes many previous results in the literatures. However, the proposed method is computation-consuming. This thesis adopts approach using the MapReduce paradigm to parallelize tools and manage their execution. The experiment shows that the map/reduce-paralleled from the original sequential combinatorial algorithm performs well on the real-world data obtained in from the HapMap data set; the computation efficiency can be effectively improved proportional to the number of processors being used.
APA, Harvard, Vancouver, ISO, and other styles
24

Chou, Yen-Chen, and 周艷貞. "General Wrapping of Information Hiding Patterns on MapReduce Framework." Thesis, 2014. http://ndltd.ncl.edu.tw/handle/21760802727746979445.

Full text
APA, Harvard, Vancouver, ISO, and other styles
25

Chang, Zhi-Hong, and 張志宏. "Join Operations for Large-Scale Data on MapReduce Framework." Thesis, 2012. http://ndltd.ncl.edu.tw/handle/t4c5e6.

Full text
Abstract:
碩士
國立臺中科技大學
資訊工程系碩士班
100
As the rapid development of hardware and network technology, cloud computing has become an important research topic. It provides a solution for large-scale data processing problems. The data-parallel framework provides a platform to deal with large-scale data, especially for data mining and data warehousing. MapReduce is one of the most famous data-parallel frameworks. It consists of two stages: the Map stage and the Reduce stage. Based on the MapReduce framework, Scatter-Gather-Merge (SGM) is an efficient algorithm supporting star join queries, which is one of the most important query types in the data warehouse. However, SGM only supports equi-join operations. This thesis proposes a framework, which supports not only equi-join operations but also nonequi-join operations. And, the nonequi-join processing usually causes a large amount of I/Os, the proposed method can resolve this problem by reducting the cost of load balancing. Our experimental results show that, for equi-join operations, our method has similar execution time compared to SGM. For non-equi join operations, we also illustrate the performances under different conditions.
APA, Harvard, Vancouver, ISO, and other styles
26

Huang, Yuan-Shao, and 黃元劭. "An Efficient Frequent Patterns Mining Algorithm Based on MapReduce Framework." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/78k6v3.

Full text
Abstract:
碩士
中華大學
資訊工程學系碩士班
101
Recently, the data is continuously increasing in every enterprise. The Big Data, Cloud Computing, Data Mining etc., become hot topics in the present day. In this thesis, we modified the tradition Apriori algorithm by improving the execution efficiency, since Aprori algorithm confronted with an issue that the computation time increases dramatically when data size increases. Therefore, we design and implement two efficient algorithms: Frequent Patterns Mining Algorithm Based on MapReduce Framework (FAMR) algorithm and Optimization FAMR (OFAMR) algorithm. We adopt Hadoop MapReduce framework’s advantage to shorten the mining execution time Compared with “One-phase” algorithm, experimental results showed that FAMR has 16.2 speedup in the running time. Since the previous method only used one-time MapReduce operation, it will generate excessive candidates and result insufficient memory. Moreover, we implemented another Optimization FAMR (OFAMR) algorithm in the thesis; the performance of OFAMR is superior to FAMR, because the number of candidates generated by OFAMR is less than the candidates generated by FAMR.
APA, Harvard, Vancouver, ISO, and other styles
27

You, Hsin-Han, and 尤信翰. "A Load-Aware Scheduler for MapReduce Framework in Heterogeneous Environments." Thesis, 2010. http://ndltd.ncl.edu.tw/handle/30478140785981334531.

Full text
Abstract:
碩士
國立交通大學
資訊科學與工程研究所
99
MapReduce is becoming a trendy programming model for large-scale data processing such as data mining, log processing, web indexing and scientific research. MapReduce framework is a batch distributed data processing framework that disassembles a job into smaller map tasks and reduce tasks. In MapReduce framework, master node distributes tasks to worker nodes to complete the whole job. Hadoop MapReduce is the most popular open-source implementation of MapReduce framework. Hadoop MapReduce comes with a pluggable task scheduler interface and a default FIFO job scheduler. Performance of MapReduce jobs and overall cluster utilization rely on how the tasks being assigned and processed. In practice, there are some issues such as dynamic loading, heterogeneity of nodes, multiple job scheduling needs to be taken into account. We find that current Hadoop scheduler suffers from performance degradation due to the above problems. We propose a new scheduler named Load-Aware Scheduler to address these issues, and improve the overall performance and utilization. Experimental results show that we could improve 10 to 20 of utilization on average by avoid unnecessary speculative tasks.
APA, Harvard, Vancouver, ISO, and other styles
28

Ho, Hung-Wei, and 何鴻緯. "Modeling and Analysis of Hadoop MapReduce Framework Using Petri Nets." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/37850810462698368045.

Full text
Abstract:
碩士
國立臺北大學
資訊工程學系
103
Technological advances have significantly increased the amount of corporate data available, which has created a wide range of business opportunities related to big data and cloud computing. Hadoop is a popular programming framework used for the setup of cloud computing systems. The MapReduce framework forms the core of the Hadoop program for parallel computing and its parallel framework can greatly increase the efficiency of big data analysis. This study used Petri nets to create a visual model of the MapReduce framework and verify its reachability. We present an actual big data analysis system to demonstrate the feasibility of the model, describe the internal procedures of the MapReduce framework in detail, list common errors during the system development process and propose error prevention mechanisms using the Petri net model in order to increase efficiency in the system development.
APA, Harvard, Vancouver, ISO, and other styles
29

Chang, Jui-Yen, and 張瑞岩. "MapReduce-Based Frequent Pattern Mining Framework with Multiple Item Support." Thesis, 2017. http://ndltd.ncl.edu.tw/handle/bxj8r2.

Full text
Abstract:
碩士
國立臺北科技大學
資訊與財金管理系碩士班
105
The analysis of big data mining for frequent patterns is become even more problematic. It got a lot of applications and attempt to promote people’s health and daily life better and easier. Association mining is the analyzing process of discovering interesting and useful association rules hidden from huge and complicated data in different databases. However, use a single minimum item support value for all items are not sufficient since it could not reflect the characteristic of each item. When the minimum support value (MIS) is set too low, despite it would find rare items, similarly, it may generate a large number of meaningless patterns. On the other hand, if the minimum support value is set too high, we will lose useful rare patterns. Thus, how to set the threshold value of minimum support for each item to find out correlated patterns efficiently and accurately is essential. In addition, efficient computing has been an active research issue of data mining in recent years. MapReduce was proposed in 2008, it could easier implement parallel algorithm to compute various kinds of derived data and reduce run-time. Accordingly, in this paper we proposed to a concept model of solutions set multiple support value for each item and using MapReduce framework to find correlated patterns involving both of frequent and rare items accurately and efficiently. It would not require post pruning and rebuilding phases since each item are either promising more or equal to MIN-MIS, thereby improving the overall performance of mining frequent patterns and rare items accurately and efficiently.
APA, Harvard, Vancouver, ISO, and other styles
30

YEH, WEN-HSIN, and 葉文昕. "An Algorithm for Efficiently Mining Frequent Itemsets Based on MapReduce Framework." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/z8na4z.

Full text
Abstract:
碩士
明新科技大學
電機工程系碩士班
107
With the maturity of cloud technology, big data and data mining, cloud computing have become the hot research topic. Association rule mining is one of the most important techniques for data mining. Among them, Apriori is the most representative algorithm for association rule mining, but the performance of traditional Apriori algorithm will become worse as the amount of data is larger and the support is smaller. The use of cloud computing technology MapReduce distributed architecture will improve Apriori's shortcomings. Google Cloud Dataproc is a platform that can easily manage Hadoop cluster management and is very helpful for big data analysis. Based on Apriori, this paper proposes an association rule mining algorithm based on MapReduce architecture. Using Google Cloud Dataproc as the experimental environment, the experimental results show that the method proposed by this research has better performance than other methods when the amount of data and the amount of calculation are larger..
APA, Harvard, Vancouver, ISO, and other styles
31

Wei, Xiu-Hui, and 魏秀蕙. "Performance Comparison of Sequential Pattern Mining Algorithms Based on Mapreduce Framework." Thesis, 2014. http://ndltd.ncl.edu.tw/handle/fz7kg8.

Full text
Abstract:
碩士
國立臺中科技大學
資訊工程系碩士班
102
Because that the popularity of cloud technology and the accumulation of large amounts of data, it is very important direction of research to reduce time for processing large amounts of data efficiently. Besides, there are many kinds of data mining technique which are used in analyzing of huge amounts of data, which contains the association rule mining algorithms and sequential pattern mining algorithms. In this study, two sequential pattern mining algorithms, GSP algorithm and AprioriAll algorithm, are parallelized through the MapReduce framework. Also, we design and study the different efficiency between the two kinds of sequential pattern mining algorithms, and analyze the different efficiency between GSP algorithm and AprioriAll algorithm. The results show that the parallelized GSP algorithm is better than the parallelized AprioriAll algorithm.
APA, Harvard, Vancouver, ISO, and other styles
32

Chen, Bo-Ting, and 陳柏廷. "Improving the techniques of Mining Association Rule based on MapReduce framework." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/b6g4eb.

Full text
Abstract:
碩士
國立臺中科技大學
資訊工程系碩士班
103
We can get useful and valuable information from insignificant data through data mining and gain huge benefit from professional analysis. However, it is important to improve the performance of data mining for Big Data processing. The purpose of this study is to improve the performance of parallel association-rule mining algorithm of PIETM (Principle of Inclusion- Exclusion and Transaction Mapping) under the MapReduce framework. PIETM is arranging transaction data in database into a tree structure which is called Transaction tree (T-tree), and then transform T-tree into Transaction Interval tree (TI-tree). And use principle of Inclusion- Exclusion according to TI-tree to calculate all frequent itemsets. PIETM combines the benefits of Apriori and FP-growth algorithms and only needs to scan the database twice in data mining. However, we still need to improve some procedures, for example, construct a transaction tree and generate candidate k-item sets. For the two problems described above, we provide a solution respectively. These two solutions adopted the FP-growth and Apriori algorithms respectively.
APA, Harvard, Vancouver, ISO, and other styles
33

Chang, Chih-Wei, and 張智崴. "An Adaptive Real-Time MapReduce Framework Based on Locutus and Borg-Tree." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/32093140696316430010.

Full text
Abstract:
碩士
國立臺北教育大學
資訊科學系碩士班
101
Google has released the design of MapReduce since 2004. After years of development, finally Apache has lunched Hadoop version 1.0 at 2011, and it means the open source resources of MapReduce is enough supporting the applications of business. But somehow, there are still some features unsatisfied for big data processing. First is the supporting of real-time computing, and the other is the cross-platform deployment and ease of use. In this thesis, we analyzed the bottleneck of Hadoop performance and try to solve it, and hoping to develop an easy Real-Time Computing Platform. We cite the researches that pointed the bottleneck of MapReduce performance were the access speed of HDFS and Zookeeper performance. It means if we could improved the coordination mechanisms and use replace HDFS to other faster Storage mechanism (we use share memory in this way), we could significantly improve its performance enough to support Real-Time analyze applications. In this paper, we propose the algorithm based on Locutus and Borg-Tree to support the coordination for MapReduce. It is structure by P2P topology that concepts quickly distributed processing. And it was programmed by NodeJS that could be easily deploying to many cloud platform. We finally took some experiments to solve the feasibility of our prototype. Although we also obtained that this program did not reach the expected performance, but we also pointed out the problem with Share Memory mechanisms and out Protocol for subsequent research and development.
APA, Harvard, Vancouver, ISO, and other styles
34

Chung, Wei-Chun, and 鐘緯駿. "Algorithms for Correcting Next-Generation Sequencing Errors Based on MapReduce Big Data Framework." Thesis, 2017. http://ndltd.ncl.edu.tw/handle/k9jnna.

Full text
Abstract:
博士
國立臺灣大學
資訊工程學研究所
105
The rapid advancement of next-generation sequencing (NGS) technology has generated an explosive growth of ultra-large-scale data and computational problems, particularly in de novo genome assembly. Greater sequencing depths and increasingly longer reads have introduced numerous errors, which increase the probability of misassembly. The huge amounts of data cause severely high disk I/O overhead and lead to an unexpectedly long execution time. To speed up the time-consuming assembly processes without affecting its quality and to address problems pertaining to error correction, we focus on improving algorithm design, architecture design, and implementation of NGS de novo genome assembly based on cloud computing. Errors in sequencing data result in fragmented contigs, which lead to an assembly of poor quality. We therefore propose an error correction algorithm based on cloud computing. The algorithm emulates the design of error correction algorithm of ALLPATHS-LG, and is designed to correct errors conservatively to avoid false decisions. To speed up execution time by reducing the massive disk I/O overhead, we introduce a message control strategy, the read-message (RM) diagram, to represent structure of the intermediate data generated along with each read. Then, we develop various schemes to trim off portions of the RM diagram to shrink the size of the intermediate data and thereby reduce the number of disk I/O operations. We have implemented the proposed algorithms on the MapReduce cloud computing framework and evaluated them using state-of-the-art tools. The RM method reduces the intermediate data size and speeds up execution. Our proposed algorithms have improved not only the execution time of the pipeline dramatically, but also the quality of assembly. This dissertation presents algorithms and architectural designs that speed up execution time and improve the quality of de novo genome assembly. These studies are valuable for further development of NGS big data applications for bioinformatics, including transcriptomics, metagenomics, pharmacogenomics, and precision medicine.
APA, Harvard, Vancouver, ISO, and other styles
35

Chin, Bing-Da, and 秦秉達. "Design of Parallel Binary Classification Algorithm Based on Hadoop Cluster with MapReduce Framework." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/fu84aw.

Full text
Abstract:
碩士
國立臺中科技大學
資訊工程系碩士班
103
With increased amount data today,it is hard to analyze large data on single computer environment efficiently,the hadoop cluster is very important because we can save and large data by hadoop cluster. Data mining plays an important role of data analysis.Because time complexity of the binary-class classification SVM algorithm is a big issue,we design a parallel binary SVM algorithm to slove this problem,and achieve the effect of classifying appropriate data. By leveraging the parallel processing property in MapReduce ,we implement multi-layer binary SVM by MapReduce framework,and run on the hadoop cluster successfully. By designing different parameters of hadoop cluster and using the same data set for training analysis, it shows that the new algorithm can reduce the computation time significantly.
APA, Harvard, Vancouver, ISO, and other styles
36

Rosen, Andrew. "Towards a Framework for DHT Distributed Computing." 2016. http://scholarworks.gsu.edu/cs_diss/107.

Full text
Abstract:
Distributed Hash Tables (DHTs) are protocols and frameworks used by peer-to-peer (P2P) systems. They are used as the organizational backbone for many P2P file-sharing systems due to their scalability, fault-tolerance, and load-balancing properties. These same properties are highly desirable in a distributed computing environment, especially one that wants to use heterogeneous components. We show that DHTs can be used not only as the framework to build a P2P file-sharing service, but as a P2P distributed computing platform. We propose creating a P2P distributed computing framework using distributed hash tables, based on our prototype system ChordReduce. This framework would make it simple and efficient for developers to create their own distributed computing applications. Unlike Hadoop and similar MapReduce frameworks, our framework can be used both in both the context of a datacenter or as part of a P2P computing platform. This opens up new possibilities for building platforms to distributed computing problems. One advantage our system will have is an autonomous load-balancing mechanism. Nodes will be able to independently acquire work from other nodes in the network, rather than sitting idle. More powerful nodes in the network will be able use the mechanism to acquire more work, exploiting the heterogeneity of the network. By utilizing the load-balancing algorithm, a datacenter could easily leverage additional P2P resources at runtime on an as needed basis. Our framework will allow MapReduce-like or distributed machine learning platforms to be easily deployed in a greater variety of contexts.
APA, Harvard, Vancouver, ISO, and other styles
37

Tsung-ChihHuang and 黃宗智. "The Design and Implementation of the MapReduce Framework based on OpenCL in GPU Environment." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/44218540678222651423.

Full text
Abstract:
碩士
國立成功大學
電腦與通信工程研究所
101
With the advances and evolution of technology, General Purpose Computation on the GPU(GPGPU) was put forward due to the excellent performance of GPU in parallel computing. This thesis presents the design and implemention of a MapReduce software framework which is based on Open Computing Language(OpenCL) in GPU environment. For those users who develop parallel application software using OpenCL, this framework provides an alternative which can simplify the process of development and can implement the complicate details of parallel computing easily. Therefore, the burden of developers will be considerably relieved. The design of this framework is composed of many application programming interfaces which can be divided into two parts in the system architecture. The first part is application programming interface framework working on CPU, such as initialization, data transfer, creating program, query device information, thread configuration, preparing kernel, adding input record, GPU memory allocation, copying output to host and releasing resource. The second part is application programming interface framework working on GPU, such as Map, Map count, Reduce, Reduce count, group, GPU memory sum. The implementation is realized using OpenCL application programming interfaces which are provided by OpenCL library modules, including application data computing, memory calculation, application pending data preparation, etc. Users thus can concentrate on the part of the design process, the framework will automatically invoke the OpenCL functions and pass the appropriate parameter values, and coordinating CPU and GPU processing. The main contribution of this thesis is using OpenCL to implement MapReduce software framework. Users can use this framework to develop cross-platform programs making the porting process much easier. Furthermore, this framework provides many application programming interfaces used in the development and those application programming interfaces fully demonstrate how OpenCL works and its flow of processing.
APA, Harvard, Vancouver, ISO, and other styles
38

Lin, Jia-Chun, and 林佳純. "Study of Job Execution Performance, Reliability, Energy Consumption, and Fault Tolerance in the MapReduce Framework." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/48439755319437778819.

Full text
Abstract:
博士
國立交通大學
資訊科學與工程研究所
103
Node/machine failure is the norm rather than an exception in a large-scale MapReduce cluster. To prevent jobs from being interrupted by machine/node failures, MapReduce has employed several policies, such as task-reexecution policy, intermediate-data replication policy, reduce-task assignment policy. However, the impacts of these policies on MapReduce jobs are not clear, especially in terms of Job Completion Reliability (JCR for short), Job Turnaround Time (JTT for short), and Job Energy Consumption (JEC for short). In this dissertation, JCR is the reliability with which a MapReduce job can be completed by a MapReduce cluster, JTT is the time period starting when the job is submitted to the cluster and ending when the job is completed by the cluster, and JEC is the energy consumed by the cluster to complete the job. To achieve a more reliable and energy-efficient computing environment than current MapReduce infrastructure, it is essential to comprehend the impacts of the above policies. In addition, the MapReduce master servers suffer from a single-point-of-failure problem, which might interrupt MapReduce operations and filesystem services. To study how the above polices influence the performances of MapReduce jobs, in this dissertation, we formally derive and analyze the JCR, JTT, and JEC of a MapReduce job under the abovementioned MapReduce policies. In addition, to mitigate the single-point-of-failure problem and improve the service qualities of MapReduce master servers, we propose a hybrid takeover scheme called PAReS (Proactive and Adaptive Redundant System) for MapReduce master servers. The analyses in this dissertation enable MapReduce managers to comprehend the influences of these policies on MapReduce jobs, help MapReduce managers to choose appropriate MapReduce policies for their MapReduce clusters, and allow MapReduce designers to propose better policies for MapReduce. Furthermore, based on our extensive experimental results, the proposed PAReS system can mitigate the single-point-of-failure problem and improve the service qualities of MapReduce master servers as compared with current redundant schemes on Hadoop.
APA, Harvard, Vancouver, ISO, and other styles
39

Roy, Sukanta. "Automated methods of natural resource mapping with Remote Sensing Big data in Hadoop MapReduce framework." Thesis, 2022. https://etd.iisc.ac.in/handle/2005/5836.

Full text
Abstract:
For several decades, remote sensing (RS) tools have provided platforms for the large-scale exploration of natural resources across the planetary bodies of our solar system. In the context of Indian remote sensing, mineral resources are being explored, and mangrove resources are being monitored towards a sustainable socio-economic structure and coastal eco-system, respectively, by utilising several remote analytical techniques. However, RS technologies and the corresponding data analytics have made a vast paradigm shift, which eventually has produced “RS Big data” in our scientific world of large-scale remote analysis. Consequently, the current practices in remote sensing need a systematic improvisation of data analytics to provide a real-time, accurate and feasible remote exploration of the RS Big data. Towards this, the improvement of corresponding scientific analysis has opened up new opportunities and research perspectives for both academia and industry in remote sensing. In this favour, different automated methods are proposed in the Hadoop MapReduce framework as a part of this thesis aiming to develop both decisive and time-efficient remote analysis under the RS Big data environment. This thesis studies the remote exploration of various surface types covering the mineralogy and mangrove regions, respectively, as two significant applications in natural resource mapping. Before starting, the reliability and outreach of RS Big data analysis in the Hadoop MapReduce framework are also assessed in the laboratory environment. In this thesis, each proposed automated method is validated first in the single node analysis as a standalone process for individual RS applications. Then the corresponding MapReduce designs of the proposed methods make them scaled to conduct the distributed analysis in a pseudo-distributed Hadoop architecture for a prototype RS Big data environment in this thesis. In particular, a “per-pixel” mapping of the mineralised belt is conducted with a proposition of Extreme Learning Machine (ELM)-based scaled-ML algorithm in the Hadoop MapReduce framework by addressing the primary challenge because of impurity in the representative spectra of an observed pixel. To an extent, the same mineralogical province is explored with a proposition of a fraction cover mapping model in the Hadoop MapReduce framework by addressing the primary challenge due to the spectral variation of pure mineral spectra within an observed pixel. These mineralogical explorations on Earth utilise airborne-based hyperspectral imagery, whereas mineralogical explorations on Moon utilise spaceborne-based hyperspectral lunar imagery in this thesis. An automated mineralogical anomaly detection method identifies the prominent lunar mineral occurrences by addressing the consequences of space weathering on lunar exposures. On the other side, the spaceborne-based active remote sensing of polarimetric Earth imagery is utilised for land cover classification over the mangrove region in the Hadoop MapReduce framework. The land features of fully polarimetric (FP) and compact polarimetric (CP) observations are explored with a proposition of Active learning Multi-Layered Perceptron (AMLP) by addressing the primary challenge due to the uncertainties in class labelling. The robustness, stability, and generalisation of all proposed shallow neural networks of single hidden layered ML models are analysed for varietal informative data classification. In fact, the advancements in methodology and architecture support each other in attaining a better remote analysis with less computational automated methods in this thesis. Some of the crucial findings of this thesis are as follows: For a reliable and generalised mineral mapping, the perturbed/mixed spectra of hydrothermal minerals are required to be mapped along with the pure spectra of hydrothermal minerals. Further, the fraction cover mapping of hydrothermal minerals should address the spectral variation of pure spectra and the underlying physics of spectral mixing to get a reliable and accurate fractional contribution of minerals. In contrast to Earth mineralogy, the automated lunar mineral exploration needs to identify the potential mineralogical map of the lunar surface because of the space weathering effect. On the other hand, the underlying physics behind the polarimetric synthetic aperture radar (SAR) remote sensing plays a vital role in better discrimination of land features within the mangrove regions. The inherent data parallelism technique of the Hadoop architecture simply makes the analytical algorithm scaled and time-efficient, which can be extended for real-time Big data environments even with other MapReduce frameworks. In conclusion, even shallow learning of an automated method can provide an efficient real-time analysis of the RS Big data prototype if the physical constraints or prior physics-based insights of remote observations are undertaken. It is evident in this thesis that such consideration makes the prototype RS Big data analysis more reliable, accurate, scalable, automated and widely acceptable under varietal remote sensing environments. In summary, this thesis builds a bridge between academia and industry to provide new directional research on RS Big data analysis in making a better real-time futuristic plan for the natural resource management of any country like India.
APA, Harvard, Vancouver, ISO, and other styles
40

Huu, Tinh Giang Nguyen, and 阮有淨江. "Design and Implement a MapReduce Framework for Converting Standalone Software Packages to Hadoop-based Distributed Environments." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/20649990806109007865.

Full text
Abstract:
碩士
國立成功大學
製造資訊與系統研究所碩博士班
101
The Hadoop MapReduce is the programming model of designing the auto scalable distributed computing applications. It provides developer an effective environment to attain automatic parallelization. However, most existing manufacturing systems are arduous and restrictive to migrate to MapReduce private cloud, due to the platform incompatible and tremendous complexity of system reconstruction. For increasing the efficiency of manufacturing systems with minimum modification of existing systems, we design a framework in this thesis, called MC-Framework: Multi-users-based Cloudizing-Application Framework. It provides the simple interface to users for fairly executing requested tasks worked with traditional standalone software packages in MapReduce-based private cloud environments. Moreover, this thesis focuses on the multi-users workloads, but the default Hadoop scheduling scheme, i.e., FIFO, would increase delay under multiuser scenarios. Hence, we also propose a new scheduling mechanism, called Job-Sharing Scheduling, to explore and fairly share the jobs to machines in the MapReduce-based private cloud. This study uses an experimental design to verify and analysis the proposed MC-Framework with two case studies: (1) independent model systems include the stochastic Petri nets mode, and (2) dependence model systems include the virtual-metrology module of a manufacturing system. The results of our experiments indicate that our proposed framework enormously improved the time performance compared with the original package.
APA, Harvard, Vancouver, ISO, and other styles
41

Lo, Chia-Huai, and 駱家淮. "Constructing Suffix Array and Longest-Common-Prefix Array for Next-Generation-Sequencing Data Using MapReduce Framework." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/71000795009259045140.

Full text
Abstract:
碩士
國立臺灣大學
資訊工程學研究所
103
Next-generation sequencing (NGS) data is rapidly growing and represents a source of varieties of new knowledge in science. State-of-the-art sequencers, such as HiSeq 2500, can generate up to 1 trillion base-pairs of sequencing data in 6 days, with good quality at low cost. In genome sequencing projects today, the NGS data size often ranges from tens of billions base-pairs to several hundreds of billions base-pairs. It is time-consuming to process such a big set of NGS data, especially for applications based on sequence alignment, e.g., de novo genome assembly and correction of sequencing errors. In literature, suffix array, longest common prefix (LCP) array and Burrows-Wheeler Transform (BWT) have been proved to be efficient indexes to speed up manifold sequence alignment tasks. For example, the all-pairs suffix-prefix matching problem, i.e., finding overlaps of reads to form the overlap graph for sequence assembly, can be solved in linear time by reading these arrays. However, constructing those arrays for NGS data remains challenging due to the huge amount of storage required to hold the suffix array. MapReduce is a promising alternative to tackle the NGS challenge, but the existing MapReduce method of suffix array construction, i.e., RPGI proposed by Menon et al [1] can only deal with input strings of size no greater than 4G base pairs and does not give LCPs in its output. In the study, we developed a MapReduce algorithm to construct suffix and BWT arrays, as well as LCP array, for NGS data based on the framework of RPGI. In addition, the proposed method supports inputs with more than 4G base-pairs and is developed into new software. To evaluate its performance, we compare the time it takes to process subsets of the giant grouper NGS data set of size 125Gbp.
APA, Harvard, Vancouver, ISO, and other styles
42

Chou, Chien-Ting, and 周建廷. "Research on The Computing of Direct Geo Morphology Runoff on Hadoop Cluster by Using MapReduce Framework." Thesis, 2011. http://ndltd.ncl.edu.tw/handle/13575176515358582342.

Full text
Abstract:
碩士
國立臺灣師範大學
資訊工程研究所
99
Because of the weather and landform in Taiwan, a heavy rain often cause sudden rising of the runoff of some basins, even lead to serious disaster. That makes flood information system are highly relied in Taiwan especially in typhoon season. Computing the runoff of a basin is the most important module of flood information system for checking whether the runoff exceeds warning level or not. However this module is complicated and data-intensive, it becomes the bottleneck when the real-time information are needed while a typhoon is attacking the basins. The development of applications in this thesis is on "Apache Hadoop"-an open-source software that builds a distributed storage and computing environment, which allows for the distributed processing of large data sets across clusters of computers using a programming model-"MapReduce". We have developed the runoff computing module of a basin by using MapReduce framework on a Hadoop cluster. In our research, to speed up the runoff computing will increase the efficiency of the flood information system. Running our programs in an 18 nodes Hadoop cluster, we have derived the conclusion that it can speed up the execution of runoff computing by 6 times.
APA, Harvard, Vancouver, ISO, and other styles
43

Chrimes, Dillon. "Towards a big data analytics platform with Hadoop/MapReduce framework using simulated patient data of a hospital system." Thesis, 2016. http://hdl.handle.net/1828/7645.

Full text
Abstract:
Background: Big data analytics (BDA) is important to reduce healthcare costs. However, there are many challenges. The study objective was high performance establishment of interactive BDA platform of hospital system. Methods: A Hadoop/MapReduce framework formed the BDA platform with HBase (NoSQL database) using hospital-specific metadata and file ingestion. Query performance tested with Apache tools in Hadoop’s ecosystem. Results: At optimized iteration, Hadoop distributed file system (HDFS) ingestion required three seconds but HBase required four to twelve hours to complete the Reducer of MapReduce. HBase bulkloads took a week for one billion (10TB) and over two months for three billion (30TB). Simple and complex query results showed about two seconds for one and three billion, respectively. Interpretations: BDA platform of HBase distributed by Hadoop successfully under high performance at large volumes representing the Province’s entire data. Inconsistencies of MapReduce limited operational efficiencies. Importance of the Hadoop/MapReduce on representation of health informatics is further discussed.
Graduate
0566
0769
0984
dillon.chrimes@viha.ca
APA, Harvard, Vancouver, ISO, and other styles
44

(9530630), Akshay Jajoo. "EXPLOITING THE SPATIAL DIMENSION OF BIG DATA JOBS FOR EFFICIENT CLUSTER JOB SCHEDULING." Thesis, 2020.

Find full text
Abstract:
With the growing business impact of distributed big data analytics jobs, it has become crucial to optimize their execution and resource consumption. In most cases, such jobs consist of multiple sub-entities called tasks and are executed online in a large shared distributed computing system. The ability to accurately estimate runtime properties and coordinate execution of sub-entities of a job allows a scheduler to efficiently schedule jobs for optimal scheduling. This thesis presents the first study that highlights spatial dimension, an inherent property of distributed jobs, and underscores its importance in efficient cluster job scheduling. We develop two new classes of spatial dimension based algorithms to
address the two primary challenges of cluster scheduling. First, we propose, validate, and design two complete systems that employ learning algorithms exploiting spatial dimension. We demonstrate high similarity in runtime properties between sub-entities of the same job by detailed trace analysis on four different industrial cluster traces. We identify design challenges and propose principles for a sampling based learning system for two examples, first for a coflow scheduler, and second for a cluster job scheduler.
We also propose, design, and demonstrate the effectiveness of new multi-task scheduling algorithms based on effective synchronization across the spatial dimension. We underline and validate by experimental analysis the importance of synchronization between sub-entities (flows, tasks) of a distributed entity (coflow, data analytics jobs) for its efficient execution. We also highlight that by not considering sibling sub-entities when scheduling something it may also lead to sub-optimal overall cluster performance. We propose, design, and implement a full coflow scheduler based on these assertions.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography