Dissertations / Theses on the topic 'MAPREDUCE FRAMEWORKS'
Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles
Consult the top 44 dissertations / theses for your research on the topic 'MAPREDUCE FRAMEWORKS.'
Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.
You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.
Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.
de, Souza Ferreira Tharso. "Improving Memory Hierarchy Performance on MapReduce Frameworks for Multi-Core Architectures." Doctoral thesis, Universitat Autònoma de Barcelona, 2013. http://hdl.handle.net/10803/129468.
Full textThe need of analyzing large data sets from many different application fields has fostered the use of simplified programming models like MapReduce. Its current popularity is justified by being a useful abstraction to express data parallel processing and also by effectively hiding synchronization, fault tolerance and load balancing management details from the application developer. MapReduce frameworks have also been ported to multi-core and shared memory computer systems. These frameworks propose to dedicate a different computing CPU core for each map or reduce task to execute them concurrently. Also, Map and Reduce phases share a common data structure where main computations are applied. In this work we describe some limitations of current multi-core MapReduce frameworks. First, we describe the relevance of the data structure used to keep all input and intermediate data in memory. Current multi-core MapReduce frameworks are designed to keep all intermediate data in memory. When executing applications with large data input, the available memory becomes too small to store all framework intermediate data and there is a severe performance loss. We propose a memory management subsystem to allow intermediate data structures the processing of an unlimited amount of data by the use of a disk spilling mechanism. Also, we have implemented a way to manage concurrent access to disk of all threads participating in the computation. Finally, we have studied the effective use of the memory hierarchy by the data structures of the MapReduce frameworks and proposed a new implementation of partial MapReduce tasks to the input data set. The objective is to make a better use of the cache and to eliminate references to data blocks that are no longer in use. Our proposal was able to significantly reduce the main memory usage and improves the overall performance with the increasing of cache memory usage.
Kumaraswamy, Ravindranathan Krishnaraj. "Exploiting Heterogeneity in Distributed Software Frameworks." Diss., Virginia Tech, 2016. http://hdl.handle.net/10919/64423.
Full textPh. D.
Venumuddala, Ramu Reddy. "Distributed Frameworks Towards Building an Open Data Architecture." Thesis, University of North Texas, 2015. https://digital.library.unt.edu/ark:/67531/metadc801911/.
Full textPeddi, Sri Vijay Bharat. "Cloud Computing Frameworks for Food Recognition from Images." Thesis, Université d'Ottawa / University of Ottawa, 2015. http://hdl.handle.net/10393/32450.
Full textElteir, Marwa Khamis. "A MapReduce Framework for Heterogeneous Computing Architectures." Diss., Virginia Tech, 2012. http://hdl.handle.net/10919/28786.
Full textPh. D.
Alkan, Sertan. "A Distributed Graph Mining Framework Based On Mapreduce." Master's thesis, METU, 2010. http://etd.lib.metu.edu.tr/upload/12611588/index.pdf.
Full textWang, Yongzhi. "Constructing Secure MapReduce Framework in Cloud-based Environment." FIU Digital Commons, 2015. http://digitalcommons.fiu.edu/etd/2238.
Full textZhang, Yue. "A Workload Balanced MapReduce Framework on GPU Platforms." Wright State University / OhioLINK, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=wright1450180042.
Full textRaja, Anitha. "A Coordination Framework for Deploying Hadoop MapReduce Jobs on Hadoop Cluster." Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-196951.
Full textApache Hadoop är ett öppen källkods system som levererar pålitlig, skalbar och distribuerad användning. Hadoop tjänster hjälper med distribuerad data förvaring, bearbetning, åtkomst och trygghet. MapReduce är en viktig del av Hadoop system och är designad att bearbeta stora data mängder och även distribuerad i flera leder. MapReduce är använt extensivt inom bearbetning av strukturerad och ostrukturerad data i olika branscher bl. a e-handel, webbsökning, sociala medier och även vetenskapliga beräkningar. Förståelse av MapReduces arbetsbelastningar är viktiga att få förbättrad konfigurationer och resultat. Men, arbetsbelastningar av MapReduce inom massproduktions miljö var inte djup-forskat hittills. I detta examensarbete, är en hel del fokus satt på ”Hadoop cluster” (som en utförande miljö i data bearbetning) att analysera två typer av Hadoop MapReduce (MR) arbeten genom ett tilltänkt system. Detta system är refererad som arbetsbelastnings översättare. Resultaten från denna arbete innehåller: (1) en parametrisk arbetsbelastningsmodell till inriktad MR arbeten, (2) en specifikation att utveckla förbättrad kluster strategier med båda modellen och koordinations system, och (3) förbättrad planering och arbetsprestationer, d.v.s kortare tid att utföra arbetet. Vi har realiserat en prototyp med Apache Tomcat på (OpenStack) Ubuntu Trusty Tahr som använder RESTful API (1) att skapa ”Hadoop cluster” version 2.7.2 och (2) att båda skala upp och ner antal medarbetare i kluster. Forskningens resultat har visat att med vältrimmad parametrar, kan MR arbete nå förbättringar dvs. sparad tid vid slutfört arbete och förbättrad användning av hårdvara resurser. Målgruppen för denna avhandling är utvecklare. I framtiden, föreslår vi tilläggning av olika parametrar att utveckla en allmän modell för MR och liknande arbeten.
Lakkimsetti, Praveen Kumar. "A framework for automatic optimization of MapReduce programs based on job parameter configurations." Kansas State University, 2011. http://hdl.handle.net/2097/12011.
Full textDepartment of Computing and Information Sciences
Mitchell L. Neilsen
Recently, cost-effective and timely processing of large datasets has been playing an important role in the success of many enterprises and the scientific computing community. Two promising trends ensure that applications will be able to deal with ever increasing data volumes: first, the emergence of cloud computing, which provides transparent access to a large number of processing, storage and networking resources; and second, the development of the MapReduce programming model, which provides a high-level abstraction for data-intensive computing. MapReduce has been widely used for large-scale data analysis in the Cloud [5]. The system is well recognized for its elastic scalability and fine-grained fault tolerance. However, even to run a single program in a MapReduce framework, a number of tuning parameters have to be set by users or system administrators to increase the efficiency of the program. Users often run into performance problems because they are unaware of how to set these parameters, or because they don't even know that these parameters exist. With MapReduce being a relatively new technology, it is not easy to find qualified administrators [4]. The major objective of this project is to provide a framework that optimizes MapReduce programs that run on large datasets. This is done by executing the MapReduce program on a part of the dataset using stored parameter combinations and setting the program with the most efficient combination and this modified program can be executed over the different datasets. We know that many MapReduce programs are used over and over again in applications like daily weather analysis, log analysis, daily report generation etc. So, once the parameter combination is set, it can be used on a number of data sets efficiently. This feature can go a long way towards improving the productivity of users who lack the skills to optimize programs themselves due to lack of familiarity with MapReduce or with the data being processed.
Li, Min. "A resource management framework for cloud computing." Diss., Virginia Tech, 2014. http://hdl.handle.net/10919/47804.
Full textPh. D.
Rahman, Md Wasi-ur. "Designing and Modeling High-Performance MapReduce and DAG Execution Framework on Modern HPC Systems." The Ohio State University, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=osu1480475635778714.
Full textDonepudi, Harinivesh. "An Apache Hadoop Framework for Large-Scale Peptide Identification." TopSCHOLAR®, 2015. http://digitalcommons.wku.edu/theses/1527.
Full textHuang, Xin. "Querying big RDF data : semantic heterogeneity and rule-based inconsistency." Thesis, Sorbonne Paris Cité, 2016. http://www.theses.fr/2016USPCB124/document.
Full textSemantic Web is the vision of next generation of Web proposed by Tim Berners-Lee in 2001. Indeed, with the rapid development of Semantic Web technologies, large-scale RDF data already exist as linked open data, and their number is growing rapidly. Traditional Semantic Web querying and reasoning tools are designed to run in stand-alone environment. Therefor, Processing large-scale bulk data computation using traditional solutions will result in bottlenecks of memory space and computational performance inevitably. Large volumes of heterogeneous data are collected from different data sources by different organizations. In this context, different sources always exist inconsistencies and uncertainties which are difficult to identify and evaluate. To solve these challenges of Semantic Web, the main research contents and innovative approaches are proposed as follows. For these purposes, we firstly developed an inference based semantic entity resolution approach and linking mechanism when the same entity is provided in multiple RDF resources described using different semantics and URIs identifiers. We also developed a MapReduce based rewriting engine for Sparql query over big RDF data to handle the implicit data described intentionally by inference rules during query evaluation. The rewriting approach also deal with the transitive closure and cyclic rules to provide a rich inference language as RDFS and OWL. The second contribution concerns the distributed inconsistency processing. We extend the approach presented in first contribution by taking into account inconsistency in the data. This includes: (1)Rules based inconsistency detection with the help of our query rewriting engine; (2)Consistent query evaluation in three different semantics. The third contribution concerns the reasoning and querying over large-scale uncertain RDF data. We propose an MapReduce based approach to deal with large-scale reasoning with uncertainty. Unlike possible worlds semantic, we propose an algorithm for generating intensional Sparql query plan over probabilistic RDF graph for computing the probabilities of results within the query
RANJAN, RAVI. "PERFORMANCE ANALYSIS OF APRIORI AND FP GROWTH ON DIFFERENT MAPREDUCE FRAMEWORKS." Thesis, 2017. http://dspace.dtu.ac.in:8080/jspui/handle/repository/15814.
Full textHuang, Ruei-Jyun, and 黃瑞竣. "A MapReduce Framework for Heterogeneous Mobile Devices." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/91818518409630056856.
Full text國立臺灣科技大學
電子工程系
101
With the advance of science and technology, mobile devices continue to introduce new models, so that users are willing to buy to experience in hardware and software performance. After some years, users could accumulate different computing capability of mobile devices. In the thesis, we will use heterogeneous mobile devices and a wireless router to build a MapReduce framework. Through the MapReduce framework, we not only can control each mobile device but also execute different applications in single mobile device or multiple mobile devices. The MapReduce framework can combine a multi-thread parallel computing with a load balance method to improve the performance when compared to any single mobile device. In the experiments, we will run two applications to count word and prime numbers under 4 different types of mobile devices. We will also run the two applications on a PC as a baseline comparison. According to the experimental results, we can demonstrate the feasibility and efficiency of the MapReduce framework for heterogeneous mobile devices.
"Thermal Aware Scheduling in Hadoop MapReduce Framework." Master's thesis, 2013. http://hdl.handle.net/2286/R.I.20932.
Full textDissertation/Thesis
M.S. Computer Science 2013
Li, Jia-Hong, and 李家宏. "Using MapReduce Framework for Mining Association Rules." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/gbw4n8.
Full text國立臺中科技大學
資訊工程系碩士班
101
With the rapid development of computer hardware and network technologies, people may gain the demand for the related applications. Cloud computing has become a very popular research area recently. An association rules in data mining which plays important role in cloud computing technology. An association rule is useful for discovering relationships among different products and further provides beneficial decision to policy-market. In association rules, computation load in discovering all frequent itemsets from transaction database is considerably high. Some researchers have shown that such mining big data on a single machine may cause computation infeasible and ineffective. Principle of Inclusion-Exclusion and Transaction Mapping benefits from two famous algorithms – Apriori and FP-Growth. Apriori benefits by join and prune the candidate itemsets. FP-Growth scans database twice only. PIETM mine frequent itemsets recursively by Principle of Inclusion-Exclusion. To achieve the application of processing big data, this paper present a novel PIETM algorithm based on Map-Reduce framework for parallel processing suitable for the application of big transaction database. The experimental results show that after re-adjust the parameter in MapReduce framework, the proposed PIETM algorithm is efficient in the application of processing big data.
Kao, Yu-Chon, and 高玉璁. "Data-Locality-Aware MapReduce Real-Time Scheduling Framework." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/95200425495617797846.
Full text國立臺灣科技大學
電機工程系
103
MapReduce is widely used in cloud applications for large-scale data processing. The increasing number of interactive cloud applications has led to an increasing need for MapReduce real-time scheduling. Most MapReduce applications are data-oriented and nonpreemptively executed. Therefore, the problem of MapReduce real-time scheduling is complicated because of the trade-off between run-time blocking for nonpreemptive execution and data-locality. This paper proposes a data-locality-aware MapReduce real-time scheduling framework for guaranteeing quality of service for interactive MapReduce applications. A scheduler and dispatcher that can be used for scheduling two-phase MapReduce jobs and for assigning jobs to computing resources are presented, and the dispatcher enable the consideration of blocking and data-locality. Furthermore, dynamic power management for run-time energy saving is discussed. Finally, the proposed methodology is evaluated by considering synthetic workloads, and a comparative study of different scheduling algorithms is conducted.
Zeng, Wei-Chen, and 曾偉誠. "Efficient XML Data Processing based on MapReduce Framework." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/j8b55u.
Full text國立臺中科技大學
資訊工程系碩士班
103
As a result of hardware technology and network technology progress, each kind of application material quantity increasing rapidly. So cloud computing technology becomes the processing great quantity material (Big data) the most important research topic.Cloud computing technology offers a new service construction, operates the resources, storage spatial use more effectively, and also provides the development environment and the cloud services. In which great quantity data processing mostly processes at present by the MapReduce operation environment; But under the MapReduce operation environment, the material must take the MapReduce construction standard form expression (i.e.Expressionas (key,value) pair combinations), Use cutting assigned to each computer, to parallel the purpose of. On the other hand, XML (extensible Markup Language) is the new generation of indication language which W3C raises,with transmits, processes each kind of complex document, supports the information inquiry, applications and so on electronic material exchange. It is the present common material exchange and the material storage standard form.Although on computer XML document processing technology already mature,But if you encounter the index XML document is too long or too large, single host cannot afford such a huge amount of computation, it is possible to explore the path originally caused the failure or too slow, so this study provides a tremendous energy for XML data for cloud computing, parallel processing technology.MapReduce with analyze the XML document, the document can because is cut the round number part faction to deliver again each operation node, only then can carry on parallel processing. But this cutting way will create nest of shape structure the XML label to destroy, will recall its nest shape relations with difficulty; Therefore will have in its processing difficulty.This research designs the MapReduce operation mechanism, Stage is divided into one rounds of MapReduce, we need to design their own XMLInputFormat an exclusive category, and named XMLInputFormat.class,XML and original take on HDFS in the correlation between processes the XML big data in the high on the clouds platform, the extract each XML path, and establishes on the HBase high in the clouds information bank, provides following operation processing, for example Data Mining, etc. We use 16 cloud servers build up Hadoop cluster to test the effectiveness of the algorithm, this study tested the effectiveness of two parts divided into XML features and Hadoop parameter adjustment use of these parameters to adjust the test, experiments show that, under the present study for Hadoop MapReduce distributed parallel processing framework to deal with a large amount of XML document is valid.The maximum size of the XML file(16GB) separately experiment 67.4% and a maximum of the XML path(13,600,000) separately experiment 89.7%.
CHEN, YI-TING, and 陳奕廷. "An Improved K-means Algorithm based on MapReduce Framework." Thesis, 2016. http://ndltd.ncl.edu.tw/handle/582un3.
Full text元智大學
資訊工程學系
104
As the data is collected and accumulated, data mining is used in big data analysis to find the relevance of huge amount of data and dig the information behind. Cluster analysis in data mining simplifies and analytics data. In this paper, we will research the problems of K-means algorithm and improve it. There are some disadvantages using K-means algorithm. The users need to determine the K value for a number of clusters, then generate the starting point randomly. In addition, the processing speed could be slow or some problems could not be fixed while dealing with huge amount of data. In order to solve these problems, we propose a method based on MapReduce framework to improve the K-means algorithm. By using agglomerative hierarchical clustering(HAC) to generate the starting point to fix the problem of generating initial point randomly. By using Silhouette coefficient to choose the best number of K clusters. This result of this research shows that choosing the right means we can select the best K value automatically, generate the initial point stable and be able to deal with the great amount of data.
Hung-YuChang and 張弘諭. "Adaptive MapReduce Framework for Multi-Application Processing on GPU." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/40788660892721645478.
Full text國立成功大學
工程科學系碩博士班
101
With the improvements in electronic and computer technology, the amount of data to be processed by each enterprise is getting larger. Handling such amount of data is not a big challenge with the help of MapReduce framework anymore. Many applications from every field can take advantage of MapReduce framework on large amount of CPUs for efficient distributed and parallel computing. On the other hand, graphics processing unit (GPU) technology is also improving. The multi-cores GPU provides stronger computing power that is capable of handling more workloads and data processing. Many MapReduce frameworks are gradually designed and implemented in general purpose graphics processing unit concept on GPU hardware to achieve better performance. However, most GPU MapReduce frameworks are focusing single application processing so far. In other words, no more methodologies or mechanisms are provided for multi-application execution and only can be processed in sequential order. The GPU hardware resources may not be fully utilized and distributed that result in the decrease of computing performance. This study designs and implements a multi-application execution mechanism based on the state-of-the-art GPU MapReduce framework, Mars. It not only provides problem partitioning utility, by considering the data size and hardware resources requirements of each application, but also feeds appropriate amount of workloads into GPU with overlapped GPU operations for efficient parallel execution. Finally, several common applications are used to verify the applicability of this mechanism. The time cost is the main evaluation metric in this study. The overall 1.3 speedup for random application combinations is achieved with the proposed method.
Hua, Guanjie, and 滑冠傑. "Haplotype Block Partitioning and TagSNP Selection with MapReduce Framework." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/13072946241409858351.
Full text靜宜大學
資訊工程學系
101
SNPs play important roles for various analysis applications including medical diagnostic and drug design. They contain the highest-resolution genetic fingerprint for identifying disease associations and human features. Haplotype, is composed of SNPs, region of linked genetic variants that are neighboring usually inherited together. Recently, genetics researches show that SNPs within certain haplotype blocks induce only a few distinct common haplotypes in the majority of the population. The discussion of haplotype block has serious implications of method with association-based for the disease genes mapping. We proposed the method in investigating several efficient combinatorial algorithms related to selecting interesting haplotype blocks under different diversity functions that generalizes many previous results in the literatures. However, the proposed method is computation-consuming. This thesis adopts approach using the MapReduce paradigm to parallelize tools and manage their execution. The experiment shows that the map/reduce-paralleled from the original sequential combinatorial algorithm performs well on the real-world data obtained in from the HapMap data set; the computation efficiency can be effectively improved proportional to the number of processors being used.
Chou, Yen-Chen, and 周艷貞. "General Wrapping of Information Hiding Patterns on MapReduce Framework." Thesis, 2014. http://ndltd.ncl.edu.tw/handle/21760802727746979445.
Full textChang, Zhi-Hong, and 張志宏. "Join Operations for Large-Scale Data on MapReduce Framework." Thesis, 2012. http://ndltd.ncl.edu.tw/handle/t4c5e6.
Full text國立臺中科技大學
資訊工程系碩士班
100
As the rapid development of hardware and network technology, cloud computing has become an important research topic. It provides a solution for large-scale data processing problems. The data-parallel framework provides a platform to deal with large-scale data, especially for data mining and data warehousing. MapReduce is one of the most famous data-parallel frameworks. It consists of two stages: the Map stage and the Reduce stage. Based on the MapReduce framework, Scatter-Gather-Merge (SGM) is an efficient algorithm supporting star join queries, which is one of the most important query types in the data warehouse. However, SGM only supports equi-join operations. This thesis proposes a framework, which supports not only equi-join operations but also nonequi-join operations. And, the nonequi-join processing usually causes a large amount of I/Os, the proposed method can resolve this problem by reducting the cost of load balancing. Our experimental results show that, for equi-join operations, our method has similar execution time compared to SGM. For non-equi join operations, we also illustrate the performances under different conditions.
Huang, Yuan-Shao, and 黃元劭. "An Efficient Frequent Patterns Mining Algorithm Based on MapReduce Framework." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/78k6v3.
Full text中華大學
資訊工程學系碩士班
101
Recently, the data is continuously increasing in every enterprise. The Big Data, Cloud Computing, Data Mining etc., become hot topics in the present day. In this thesis, we modified the tradition Apriori algorithm by improving the execution efficiency, since Aprori algorithm confronted with an issue that the computation time increases dramatically when data size increases. Therefore, we design and implement two efficient algorithms: Frequent Patterns Mining Algorithm Based on MapReduce Framework (FAMR) algorithm and Optimization FAMR (OFAMR) algorithm. We adopt Hadoop MapReduce framework’s advantage to shorten the mining execution time Compared with “One-phase” algorithm, experimental results showed that FAMR has 16.2 speedup in the running time. Since the previous method only used one-time MapReduce operation, it will generate excessive candidates and result insufficient memory. Moreover, we implemented another Optimization FAMR (OFAMR) algorithm in the thesis; the performance of OFAMR is superior to FAMR, because the number of candidates generated by OFAMR is less than the candidates generated by FAMR.
You, Hsin-Han, and 尤信翰. "A Load-Aware Scheduler for MapReduce Framework in Heterogeneous Environments." Thesis, 2010. http://ndltd.ncl.edu.tw/handle/30478140785981334531.
Full text國立交通大學
資訊科學與工程研究所
99
MapReduce is becoming a trendy programming model for large-scale data processing such as data mining, log processing, web indexing and scientific research. MapReduce framework is a batch distributed data processing framework that disassembles a job into smaller map tasks and reduce tasks. In MapReduce framework, master node distributes tasks to worker nodes to complete the whole job. Hadoop MapReduce is the most popular open-source implementation of MapReduce framework. Hadoop MapReduce comes with a pluggable task scheduler interface and a default FIFO job scheduler. Performance of MapReduce jobs and overall cluster utilization rely on how the tasks being assigned and processed. In practice, there are some issues such as dynamic loading, heterogeneity of nodes, multiple job scheduling needs to be taken into account. We find that current Hadoop scheduler suffers from performance degradation due to the above problems. We propose a new scheduler named Load-Aware Scheduler to address these issues, and improve the overall performance and utilization. Experimental results show that we could improve 10 to 20 of utilization on average by avoid unnecessary speculative tasks.
Ho, Hung-Wei, and 何鴻緯. "Modeling and Analysis of Hadoop MapReduce Framework Using Petri Nets." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/37850810462698368045.
Full text國立臺北大學
資訊工程學系
103
Technological advances have significantly increased the amount of corporate data available, which has created a wide range of business opportunities related to big data and cloud computing. Hadoop is a popular programming framework used for the setup of cloud computing systems. The MapReduce framework forms the core of the Hadoop program for parallel computing and its parallel framework can greatly increase the efficiency of big data analysis. This study used Petri nets to create a visual model of the MapReduce framework and verify its reachability. We present an actual big data analysis system to demonstrate the feasibility of the model, describe the internal procedures of the MapReduce framework in detail, list common errors during the system development process and propose error prevention mechanisms using the Petri net model in order to increase efficiency in the system development.
Chang, Jui-Yen, and 張瑞岩. "MapReduce-Based Frequent Pattern Mining Framework with Multiple Item Support." Thesis, 2017. http://ndltd.ncl.edu.tw/handle/bxj8r2.
Full text國立臺北科技大學
資訊與財金管理系碩士班
105
The analysis of big data mining for frequent patterns is become even more problematic. It got a lot of applications and attempt to promote people’s health and daily life better and easier. Association mining is the analyzing process of discovering interesting and useful association rules hidden from huge and complicated data in different databases. However, use a single minimum item support value for all items are not sufficient since it could not reflect the characteristic of each item. When the minimum support value (MIS) is set too low, despite it would find rare items, similarly, it may generate a large number of meaningless patterns. On the other hand, if the minimum support value is set too high, we will lose useful rare patterns. Thus, how to set the threshold value of minimum support for each item to find out correlated patterns efficiently and accurately is essential. In addition, efficient computing has been an active research issue of data mining in recent years. MapReduce was proposed in 2008, it could easier implement parallel algorithm to compute various kinds of derived data and reduce run-time. Accordingly, in this paper we proposed to a concept model of solutions set multiple support value for each item and using MapReduce framework to find correlated patterns involving both of frequent and rare items accurately and efficiently. It would not require post pruning and rebuilding phases since each item are either promising more or equal to MIN-MIS, thereby improving the overall performance of mining frequent patterns and rare items accurately and efficiently.
YEH, WEN-HSIN, and 葉文昕. "An Algorithm for Efficiently Mining Frequent Itemsets Based on MapReduce Framework." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/z8na4z.
Full text明新科技大學
電機工程系碩士班
107
With the maturity of cloud technology, big data and data mining, cloud computing have become the hot research topic. Association rule mining is one of the most important techniques for data mining. Among them, Apriori is the most representative algorithm for association rule mining, but the performance of traditional Apriori algorithm will become worse as the amount of data is larger and the support is smaller. The use of cloud computing technology MapReduce distributed architecture will improve Apriori's shortcomings. Google Cloud Dataproc is a platform that can easily manage Hadoop cluster management and is very helpful for big data analysis. Based on Apriori, this paper proposes an association rule mining algorithm based on MapReduce architecture. Using Google Cloud Dataproc as the experimental environment, the experimental results show that the method proposed by this research has better performance than other methods when the amount of data and the amount of calculation are larger..
Wei, Xiu-Hui, and 魏秀蕙. "Performance Comparison of Sequential Pattern Mining Algorithms Based on Mapreduce Framework." Thesis, 2014. http://ndltd.ncl.edu.tw/handle/fz7kg8.
Full text國立臺中科技大學
資訊工程系碩士班
102
Because that the popularity of cloud technology and the accumulation of large amounts of data, it is very important direction of research to reduce time for processing large amounts of data efficiently. Besides, there are many kinds of data mining technique which are used in analyzing of huge amounts of data, which contains the association rule mining algorithms and sequential pattern mining algorithms. In this study, two sequential pattern mining algorithms, GSP algorithm and AprioriAll algorithm, are parallelized through the MapReduce framework. Also, we design and study the different efficiency between the two kinds of sequential pattern mining algorithms, and analyze the different efficiency between GSP algorithm and AprioriAll algorithm. The results show that the parallelized GSP algorithm is better than the parallelized AprioriAll algorithm.
Chen, Bo-Ting, and 陳柏廷. "Improving the techniques of Mining Association Rule based on MapReduce framework." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/b6g4eb.
Full text國立臺中科技大學
資訊工程系碩士班
103
We can get useful and valuable information from insignificant data through data mining and gain huge benefit from professional analysis. However, it is important to improve the performance of data mining for Big Data processing. The purpose of this study is to improve the performance of parallel association-rule mining algorithm of PIETM (Principle of Inclusion- Exclusion and Transaction Mapping) under the MapReduce framework. PIETM is arranging transaction data in database into a tree structure which is called Transaction tree (T-tree), and then transform T-tree into Transaction Interval tree (TI-tree). And use principle of Inclusion- Exclusion according to TI-tree to calculate all frequent itemsets. PIETM combines the benefits of Apriori and FP-growth algorithms and only needs to scan the database twice in data mining. However, we still need to improve some procedures, for example, construct a transaction tree and generate candidate k-item sets. For the two problems described above, we provide a solution respectively. These two solutions adopted the FP-growth and Apriori algorithms respectively.
Chang, Chih-Wei, and 張智崴. "An Adaptive Real-Time MapReduce Framework Based on Locutus and Borg-Tree." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/32093140696316430010.
Full text國立臺北教育大學
資訊科學系碩士班
101
Google has released the design of MapReduce since 2004. After years of development, finally Apache has lunched Hadoop version 1.0 at 2011, and it means the open source resources of MapReduce is enough supporting the applications of business. But somehow, there are still some features unsatisfied for big data processing. First is the supporting of real-time computing, and the other is the cross-platform deployment and ease of use. In this thesis, we analyzed the bottleneck of Hadoop performance and try to solve it, and hoping to develop an easy Real-Time Computing Platform. We cite the researches that pointed the bottleneck of MapReduce performance were the access speed of HDFS and Zookeeper performance. It means if we could improved the coordination mechanisms and use replace HDFS to other faster Storage mechanism (we use share memory in this way), we could significantly improve its performance enough to support Real-Time analyze applications. In this paper, we propose the algorithm based on Locutus and Borg-Tree to support the coordination for MapReduce. It is structure by P2P topology that concepts quickly distributed processing. And it was programmed by NodeJS that could be easily deploying to many cloud platform. We finally took some experiments to solve the feasibility of our prototype. Although we also obtained that this program did not reach the expected performance, but we also pointed out the problem with Share Memory mechanisms and out Protocol for subsequent research and development.
Chung, Wei-Chun, and 鐘緯駿. "Algorithms for Correcting Next-Generation Sequencing Errors Based on MapReduce Big Data Framework." Thesis, 2017. http://ndltd.ncl.edu.tw/handle/k9jnna.
Full text國立臺灣大學
資訊工程學研究所
105
The rapid advancement of next-generation sequencing (NGS) technology has generated an explosive growth of ultra-large-scale data and computational problems, particularly in de novo genome assembly. Greater sequencing depths and increasingly longer reads have introduced numerous errors, which increase the probability of misassembly. The huge amounts of data cause severely high disk I/O overhead and lead to an unexpectedly long execution time. To speed up the time-consuming assembly processes without affecting its quality and to address problems pertaining to error correction, we focus on improving algorithm design, architecture design, and implementation of NGS de novo genome assembly based on cloud computing. Errors in sequencing data result in fragmented contigs, which lead to an assembly of poor quality. We therefore propose an error correction algorithm based on cloud computing. The algorithm emulates the design of error correction algorithm of ALLPATHS-LG, and is designed to correct errors conservatively to avoid false decisions. To speed up execution time by reducing the massive disk I/O overhead, we introduce a message control strategy, the read-message (RM) diagram, to represent structure of the intermediate data generated along with each read. Then, we develop various schemes to trim off portions of the RM diagram to shrink the size of the intermediate data and thereby reduce the number of disk I/O operations. We have implemented the proposed algorithms on the MapReduce cloud computing framework and evaluated them using state-of-the-art tools. The RM method reduces the intermediate data size and speeds up execution. Our proposed algorithms have improved not only the execution time of the pipeline dramatically, but also the quality of assembly. This dissertation presents algorithms and architectural designs that speed up execution time and improve the quality of de novo genome assembly. These studies are valuable for further development of NGS big data applications for bioinformatics, including transcriptomics, metagenomics, pharmacogenomics, and precision medicine.
Chin, Bing-Da, and 秦秉達. "Design of Parallel Binary Classification Algorithm Based on Hadoop Cluster with MapReduce Framework." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/fu84aw.
Full text國立臺中科技大學
資訊工程系碩士班
103
With increased amount data today,it is hard to analyze large data on single computer environment efficiently,the hadoop cluster is very important because we can save and large data by hadoop cluster. Data mining plays an important role of data analysis.Because time complexity of the binary-class classification SVM algorithm is a big issue,we design a parallel binary SVM algorithm to slove this problem,and achieve the effect of classifying appropriate data. By leveraging the parallel processing property in MapReduce ,we implement multi-layer binary SVM by MapReduce framework,and run on the hadoop cluster successfully. By designing different parameters of hadoop cluster and using the same data set for training analysis, it shows that the new algorithm can reduce the computation time significantly.
Rosen, Andrew. "Towards a Framework for DHT Distributed Computing." 2016. http://scholarworks.gsu.edu/cs_diss/107.
Full textTsung-ChihHuang and 黃宗智. "The Design and Implementation of the MapReduce Framework based on OpenCL in GPU Environment." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/44218540678222651423.
Full text國立成功大學
電腦與通信工程研究所
101
With the advances and evolution of technology, General Purpose Computation on the GPU(GPGPU) was put forward due to the excellent performance of GPU in parallel computing. This thesis presents the design and implemention of a MapReduce software framework which is based on Open Computing Language(OpenCL) in GPU environment. For those users who develop parallel application software using OpenCL, this framework provides an alternative which can simplify the process of development and can implement the complicate details of parallel computing easily. Therefore, the burden of developers will be considerably relieved. The design of this framework is composed of many application programming interfaces which can be divided into two parts in the system architecture. The first part is application programming interface framework working on CPU, such as initialization, data transfer, creating program, query device information, thread configuration, preparing kernel, adding input record, GPU memory allocation, copying output to host and releasing resource. The second part is application programming interface framework working on GPU, such as Map, Map count, Reduce, Reduce count, group, GPU memory sum. The implementation is realized using OpenCL application programming interfaces which are provided by OpenCL library modules, including application data computing, memory calculation, application pending data preparation, etc. Users thus can concentrate on the part of the design process, the framework will automatically invoke the OpenCL functions and pass the appropriate parameter values, and coordinating CPU and GPU processing. The main contribution of this thesis is using OpenCL to implement MapReduce software framework. Users can use this framework to develop cross-platform programs making the porting process much easier. Furthermore, this framework provides many application programming interfaces used in the development and those application programming interfaces fully demonstrate how OpenCL works and its flow of processing.
Lin, Jia-Chun, and 林佳純. "Study of Job Execution Performance, Reliability, Energy Consumption, and Fault Tolerance in the MapReduce Framework." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/48439755319437778819.
Full text國立交通大學
資訊科學與工程研究所
103
Node/machine failure is the norm rather than an exception in a large-scale MapReduce cluster. To prevent jobs from being interrupted by machine/node failures, MapReduce has employed several policies, such as task-reexecution policy, intermediate-data replication policy, reduce-task assignment policy. However, the impacts of these policies on MapReduce jobs are not clear, especially in terms of Job Completion Reliability (JCR for short), Job Turnaround Time (JTT for short), and Job Energy Consumption (JEC for short). In this dissertation, JCR is the reliability with which a MapReduce job can be completed by a MapReduce cluster, JTT is the time period starting when the job is submitted to the cluster and ending when the job is completed by the cluster, and JEC is the energy consumed by the cluster to complete the job. To achieve a more reliable and energy-efficient computing environment than current MapReduce infrastructure, it is essential to comprehend the impacts of the above policies. In addition, the MapReduce master servers suffer from a single-point-of-failure problem, which might interrupt MapReduce operations and filesystem services. To study how the above polices influence the performances of MapReduce jobs, in this dissertation, we formally derive and analyze the JCR, JTT, and JEC of a MapReduce job under the abovementioned MapReduce policies. In addition, to mitigate the single-point-of-failure problem and improve the service qualities of MapReduce master servers, we propose a hybrid takeover scheme called PAReS (Proactive and Adaptive Redundant System) for MapReduce master servers. The analyses in this dissertation enable MapReduce managers to comprehend the influences of these policies on MapReduce jobs, help MapReduce managers to choose appropriate MapReduce policies for their MapReduce clusters, and allow MapReduce designers to propose better policies for MapReduce. Furthermore, based on our extensive experimental results, the proposed PAReS system can mitigate the single-point-of-failure problem and improve the service qualities of MapReduce master servers as compared with current redundant schemes on Hadoop.
Roy, Sukanta. "Automated methods of natural resource mapping with Remote Sensing Big data in Hadoop MapReduce framework." Thesis, 2022. https://etd.iisc.ac.in/handle/2005/5836.
Full textHuu, Tinh Giang Nguyen, and 阮有淨江. "Design and Implement a MapReduce Framework for Converting Standalone Software Packages to Hadoop-based Distributed Environments." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/20649990806109007865.
Full text國立成功大學
製造資訊與系統研究所碩博士班
101
The Hadoop MapReduce is the programming model of designing the auto scalable distributed computing applications. It provides developer an effective environment to attain automatic parallelization. However, most existing manufacturing systems are arduous and restrictive to migrate to MapReduce private cloud, due to the platform incompatible and tremendous complexity of system reconstruction. For increasing the efficiency of manufacturing systems with minimum modification of existing systems, we design a framework in this thesis, called MC-Framework: Multi-users-based Cloudizing-Application Framework. It provides the simple interface to users for fairly executing requested tasks worked with traditional standalone software packages in MapReduce-based private cloud environments. Moreover, this thesis focuses on the multi-users workloads, but the default Hadoop scheduling scheme, i.e., FIFO, would increase delay under multiuser scenarios. Hence, we also propose a new scheduling mechanism, called Job-Sharing Scheduling, to explore and fairly share the jobs to machines in the MapReduce-based private cloud. This study uses an experimental design to verify and analysis the proposed MC-Framework with two case studies: (1) independent model systems include the stochastic Petri nets mode, and (2) dependence model systems include the virtual-metrology module of a manufacturing system. The results of our experiments indicate that our proposed framework enormously improved the time performance compared with the original package.
Lo, Chia-Huai, and 駱家淮. "Constructing Suffix Array and Longest-Common-Prefix Array for Next-Generation-Sequencing Data Using MapReduce Framework." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/71000795009259045140.
Full text國立臺灣大學
資訊工程學研究所
103
Next-generation sequencing (NGS) data is rapidly growing and represents a source of varieties of new knowledge in science. State-of-the-art sequencers, such as HiSeq 2500, can generate up to 1 trillion base-pairs of sequencing data in 6 days, with good quality at low cost. In genome sequencing projects today, the NGS data size often ranges from tens of billions base-pairs to several hundreds of billions base-pairs. It is time-consuming to process such a big set of NGS data, especially for applications based on sequence alignment, e.g., de novo genome assembly and correction of sequencing errors. In literature, suffix array, longest common prefix (LCP) array and Burrows-Wheeler Transform (BWT) have been proved to be efficient indexes to speed up manifold sequence alignment tasks. For example, the all-pairs suffix-prefix matching problem, i.e., finding overlaps of reads to form the overlap graph for sequence assembly, can be solved in linear time by reading these arrays. However, constructing those arrays for NGS data remains challenging due to the huge amount of storage required to hold the suffix array. MapReduce is a promising alternative to tackle the NGS challenge, but the existing MapReduce method of suffix array construction, i.e., RPGI proposed by Menon et al [1] can only deal with input strings of size no greater than 4G base pairs and does not give LCPs in its output. In the study, we developed a MapReduce algorithm to construct suffix and BWT arrays, as well as LCP array, for NGS data based on the framework of RPGI. In addition, the proposed method supports inputs with more than 4G base-pairs and is developed into new software. To evaluate its performance, we compare the time it takes to process subsets of the giant grouper NGS data set of size 125Gbp.
Chou, Chien-Ting, and 周建廷. "Research on The Computing of Direct Geo Morphology Runoff on Hadoop Cluster by Using MapReduce Framework." Thesis, 2011. http://ndltd.ncl.edu.tw/handle/13575176515358582342.
Full text國立臺灣師範大學
資訊工程研究所
99
Because of the weather and landform in Taiwan, a heavy rain often cause sudden rising of the runoff of some basins, even lead to serious disaster. That makes flood information system are highly relied in Taiwan especially in typhoon season. Computing the runoff of a basin is the most important module of flood information system for checking whether the runoff exceeds warning level or not. However this module is complicated and data-intensive, it becomes the bottleneck when the real-time information are needed while a typhoon is attacking the basins. The development of applications in this thesis is on "Apache Hadoop"-an open-source software that builds a distributed storage and computing environment, which allows for the distributed processing of large data sets across clusters of computers using a programming model-"MapReduce". We have developed the runoff computing module of a basin by using MapReduce framework on a Hadoop cluster. In our research, to speed up the runoff computing will increase the efficiency of the flood information system. Running our programs in an 18 nodes Hadoop cluster, we have derived the conclusion that it can speed up the execution of runoff computing by 6 times.
Chrimes, Dillon. "Towards a big data analytics platform with Hadoop/MapReduce framework using simulated patient data of a hospital system." Thesis, 2016. http://hdl.handle.net/1828/7645.
Full textGraduate
0566
0769
0984
dillon.chrimes@viha.ca
(9530630), Akshay Jajoo. "EXPLOITING THE SPATIAL DIMENSION OF BIG DATA JOBS FOR EFFICIENT CLUSTER JOB SCHEDULING." Thesis, 2020.
Find full textaddress the two primary challenges of cluster scheduling. First, we propose, validate, and design two complete systems that employ learning algorithms exploiting spatial dimension. We demonstrate high similarity in runtime properties between sub-entities of the same job by detailed trace analysis on four different industrial cluster traces. We identify design challenges and propose principles for a sampling based learning system for two examples, first for a coflow scheduler, and second for a cluster job scheduler.
We also propose, design, and demonstrate the effectiveness of new multi-task scheduling algorithms based on effective synchronization across the spatial dimension. We underline and validate by experimental analysis the importance of synchronization between sub-entities (flows, tasks) of a distributed entity (coflow, data analytics jobs) for its efficient execution. We also highlight that by not considering sibling sub-entities when scheduling something it may also lead to sub-optimal overall cluster performance. We propose, design, and implement a full coflow scheduler based on these assertions.