Dissertations / Theses on the topic 'Efficient parallel algorithms'

To see the other types of publications on this topic, follow the link: Efficient parallel algorithms.

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Efficient parallel algorithms.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Stadtherr, Hans. "Work efficient parallel scheduling algorithms." [S.l. : s.n.], 1998. http://deposit.ddb.de/cgi-bin/dokserv?idn=962681369.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Huang, Chun-Hsi. "Communication-efficient bulk synchronous parallel algorithms." Buffalo, N.Y. : Dept. of Computer Science, State University of New York at Buffalo, 2001. http://www.cse.buffalo.edu/tech%2Dreports/2001%2D06.ps.Z.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Liang, Weifa, and wliang@cs anu edu au. "Designing Efficient Parallel Algorithms for Graph Problems." The Australian National University. Department of Computer Science, 1997. http://thesis.anu.edu.au./public/adt-ANU20010829.114536.

Full text
Abstract:
Graph algorithms are concerned with the algorithmic aspects of solving graph problems. The problems are motivated from and have application to diverse areas of computer science, engineering and other disciplines. Problems arising from these areas of application are good candidates for parallelization since they often have both intense computational needs and stringent response time requirements. Motivated by these concerns, this thesis investigates parallel algorithms for these kinds of graph problems that have at least one of the following properties: the problems involve some type of dynamic updates; the sparsification technique is applicable; or the problems are closely related to communications network issues. The models of parallel computation used in our studies are the Parallel Random Access Machine (PRAM) model and the practical interconnection network models such as meshes and hypercubes. ¶ Consider a communications network which can be represented by a graph G = (V;E), where V is a set of sites (processors), and E is a set of links which are used to connect the sites (processors). In some cases, we also assign weights and/or directions to the edges in E. Associated with this network, there are many problems such as (i) whether the network is k-edge (k-vertex) connected withfixed k; (ii) whether there are k-edge (k-vertex) disjoint paths between u and v for a pair of given vertices u and v after the network is dynamically updated by adding and/or deleting an edge etc; (iii) whether the sites in the network can communicate with each other when some sites and links fail; (iv) identifying the first k edges in the network whose deletion will result in the maximum increase in the routing cost in the resulting network for fixed k; (v) how to augment the network at optimal cost with a given feasible set of weighted edges such that the augmented network is k-edge (k-vertex) connected; (vi) how to route messages through the network efficiently. In this thesis we answer the problems mentioned above by presenting efficient parallel algorithms to solve them. As far as we know, most of the proposed algorithms are the first ones in the parallel setting. ¶ Even though most of the problems concerned in this thesis are related to communications networks, we also study the classic edge-coloring problem. The outstanding difficulty to solve this problem in parallel is that we do not yet know whether or not it is in NC. In this thesis we present an improved parallel algorithm for the problem which needs [bigcircle]([bigtriangleup][superscript 4.5]log [superscript 3] [bigtriangleup] log n + [bigtriangleup][superscript 4] log [superscript 4] n) time using [bigcircle](n[superscript 2][bigtriangleup] + n[bigtriangleup][superscript 3]) processors, where n is the number of vertices and [bigtriangleup] is the maximum vertex degree. Compared with a previously known result on the same model, we improved by an [bigcircle]([bigtriangleup][superscript 1.5]) factor in time. The non-trivial part is to reduce this problem to the edge-coloring update problem. We also generalize this problem to the approximate edge-coloring problem by giving a faster parallel algorithm for the latter case. ¶ Throughout the design and analysis of parallel graph algorithms, we also find a technique called the sparsification technique is very powerful in the design of efficient sequential and parallel algorithms on dense undirected graphs. We believe that this technique may be useful in its own right for guiding the design of efficient sequential and parallel algorithms for problems in other areas as well as in graph theory.
APA, Harvard, Vancouver, ISO, and other styles
4

Yang, Yong. "Efficient parallel genetic algorithms applied to numerical optimisation." Thesis, Southampton Solent University, 2008. http://ssudl.solent.ac.uk/631/.

Full text
Abstract:
This research is concerned with the optimisation of multi-modal numerical problems using genetic algorithms (GAs). GAs use randomised operators operating over a population of candidate solutions to generate new points in the search space. As the scale and complexity of target applications increase, run time becomes a major inhibitor. Parallel genetic algorithms (PGAs) have therefore become an important area of research. Coarse-grained implementations are one of the most popular models and many researchers are concerned primarily with this area. The island model was the only one class of parallel genetic algorithm on the coarse-grained processing platform. There are indiscriminate overlaps between sub-populations in the island model even if there is no communication between sub-populations. In order to determine whether the removal of the overlaps between sub-populations is beneficial, the partition model based on domain decomposition was motivated and showed that it can offer superior performance on a number of two dimensional test problems. However the partition model has a certain scalability problem. The main contribution of this thesis is to propose and develop an alternative approach, which replicates the beneficial behaviour of the partition model using more scalable techniques. It operates in a similar manner to the island model, but periodically performs a clustering analysis on each sub-population. The clustering analysis is used to identify regions of the search space in which more than one sub-population are sampling. The overlapping clusters are then merged and redistributed amongst sub-populations so that only one sub-population has samples in that region of the search space. It is shown that these overlaps between sub-populations are identified and removed by the clustering analysis without a priori domain knowledge. The performance of the new approach employing the clustering analysis is then compared with the island model and the partition model on a number of non-trivial problems. These experiments show that the new approach is robust, efficient and effective, and removes the scalability problem that prevents the wide application of the partition model.
APA, Harvard, Vancouver, ISO, and other styles
5

Goldberg, Andrew Vladislav. "Efficient graph algorithms for sequential and parallel computers." Thesis, Massachusetts Institute of Technology, 1987. http://hdl.handle.net/1721.1/14912.

Full text
Abstract:
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1987.
MICROFICHE COPY AVAILABLE IN ARCHIVES AND ENGINEERING.
Bibliography: p. 117-123.
by Andrew Vladislav Goldberg.
Ph.D.
APA, Harvard, Vancouver, ISO, and other styles
6

He, Xin. "On efficient parallel algorithms for solving graph problems /." The Ohio State University, 1987. http://rave.ohiolink.edu/etdc/view?acc_num=osu1487331541710947.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Chen, Calvin Ching-Yuen. "Efficient Parallel Algorithms and Data Structures Related to Trees." Thesis, University of North Texas, 1991. https://digital.library.unt.edu/ark:/67531/metadc332626/.

Full text
Abstract:
The main contribution of this dissertation proposes a new paradigm, called the parentheses matching paradigm. It claims that this paradigm is well suited for designing efficient parallel algorithms for a broad class of nonnumeric problems. To demonstrate its applicability, we present three cost-optimal parallel algorithms for breadth-first traversal of general trees, sorting a special class of integers, and coloring an interval graph with the minimum number of colors.
APA, Harvard, Vancouver, ISO, and other styles
8

Halverson, Ranette Hudson. "Efficient Linked List Ranking Algorithms and Parentheses Matching as a New Strategy for Parallel Algorithm Design." Thesis, University of North Texas, 1993. https://digital.library.unt.edu/ark:/67531/metadc278153/.

Full text
Abstract:
The goal of a parallel algorithm is to solve a single problem using multiple processors working together and to do so in an efficient manner. In this regard, there is a need to categorize strategies in order to solve broad classes of problems with similar structures and requirements. In this dissertation, two parallel algorithm design strategies are considered: linked list ranking and parentheses matching.
APA, Harvard, Vancouver, ISO, and other styles
9

邱祖淇 and Cho-ki Joe Yau. "Efficient solutions for the load distribution problem." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 1999. http://hub.hku.hk/bib/B31222031.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Yau, Cho-ki Joe. "Efficient solutions for the load distribution problem /." Hong Kong : University of Hong Kong, 1999. http://sunzi.lib.hku.hk/hkuto/record.jsp?B20971953.

Full text
APA, Harvard, Vancouver, ISO, and other styles
11

Bangalore, Lakshminarayana Nagesh. "Efficient graph algorithm execution on data-parallel architectures." Diss., Georgia Institute of Technology, 2014. http://hdl.handle.net/1853/53058.

Full text
Abstract:
Mechanisms for improving the execution efficiency of graph algorithms on Data-Parallel Architectures were proposed and identified. Execution of graph algorithms on GPGPU architectures, the prevalent data-parallel architectures was considered. Irregular and data dependent accesses in graph algorithms were found to cause significant idle cycles in GPGPU cores. A prefetching mechanism that reduced the amount of idle cycles by prefetching a data-dependent access pattern found in graph algorithms was proposed. Storing prefetches in unused spare registers in addition to storing them in the cache was shown to be more effective by the prefetching mechanism. The design of the cache hierarchy for graph algorithms was explored. First, an exclusive cache hierarchy was shown to be beneficial at the cost of increased traffic; a region based exclusive cache hierarchy was shown to be similar in performance to an exclusive cache hierarchy while reducing on-chip traffic. Second, bypassing cache blocks at both the level one and level two caches was shown to be beneficial. Third, the use of fine-grained memory accesses (or cache sub-blocking) was shown to be beneficial. The combination of cache bypassing and fine-grained memory accesses was shown to be more beneficial than applying the two mechanisms individually. Finally, the impact of different implementation strategies on algorithm performance was evaluated for the breadth first search algorithm using different input graphs and heuristics to identify the best performing implementation for a given input graph were also discussed.
APA, Harvard, Vancouver, ISO, and other styles
12

Muhammad, Ali. "Parallel genetic algorithms : an efficient model and applications in control systems." Thesis, Southampton Solent University, 1997. http://ssudl.solent.ac.uk/1266/.

Full text
Abstract:
Optimisation is an adaptiver process and it is widely applied in science and engineering from scheduling a manufacturing process to control of a spacecraft. Genetic algorithms are search and optimisation methods based on the mechanics of natural evolution. they provide promising results in non-linear and complex search problems. They have been proven to be parallel and global in nature. Genetic algorithms run slowly on sequential machines which is their mmmajor drawback. Most of the applications of genetic algorithm in engineering are in the area of design and schedule optimisation, where usually enough time is avialable to simulate the algorithm. The computer architecture is a main bottleneck since the sequential computation does not reflect the true spatial structure of the algorithm. There are a couple of parallel models and implementations available which improve the performance of these algorithms. The aim of this research is to develop a new model and/or to improve existing parallel models for real-time application of these methods in system identification and intelligent control. The desired features of this new model are: it should be independent of the optimisation problem, so it could be able to cope with a black box problem, it could be used in real-time applications, where the exact model of the system is unknown, and it should be implementable within the current technological framework. An extensive study of the current literasture on genetic algorithms has been carried out. A detailed review of the underlying theory of genetic algorithms has also been presented. A parallel model of genetic algorithm has been proposed and implemented on a transputer based system using ANSI C toolset for transputers. It has been tested for different strategies on the traditional suite of optimisation problems, ie DeJong's function and Deceptive functions. The results are compared using the performance measures proposed by DeJong. The performance and efficiency measures of the algorithm have been defined and worked out for the simulation results. The research has advanced the understanding of genetic algorithms as stochastic processes. A Markov chain based mathematical model has been developed. An informal study of convergence properties of the algorithm are presented from different points of view, ie time series, real analysis, Markov chains and metric topology. Gradient like information has been integrated into the genetic search in order to improve the performance and efficiency of the algorithm. A novel directional search method has been developed tested on the same set of problems and comapred using the same performance and efficiency measures as those reported in the recent publications. Unlike neural networks and fuzzy systems, genetic algorithms do not provide any general logic for system modelliung. Therefore system identification is achieved by using the fuzzy network for general logic and a genetic algorithm for parameter estimation giving as a result an evolving fuzzy network. This novel method has been applied to modelling of chaotic time series and it has been used to control a highly non-linear system, ie inverse pendulum. It is expected that with the advance of re-configurable electronics, evolutionary chips will be realised in the near future. They will play an important role in the development of genetic algorithms based control systems
APA, Harvard, Vancouver, ISO, and other styles
13

Gupta, Sandeep K. S. "Synthesizing communication-efficient distributed memory parallel programs for block recursive algorithms /." The Ohio State University, 1995. http://rave.ohiolink.edu/etdc/view?acc_num=osu1487861796820607.

Full text
APA, Harvard, Vancouver, ISO, and other styles
14

Bischof, Stefan. "Efficient algorithms for on-line scheduling and load distribution in parallel systems." [S.l. : s.n.], 1999. http://deposit.ddb.de/cgi-bin/dokserv?idn=959771409.

Full text
APA, Harvard, Vancouver, ISO, and other styles
15

Banks, Nicola E. "Insights from the parallel implementation of efficient algorithms for the fractional calculus." Thesis, University of Chester, 2015. http://hdl.handle.net/10034/613841.

Full text
Abstract:
This thesis concerns the development of parallel algorithms to solve fractional differential equations using a numerical approach. The methodology adopted is to adapt existing numerical schemes and to develop prototype parallel programs using the MatLab Parallel Computing Toolbox (MPCT). The approach is to build on existing insights from parallel implementation of ordinary differential equations methods and to test a range of potential candidates for parallel implementation in the fractional case. As a consequence of the work, new insights on the use of MPCT for prototyping are presented, alongside conclusions and algorithms for the effective implementation of parallel methods for the fractional calculus. The principal parallel approaches considered in the work include: - A Runge-Kutta Method for Ordinary Differential Equations including the application of an adapted Richardson Extrapolation Scheme - An implementation of the Diethelm-Chern Algorithm for Fractional Differential Equations - A parallel version of the well-established Fractional Adams Method for Fractional Differential Equations - The adaptation for parallel implementation of Lubich's Fractional Multistep Method for Fractional Differential Equations An important aspect of the work is an improved understanding of the comparative diffi culty of using MPCT for obtaining fair comparisons of parallel implementation. We present details of experimental results which are not satisfactory, and we explain how the problems may be overcome to give meaningful experimental results. Therefore, an important aspect of the conclusions of this work is the advice for other users of MPCT who may be planning to use the package as a prototyping tool for parallel algorithm development: by understanding how implicit multithreading operates, controls can be put in place to allow like-for-like performance comparisons between sequential and parallel programs.
APA, Harvard, Vancouver, ISO, and other styles
16

Ramasamy, Kandasamy Manimozhian. "Efficient state space exploration for parallel test generation." Thesis, [Austin, Tex. : University of Texas, 2009. http://hdl.handle.net/2152/ETD-UT-2009-05-131.

Full text
Abstract:
Report (M.S. in Engineering)--University of Texas at Austin, 2009.
Title from PDF title page (University of Texas Digital Repository, viewed on August 10, 2009). Includes bibliographical references.
APA, Harvard, Vancouver, ISO, and other styles
17

Nelissen, Geoffrey. "Efficient optimal multiprocessor scheduling algorithms for real-time systems." Doctoral thesis, Universite Libre de Bruxelles, 2013. http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/209528.

Full text
Abstract:
Real-time systems are composed of a set of tasks that must respect some deadlines. We find them in applications as diversified as the telecommunications, medical devices, cars, planes, satellites, military applications, etc. Missing deadlines in a real-time system may cause various results such as a diminution of the quality of service provided by the system, the complete stop of the application or even the death of people. Being able to prove the correct operation of such systems is therefore primordial. This is the goal of the real-time scheduling theory.

These last years, we have witnessed a paradigm shift in the computing platform architectures. Uniprocessor platforms have given place to multiprocessor architectures. While the real-time scheduling theory can be considered as being mature for uniprocessor systems, it is still an evolving research field for multiprocessor architectures. One of the main difficulties with multiprocessor platforms, is to provide an optimal scheduling algorithm (i.e. scheduling algorithm that constructs a schedule respecting all the task deadlines for any task set for which a solution exists). Although optimal multiprocessor real-time scheduling algorithms exist, they usually cause an excessive number of task preemptions and migrations during the schedule. These preemptions and migrations cause overheads that must be added to the task execution times. Therefore, task sets that would have been schedulable if preemptions and migrations had no cost, become unschedulable in practice. An efficient scheduling algorithm is therefore an algorithm that either minimize the number of preemptions and migrations, or reduce their cost.

In this dissertation, we expose the following results:

- We show that reducing the "fairness" in the schedule, advantageously impacts the number of preemptions and migrations. Hence, all the scheduling algorithms that will be proposed in this thesis, tend to reduce or even suppress the fairness in the computed schedule.

- We propose three new online scheduling algorithms. One of them --- namely, BF2 --- is optimal for the scheduling of sporadic tasks in discrete-time environments, and reduces the number of task preemptions and migrations in comparison with the state-of-the-art in discrete-time systems. The second one is optimal for the scheduling of periodic tasks in a continuous-time environment. Because this second algorithm is based on a semi-partitioned scheme, it should favorably impact the preemption overheads. The third algorithm --- named U-EDF --- is optimal for the scheduling of sporadic and dynamic task sets in a continuous-time environment. It is the first real-time scheduling algorithm which is not based on the notion of "fairness" and nevertheless remains optimal for the scheduling of sporadic (and dynamic) systems. This important result was achieved by extending the uniprocessor algorithm EDF to the multiprocessor scheduling problem.

- Because the coding techniques are also evolving as the degree of parallelism increases in computing platforms, we provide solutions enabling the scheduling of parallel tasks with the currently existing scheduling algorithms, which were initially designed for the scheduling of sequential independent tasks.
Doctorat en Sciences de l'ingénieur
info:eu-repo/semantics/nonPublished

APA, Harvard, Vancouver, ISO, and other styles
18

Witkowski, Thomas. "Software concepts and algorithms for an efficient and scalable parallel finite element method." Doctoral thesis, Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden, 2014. http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-141651.

Full text
Abstract:
Software packages for the numerical solution of partial differential equations (PDEs) using the finite element method are important in different fields of research. The basic data structures and algorithms change in time, as the user\'s requirements are growing and the software must efficiently use the newest highly parallel computing systems. This is the central point of this work. To make efficiently use of parallel computing systems with growing number of independent basic computing units, i.e.~CPUs, we have to combine data structures and algorithms from different areas of mathematics and computer science. Two crucial parts are a distributed mesh and parallel solver for linear systems of equations. For both there exists multiple independent approaches. In this work we argue that it is necessary to combine both of them to allow for an efficient and scalable implementation of the finite element method. First, we present concepts, data structures and algorithms for distributed meshes, which allow for local refinement. The central point of our presentation is to provide arbitrary geometrical information of the mesh and its distribution to the linear solver. A large part of the overall computing time of the finite element method is spend by the linear solver. Thus, its parallelization is of major importance. Based on the presented concept for distributed meshes, we preset several different linear solver methods. Hereby we concentrate on general purpose linear solver, which makes only little assumptions about the systems to be solver. For this, a new FETI-DP (Finite Element Tearing and Interconnect - Dual Primal) method is proposed. Those the standard FETI-DP method is quasi optimal from a mathematical point of view, its not possible to implement it efficiently for a large number of processors (> 10,000). The main reason is a relatively small but globally distributed coarse mesh problem. To circumvent this problem, we propose a new multilevel FETI-DP method which hierarchically decompose the coarse grid problem. This leads to a more local communication pattern for solver the coarse grid problem and makes it possible to scale for a large number of processors. Besides the parallelization of the finite element method, we discuss an approach to speed up serial computations of existing finite element packages. In many computations the PDE to be solved consists of more than one variable. This is especially the case in multi-physics modeling. Observation show that in many of these computation the solution structure of the variables is different. But in the standard finite element method, only one mesh is used for the discretization of all variables. We present a multi-mesh finite element method, which allows to discretize a system of PDEs with two independently refined meshes
Softwarepakete zur numerischen Lösung partieller Differentialgleichungen mit Hilfe der Finiten-Element-Methode sind in vielen Forschungsbereichen ein wichtiges Werkzeug. Die dahinter stehenden Datenstrukturen und Algorithmen unterliegen einer ständigen Neuentwicklung um den immer weiter steigenden Anforderungen der Nutzergemeinde gerecht zu werden und um neue, hochgradig parallel Rechnerarchitekturen effizient nutzen zu können. Dies ist auch der Kernpunkt dieser Arbeit. Um parallel Rechnerarchitekturen mit einer immer höher werdenden Anzahl an von einander unabhängigen Recheneinheiten, z.B.~Prozessoren, effizient Nutzen zu können, müssen Datenstrukturen und Algorithmen aus verschiedenen Teilgebieten der Mathematik und Informatik entwickelt und miteinander kombiniert werden. Im Kern sind dies zwei Bereiche: verteilte Gitter und parallele Löser für lineare Gleichungssysteme. Für jedes der beiden Teilgebiete existieren unabhängig voneinander zahlreiche Ansätze. In dieser Arbeit wird argumentiert, dass für hochskalierbare Anwendungen der Finiten-Elemente-Methode nur eine Kombination beider Teilgebiete und die Verknüpfung der darunter liegenden Datenstrukturen eine effiziente und skalierbare Implementierung ermöglicht. Zuerst stellen wir Konzepte vor, die parallele verteile Gitter mit entsprechenden Adaptionstrategien ermöglichen. Zentraler Punkt ist hier die Informationsaufbereitung für beliebige Löser linearer Gleichungssysteme. Beim Lösen partieller Differentialgleichung mit der Finiten Elemente Methode wird ein großer Teil der Rechenzeit für das Lösen der dabei anfallenden linearen Gleichungssysteme aufgebracht. Daher ist deren Parallelisierung von zentraler Bedeutung. Basierend auf dem vorgestelltem Konzept für verteilten Gitter, welches beliebige geometrische Informationen für die linearen Löser aufbereiten kann, präsentieren wir mehrere unterschiedliche Lösermethoden. Besonders Gewicht wird dabei auf allgemeine Löser gelegt, die möglichst wenig Annahmen über das zu lösende System machen. Hierfür wird die FETI-DP (Finite Element Tearing and Interconnect - Dual Primal) Methode weiterentwickelt. Obwohl die FETI-DP Methode vom mathematischen Standpunkt her als quasi-optimal bezüglich der parallelen Skalierbarkeit gilt, kann sie für große Anzahl an Prozessoren (> 10.000) nicht mehr effizient implementiert werden. Dies liegt hauptsächlich an einem verhältnismäßig kleinem aber global verteilten Grobgitterproblem. Wir stellen eine Multilevel FETI-DP Methode vor, die dieses Problem durch eine hierarchische Komposition des Grobgitterproblems löst. Dadurch wird die Kommunikation entlang des Grobgitterproblems lokalisiert und die Skalierbarkeit der FETI-DP Methode auch für große Anzahl an Prozessoren sichergestellt. Neben der Parallelisierung der Finiten-Elemente-Methode beschäftigen wir uns in dieser Arbeit mit der Ausnutzung von bestimmten Voraussetzung um auch die sequentielle Effizienz bestehender Implementierung der Finiten-Elemente-Methode zu steigern. In vielen Fällen müssen partielle Differentialgleichungen mit mehreren Variablen gelöst werden. Sehr häufig ist dabei zu beobachten, insbesondere bei der Modellierung mehrere miteinander gekoppelter physikalischer Phänomene, dass die Lösungsstruktur der unterschiedlichen Variablen entweder schwach oder vollständig voneinander entkoppelt ist. In den meisten Implementierungen wird dabei nur ein Gitter zur Diskretisierung aller Variablen des Systems genutzt. Wir stellen eine Finite-Elemente-Methode vor, bei der zwei unabhängig voneinander verfeinerte Gitter genutzt werden können um ein System partieller Differentialgleichungen zu lösen
APA, Harvard, Vancouver, ISO, and other styles
19

Melot, Nicolas. "Algorithms and Framework for Energy Efficient Parallel Stream Computing on Many-Core Architectures." Doctoral thesis, Linköpings universitet, Programvara och system, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-132308.

Full text
Abstract:
The rise of many-core processor architectures in the market answers to a constantly growing need of processing power to solve more and more challenging problems such as the ones in computing for big data. Fast computation is more and more limited by the very high power required and the management of the considerable heat produced. Many programming models compete to take profit of many-core architectures to improve both execution speed and energy consumption, each with their advantages and drawbacks. The work described in this thesis is based on the dataflow computing approach and investigates the benefits of a carefully pipelined execution of streaming applications, focusing in particular on off- and on-chip memory accesses. As case study, we implement classic and on-chip pipelined versions of mergesort for Intel SCC and Xeon. We see how the benefits of the on-chip pipelining technique are bounded by the underlying architecture, and we explore the problem of fine tuning streaming applications for many-core architectures to optimize for energy given a throughput budget. We propose a novel methodology to compute schedules optimized for energy efficiency given a fixed throughput target. We introduce \emph{Drake}, derived from Schedeval, a tool that generates pipelined applications for Many-Core architectures and allows the performance testing in time or energy of their static schedule. We show that streaming applications based on Drake compete with specialized implementations and we use Schedeval to demonstrate performance differences between schedules that are otherwise considered as equivalent by a simple model.

This thesis has also been funded by CUGS, Graduate School in Computer Science and FP7 EXCESS.

The electronic version has been corrected. See the published errata list.

APA, Harvard, Vancouver, ISO, and other styles
20

Ojiaku, Jude-Thaddeus. "A study of time and energy efficient algorithms for parallel and heterogeneous computing." Thesis, University of Liverpool, 2016. http://livrepository.liverpool.ac.uk/3004892/.

Full text
Abstract:
This PhD project is motivated by the need to develop and achieve better and energy efficient computing through the use of parallelism and heterogeneous systems. Our contribution consists of both theoretical aspects, as well as in-depth and comprehensive empirical studies that aim to provide more insight into parallel and heterogeneous computing. Our first problem is a theoretical problem that focuses on the scheduling of a special category of jobs known as deteriorating jobs. These kind of jobs will require more effort to complete them if postponed to a later time. They are intended to model several industrial processes including steel production, fire-fighting and financial management. We study the problem in the context of parallel machine scheduling in an online setting where jobs have arbitrary release times. Our main results show that List Scheduling is $(1+b_{max})$-competitive and that no deterministic algorithm is better than $(1+b_{max})^{1-\frac{1}{m}}$, where $b_{max}$ is the largest deteriorating rate. We also extend our results to online deterministic algorithms and show that no deterministic online algorithm is better than $(1+b_{max})$-competitive. Our next study concerns the scheduling of $n$ jobs with precedence constraints on $m$ parallel machines. We are interested in the precedence constraint known as chain precedence constraint where each job can have at most one predecessor and at most one successor. The jobs are modelled as directed acyclic graphs where nodes represent the jobs and edges represent the precedence constraints between jobs. The jobs have a strict deadline that must be met. The parallel machines are considered to be unrelated and a communication network connects each pair of machines. Execution of the jobs on the machines as well as communication across the network incurs costs in the form of time and energy. These costs are given by cost matrices that covers processing and communication. The goal is to construct a feasible schedule that minimizes the total energy required to execute the chain of jobs on the machines, such that all deadlines are met. We present a dynamic programming solution to the problem that leads to a pseudo polynomial time algorithm with running time $O(nm^2d_{max})$, where $d_{max}$ is the largest deadline. We show that the algorithm computes an optimal schedule where one exists. We then proceed to a similar problem that involves the scheduling of jobs to minimize flow time plus energy. This problem is based on a dynamic speed scaling heuristic in literature that is able to adjust the speed of a processor based on the number of \emph{active jobs}, called AJC. We present a comprehensive empirical study that consists of several job selection, speed selection and processor allocation heuristics. We also consider both single processor and multi processor settings. Our main goal is to investigate the viability of designing a fixed-speed counterpart for AJC, that is not as computationally intensive as AJC, while being very simple. We also evaluate the performance of this fixed speed heuristic and compare it with that of AJC. Our fourth and final study involves the use of graphics processing unit (GPU) as an accelerator for compute intensive tasks. The GPU has become a very popular multi processor for heterogeneous computing both from an economical point of view and performance standpoint. Firstly, we contribute to the development of a Bioinformatics tool, called GapsMis, by implementing a heterogeneous version that uses graphics processors for acceleration. GapsMis is a tool designed for the alignment of sequences, like protein and DNA sequences, and allows for the insertion of gaps in the alignment. Then we present a case study that aims to highlight the various aspects, including benefits and challenges, involved in developing heterogeneous applications that is vendor-agnostic. In order to do this we select four algorithms as case studies including GapsMis and the algorithm presented in our second problem. The other two algorithms are based on the Velocity-Verlet integration and the Fruchterman-Reingold force-based method for graph layout. We make use of the Open Computing Language (OpenCL) and C++ for implementation of the algorithms on a range of graphics processors from Advanced Micro Devices (AMD) and NVIDIA Corporation. We evaluate several factors that can affect performance of these applications on each hardware. We also compare the performance of our algorithms in a multi-GPU setting and against single and multi-core CPU implementations. Furthermore, several metrics are defined to capture several aspects of performance including execution time of application kernel(s), execution time of application including communication times, throughput, power and energy consumption.
APA, Harvard, Vancouver, ISO, and other styles
21

Leung, Kwong-Keung, and 梁光強. "Fast and efficient video coding based on communication and computationscheduling on multiprocessors." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2001. http://hub.hku.hk/bib/B29750945.

Full text
APA, Harvard, Vancouver, ISO, and other styles
22

Mor, Stefano Drimon Kurz. "Analysis of synchronizations in greedy-scheduled executions and applications to efficient generation of pseudorandom numbers in parallel." reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2015. http://hdl.handle.net/10183/130529.

Full text
Abstract:
Nous présentons deux contributions dans le domaine de la programmation parallèle. La première est théorique : nous introduisons l’analyse SIPS, une approche nouvelle pour dénombrer le nombre d’opérations de synchronisation durant l’exécution d’un algorithme parallèle ordonnancé par vol de travail. Basée sur le concept d’horloges logiques, elle nous permet : d’une part de donner de nouvelles majorations de coût en moyenne; d’autre part de concevoir des programmes parallèles plus efficaces par adaptation dynamique de la granularité. La seconde contribution est pragmatique : nous présentons une parallélisation générique d’algorithmes pour la génération déterministe de nombres pseudo-aléatoires, indépendamment du nombre de processus concurrents lors de l’exécution. Alternative à l’utilisation d’un générateur pseudo-aléatoire séquentiel par processus, nous introduisons une API générique, appelée Par-R qui est conçue et analysée grâce à SIPS. Sa caractéristique principale est d’exploiter un générateur séquentiel qui peut “sauter” directement d’un nombre à un autre situé à une distance arbitraire dans la séquence pseudo-aléatoire. Grâce à l’analyse SIPS, nous montrons qu’en moyenne, lors d’une exécution par vol de travail d’un programme très parallèle (dont la profondeur ou chemin critique est très petite devant le travail ou nombre d’opérations), ces opérations de saut sont rares. Par-R est comparé au générateur pseudo-aléatoire DotMix écrit pour Cilk Plus, une extension de C/C++ pour la programmation parallèle par vol de travail. Le surcout théorique de Par-R se compare favorablement au surcoput de DotMix, ce qui apparait aussi expériemntalement. De plus, étant générique, Par-R est indépendant du générateur séquentiel sous-jacent.
Nós apresentamos duas contribuições para a área de programação paralela. A primeira contribuição é teórica: nós introduzimos a análise SIPS, uma nova abordagem para a estimar o número de sincronizações realizadas durante a execução de um algoritmo paralelo. SIPS generaliza o conceito de relógios lógicos para contar o número de sincronizações realizadas por um algoritmo paralelo e é capaz de calcular limites do pior caso mesmo na presença de execuções paralelas não-determinísticas, as quais não são geralmente cobertas por análises no estado-da-arte. Nossa análise nos permite estimar novos limites de pior caso para computações escalonadas pelo popular algoritmo de roubo de tarefas e também projetar programas paralelos e adaptáveis que são mais eficientes. A segunda contribuição é pragmática: nós apresentamos uma estratégia de paralelização eficiente para a geração de números pseudoaleatórios. Como uma alternativa para implementações fixas de componentes de geração aleatória nós introduzimos uma API chamada Par-R, projetada e analisada utilizando-se SIPS. Sua principal idea é o uso da capacidade de um gerador sequencial R de realizar um “pulo” eficiente dentro do fluxo de números gerados; nós os associamos a operações realizadas pelo escalonador por roubo de tarefas, o qual nossa análise baseada em SIPS demonstra ocorrer raramente em média. Par-R é comparado com o gerador paralelo de números pseudoaleatórios DotMix, escrito para a plataforma de multithreading dinâmico Cilk Plus. A latência de Par-R tem comparação favorável à latência do DotMix, o que é confirmado experimentalmente, mas não requer o uso subjacente fixado de um dado gerador aleatório.
We present two contributions to the field of parallel programming. The first contribution is theoretical: we introduce SIPS analysis, a novel approach to estimate the number of synchronizations performed during the execution of a parallel algorithm. Based on the concept of logical clocks, it allows us: on one hand, to deliver new bounds for the number of synchronizations, in expectation; on the other hand, to design more efficient parallel programs by dynamic adaptation of the granularity. The second contribution is pragmatic: we present an efficient parallelization strategy for pseudorandom number generation, independent of the number of concurrent processes participating in a computation. As an alternative to the use of one sequential generator per process, we introduce a generic API called Par-R, which is designed and analyzed using SIPS. Its main characteristic is the use of a sequential generator that can perform a “jump-ahead” directly from one number to another on an arbitrary distance within the pseudorandom sequence. Thanks to SIPS, we show that, in expectation, within an execution scheduled by work stealing of a “very parallel” program (whose depth or critical path is subtle when compared to the work or number of operations), these operations are rare. Par-R is compared with the parallel pseudorandom number generator DotMix, written for the Cilk Plus dynamic multithreading platform. The theoretical overhead of Par-R compares favorably to DotMix’s overhead, what is confirmed experimentally, while not requiring a fixed generator underneath.
APA, Harvard, Vancouver, ISO, and other styles
23

Schwambach, Vítor. "Methods and tools for rapid and efficient parallel implementation of computer vision algorithms on embedded multiprocessors." Thesis, Université Grenoble Alpes (ComUE), 2016. http://www.theses.fr/2016GREAM022/document.

Full text
Abstract:
Les applications de vision par ordinateur embarquées demandent une forte capacité decalcul et poussent le développement des systèmes multi- et many-cores spécifiques à l’application. Les choix au départ de la conception du système peuvent impacter sa performance parallèle finale – parmi lesquelles la granularité de la parallélisation, le nombre de processeurs et l’équilibre entre calculs et l’acheminement des données. L’impact de ces choix est difficile à estimer dans les phases initiales de conception et il y a peu d’outils et méthodes pour aider les concepteurs dans cette tâche. Les contributions de cette thèse consistent en deux méthodes et les outils associés qui visent à faciliter la sélection des paramètres architecturaux d’un multiprocesseur embarqué et les stratégies de parallélisation des applications de vision embarquée. La première est une méthode d’exploration de l’espace de conception qui repose sur Parana, un outil fournissant une estimation rapide et précise de la performance parallèle. Parana permet l’évaluation de différents scénarios de parallélisation et peut déterminer la limite maximale de performance atteignable. La seconde contribution est une méthode pour l’optimisation du dimensionnement des tuiles d’images 2D utilisant la programmation par contraintes dans l’outil Tilana. La méthode proposée intègre pour plus de précision des facteurs non-linéaires comme les temps des transferts DMA et les surcoûts de l’ordonnancement parallèle
Embedded computer vision applications demand high system computational power and constitute one of the key drivers for application-specific multi- and many-core systems. A number of early system design choices can impact the system’s parallel performance – among which the parallel granularity, the number of processors and the balance between computation and communication. Their impact in the final system performance is difficult to assess in early design stages and there is a lack for tools that support designers in this task. The contributions of this thesis consist in two methods and associated tools that facilitate the selection of embedded multiprocessor’s architectural parameters and computer vision application parallelization strategies. The first consists of a Design Space Exploration (DSE) methodology that relies on Parana, a fast and accurate parallel performance estimation tool. Parana enables the evaluation of what-if parallelization scenarios and can determine their maximum achievable performance limits. The second contribution consists of a method for optimal 2D image tile sizing using constraint programming within the Tilana tool. The proposed method integrates non-linear DMA data transfer times and parallel scheduling overheads for increased accuracy
APA, Harvard, Vancouver, ISO, and other styles
24

Leung, Kwong-Keung. "Fast and efficient video coding based on communication and computation scheduling on multiprocessors." Hong Kong : University of Hong Kong, 2001. http://sunzi.lib.hku.hk/hkuto/record.jsp?B23272867.

Full text
APA, Harvard, Vancouver, ISO, and other styles
25

Kosowski, Adrian. "Time and Space-Efficient Algorithms for Mobile Agents in an Anonymous Network." Habilitation à diriger des recherches, Université Sciences et Technologies - Bordeaux I, 2013. http://tel.archives-ouvertes.fr/tel-00867765.

Full text
Abstract:
Computing with mobile agents is rapidly becoming a topic of mainstream research in the theory of distributed computing. The main research questions undertaken in this study concern the feasibility of solving fundamental tasks in an anonymous network, subject to limitations on the resources available to the agent. The considered challenges include: exploring a graph by means of an agent with limited memory, discovery of the network topology, and attempting to meet with another agent in another network (rendezvous). The constraints imposed on the agent include the number of moves which the agent is allowed to perform in the network, the amount of state memory available to the agent, the ability of the agent to communicate with other agents, as well as its a priori knowledge of the network topology or of global parameters.
APA, Harvard, Vancouver, ISO, and other styles
26

Zönnchen, Benedikt Sebastian [Verfasser], Hans-Joachim [Akademischer Betreuer] Bungartz, Hans-Joachim [Gutachter] Bungartz, and Gerta [Gutachter] Köster. "Efficient parallel algorithms for large-scale pedestrian simulation / Benedikt Sebastian Zönnchen ; Gutachter: Hans-Joachim Bungartz, Gerta Köster ; Betreuer: Hans-Joachim Bungartz." München : Universitätsbibliothek der TU München, 2021. http://d-nb.info/1237048850/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
27

Chenini, Hanen. "A rapid design methodology for generating of parallel image processing applications and parallel architectures for smart camera." Thesis, Clermont-Ferrand 2, 2014. http://www.theses.fr/2014CLF22459.

Full text
Abstract:
Dû à la complexité des algorithmes de traitement d’images récents et dans le but d'accélérer la procédure de la conception des MPSoCs, méthodologies de prototypage rapide sont nécessaires pour fournir différents choix pour le programmeur de générer des programmes parallèles efficaces. Ce manuscrit présente les travaux menés pour proposer une méthodologie de prototypage rapide permettant la conception des architectures MPSOC ainsi que la génération automatique de système matériel / logiciel dédié un circuit reprogrammable (FPGA). Pour faciliter la programmation parallèle, l'approche MPSoC proposée est basée sur l’utilisation de Framework « CubeGen » qui permet la génération des différentes solutions envisageables pour réaliser des prototypes dans le domaine du traitement d’image. Ce document décrit une méthode basée sur le concept des squelettes générés en fonction des caractéristiques d'application afin d'exploiter tous les types de parallélisme des algorithmes réels. Un ensemble d’expérimentations utilisant des algorithmes courants permet d’évaluer les performances du flot de conception proposé équivalente à une architecture basé des processeurs hardcore et les solutions traditionnels basé sur cibles ASIC
Due to the complexity of image processing algorithms and the restrictions imposed by MPSoC designs to reach their full potentials, automatic design methodologies are needed to provide guidance for the programmer to generate efficient parallel programs. In this dissertation, we present a MPSoC-based design methodology solution supporting automatic design space exploration, automatic performance evaluation, as well as automatic hardware/software system generation. To facilitate the parallel programming, the presented MPSoC approach is based on a CubeGen framework that permits the expression of different scenarios for architecture and algorithmic design exploring to reach the desired level of performance, resulting in short time development. The generated design could be implemented in a FPGA technology with an expected improvement in application performance and power consumption. Starting from the application, we have evolved our effective methodology to provide several parameterizable algorithmic skeletons in the face of varying application characteristics to exploit all types of parallelism of the real algorithms. Implementing such applications on our parallel embedded system shows that our advanced methods achieve increased efficiency with respect to the computational and communication requirements. The experimental results demonstrate that the designed multiprocessing architecture can be programmed efficiently and also can have an equivalent performance to a more powerful designs based hard-core processors and better than traditional ASIC solutions which are too slow and too expensive
APA, Harvard, Vancouver, ISO, and other styles
28

Witkowski, Thomas [Verfasser], Axel [Akademischer Betreuer] Voigt, and Wolfgang [Akademischer Betreuer] Nagel. "Software concepts and algorithms for an efficient and scalable parallel finite element method / Thomas Witkowski. Gutachter: Axel Voigt ; Wolfgang Nagel. Betreuer: Axel Voigt." Dresden : Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden, 2014. http://d-nb.info/1068446455/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
29

Morrison, Adrian Franklin. "An Efficient Method for Computing Excited State Properties of Extended Molecular Aggregates Based on an Ab-Initio Exciton Model." The Ohio State University, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=osu1509730158943602.

Full text
APA, Harvard, Vancouver, ISO, and other styles
30

Klein, Philip N. (Philip Nathan). "An efficient parallel algorithm for planarity." Thesis, Massachusetts Institute of Technology, 1986. http://hdl.handle.net/1721.1/34303.

Full text
Abstract:
Thesis (M.S.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1986.
MICROFICHE COPY AVAILABLE IN ARCHIVES AND ENGINEERING
Bibliography: leaves 56-57.
by Philip Nathan Klein.
M.S.
APA, Harvard, Vancouver, ISO, and other styles
31

Jain, Ankita. "Efficient Parallel Algorithm for Overlaying Surface Meshes." Thesis, Georgia Institute of Technology, 2007. http://hdl.handle.net/1853/16315.

Full text
Abstract:
Many computational applications involve multiple physical components, each having its own computational domain discretized by a mesh. An integrated simulation of these physical systems require transferring data across these boundaries, which are typically represented by surface meshes composed of triangles or quadrilaterals and are non-matching with differing connectivities and geometry. It is necessary to constructa common refinement (or common tessellation) of the surface meshes to transfer data between different domains accurately and conservatively. For large-scale problems that involve moving boundary, the common tessellation must be updated frequently within the integrated simulations running on parallel computers. Previously, Jiao and Heath developed an algorithm for constructing a common tessellation by overlaying the surface meshes. The original algorithm is efficient and robust, but unfortunately, it is complex and difficult to parallelize. In this thesis, we develop a modified algorithm for overlaying surface meshes. Our algorithm employs a high-level primitive, face-intersection, which combines the low-level point-projection and edge-intersection primitives of the original algorithm. A main advantage of our modified algorithm is its ease of implementation and parallelization. Our implementation utilizes flexible data structures for efficient computation and query of the common tessellation and avoids potential redundancy in computations to achieve high efficiency. To achieve robustness, we pay special attention to avoid potential topological inconsistencies due to numerical errors, and introduce a preprocessing step to project a far-apart surface mesh onto other before computing the common tessellation. We present numerical examples to demonstrate the robustness and efficiency of our method on parallel computers.
APA, Harvard, Vancouver, ISO, and other styles
32

Lu, Xin. "An efficient parallel optimization algorithm for the token bucket control mechanism." Thesis, University of Ottawa (Canada), 2004. http://hdl.handle.net/10393/26704.

Full text
Abstract:
The Token Bucket algorithm, one of the most widely used control mechanism nowadays, has been widely studied to ensure the QoS needs of various applications. However, one main drawback of current models of this algorithm is that most of them have focused on a single Token Bucket system. In this thesis, based on previous research efforts, we propose a parallel solution to the multiple Token Bucket model. We also develop a Reduced Memory Algorithm to decrease the algorithm's memory requirements at the cost of extra computation time. We test our parallel processing algorithm using two sets of traces. Our numerical results show that the model can effectively solve the multiple Token Bucket problems. Besides showing the benefits of using a parallel processing platform, our results also provide us with the guidelines to configure the parallel processing platform.
APA, Harvard, Vancouver, ISO, and other styles
33

Rawald, Tobias. "Scalable and Efficient Analysis of Large High-Dimensional Data Sets in the Context of Recurrence Analysis." Doctoral thesis, Humboldt-Universität zu Berlin, 2018. http://dx.doi.org/10.18452/18797.

Full text
Abstract:
Die Recurrence Quantification Analysis (RQA) ist eine Methode aus der nicht-linearen Zeitreihenanalyse. Im Mittelpunkt dieser Methode steht die Auswertung des Inhalts sogenannter Rekurrenzmatrizen. Bestehende Berechnungsansätze zur Durchführung der RQA können entweder nur Zeitreihen bis zu einer bestimmten Länge verarbeiten oder benötigen viel Zeit zur Analyse von sehr langen Zeitreihen. Diese Dissertation stellt die sogenannte skalierbare Rekurrenzanalyse (SRA) vor. Sie ist ein neuartiger Berechnungsansatz, der eine gegebene Rekurrenzmatrix in mehrere Submatrizen unterteilt. Jede Submatrix wird von einem Berechnungsgerät in massiv-paralleler Art und Weise untersucht. Dieser Ansatz wird unter Verwendung der OpenCL-Schnittstelle umgesetzt. Anhand mehrerer Experimente wird demonstriert, dass SRA massive Leistungssteigerungen im Vergleich zu existierenden Berechnungsansätzen insbesondere durch den Einsatz von Grafikkarten ermöglicht. Die Dissertation enthält eine ausführliche Evaluation, die den Einfluss der Anwendung mehrerer Datenbankkonzepte, wie z.B. die Repräsentation der Eingangsdaten, auf die RQA-Verarbeitungskette analysiert. Es wird untersucht, inwiefern unterschiedliche Ausprägungen dieser Konzepte Einfluss auf die Effizienz der Analyse auf verschiedenen Berechnungsgeräten haben. Abschließend wird ein automatischer Optimierungsansatz vorgestellt, der performante RQA-Implementierungen für ein gegebenes Analyseszenario in Kombination mit einer Hardware-Plattform dynamisch bestimmt. Neben anderen Aspekten werden drastische Effizienzgewinne durch den Einsatz des Optimierungsansatzes aufgezeigt.
Recurrence quantification analysis (RQA) is a method from nonlinear time series analysis. It relies on the identification of line structures within so-called recurrence matrices and comprises a set of scalar measures. Existing computing approaches to RQA are either not capable of processing recurrence matrices exceeding a certain size or suffer from long runtimes considering time series that contain hundreds of thousands of data points. This thesis introduces scalable recurrence analysis (SRA), which is an alternative computing approach that subdivides a recurrence matrix into multiple sub matrices. Each sub matrix is processed individually in a massively parallel manner by a single compute device. This is implemented exemplarily using the OpenCL framework. It is shown that this approach delivers considerable performance improvements in comparison to state-of-the-art RQA software by exploiting the computing capabilities of many-core hardware architectures, in particular graphics cards. The usage of OpenCL allows to execute identical SRA implementations on a variety of hardware platforms having different architectural properties. An extensive evaluation analyses the impact of applying concepts from database technology, such memory storage layouts, to the RQA processing pipeline. It is investigated how different realisations of these concepts affect the performance of the computations on different types of compute devices. Finally, an approach based on automatic performance tuning is introduced that automatically selects well-performing RQA implementations for a given analytical scenario on specific computing hardware. Among others, it is demonstrated that the customised auto-tuning approach allows to considerably increase the efficiency of the processing by adapting the implementation selection.
APA, Harvard, Vancouver, ISO, and other styles
34

McLaughlin, Adam. "Mapping parallel graph algorithms to throughput-oriented architectures." Diss., Georgia Institute of Technology, 2015. http://hdl.handle.net/1853/54374.

Full text
Abstract:
The stagnant performance of single core processors, increasing size of data sets, and variety of structure in information has made the domain of parallel and high-performance computing especially crucial. Graphics Processing Units (GPUs) have recently become an exciting alternative to traditional CPU architectures for applications in this domain. Although GPUs are designed for rendering graphics, research has found that the GPU architecture is well-suited to algorithms that search and analyze unstructured, graph-based data, offering up to an order of magnitude greater memory bandwidth over their CPU counterparts. This thesis focuses on GPU graph analysis from the perspective that algorithms should be efficient on as many classes of graphs as possible, rather than being specialized to a specific class, such as social networks or road networks. Using betweenness centrality, a popular analytic used to find prominent entities of a network, as a motivating example, we show how parallelism, distributed computing, hybrid and on-line algorithms, and dynamic algorithms can all contribute to substantial improvements in the performance and energy-efficiency of these computations. We further generalize this approach and provide an abstraction that can be applied to a whole class of graph algorithms that require many simultaneous breadth-first searches. Finally, to show that our findings can be applied in real-world scenarios, we apply these techniques to the problem of verifying that a multiprocessor complies with its memory consistency model.
APA, Harvard, Vancouver, ISO, and other styles
35

Dubreuil, Christophe. "A quad-tree algorithm for efficient polygon comparison, and its parallel implementation." Thesis, Heriot-Watt University, 1997. http://hdl.handle.net/10399/677.

Full text
APA, Harvard, Vancouver, ISO, and other styles
36

Ahmed, Nova. "Parallel Algorithm for Memory Efficient Pairwise and Multiple Genome Alignment in Distributed Environment." Digital Archive @ GSU, 2004. http://digitalarchive.gsu.edu/cs_theses/2.

Full text
Abstract:
The genome sequence alignment problems are very important ones from the computational biology perspective. These problems deal with large amount of data which is memory intensive as well as computation intensive. In the literature, two separate algorithms have been studied and improved – one is a Pairwise sequence alignment algorithm which aligns pairs of genome sequences with memory reduction and parallelism for the computation and the other one is the multiple sequence alignment algorithm that aligns multiple genome sequences and this algorithm is also parallelized efficiently so that the workload of the alignment program is well distributed. The parallel applications can be launched on different environments where shared memory is very well suited for these kinds of applications. But shared memory environment has the limitation of memory usage as well as scalability also these machines are very costly. A better approach is to use the cluster of computers and the cluster environment can be further enhanced to a grid environment so that the scalability can be improved introducing multiple clusters. Here the grid environment is studied as well as the shared memory and cluster environment for the two applications. It can be stated that for carefully designed algorithms the grid environment is comparable for its performance to other distributed environments and it sometimes outperforms the others in terms of the limitations of resources the other distributed environments have.
APA, Harvard, Vancouver, ISO, and other styles
37

Zlateski, Aleksandar. "A design and implementation of an efficient, parallel watershed algorithm for affinity graphs." Thesis, Massachusetts Institute of Technology, 2011. http://hdl.handle.net/1721.1/66820.

Full text
Abstract:
Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2011.
Cataloged from PDF version of thesis.
Includes bibliographical references (p. 43).
In this thesis, I designed and implemented an efficient, parallel, generalized watershed algorithm for hierarchical segmentation of affinity graphs. By introducing four variable parameters the algorithm enables us to use previous knowledge about the input graph in order to achieve better results. The algorithm is very suitable for hierarchical segmentintation of large scale 3D images of the brain tissue obtained by electron microscopy making it an essential tool for reconstructing the brain's neural-networks called connectomes. The algorithm was fully implemented in C++ and tested on a currently largest available affinity graph of size 90GB on which no existent watershed implementation could be applied.
by Aleksandar Zlateski.
M.Eng.
APA, Harvard, Vancouver, ISO, and other styles
38

Çağrıcı, Gökhan Koltuksuz Ahmet. "An analysis of key generation efficiency of rsa cryptos ystem in distributed environments/." [s.l.]: [s.n.], 2005. http://library.iyte.edu.tr/tezler/master/bilgisayaryazilimi/T000406.pdf.

Full text
Abstract:
Thesis (Master)--İzmir Institute of Technology, İzmir, 2005.
Keywords: Cryptosystem, rivest-Shamir-Adleman, parallel computing, parallel algorithms, Random number. Includes bibliographical references (leaves. 68).
APA, Harvard, Vancouver, ISO, and other styles
39

Yamamoto, Yusaku. "Efficient Parallel Implementation of a Weather Derivatives Pricing Algorithm based on the Fast Gauss Transform." IEEE, 2006. http://hdl.handle.net/2237/9478.

Full text
APA, Harvard, Vancouver, ISO, and other styles
40

Janzén, Johan. "Evaluation of Energy-Optimizing Scheduling Algorithms for Streaming Computations on Massively Parallel Multicore Architectures." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-111385.

Full text
Abstract:
This thesis describes an environment to evaluate and compare static schedulers for real pipelined streaming applications on massively parallel architectures, such as Intel Single chip Cloud Computer (SCC), Adapteva Epiphany, and Tilera TILE-Gx series. The framework allows performance comparison of schedulers in their execution time, or the energy usage of static schedules with energy models and measurements on real platform. This thesis focuses on the implementation of a framework evaluating the energy consumption of such streaming applications on the SCC. The framework can run streaming applications, built as task collections, with static schedules including dynamic frequency scaling. Streams are handled by the framework with FIFO buffers, connected between tasks. We evaluate the framework by considering a pipelined mergesort implementation with different static schedules. The runtime is compared with the runtime of a previously published task based optimized mergesort implementation. The results show how much overhead the framework adds on to the streaming application. As a demonstration of the energy measuring capabilities, we schedule and analyze a Fast Fourier Transform application, and discuss the results. Future work may include quantitative comparative studies of a range of different static schedulers. This has, to our knowledge, not been done previously.
APA, Harvard, Vancouver, ISO, and other styles
41

Lambert, Thomas. "On the Effect of Replication of Input Files on the Efficiency and the Robustness of a Set of Computations." Thesis, Bordeaux, 2017. http://www.theses.fr/2017BORD0656/document.

Full text
Abstract:
Avec l’émergence du calcul haute-performance (HPC) et des applications Big Data, de nouvelles problématiques cruciales sont apparues. Parmi elles on trouve le problème du transfert de données, c’est-à-dire des communications entre machines, qui peut génerer des délais lors de gros calculs en plus d’avoir un impact sur la consommation énergétique. La réplication, que ce soit de tâches ou de fichiers, est un facteur qui accroît ces communications, tout en étant un outil quasi-indispensable pour améliorer le parallélisme du calcul et la résistance aux pannes. Dans cette thèse nous nous intéressons à la réplication de fichiers et à son impact sur les communications au travers de deux problèmes. Dans le premier, la multiplication de matrices en parallèle, le but est de limiter autant que possible ces réplications pour diminuer la quantité de données déplacées. Dans le second, l’ordonnancement de la phase « Map » de MapReduce, il existe une réplication initiale qu’il faut utiliser au mieux afin d’obtenir l’ordonnancement le plus rapide ou entraînant le moins de création de nouvelles copies. En plus de la réplication, nous nous intéressons aussi à la comparaison entre stratégies d’ordonnancement statiques (allocations faites en amont du calcul) et dynamiques (allocations faites pendant le calcul) sur ces deux problèmes avec pour objectif de créer des stratégies hybrides mélangeant les deux aspects. Pour le premier problème, le produit de matrices en parallèle, nous nous ramenons à un problème de partition de carré où l’équilibrage de charge est donné en entrée. Cet équilibrage donné, le but est de minimiser la somme des semi-paramètres des rectangles couvrant des zones ainsi créés. Ce problème a déjà été étudié par le passé et nous démontrons de nouveaux résultats. Nous proposons ainsi deux nouveaux algorithmes d’approximation, l’un fondé sur une stratégie récursive et l’autre sur l’usage d’une courbe fractale. Nous présentons également une modélisation alternative, fondée sur un problème similaire de partition de cube, dont nous prouvons la NP-complétude tout en fournissant deux algorithmes d’approximation. Pour finir, nous réalisons également une implémentation pratique du produit de matrices en utilisant nos stratégies d’allocation grâce à la librairie StarPU. Les résultats expérimentaux montrent une amélioration du temps de calcul ainsi qu’une diminution significative des transferts de données lorsqu’on utilise une stratégie statique d’allocation couplée à une technique de vol de tâches. Pour le second problème, l’ordonnancement de la phase « Map » de MapReduce, plusieurs copies des fichiers d’entrée sont distribuées parmi les processeurs disponibles. Le but ici est de faire en sorte que chaque tâche soit attribuée à un processeur possédant son fichier d’entrée tout en ayant le meilleur temps de calcul total. Une autre option étudiée est d’autoriser les tâches nonlocales (attribués à des processeurs ne possédant pas leurs fichiers d’entrée) mais d’en limiter le nombre. Dans cette thèse nous montrons premièrement qu’un algorithme glouton pour ce problème peut être modélisé par un processus de « balls-in-bins » avec choix, impliquant une surcharge (nombre de tâches supplémentaires par rapport à la moyenne) en O(mlogm) où m est le nombre de processeurs. Secondement, dans le cas où les tâches non-locales sont interdites, nous relions le problème à celui de l’orientation de graphes, ce qui permet d’obtenir des algorithmes optimaux et polynomiaux et l’existence d’une assignation presque parfaite avec forte probabilité. Dans le cas où les tâches non locales sont autorisées, nous proposons également des algorithmes polynomiaux et optimaux. Finalement, nous proposons un ensemble de simulations pour montrer l’efficacité de nos méthodes dans le cas de tâches faiblement hétérogènes
The increasing importance of High Performance Computing (HPC) and Big Data applications creates new issues in parallel computing. One of them is communication, the data transferred from a processor to another. Such data movements have an impact on computational time, inducing delays and increase of energy consumption. If replication, of either tasks or files, generates communication, it is also an important tool to improve resiliency and parallelism. In this thesis, we focus on the impact of the replication of input files on the overall amount of communication. For this purpose, we concentrate on two practical problems. The first one is parallel matrix multiplication. In this problem, the goal is to induce as few replications as possible in order to decrease the amount of communication. The second problem is the scheduling of the “Map” phase in the MapReduce framework. In this case, replication is an input of the problem and this time the goal is to use it in the best possible way. In addition to the replication issue, this thesis also considers the comparison between static and dynamic approaches for scheduling. For consistency, static approaches compute schedules before starting the computation while dynamic approaches compute the schedules during the computation itself. In this thesis we design hybrid strategies in order to take advantage of the pros of both. First, we relate communication-avoiding matrix multiplication with a square partitioning problem, where load-balancing is given as an input. In this problem, the goal is to split a square into zones (whose areas depend on the relative speed of resources) while minimizing the sum of their half-perimeters. We improve the existing results in the literature for this problem with two additional approximation algorithms. In addition we also propose an alternative model using a cube partitioning problem. We prove the NP-completeness of the associated decision problem and we design two approximations algorithms. Finally, we implement the algorithms for both problems in order to provide a comparison of the schedules for matrix multiplication. For this purpose, we rely on the StarPU library. Second, in the Map phase of MapReduce scheduling case, the input files are replicated and distributed among the processors. For this problem we propose two metrics. In the first one, we forbid non-local tasks (a task that is processed on a processor that does not own its input files) and under this constraint, we aim at minimizing the makespan. In the second problem, we allow non-local tasks and we aim at minimizing them while minimizing makespan. For the theoretical study, we focus on tasks with homogeneous computation times. First, we relate a greedy algorithm on the makespan metric with a “ball-into-bins” process, proving that this algorithm produces solutions with expected overhead (the difference between the number of tasks on the most loaded processor and the number of tasks in a perfect distribution) equal to O(mlogm) where m denotes the number of processors. Second, we relate this scheduling problem (with forbidden non-local tasks) to a problem of graph orientation and therefore prove, with the results from the literature, that there exists, with high probability, a near-perfect assignment (whose overhead is at most 1). In addition, there are polynomial-time optimal algorithms. For the communication metric case, we provide new algorithms based on a graph model close to matching problems in bipartite graphs. We prove that these algorithms are optimal for both communication and makespan metrics. Finally, we provide simulations based on traces from a MapReduce cluster to test our strategies with realistic settings and prove that the algorithms we propose perform very well in the case of low or medium variance of the computation times of the different tasks of a job
APA, Harvard, Vancouver, ISO, and other styles
42

Coutinho, Demetrios Ara?jo Magalh?es. "Implementa??o paralela escal?vel e eficiente do algoritmo simplex padr?o em arquitetura multicore." Universidade Federal do Rio Grande do Norte, 2014. http://repositorio.ufrn.br:8080/jspui/handle/123456789/15502.

Full text
Abstract:
Made available in DSpace on 2014-12-17T14:56:18Z (GMT). No. of bitstreams: 1 DemetriusAMC_DISSERT.pdf: 2429364 bytes, checksum: 57aaf24560c189720b218dbca0ef1a56 (MD5) Previous issue date: 2014-01-24
Coordena??o de Aperfei?oamento de Pessoal de N?vel Superior
This work presents a scalable and efficient parallel implementation of the Standard Simplex algorithm in the multicore architecture to solve large scale linear programming problems. We present a general scheme explaining how each step of the standard Simplex algorithm was parallelized, indicating some important points of the parallel implementation. Performance analysis were conducted by comparing the sequential time using the Simplex tableau and the Simplex of the CPLEXR IBM. The experiments were executed on a shared memory machine with 24 cores. The scalability analysis was performed with problems of different dimensions, finding evidence that our parallel standard Simplex algorithm has a better parallel efficiency for problems with more variables than constraints. In comparison with CPLEXR , the proposed parallel algorithm achieved a efficiency of up to 16 times better
Este trabalho apresenta uma implementa??o paralela escal?vel e eficiente do algoritmo Simplex padr?o em arquitetura de processadores multicore para resolver problemas de programa??o linear de grande escala. Apresenta-se um esquema geral explicando como foi paralelizado cada passo do algoritmo simplex padr?o, apontando pontos importantes da implementa??o paralela. Foram realizadas an?lises de desempenho atrav?s da compara??o dos tempos sequenciais utilizando o Simplex tableau e Simplex do CPLEXR da IBM. Os experimentos foram realizados em uma m?quina de mem?ria compartilhada com 24 n?cleos. A an?lise de escalabilidade foi feita com problemas de diferentes dimens?es, encontrando evid?ncias de que a implementa??o paralela proposta do algoritmo simplex padr?o tem melhor efici?ncia paralela para problemas com mais vari?veis do que restri??es. Na compara??o com CPLEXR , o algoritmo proposto paralelo obteve uma efici?ncia de at? 16 vezes maior
APA, Harvard, Vancouver, ISO, and other styles
43

Colombet, Laurent. "Parallélisation d'applications pour des réseaux de processeurs homogènes ou hétérogènes." Grenoble INPG, 1994. http://tel.archives-ouvertes.fr/tel-00005084.

Full text
Abstract:
Le but de cette these est d'etudier et developper des methodes pour la parallelisation efficace des applications scientifiques sur machines paralleles a memoire distribuee. Dans une premiere partie nous presentons deux bibliotheques de fonctions de communication PVM ((\it Parallel Virtual Machine)) et MPI ((\it Message Passing Interface)). Ces dernieres fournissent une portabilite des programmes sur la grande majorite des machines paralleles, mais aussi sur des reseaux d'ordinateurs heterogenes. Cette partie illustre le probleme de la mesure des performances pour des reseaux de processeurs heterogenes. Ceci nous a amene a adapter le calcul du facteur d'acceleration et de l'efficacite afin de pouvoir evaluer les performances d'un algorithme sur un reseau de processeurs heterogenes. La deuxieme partie est consacree a l'etude de bibliotheques numeriques paralleles, comme ScaLAPACK, et au developpement d'une methode etudiee de maniere theorique, mais peu utilisee en pratique pour augmenter les performances des fonctions de ces bibliotheques : le recouvrement calcul/communication. L'idee generale consiste a anticiper les communications, notamment en pipelinant l'envoi des messages. Des resultats experimentaux sur machines Cray T3D et IBM SP1, permettent de valider les etudes theoriques effectuees sur des algorithmes de base de ces bibliotheques
The aim of this thesis is to study and develop efficient methods for parallelization of scientific applications on parallel computers with distributed memory. In the first part we present two libraries of PVM((\it Parallel Virtual Machine)) and MPI ((\it Message Passing Interface)) communication tools. They allow implementation of programs on most parallel machines, but also on heterogeneous computer networks. This chapter illustrates the problems faced when trying to evaluate performances of networks with heterogeneous processors. To evaluate such performances we modified and adapted the concepts of speed-up and efficiency to account for heterogeneity. The second part deals with a study of parallel application libraries such as ScaLAPACK and with the development of communication masking techniques. The general concept is based on communication anticipation, in particular by pipelining message sending operations. Experimental results on Cray T3D and IBM SP1 machines validates the theoretical studies performed on basic algorithms of the libraries discussed above
APA, Harvard, Vancouver, ISO, and other styles
44

Li, Lei. "Fast Algorithms for Mining Co-evolving Time Series." Research Showcase @ CMU, 2011. http://repository.cmu.edu/dissertations/112.

Full text
Abstract:
Time series data arise in many applications, from motion capture, environmental monitoring, temperatures in data centers, to physiological signals in health care. In the thesis, I will focus on the theme of learning and mining large collections of co-evolving sequences, with the goal of developing fast algorithms for finding patterns, summarization, and anomalies. In particular, this thesis will answer the following recurring challenges for time series: 1. Forecasting and imputation: How to do forecasting and to recover missing values in time series data? 2. Pattern discovery and summarization: How to identify the patterns in the time sequences that would facilitate further mining tasks such as compression, segmentation and anomaly detection? 3. Similarity and feature extraction: How to extract compact and meaningful features from multiple co-evolving sequences that will enable better clustering and similarity queries of time series? 4. Scale up: How to handle large data sets on modern computing hardware? We develop models to mine time series with missing values, to extract compact representation from time sequences, to segment the sequences, and to do forecasting. For large scale data, we propose algorithms for learning time series models, in particular, including Linear Dynamical Systems (LDS) and Hidden Markov Models (HMM). We also develop a distributed algorithm for finding patterns in large web-click streams. Our thesis will present special models and algorithms that incorporate domain knowledge. For motion capture, we will describe the natural motion stitching and occlusion filling for human motion. In particular, we provide a metric for evaluating the naturalness of motion stitching, based which we choose the best stitching. Thanks to domain knowledge (body structure and bone lengths), our algorithm is capable of recovering occlusions in mocap sequences, better in accuracy and longer in missing period. We also develop an algorithm for forecasting thermal conditions in a warehouse-sized data center. The forecast will help us control and manage the data center in a energy-efficient way, which can save a significant percentage of electric power consumption in data centers.
APA, Harvard, Vancouver, ISO, and other styles
45

Lee, Young-Jun. "Routing and Efficient Evaluation Techniques for Multi-hop Mobile Wireless Networks." Diss., Georgia Institute of Technology, 2005. http://hdl.handle.net/1853/7455.

Full text
Abstract:
In this dissertation, routing protocols, load-balancing protocols, and efficient evaluation techniques for multi-hop mobile wireless networks are explored. With the advancements made in wireless communication and computer technologies, a new type of mobile wireless network, known as a mobile ad hoc network (MANET), has drawn constant attention. In recent years, several routing protocols for MANETs have been proposed. However, there still remains the need for mechanisms for better scalability support with respect to network size, traffic volume, and mobility. To address this issue, a new method for multi-hop routing in MANETs called Dynamic NIx-Vector Routing (DNVR) is proposed. DNVR has several distinct features compared to other existing on-demand routing protocols, which lead to more stable routes and better scalability. Currently, ad hoc routing protocols lack load-balancing capabilities. Therefore they often fail to provide good service quality, especially in the presence of a large volume of network traffic since the network load concentrates on some nodes, resulting in a highly congested environment. To address this issue, a novel load-balancing technique for ad hoc on-demand routing protocols is proposed. The new method is simple but very effective in achieving load balance and congestion alleviation. In addition, it operates in a completely distributed fashion. To evaluate and verify wireless network protocols effectively, especially to test their scalability properties, scalable and efficient network simulation methods are required. Usually simulation of such large-scale wireless networks needs a long execution time and requires a large amount of computing resources such as powerful CPUs and memory. Traditionally, to cope with this problem, parallel network simulation techniques with parallel computing capabilities have been considered. This dissertation explores a different type of method, which is efficient and can be achieved with a sequential simulation, as well as a parallel and distributed technique for large-scale mobile wireless networks.
APA, Harvard, Vancouver, ISO, and other styles
46

Vu, Chinh Trung. "Distributed Energy-Efficient Solutions for Area Coverage Problems in Wireless Sensor Networks." Digital Archive @ GSU, 2009. http://digitalarchive.gsu.edu/cs_diss/37.

Full text
Abstract:
Wireless sensor networks (WSNs) have recently attracted a great deal of attention due to their numerous attractive applications in many different fields. Sensors and WSNs possess a number of special characteristics that make them very promising in a wide range of applications, but they also put on them lots of constraints that make issues in sensor network particularly challenging. These issues may include topology control, routing, coverage, security, data management and many others. Among them, coverage problem is one of the most fundamental ones for which a WSN has to watch over the environment such as a forest (area coverage) or set of subjects such as collection of precious renaissance paintings (target of point coverage) in order for the network to be able to collect environment parameters, and maybe further monitor the environment. In this dissertation, we highly focus on the area coverage problem. With no assumption of sensors’ locations (i.e., the sensor network is randomly deployed), we only consider distributed and parallel scheduling methods with the ultimate objective of maximizing network lifetime. Additionally, the proposed solutions (including algorithms, a scheme, and a framework) have to be energy-efficient. Generally, we investigate numerous generalizations and variants of the basic coverage problem. Those problems of interest include k-coverage, composite event detection, partial coverage, and coverage for adjustable sensing range network. Various proposed algorithms. In addition, a scheme and a framework are also suggested to solve those problems. The scheme, which is designed for emergency alarming applications, specifies the guidelines for data and communication patterns that significantly reduce the energy consumption and guarantee very low notification delay. For partial coverage problem, we propose a universal framework (consisting of four strategies) which can take almost any complete-coverage algorithm as an input to generate an algorithm for partial coverage. Among the four strategies, two pairs of strategies are trade-off in terms of network lifetime and coverage uniformity. Extensive simulations are conducted to validate the efficiency of each of our proposed solutions.
APA, Harvard, Vancouver, ISO, and other styles
47

Fassi, Imen. "XFOR (Multifor) : A new programming structure to ease the formulation of efficient loop optimizations." Thesis, Strasbourg, 2015. http://www.theses.fr/2015STRAD043/document.

Full text
Abstract:
Nous proposons une nouvelle structure de programmation appelée XFOR (Multifor), dédiée à la programmation orientée réutilisation de données. XFOR permet de gérer simultanément plusieurs boucles "for" ainsi que d’appliquer/composer des transformations de boucles d’une façon intuitive. Les expérimentations ont montré des accélérations significatives des codes XFOR par rapport aux codes originaux, mais aussi par rapport au codes générés automatiquement par l’optimiseur polyédrique de boucles Pluto. Nous avons mis en œuvre la structure XFOR par le développement de trois outils logiciels: (1) un compilateur source-à-source nommé IBB, qui traduit les codes XFOR en un code équivalent où les boucles XFOR ont été remplacées par des boucles for sémantiquement équivalentes. L’outil IBB bénéficie également des optimisations implémentées dans le générateur de code polyédrique CLooG qui est invoqué par IBB pour générer des boucles for à partir d’une description OpenScop; (2) un environnement de programmation XFOR nommé XFOR-WIZARD qui aide le programmeur dans la ré-écriture d’un programme utilisant des boucles for classiques en un programme équivalent, mais plus efficace, utilisant des boucles XFOR; (3) un outil appelé XFORGEN, qui génère automatiquement des boucles XFOR à partir de toute représentation OpenScop de nids de boucles transformées générées automatiquement par un optimiseur automatique
We propose a new programming structure named XFOR (Multifor), dedicated to data-reuse aware programming. It allows to handle several for-loops simultaneously and map their respective iteration domains onto each other. Additionally, XFOR eases loop transformations application and composition. Experiments show that XFOR codes provides significant speed-ups when compared to the original code versions, but also to the Pluto optimized versions. We implemented the XFOR structure through the development of three software tools: (1) a source-to-source compiler named IBB for Iterate-But-Better!, which automatically translates any C/C++ code containing XFOR-loops into an equivalent code where XFOR-loops have been translated into for-loops. IBB takes also benefit of optimizations implemented in the polyhedral code generator CLooG which is invoked by IBB to generate for-loops from an OpenScop specification; (2) an XFOR programming environment named XFOR-WIZARD that assists the programmer in re-writing a program with classical for-loops into an equivalent but more efficient program using XFOR-loops; (3) a tool named XFORGEN, which automatically generates XFOR-loops from any OpenScop representation of transformed loop nests automatically generated by an automatic optimizer
APA, Harvard, Vancouver, ISO, and other styles
48

Kindap, Nihal. "On An Architecture For A Parallel Finite Field Multiplier With Low Complexity Based On Composite Fields." Master's thesis, METU, 2004. http://etd.lib.metu.edu.tr/upload/12605347/index.pdf.

Full text
Abstract:
In this thesis, a bit parallel architecture for a parallel finite field multiplier with low complexity in composite fields GF((2n)m) with k = n ·
m (k 32) is investigated. The architecture has lower complexity when the Karatsuba-Ofman algorithm is applied for certain k. Using particular primitive polynomials for composite fields improves the complexities. We demonstrated for the values m = 2, 4, 8 in details. This thesis is based on the paper &ldquo
A New Architecture for a Parallel Finite Field Multiplier with Low Complexity Based on Composite Fields &rdquo
by Christof Paar. The whole purpose of this thesis is to understand and present a detailed description of the results of the paper of Paar.
APA, Harvard, Vancouver, ISO, and other styles
49

Luo, Jia. "Algorithmes génétiques parallèles pour résoudre des problèmes d'ordonnancement de tâches dynamiques de manière efficace en prenant en compte l'énergie." Thesis, Toulouse 3, 2019. http://www.theses.fr/2019TOU30001.

Full text
Abstract:
Du fait de nouvelles législations gouvernementales et de la prise de conscience environnementale des consommateurs ainsi que de la hausse du coût de l'énergie, l'efficacité énergétique est devenue un paramètre essentiel des processus industriels ces dernières années. La plupart des avancées en ce qui concerne les économies d'énergie dans les problèmes d'ordonnancement se sont focalisées sur l'ordonnancement statique. Mais en fait, ces problèmes sont dynamiques dans le monde réel. Dans cette thèse, deux problèmes d'ordonnancement dynamique efficace énergiquement sont étudiés. Le Modèle I analyse le retard total et la durée de production avec une limite de puissance tout en tenant compte d'un flux dynamique de nouvelles tâches. Un rééchelonnement complet périodique est adopté. Le Modèle II vise à réduire au minimum le retard total et la consommation d'énergie totale dans le traitement des tâches en tenant compte de nouvelles tâches prioritaires. Une approche basée sur la réparation de la planification des événements est utilisée pour traiter la mise à jour de l'ordonnancement. Comme un nouveau plan d'ordonnancement adéquat doit être obtenu dans un temps de réponse court dans un environnement dynamique, deux Algorithmes Génétiques parallèles (AG) sont proposés pour résoudre ces deux modèles. L'algorithme parallèle AG I est une méthode hybride basée sur CUDA consistant en un modèle AG insulaire au niveau supérieur et un modèle AG fin, au niveau inférieur. Il combine les métriques de deux couches hiérarchiques et tire pleinement parti des capacités de calcul de la plateforme CUDA. L'algorithme AG II est conçu avec une double hétérogénéité qui résulte de l'utilisation d'un AG cellulaire parallèle et d'un pseudo AG parallèle. Avec ces deux structures différentes, les ilots augmentent la diversité de la population et peuvent être simultanément parallélisés sur des GPU et un processeur multi-cœur. Enfin, des solutions numériques sont présentées et analysées ; elles montrent que nos approches peuvent non seulement résoudre les problèmes de manière flexible, mais également obtenir des solutions avantageuses et réduire les temps de calcul
Due to new government legislation, customers' environmental concerns and continuously rising cost of energy, energy efficiency is becoming an essential parameter of industrial manufacturing processes in recent years. Most efforts considering energy issues in scheduling problems have focused on static scheduling. But in fact, scheduling problems are dynamic in the real world with uncertain new arrival jobs after the execution time. In this thesis, two energy efficient dynamic scheduling problems are studied. Model I analyzes the total tardiness and the makespan with a peak power limitation while considering the flexible flow shop with new arrival jobs. A periodic complete rescheduling approach is adopted to represent the optimization problem. Model II concerns an investigation into minimizing total tardiness and total energy consumption in the job shop with new urgent arrival jobs. An event driven schedule repair approach is utilized to deal with the updated schedule. As an adequate renewed scheduling plan needs to be obtained in a short response time in dynamic environment, two parallel Genetic Algorithms (GAs) are proposed to solve these two models respectively. The parallel GA I is a CUDA-based hybrid model consisting of an island GA at the upper level and a fine-grained GA at the lower level. It combines metrics of two hierarchical layers and takes full advantage of CUDA's compute capability. The parallel GA II is a dual heterogeneous design composed of a cellular GA and a pseudo GA. The islands with these two different structures increase the population diversity and can be well parallelized on GPUs simultaneously with multi-core CPU. Finally, numerical experiments are conducted and show that our approaches can not only solve the problems flexibly, but also gain competitive results and reduce time requirements
APA, Harvard, Vancouver, ISO, and other styles
50

Poggi, Giovanni. "Implementazione CUDA su GPU di un algoritmo sort-based per la distribuzione efficiente di dati in simulazioni distribuite." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2021. http://amslaurea.unibo.it/24988/.

Full text
Abstract:
L'obiettivo di questa tesi è quello di elaborare, descrivere ed analizzare un algoritmo che risolva un problema di sort matching all'interno di una simulazione, utilizzando una GPU su architettura CUDA. Dopo questa breve introduzione, nel secondo capitolo si svolgerà un'introduzione rispetto all'argomento principale, con cenni al Data Distribution Management. Si introdurrà l'High Level Architecture, di cui si da una definizione e quindi una descrizione di massima, dando particolare attenzione alla sua struttura. Si prosegue con l'analisi dettagliata del servizio di DDM, la sua funzione e i principi chiave a cui deve attenersi. Nel terzo capitolo si effettua l'introduzione al matching, spiegando nel dettaglio di cosa si occupa questa operazione e le differenze sostanziali tra extent, region, update e subscription. Dopo un breve cenno al routing space si presentano alcuni degli algoritmi più diffusi ed impiegati nello svolgimento operativo dei servizi di DDM. Ponendo forte accento sugli algoritmi Brute-Force Parallel, Grid-Based Parallel e l'algoritmo preso in esame, il Sort-Based Parallel, aggiungendo qualche cenno sulla parallelizzazione con CUDA. Il quarto capitolo è dedicato all'ottimizzazione dell'algoritmo sort-based utilizzando l'architettura CUDA. Dopo una breve introduzione alla GPU e CUDA, si presentano tutte le modifiche sostanziali effettuate al Sort-Based. Si prosegue con l'adattamento realizzato dell'algoritmo su GPU ed i dettagli implementativi. Nel quinto capitolo si analizzano le prestazioni ottenute durante l'esecuzione dell'algoritmo utilizzato. Si descrivono i criteri di valutazione che hanno permesso di comprendere l'ottenimento del goal iniziale e si presentano i risultati ottenuti, descrivendo le rappresentazioni grafiche corredate dalle relative tabelle numeriche per una più corretta comprensione. Seguono le conclusioni con una riflessione e discussione dei risultati, corredati dai possibili sviluppi futuri dell'applicazione in oggetto alla tesi.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography