Littérature scientifique sur le sujet « Dataflow-Threads »

Créez une référence correcte selon les styles APA, MLA, Chicago, Harvard et plusieurs autres

Choisissez une source :

Consultez les listes thématiques d’articles de revues, de livres, de thèses, de rapports de conférences et d’autres sources académiques sur le sujet « Dataflow-Threads ».

À côté de chaque source dans la liste de références il y a un bouton « Ajouter à la bibliographie ». Cliquez sur ce bouton, et nous générerons automatiquement la référence bibliographique pour la source choisie selon votre style de citation préféré : APA, MLA, Harvard, Vancouver, Chicago, etc.

Vous pouvez aussi télécharger le texte intégral de la publication scolaire au format pdf et consulter son résumé en ligne lorsque ces informations sont inclues dans les métadonnées.

Articles de revues sur le sujet "Dataflow-Threads"

1

Lamport, Leslie. « Implementing dataflow with threads ». Distributed Computing 21, no 3 (29 juillet 2008) : 163–81. http://dx.doi.org/10.1007/s00446-008-0065-1.

Texte intégral
Styles APA, Harvard, Vancouver, ISO, etc.
2

Satoh, Shigehisa, Kazuhiro Kusano et Mitsuhisa Sato. « Compiler Optimization Techniques for OpenMP Programs ». Scientific Programming 9, no 2-3 (2001) : 131–42. http://dx.doi.org/10.1155/2001/189054.

Texte intégral
Résumé :
We have developed compiler optimization techniques for explicit parallel programs using the OpenMP API. To enable optimization across threads, we designed dataflow analysis techniques in which interactions between threads are effectively modeled. Structured description of parallelism and relaxed memory consistency in OpenMP make the analyses effective and efficient. We developed algorithms for reaching definitions analysis, memory synchronization analysis, and cross-loop data dependence analysis for parallel loops. Our primary target is compiler-directed software distributed shared memory systems in which aggressive compiler optimizations for software-implemented coherence schemes are crucial to obtaining good performance. We also developed optimizations applicable to general OpenMP implementations, namely redundant barrier removal and privatization of dynamically allocated objects. Experimental results for the coherency optimization show that aggressive compiler optimizations are quite effective for a shared-write intensive program because the coherence-induced communication volume in such a program is much larger than that in shared-read intensive programs.
Styles APA, Harvard, Vancouver, ISO, etc.
3

THORNTON, MITCHELL. « Performance Evaluation of a Parallel Decoupled Data Driven Multiprocessor ». Parallel Processing Letters 13, no 03 (septembre 2003) : 497–507. http://dx.doi.org/10.1142/s0129626403001458.

Texte intégral
Résumé :
The Decoupled Data-Driven (D3) architecture has shown promising results from performance evaluations based upon deterministic simulations. This paper provides performance evaluations of the D3 architecture through the formulation and analysis of a stochastic model. The D3 architecture is a hybrid control/dataflow approach that takes advantage of inherent parallelism present in a program by dynamically scheduling program threads based on data availability and it also takes advantage of locality through the use of conventional processing elements that execute the program threads. The model is validated by comparing the deterministic and stochastic model responses. After model validation, various input parameters are varied such as the number of available processing elements and average threadlength, then the performance of the architecture is evaluated. The stochastic model is based upon a closed queueing network and utilizes the concepts of available parallelism and virtual queues in order to be reduced to a Markovian system. Experiments with varying computation engine threadlengths and communication latencies indicate a high degree of tolerance with respect to exploited parallelism.
Styles APA, Harvard, Vancouver, ISO, etc.
4

Lohstroh, Marten, Christian Menard, Soroush Bateni et Edward A. Lee. « Toward a Lingua Franca for Deterministic Concurrent Systems ». ACM Transactions on Embedded Computing Systems 20, no 4 (juin 2021) : 1–27. http://dx.doi.org/10.1145/3448128.

Texte intégral
Résumé :
Many programming languages and programming frameworks focus on parallel and distributed computing. Several frameworks are based on actors, which provide a more disciplined model for concurrency than threads. The interactions between actors, however, if not constrained, admit nondeterminism. As a consequence, actor programs may exhibit unintended behaviors and are less amenable to rigorous testing. We show that nondeterminism can be handled in a number of ways, surveying dataflow dialects, process networks, synchronous-reactive models, and discrete-event models. These existing approaches, however, tend to require centralized control, pose challenges to modular system design, or introduce a single point of failure. We describe “reactors,” a new coordination model that combines ideas from several of these approaches to enable determinism while preserving much of the style of actors. Reactors promote modularity and allow for distributed execution. By using a logical model of time that can be associated with physical time, reactors also provide control over timing. Reactors also expose parallelism that can be exploited on multicore machines and in distributed configurations without compromising determinacy.
Styles APA, Harvard, Vancouver, ISO, etc.
5

Tatas, Konstantinos, Costas Kyriacou, Paraskevas Evripidou, Pedro Trancoso et Stephan Wong. « Rapid Prototyping of the Data-Driven Chip-Multiprocessor (D2-CMP) using FPGAs ». Parallel Processing Letters 18, no 02 (juin 2008) : 291–306. http://dx.doi.org/10.1142/s0129626408003399.

Texte intégral
Résumé :
This paper presents the FPGA implementation of the prototype for the Data-Driven Chip-Multiprocessor (D2-CMP). In particular, we study the implementation of a Thread Synchronization Unit (TSU) on FPGA, a hardware unit that enables thread execution using dataflow-like scheduling policy on a chip multiprocessor. Threads are scheduled for execution based on data availability, i.e., a thread is scheduled for execution only if its input data is available. This model of execution is called the non-blocking Data-Driven Multithreading (DDM) model of execution. The DDM model has been evaluated using an execution driven simulator. To validate the simulation results, a 2-node DDM chip multiprocessor has been implemented on a Xilinx Virtex-II Pro FPGA with two PowerPC processors hardwired on the FPGA. Measurements on the hardware prototype show that the TSU can be implemented with a moderate hardware budget. The 2-node multiprocessor has been implemented with less than half of the reconfigurable hardware available on the Xilinx Virtex-II Pro FPGA (45% slices), which corresponds to an ASIC equivalent gate count of 1.9 million gates. Measurements on the prototype showed that the delays incurred by the operation of the TSU can be tolerated.
Styles APA, Harvard, Vancouver, ISO, etc.
6

Halappanavar, Mahantesh, John Feo, Oreste Villa, Antonino Tumeo et Alex Pothen. « Approximate weighted matching on emerging manycore and multithreaded architectures ». International Journal of High Performance Computing Applications 26, no 4 (9 août 2012) : 413–30. http://dx.doi.org/10.1177/1094342012452893.

Texte intégral
Résumé :
Graph matching is a prototypical combinatorial problem with many applications in high-performance scientific computing. Optimal algorithms for computing matchings are challenging to parallelize. Approximation algorithms are amenable to parallelization and are therefore important to compute matchings for large-scale problems. Approximation algorithms also generate nearly optimal solutions that are sufficient for many applications. In this paper we present multithreaded algorithms for computing half-approximate weighted matching on state-of-the-art multicore (Intel Nehalem and AMD Magny-Cours), manycore (Nvidia Tesla and Nvidia Fermi), and massively multithreaded (Cray XMT) platforms. We provide two implementations: the first uses shared work queues and is suited for all platforms; and the second implementation, based on dataflow principles, exploits special features available on the Cray XMT. Using a carefully chosen dataset that exhibits characteristics from a wide range of applications, we show scalable performance across different platforms. In particular, for one instance of the input, an R-MAT graph (RMAT-G), we show speedups of about [Formula: see text] on [Formula: see text] cores of an AMD Magny-Cours, [Formula: see text] on [Formula: see text] cores of Intel Nehalem, [Formula: see text] on Nvidia Tesla and [Formula: see text] on Nvidia Fermi relative to one core of Intel Nehalem, and [Formula: see text] on [Formula: see text] processors of Cray XMT. We demonstrate strong as well as weak scaling for graphs with up to a billion edges using up to 12,800 threads. We avoid excessive fine-tuning for each platform and retain the basic structure of the algorithm uniformly across platforms. An exception is the dataflow algorithm designed specifically for the Cray XMT. To the best of the authors' knowledge, this is the first such large-scale study of the half-approximate weighted matching problem on multithreaded platforms. Driven by the critical enabling role of combinatorial algorithms such as matching in scientific computing and the emergence of informatics applications, there is a growing demand to support irregular computations on current and future computing platforms. In this context, we evaluate the capability of emerging multithreaded platforms to tolerate latency induced by irregular memory access patterns, and to support fine-grained parallelism via light-weight synchronization mechanisms. By contrasting the architectural features of these platforms against the Cray XMT, which is specifically designed to support irregular memory-intensive applications, we delineate the impact of these choices on performance.
Styles APA, Harvard, Vancouver, ISO, etc.

Thèses sur le sujet "Dataflow-Threads"

1

Sahebi, Amin. « Reconfigurable Architectures for Accelerating Distributed Applications, "A Graph Processing Application Case Study" ». Doctoral thesis, 2022. http://hdl.handle.net/2158/1271765.

Texte intégral
Résumé :
This thesis mainly focuses on state-of-the-art challenges of distributed execution models and research on the system support for artificial intelligence and high performance computing applications. In this context, we focus on investigating in detail about co-designing the Dataflow-Threads execution model. Moreover, to facilitate support, development, and debug the Dataflow-Threads execution model, we introduced DRT; a lightweight Dataflow runtime. DRT has been written in portable C code (tested with the GNU C compiler), and it is open-source. It can be used on real machines based on architectures like x86, AArch, RISC-V ISA. Furthermore, we consider major problematic applications in the domain of the Artificial Intelligence (AI) and High Performance Computing (HPC) and address the main challenges and bottlenecks to extend our dataflow runtime. To do this, we used widely known benchmarks to stress the capabilities of the DF-Threads execution model and its evaluation against other parallel programming models. We choose Blocked Matrix Multiplication and Recursive Fibonacci. Matrix multiplication is one of the main kernels of AI and HPC Applications. Plus, Recursive Fibonacci is a simple benchmark which creHigh-Performancember of threads and processes and stress the entire execution model. In this thesis, we are mainly interested in heterogeneous platforms. A heterogeneous platform is a hardware device that contains a range of computing components, such as multicore CPUs, GPU, or FPGAs. Their capabilities have provided many features for researchers to use this kind of structure in their state-of-the-art works. Heterogeneous systems are flexible, cost-efficient, and well-supported by communities. Our work focuses mainly on CPU+FPGA Heterogeneous systems, mostly a general-purpose CPU (x86 or ARM) within a Unix-based operating system besides an FPGA accelerator. Subsequently, because of a need in our hardware platform structure, we design and fabricate the Gluon board, which uses serial transceivers in Xilinx Ultrascale+ Heterogeneous accelerator and facilitates GTH transceivers in high rate data transfer applications. Gluon boards are modular and can carry up to 18 Gbps on each lane with specific data types and payload sizes. The end-user cost to manufacture the Gluon board is less than 400 euros with enormous capabilities. Moreover, a real application demonstrates a distributed graph processing application to express the distributed computing execution model and further extend our execution model to cover the real-world application like Graph Processing in large scale. In the first step, we provided a comprehensive baseline, designed and proposed a large scale distributed graph processing application and evaluated it within the PageRank algorithm using well-known datasets. We show how graph partitioning combined with a multi-FPGA architecture leads to higher performance without limitation on the size of the graph, even when the graph has trillions of vertices. Our performance analysis, in the case of PageRank, forecasts performance improvement of up to 20 times and a cost-normalized improvement of up to 12 times when comparing the proposed approach on one Xilinx Alveo U250 FPGA accelerator against a state-of-the-art baseline graph processing software implementation on a Intel Xeon server CPU with a 40-core processor at 2.50 GHz.
Styles APA, Harvard, Vancouver, ISO, etc.
2

Maybodi, Farnam Khalili. « A Data-Flow Threads Co-Processor for MPSoC FPGA Clusters ». Doctoral thesis, 2021. http://hdl.handle.net/2158/1237446.

Texte intégral
Styles APA, Harvard, Vancouver, ISO, etc.

Actes de conférences sur le sujet "Dataflow-Threads"

1

Vilim, Matthew, Alexander Rucker et Kunle Olukotun. « Aurochs : An Architecture for Dataflow Threads ». Dans 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2021. http://dx.doi.org/10.1109/isca52012.2021.00039.

Texte intégral
Styles APA, Harvard, Vancouver, ISO, etc.
2

Pace, Domenico, Krishna Kavi et Charles Shelor. « MT-SDF : Scheduled Dataflow Architecture with Mini-threads ». Dans 2013 Data-Flow Execution Models for Extreme Scale Computing (DFM). IEEE, 2013. http://dx.doi.org/10.1109/dfm.2013.18.

Texte intégral
Styles APA, Harvard, Vancouver, ISO, etc.
3

Alves, Tiago A. O., Leandro A. J. Marzulo, Felipe M. G. França et Vítor Santos Costa. « Trebuchet : Explorando TLP com Virtualização DataFlow ». Dans Simpósio em Sistemas Computacionais de Alto Desempenho. Sociedade Brasileira de Computação, 2009. http://dx.doi.org/10.5753/wscad.2009.17393.

Texte intégral
Résumé :
No modelo DataFlow as instruções são executadas t ão logo seus operandos de entrada estejam disponíveis, expondo, de forma natural, o paralelismo em nível de instrução (ILP). Por outro lado, a exploração de paralelismo em nível de thread (TLP) passa a ser também um fator de grande import ância para o aumento de desempenho na execução de uma aplicação em máquinas multicore. Este trabalho propõe um modelo de execução de programas, baseado nas arquiteturas DataFlow, que transforma ILP em TLP. Esse modelo é demonstrado através da implementação de uma máquina virtual multi-threaded, a Trebuchet. A aplicação é compilada para o modelo DataFlow e suas instruções independentes (segundo o fluxo de dados) são executadas em Elementos de Processamento (EPs) distintos da Trebuchet. Cada EP é mapeado em uma thread na máquina hospedeira. O modelo permite a definição de blocos de instruções de diferentes granularidades, que terão disparo guiado pelo fluxo de dados e execução direta na máquina hospedeira, para diminuir os custos de interpretação. Como a sincronização é obtida pelo modelo DataFlow, não é necessária a introdução de locks ou barreiras nos programas a serem paralelizados. Um conjunto de três benchmarks reduzidos, compilados em oito threads e executados por um processador quadcore Intel R CoreTMi7 920, permitiu avaliar: (i) o funcionamento do modelo; (ii) a versatilidade na definição de instruções com diferentes granularidades (blocos); (iii) uma comparação com o OpenMP. Acelerações de 4,81, 2,4 e 4,03 foram atingidas em relação à versão sequencial, enquanto que acelerações de 1,11, 1,3 e 1,0 foram obtidas em relação ao OpenMP.
Styles APA, Harvard, Vancouver, ISO, etc.
4

Gonçalves, Ronaldo A. L., Rafael L. Sagula, Tiarajú A. Diverio et Philippe O. A. Navaux. « Process Prefetching for a Simultaneous Multithreaded Architecture ». Dans International Symposium on Computer Architecture and High Performance Computing. Sociedade Brasileira de Computação, 1999. http://dx.doi.org/10.5753/sbac-pad.1999.19772.

Texte intégral
Résumé :
Traditional superscalar architectures shall eventually prove incapable of taking full advantage of billions of transistors to be available in the future generations of microprocessors if they remain limited by dataflow dependencies. Thus, SMT (Simultaneous Multithreaded) architccture may be a possiblc solution to this problem, as far as it can fctch and execute a great deal of instruction flows and at the same time hiding both high latency operations and data dependencies. But this capability of SMT architecture depends on the existence of multithreaded applications and on some effective fetching instruction mechanism that will guarantee the presence of ready threads in the L1 i-cache to be used throughout context switching. SEMPRE (Superscalar Execution ofMultiple PRocEsses) is a type of SMT architecture which makes use of various processes to be found in today's operating systems developed to supply instructions to its SMT pipeline. This paper proposes and evaluates an effectual mechanism that prefetches instructions from awaiting processes in order to guarantee adequate context switching. An analytical model of such a mechanism was developed through using DSPN (Deterministic and Stochastic Petri Nets) and the results have shown that its use improves the dispatch width by 25% when realistic parameters are used. This method reduces the problem of cache degradation (present on many SMT architectures) and tolerates L2 delays of up to 9 cycles in some cases without the loss of performance.
Styles APA, Harvard, Vancouver, ISO, etc.
Nous offrons des réductions sur tous les plans premium pour les auteurs dont les œuvres sont incluses dans des sélections littéraires thématiques. Contactez-nous pour obtenir un code promo unique!

Vers la bibliographie