Academic literature on the topic 'Dataflow-Threads'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Dataflow-Threads.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Dataflow-Threads"

1

Lamport, Leslie. "Implementing dataflow with threads." Distributed Computing 21, no. 3 (July 29, 2008): 163–81. http://dx.doi.org/10.1007/s00446-008-0065-1.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Satoh, Shigehisa, Kazuhiro Kusano, and Mitsuhisa Sato. "Compiler Optimization Techniques for OpenMP Programs." Scientific Programming 9, no. 2-3 (2001): 131–42. http://dx.doi.org/10.1155/2001/189054.

Full text
Abstract:
We have developed compiler optimization techniques for explicit parallel programs using the OpenMP API. To enable optimization across threads, we designed dataflow analysis techniques in which interactions between threads are effectively modeled. Structured description of parallelism and relaxed memory consistency in OpenMP make the analyses effective and efficient. We developed algorithms for reaching definitions analysis, memory synchronization analysis, and cross-loop data dependence analysis for parallel loops. Our primary target is compiler-directed software distributed shared memory systems in which aggressive compiler optimizations for software-implemented coherence schemes are crucial to obtaining good performance. We also developed optimizations applicable to general OpenMP implementations, namely redundant barrier removal and privatization of dynamically allocated objects. Experimental results for the coherency optimization show that aggressive compiler optimizations are quite effective for a shared-write intensive program because the coherence-induced communication volume in such a program is much larger than that in shared-read intensive programs.
APA, Harvard, Vancouver, ISO, and other styles
3

THORNTON, MITCHELL. "Performance Evaluation of a Parallel Decoupled Data Driven Multiprocessor." Parallel Processing Letters 13, no. 03 (September 2003): 497–507. http://dx.doi.org/10.1142/s0129626403001458.

Full text
Abstract:
The Decoupled Data-Driven (D3) architecture has shown promising results from performance evaluations based upon deterministic simulations. This paper provides performance evaluations of the D3 architecture through the formulation and analysis of a stochastic model. The D3 architecture is a hybrid control/dataflow approach that takes advantage of inherent parallelism present in a program by dynamically scheduling program threads based on data availability and it also takes advantage of locality through the use of conventional processing elements that execute the program threads. The model is validated by comparing the deterministic and stochastic model responses. After model validation, various input parameters are varied such as the number of available processing elements and average threadlength, then the performance of the architecture is evaluated. The stochastic model is based upon a closed queueing network and utilizes the concepts of available parallelism and virtual queues in order to be reduced to a Markovian system. Experiments with varying computation engine threadlengths and communication latencies indicate a high degree of tolerance with respect to exploited parallelism.
APA, Harvard, Vancouver, ISO, and other styles
4

Lohstroh, Marten, Christian Menard, Soroush Bateni, and Edward A. Lee. "Toward a Lingua Franca for Deterministic Concurrent Systems." ACM Transactions on Embedded Computing Systems 20, no. 4 (June 2021): 1–27. http://dx.doi.org/10.1145/3448128.

Full text
Abstract:
Many programming languages and programming frameworks focus on parallel and distributed computing. Several frameworks are based on actors, which provide a more disciplined model for concurrency than threads. The interactions between actors, however, if not constrained, admit nondeterminism. As a consequence, actor programs may exhibit unintended behaviors and are less amenable to rigorous testing. We show that nondeterminism can be handled in a number of ways, surveying dataflow dialects, process networks, synchronous-reactive models, and discrete-event models. These existing approaches, however, tend to require centralized control, pose challenges to modular system design, or introduce a single point of failure. We describe “reactors,” a new coordination model that combines ideas from several of these approaches to enable determinism while preserving much of the style of actors. Reactors promote modularity and allow for distributed execution. By using a logical model of time that can be associated with physical time, reactors also provide control over timing. Reactors also expose parallelism that can be exploited on multicore machines and in distributed configurations without compromising determinacy.
APA, Harvard, Vancouver, ISO, and other styles
5

Tatas, Konstantinos, Costas Kyriacou, Paraskevas Evripidou, Pedro Trancoso, and Stephan Wong. "Rapid Prototyping of the Data-Driven Chip-Multiprocessor (D2-CMP) using FPGAs." Parallel Processing Letters 18, no. 02 (June 2008): 291–306. http://dx.doi.org/10.1142/s0129626408003399.

Full text
Abstract:
This paper presents the FPGA implementation of the prototype for the Data-Driven Chip-Multiprocessor (D2-CMP). In particular, we study the implementation of a Thread Synchronization Unit (TSU) on FPGA, a hardware unit that enables thread execution using dataflow-like scheduling policy on a chip multiprocessor. Threads are scheduled for execution based on data availability, i.e., a thread is scheduled for execution only if its input data is available. This model of execution is called the non-blocking Data-Driven Multithreading (DDM) model of execution. The DDM model has been evaluated using an execution driven simulator. To validate the simulation results, a 2-node DDM chip multiprocessor has been implemented on a Xilinx Virtex-II Pro FPGA with two PowerPC processors hardwired on the FPGA. Measurements on the hardware prototype show that the TSU can be implemented with a moderate hardware budget. The 2-node multiprocessor has been implemented with less than half of the reconfigurable hardware available on the Xilinx Virtex-II Pro FPGA (45% slices), which corresponds to an ASIC equivalent gate count of 1.9 million gates. Measurements on the prototype showed that the delays incurred by the operation of the TSU can be tolerated.
APA, Harvard, Vancouver, ISO, and other styles
6

Halappanavar, Mahantesh, John Feo, Oreste Villa, Antonino Tumeo, and Alex Pothen. "Approximate weighted matching on emerging manycore and multithreaded architectures." International Journal of High Performance Computing Applications 26, no. 4 (August 9, 2012): 413–30. http://dx.doi.org/10.1177/1094342012452893.

Full text
Abstract:
Graph matching is a prototypical combinatorial problem with many applications in high-performance scientific computing. Optimal algorithms for computing matchings are challenging to parallelize. Approximation algorithms are amenable to parallelization and are therefore important to compute matchings for large-scale problems. Approximation algorithms also generate nearly optimal solutions that are sufficient for many applications. In this paper we present multithreaded algorithms for computing half-approximate weighted matching on state-of-the-art multicore (Intel Nehalem and AMD Magny-Cours), manycore (Nvidia Tesla and Nvidia Fermi), and massively multithreaded (Cray XMT) platforms. We provide two implementations: the first uses shared work queues and is suited for all platforms; and the second implementation, based on dataflow principles, exploits special features available on the Cray XMT. Using a carefully chosen dataset that exhibits characteristics from a wide range of applications, we show scalable performance across different platforms. In particular, for one instance of the input, an R-MAT graph (RMAT-G), we show speedups of about [Formula: see text] on [Formula: see text] cores of an AMD Magny-Cours, [Formula: see text] on [Formula: see text] cores of Intel Nehalem, [Formula: see text] on Nvidia Tesla and [Formula: see text] on Nvidia Fermi relative to one core of Intel Nehalem, and [Formula: see text] on [Formula: see text] processors of Cray XMT. We demonstrate strong as well as weak scaling for graphs with up to a billion edges using up to 12,800 threads. We avoid excessive fine-tuning for each platform and retain the basic structure of the algorithm uniformly across platforms. An exception is the dataflow algorithm designed specifically for the Cray XMT. To the best of the authors' knowledge, this is the first such large-scale study of the half-approximate weighted matching problem on multithreaded platforms. Driven by the critical enabling role of combinatorial algorithms such as matching in scientific computing and the emergence of informatics applications, there is a growing demand to support irregular computations on current and future computing platforms. In this context, we evaluate the capability of emerging multithreaded platforms to tolerate latency induced by irregular memory access patterns, and to support fine-grained parallelism via light-weight synchronization mechanisms. By contrasting the architectural features of these platforms against the Cray XMT, which is specifically designed to support irregular memory-intensive applications, we delineate the impact of these choices on performance.
APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic "Dataflow-Threads"

1

Sahebi, Amin. "Reconfigurable Architectures for Accelerating Distributed Applications, "A Graph Processing Application Case Study"." Doctoral thesis, 2022. http://hdl.handle.net/2158/1271765.

Full text
Abstract:
This thesis mainly focuses on state-of-the-art challenges of distributed execution models and research on the system support for artificial intelligence and high performance computing applications. In this context, we focus on investigating in detail about co-designing the Dataflow-Threads execution model. Moreover, to facilitate support, development, and debug the Dataflow-Threads execution model, we introduced DRT; a lightweight Dataflow runtime. DRT has been written in portable C code (tested with the GNU C compiler), and it is open-source. It can be used on real machines based on architectures like x86, AArch, RISC-V ISA. Furthermore, we consider major problematic applications in the domain of the Artificial Intelligence (AI) and High Performance Computing (HPC) and address the main challenges and bottlenecks to extend our dataflow runtime. To do this, we used widely known benchmarks to stress the capabilities of the DF-Threads execution model and its evaluation against other parallel programming models. We choose Blocked Matrix Multiplication and Recursive Fibonacci. Matrix multiplication is one of the main kernels of AI and HPC Applications. Plus, Recursive Fibonacci is a simple benchmark which creHigh-Performancember of threads and processes and stress the entire execution model. In this thesis, we are mainly interested in heterogeneous platforms. A heterogeneous platform is a hardware device that contains a range of computing components, such as multicore CPUs, GPU, or FPGAs. Their capabilities have provided many features for researchers to use this kind of structure in their state-of-the-art works. Heterogeneous systems are flexible, cost-efficient, and well-supported by communities. Our work focuses mainly on CPU+FPGA Heterogeneous systems, mostly a general-purpose CPU (x86 or ARM) within a Unix-based operating system besides an FPGA accelerator. Subsequently, because of a need in our hardware platform structure, we design and fabricate the Gluon board, which uses serial transceivers in Xilinx Ultrascale+ Heterogeneous accelerator and facilitates GTH transceivers in high rate data transfer applications. Gluon boards are modular and can carry up to 18 Gbps on each lane with specific data types and payload sizes. The end-user cost to manufacture the Gluon board is less than 400 euros with enormous capabilities. Moreover, a real application demonstrates a distributed graph processing application to express the distributed computing execution model and further extend our execution model to cover the real-world application like Graph Processing in large scale. In the first step, we provided a comprehensive baseline, designed and proposed a large scale distributed graph processing application and evaluated it within the PageRank algorithm using well-known datasets. We show how graph partitioning combined with a multi-FPGA architecture leads to higher performance without limitation on the size of the graph, even when the graph has trillions of vertices. Our performance analysis, in the case of PageRank, forecasts performance improvement of up to 20 times and a cost-normalized improvement of up to 12 times when comparing the proposed approach on one Xilinx Alveo U250 FPGA accelerator against a state-of-the-art baseline graph processing software implementation on a Intel Xeon server CPU with a 40-core processor at 2.50 GHz.
APA, Harvard, Vancouver, ISO, and other styles
2

Maybodi, Farnam Khalili. "A Data-Flow Threads Co-Processor for MPSoC FPGA Clusters." Doctoral thesis, 2021. http://hdl.handle.net/2158/1237446.

Full text
APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Dataflow-Threads"

1

Vilim, Matthew, Alexander Rucker, and Kunle Olukotun. "Aurochs: An Architecture for Dataflow Threads." In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2021. http://dx.doi.org/10.1109/isca52012.2021.00039.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Pace, Domenico, Krishna Kavi, and Charles Shelor. "MT-SDF: Scheduled Dataflow Architecture with Mini-threads." In 2013 Data-Flow Execution Models for Extreme Scale Computing (DFM). IEEE, 2013. http://dx.doi.org/10.1109/dfm.2013.18.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Alves, Tiago A. O., Leandro A. J. Marzulo, Felipe M. G. França, and Vítor Santos Costa. "Trebuchet: Explorando TLP com Virtualização DataFlow." In Simpósio em Sistemas Computacionais de Alto Desempenho. Sociedade Brasileira de Computação, 2009. http://dx.doi.org/10.5753/wscad.2009.17393.

Full text
Abstract:
No modelo DataFlow as instruções são executadas t ão logo seus operandos de entrada estejam disponíveis, expondo, de forma natural, o paralelismo em nível de instrução (ILP). Por outro lado, a exploração de paralelismo em nível de thread (TLP) passa a ser também um fator de grande import ância para o aumento de desempenho na execução de uma aplicação em máquinas multicore. Este trabalho propõe um modelo de execução de programas, baseado nas arquiteturas DataFlow, que transforma ILP em TLP. Esse modelo é demonstrado através da implementação de uma máquina virtual multi-threaded, a Trebuchet. A aplicação é compilada para o modelo DataFlow e suas instruções independentes (segundo o fluxo de dados) são executadas em Elementos de Processamento (EPs) distintos da Trebuchet. Cada EP é mapeado em uma thread na máquina hospedeira. O modelo permite a definição de blocos de instruções de diferentes granularidades, que terão disparo guiado pelo fluxo de dados e execução direta na máquina hospedeira, para diminuir os custos de interpretação. Como a sincronização é obtida pelo modelo DataFlow, não é necessária a introdução de locks ou barreiras nos programas a serem paralelizados. Um conjunto de três benchmarks reduzidos, compilados em oito threads e executados por um processador quadcore Intel R CoreTMi7 920, permitiu avaliar: (i) o funcionamento do modelo; (ii) a versatilidade na definição de instruções com diferentes granularidades (blocos); (iii) uma comparação com o OpenMP. Acelerações de 4,81, 2,4 e 4,03 foram atingidas em relação à versão sequencial, enquanto que acelerações de 1,11, 1,3 e 1,0 foram obtidas em relação ao OpenMP.
APA, Harvard, Vancouver, ISO, and other styles
4

Gonçalves, Ronaldo A. L., Rafael L. Sagula, Tiarajú A. Diverio, and Philippe O. A. Navaux. "Process Prefetching for a Simultaneous Multithreaded Architecture." In International Symposium on Computer Architecture and High Performance Computing. Sociedade Brasileira de Computação, 1999. http://dx.doi.org/10.5753/sbac-pad.1999.19772.

Full text
Abstract:
Traditional superscalar architectures shall eventually prove incapable of taking full advantage of billions of transistors to be available in the future generations of microprocessors if they remain limited by dataflow dependencies. Thus, SMT (Simultaneous Multithreaded) architccture may be a possiblc solution to this problem, as far as it can fctch and execute a great deal of instruction flows and at the same time hiding both high latency operations and data dependencies. But this capability of SMT architecture depends on the existence of multithreaded applications and on some effective fetching instruction mechanism that will guarantee the presence of ready threads in the L1 i-cache to be used throughout context switching. SEMPRE (Superscalar Execution ofMultiple PRocEsses) is a type of SMT architecture which makes use of various processes to be found in today's operating systems developed to supply instructions to its SMT pipeline. This paper proposes and evaluates an effectual mechanism that prefetches instructions from awaiting processes in order to guarantee adequate context switching. An analytical model of such a mechanism was developed through using DSPN (Deterministic and Stochastic Petri Nets) and the results have shown that its use improves the dispatch width by 25% when realistic parameters are used. This method reduces the problem of cache degradation (present on many SMT architectures) and tolerates L2 delays of up to 9 cycles in some cases without the loss of performance.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography