Log in

Relevant bibliographies by topics / Execution thread / Journal articles

Journal articles on the topic 'Execution thread'

To see the other types of publications on this topic, follow the link: Execution thread.

Author: Grafiati

Published: 30 May 2022

Last updated: 31 May 2022

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Execution thread.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Chen, Caisen, Yangxia Xiang, Yuqin DengLiu, and Zeyun Zhou. "Research on Cache Timing Attack Against RSA with Sliding Window Exponentiation Algorithm." International Journal of Interdisciplinary Telecommunications and Networking 8, no. 2 (April 2016): 88–95. http://dx.doi.org/10.4018/ijitn.2016040108.

Full text

Abstract:

The vulnerabilities of the RSA cryptographic algorithm are analyzed, and it is not securely implemented. As the simultaneous multithreading could enable multiple execution threads to share the execution resources of a superscalar between the chipper process and the spy process, the shared access to memory caches provides an easily used high bandwidth covert channel between threads, allowing that a malicious thread can monitor the execution of another thread. This paper targets at RSA algorithm which is implemented with sliding window exponentiation algorithm via OpenSSL, the attacker can monitor the cryptographic thread by executing a spy thread, recording the timing characteristic during the RSA decryption when reading the Cache. The attacker can recover the original key via analyzing these timing measurements. Finally, the authors provide some countermeasures of how this attack could be mitigated or eliminated entirely.

APA, Harvard, Vancouver, ISO, and other styles

2

Hamidi, Beqir, and Lindita Hamidi. "Synchronization Possibilities and Features in Java." European Journal of Interdisciplinary Studies 1, no. 1 (April 30, 2015): 75. http://dx.doi.org/10.26417/ejis.v1i1.p75-84.

Full text

Abstract:

In this paper we have discussed one of the greatest features of the general purpose computer programming language –Java. This paper represents concepts of Synchronization possibilities and features in Java. Today's operating systems support concept of "Multitasking". Multitasking achieved by executing more than one task at a same time. Tasks runs on threads. Multitasking runs more than one task at a same time. Multitasking which means doing many things at the same time is one of the most fundamental concepts in computer engineering and computer science because the processor execute given tasks in parallel so it makes me think that are executing simultaneously. Multitasking is related to other fundamental concepts like processes and threads. A process is a computer program that is executing in a processor, while a thread is a part of a process that has a way of execution: it is a thread of execution. Every process has at least one thread of execution. There are two types of multitasking: process – based and thread – based. Process-based multitasking, means that on a given computer there can be more than one program or process that is executing, while thread-based multitasking, which is also known as multithreading, means that within a process, there can be more than one thread of execution, each of them doing a job and so accomplishing the job of their process. When there are many processes or many threads within processes, they may have to cooperate with each other or concurrently try to get access to some shared computer resources like: processor, memory and input/output devices. They may have to, for example: print a file in a printer or write and/or read to the same file. We need a way of setting an order, where processes and/or threads could do their jobs (user jobs) without any problem, we need to synchronize them. Java has built-in support for process and thread synchronization, there are some constructs that we can use when we need to do synchronization.This paper, a first phase discussed the concept of Parall Programming, threads, how to create a thread, using a thread, working with more than one thread. Second phase is about synchronization, what is in general and in the end we disscused the synchronization possibilities and feautures in Java.

APA, Harvard, Vancouver, ISO, and other styles

3

Ramanauskaite, Simona, Asta Slotkiene, Kornelija Tunaityte, Ivan Suzdalev, Andrius Stankevicius, and Saulius Valentinavicius. "Reducing WCET Overestimations in Multi-Thread Loops with Critical Section Usage." Energies 14, no. 6 (March 21, 2021): 1747. http://dx.doi.org/10.3390/en14061747.

Full text

Abstract:

Worst-case execution time (WCET) is an important metric in real-time systems that helps in energy usage modeling and predefined execution time requirement evaluation. While basic timing analysis relies on execution path identification and its length evaluation, multi-thread code with critical section usage brings additional complications and requires analysis of resource-waiting time estimation. In this paper, we solve a problem of worst-case execution time overestimation reduction in situations when multiple threads are executing loops with the same critical section usage in each iteration. The experiment showed the worst-case execution time does not take into account the proportion between computational and critical sections; therefore, we proposed a new worst-case execution time calculation model to reduce the overestimation. The proposed model results prove to reduce the overestimation on average by half in comparison to the theoretical model. Therefore, this leads to more accurate execution time and energy consumption estimation.

APA, Harvard, Vancouver, ISO, and other styles

4

Karasik, O. N., and A. A. Prihozhy. "ADVANCED SCHEDULER FOR COOPERATIVE EXECUTION OF THREADS ON MULTI-CORE SYSTEM." «System analysis and applied information science», no. 1 (May 4, 2017): 4–11. http://dx.doi.org/10.21122/2309-4923-2017-1-4-11.

Full text

Abstract:

Three architectures of the cooperative thread scheduler in a multithreaded application that is executed on a multi-core system are considered. Architecture A0 is based on the synchronization and scheduling facilities, which are provided by the operating system. Architecture A1 introduces a new synchronization primitive and a single queue of the blocked threads in the scheduler, which reduces the interaction activity between the threads and operating system, and significantly speed up the processes of blocking and unblocking the threads. Architecture A2 replaces the single queue of blocked threads with dedicated queues, one for each of the synchronizing primitives, extends the number of internal states of the primitive, reduces the inter- dependence of the scheduling threads, and further significantly speeds up the processes of blocking and unblocking the threads. All scheduler architectures are implemented on Windows operating systems and based on the User Mode Scheduling. Important experimental results are obtained for multithreaded applications that implement two blocked parallel algorithms of solving the linear algebraic equation systems by the Gaussian elimination. The algorithms differ in the way of the data distribution among threads and by the thread synchronization models. The number of threads varied from 32 to 7936. Architecture A1 shows the acceleration of up to 8.65% and the architecture A2 shows the acceleration of up to 11.98% compared to A0 architecture for the blocked parallel algorithms computing the triangular form and performing the back substitution. On the back substitution stage of the algorithms, architecture A1 gives the acceleration of up to 125%, and architecture A2 gives the acceleration of up to 413% compared to architecture A0. The experiments clearly show that the proposed architectures, A1 and A2 outperform A0 depending on the number of thread blocking and unblocking operations, which happen during the execution of multi-threaded applications. The conducted computational experiments demonstrate the improvement of parameters of multithreaded applications on a heterogeneous multi-core system due the proposed advanced versions of the thread scheduler.

APA, Harvard, Vancouver, ISO, and other styles

5

Metzler, Patrick, Neeraj Suri, and Georg Weissenbacher. "Extracting safe thread schedules from incomplete model checking results." International Journal on Software Tools for Technology Transfer 22, no. 5 (June 26, 2020): 565–81. http://dx.doi.org/10.1007/s10009-020-00575-y.

Full text

Abstract:

Abstract Model checkers frequently fail to completely verify a concurrent program, even if partial-order reduction is applied. The verification engineer is left in doubt whether the program is safe and the effort toward verifying the program is wasted. We present a technique that uses the results of such incomplete verification attempts to construct a (fair) scheduler that allows the safe execution of the partially verified concurrent program. This scheduler restricts the execution to schedules that have been proven safe (and prevents executions that were found to be erroneous). We evaluate the performance of our technique and show how it can be improved using partial-order reduction. While constraining the scheduler results in a considerable performance penalty in general, we show that in some cases our approach—somewhat surprisingly—even leads to faster executions.

APA, Harvard, Vancouver, ISO, and other styles

6

YONG, XIE, and HSU WEN-JING. "ALIGNED MULTITHREADED COMPUTATIONS AND THEIR SCHEDULING WITH PERFORMANCE GUARANTEES." Parallel Processing Letters 13, no. 03 (September 2003): 353–64. http://dx.doi.org/10.1142/s0129626403001331.

Full text

Abstract:

This paper considers the problem of scheduling dynamic parallel computations to achieve linear speedup without using significantly more space per processor than that required for a single processor execution. Earlier research in the Cilk project proposed the "strict" computational model, in which every dependency goes from a thread x only to one of x's ancestor threads, and guaranteed both linear speedup and linear expansion of space. However, Cilk threads are stateless, and the task graph that Cilk language expresses is series-parallel graph, which is a proper subset of arbitrary task graph. Moreover, Cilk does not support applications with pipelining. We propose the "aligned" multithreaded computational model, which extends the "strict" computational model in Cilk. In the aligned multithreaded computational model, dependencies can go from arbitrary thread x not only to x's ancestor threads, but also to x's younger brother threads, that are spawned by x's parent thread but after x. We use the same measures of time and space as those used in Cilk: T1 is the time required for executing the computation on 1 processor, T∞ is the time required by an infinite number of processors, and S1 is the space required to execute the computation on 1 processor. We show that for any aligned computation, there exists an execution schedule that achieves both efficient time and efficient space. Specifically, we show that for an execution of any aligned multithreaded computation on P processors, the time required is bounded by O(T1/P + T∞), and the space required can be loosely bounded by O(λ·S1P), where λ is the maximum number of younger brother threads that have the same parent thread and can be blocked during execution. If we assume that λ is a constant, and the space requirements for elder and younger brother threads are the same, then the space required would be bounded by O(S1P). Based on the aligned multithreaded computational model, we show that the aligned multithreaded computational model supports pipelined applications. Furthermore, we propose a multithreaded programming language and show that it can express arbitrary task graph.

APA, Harvard, Vancouver, ISO, and other styles

7

Hirata, Hiroaki, and Atsushi Nunome. "Decoupling Computation and Result Write-Back for Thread-Level Parallelization." International Journal of Software Innovation 8, no. 3 (July 2020): 19–34. http://dx.doi.org/10.4018/ijsi.2020070102.

Full text

Abstract:

Thread-level speculation (TLS) is an approach to enhance the opportunity of parallelization of programs. A TLS system enables multiple threads to begin the execution of tasks in parallel even if there may be the dependency between tasks. When any dependency violation is detected, the TLS system enforces the violating thread to abort and re-execute the task. So, the frequency of aborts is one of the factors that damage the performance of the speculative execution. This article proposes a new technique named the code shelving, which enables threads not to need to abort. It is available not only for TLS but also as a flexible synchronization technique in conventional and non-speculatively parallel execution. The authors implemented the code shelving on their parallel execution system called Speculative Memory (SM) and verified the effectiveness of the code shelving.

APA, Harvard, Vancouver, ISO, and other styles

8

Tatas, Konstantinos, Costas Kyriacou, Paraskevas Evripidou, Pedro Trancoso, and Stephan Wong. "Rapid Prototyping of the Data-Driven Chip-Multiprocessor (D2-CMP) using FPGAs." Parallel Processing Letters 18, no. 02 (June 2008): 291–306. http://dx.doi.org/10.1142/s0129626408003399.

Full text

Abstract:

This paper presents the FPGA implementation of the prototype for the Data-Driven Chip-Multiprocessor (D2-CMP). In particular, we study the implementation of a Thread Synchronization Unit (TSU) on FPGA, a hardware unit that enables thread execution using dataflow-like scheduling policy on a chip multiprocessor. Threads are scheduled for execution based on data availability, i.e., a thread is scheduled for execution only if its input data is available. This model of execution is called the non-blocking Data-Driven Multithreading (DDM) model of execution. The DDM model has been evaluated using an execution driven simulator. To validate the simulation results, a 2-node DDM chip multiprocessor has been implemented on a Xilinx Virtex-II Pro FPGA with two PowerPC processors hardwired on the FPGA. Measurements on the hardware prototype show that the TSU can be implemented with a moderate hardware budget. The 2-node multiprocessor has been implemented with less than half of the reconfigurable hardware available on the Xilinx Virtex-II Pro FPGA (45% slices), which corresponds to an ASIC equivalent gate count of 1.9 million gates. Measurements on the prototype showed that the delays incurred by the operation of the TSU can be tolerated.

APA, Harvard, Vancouver, ISO, and other styles

9

Sinharoy, Balaram. "Compiler Optimization to Improve Data Locality for Processor Multithreading." Scientific Programming 7, no. 1 (1999): 21–37. http://dx.doi.org/10.1155/1999/235625.

Full text

Abstract:

Over the last decade processor speed has increased dramatically, whereas the speed of the memory subsystem improved at a modest rate. Due to the increase in the cache miss latency (in terms of the processor cycle), processors stall on cache misses for a significant portion of its execution time. Multithreaded processors has been proposed in the literature to reduce the processor stall time due to cache misses. Although multithreading improves processor utilization, it may also increase cache miss rates, because in a multithreaded processor multiple threads share the same cache, which effectively reduces the cache size available to each individual thread. Increased processor utilization and the increase in the cache miss rate demands higher memory bandwidth. A novel compiler optimization method has been presented in this paper that improves data locality for each of the threads and enhances data sharing among the threads. The method is based on loop transformation theory and optimizes both spatial and temporal data locality. The created threads exhibit high level of intra‐thread and inter‐thread data locality which effectively reduces both the data cache miss rates and the total execution time of numerically intensive computation running on a multithreaded processor.

APA, Harvard, Vancouver, ISO, and other styles

10

Tian, Zhenzhou, Qing Wang, Cong Gao, Lingwei Chen, and Dinghao Wu. "Plagiarism Detection of Multi-threaded Programs Using Frequent Behavioral Pattern Mining." International Journal of Software Engineering and Knowledge Engineering 30, no. 11n12 (November 2020): 1667–88. http://dx.doi.org/10.1142/s0218194020400252.

Full text

Abstract:

Software dynamic birthmark techniques construct birthmarks using the captured execution traces from running the programs, which serve as one of the most promising methods for obfuscation-resilient software plagiarism detection. However, due to the perturbation caused by non-deterministic thread scheduling in multi-threaded programs, such dynamic approaches optimized for sequential programs may suffer from the randomness in multi-threaded program plagiarism detection. In this paper, we propose a new dynamic thread-aware birthmark FPBirth to facilitate multi-threaded program plagiarism detection. We first explore dynamic monitoring to capture multiple execution traces with respect to system calls for each multi-threaded program under a specified input, and then leverage the Apriori algorithm to mine frequent patterns to formulate our dynamic birthmark, which can not only depict the program’s behavioral semantics, but also resist the changes and perturbations over execution traces caused by the thread scheduling in multi-threaded programs. Using FPBirth, we design a multi-threaded program plagiarism detection system. The experimental results based on a public software plagiarism sample set demonstrate that the developed system integrating our proposed birthmark FPBirth copes better with multi-threaded plagiarism detection than alternative approaches. Compared against the dynamic birthmark System Call Short Sequence Birthmark (SCSSB), FPBirth achieves 12.4%, 4.1% and 7.9% performance improvements with respect to union of resilience and credibility (URC), F-Measure and matthews correlation coefficient (MCC) metric, respectively.

APA, Harvard, Vancouver, ISO, and other styles

11

Ootsu, Kanemitsu, Hirohito Ogawa, Takashi Yokota, and Takanobu Baba. "Program Execution Path-Based Speculative Thread Partitioning." Transactions of the Institute of Systems, Control and Information Engineers 22, no. 6 (2009): 209–19. http://dx.doi.org/10.5687/iscie.22.209.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

Arandi, Samer, George Matheou, Costas Kyriacou, and Paraskevas Evripidou. "Data-Driven Thread Execution on Heterogeneous Processors." International Journal of Parallel Programming 46, no. 2 (February 8, 2017): 198–224. http://dx.doi.org/10.1007/s10766-016-0486-6.

Full text

APA, Harvard, Vancouver, ISO, and other styles

13

Chen, Yuting. "Platform Independent Analysis of Probabilities on Multithreaded Programs." International Journal of Software Innovation 1, no. 3 (July 2013): 48–65. http://dx.doi.org/10.4018/ijsi.2013070104.

Full text

Abstract:

A concurrent program is intuitively associated with probability: the executions of the program can produce nondeterministic execution program paths due to the interleavings of threads, whereas some paths can always be executed more frequently than the others. An exploration of the probabilities on the execution paths is expected to provide engineers or compilers with support in helping, either at coding phase or at compile time, to optimize some hottest paths. However, it is not easy to take a static analysis of the probabilities on a concurrent program in that the scheduling of threads of a concurrent program usually depends on the operating system and hardware (e.g., processor) on which the program is executed, which may be vary from machine to machine. In this paper the authors propose a platform independent approach, called ProbPP, to analyzing probabilities on the execution paths of the multithreaded programs. The main idea of ProbPP is to calculate the probabilities on the basis of two kinds of probabilities: Primitive Dependent Probabilities (PDPs) representing the control dependent probabilities among the program statements and Thread Execution Probabilities (TEPs) representing the probabilities of threads being scheduled to execute. The authors have also conducted two preliminary experiments to evaluate the effectiveness and performance of ProbPP, and the experimental results show that ProbPP can provide engineers with acceptable accuracy.

APA, Harvard, Vancouver, ISO, and other styles

14

Bylina, Beata, and Jaroslaw Bylina. "An Experimental Evaluation of the OpenMP Thread Mapping for LU Factorisation on Xeon Phi Coprocessor and on Hybrid CPU-MIC Platform." Scalable Computing: Practice and Experience 19, no. 3 (September 14, 2018): 259–74. http://dx.doi.org/10.12694/scpe.v19i3.1373.

Full text

Abstract:

Efficient thread mapping relies upon matching the behaviour of the application with system characteristics. The main aim of this paper is to evaluate the influence of the OpenMP thread mapping on the computation performance of the matrix factorisations on Intel Xeon Phi coprocessor and hybrid CPU-MIC platforms. The authors consider parallel LU factorisations with and without pivoting, both from MKL (Math Kernel Library) library. The results show that the choice of thread affinity, the number of threads and the execution mode have a measurable impact on the performance and the scalability of the LU factorisations.

APA, Harvard, Vancouver, ISO, and other styles

15

Bouksiaa, Mohamed Said Mosli, Francois Trahay, Alexis Lescouet, Gauthier Voron, Remi Dulong, Amina Guermouche, Elisabeth Brunet, and Gael Thomas. "Using Differential Execution Analysis to Identify Thread Interference." IEEE Transactions on Parallel and Distributed Systems 30, no. 12 (December 1, 2019): 2866–78. http://dx.doi.org/10.1109/tpds.2019.2927481.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Fujisawa, Kohei, Atsushi Nunome, Kiyoshi Shibayama, and Hiroaki Hirata. "Design Space Exploration for Implementing a Software-Based Speculative Memory System." International Journal of Software Innovation 6, no. 2 (April 2018): 37–49. http://dx.doi.org/10.4018/ijsi.2018040104.

Full text

Abstract:

To enlarge the opportunities for parallelizing a sequentially coded program, the authors have previously proposed speculative memory (SM). With SM, they can start the parallel execution of a program by assuming that it does not violate the data dependencies in the program. When the SM system detects a violation, it recovers the computational state of the program and restarts the execution. In this article, the authors explore the design space for implementing a software-based SM system. They compared the possible choices in the following three viewpoints: (1) which waiting system of suspending or busy-waiting should be used, (2) when a speculative thread should be committed, and (3) which version of data a speculative thread should read. Consequently, the performance of the busy-waiting system which makes speculative threads commit early and read non-speculative values is better than that of others.

APA, Harvard, Vancouver, ISO, and other styles

17

Cavus, Mustafa, Mohammed Shatnawi, Resit Sendag, and Augustus K. Uht. "Fast Key-Value Lookups with Node Tracker." ACM Transactions on Architecture and Code Optimization 18, no. 3 (June 2021): 1–26. http://dx.doi.org/10.1145/3452099.

Full text

Abstract:

Lookup operations for in-memory databases are heavily memory bound, because they often rely on pointer-chasing linked data structure traversals. They also have many branches that are hard-to-predict due to random key lookups. In this study, we show that although cache misses are the primary bottleneck for these applications, without a method for eliminating the branch mispredictions only a small fraction of the performance benefit is achieved through prefetching alone. We propose the Node Tracker (NT), a novel programmable prefetcher/pre-execution unit that is highly effective in exploiting inter key-lookup parallelism to improve single-thread performance. We extend NT with branch outcome streaming (BOS) to reduce branch mispredictions and show that this achieves an extra 3× speedup. Finally, we evaluate the NT as a pre-execution unit and demonstrate that we can further improve the performance in both single- and multi-threaded execution modes. Our results show that, on average, NT improves single-thread performance by 4.1× when used as a prefetcher; 11.9× as a prefetcher with BOS; 14.9× as a pre-execution unit and 18.8× as a pre-execution unit with BOS. Finally, with 24 cores of the latter version, we achieve a speedup of 203× and 11× over the single-core and 24-core baselines, respectively.

APA, Harvard, Vancouver, ISO, and other styles

18

Huang, Kaijie, and Jie Cao. "Parallel Differential Evolutionary Particle Filtering Algorithm Based on the CUDA Unfolding Cycle." Wireless Communications and Mobile Computing 2021 (October 15, 2021): 1–12. http://dx.doi.org/10.1155/2021/1999154.

Full text

Abstract:

Aiming at the problem of low statute efficiency of prefix sum execution during the execution of the parallel differential evolutionary particle filtering algorithm, a filtering algorithm based on the CUDA unfolding cyclic prefix sum is proposed to remove the thread differentiation and thread idleness existing in the parallel prefix sum by unfolding the cyclic method and unfolding the thread bundle method, optimize the cycle, and improve the prefix sum execution efficiency. By introducing the parallel strategy, the differential evolutionary particle filtering algorithm is implemented in parallel and executed on the GPU side using the improved prefix sum computation during the algorithm update. Through big data analysis, the results show that this parallel differential evolutionary particle filtering algorithm with the improved prefix sum statute can effectively improve differential evolutionary particle filtering for nonlinear system states and real-time performance in heterogeneous parallel processing systems.

APA, Harvard, Vancouver, ISO, and other styles

19

DU, Yan-Ning, Yin-Liang ZHAO, Bo HAN, and Yuan-Cheng LI. "Data Structure Directed Thread Partitioning Method and Execution Model." Journal of Software 24, no. 10 (January 17, 2014): 2432–59. http://dx.doi.org/10.3724/sp.j.1001.2013.04353.

Full text

APA, Harvard, Vancouver, ISO, and other styles

20

Kang, Jihun, and Heonchang Yu. "GPGPU Task Scheduling Technique for Reducing the Performance Deviation of Multiple GPGPU Tasks in RPC-Based GPU Virtualization Environments." Symmetry 13, no. 3 (March 20, 2021): 508. http://dx.doi.org/10.3390/sym13030508.

Full text

Abstract:

In remote procedure call (RPC)-based graphic processing unit (GPU) virtualization environments, GPU tasks requested by multiple-user virtual machines (VMs) are delivered to the VM owning the GPU and are processed in a multi-process form. However, because the thread executing the computing on general GPUs cannot arbitrarily stop the task or trigger context switching, GPU monopoly may be prolonged owing to a long-running general-purpose computing on graphics processing unit (GPGPU) task. Furthermore, when scheduling tasks on the GPU, the time for which each user VM uses the GPU is not considered. Thus, in cloud environments that must provide fair use of computing resources, equal use of GPUs between each user VM cannot be guaranteed. We propose a GPGPU task scheduling scheme based on thread division processing that supports GPU use evenly by multiple VMs that process GPGPU tasks in an RPC-based GPU virtualization environment. Our method divides the threads of the GPGPU task into several groups and controls the execution time of each thread group to prevent a specific GPGPU task from a long time monopolizing the GPU. The efficiency of the proposed technique is verified through an experiment in an environment where multiple VMs simultaneously perform GPGPU tasks.

APA, Harvard, Vancouver, ISO, and other styles

21

Kyriacou, Costas, Paraskevas Evripidou, and Pedro Trancoso. "CacheFlow: Cache Optimizations for Data Driven Multithreading." Parallel Processing Letters 16, no. 02 (June 2006): 229–44. http://dx.doi.org/10.1142/s0129626406002599.

Full text

Abstract:

Data-Driven Multithreading is a non-blocking multithreading model of execution that provides effective latency tolerance by allowing the computation processor do useful work, while a long latency event is in progress. With the Data-Driven Multithreading model, a thread is scheduled for execution only if all of its inputs have been produced and placed in the processor's local memory. Data-driven sequencing leads to irregular memory access patterns that could affect negatively cache performance. Nevertheless, it enables the implementation of short-term optimal cache management policies. This paper presents the implementation of CacheFlow, an optimized cache management policy which eliminates the side effects due to the loss of locality caused by the data-driven sequencing, and reduces further cache misses. CacheFlow employs thread-based prefetching to preload data blocks of threads deemed executable. Simulation results, for nine scientific applications, on a 32-node Data-Driven Multithreaded machine show an average speedup improvement from 19.8 to 22.6. Two techniques to further improve the performance of CacheFlow, conflict avoidance and thread reordering, are proposed and tested. Simulation experiments have shown a speedup improvement of 24% and 32%, respectively. The average speedup for all applications on a 32-node machine with both optimizations is 26.1.

APA, Harvard, Vancouver, ISO, and other styles

22

Guo, JunXia, Zheng Li, CunFeng Shi, and RuiLian Zhao. "Thread Scheduling Sequence Generation Based on All Synchronization Pair Coverage Criteria." International Journal of Software Engineering and Knowledge Engineering 30, no. 01 (January 2020): 97–118. http://dx.doi.org/10.1142/s0218194020500059.

Full text

Abstract:

Testing multi-thread programs becomes extremely difficult because thread interleavings are uncertain, which may cause a program getting different results in each execution. Thus, Thread Scheduling Sequence (TSS) is a crucial factor in multi-thread program testing. A good TSS can obtain better testing efficiency and save the testing cost especially with the increase of thread numbers. Focusing on the above problem, in this paper, we discuss a kind of approach that can efficiently generate TSS based on the concurrent coverage criteria. First, we give a definition of Synchronization Pair (SP) as well as all Synchronization Pairs Coverage (ASPC) criterion. Then, we introduce the Synchronization Pair Thread Graph (SPTG) to describe the relationships between SPs and threads. Moreover, this paper presents a TSS generation method based on the ASPC according to SPTG. Finally, TSSs automatic generation experiments are conducted on six multi-thread programs in Java Library with the help of Java Path Finder (JPF) tool. The experimental results illustrate that our method not only generates TSSs to cover all SPs but also requires less state number, transition number as well as TSS number when satisfying ASPC, compared with other three widely used TSS generation methods. As a result, it is clear that the efficiency of TSS generation is obviously improved.

APA, Harvard, Vancouver, ISO, and other styles

23

WANG, SHENGYUE, PEN-CHUNG YEW, and ANTONIA ZHAI. "CODE TRANSFORMATIONS FOR ENHANCING THE PERFORMANCE OF SPECULATIVELY PARALLEL THREADS." Journal of Circuits, Systems and Computers 21, no. 02 (April 2012): 1240008. http://dx.doi.org/10.1142/s0218126612400087.

Full text

Abstract:

As technology advances, microprocessors that integrate multiple cores on a single chip are becoming increasingly common. How to use these processors to improve the performance of a single program has been a challenge. For general-purpose applications, it is especially difficult to create efficient parallel execution due to the complex control flow and ambiguous data dependences. Thread-level speculation and transactional memory provide two hardware mechanisms that are able to optimistically parallelize potentially dependent threads. However, a compiler that performs detailed performance trade-off analysis is essential for generating efficient parallel programs for these hardwares. This compiler must be able to take into consideration the cost of intra-thread as well as inter-thread value communication. On the other hand, the ubiquitous existence of complex, input-dependent control flow and data dependence patterns in general-purpose applications makes it impossible to have one technique optimize all program patterns. In this paper, we propose three optimization techniques to improve the thread performance: (i) scheduling instruction and generating recovery code to reduce the critical forwarding path introduced by synchronizing memory resident values; (ii) identifying reduction variables and transforming the code the minimize the serializing execution; and (iii) dynamically merging consecutive iterations of a loop to avoid stalls due to unbalanced workload. Detailed evaluation of the proposed mechanism shows that each optimization technique improves a subset but none improve all of the SPEC2000 benchmarks. On average, the proposed optimizations improve the performance by 7% for the set of the SPEC2000 benchmarks that have already been optimized for register-resident value communication.

APA, Harvard, Vancouver, ISO, and other styles

24

Kim, Seung Hun, Dohoon Kim, Changmin Lee, Won Seob Jeong, Won Woo Ro, and Jean-Luc Gaudiot. "A Performance-Energy Model to Evaluate Single Thread Execution Acceleration." IEEE Computer Architecture Letters 14, no. 2 (July 1, 2015): 99–102. http://dx.doi.org/10.1109/lca.2014.2368144.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

Rivas, Mario Aldea, and Michael González Harbour. "Operating system support for execution time budgets for thread groups." ACM SIGAda Ada Letters XXVII, no. 2 (August 2007): 67–71. http://dx.doi.org/10.1145/1316002.1316017.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

Choi, Kiho, Daejin Park, and Jeonghun Cho. "SSCFM: Separate Signature-Based Control Flow Error Monitoring for Multi-Threaded and Multi-Core Environments." Electronics 8, no. 2 (February 1, 2019): 166. http://dx.doi.org/10.3390/electronics8020166.

Full text

Abstract:

Soft error is a key challenge in computer systems. Without soft error mitigation, control flow error (CFE) can lead to system crash. Signature-based CFE monitoring scheme is a representative technique for detecting CFEs during runtime. However, most of the signature-based CFE monitoring schemes proposed thus far are based on a single thread. Currently, the widely used multi-threaded and multi-core environments have greatly improved the performance of the computing system, but, if the these schemes are applied in these environments, performance improvement is difficult to achieve, or rather performance degradation may occur. In this paper, we propose a separate signature-based CFE monitoring (SSCFM) scheme that separates the signature update and the signature verification on the thread level. The signature update is combined with application thread and signature verification and executed on separate monitor threads, so that we can expect performance improvements in multi-threaded or multi-core environments. Furthermore, the SSCFM scheme can fully cover inter-procedural CFE not covered by many signature-based CFE monitoring schemes by using inter-procedural control flow analysis. With the proposed SSCFM scheme, the execution time overhead is reduced by approximately 26.67% on average from the SEDSR scheme, and the average CFE detection rate with SSCFM is approximately 93.69%. In addition, this paper also introduces the LLVM compiler-based SSCFM generator that makes it easy to apply the SSCFM scheme to software applications.

APA, Harvard, Vancouver, ISO, and other styles

27

Xue, Xiaozhen, Sima Siami-Namini, and Akbar Siami Namin. "Testing Multi-Threaded Applications Using Answer Set Programming." International Journal of Software Engineering and Knowledge Engineering 28, no. 08 (August 2018): 1151–75. http://dx.doi.org/10.1142/s021819401850033x.

Full text

Abstract:

We introduce a technique to formally represent and specify race conditions in multithreaded applications. Answer set programming (ASP) is a logic-based knowledge representation paradigm to formally express belief acquired through reasoning in an application domain. The transparent and expressiveness representation of problems along with powerful non-monotonic reasoning power enable ASP to abstractly represent and solve some certain classes of NP hard problems in polynomial times. We use ASP to formally express race conditions and thus represent potential data races often occurred in multithreaded applications with shared memory models. We then use ASP to generate all possible test inputs and thread interleaving, i.e. scheduling, whose executions would result in deterministically exposing thread interleaving failures. We evaluated the proposed technique with some moderate sized Java programs, and our experimental results confirm that the proposed technique can practically expose common data races in multithreaded programs with low false positive rates. We conjecture that, in addition to generating threads scheduling whose execution order leads to the exposition of data races, ASP has several other applications in constraint-based software testing research and can be utilized to express and solve similar test case generation problems where constraints play a key role in determining the complexity of searches.

APA, Harvard, Vancouver, ISO, and other styles

28

Petric, Vlad, and Amir Roth. "Energy-Effectiveness of Pre-Execution and Energy-Aware P-Thread Selection." ACM SIGARCH Computer Architecture News 33, no. 2 (May 2005): 322–33. http://dx.doi.org/10.1145/1080695.1069997.

Full text

APA, Harvard, Vancouver, ISO, and other styles

29

Köster, M., J. Groß, and A. Krüger. "Massively Parallel Rule-Based Interpreter Execution on GPUs Using Thread Compaction." International Journal of Parallel Programming 48, no. 4 (June 24, 2020): 675–91. http://dx.doi.org/10.1007/s10766-020-00670-2.

Full text

APA, Harvard, Vancouver, ISO, and other styles

30

Soliman, Mostafa I., and Elsayed A. Elsayed. "Simultaneous Multithreaded Matrix Processor." Journal of Circuits, Systems and Computers 24, no. 08 (August 12, 2015): 1550114. http://dx.doi.org/10.1142/s0218126615501145.

Full text

Abstract:

This paper proposes a simultaneous multithreaded matrix processor (SMMP) to improve the performance of data-parallel applications by exploiting instruction-level parallelism (ILP) data-level parallelism (DLP) and thread-level parallelism (TLP). In SMMP, the well-known five-stage pipeline (baseline scalar processor) is extended to execute multi-scalar/vector/matrix instructions on unified parallel execution datapaths. SMMP can issue four scalar instructions from two threads each cycle or four vector/matrix operations from one thread, where the execution of vector/matrix instructions in threads is done in round-robin fashion. Moreover, this paper presents the implementation of our proposed SMMP using VHDL targeting FPGA Virtex-6. In addition, the performance of SMMP is evaluated on some kernels from the basic linear algebra subprograms (BLAS). Our results show that, the hardware complexity of SMMP is 5.68 times higher than the baseline scalar processor. However, speedups of 4.9, 6.09, 6.98, 8.2, 8.25, 8.72, 9.36, 11.84 and 21.57 are achieved on BLAS kernels of applying Givens rotation, scalar times vector plus another, vector addition, vector scaling, setting up Givens rotation, dot-product, matrix–vector multiplication, Euclidean length, and matrix–matrix multiplications, respectively. The average speedup over the baseline is 9.55 and the average speedup over complexity is 1.68. Comparing with Xilinx MicroBlaze, the complexity of SMMP is 6.36 times higher, however, its speedup ranges from 6.87 to 12.07 on vector/matrix kernels, which is 9.46 in average.

APA, Harvard, Vancouver, ISO, and other styles

31

GONTMAKHER, ALEX, SERGEY POLYAKOV, and ASSAF SCHUSTER. "COMPLEXITY OF VERIFYING JAVA SHARED MEMORY EXECUTION." Parallel Processing Letters 13, no. 04 (December 2003): 721–33. http://dx.doi.org/10.1142/s0129626403001628.

Full text

Abstract:

This paper studies the problem of testing shared memory Java implementations to determine whether the memory behavior they provide is consistent. The complexity of the task is analyzed. The problem is defined as that of analyzing memory access traces. The study showed that the problem is NP-complete, both in the general case and in some particular cases in which the number of memory operations per thread, the number of write operations per variable, and the number of variables are restricted.

APA, Harvard, Vancouver, ISO, and other styles

32

AMAMIYA, MAKOTO, HIDEO TANIGUCHI, and TAKANORI MATSUZAKI. "AN ARCHITECTURE OF FUSING COMMUNICATION AND EXECUTION FOR GLOBAL DISTRIBUTED PROCESSING." Parallel Processing Letters 11, no. 01 (March 2001): 7–24. http://dx.doi.org/10.1142/s0129626401000397.

Full text

Abstract:

We are pursuing the FUCE architecture project at Kyushu University. FUCE means FUsion of Communication and Execution. The main objective of our research is, as the name shows, to develop a new architecture that truly fuses communication and computation. The FUCE project develops a new on-chip-multi-processor and kernel software on it. We name the processor FUCE processor, and the kernel software as CEFOS (Communication and Execution Fusion OS). The FUCE processor is designed as a network node processor to perform mainly switching/transmitting of messages/transaction and handling its contents. FUCE processor architecture is designed as a multiprocessor-on-chip to support the fine-grain multi-threading. The kernel software CEFOS is also developed on the concept of multithreading. User and system processes are constructed as a set of threads, which are executed concurrently according to thread dependences.

APA, Harvard, Vancouver, ISO, and other styles

33

Știrb, Iulia. "Extending NUMA-BTLP Algorithm with Thread Mapping Based on a Communication Tree." Computers 7, no. 4 (December 3, 2018): 66. http://dx.doi.org/10.3390/computers7040066.

Full text

Abstract:

The paper presents a Non-Uniform Memory Access (NUMA)-aware compiler optimization for task-level parallel code. The optimization is based on Non-Uniform Memory Access—Balanced Task and Loop Parallelism (NUMA-BTLP) algorithm Ştirb, 2018. The algorithm gets the type of each thread in the source code based on a static analysis of the code. After assigning a type to each thread, NUMA-BTLP Ştirb, 2018 calls NUMA-BTDM mapping algorithm Ştirb, 2016 which uses PThreads routine pthread_setaffinity_np to set the CPU affinities of the threads (i.e., thread-to-core associations) based on their type. The algorithms perform an improve thread mapping for NUMA systems by mapping threads that share data on the same core(s), allowing fast access to L1 cache data. The paper proves that PThreads based task-level parallel code which is optimized by NUMA-BTLP Ştirb, 2018 and NUMA-BTDM Ştirb, 2016 at compile-time, is running time and energy efficiently on NUMA systems. The results show that the energy is optimized with up to 5% at the same execution time for one of the tested real benchmarks and up to 15% for another benchmark running in infinite loop. The algorithms can be used on real-time control systems such as client/server based applications which require efficient access to shared resources. Most often, task parallelism is used in the implementation of the server and loop parallelism is used for the client.

APA, Harvard, Vancouver, ISO, and other styles

34

Vasiliev, Ivan Aleksandrovich, Pavel Mikhailovich Dovgalyuk, and Maria Anatolyevna Klimushenkova. "Using the identification of threads of execution when solving problems of full-system analysis of binary code." Proceedings of the Institute for System Programming of the RAS 33, no. 6 (2021): 51–66. http://dx.doi.org/10.15514/ispras-2021-33(6)-4.

Full text

Abstract:

Dynamic binary analysis, that is often used for full-system analysis, provides the analyst with a sequence of executed instructions and the content of RAM and system registers. This data is hard to process, as it is low-level and demands a deep understanding of studied system and a high-skileed professional to perform the analysis. To simplify the analysis process, it is necessary to bring the input data to a more user-friendly form, i.e. provide high-level information about the system. Such high-level information would be the program execution flow. To recover the flow of execution of a program, it is important to have an understanding of the procedures being called in it. You can get such a representation using the function call stack for a specific thread. Building a call stack without information about the running threads is impossible, since each thread is uniquely associated with one stack, and vice versa. In addition, the very presence of information about flows increases the level of knowledge about the system, allows you to more subtly profile the object of research and conduct a highly focused analysis, applying the principles of selective instrumentation. The virtual machine only provides low-level data, thus, there is a need to develop a method for automatic identification of threads in the system under study, based on the available data. In this paper, the existing approaches to the implementation of obtaining high-level information in full-system analysis are considered and a method is proposed for recovering thread info during full-system emulation with a low degree of OS-dependency. Examples of practical use of this method in the implementation of analysis tools are also given, namely: restoring the call stack, detecting suspicious return operations, and detecting calls to freed memory in the stack. The testing presented in the article shows that the slowdown imposed by the described algorithms allows working with the system under study, and comparison with the reference data confirms the correctness of the results obtained by the algorithms.

APA, Harvard, Vancouver, ISO, and other styles

35

Dong, Jing Chuan, and Tai Yong Wang. "A Pipeline Designed Reconfigurable CNC Architecture." Materials Science Forum 697-698 (September 2011): 288–91. http://dx.doi.org/10.4028/www.scientific.net/msf.697-698.288.

Full text

Abstract:

This paper proposed a pipeline based reconfigurable architecture for CNC controllers. The architecture consists of an upper controller and a NC Microcode Processor (NCMP). The control software in the upper controller is a multi-thread program, including a management thread and a NC pipeline thread. The NC pipeline thread in the upper controller converts the machining program into NC Microcode (NCM), which is optimized for the real-time execution in NCMP. A pipelined feedrate plan algorithm is developed for the proposed architecture. The algorithm features a reconfigurable structure and look-ahead ability for high speed machining. Two prototype systems were built to demonstrate the feasibility of the proposed architecture. The experimental results indicate that the NC pipeline is highly reconfigurable and flexible in design compared to the classic implementation.

APA, Harvard, Vancouver, ISO, and other styles

36

Yang, Yuer, Zeguang Chen, Shaobo Chen, Zhuoyun Du, Yuxin Luo, Liangtian Zhao, Lifeng Zhou, and Yujuan Quan. "Avpd: An Anti-virus Model with Remote Thread Injection for Android Based on ResNet50." Journal of Physics: Conference Series 2203, no. 1 (March 1, 2022): 012078. http://dx.doi.org/10.1088/1742-6596/2203/1/012078.

Full text

Abstract:

Abstract Most Android mobile anti-virus software in the industry is checked at the application level, and users familiar with the Android operating system are well aware that the use of virtual clicks, function execution, or shell commands can force the application to stop, which poses a threat to the real-time monitoring of anti-virus software. Moreover, the current mainstream anti-virus software in the industry can only let users manually uninstall or deactivate malicious apps when detected, which also makes the anti-virus software in Android mobile lose the ability of mobile anti-virus software to remove or delete viruses and Trojans automatically. To solve the problems above, in this paper, we train a mobile anti-virus model based on Resnet50 and proposes an Android mobile anti-virus method using remote thread injection - overriding the execution of malicious code by RTI means such as hook API, nulling related functions, rewriting related classes or functions to preserve the app as much as possible. In contrast, The model can identify malicious code with the highest accuracy. The model's recognition accuracy is up to 98.14%, and the malicious code blocking rate is up to 99.70% after recognition.

APA, Harvard, Vancouver, ISO, and other styles

37

Berisha, Artan, Eliot Bytyçi, and Ardeshir Tershnjaku. "Parallel Genetic Algorithms for University Scheduling Problem." International Journal of Electrical and Computer Engineering (IJECE) 7, no. 2 (April 1, 2017): 1096. http://dx.doi.org/10.11591/ijece.v7i2.pp1096-1102.

Full text

Abstract:

University scheduling timetabling problem, falls into NP hard problems. Re-searchers have tried with many techniques to find the most suitable and fastest way for solving the problem. With the emergence of multi-core systems, the parallel implementation was considered for finding the solution. Our approaches attempt to combine several techniques in two algorithms: coarse grained algorithm and multi thread tournament algorithm. The results obtained from two algorithms are compared, using an algorithm evaluation function. Considering execution time, the coarse grained algorithm performed twice better than the multi thread algorithm.

APA, Harvard, Vancouver, ISO, and other styles

38

SHCHERBAN, VOLODYMYR, JULY MAKARENKO, OKSANA KOLISKO, LUDMILA HALAVSKA, and YURYJ SHCHERBAN. "COMPUTER IMPLEMENTATION OF RECURSION ALGORITHM DETERMINATION OF THREAD TENSION DURING FORMATION OF MULTILAYER FABRICS FROM POLYETHYLENE THREADS." HERALD OF KHMELNYTSKYI NATIONAL UNIVERSITY 297, no. 3 (July 2, 2021): 204–7. http://dx.doi.org/10.31891/2307-5732-2021-297-3-204-207.

Full text

Abstract:

Multilayer fabrics made of polyethylene threads are widely used for products of real property and tactical equipment of servicemen capable of protecting the human body from the influence of firearm, cold, cutting, spiny weapons, shock and shock-fractional influences. Optimization of the process of their manufacture is to optimize the tension of the main polyethylene threads in front of the formation zone. To do this, it is necessary to determine the change in relative tension on zones of filling of polyethylene threads on a loop. The execution of this complex task should be based on the use of specially designed computer programs. Taking into account the specifics of the processing of threads on a weaving machine, when determining the relative tension in each individual zone, it is necessary to use a recursion algorithm when the initial tension of the thread from the previous zone will be input for the next zone. Designing new and improvement of existing technological processes of processing polyethylene complex threads on weaving machines requires a change in relative tension on zones of refueling of basic threads. The execution of this complex task should be based on the use of specially designed computer programs using a recursion algorithm. Determination of the change in relative tension on zones of filling of polyethylene complex threads on weaving machines, taking into account the material of the guide, will improve the technology of manufacturing multilayer fabrics that are used to manufacture products of real property and tactical equipment of servicemen capable of protecting the human body from the influence of firearms, cold, cutting, spiny weapons, shock and shock-fractional influences. Improvement of existing technological processes of processing polyethylene complex threads on weaving machines will reduce the downtime that arise when breaking the threads. This negatively affects the productivity of weaving machine tools, reduces the quality of multilayer tissues. Minimization of tension in each line of refueling line of basic polyethylene complex threads will reduce the likelihood of a cliff of the thread, which is important for improving technological processes from the position of increasing the productivity of weaving machine tools and the quality of multilayer tissues. Mathematical provision of a computer program requires the development of thread interaction models with surfaces of scala, framing guides, holes of the remission framework taking into account the real physical and mechanical properties of complex threads and yarns and their real geometric and constructive parameters. The main factor affects the growth of the tension of polyethylene complex threads is the force of friction. It characterizes the friction properties of the threads and conditions of their interaction with the surfaces of the scala, framing guides, holes of the striped frames.

APA, Harvard, Vancouver, ISO, and other styles

39

Oh, Jaegeun, Seok Joong Hwang, Huong Giang Nguyen, Areum Kim, Seon Wook Kim, Chulwoo Kim, and Jong-Kook Kim. "Exploiting Thread-Level Parallelism in Lockstep Execution by Partially Duplicating a Single Pipeline." ETRI Journal 30, no. 4 (August 8, 2008): 576–86. http://dx.doi.org/10.4218/etrij.08.0107.0343.

Full text

APA, Harvard, Vancouver, ISO, and other styles

40

Kazi, I. H., and D. J. Lilja. "Coarse-grained thread pipelining: a speculative parallel execution model for shared-memory multiprocessors." IEEE Transactions on Parallel and Distributed Systems 12, no. 9 (September 2001): 952–66. http://dx.doi.org/10.1109/71.954629.

Full text

APA, Harvard, Vancouver, ISO, and other styles

41

Peternier, Achille, Danilo Ansaloni, Daniele Bonetta, Cesare Pautasso, and Walter Binder. "Improving execution unit occupancy on SMT-based processors through hardware-aware thread scheduling." Future Generation Computer Systems 30 (January 2014): 229–41. http://dx.doi.org/10.1016/j.future.2013.06.015.

Full text

APA, Harvard, Vancouver, ISO, and other styles

42

Ristov, Sasko. "Special Issue on Infrastructures and Algorithms for Scalable Computing." Scalable Computing: Practice and Experience 19, no. 3 (September 17, 2018): iii—iv. http://dx.doi.org/10.12694/scpe.v19i3.1441.

Full text

Abstract:

We are happy to present this special issue of the scientific journal Scalable Computing: Practice and Experience. In this special issue on Infrastructures and Algorithms for Scalable Computing (Volume 19, No 3 June 2018), we have selected four papers out of submitted nine, which gone through a peer review according to the journal policy. All papers represent novel results in the fields of distributed algorithms and infrastructures for scalable computing. The first paper presents present a novel approach for efficient data placement, which improves the performance of workflow execution in distributed datacenters. The greedy heuristic algorithm, which is based on a network flow optimization framework, minimizes the total storage cost, including efforts to move and store the data from different source locations and dependencies. The second paper evaluated the significance of different clustering techniques viz. k-means, Hierarchical Agglomerative Clustering and Markov Clustering in groupingawaredata placement for data-intensive applications with interest locality. The evaluation in Azure reported that Markov Clustering-based data placement strategy improves the local map execution and reduces the execution time compared to Hadoops Default Data Placement Strategy and other evaluated clustering techniques. This is more emphasized for data-intensive applications that have interest locality. The third paper presents an experimental evaluation of the openMP thread-mapping strategies in different hardware environments (IntelXeon Phi coprocessor and hybrid CPU-MIC platforms). The paper shows the optimal choice of thread affinity, the number of threads and the execution mode that can provide optimal performance of the LU factorization. In the fourth paper, the authors study the amount of memory occupied by sparse matrices split up into same-size blocks. The paper considers and statistically evaluates four popular storage formats and combinations among them. The conclusion is that block-based storage formats may significantly reduce memory footprints of sparse matrices arising from a wide range of application domains. We use this opportunity to thank all contributors to this Special Issue: all authors who submitted the results of their latest research and all reviewers for their valuable comments and suggestions for improvement. We would like to express our special gratitude for the Editor-in-Chief, Professor Dana Petcu, for her constant support during the whole process of this Special Issue.

APA, Harvard, Vancouver, ISO, and other styles

43

Roka, Sanjay, and Santosh Naik. "SURVEY ON SIGNATURE BASED INTRUCTION DETECTION SYSTEM USING MULTITHREADING." International Journal of Research -GRANTHAALAYAH 5, no. 4RACSIT (April 30, 2017): 58–62. http://dx.doi.org/10.29121/granthaalayah.v5.i4racsit.2017.3352.

Full text

Abstract:

The traditional way of protecting networks with firewalls and encryption software is no longer sufficient and effective. Many intrusion detection techniques have been developed on fixed wired networks but have been turned to be inapplicable in this new environment. We need to search for new architecture and mechanisms to protect computer networks. Signature-based Intrusion Detection System matches network packets against a pre-configured set of intrusion signatures. Current implementations of IDS employ only a single thread of execution and as a consequence benefit very little from multi-processor hardware platforms. A multi-threaded technique would allow more efficient and scalable exploitation of these multi-processor machines.

APA, Harvard, Vancouver, ISO, and other styles

44

DOROJEVETS, MIKHAIL. "COOL MULTITHREADING IN HTMT SPELL-1 PROCESSORS." International Journal of High Speed Electronics and Systems 10, no. 01 (March 2000): 247–53. http://dx.doi.org/10.1142/s0129156400000283.

Full text

Abstract:

A COOL-1 multiprocessor shared memory system based on superconductor Rapid Single-Flux Quantum (RSFQ) technology is being developed at SUNY (Stony Brook, USA) within the framework of the Hybrid Technology Multithreaded architecture (HTMT) petaflops project led by JPL. This paper describes a multithreading approach proposed in the COOL-I architecture and mechanisms to exploit the thread level parallelism in RSFQ processors called SPELL-1. Up to 128 fine-grain threads called (instruction) streams arranged in 16 groups of 8 streams each can run in parallel within a SPELL-1 processor. All eight streams comprising each COOL stream cluster can communicate and synchronize directly via shared registers. Fast creation and termination of streams including speculative stream execution are also supported.

APA, Harvard, Vancouver, ISO, and other styles

45

Tripathy, Devashree, Amirali Abdolrashidi, Laxmi Narayan Bhuyan, Liang Zhou, and Daniel Wong. "PAVER." ACM Transactions on Architecture and Code Optimization 18, no. 3 (June 2021): 1–26. http://dx.doi.org/10.1145/3451164.

Full text

Abstract:

The massive parallelism present in GPUs comes at the cost of reduced L1 and L2 cache sizes per thread, leading to serious cache contention problems such as thrashing. Hence, the data access locality of an application should be considered during thread scheduling to improve execution time and energy consumption. Recent works have tried to use the locality behavior of regular and structured applications in thread scheduling, but the difficult case of irregular and unstructured parallel applications remains to be explored. We present PAVER , a P riority- A ware V ertex schedul ER , which takes a graph-theoretic approach toward thread scheduling. We analyze the cache locality behavior among thread blocks ( TBs ) through a just-in-time compilation, and represent the problem using a graph representing the TBs and the locality among them. This graph is then partitioned to TB groups that display maximum data sharing, which are then assigned to the same streaming multiprocessor by the locality-aware TB scheduler. Through exhaustive simulation in Fermi, Pascal, and Volta architectures using a number of scheduling techniques, we show that PAVER reduces L2 accesses by 43.3%, 48.5%, and 40.21% and increases the average performance benefit by 29%, 49.1%, and 41.2% for the benchmarks with high inter-TB locality.

APA, Harvard, Vancouver, ISO, and other styles

46

Luo, Yangchun, Wei-Chung Hsu, and Antonia Zhai. "The design and implementation of heterogeneous multicore systems for energy-efficient speculative thread execution." ACM Transactions on Architecture and Code Optimization 10, no. 4 (December 2013): 1–29. http://dx.doi.org/10.1145/2541228.2541233.

Full text

APA, Harvard, Vancouver, ISO, and other styles

47

Adam, George K. "Co-Design of Multicore Hardware and Multithreaded Software for Thread Performance Assessment on an FPGA." Computers 11, no. 5 (May 9, 2022): 76. http://dx.doi.org/10.3390/computers11050076.

Full text

Abstract:

Multicore and multithreaded architectures increase the performance of computing systems. The increase in cores and threads, however, raises further issues in the efficiency achieved in terms of speedup and parallelization, particularly for the real-time requirements of Internet of things (IoT)-embedded applications. This research investigates the efficiency of a 32-core field-programmable gate array (FPGA) architecture, with memory management unit (MMU) and real-time operating system (OS) support, to exploit the thread level parallelism (TLP) of tasks running in parallel as threads on multiple cores. The research outcomes confirm the feasibility of the proposed approach in the efficient execution of recursive sorting algorithms, as well as their evaluation in terms of speedup and parallelization. The results reveal that parallel implementation of the prevalent merge sort and quicksort algorithms on this platform is more efficient. The increase in the speedup is proportional to the core scaling, reaching a maximum of 53% for the configuration with the highest number of cores and threads. However, the maximum magnitude of the parallelization (66%) was found to be bounded to a low number of two cores and four threads. A further increase in the number of cores and threads did not add to the improvement of the parallelism.

APA, Harvard, Vancouver, ISO, and other styles

48

Gilman, Guin, Samuel S. Ogden, Tian Guo, and Robert J. Walls. "Demystifying the Placement Policies of the NVIDIA GPU Thread Block Scheduler for Concurrent Kernels." ACM SIGMETRICS Performance Evaluation Review 48, no. 3 (March 5, 2021): 81–88. http://dx.doi.org/10.1145/3453953.3453972.

Full text

Abstract:

In this work, we empirically derive the scheduler's behavior under concurrent workloads for NVIDIA's Pascal, Volta, and Turing microarchitectures. In contrast to past studies that suggest the scheduler uses a round-robin policy to assign thread blocks to streaming multiprocessors (SMs), we instead find that the scheduler chooses the next SM based on the SM's local resource availability. We show how this scheduling policy can lead to significant, and seemingly counter-intuitive, performance degradation; for example, a decrease of one thread per block resulted in a 3.58X increase in execution time for one kernel in our experiments. We hope that our work will be useful for improving the accuracy of GPU simulators and aid in the development of novel scheduling algorithms.

APA, Harvard, Vancouver, ISO, and other styles

49

Egorov, Alexander, Natalya Krupenina, and Lyubov Tyndykar. "The parallel approach to issue of operational management optimization problem on transport gateway system." E3S Web of Conferences 203 (2020): 05003. http://dx.doi.org/10.1051/e3sconf/202020305003.

Full text

Abstract:

The universal parallelization software shell for joint data processing, implemented in combination with a distributed computing system, is considered. The research purpose – to find the most effective solution for the navigable canal management information system organizing. One optimization option is to increase computer devices computing power by combining them into a single computing cluster. The management optimizing task of a locked shipping channel for execution to adapt in a multi-threaded environment is proposed with constraints on a technologically feasible schedule. In article shows algorithms and gives recommendations for their application in the subtasks formation in parallel processing case, as well as on a separate thread. The proposed approach to building a tree of options allows you to optimally distribute the load between all resources multi-threaded system any structure.

APA, Harvard, Vancouver, ISO, and other styles

50

Soliman, Mostafa I. "Performance Evaluation of Multi-Core Intel Xeon Processors on Basic Linear Algebra Subprograms." Parallel Processing Letters 19, no. 01 (March 2009): 159–74. http://dx.doi.org/10.1142/s0129626409000134.

Full text

Abstract:

Multi-core technology is a natural next step in delivering the benefits of Moore's law to computing platforms. On multi-core processors, the performance of many applications would be improved by parallel processing threads of codes using multi-threading techniques. This paper evaluates the performance of the multi-core Intel Xeon processors on the widely used basic linear algebra subprograms (BLAS). On two dual-core Intel Xeon processors with Hyper-Threading technology, our results show that a performance of around 20 GFLOPS is achieved on Level-3 (matrix-matrix operations) BLAS using multi-threading, SIMD, matrix blocking, and loop unrolling techniques. However, on a small size of Level-2 (matrix-vector operations) and Level-1 (vector operations) BLAS, the use of multi-threading technique speeds down the execution because of the thread creation overheads. Thus the use of Intel SIMD instruction set is the way to improve the performance of single-threaded Level-2 (6 GFLOPS) and Level-1 BLAS (3 GFLOPS). When the problem size becomes large (cannot fit in L2 cache), the performance of the four Xeon cores is less than 2 and 1 GFLOPS on Level-2 and Level-1 BLAS, respectively, even though eight threads are executed in parallel on eight logical processors.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!