Log in

Relevant bibliographies by topics / Cache codé / Journal articles

To see the other types of publications on this topic, follow the link: Cache codé.

Journal articles on the topic 'Cache codé'

Author: Grafiati

Published: 25 May 2024

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Cache codé.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Ding, Wei, Yuanrui Zhang, Mahmut Kandemir, and Seung Woo Son. "Compiler-Directed File Layout Optimization for Hierarchical Storage Systems." Scientific Programming 21, no. 3-4 (2013): 65–78. http://dx.doi.org/10.1155/2013/167581.

Full text

Abstract:

File layout of array data is a critical factor that effects the behavior of storage caches, and has so far taken not much attention in the context of hierarchical storage systems. The main contribution of this paper is a compiler-driven file layout optimization scheme for hierarchical storage caches. This approach, fully automated within an optimizing compiler, analyzes a multi-threaded application code and determines a file layout for each disk-resident array referenced by the code, such that the performance of the target storage cache hierarchy is maximized. We tested our approach using 16 I/O intensive application programs and compared its performance against two previously proposed approaches under different cache space management schemes. Our experimental results show that the proposed approach improves the execution time of these parallel applications by 23.7% on average.

APA, Harvard, Vancouver, ISO, and other styles

2

Calciu, Irina, M. Talha Imran, Ivan Puddu, Sanidhya Kashyap, Hasan Al Maruf, Onur Mutlu, and Aasheesh Kolli. "Using Local Cache Coherence for Disaggregated Memory Systems." ACM SIGOPS Operating Systems Review 57, no. 1 (June 26, 2023): 21–28. http://dx.doi.org/10.1145/3606557.3606561.

Full text

Abstract:

Disaggregated memory provides many cost savings and resource provisioning benefits for current datacenters, but software systems enabling disaggregated memory access result in high performance penalties. These systems require intrusive code changes to port applications for disaggregated memory or employ slow virtual memory mechanisms to avoid code changes. Such mechanisms result in high overhead page faults to access remote data and high dirty data amplification when tracking changes to cached data at page-granularity. In this paper, we propose a fundamentally new approach for disaggregated memory systems, based on the observation that we can use local cache coherence to track applications' memory accesses transparently, without code changes, at cache-line granularity. This simple idea (1) eliminates page faults from the application critical path when accessing remote data, and (2) decouples the application memory access tracking from the virtual memory page size, enabling cache-line granularity dirty data tracking and eviction. Using this observation, we implemented a new software runtime for disaggregated memory that improves average memory access time and reduces dirty data amplification1.

APA, Harvard, Vancouver, ISO, and other styles

3

Charrier, Dominic E., Benjamin Hazelwood, Ekaterina Tutlyaeva, Michael Bader, Michael Dumbser, Andrey Kudryavtsev, Alexander Moskovsky, and Tobias Weinzierl. "Studies on the energy and deep memory behaviour of a cache-oblivious, task-based hyperbolic PDE solver." International Journal of High Performance Computing Applications 33, no. 5 (April 15, 2019): 973–86. http://dx.doi.org/10.1177/1094342019842645.

Full text

Abstract:

We study the performance behaviour of a seismic simulation using the ExaHyPE engine with a specific focus on memory characteristics and energy needs. ExaHyPE combines dynamically adaptive mesh refinement (AMR) with ADER-DG. It is parallelized using tasks, and it is cache efficient. AMR plus ADER-DG yields a task graph which is highly dynamic in nature and comprises both arithmetically expensive tasks and tasks which challenge the memory’s latency. The expensive tasks and thus the whole code benefit from AVX vectorization, although we suffer from memory access bursts. A frequency reduction of the chip improves the code’s energy-to-solution. Yet, it does not mitigate burst effects. The bursts’ latency penalty becomes worse once we add Intel Optane technology, increase the core count significantly or make individual, computationally heavy tasks fall out of close caches. Thread overbooking to hide away these latency penalties becomes contra-productive with noninclusive caches as it destroys the cache and vectorization character. In cases where memory-intense and computationally expensive tasks overlap, ExaHyPE’s cache-oblivious implementation nevertheless can exploit deep, noninclusive, heterogeneous memory effectively, as main memory misses arise infrequently and slow down only few cores. We thus propose that upcoming supercomputing simulation codes with dynamic, inhomogeneous task graphs are actively supported by thread runtimes in intermixing tasks of different compute character, and we propose that future hardware actively allows codes to downclock the cores running particular task types.

APA, Harvard, Vancouver, ISO, and other styles

4

Mittal, Shaily, and Nitin. "Memory Map: A Multiprocessor Cache Simulator." Journal of Electrical and Computer Engineering 2012 (2012): 1–12. http://dx.doi.org/10.1155/2012/365091.

Full text

Abstract:

Nowadays, Multiprocessor System-on-Chip (MPSoC) architectures are mainly focused on by manufacturers to provide increased concurrency, instead of increased clock speed, for embedded systems. However, managing concurrency is a tough task. Hence, one major issue is to synchronize concurrent accesses to shared memory. An important characteristic of any system design process is memory configuration and data flow management. Although, it is very important to select a correct memory configuration, it might be equally imperative to choreograph the data flow between various levels of memory in an optimal manner. Memory map is a multiprocessor simulator to choreograph data flow in individual caches of multiple processors and shared memory systems. This simulator allows user to specify cache reconfigurations and number of processors within the application program and evaluates cache miss and hit rate for each configuration phase taking into account reconfiguration costs. The code is open source and in java.

APA, Harvard, Vancouver, ISO, and other styles

5

Moon, S. M. "Increasing cache bandwidth using multiport caches for exploiting ILP in non-numerical code." IEE Proceedings - Computers and Digital Techniques 144, no. 5 (1997): 295. http://dx.doi.org/10.1049/ip-cdt:19971283.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Ma, Ruhui, Haibing Guan, Erzhou Zhu, Yongqiang Gao, and Alei Liang. "Code cache management based on working set in dynamic binary translator." Computer Science and Information Systems 8, no. 3 (2011): 653–71. http://dx.doi.org/10.2298/csis100327022m.

Full text

Abstract:

Software code cache employed to store translated or optimized codes, amortizes the overhead of dynamic binary translation via reusing of stored-altered copies of original program instructions. Though many conventional code cache managements, such as Flush, Least-Recently Used (LRU), have been applied on some classic dynamic binary translators, actually they are so unsophisticated yet unadaptable that it not only brings additional unnecessary overhead, but also wastes much cache space, since there exist several noticeable features in software code cache, unlike pages in memory. Consequently, this paper presents two novel alternative cache schemes-SCC (Static Code Cache) and DCC (Dynamic Code Cache) based on working set. In these new schemes, we utilize translation rate to judge working set. To evaluate these new replacement policies, we implement them on dynamic binary translator-CrossBit with several commonplace code cache managements. Through the experiment results based on benchmark SPECint 2000, we achieve better performance improvement and cache space utilization ratio.

APA, Harvard, Vancouver, ISO, and other styles

7

Das, Abhishek, and Nur A. Touba. "A Single Error Correcting Code with One-Step Group Partitioned Decoding Based on Shared Majority-Vote." Electronics 9, no. 5 (April 26, 2020): 709. http://dx.doi.org/10.3390/electronics9050709.

Full text

Abstract:

Technology scaling has led to an increase in density and capacity of on-chip caches. This has enabled higher throughput by enabling more low latency memory transfers. With the reduction in size of SRAMs and development of emerging technologies, e.g., STT-MRAM, for on-chip cache memories, reliability of such memories becomes a major concern. Traditional error correcting codes, e.g., Hamming codes and orthogonal Latin square codes, either suffer from high decoding latency, which leads to lower overall throughput, or high memory overhead. In this paper, a new single error correcting code based on a shared majority voting logic is presented. The proposed codes trade off decoding latency in order to improve the memory overhead posed by orthogonal Latin square codes. A latency optimization technique is also proposed which lowers the decoding latency by incurring a slight memory overhead. It is shown that the proposed codes achieve better redundancy compared to orthogonal Latin square codes. The proposed codes are also shown to achieve lower decoding latency compared to Hamming codes. Thus, the proposed codes achieve a balanced trade-off between memory overhead and decoding latency, which makes them highly suitable for on-chip cache memories which have stringent throughput and memory overhead constraints.

APA, Harvard, Vancouver, ISO, and other styles

8

Simecek, Ivan, and Pavel Tvrdík. "A new code transformation technique for nested loops." Computer Science and Information Systems 11, no. 4 (2014): 1381–416. http://dx.doi.org/10.2298/csis131126075s.

Full text

Abstract:

For good performance of every computer program, good cache utilization is crucial. In numerical linear algebra libraries, good cache utilization is achieved by explicit loop restructuring (mainly loop blocking), but it requires a complicated memory pattern behavior analysis. In this paper, we describe a new source code transformation called dynamic loop reversal that can increase temporal and spatial locality. We also describe a formal method for predicting cache behavior and evaluate results of the model accuracy by the measurements on a cache monitor. The comparisons of the numbers of measured cache misses and the numbers of cache misses estimated by the model indicate that the model is relatively accurate and can be used in practice.

APA, Harvard, Vancouver, ISO, and other styles

9

Luo, Ya Li. "Research of Adaptive Control Algorithm Based on the Cached Playing of Streaming Media." Applied Mechanics and Materials 539 (July 2014): 502–6. http://dx.doi.org/10.4028/www.scientific.net/amm.539.502.

Full text

Abstract:

Aimed at the quality issues of current network streaming media playing, it manages by introducing a streaming media caching mechanism to help improve the playing effect. But the cached playing also has its own deficiencies, so here combines the adaptive control algorithm with the caching mechanism to solve this problem. It firstly introduces the streaming media service, and analyzes the transmission process of streaming media and adaptive media playing in detail; secondly analyzes the adaptive control algorithm of streaming media caching from the principle and design of reserving cache algorithm and the balance frame dropping algorithm; and finally gives the core code of the algorithm. This paper has a certain reference value to the video website developers and computer system architecture researchers.

APA, Harvard, Vancouver, ISO, and other styles

10

Heirman, Wim, Stijn Eyerman, Kristof Du Bois, and Ibrahim Hur. "Automatic Sublining for Efficient Sparse Memory Accesses." ACM Transactions on Architecture and Code Optimization 18, no. 3 (June 2021): 1–23. http://dx.doi.org/10.1145/3452141.

Full text

Abstract:

Sparse memory accesses, which are scattered accesses to single elements of a large data structure, are a challenge for current processor architectures. Their lack of spatial and temporal locality and their irregularity makes caches and traditional stream prefetchers useless. Furthermore, performing standard caching and prefetching on sparse accesses wastes precious memory bandwidth and thrashes caches, deteriorating performance for regular accesses. Bypassing prefetchers and caches for sparse accesses, and fetching only a single element (e.g., 8 B) from main memory (subline access), can solve these issues. Deciding which accesses to handle as sparse accesses and which as regular cached accesses, is a challenging task, with a large potential impact on performance. Not only is performance reduced by treating sparse accesses as regular accesses, not caching accesses that do have locality also negatively impacts performance by significantly increasing their latency and bandwidth consumption. Furthermore, this decision depends on the dynamic environment, such as input set characteristics and system load, making a static decision by the programmer or compiler suboptimal. We propose the Instruction Spatial Locality Estimator ( ISLE ), a hardware detector that finds instructions that access isolated words in a sea of unused data. These sparse accesses are dynamically converted into uncached subline accesses, while keeping regular accesses cached. ISLE does not require modifying source code or binaries, and adapts automatically to a changing environment (input data, available bandwidth, etc.). We apply ISLE to a graph analytics processor running sparse graph workloads, and show that ISLE outperforms the performance of no subline accesses, manual sublining, and prior work on detecting sparse accesses.

APA, Harvard, Vancouver, ISO, and other styles

11

Пуйденко, Вадим Олексійович, and Вячеслав Сергійович Харченко. "МІНІМІЗАЦІЯ ЛОГІЧНОЇ СХЕМИ ДЛЯ РЕАЛІЗАЦІЇ PSEUDO LRU ШЛЯХОМ МІЖТИПОВОГО ПЕРЕХОДУ У ТРИГЕРНИХ СТРУКТУРАХ." RADIOELECTRONIC AND COMPUTER SYSTEMS, no. 2 (April 26, 2020): 33–47. http://dx.doi.org/10.32620/reks.2020.2.03.

Full text

Abstract:

The principle of program control means that the processor core turns to the main memory of the computer for operands or instructions. According to architectural features, operands are stored in data segments, and instructions are stored in code segments of the main memory. The operating system uses both page memory organization and segment memory organization. The page memory organization is always mapped to the segment organization. Due to the cached packet cycles of the processor core, copies of the main memory pages are stored in the internal associative cache memory. The associative cache memory consists of three units: a data unit, a tag unit, and an LRU unit. The data unit stores operands or instructions, the tag unit contains fragments of address information, and the LRU unit contains the logic of policy for replacement of string. The missing event attracts LRU logic to decide for substitution of reliable string in the data unit of associative cache memory. The pseudo-LRU algorithm is a simple and better substitution policy among known substitution policies. Two options for the minimization of the hardware for replacement policy by the pseudo-LRU algorithm in q - directed associative cache memory is implemented. The transition from the trigger structure of the synchronous D-trigger to the trigger structure of the synchronous JK-trigger is carried out reasonably in both options. The first option of minimization is based on the sequence for updating of the by the algorithm pseudo LRU, which allows deleting of the combinational logic for updating bits of LRU unit. The second option of minimization is based on the sequence for changing of the q - index of direction, as the consequence for updating the bits of LRU unit by the algorithm pseudo LRU. It allows additionally reducing the number of memory elements. Both options of the minimization allow improving such characteristics as productivity and reliability of the LRU unit.

APA, Harvard, Vancouver, ISO, and other styles

12

Sasongko, Muhammad Aditya, Milind Chabbi, Mandana Bagheri Marzijarani, and Didem Unat. "ReuseTracker : Fast Yet Accurate Multicore Reuse Distance Analyzer." ACM Transactions on Architecture and Code Optimization 19, no. 1 (March 31, 2022): 1–25. http://dx.doi.org/10.1145/3484199.

Full text

Abstract:

One widely used metric that measures data locality is reuse distance —the number of unique memory locations that are accessed between two consecutive accesses to a particular memory location. State-of-the-art techniques that measure reuse distance in parallel applications rely on simulators or binary instrumentation tools that incur large performance and memory overheads. Moreover, the existing sampling-based tools are limited to measuring reuse distances of a single thread and discard interactions among threads in multi-threaded programs. In this work, we propose ReuseTracker —a fast and accurate reuse distance analyzer that leverages existing hardware features in commodity CPUs. ReuseTracker is designed for multi-threaded programs and takes cache-coherence effects into account. By utilizing hardware features like performance monitoring units and debug registers, ReuseTracker can accurately profile reuse distance in parallel applications with much lower overheads than existing tools. It introduces only 2.9× runtime and 2.8× memory overheads. Our tool achieves 92% accuracy when verified against a newly developed configurable benchmark that can generate a variety of different reuse distance patterns. We demonstrate the tool’s functionality with two use-case scenarios using PARSEC, Rodinia, and Synchrobench benchmark suites where ReuseTracker guides code refactoring in these benchmarks by detecting spatial reuses in shared caches that are also false sharing and successfully predicts whether some benchmarks in these suites can benefit from adjacent cache line prefetch optimization.

APA, Harvard, Vancouver, ISO, and other styles

13

Zhang, Kang, Fan Fu Zhou, and Alei Liang. "DCC: A Replacement Strategy for DBT System Based on Working Sets." Applied Mechanics and Materials 251 (December 2012): 114–18. http://dx.doi.org/10.4028/www.scientific.net/amm.251.114.

Full text

Abstract:

Limited memory resource has always been deemed as the major bottleneck of the program performance, not excepting for dynamic binary translation (DBT) [1] system. However, traditional methods seldom enable this issue mentioned to be solved better due to a ﬁx cache size routinely assigned for codes without considering the program behavior on the ﬂy. Though some prediction methods implemented via LRU [2] histograms can achieve better performance in Operating System, unlike pages in memory, blocks in DBT have unﬁxed sizes each other, making the prediction imprecise. In this paper, we present a new replace policy named DCC (Dynamic Code Cache) based on working set [3] for bounded code cache, which would dynamically change the size of the code cache and clear the cache according to the information of working set detected. It would sacriﬁce a small part of the performance to save the space, especially when many great or complicated applications are running on the same physical machine, and the performance might be better under this condition. For Evaluating, we test this strategy on a DBT system named Crossbit [4].We get around 30% space saved up with about 12% running time sacriﬁced.

APA, Harvard, Vancouver, ISO, and other styles

14

Duangthong, Chatuporn, Pornchai Supnithi, and Watid Phakphisut. "Two-Dimensional Error Correction Code for Spin-Transfer Torque Magnetic Random-Access Memory (STT-MRAM) Caches." ECTI Transactions on Computer and Information Technology (ECTI-CIT) 16, no. 3 (June 18, 2022): 237–46. http://dx.doi.org/10.37936/ecti-cit.2022163.246903.

Full text

Abstract:

Spin-Transfer Torque Magnetic Random-Access Memory (STT-MRAM) is an emerging nonvolatile memory (NVM) technology that can replace conventional cache memory in computer systems. STT-RAM has many desirable properties such as high writing and reading speed, non-volatility, and low power consumption. Since the cache requires a high speed of writing and reading speed, a single-error correction and double error detection (SEC - DED) are applicable to improve the reliability of the cache. However, the process variation and thermal fluctuation of STT-MRAM cause errors. For example, writing ‘1’ bits has more errors than writing ‘0’ bits. We then design the weight reduction code to reduce the error caused by writing ‘1’ bits. Moreover, the performance of an SEC-DED code is improved by constructing an SED-DED code as the product code. The simulation results demonstrate that the two-dimensional error correction code consisting of product code and weight reduction code is roughly 5.67 × 10−4 lower than the SEC-DED code when the error rate of writing ‘1’ bits is equal to 6 × 10−3.

APA, Harvard, Vancouver, ISO, and other styles

15

Gordon-Ross, Ann, Frank Vahid, and Nikil Dutt. "Combining code reordering and cache configuration." ACM Transactions on Embedded Computing Systems 11, no. 4 (December 2012): 1–20. http://dx.doi.org/10.1145/2362336.2399177.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Zhao, Yiqiang, Boning Shi, Qizhi Zhang, Yidong Yuan, and Jiaji He. "Research on Cache Coherence Protocol Verification Method Based on Model Checking." Electronics 12, no. 16 (August 11, 2023): 3420. http://dx.doi.org/10.3390/electronics12163420.

Full text

Abstract:

This paper analyzes the underlying logic of the processor’s behavior level code. It proposes an automatic model construction and formal verification method for the cache consistency protocol with the aim of ensuring data consistency in the processor and the correctness of the cache function. The main idea of this method is to analyze the register transfer level (RTL) code directly at the module level and variable level, and extract the key modules and key variables according to the code information. Then, based on key variables, conditional behavior statements are retrieved from the code, and unnecessary statements are deleted. The model construction and simplification of related core states are completed automatically, while also simultaneously generating the attribute library to be verified, using “white list” as the construction strategy. Finally, complete cache consistency protocol verification is implemented in the model detector UPPAAL. Ultimately, this mechanism reduces the 142 state-transition path-guided global states of the cache module to be verified into 4 core functional states driven by consistency protocol implementation, effectively reducing the complexity of the formal model, and extracting 32 verification attributes into 6 verification attributes, reducing the verification time cost by 76.19%.

APA, Harvard, Vancouver, ISO, and other styles

17

Ding, Chen, Dong Chen, Fangzhou Liu, Benjamin Reber, and Wesley Smith. "CARL: Compiler Assigned Reference Leasing." ACM Transactions on Architecture and Code Optimization 19, no. 1 (March 31, 2022): 1–28. http://dx.doi.org/10.1145/3498730.

Full text

Abstract:

Data movement is a common performance bottleneck, and its chief remedy is caching. Traditional cache management is transparent to the workload: data that should be kept in cache are determined by the recency information only, while the program information, i.e., future data reuses, is not communicated to the cache. This has changed in a new cache design named Lease Cache . The program control is passed to the lease cache by a compiler technique called Compiler Assigned Reference Lease (CARL). This technique collects the reuse interval distribution for each reference and uses it to compute and assign the lease value to each reference. In this article, we prove that CARL is optimal under certain statistical assumptions. Based on this optimality, we prove miss curve convexity, which is useful for optimizing shared cache, and sub-partitioning monotonicity, which simplifies lease compilation. We evaluate the potential using scientific kernels from PolyBench and show that compiler insertions of up to 34 leases in program code achieve similar or better cache utilization (in variable size cache) than the optimal fixed-size caching policy, which has been unattainable with automatic caching but now within the potential of cache programming for all tested programs and most cache sizes.

APA, Harvard, Vancouver, ISO, and other styles

18

Vishnekov, A. V., and E. M. Ivanova. "DYNAMIC CONTROL METHODS OF CACHE LINES REPLACEMENT POLICY." Vestnik komp'iuternykh i informatsionnykh tekhnologii, no. 191 (May 2020): 49–56. http://dx.doi.org/10.14489/vkit.2020.05.pp.049-056.

Full text

Abstract:

The paper investigates the issues of increasing the performance of computing systems by improving the efficiency of cache memory, analyzes the efficiency indicators of replacement algorithms. We show the necessity of creation of automated or automatic means for cache memory tuning in the current conditions of program code execution, namely a dynamic cache replacement algorithms control by replacement of the current replacement algorithm by more effective one in current computation conditions. Methods development for caching policy control based on the program type definition: cyclic, sequential, locally-point, mixed. We suggest the procedure for selecting an effective replacement algorithm by support decision-making methods based on the current statistics of caching parameters. The paper gives the analysis of existing cache replacement algorithms. We propose a decision-making procedure for selecting an effective cache replacement algorithm based on the methods of ranking alternatives, preferences and hierarchy analysis. The critical number of cache hits, the average time of data query execution, the average cache latency are selected as indicators of initiation for the swapping procedure for the current replacement algorithm. The main advantage of the proposed approach is its universality. This approach assumes an adaptive decision-making procedure for the effective replacement algorithm selecting. The procedure allows the criteria variability for evaluating the replacement algorithms, its’ efficiency, and their preference for different types of program code. The dynamic swapping of the replacement algorithm with a more efficient one during the program execution improves the performance of the computer system.

APA, Harvard, Vancouver, ISO, and other styles

19

Vishnekov, A. V., and E. M. Ivanova. "DYNAMIC CONTROL METHODS OF CACHE LINES REPLACEMENT POLICY." Vestnik komp'iuternykh i informatsionnykh tekhnologii, no. 191 (May 2020): 49–56. http://dx.doi.org/10.14489/vkit.2020.05.pp.049-056.

Full text

Abstract:

The paper investigates the issues of increasing the performance of computing systems by improving the efficiency of cache memory, analyzes the efficiency indicators of replacement algorithms. We show the necessity of creation of automated or automatic means for cache memory tuning in the current conditions of program code execution, namely a dynamic cache replacement algorithms control by replacement of the current replacement algorithm by more effective one in current computation conditions. Methods development for caching policy control based on the program type definition: cyclic, sequential, locally-point, mixed. We suggest the procedure for selecting an effective replacement algorithm by support decision-making methods based on the current statistics of caching parameters. The paper gives the analysis of existing cache replacement algorithms. We propose a decision-making procedure for selecting an effective cache replacement algorithm based on the methods of ranking alternatives, preferences and hierarchy analysis. The critical number of cache hits, the average time of data query execution, the average cache latency are selected as indicators of initiation for the swapping procedure for the current replacement algorithm. The main advantage of the proposed approach is its universality. This approach assumes an adaptive decision-making procedure for the effective replacement algorithm selecting. The procedure allows the criteria variability for evaluating the replacement algorithms, its’ efficiency, and their preference for different types of program code. The dynamic swapping of the replacement algorithm with a more efficient one during the program execution improves the performance of the computer system.

APA, Harvard, Vancouver, ISO, and other styles

20

Ma, Cong, Dinghao Wu, Gang Tan, Mahmut Taylan Kandemir, and Danfeng Zhang. "Quantifying and Mitigating Cache Side Channel Leakage with Differential Set." Proceedings of the ACM on Programming Languages 7, OOPSLA2 (October 16, 2023): 1470–98. http://dx.doi.org/10.1145/3622850.

Full text

Abstract:

Cache side-channel attacks leverage secret-dependent footprints in CPU cache to steal confidential information, such as encryption keys. Due to the lack of a proper abstraction for reasoning about cache side channels, existing static program analysis tools that can quantify or mitigate cache side channels are built on very different kinds of abstractions. As a consequence, it is hard to bridge advances in quantification and mitigation research. Moreover, existing abstractions lead to imprecise results. In this paper, we present a novel abstraction, called differential set, for analyzing cache side channels at compile time. A distinguishing feature of differential sets is that it allows compositional and precise reasoning about cache side channels. Moreover, it is the first abstraction that carries sufficient information for both side channel quantification and mitigation. Based on this new abstraction, we develop a static analysis tool DSA that automatically quantifies and mitigates cache side channel leakage at the same time. Experimental evaluation on a set of commonly used benchmarks shows that DSA can produce more precise leakage bound as well as mitigated code with fewer memory footprints, when compared with state-of-the-art tools that only quantify or mitigate cache side channel leakage.

APA, Harvard, Vancouver, ISO, and other styles

21

Sahuquillo, Julio, Noel Tomas, Salvador Petit, and Ana Pont. "Spim-Cache: A Pedagogical Tool for Teaching Cache Memories Through Code-Based Exercises." IEEE Transactions on Education 50, no. 3 (August 2007): 244–50. http://dx.doi.org/10.1109/te.2007.900021.

Full text

APA, Harvard, Vancouver, ISO, and other styles

22

Liu, Cong, Xinyu Xu, Zhenjiao Chen, and Binghao Wang. "A Universal-Verification-Methodology-Based Testbench for the Coverage-Driven Functional Verification of an Instruction Cache Controller." Electronics 12, no. 18 (September 9, 2023): 3821. http://dx.doi.org/10.3390/electronics12183821.

Full text

Abstract:

The Cache plays an important role in computer architecture by reducing the access time of the processor and improving its performance. The hardware design of the Cache is complex and it is challenging to verify its functions, so the traditional Verilog-based verification method is no longer applicable. This paper proposes a comprehensive and efficient verification testbench based on the SystemVerilog language and universal verification methodology (UVM) for an instruction Cache (I-Cache) controller. Corresponding testcases are designed for each feature of the I-Cache controller and automatically executed using a python script on an electronic design automation (EDA) tool. After simulating a large number of testcases, the statistics reveal that the module’s code coverage is 99.13%. Additionally, both the function coverage and the assertion coverage of the module reach 100%. Our results demonstrate that these coverage metrics meet the requirements and ensure the thoroughness of function verification. Furthermore, the established verification testbench exhibits excellent scalability and reusability, making it easily applicable to higher-level verification scenarios.

APA, Harvard, Vancouver, ISO, and other styles

23

Makhkamova, Ozoda, and Doohyun Kim. "A Conversation History-Based Q&A Cache Mechanism for Multi-Layered Chatbot Services." Applied Sciences 11, no. 21 (October 25, 2021): 9981. http://dx.doi.org/10.3390/app11219981.

Full text

Abstract:

Chatbot technologies have made our lives easier. To create a chatbot with high intelligence, a significant amount of knowledge processing is required. However, this can slow down the reaction time; hence, a mechanism to enable a quick response is needed. This paper proposes a cache mechanism to improve the response time of the chatbot service; while the cache in CPU utilizes the locality of references within binary code executions, our cache mechanism for chatbots uses the frequency and relevance information which potentially exists within the set of Q&A pairs. The proposed idea is to enable the broker in a multi-layered structure to analyze and store the keyword-wise relevance of the set of Q&A pairs from chatbots. In addition, the cache mechanism accumulates the frequency of the input questions by monitoring the conversation history. When a cache miss occurs, the broker selects a chatbot according to the frequency and relevance, and then delivers the query to the selected chatbot to obtain a response for answer. This mechanism showed a significant increase in the cache hit ratio as well as an improvement in the average response time.

APA, Harvard, Vancouver, ISO, and other styles

24

Lin, Bo, Shangwen Wang, Ming Wen, and Xiaoguang Mao. "Context-Aware Code Change Embedding for Better Patch Correctness Assessment." ACM Transactions on Software Engineering and Methodology 31, no. 3 (July 31, 2022): 1–29. http://dx.doi.org/10.1145/3505247.

Full text

Abstract:

Despite the capability in successfully fixing more and more real-world bugs, existing Automated Program Repair (APR) techniques are still challenged by the long-standing overfitting problem (i.e., a generated patch that passes all tests is actually incorrect). Plenty of approaches have been proposed for automated patch correctness assessment (APCA ). Nonetheless, dynamic ones (i.e., those that needed to execute tests) are time-consuming while static ones (i.e., those built on top of static code features) are less precise. Therefore, embedding techniques have been proposed recently, which assess patch correctness via embedding token sequences extracted from the changed code of a generated patch. However, existing techniques rarely considered the context information and program structures of a generated patch, which are crucial for patch correctness assessment as revealed by existing studies. In this study, we explore the idea of context-aware code change embedding considering program structures for patch correctness assessment. Specifically, given a patch, we not only focus on the changed code but also take the correlated unchanged part into consideration, through which the context information can be extracted and leveraged. We then utilize the AST path technique for representation where the structure information from AST node can be captured. Finally, based on several pre-defined heuristics, we build a deep learning based classifier to predict the correctness of the patch. We implemented this idea as Cache and performed extensive experiments to assess its effectiveness. Our results demonstrate that Cache can (1) perform better than previous representation learning based techniques (e.g., Cache relatively outperforms existing techniques by \( \approx \) 6%, \( \approx \) 3%, and \( \approx \) 16%, respectively under three diverse experiment settings), and (2) achieve overall higher performance than existing APCA techniques while even being more precise than certain dynamic ones including PATCH-SIM (92.9% vs. 83.0%). Further results reveal that the context information and program structures leveraged by Cache contributed significantly to its outstanding performance.

APA, Harvard, Vancouver, ISO, and other styles

25

Ansari, Ali, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. "Code Layout Optimization for Near-Ideal Instruction Cache." IEEE Computer Architecture Letters 18, no. 2 (July 1, 2019): 124–27. http://dx.doi.org/10.1109/lca.2019.2924429.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

Tomiyama, Hiroyuki, and Hiroto Yasuura. "Code placement techniques for cache miss rate reduction." ACM Transactions on Design Automation of Electronic Systems 2, no. 4 (October 1997): 410–29. http://dx.doi.org/10.1145/268424.268469.

Full text

APA, Harvard, Vancouver, ISO, and other styles

27

Ryoo, Jihyun, Mahmut Taylan Kandemir, and Mustafa Karakoy. "Memory Space Recycling." Proceedings of the ACM on Measurement and Analysis of Computing Systems 6, no. 1 (February 24, 2022): 1–24. http://dx.doi.org/10.1145/3508034.

Full text

Abstract:

Many program codes from different application domains process very large amounts of data, making their cache memory behavior critical for high performance. Most of the existing work targeting cache memory hierarchies focus on improving data access patterns, e.g., maximizing sequential accesses to program data structures via code and/or data layout restructuring strategies. Prior work has addressed this data locality optimization problem in the context of both single-core and multi-core systems. Another dimension of optimization, which can be as equally important/beneficial as improving data access pattern is to reduce the data volume (total number of addresses) accessed by the program code. Compared to data access pattern restructuring, this volume minimization problem has relatively taken much less attention. In this work, we focus on this volume minimization problem and address it in both single-core and multi-core execution scenarios. Specifically, we explore the idea of rewriting an application program code to reduce its "memory space footprint". The main idea behind this approach is to reuse/recycle, for a given data element, a memory location that has originally been assigned to another data element, provided that the lifetimes of these two data elements do not overlap with each other. A unique aspect is that it is "distance aware", i.e., in identifying the memory/cache locations to recycle it takes into account the physical distance between the location of the core and the memory/cache location to be recycled. We present a detailed experimental evaluation of our proposed memory space recycling strategy, using five different metrics: memory space consumption, network footprint, data access distance, cache miss rate, and execution time. The experimental results show that our proposed approach brings, respectively, 33.2%, 48.6%, 46.5%, 31.8%, and 27.9% average improvements in these metrics, in the case of single-threaded applications. With the multi-threaded versions of the same applications, the achieved improvements are 39.5%, 55.5%, 53.4%, 26.2%, and 22.2%, in the same order.

APA, Harvard, Vancouver, ISO, and other styles

28

Błaszyński, Piotr, and Włodzimierz Bielecki. "High-Performance Computation of the Number of Nested RNA Structures with 3D Parallel Tiled Code." Eng 4, no. 1 (February 3, 2023): 507–25. http://dx.doi.org/10.3390/eng4010030.

Full text

Abstract:

Many current bioinformatics algorithms have been implemented in parallel programming code. Some of them have already reached the limits imposed by Amdahl’s law, but many can still be improved. In our paper, we present an approach allowing us to generate a high-performance code for calculating the number of RNA pairs. The approach allows us to generate parallel tiled code of the maximal dimension of tiles, which for the discussed algorithm is 3D. Experiments carried out by us on two modern multi-core computers, an Intel(R) Xeon(R) Gold 6326 (2.90 GHz, 2 physical units, 32 cores, 64 threads, 24 MB Cache) and Intel(R) i7(11700KF (3.6 GHz, 8 cores, 16 threads, 16 MB Cache), demonstrate a significant increase in performance and scalability of the generated parallel tiled code. For the Intel(R) Xeon(R) Gold 6326 and Intel(R) i7, target code speedup increases linearly with an increase in the number of threads. An approach presented in the paper to generate target code can be used by programmers to generate target parallel tiled code for other bioinformatics codes whose dependence patterns are similar to those of the code implementing the counting algorithm.

APA, Harvard, Vancouver, ISO, and other styles

29

Bielecki, Włodzimierz, Piotr Błaszyński, and Marek Pałkowski. "3D Tiled Code Generation for Nussinov’s Algorithm." Applied Sciences 12, no. 12 (June 9, 2022): 5898. http://dx.doi.org/10.3390/app12125898.

Full text

Abstract:

Current state-of-the-art parallel codes used to calculate the maximum number of pairs for a given RNA sequence by means of Nussinov’s algorithm do not allow for achieving speedup close up to the number of the processors used for execution of those codes on multi-core computers. This is due to the fact that known codes do not make full use of and derive benefit from cache memory of such computers. There is a need to develop new approaches allowing for increasing cache exploitation in multi-core computers. One of such possibilities is increasing the dimension of tiles in generated target tiled code and assuring a similar size of generated tiles. The article presents an approach allowing us to produce 3D parallel code with tiling calculating Nussinov’s RNA folding, i.e., code with the maximal tile dimension possible for the loop nest, executing Nussinov’s algorithm. The approach guarantees that generated tiles are of a similar size. The code generated with the presented approach is characterized by increased code locality and outperforms all closely related ones examined by us. This allows us to considerably reduce execution time required for computing the maximum number of pairs of any nested structure for larger RNA sequences by means of Nussinov’s algorithm.

APA, Harvard, Vancouver, ISO, and other styles

30

Murugan, Dr. "Hybrid LRU Algorithm for Enterprise Data Hub using Serverless Architecture." Turkish Journal of Computer and Mathematics Education (TURCOMAT) 12, no. 4 (April 11, 2021): 441–49. http://dx.doi.org/10.17762/turcomat.v12i4.525.

Full text

Abstract:

In hybrid LRU algorithm was built to execute parameterized priority queue using Least Recently Used model. It helped to determine the object in an optimum mode to remove from cache. Experiment results demonstrated ~30% decrease of the execution time to extract data from cache store during object cache extraction process. In the era of modern utility computing theory, Serverless architecture is the cloud platform concept to hide the server usage from the development community and runs the code on-demand basis. This paper provides Hybrid LRU algorithm by leveraging Serverless Architecture benefits. Eventually, this new technique added few advantages like infrastructure instance scalability, no server management, reduced cost on efficient usage, etc. This paper depicts about the experimental advantage of Hybrid LRU execution time optimization using Serverless architecture.

APA, Harvard, Vancouver, ISO, and other styles

31

Steenkiste, P. "The impact of code density on instruction cache performance." ACM SIGARCH Computer Architecture News 17, no. 3 (June 1989): 252–59. http://dx.doi.org/10.1145/74926.74954.

Full text

APA, Harvard, Vancouver, ISO, and other styles

32

Marathe, Jaydeep, and Frank Mueller. "Source-Code-Correlated Cache Coherence Characterization of OpenMP Benchmarks." IEEE Transactions on Parallel and Distributed Systems 18, no. 6 (June 2007): 818–34. http://dx.doi.org/10.1109/tpds.2007.1058.

Full text

APA, Harvard, Vancouver, ISO, and other styles

33

Naik Dessai, Sanket Suresh, and Varuna Eswer. "Embedded Software Testing to Determine BCM5354 Processor Performance." International Journal of Software Engineering and Technologies (IJSET) 1, no. 3 (December 1, 2016): 121. http://dx.doi.org/10.11591/ijset.v1i3.4577.

Full text

Abstract:

Efficiency of a processor is a critical factor for an embedded system. One of the deciding factors for efficiency is the functioning of the L1 cache and Translation Lookaside Buffer (TLB). Certain processors have the L1 cache and TLB managed by the operating system, MIPS32 is one such processor. The performance of the L1 cache and TLB necessitates a detailed study to understand its management during varied load on the processor. This paper presents an implementation of embedded testing procedure to analyse the performance of the MIPS32 processor L1 cache and TLB management by the operating system (OS). The implementation proposed for embedded testing in the paper considers the counting of the respective cache and TLB management instruction execution, which is an event that is measurable with the use of dedicated counters. The lack of hardware counters in the MIPS32 processor results in the usage of software based event counters that are defined in the kernel. This paper implements embedding testbed with a subset of MIPS32 processor performance measurement metrics using software based counters. Techniques were developed to overcome the challenges posed by the kernel source code. To facilitate better understanding of the testbed implementation procedure of the software based processor performance counters; use-case analysis diagram, flow charts, screen shots, and knowledge nuggets are supplemented along with histograms of the cache and TLB events data generated by the proposed implementation. In this testbed twenty-seven metrics have been identified and implemented to provide data related to the events of the L1 cache and TLB on the MIPS32 processor. The generated data can be used in tuning of compiler, OS memory management design, system benchmarking, scalability, analysing architectural issues, address space analysis, understanding bus communication, kernel profiling, and workload characterisation.

APA, Harvard, Vancouver, ISO, and other styles

34

Oktrifianto, Rahmat, Dani Adhipta, and Warsun Najib. "Page Load Time Speed Increase on Disease Outbreak Investigation Information System Website." IJITEE (International Journal of Information Technology and Electrical Engineering) 2, no. 4 (September 10, 2019): 114. http://dx.doi.org/10.22146/ijitee.46599.

Full text

Abstract:

Outbreaks or extraordinary events often become an issue that occurs in Indonesia. Therefore, an outbreak investigation information system is required to collect, manage and analyze data quickly and accurately. On the other hand, challenges in data accessing processes in certain locations are still constrained by a slow internet connection. This paper conducted speed increase of a page load or site speed time from disease outbreaks investigation information system website.Page load time speed testing was carried out using Google Chrome Developer Tools and using simulation speeds of 2.5 Mbps. Testing time was carried out by dividing the time into three sections, morning hours, working hours and night hours. Implementation of page load time increase includes reducing HTTP requests, utilizing GZIP compression, performing code minification, setting browser chache, using CDN, and using other enhancement techniques.The results showed that after implementing an increase in page load time by turning off cache and using cache, there was an increase in site speed. When the browser cache was turned off, an average page load time increased of 54.79% from the previous time. Whereas when using the browser cache, page load time speed increased by 55.28% from the previous time.

APA, Harvard, Vancouver, ISO, and other styles

35

Wang, Xiang, Zongmin Zhao, Dongdong Xu, Zhun Zhang, Qiang Hao, Mengchen Liu, and Yu Si. "Two-Stage Checkpoint Based Security Monitoring and Fault Recovery Architecture for Embedded Processor." Electronics 9, no. 7 (July 18, 2020): 1165. http://dx.doi.org/10.3390/electronics9071165.

Full text

Abstract:

Nowadays, the secure program execution of embedded processor has attracted considerable research attention, since more and more code tampering attacks and transient faults are seriously affecting the security of embedded processors. The program monitoring and fault recovery strategies are not only closely related to the security of embedded devices, but also directly affect the performance of the processor. This paper presents a security monitoring and fault recovery architecture for run-time program execution, which takes regular backup copies of the two-stage checkpoint. In this framework, the integrity check technology based on the basic block (BB) is utilized to monitor the program execution in real-time, while the rollback operation is taken once the integrity check is failed. In addition, a Monitoring Cache (M-Cache) is built to buffer the reference data for integrity checking. Moreover, a recovery strategy mainly for three tampered positions (registers in processor, instructions in Cache, and codes in memory) is provided to ensure the smooth running of the embedded system. Finally, the open RISC processor is adopted to implement and verify the presented security architecture, which has been proved to be effective for program detection in the execution of tamper attack and quick recovery of the running environment as well as code.

APA, Harvard, Vancouver, ISO, and other styles

36

Eswer, Varuna, and Sanket Suresh Naik Dessai. "Embedded Software Engineering Approach to Implement BCM5354 Processor Performance." International Journal of Software Engineering and Technologies (IJSET) 1, no. 1 (April 1, 2016): 41. http://dx.doi.org/10.11591/ijset.v1i1.4568.

Full text

Abstract:

Efficiency of a processor is a critical factor for an embedded system. One of the deciding factors for efficiency is the functioning of the L1 cache and Translation Lookaside Buffer (TLB). Certain processors have the L1 cache and TLB managed by the operating system, MIPS32 is one such processor. The performance of the L1 cache and TLB necessitates a detailed study to understand its management during varied load on the processor. This paper presents an implementation to analyse the performance of the MIPS32 processor L1 cache and TLB management by the operating system (OS) using software engineering approach. Software engineering providing better clearity for the system developemt and its performance analysis.In the initial stage if the requirement analysis for the performance measurment sort very clearly,the methodologies for the implementation becomes very economical without any ambigunity.In this paper a implementation is proposed to determine the processor performance metrics using a software engineering approach considering the counting of the respective cache and TLB management instruction execution, which is an event that is measurable with the use of dedicated counters. The lack of hardware counters in the MIPS32 processor results in the usage of software based event counters that are defined in the kernel. This paper implements a subset of MIPS32 processor performance measurement metrics using software based counters. Techniques were developed to overcome the challenges posed by the kernel source code. To facilitate better understanding of the implementation procedure of the software based processor performance counters; use-case analysis diagram, flow charts, screen shots, and knowledge nuggets are supplemented along with histograms of the cache and TLB events data generated by the proposed implementation. Twenty-seven metrics have been identified and implemented to provide data related to the events of the L1 cache and TLB on the MIPS32 processor. The generated data can be used in tuning of compiler, OS memory management design, system benchmarking, scalability, analysing architectural issues, address space analysis, understanding bus communication, kernel profiling, and workload characterisation.

APA, Harvard, Vancouver, ISO, and other styles

37

Wang, Weike, Xiang Wang, Pei Du, Yuntong Tian, Xiaobing Zhang, Qiang Hao, Zhun Zhang, and Bin Xu. "Embedded System Confidentiality Protection by Cryptographic Engine Implemented with Composite Field Arithmetic." MATEC Web of Conferences 210 (2018): 02047. http://dx.doi.org/10.1051/matecconf/201821002047.

Full text

Abstract:

Embedded systems are subjecting to various kinds of security threats. Some malicious attacks exploit valid code gadgets to launch destructive actions or to reveal critical details. Some previous memory encryption strategies aiming at this issue suffer from unacceptable performance overhead and resource consumption. This paper proposes a hardware based confidentiality protection method to secure the code and data stored and transferred in embedded systems. This method takes advantage of the I/D-cache structure to reduce the frequency of the cryptographic encryption and decryption processing. We implement the AES engine with composite field arithmetic to reduce the cost of hardware implementation. The proposed architecture is implemented on EP2C70 FPGA chip with OpenRisc 1200 based SoC. The experiment results show that the AES engine is required to work only in the case of I/D-cache miss and the hardware implementation overhead can save 53.24% and 13.39% for the AES engine and SoC respectively.

APA, Harvard, Vancouver, ISO, and other styles

38

Benini, L., A. Macii, and A. Nannarelli. "Code compression architecture for cache energy minimisation in embedded systems." IEE Proceedings - Computers and Digital Techniques 149, no. 4 (2002): 157. http://dx.doi.org/10.1049/ip-cdt:20020467.

Full text

APA, Harvard, Vancouver, ISO, and other styles

39

Chen, W. Y., P. P. Chang, T. M. Conte, and W. W. Hwu. "The effect of code expanding optimizations on instruction cache design." IEEE Transactions on Computers 42, no. 9 (1993): 1045–57. http://dx.doi.org/10.1109/12.241594.

Full text

APA, Harvard, Vancouver, ISO, and other styles

40

Fahringer, T., and A. Požgaj. "P3T+: A Performance Estimator for Distributed and Parallel Programs." Scientific Programming 8, no. 2 (2000): 73–93. http://dx.doi.org/10.1155/2000/217384.

Full text

Abstract:

Developing distributed and parallel programs on today's multiprocessor architectures is still a challenging task. Particular distressing is the lack of effective performance tools that support the programmer in evaluating changes in code, problem and machine sizes, and target architectures. In this paper we introduceP3T+ which is a performance estimator for mostly regular HPF (High Performance Fortran) programs but partially covers also message passing programs (MPI).P3T+ is unique by modeling programs, compiler code transformations, and parallel and distributed architectures. It computes at compile-time a variety of performance parameters including work distribution, number of transfers, amount of data transferred, transfer times, computation times, and number of cache misses. Several novel technologies are employed to compute these parameters: loop iteration spaces, array access patterns, and data distributions are modeled by employing highly effective symbolic analysis. Communication is estimated by simulating the behavior of a communication library used by the underlying compiler. Computation times are predicted through pre-measured kernels on every target architecture of interest. We carefully model most critical architecture specific factors such as cache lines sizes, number of cache lines available, startup times, message transfer time per byte, etc.P3T+ has been implemented and is closely integrated with the Vienna High Performance Compiler (VFC) to support programmers develop parallel and distributed applications. Experimental results for realistic kernel codes taken from real-world applications are presented to demonstrate both accuracy and usefulness ofP3T+.

APA, Harvard, Vancouver, ISO, and other styles

41

Shin, Dong-Jin, and Jeong-Joon Kim. "Cache-Based Matrix Technology for Efficient Write and Recovery in Erasure Coding Distributed File Systems." Symmetry 15, no. 4 (April 6, 2023): 872. http://dx.doi.org/10.3390/sym15040872.

Full text

Abstract:

With the development of various information and communication technologies, the amount of big data has increased, and distributed file systems have emerged to store them stably. The replication technique divides the original data into blocks and writes them on multiple servers for redundancy and fault tolerance. However, there is a symmetrical space efficiency problem that arises from the need to store blocks larger than the original data. When storing data, the Erasure Coding (EC) technique generates parity blocks through encoding calculations and writes them separately on each server for fault tolerance and data recovery purposes. Even if a specific server fails, original data can still be recovered through decoding calculations using the parity blocks stored on the remaining servers. However, matrices generated during encoding and decoding are redundantly generated during data writing and recovery, which leads to unnecessary overhead in distributed file systems. This paper proposes a cache-based matrix technique that uploads the matrices generated during encoding and decoding to cache memory and reuses them, rather than generating new matrices each time encoding or decoding occurs. The design of the cache memory applies the Weighting Size and Cost Replacement Policy (WSCRP) algorithm to efficiently upload and reuse matrices to cache memory using parameters known as weights and costs. Furthermore, the cache memory table can be managed efficiently because the weight–cost model sorts and updates matrices using specific parameters, which reduces replacement cost. The experiment utilized the Hadoop Distributed File System (HDFS) as the distributed file system, and the EC volume was composed of Reed–Solomon code with parameters (6, 3). As a result of the experiment, it was possible to reduce the write, read, and recovery times associated with encoding and decoding. In particular, for up to three node failures, systems using WSCRP were able to reduce recovery time by about 30 s compared to regular HDFS systems.

APA, Harvard, Vancouver, ISO, and other styles

42

Sieck, Florian, Zhiyuan Zhang, Sebastian Berndt, Chitchanok Chuengsatiansup, Thomas Eisenbarth, and Yuval Yarom. "TeeJam: Sub-Cache-Line Leakages Strike Back." IACR Transactions on Cryptographic Hardware and Embedded Systems 2024, no. 1 (December 4, 2023): 457–500. http://dx.doi.org/10.46586/tches.v2024.i1.457-500.

Full text

Abstract:

The microarchitectural behavior of modern CPUs is mostly hidden from developers and users of computer software. Due to a plethora of attacks exploiting microarchitectural behavior, developers of security-critical software must, e.g., ensure their code is constant-time, which is cumbersome and usually results in slower programs. In practice, small leakages which are deemed not exploitable still remain in the codebase. For example, sub-cache-line leakages have previously been investigated in the CacheBleed and MemJam attacks, which are deemed impractical on modern platforms.In this work, we revisit and carefully analyze the 4k-aliasing effect and discover that the measurable delay introduced by this microarchitectural effect is higher than found by previous work and described by Intel. By combining the rediscovered effect with a high temporal resolution possible when single-stepping an SGX enclave, we construct a very precise, yet widely applicable attack with sub-cache-line leakage resolution. o demonstrate the significance of our findings, we apply the new attack primitive to break a hardened AES T-Table implementation that features constant cache line access patterns. The attack is up to three orders of magnitude more efficient than previous sub-cache-line attacks on AES in SGX. Furthermore, we improve upon the recent work of Sieck et al. which showed partial exploitability of very faint leakages in a utility function loading base64-encoded RSA keys. With reliable sub-cache-line resolution, we build an end-to-end attack exploiting the faint leakage that can recover 4096-bit keys in minutes on a laptop. Finally, we extend the key recovery algorithm to also work for RSA keys following the standard that uses Carmichael’s totient function, while previous attacks were restricted to RSA keys using Euler’s totient function.

APA, Harvard, Vancouver, ISO, and other styles

43

Cho, Won, and Joonho Kong. "Memory and Cache Contention Denial-of-Service Attack in Mobile Edge Devices." Applied Sciences 11, no. 5 (March 8, 2021): 2385. http://dx.doi.org/10.3390/app11052385.

Full text

Abstract:

In this paper, we introduce a memory and cache contention denial-of-service attack and its hardware-based countermeasure. Our attack can significantly degrade the performance of the benign programs by hindering the shared resource accesses of the benign programs. It can be achieved by a simple C-based malicious code while degrading the performance of the benign programs by 47.6% on average. As another side-effect, our attack also leads to greater energy consumption of the system by 2.1× on average, which may cause shorter battery life in the mobile edge devices. We also propose detection and mitigation techniques for thwarting our attack. By analyzing L1 data cache miss request patterns, we effectively detect the malicious program for the memory and cache contention denial-of-service attack. For mitigation, we propose using instruction fetch width throttling techniques to restrict the malicious accesses to the shared resources. When employing our malicious program detection with the instruction fetch width throttling technique, we recover the system performance and energy by 92.4% and 94.7%, respectively, which means that the adverse impacts from the malicious programs are almost removed.

APA, Harvard, Vancouver, ISO, and other styles

44

Savage, John E., and Mohammad Zubair. "Evaluating Multicore Algorithms on the Unified Memory Model." Scientific Programming 17, no. 4 (2009): 295–308. http://dx.doi.org/10.1155/2009/681708.

Full text

Abstract:

One of the challenges to achieving good performance on multicore architectures is the effective utilization of the underlying memory hierarchy. While this is an issue for single-core architectures, it is a critical problem for multicore chips. In this paper, we formulate the unified multicore model (UMM) to help understand the fundamental limits on cache performance on these architectures. The UMM seamlessly handles different types of multiple-core processors with varying degrees of cache sharing at different levels. We demonstrate that our model can be used to study a variety of multicore architectures on a variety of applications. In particular, we use it to analyze an option pricing problem using the trinomial model and develop an algorithm for it that has near-optimal memory traffic between cache levels. We have implemented the algorithm on a two Quad-Core Intel Xeon 5310 1.6 GHz processors (8 cores). It achieves a peak performance of 19.5 GFLOPs, which is 38% of the theoretical peak of the multicore system. We demonstrate that our algorithm outperforms compiler-optimized and auto-parallelized code by a factor of up to 7.5.

APA, Harvard, Vancouver, ISO, and other styles

45

Xu, Xiaoran, Keith Cooper, Jacob Brock, Yan Zhang, and Handong Ye. "ShareJIT: JIT code cache sharing across processes and its practical implementation." Proceedings of the ACM on Programming Languages 2, OOPSLA (October 24, 2018): 1–23. http://dx.doi.org/10.1145/3276494.

Full text

APA, Harvard, Vancouver, ISO, and other styles

46

Bottcher, Axel. "A visualization environment for super scalar machines." Facta universitatis - series: Electronics and Energetics 17, no. 2 (2004): 199–208. http://dx.doi.org/10.2298/fuee0402199b.

Full text

Abstract:

In this paper, we introduce an environment to visualize the internal activities of super scalar processors. This seems currently to be the dominating class of processors on the market. A programmer or a compiler can produce optimized code only with a thorough understanding of the internal structures. This usefulness of this environment is then demonstrated for two aspects of program optimization: loop unrolling in situations with cold or perfectly warmed cache and instruction ordering. We use matrix multiplication as representative example to reflect signal processing code.

APA, Harvard, Vancouver, ISO, and other styles

47

Wang, Bei, Stephane Ethier, William Tang, Khaled Z. Ibrahim, Kamesh Madduri, Samuel Williams, and Leonid Oliker. "Modern gyrokinetic particle-in-cell simulation of fusion plasmas on top supercomputers." International Journal of High Performance Computing Applications 33, no. 1 (June 29, 2017): 169–88. http://dx.doi.org/10.1177/1094342017712059.

Full text

Abstract:

The gyrokinetic toroidal code at Princeton (GTC-P) is a highly scalable and portable particle-in-cell (PIC) code. It solves the 5-D Vlasov–Poisson equation featuring efficient utilization of modern parallel computer architectures at the petascale and beyond. Motivated by the goal of developing a modern code capable of dealing with the physics challenge of increasing problem size with sufficient resolution, new thread-level optimizations have been introduced as well as a key additional domain decomposition. GTC-P’s multiple levels of parallelism, including internode 2-D domain decomposition and particle decomposition, as well as intranode shared memory partition and vectorization, have enabled pushing the scalability of the PIC method to extreme computational scales. In this article, we describe the methods developed to build a highly parallelized PIC code across a broad range of supercomputer designs. This particularly includes implementations on heterogeneous systems using NVIDIA GPU accelerators and Intel Xeon Phi (MIC) coprocessors and performance comparisons with state-of-the-art homogeneous HPC systems such as Blue Gene/Q. New discovery science capabilities in the magnetic fusion energy application domain are enabled, including investigations of ion–temperature–gradient driven turbulence simulations with unprecedented spatial resolution and long temporal duration. Performance studies with realistic fusion experimental parameters are carried out on multiple supercomputing systems spanning a wide range of cache capacities, cache-sharing configurations, memory bandwidth, interconnects, and network topologies. These performance comparisons using a realistic discovery-science-capable domain application code provide valuable insights on optimization techniques across one of the broadest sets of current high-end computing platforms worldwide.

APA, Harvard, Vancouver, ISO, and other styles

48

Ishitobi, Yuriko, Tohru Ishihara, and Hiroto Yasuura. "Code and Data Placement for Embedded Processors with Scratchpad and Cache Memories." Journal of Signal Processing Systems 60, no. 2 (November 5, 2008): 211–24. http://dx.doi.org/10.1007/s11265-008-0306-3.

Full text

APA, Harvard, Vancouver, ISO, and other styles

49

PARUCHURI, PAVAN KUMAR, Satyanarayana CH, Ananda Rao A, and Radica Raju P. "Design and Implementation of Task Reprocessing on Medium-large Multi-core Architecture." Application and Theory of Computer Technology 2, no. 3 (April 27, 2017): 25. http://dx.doi.org/10.22496/atct.v2i3.80.

Full text

Abstract:

EDF-Based real-time scheduling approach is one of the efficient way of scheduling the recurrent real-time tasks in both soft real-time and hard real-time systems. EDF algorithms entile heavy overhead due to lower scheduling and huge migration. In this paper, an exploratory procedure has been presented for reprocessing of neglected task sets and also examine, whether an improved heuristic set attributed as Enhanced shared CAche Performance (ENCAP) can improve the shared cache performance which processes the task set based on Earliest-Deadline-First (EDF). An Object-Oriented real-time code reusable scheduling components which is designed from the scratch referred as LITMUSRT has been used to test the efficiency and average-case accomplishments where per task utilizations are shortened and overall utilization are not restricted. We also discuss the implementation and experimental evaluation of Task-Repo procedure for co-scheduling task sets, under the utilization of 50-60% (per-task) on small to medium-large multicore LINUX SYSTEM.

APA, Harvard, Vancouver, ISO, and other styles

50

Morse, Gregory. "Self-Spectre, Write-Execute and the Hidden State." Tatra Mountains Mathematical Publications 73, no. 1 (August 1, 2019): 131–44. http://dx.doi.org/10.2478/tmmp-2019-0010.

Full text

Abstract:

Abstract The recent Meltdown and Spectre vulnerabilities have highlighted a very present and real threat in the on-chip memory cache units which can ultimately provide a hidden state, albeit only readable via memory timing instructions [Kocher, P.—Genkin, D.— Gruss, D.— Haas, W.—Hamburg, M.—Lipp, M.–Mangard, S.—Prescher, T.—Schwarz, M.—Yarom, Y.: Spectre attacks: Exploiting speculative execution, CoRR, abs/1801.01203, 2018]. Yet the exploits, although having some complexity and slowness, are demonstrably reliable on nearly all processors produced for the last two decades. Moving out from looking at this strictly as a means of reading protected memory, as the large microprocessor companies move to close this security vulnerability, an interesting question arises. Could the inherent design of the processor give the ability to hide arbitrary calculations in this speculative and parallel side channel? Without even using protected memory and exploiting the vulnerability, as has been the focus, there could very well be a whole class of techniques which exploit the side-channel. It could be done in a way which would be largely un-preventable behavior as the technology would start to become self-defeating or require a more complicated and expensive on-chip cache memory system to properly post-speculatively clean itself. And the ability to train the branch predictor to incorrectly speculatively behave is almost certain given hardware limitations, andthusprovidesexactly this pathway. A novel approach looks at just how much computation can be done speculatively with a result store via indirect reads and available through the memory cache. A multi-threaded approach can allow a multi-stage computation pipeline where each computation is passed to a read-out thread and then to the next computation thread [Swanson, S.—McDowell, L. K.—Swift, M. M.—Eggers, S. J.–Levy H. M.: An evaluation of speculative instruction execution on simultaneous multithreaded processors, ACM Trans. Comput. Syst. 21 (2003), 314–340]. Through channels like this, an application can surreptitiously make arbitrary calculations, or even leak data without any standard tracing tools being capable of monitoring the subtle changes. Like a variation of the famous physics Heisenberg uncertainty principle, even a tool capable of reading the cache states would not only be incredibly inefficient, but thereby tamper with and modify the state. Tools like in-circuit emulators, or specially designed cache emulators would be needed to unmask the speculative reads, and it is further difficult to visualize with a linear time-line. Specifically, the AES and RSA algorithms will be studied with respect to these ideas, looking at success rates for various calculation batches with speculative execution, while having a summary view to see the rather severe performance penalties for using such methods. Either approaches could provide for strong white-box cryptography when considering a binary, non-source code form. In terms of white-box methods, both could be significantly challenging to locate or deduce the inner workings of the code. Further, both methods can easily surreptitiously leak or hide data within shared memory in a seemingly innocuous manner.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!