Academic literature on the topic 'Bank Level Parallelism (BLP)'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Bank Level Parallelism (BLP).'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Bank Level Parallelism (BLP)"

1

Shin, Wongyu, Jaemin Jang, Jungwhan Choi, Jinwoong Suh, and Lee-Sup Kim. "Bank-Group Level Parallelism." IEEE Transactions on Computers 66, no. 8 (August 1, 2017): 1428–34. http://dx.doi.org/10.1109/tc.2017.2665475.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Xue, Dongliang, Linpeng Huang, and Chentao Wu. "A pure hardware-driven scheduler for enhancing bank-level parallelism in a persistent memory controller." Future Generation Computer Systems 107 (June 2020): 383–93. http://dx.doi.org/10.1016/j.future.2020.01.047.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Najoui, Mohamed, Mounir Bahtat, Anas Hatim, Said Belkouch, and Noureddine Chabini. "VLIW DSP-Based Low-Level Instruction Scheme of Givens QR Decomposition for Real-Time Processing." Journal of Circuits, Systems and Computers 26, no. 09 (April 24, 2017): 1750129. http://dx.doi.org/10.1142/s0218126617501298.

Full text
Abstract:
QR decomposition (QRD) is one of the most widely used numerical linear algebra (NLA) kernels in several signal processing applications. Its implementation has a considerable and an important impact on the system performance. As processor architectures continue to gain ground in the high-performance computing world, QRD algorithms have to be redesigned in order to take advantage of the architectural features on these new processors. However, in some processor architectures like very large instruction word (VLIW), compiler efficiency is not enough to make an effective use of available computational resources. This paper presents an efficient and optimized approach to implement Givens QRD in a low-power platform based on VLIW architecture. To overcome the compiler efficiency limits to parallelize the most of Givens arithmetic operations, we propose a low-level instruction scheme that could maximize the parallelism rate and minimize clock cycles. The key contributions of this work are as follows: (i) New parallel and fast version design of Givens algorithm based on the VLIW features (i.e., instruction-level parallelism (ILP) and data-level parallelism (DLP)) including the cache memory properties. (ii) Efficient data management approach to avoid cache misses and memory bank conflicts. Two DSP platforms C6678 and AK2H12 were used as targets for implementation. The introduced parallel QR implementation method achieves, in average, more than 12[Formula: see text] and 6[Formula: see text] speedups over the standard algorithm version and the optimized QR routine implementations, respectively. Compared to the state of the art, the proposed scheme implementation is at least 3.65 and 2.5 times faster than the recent CPU and DSP implementations, respectively.
APA, Harvard, Vancouver, ISO, and other styles
4

Khadirsharbiyani, Soheil, Jagadish Kotra, Karthik Rao, and Mahmut Taylan Kandemir. "Data Convection." ACM SIGMETRICS Performance Evaluation Review 50, no. 1 (June 20, 2022): 37–38. http://dx.doi.org/10.1145/3547353.3522647.

Full text
Abstract:
Stacked DRAMs have been studied and productized in the last decade. The large available bandwidth they offer makes them an attractive choice, particularly, in high-performance computing (HPC) environments. Consequently, many prior research efforts have studied and evaluated 3D stacked DRAM-based designs. Despite offering high bandwidth, stacked DRAMs are severely constrained by the overall memory capacity offered. In this paper, we study and evaluate integrating stacked DRAM on top of a GPU in a 3D manner which in tandem with the 2.5D stacked DRAM boosts the capacity and the bandwidth without increasing the package size. It also helps meet the capacity needs of emergent workloads like deep learning. However, the bandwidth given by these 3D stacked DRAMs is significantly constrained by the GPU's heat production. Our investigations on a cycle-level simulator show that the 3D stacked DRAM portions closest to the GPU have shorter retention times than the layers further away. Depending on the retention period, certain regions of 3D stacked DRAM are refreshed more frequently than others, leading to thermally-induced NUMA paradigms. Our proposed approach attempts to place the most frequently requested data in a thermally conscious manner, taking into consideration both bank-level parallelism and channel-level parallelism. The results collected with a cycle-level GPU simulator indicate that the three implementations of our proposed approach lead to 1.8%, 11.7%, and 14.4% performance improvements, over a baseline that already includes 3D+2.5D stacked DRAMs.
APA, Harvard, Vancouver, ISO, and other styles
5

Ma, Jianliang, Jinglei Meng, Tianzhou Chen, and Minghui Wu. "CaLRS: A Critical-Aware Shared LLC Request Scheduling Algorithm on GPGPU." Scientific World Journal 2015 (2015): 1–10. http://dx.doi.org/10.1155/2015/848416.

Full text
Abstract:
Ultra high thread-level parallelism in modern GPUs usually introduces numerous memory requests simultaneously. So there are always plenty of memory requests waiting at each bank of the shared LLC (L2 in this paper) and global memory. For global memory, various schedulers have already been developed to adjust the request sequence. But we find few work has ever focused on the service sequence on the shared LLC. We measured that a big number of GPU applications always queue at LLC bank for services, which provide opportunity to optimize the service order on LLC. Through adjusting the GPU memory request service order, we can improve the schedulability of SM. So we proposed a critical-aware shared LLC request scheduling algorithm (CaLRS) in this paper. The priority representative of memory request is critical for CaLRS. We use the number of memory requests that originate from the same warp but have not been serviced when they arrive at the shared LLC bank to represent the criticality of each warp. Experiments show that the proposed scheme can boost the SM schedulability effectively by promoting the scheduling priority of the memory requests with high criticality and improves the performance of GPU indirectly.
APA, Harvard, Vancouver, ISO, and other styles
6

GRÉWAL, G., S. COROS, and M. VENTRESCA. "A MEMETIC ALGORITHM FOR PERFORMING MEMORY ASSIGNMENT IN DUAL-BANK DSPS." International Journal of Computational Intelligence and Applications 06, no. 04 (December 2006): 473–97. http://dx.doi.org/10.1142/s1469026806002039.

Full text
Abstract:
To increase memory bandwidth, many programmable Digital-Signal Processors (DSPs) employ two on-chip data memories. This architectural feature supports higher memory bandwidth by allowing multiple data memory accesses to occur in parallel. Exploiting dual memory banks, however, is a challenging problem for compilers. This, in part, is due to the instruction-level parallelism, small numbers of registers, and highly specialized register capabilities of most DSPs. In this paper, we present a new methodology based on a Memetic Algorithm (MA) for assigning data to dual-bank memories. Our approach is global, and integrates several important issues in memory assignment within a single model. Special effort is made to identify those data objects that could potentially benefit from an assignment to a specific memory, or perhaps duplication in both memories. Our computational results show that the MA is able to achieve a 54% reduction in the number of memory cycles and a reduction in the range of 7%–42% in the total number of cycles when tested with well-known DSP kernels and applications. Our computational results also show that, when compared with the Genetic Algorithm in Ref. 3, the memetic algorithm is able to find solutions that, on average, have 7%–20% less cost, with the biggest improvements being found for larger problem instances.
APA, Harvard, Vancouver, ISO, and other styles
7

Fang, Juan, Jiajia Lu, Mengxuan Wang, and Hui Zhao. "A Performance Conserving Approach for Reducing Memory Power Consumption in Multi-Core Systems." Journal of Circuits, Systems and Computers 28, no. 07 (June 27, 2019): 1950113. http://dx.doi.org/10.1142/s0218126619501135.

Full text
Abstract:
With more cores integrated into a single chip and the fast growth of main memory capacity, the DRAM memory design faces ever increasing challenges. Previous studies have shown that DRAM can consume up to 40% of the system power, which makes DRAM a major factor constraining the whole system’s growth in performance. Moreover, memory accesses from different applications are usually interleaved and interfere with each other, which further exacerbates the situation in memory system management. Therefore, reducing memory power consumption has become an urgent problem to be solved in both academia and industry. In this paper, we first proposed a novel strategy called Dynamic Bank Partitioning (DBP), which allocates banks to different applications based on their memory access characteristics. DBP not only effectively eliminates the interference among applications, but also fully takes advantage of bank level parallelism. Secondly, to further reduce power consumption, we propose an adaptive method to dynamically select an optimal page policy for each bank according to the characteristics of memory accesses that each bank receives. Our experimental results show that our strategy not only improves the system performance but also reduces the memory power consumption at the same time. Our proposed scheme can reduce memory power consumption up to 21.2% (10% on average across all workloads) and improve the performance to some extent. In the case that workloads are built with mixed applications, our scheme reduces the power consumption by 14% on average and improves the performance up to 12.5% (3% on average).
APA, Harvard, Vancouver, ISO, and other styles
8

Fang, Juan, Mengxuan Wang, and Zelin Wei. "A memory scheduling strategy for eliminating memory access interference in heterogeneous system." Journal of Supercomputing 76, no. 4 (January 10, 2020): 3129–54. http://dx.doi.org/10.1007/s11227-019-03135-7.

Full text
Abstract:
AbstractMultiple CPUs and GPUs are integrated on the same chip to share memory, and access requests between cores are interfering with each other. Memory requests from the GPU seriously interfere with the CPU memory access performance. Requests between multiple CPUs are intertwined when accessing memory, and its performance is greatly affected. The difference in access latency between GPU cores increases the average latency of memory accesses. In order to solve the problems encountered in the shared memory of heterogeneous multi-core systems, we propose a step-by-step memory scheduling strategy, which improve the system performance. The step-by-step memory scheduling strategy first creates a new memory request queue based on the request source and isolates the CPU requests from the GPU requests when the memory controller receives the memory request, thereby preventing the GPU request from interfering with the CPU request. Then, for the CPU request queue, a dynamic bank partitioning strategy is implemented, which dynamically maps it to different bank sets according to different memory characteristics of the application, and eliminates memory request interference of multiple CPU applications without affecting bank-level parallelism. Finally, for the GPU request queue, the criticality is introduced to measure the difference of the memory access latency between the cores. Based on the first ready-first come first served strategy, we implemented criticality-aware memory scheduling to balance the locality and criticality of application access.
APA, Harvard, Vancouver, ISO, and other styles
9

Khadirsharbiyani, Soheil, Jagadish Kotra, Karthik Rao, and Mahmut Kandemir. "Data Convection." Proceedings of the ACM on Measurement and Analysis of Computing Systems 6, no. 1 (February 24, 2022): 1–25. http://dx.doi.org/10.1145/3508027.

Full text
Abstract:
Stacked DRAMs have been studied, evaluated in multiple scenarios, and even productized in the last decade. The large available bandwidth they offer make them an attractive choice, particularly, in high-performance computing (HPC) environments. Consequently, many prior research efforts have studied and evaluated 3D stacked DRAM-based designs. Despite offering high bandwidth, stacked DRAMs are severely constrained by the overall memory capacity offered. In this paper, we study and evaluate integrating stacked DRAM on top of a GPU in a 3D manner which in tandem with the 2.5D stacked DRAM increases the capacity and the bandwidth without increasing the package size. This integration of 3D stacked DRAMs aids in satisfying the capacity requirements of emerging workloads like deep learning. Though this vertical 3D integration of stacked DRAMs also increases the total available bandwidth, we observe that the bandwidth offered by these 3D stacked DRAMs is severely limited by the heat generated on the GPU. Based on our experiments on a cycle-level simulator, we make a key observation that the sections of the 3D stacked DRAM that are closer to the GPU have lower retention-times compared to the farther layers of stacked DRAM. This thermal-induced variable retention-times causes certain sections of 3D stacked DRAM to be refreshed more frequently compared to the others, thereby resulting in thermal-induced NUMA paradigms. To alleviate such thermal-induced NUMA behavior, we propose and experimentally evaluate three different incarnations of Data Convection, i.e., Intra-layer, Inter-layer, and Intra + Inter-layer, that aim at placing the most-frequently accessed data in a thermal-induced retention-aware fashion, taking into account both bank-level and channel-level parallelism. Our evaluations on a cycle-level GPU simulator indicate that, in a multi-application scenario, our Intra-layer, Inter-layer and Intra + Inter-layer algorithms improve the overall performance by 1.8%, 11.7%, and 14.4%, respectively, over a baseline that already encompasses 3D+2.5D stacked DRAMs.
APA, Harvard, Vancouver, ISO, and other styles
10

Liao, Xiaofei, Zhan Zhang, Haikun Liu, and Hai Jin. "Improving Bank-level Parallelism for In-Memory Checkpointing in Hybrid Memory Systems." IEEE Transactions on Big Data, 2018, 1. http://dx.doi.org/10.1109/tbdata.2018.2865964.

Full text
APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic "Bank Level Parallelism (BLP)"

1

Patil, Adarsh. "Heterogeneity Aware Shared DRAM Cache for Integrated Heterogeneous Architectures." Thesis, 2017. http://etd.iisc.ac.in/handle/2005/4124.

Full text
Abstract:
Integrated Heterogeneous System (IHS) processors pack throughput-oriented GPGPUs along-side latency-oriented CPUs on the same die sharing certain resources, e.g., shared last level cache, network-on-chip (NoC), and the main memory. They also share virtual and physical address spaces and unify the memory hierarchy. The IHS architecture allows for easier programmability, data management and efficiency. However, the significant disparity in the demands for memory and other shared resources between the GPU cores and CPU cores poses significant problems in exploiting the full potential of this architecture. In this work, we propose adding a large capacity stacked DRAM, used as a shared last level cache, for the IHS processors. The reduced latency of access and large bandwidth provided by the DRAM cache can help improve performance respectively of CPU and GPGPU while the large capacity can help contain the working set of the IHS workloads. However, adding the DRAM cache naively leaves significant performance on the table due to the disparate demands from CPU and GPU cores for DRAM cache and memory accesses. In particular, the imbalance can significantly reduce the performance benefits that the CPU cores would have otherwise enjoyed with the introduction of the DRAM cache. This necessitates a heterogeneity-aware management of this shared resource for improved performance. To address this, in this thesis, we propose three simple techniques to enhance the performance of CPU application while ensuring very little or no performance impact to the GPU. Specifically, we propose (i) PrIS, a prioritization scheme for scheduling CPU requests at the DRAM cache controller, (ii) ByE, a selective and temporal bypassing scheme for CPU requests at the DRAM cache and (iii) Chaining, an occupancy controlling mechanism for GPU lines in the DRAM cache through pseudoassociativity. The resulting cache, HAShCache, is heterogeneity-aware and can adapt dynamically to address the inherent disparity of demands in an IHS architecture with simple light weight schemes. We enhance the gem5-gpu simulator to model an IHS architecture with stacked DRAM as a cache, coherent GPU L2 cache and CPU caches and a shared unified physical memory. Using this setup we perform detailed experimental evaluation of the proposed HAShCache and demonstrate an average system performance (combined performance of CPU and GPU cores) improvement of 41% over a naive DRAM cache and over 100% improvement over a baseline system with no stacked DRAM cache.
APA, Harvard, Vancouver, ISO, and other styles
2

Wang, Shao-Fu, and 王少甫. "Exploiting Bank-level Parallelism via Data Consistency Relaxation for Non-volatile Memory System." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/55825295588163105110.

Full text
Abstract:
碩士
國立臺灣大學
資訊工程學研究所
103
The maturity of emerging non-volatile memory (NVM) technologies presents promising next-generation memory system design. Because of its mixed performance characteristics between DRAM and persistent store, e.g., high density, byte-addressability, and non-volatility, architects rethink the design of traditional memory hierarchy. With NVM as main memory, programmer can place non-volatile data structures on main memory and directly access them by ld/st instructions. Non-volatile data structures demand consistency and atomicity guarantees in case of sudden system crash. To guarantee consistency and atomicity, some forms of write-ahead logging (WAL) semantics are needed. Because modern memory controller reorders writes to exploit bank-level parallelism, persist barrier is adopted by many existing works to guarantee the order between writes. However, we observe that persist barriers introduce unnecessary write ordering constraints and hurt the system performance by restricting memory controller from exploiting bank-level parallelism. In this thesis, we propose Semantics-aware Memory Scheduler. By using a new software/hardware interface to transfer knowledge of application''s logging semantics to memory controller, Semantics-aware Memory Scheduler eliminates unnecessary write ordering constraints by differentiating between log writes and target data writes. Through allowing more concurrent memory writes, memory controller can provide more performance by maximizing bank-level parallelism. Experimental results of full-system simulation show that Semantics-aware Memory Scheduler can improve throughput by up to 2.89x (2.13x on average).
APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Bank Level Parallelism (BLP)"

1

Malik, Kshitiz, Mayank Agarwal, Sam S. Stone, Kevin M. Woley, and Matthew I. Frank. "Branch-mispredict level parallelism (BLP) for control independence." In 2008 IEEE 14th International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2008. http://dx.doi.org/10.1109/hpca.2008.4658628.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Tang, Xulong, Mahmut Kandemir, Praveen Yedlapalli, and Jagadish Kotra. "Improving bank-level parallelism for irregular applications." In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2016. http://dx.doi.org/10.1109/micro.2016.7783760.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Ding, Wei, Diana Guttman, and Mahmut Kandemir. "Compiler Support for Optimizing Memory Bank-Level Parallelism." In 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2014. http://dx.doi.org/10.1109/micro.2014.34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Lee, Chang Joo, Veynu Narasiman, Onur Mutlu, and Yale N. Patt. "Improving memory bank-level parallelism in the presence of prefetching." In the 42nd Annual IEEE/ACM International Symposium. New York, New York, USA: ACM Press, 2009. http://dx.doi.org/10.1145/1669112.1669155.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Poremba, Matthew, Tao Zhang, and Yuan Xie. "Fine-granularity tile-level parallelism in non-volatile memory architecture with two-dimensional bank subdivision." In DAC '16: The 53rd Annual Design Automation Conference 2016. New York, NY, USA: ACM, 2016. http://dx.doi.org/10.1145/2897937.2898024.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Kwon, Young-Cheon, Suk Han Lee, Jaehoon Lee, Sang-Hyuk Kwon, Je Min Ryu, Jong-Pil Son, O. Seongil, et al. "25.4 A 20nm 6GB Function-In-Memory DRAM, Based on HBM2 with a 1.2TFLOPS Programmable Computing Unit Using Bank-Level Parallelism, for Machine Learning Applications." In 2021 IEEE International Solid- State Circuits Conference (ISSCC). IEEE, 2021. http://dx.doi.org/10.1109/isscc42613.2021.9365862.

Full text
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography