Dissertations / Theses: 'Shared-Memory Machines'

1

Roberts, Harriet. "Preconditioned iterative methods on virtual shared memory machines." Thesis, This resource online, 1994. http://scholar.lib.vt.edu/theses/available/etd-07292009-090522/.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Younge, Andrew J., Christopher Reidy, Robert Henschel, and Geoffrey C. Fox. "Evaluation of SMP Shared Memory Machines for Use with In-Memory and OpenMP Big Data Applications." IEEE, 2016. http://hdl.handle.net/10150/622702.

Full text

Abstract:

While distributed memory systems have shaped the field of distributed systems for decades, the demand for many-core shared memory resources is increasing. Symmetric Multiprocessor Systems (SMPs) have become increasingly important recently among a wide array of disciplines, ranging from Bioinformatics to astrophysics, and beyond. With the increase in big data computing, the size and scope of traditional commodity server systems is often outpaced. While some big data applications can be mapped to distributed memory systems found through many cluster and cloud technologies today, this effort represents a large barrier of entry that some projects cannot cross. Shared memory SMP systems look to effectively and efficiently fill this niche within distributed systems by providing high throughput and performance with minimized development effort, as the computing environment often represents what many researchers are already familiar with. In this paper, we look at the use of two common shared memory systems, the ScaleMP vSMP virtualized SMP deployment at Indiana University, and the SGI UV architecture deployed at University of Arizona. While both systems are notably different in their design, their potential impact on computing is remarkably similar. As such, we look to compare each system first under a set of OpenMP threaded benchmarks via the SPEC group, and to follow up with our experience using each machine for Trinity de-novo assembly. We find both SMP systems are well suited to support various big data applications, with the newer vSMP deployment often slightly faster; however, certain caveats and performance considerations are necessary when considering such SMP systems.

APA, Harvard, Vancouver, ISO, and other styles

3

Hines, Michael R. "Techniques for collective physical memory ubiquity within networked clusters of virtual machines." Diss., Online access via UMI:, 2009.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

4

Huang, Wei. "High Performance Network I/O in Virtual Machines over Modern Interconnects." The Ohio State University, 2008. http://rave.ohiolink.edu/etdc/view?acc_num=osu1218602792.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Melo, Alba Cristina M. A. "Conception d'un système supportant des modèles de cohérence multiples pour les machines parallèles à mémoire virtuelle partagée." Grenoble INPG, 1996. http://www.theses.fr/1996INPG0108.

Full text

Abstract:

La programmation par variables partagees est utilisee dans les architectures paralleles sans memoire commune grace a une couche logicielle qui simule la memoire physiquement partagee. Le maintien de l'abstraction parfaite d'une memoire unique necessite un grand nombre d'operations de coherence et, par consequent, une degradation importante des performances. Afin de palier cette degradation, plusieurs systemes se servent des modeles de coherence de la memoire plus relaches, qui permettent une concurrence plus importante entre les acces mais compliquent le modele de programmation. Le choix d'un modele de coherence est donc un compromis entre les performances et la simplicite de la programmation. Ces deux facteurs dependent des attentes des utilisateurs et des caracteristiques d'acces aux donnees de chaque applications parallele. Cette these presente diva, un systeme a memoire virtuelle partagee qui supporte plusieurs modeles de coherence de la memoire. Avec diva, l'utilisateur peut choisir la semantique de la memoire partagee la plus appropriee a l'execution correcte et performante de son application. De plus, diva offre a l'utilisateur la possibilite de definir ses propres modeles de coherence. L'existence des modeles multiples a l'interieur de diva a guide les choix de conception de plusieurs autres mecanismes. Ainsi, nous proposons une interface unique de synchronisation et des mecanismes de remplacement et prechargement des pages adaptes a un environnement a modeles multiples. Un prototype de diva a ete mis en uvre sur la machine parallele intel/paragon. L'analyse d'une application qui s'execute sur des differents modeles de coherence nous a permis de montrer que le choix du modele de coherence affecte directement les performances d'une application

APA, Harvard, Vancouver, ISO, and other styles

6

Moreaud, Stéphanie. "Mouvement de données et placement des tâches pour les communications haute performance sur machines hiérarchiques." Phd thesis, Université Sciences et Technologies - Bordeaux I, 2011. http://tel.archives-ouvertes.fr/tel-00635651.

Full text

Abstract:

Les architectures des machines de calcul sont de plus en plus complexes et hiérarchiques, avec des processeurs multicœurs, des bancs mémoire distribués, et de multiples bus d'entrées-sorties. Dans le cadre du calcul haute performance, l'efficacité de l'exécution des applications parallèles dépend du coût de communication entre les tâches participantes qui est impacté par l'organisation des ressources, en particulier par les effets NUMA ou de cache. Les travaux de cette thèse visent à l'étude et à l'optimisation des communications haute performance sur les architectures hiérarchiques modernes. Ils consistent tout d'abord en l'évaluation de l'impact de la topologie matérielle sur les performances des mouvements de données, internes aux calculateurs ou au travers de réseaux rapides, et pour différentes stratégies de transfert, types de matériel et plateformes. Dans une optique d'amélioration et de portabilité des performances, nous proposons ensuite de prendre en compte les affinités entre les communications et le matériel au sein des bibliothèques de communication. Ces recherches s'articulent autour de l'adaptation du placement des tâches en fonction des schémas de transfert et de la topologie des calculateurs, ou au contraire autour de l'adaptation des stratégies de mouvement de données à une répartition définie des tâches. Ce travail, intégré aux principales bibliothèques MPI, permet de réduire de façon significative le coût des communications et d'améliorer ainsi les performances applicatives. Les résultats obtenus témoignent de la nécessité de prendre en compte les caractéristiques matérielles des machines modernes pour en exploiter la quintessence.

APA, Harvard, Vancouver, ISO, and other styles

7

Wen, Yuzhong. "Replication of Concurrent Applications in a Shared Memory Multikernel." Thesis, Virginia Tech, 2016. http://hdl.handle.net/10919/71813.

Full text

Abstract:

State Machine Replication (SMR) has become the de-facto methodology of building a replication based fault-tolerance system. Current SMR systems usually have multiple machines involved, each of the machines in the SMR system acts as the replica of others. However having multiple machines leads to more cost to the infrastructure, in both hardware cost and power consumption. For tolerating non-critical CPU and memory failure that will not crash the entire machine, there is no need to have extra machines to do the job. As a result, intra-machine replication is a good fit for this scenario. However, current intra-machine replication approaches do not provide strong isolation among the replicas, which allows the faults to be propagated from one replica to another. In order to provide an intra-machine replication technique with strong isolation, in this thesis we present a SMR system on a multi-kernel OS. We implemented a replication system that is capable of replicating concurrent applications on different kernel instances of a multi-kernel OS. Modern concurrent application can be deployed on our system with minimal code modification. Additionally, our system provides two different replication modes that allows the user to switch freely according to the application type. With the evaluation of multiple real world applications, we show that those applications can be easily deployed on our system with 0 to 60 lines of code changes to the source code. From the performance perspective, our system only introduces 0.23\% to 63.39\% overhead compared to non-replicated execution.
Master of Science

APA, Harvard, Vancouver, ISO, and other styles

8

Lam, King-tin, and 林擎天. "Efficient shared object space support for distributed Java virtual machine." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2012. http://hub.hku.hk/bib/B47752877.

Full text

Abstract:

Given the popularity of Java, extending the standard Java virtual machine (JVM) to become cluster-aware effectively brings the vision of transparent horizontal scaling of applications to fruition. With a set of cluster-wide JVMs orchestrated as a virtually single system, thread-level parallelism in Java is no longer confined to one multiprocessor. An unmodified multithreaded Java application running on such a Distributed JVM (DJVM) can scale out transparently, tapping into the vast computing power of the cluster. While this notion creates an easy-to-use and powerful parallel programming paradigm, research on DJVMs has remained largely at the proof-of-concept stage where successes were proven using trivial scientific computing workloads only. Real-life Java applications with commercial server workloads have not been well-studied on DJVMs. Their natures including complex and sometimes huge object graphs, irregular access patterns and frequent synchronizations are key scalability hurdles. To design a scalable DJVM for real-life applications, we identify three major unsolved issues calling for a top-to-bottom overhaul of traditional systems. First, we need a more time- and space-efficient cache coherence protocol to support fine-grained object sharing over the distributed shared heap. The recent prevalence of concurrent data structures with heavy use of volatile fields has added complications to the matter. Second, previous generations of DJVMs lack true support for memory-intensive applications. While the network-wide aggregated physical memory can be huge, mutual sharing of huge object graphs like Java collections may cause nodes to eventually run out of local heap space because the cached copies of remote objects, linked by active references, can’t be arbitrarily discarded. Third, thread affinity, which determines the overall communication cost, is vital to the DJVM performance. Data access locality can be improved by collocating highly-correlated threads, via dynamic thread migration. Tracking inter-thread correlations trades profiling costs for reduced object misses. Unfortunately, profiling techniques like active correlation tracking used in page-based DSMs would entail prohibitively high overheads and low accuracy when ported to fine-grained object-based DJVMs. This dissertation presents technical contributions towards all these problems. We use a dual-protocol approach to address the first problem. Synchronized (lock-based) and volatile accesses are handled by a home-based lazy release consistency (HLRC) protocol and a sequential consistency (SC) protocol respectively. The two protocols’ metadata are maintained in a conflict-free, memory-efficient manner. With further techniques like hierarchical passing of lock ownerships, the overall communication overheads of fine-grained distributed object sharing are pruned to a minimal level. For the second problem, we develop a novel uncaching mechanism to safely break a huge active object graph. When a JVM instance runs low on free memory, it initiates an uncaching policy, which eagerly assigns nulls to selected reference fields, thus detaching some older or less useful cached objects from the root set for reclamation. Careful orchestration is made between uncaching, local garbage collection and the coherence protocol to avoid possible data races. Lastly, we devise lightweight sampling-based profiling methods to derive inter-thread correlations, and a profile-guided thread migration policy to boost the system performance. Extensive experiments have demonstrated the effectiveness of all our solutions.
published_or_final_version
Computer Science
Doctoral
Doctor of Philosophy

APA, Harvard, Vancouver, ISO, and other styles

9

Fross, Bradley K. "Splash-2 shared-memory architecture for supporting high level language compilers." Thesis, Virginia Tech, 1995. http://hdl.handle.net/10919/42064.

Full text

Abstract:

Modem computer technology has been evolving for nearly fifty years, and has seen many architectural innovations along the way. One of the latest technologies to come about is the reconfigurable processor-based custom computing machine (CCM). CCMs use field programmable gate arrays (FPGAs) as their processing cores, giving them the flexibility of software systems with performance comparable to that of dedicated custom hardware. Hardware description languages are currently used to program CCMs. However, research is being performed to investigate the use of high-level languages (HLLs), such as the C programming language, to create CCM programs. Many aspects of CCM architectures, such as local memory systems, are not conducive to HLL compiler usage. This thesis proposes and evaluates the use of a shared-memory architecture on a Splash-2 CCM to promote the development and usage of HLL compilers for CCM systems.

Master of Science

APA, Harvard, Vancouver, ISO, and other styles

10

Lee, Dong Ryeol. "A distributed kernel summation framework for machine learning and scientific applications." Diss., Georgia Institute of Technology, 2012. http://hdl.handle.net/1853/44727.

Full text

Abstract:

The class of computational problems I consider in this thesis share the common trait of requiring consideration of pairs (or higher-order tuples) of data points. I focus on the problem of kernel summation operations ubiquitous in many data mining and scientific algorithms. In machine learning, kernel summations appear in popular kernel methods which can model nonlinear structures in data. Kernel methods include many non-parametric methods such as kernel density estimation, kernel regression, Gaussian process regression, kernel PCA, and kernel support vector machines (SVM). In computational physics, kernel summations occur inside the classical N-body problem for simulating positions of a set of celestial bodies or atoms. This thesis attempts to marry, for the first time, the best relevant techniques in parallel computing, where kernel summations are in low dimensions, with the best general-dimension algorithms from the machine learning literature. We provide a unified, efficient parallel kernel summation framework that can utilize: (1) various types of deterministic and probabilistic approximations that may be suitable for both low and high-dimensional problems with a large number of data points; (2) indexing the data using any multi-dimensional binary tree with both distributed memory (MPI) and shared memory (OpenMP/Intel TBB) parallelism; (3) a dynamic load balancing scheme to adjust work imbalances during the computation. I will first summarize my previous research in serial kernel summation algorithms. This work started from Greengard/Rokhlin's earlier work on fast multipole methods for the purpose of approximating potential sums of many particles. The contributions of this part of this thesis include the followings: (1) reinterpretation of Greengard/Rokhlin's work for the computer science community; (2) the extension of the algorithms to use a larger class of approximation strategies, i.e. probabilistic error bounds via Monte Carlo techniques; (3) the multibody series expansion: the generalization of the theory of fast multipole methods to handle interactions of more than two entities; (4) the first O(N) proof of the batch approximate kernel summation using a notion of intrinsic dimensionality. Then I move onto the problem of parallelization of the kernel summations and tackling the scaling of two other kernel methods, Gaussian process regression (kernel matrix inversion) and kernel PCA (kernel matrix eigendecomposition). The artifact of this thesis has contributed to an open-source machine learning package called MLPACK which has been first demonstrated at the NIPS 2008 and subsequently at the NIPS 2011 Big Learning Workshop. Completing a portion of this thesis involved utilization of high performance computing resource at XSEDE (eXtreme Science and Engineering Discovery Environment) and NERSC (National Energy Research Scientific Computing Center).

APA, Harvard, Vancouver, ISO, and other styles

11

Huster, Carl R. "A parallel/vector Monte Carlo MESFET model for shared memory machines." Thesis, 1992. http://hdl.handle.net/1957/37306.

Full text

Abstract:

The parallelization and vectorization of Monte Carlo algorithms for modelling charge transport in semiconductor devices are considered. The standard ensemble Monte Carlo simulation of a three parabolic band model for GaAs is first presented as partial verification of the simulation. The model includes scattering due to acoustic, polar-optical and intervalley phonons. This ensemble simulation is extended to a full device simulation by the addition of real-space positions, and solution for the electrostatic potential from the charge density distribution using Poisson's equation. Poisson's equation was solved using the cloud-in-cell scheme for charge assignment, finite differences for spatial discretization, and simultaneous over-relaxation for solution. The particle movement (acceleration and scattering) and the solution of Poisson's are both separately parallelized. The parallelization techniques used in both parts are based on the use of semaphores for the protection of shared resources and processor synchronization. The speed increase results for parallelization with and without vectorization on the Ardent Titan II are presented. The results show saturation due to memory access limitations at a speed increase of approximately 3.3 times the serial case when four processors are used. Vectorization alone provides a speed increase of approximately 1.6 times when compared with the nonvectorized serial case. It is concluded that the speed increase achieved with the Titan II is limited by memory access considerations and that this limitation is likely to plague shared memory machines for the forseeable future. For the program presented here, vectorization is concluded to provide a better speed increase per day of development time than parallelization. However, when vectorization is used in conjunction with parallelization, the speed increase due to vectorization is negligible.
Graduation date: 1993

APA, Harvard, Vancouver, ISO, and other styles

12

Hsu, Po-Hsueh, and 許博學. "Run-Time Parallelization Techniques for Irregular Scientific Computations on Shared-Memory Machines." Thesis, 1999. http://ndltd.ncl.edu.tw/handle/30203154122182106494.

Full text

Abstract:

博士
國立中山大學
電機工程學系
87
High performance computing power is crucial for the advanced calculations of scientific applications. A multiprocessor system derives its high performance from the fact that some computations can proceed in parallel. A parallelizing compiler can take a sequential program as input and automatically translate it into parallel form for the target multiprocessor system. The compiler checks for data dependences in the program to determine parallel executable loops. But for loops with arrays of irregular (i.e., indirectly indexed), nonlinear or dynamic access patterns, no state-of-the-art compilers can determine whether data dependences exist. Either the necessary information is not statically available or access patterns are too complex to analyze. Unfortunately, many scientific and engineering programs that perform complex simulations usually contain such loops. This phenomenon greatly limits the applicability of parallelizing compilers. Since all information is available during program execution, we can resort to the run-time data dependence analysis to overcome the above problem. Two kinds of approaches have been developed in run-time parallelization: the speculative doall parallelization and the run-time doacross parallelization. The former assumes full parallelism and executes the loop speculatively, then examines the correctness of parallel execution after loop termination. If the speculation succeeds, a significant speedup is obtained. Otherwise, those altered variables should be restored and the original loop is re-executed in sequence. The latter, known as the inspector/executor method, constructs a proper execution schedule to enforce parallelism in doacross loops. The inspector examines cross-iteration dependences and produces a parallel execution schedule at run-time. The executor then performs the actual operations of the loop based on the schedule arranged by the inspector. In this thesis, we present three practical run-time techniques to fully exploit loop-level parallelism. The first is speculative parallelization with new technology (SPNT). Two main characteristics make the SPNT test distinguished. One is improving the success rate of speculative parallelization by eliminating all cross-iteration data dependences except the cross-processor flow dependences. The other is reducing the failure penalty by aborting the speculative parallel execution immediately once a cross-processor flow dependence is detected during the execution. The second is the parallel group analysis (PGA) that schedules maximal sets of contiguous iterations with no cross-iteration flow dependence into parallel groups. The PGA technique has the advantages of simple but efficient algorithm and small amount of memory space. The third is the optimal parallel scheduler (OPS) that can obtain an optimal schedule in parallel. In addition, we make use of an atomic bitwise-OR instruction to remove the overhead of global synchronization and gain satisfactory speedup. Both the PGA and the OPS play the role in run-time doacross parallelization. Applying run-time parallelization to irregular scientific computations can actually discover much parallelism not found before. This meets the requirement of high performance computing for scientific applications. We believe that it is worthwhile to invest more resources in the development of run-time parallelization techniques.

APA, Harvard, Vancouver, ISO, and other styles

13

Antony, Joseph. "Performance Models for Electronic Structure Methods on Modern Computer Architectures." Phd thesis, 2009. http://hdl.handle.net/1885/49420.

Full text

Abstract:

Electronic structure codes are computationally intensive scientic applications used to probe and elucidate chemical processes at an atomic level. Maximizing the performance of these applications on any given hardware platform is vital in order to facilitate larger and more accurate computations. An important part of this endeavor is the development of protocols for measuring performance, and models to describe that performance as a function of system architecture. This thesis makes contributions in both areas, with a focus on shared memory parallel computer architectures and the Gaussian electronic structure code. Shared memory parallel computer systems are increasingly important as hardware man- ufacturers are unable to extract performance improvements by increasing clock frequencies. Instead the emphasis is on using multi-core processors to provide higher performance. These processor chips generally have complex cache hierarchies, and may be coupled together in multi-socket systems which exhibit highly non-uniform memory access (NUMA) characteristics. This work seeks to understand how cache characteristics and memory/thread placement affects the performance of electronic structure codes, and to develop performance models that can be used to describe and predict code performance by accounting for these effects. A protocol for performing memory and thread placement experiments on NUMA systems is presented and its implementation under both the Solaris and Linux operating systems is discussed. A placement distribution model is proposed and subsequently used to guide both memory/thread placement experiments and as an aid in the analysis of results obtained from experiments. In order to describe single threaded performance as a function of cache blocking a simple linear performance model is investigated for use when computing the electron repulsion integrals that lie at the heart of virtually all electronic structure methods. A parametric cache variation study is performed. This is achieved by combining parameters obtained for the linear performance model on existing hardware, with instruction and cache miss counts obtained by simulation, and predictions are made of performance as a function of cache architecture. Extension of the linear performance model to describe multi-threaded performance on complex NUMA architectures is discussed and investigated experimentally. Use of dynamic page migration to improve locality is also considered. Finally the use of large scale electronic structure calculations is demonstrated in a series of calculations aiming to study the charge distribution for a single positive ion solvated within a shell of water molecules of increasing size.

APA, Harvard, Vancouver, ISO, and other styles

14

Yih, Wang Chung, and 王崇懿. "Design of A Parallel Virtual Machine with Support of Distributed* Shared Memory." Thesis, 1995. http://ndltd.ncl.edu.tw/handle/60621672842615961380.

Full text

Abstract:

碩士
國立清華大學
資訊科學學系
83
The distributed computing environment(DCE) is becoming more and more important as computer techniques progress. There is a trend to speedup computation tasks by DCE tools that merge computers together into a powerful parallel virtual machine based on networked workstations cooperatively. In the development of parallel programs on a parallel virtual machine, it is unfriendly for programmers to maintain data consistency with elementary communication primitives provided by most of the DCE tools. For the sake of simplifying programming style, the concept of distributed shared memory (DSM) is introduced instead of message passing. In this thesis, we discuss the evolution of distributed shared memory systems and implement the DSM mechanism on the DOS-PVM platform. The DOS-PVM platform that integrates distributed shared memory and virtual memory function on a cluster of PCs is a new DCE tool we design. The goal of our DOS-PVM environment is to make a cluster of PC working like a bus-based multiprocessor system.

APA, Harvard, Vancouver, ISO, and other styles

15

Wang, Ying-Lung, and 王應龍. "Design and Implementation of a Parallel Virtual Machine with Distributed Shared Memory by Using Java." Thesis, 2000. http://ndltd.ncl.edu.tw/handle/62491381815597181663.

Full text

Abstract:

碩士
國立清華大學
資訊工程學系
88
In this thesis, we design and implement a system with Distributed Shared Memory (DSM) mechanism for distributed computing environments by using Java. There is a speed limit for the processor. Many complex and time-consuming computation tasks, such as astronomical, chemical, or meteorological data processing, usually need distributed systems to increase their computing power. A distributed system may consist of many different type machines. Due to the difference of machine types, the instruction sets used in those machines may be different. When developing parallel programs over a distributed system with different type machines, programmers must be careful and may spend a lot of efforts in writing several versions of their programs for running in different system platforms. However, porting programs is not an easy job. Hence, the portability and the cross-platform of Java make it be a good solution for simplifying the heterogeneous programming. For distributed computing, Java provides Remote Method Invocation (RMI) mechanism for invoking methods of a remote object like invoking methods of a local object. However, programmers still need to take care of problems of data replications. The maintenance of the data consistency decreases the development of parallel programs. DSM mechanism provides a shared memory abstraction over a cluster of physically distributed machines. Communication between programs running on different machines is achieved in a normal memory access fashion. Complicated procedures of the network communication and those of the data consistency are hidden by DSM. Therefore, the goal of this thesis is to combine the advantage of Java and DSM to provide a more powerful distributed computing environment.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Shared-Memory Machines'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles