Tesis sobre el tema "Vectorisation (informatique)"
Crea una cita precisa en los estilos APA, MLA, Chicago, Harvard y otros
Consulte los 23 mejores tesis para su investigación sobre el tema "Vectorisation (informatique)".
Junto a cada fuente en la lista de referencias hay un botón "Agregar a la bibliografía". Pulsa este botón, y generaremos automáticamente la referencia bibliográfica para la obra elegida en el estilo de cita que necesites: APA, MLA, Harvard, Vancouver, Chicago, etc.
También puede descargar el texto completo de la publicación académica en formato pdf y leer en línea su resumen siempre que esté disponible en los metadatos.
Explore tesis sobre una amplia variedad de disciplinas y organice su bibliografía correctamente.
Maini, Jean-Luc. "Vectorisation, segmentation de schémas éléctroniques et reconnaissance de composants". Le Havre, 2000. http://www.theses.fr/2000LEHA0008.
Texto completoPeou, Kenny. "Computing Tools for HPDA : a Cache-Oblivious and SIMD Approach". Electronic Thesis or Diss., université Paris-Saclay, 2021. http://www.theses.fr/2021UPASG105.
Texto completoThis work presents three contributions to the fields of CPU vectorization and machine learning. The first contribution is an algorithm for computing an average with half precision floating point values. In this work performed with limited half precision hardware support, we use an existing software library to emulate half precision computation. This allows us to compare the numerical precision of our algorithm to various commonly used algorithms. Finally, we perform runtime performance benchmarks using single and double floating point values in order to anticipate the potential gains from applying CPU vectorization to half precision values. Overall, we find that our algorithm has slightly worse best-case numerical performance in exchange for significantly better worst-case numerical performance, all while providing similar runtime performance to other algorithms. The second contribution is a fixed-point computational library designed specifically for CPU vectorization. Existing libraries fail rely on compiler auto-vectorization, which fail to vectorize arithmetic multiplication and division operations. In addition, these two operations require cast operations which reduce vectorizability and have a real computational cost. To allevieate this, we present a fixed-point data storage format that does not require any cast operations to perform arithmetic operations. In addition, we present a number of benchmarks comparing our implementation to existing libraries and present the CPU vectorization speedup on a number of architectures. Overall, we find that our fixed point format allows runtime performance equal to or better than all compared libraries. The final contribution is a neural network inference engine designed to perform experiments varying the numerical datatypes used in the inference computation. This inference engine allows layer-specific control of which data types are used to perform inference. We use this level of control to perform experiments to determine how aggressively it is possible to reduce the numerical precision used in inferring the PVANet neural network. In the end, we determine that a combination of the standardized float16 and bfloat16 data types is sufficient for the entire inference
Gallet, Camille. "Étude de transformations et d’optimisations de code parallèle statique ou dynamique pour architecture "many-core"". Electronic Thesis or Diss., Paris 6, 2016. http://www.theses.fr/2016PA066747.
Texto completoSince the 60s to the present, the evolution of supercomputers faced three revolutions : (i) the arrival of the transistors to replace triodes, (ii) the appearance of the vector calculations, and (iii) the clusters. These currently consist of standards processors that have benefited of increased computing power via an increase in the frequency, the proliferation of cores on the chip and expansion of computing units (SIMD instruction set). A recent example involving a large number of cores and vector units wide (512-bit) is the co-proceseur Intel Xeon Phi. To maximize computing performance on these chips by better exploiting these SIMD instructions, it is necessary to reorganize the body of the loop nests taking into account irregular aspects (control flow and data flow). To this end, this thesis proposes to extend the transformation named Deep Jam to extract the regularity of an irregular code and facilitate vectorization. This thesis presents our extension and application of a multi-material hydrodynamic mini-application, HydroMM. Thus, these studies show that it is possible to achieve a significant performance gain on uneven codes
Gallet, Camille. "Étude de transformations et d’optimisations de code parallèle statique ou dynamique pour architecture "many-core"". Thesis, Paris 6, 2016. http://www.theses.fr/2016PA066747/document.
Texto completoSince the 60s to the present, the evolution of supercomputers faced three revolutions : (i) the arrival of the transistors to replace triodes, (ii) the appearance of the vector calculations, and (iii) the clusters. These currently consist of standards processors that have benefited of increased computing power via an increase in the frequency, the proliferation of cores on the chip and expansion of computing units (SIMD instruction set). A recent example involving a large number of cores and vector units wide (512-bit) is the co-proceseur Intel Xeon Phi. To maximize computing performance on these chips by better exploiting these SIMD instructions, it is necessary to reorganize the body of the loop nests taking into account irregular aspects (control flow and data flow). To this end, this thesis proposes to extend the transformation named Deep Jam to extract the regularity of an irregular code and facilitate vectorization. This thesis presents our extension and application of a multi-material hydrodynamic mini-application, HydroMM. Thus, these studies show that it is possible to achieve a significant performance gain on uneven codes
Mercadier, Darius. "Usuba, Optimizing Bitslicing Compiler". Electronic Thesis or Diss., Sorbonne université, 2020. http://www.theses.fr/2020SORUS180.
Texto completoBitslicing is a technique commonly used in cryptography to implement high-throughput parallel and constant-time symmetric primitives. However, writing, optimizing and protecting bitsliced implementations by hand are tedious tasks, requiring knowledge in cryptography, CPU microarchitectures and side-channel attacks. The resulting programs tend to be hard to maintain due to their high complexity. To overcome those issues, we propose Usuba, a high-level domain-specific language to write symmetric cryptographic primitives. Usuba allows developers to write high-level specifications of ciphers without worrying about the actual parallelization: an Usuba program is a scalar description of a cipher, from which the Usuba compiler (Usubac) automatically produces vectorized bitsliced code. When targeting high-end Intel CPUs, the Usubac applies several domain-specific optimizations, such as interleaving and custom instruction-scheduling algorithms. We are thus able to match the throughputs of hand-tuned assembly and C implementations of several widely used ciphers. Futhermore, in order to protect cryptographic implementations on embedded devices against side-channel attacks, we extend our compiler in two ways. First, we integrate into Usubac state-of-the-art techniques in higher-order masking to generate implementations that are provably secure against power-analysis attacks. Second, we implement a backend for SKIVA, a custom 32-bit CPU enabling the combination of countermeasures against power-based and timing-based leakage, as well as fault injection
Meas-Yedid, Vannary. "Analyse de cartes scannées : interprétation de la planche de vert : contribution à l'analyse d'images en cartographie". Paris 5, 1998. http://www.theses.fr/1998PA05S003.
Texto completoKirschenmann, Wilfried. "Vers des noyaux de calcul intensif pérennes". Phd thesis, Université de Lorraine, 2012. http://tel.archives-ouvertes.fr/tel-00844673.
Texto completoNaouai, Mohamed. "Localisation et reconstruction du réseau routier par vectorisation d'image THR et approximation des contraintes de type "NURBS"". Phd thesis, Université de Strasbourg, 2013. http://tel.archives-ouvertes.fr/tel-00994333.
Texto completoSoyez-Martin, Claire. "From semigroup theory to vectorization : recognizing regular languages". Electronic Thesis or Diss., Université de Lille (2022-....), 2023. http://www.theses.fr/2023ULILB052.
Texto completoThe pursuit of optimizing regular expression validation has been a long-standing challenge,spanning several decades. Over time, substantial progress has been made through a vast range of approaches, spanning from ingenious new algorithms to intricate low-level optimizations.Cutting-edge tools have harnessed these optimization techniques to continually push the boundaries of efficient execution. One notable advancement is the integration of vectorization, a method that leverage low-level parallelism to process data in batches, resulting in significant performance enhancements. While there has been extensive research on designing handmade tailored algorithms for particular languages, these solutions often lack generalizability, as the underlying methodology cannot be applied indiscriminately to any regular expression, which makes it difficult to integrate to existing tools.This thesis provides a theoretical framework in which it is possible to generate vectorized programs for regular expressions corresponding to rational expressions in a given class. To do so, we rely on the algebraic theory of automata, which provides tools to process letters in parallel. These tools also allow for a deeper understanding of the underlying regular language, which gives access to some properties that are useful when producing vectorized algorithms. The contribution of this thesis is twofold. First, it provides implementations and preliminary benchmarks to study the potential efficiency of algorithms using algebra and vectorization. Second, it gives algorithms that construct vectorized programs for languages in specific classes of rational expressions, namely the first order logic and its subset restricted to two variables
Haine, Christopher. "Kernel optimization by layout restructuring". Thesis, Bordeaux, 2017. http://www.theses.fr/2017BORD0639/document.
Texto completoCareful data layout design is crucial for achieving high performance, as nowadays processors waste a considerable amount of time being stalled by memory transactions, and in particular spacial and temporal locality have to be optimized. However, data layout transformations is an area left largely unexplored by state-of-the-art compilers, due to the difficulty to evaluate the possible performance gains of transformations. Moreover, optimizing data layout is time-consuming, error-prone, and layout transformations are too numerous tobe experimented by hand in hope to discover a high performance version. We propose to guide application programmers through data layout restructuring with an extensive feedback, firstly by providing a comprehensive multidimensional description of the initial layout, built via analysis of memory traces collected from the application binary textit {in fine} aiming at pinpointing problematic strides at the instruction level, independently of theinput language. We choose to focus on layout transformations,translatable to C-formalism to aid user understanding, that we apply and assesson case study composed of two representative multithreaded real-lifeapplications, a cardiac wave simulation and lattice QCD simulation, with different inputs and parameters. The performance prediction of different transformations matches (within 5%) with hand-optimized layout code
Thebault, Loïc. "Algorithmes Parallèles Efficaces Appliqués aux Calculs sur Maillages Non Structurés". Thesis, Université Paris-Saclay (ComUE), 2016. http://www.theses.fr/2016SACLV088/document.
Texto completoThe growing need for numerical simulations results in larger and more complex computing centers and more HPC softwares. Actual HPC system architectures have an increasing requirement for energy efficiency and performance. Recent advances in hardware design result in an increasing number of nodes and an increasing number of cores per node. However, some resources do not scale at the same rate. The increasing number of cores and parallel units implies a lower memory per core, higher requirement for concurrency, higher coherency traffic, and higher cost for coherency protocol. Most of the applications and runtimes currently in use struggle to scale with the present trend. In the context of finite element methods, exposing massive parallelism on unstructured mesh computations with efficient load balancing and minimal synchronizations is challenging. To make efficient use of these architectures, several parallelization strategies have to be combined together to exploit the multiple levels of parallelism. This P.h.D. thesis proposes several contributions aimed at overpassing this limitation by addressing irregular codes and data structures in an efficient way. We developed a hybrid parallelization approach combining the distributed, shared, and vectorial forms of parallelism in a fine grain taskbased approach applied to irregular structures. Our approach has been ported to several industrial applications developed by Dassault Aviation and has led to important speedups using standard multicores and the Intel Xeon Phi manycore
Ogier, Jean-Marc. "Contribution à l'analyse automatique de documents cartographiques. Interprétation de données cadastrales". Rouen, 1994. http://www.theses.fr/1994ROUES008.
Texto completoSornet, Gauthier. "Parallèlisme des calculs numériques appliqué aux géosciences". Thesis, Orléans, 2019. http://www.theses.fr/2019ORLE2009.
Texto completoDiscrete model resolution by numerical calculation benefits a dizzying number of industrial and general interest applications. It isalso necessary for heterogeneous machine evolutions to be exploited by calculation software. Thus, this thesis aims to explore theimpact of parallelism in modern architecture on geoscientific calculations. Consequently, this thesis work is particularly interestedin the categories of stencil resolution kernels and spectral finite elements. Work already carried out on these kernels consists instructuring, cutting and attacking the calculation in such a way as to exploit the resources of the machines in parallel. In addition,they also provide methods, models and tools for experimental analysis. Our approach consists in combining different parallelismcapacities for a given calculation kernel. Indeed, the latest Intel x86 processor developments have multiple superscalar, vector andmulti-core parallelism capabilities. The results of our work are consistent with those in the literature. First of all, we notice that theautomated vector optimizations of compilers are inefficient. In addition, the work of this thesis succeeds in refining the analysismodels. Thus, we observe an adequacy of the refined models with the experimental observations. In addition, the performanceachieved exceeds parallel reference implementations. These discoveries confirm the need to integrate these optimizations intocalculation abstraction solutions. Indeed, these solutions would better integrate the purpose of the calculations to be optimizedand can thus better couple them to the functioning of the target architectures
Wang, Yunsong. "Optimization of Monte Carlo Neutron Transport Simulations with Emerging Architectures". Thesis, Université Paris-Saclay (ComUE), 2017. http://www.theses.fr/2017SACLX090/document.
Texto completoMonte Carlo (MC) neutron transport simulations are widely used in the nuclear community to perform reference calculations with minimal approximations. The conventional MC method has a slow convergence according to the law of large numbers, which makes simulations computationally expensive. Cross section computation has been identified as the major performance bottleneck for MC neutron code. Typically, cross section data are precalculated and stored into memory before simulations for each nuclide, thus during the simulation, only table lookups are required to retrieve data from memory and the compute cost is trivial. We implemented and optimized a large collection of lookup algorithms in order to accelerate this data retrieving process. Results show that significant speedup can be achieved over the conventional binary search on both CPU and MIC in unit tests other than real case simulations. Using vectorization instructions has been proved effective on many-core architecture due to its 512-bit vector units; on CPU this improvement is limited by a smaller register size. Further optimization like memory reduction turns out to be very important since it largely improves computing performance. As can be imagined, all proposals of energy lookup are totally memory-bound where computing units does little things but only waiting for data. In another word, computing capability of modern architectures are largely wasted. Another major issue of energy lookup is that the memory requirement is huge: cross section data in one temperature for up to 400 nuclides involved in a real case simulation requires nearly 1 GB memory space, which makes simulations with several thousand temperatures infeasible to carry out with current computer systems.In order to solve the problem relevant to energy lookup, we begin to investigate another on-the-fly cross section proposal called reconstruction. The basic idea behind the reconstruction, is to do the Doppler broadening (performing a convolution integral) computation of cross sections on-the-fly, each time a cross section is needed, with a formulation close to standard neutron cross section libraries, and based on the same amount of data. The reconstruction converts the problem from memory-bound to compute-bound: only several variables for each resonance are required instead of the conventional pointwise table covering the entire resolved resonance region. Though memory space is largely reduced, this method is really time-consuming. After a series of optimizations, results show that the reconstruction kernel benefits well from vectorization and can achieve 1806 GFLOPS (single precision) on a Knights Landing 7250, which represents 67% of its effective peak performance. Even if optimization efforts on reconstruction significantly improve the FLOP usage, this on-the-fly calculation is still slower than the conventional lookup method. Under this situation, we begin to port the code on GPGPU to exploit potential higher performance as well as higher FLOP usage. On the other hand, another evaluation has been planned to compare lookup and reconstruction in terms of power consumption: with the help of hardware and software energy measurement support, we expect to find a compromising solution between performance and energy consumption in order to face the "power wall" challenge along with hardware evolution
Hallou, Nabil. "Runtime optimization of binary through vectorization transformations". Thesis, Rennes 1, 2017. http://www.theses.fr/2017REN1S120/document.
Texto completoIn many cases, applications are not optimized for the hardware on which they run. This is due to backward compatibility of ISA that guarantees the functionality but not the best exploitation of the hardware. Many reasons contribute to this unsatisfying situation such as legacy code, commercial code distributed in binary form, or deployment on compute farms. Our work focuses on maximizing the CPU efficiency for the SIMD extensions. The first contribution is a lightweight binary translation mechanism that does not include a vectorizer, but instead leverages what a static vectorizer previously did. We show that many loops compiled for x86 SSE can be dynamically converted to the more recent and more powerful AVX; as well as, how correctness is maintained with regards to challenges such as data dependencies and reductions. We obtain speedups in line with those of a native compiler targeting AVX. The second contribution is a runtime auto-vectorization of scalar loops. For this purpose, we use open source frame-works that we have tuned and integrated to (1) dynamically lift the x86 binary into the Intermediate Representation form of the LLVM compiler, (2) abstract hot loops in the polyhedral model, (3) use the power of this mathematical framework to vectorize them, and (4) finally compile them back into executable form using the LLVM Just-In-Time compiler. In most cases, the obtained speedups are close to the number of elements that can be simultaneously processed by the SIMD unit. The re-vectorizer and auto-vectorizer are implemented inside a dynamic optimization platform; it is completely transparent to the user, does not require any rewriting of the binaries, and operates during program execution
Roch, Jean-Louis. "Calcul formel et parallélisme : l'architecture du système PAC et son arithmétique rationnelle". Phd thesis, Grenoble INPG, 1989. http://tel.archives-ouvertes.fr/tel-00334457.
Texto completoMoustafa, Salli. "Massively Parallel Cartesian Discrete Ordinates Method for Neutron Transport Simulation". Thesis, Bordeaux, 2015. http://www.theses.fr/2015BORD0408/document.
Texto completoHigh-fidelity nuclear reactor core simulations require a precise knowledge of the neutron flux inside the reactor core. This flux is modeled by the linear Boltzmann equation also called neutron transport equation. In this thesis, we focus on solving this equation using the discrete ordinates method (SN) on Cartesian mesh. This method involves a source iteration scheme including a sweep over the spatial mesh and gathering the vast majority of computations in the SN method. Due to the large amount of computations performed in the resolution of the Boltzmann equation, numerous research works were focused on the optimization of the time to solution by developing parallel algorithms for solving the transport equation. However, these algorithms were designed by considering a super-computer as a collection of independent cores, and therefore do not explicitly take into account the memory hierarchy and multi-level parallelism available inside modern super-computers. Therefore, we first proposed a strategy for designing an efficient parallel implementation of the sweep operation on modern architectures by combining the use of the SIMD paradigm thanks to C++ generic programming techniques and an emerging task-based runtime system: PaRSEC. We demonstrated the need for such an approach using theoretical performance models predicting optimal partitionings. Then we studied the challenge of converging the source iterations scheme in highly diffusive media such as the PWR cores. We have implemented and studied the convergence of a new acceleration scheme (PDSA) that naturally suits our Hybrid parallel implementation. The combination of all these techniques have enabled us to develop a massively parallel version of the SN Domino solver. It is capable of tackling the challenges posed by the neutron transport simulations and compares favorably with state-of-the-art solvers such as Denovo. The performance of the PaRSEC implementation of the sweep operation reaches 6.1 Tflop/s on 768 cores corresponding to 33.9% of the theoretical peak performance of this set of computational resources. For a typical 26-group PWR calculations involving 1.02×1012 DoFs, the time to solution required by the Domino solver is 46 min using 1536 cores
Borghi, Alexandre. "Adaptation de l’algorithmique aux architectures parallèles". Thesis, Paris 11, 2011. http://www.theses.fr/2011PA112205/document.
Texto completoIn this thesis, we are interested in adapting algorithms to parallel architectures. Current high performance platforms have several levels of parallelism and require a significant amount of work to make the most of them. Supercomputers possess more and more computational units and are more and more heterogeneous and hierarchical, which make their use very difficult.We take an interest in several aspects which enable to benefit from modern parallel architectures. Throughout this thesis, several problems with different natures are tackled, more theoretically or more practically according to the context and the scale of the considered parallel platforms.We have worked on modeling problems in order to adapt their formulation to existing solvers or resolution methods, in particular in the context of integer factorization problem modeled and solved with integer programming tools.The main contribution of this thesis corresponds to the design of algorithms thought from the beginning to be efficient when running on modern architectures (multi-core processors, Cell, GPU). Two algorithms which solve the compressive sensing problem have been designed in this context: the first one uses linear programming and enables to find an exact solution, whereas the second one uses convex programming and enables to find an approximate solution.We have also used a high-level parallelization library which uses the BSP model in the context of model checking to implement in parallel an existing algorithm. From a unique implementation, this tool enables the use of the algorithm on platforms with different levels of parallelism, while obtaining cutting edge performance for each of them. In our case, the largest-scale platform that we considered is the cluster of multi-core multiprocessors. More, in the context of the very particular Cell processor, an implementation has been written from scratch to take benefit from it
Vaxivière, Pascal. "Interprétation de dessins techniques mécaniques". Vandoeuvre-les-Nancy, INPL, 1995. http://www.theses.fr/1995INPL032N.
Texto completoChouh, Hamza. "Simulations interactives de champ ultrasonore pour des configurations complexes de contrôle non destructif". Thesis, Lyon, 2016. http://www.theses.fr/2016LYSE1220/document.
Texto completoIn order to fulfill increasing reliability and safety requirements, non destructive testing techniques are constantly evolving and so does their complexity. Consequently, simulation is an essential part of their design. We developed a tool for the simulation of the ultrasonic field radiated by any planar probes into non destructive testing configurations involving meshed geometries without prominent edges, isotropic and anisotropic, homogeneous and heterogeneous materials, and wave trajectories that can include reflections and transmissions. We approximate the ultrasonic wavefronts by using polynomial interpolators that are local to ultrasonic ray pencils. They are obtained using a surface research algorithm based on pencil tracing and successive subdivisions. Their interpolators enable the computation of the necessary quantities for the impulse response computation on each point of a sampling of the transducer surface that fulfills the Shannon criterion. By doing so, we can compute a global impulse response which, when convoluted with the excitation signal of the transducer, results in the ultrasonic field. The usage of task parallelism and of SIMD instructions on the most computationally expensive steps yields an important performance boost. Finally, we developed a tool for progressive visualization of field images. It benefits from an image reconstruction technique and schedules field computations in order to accelerate convergence towards the final image
Bramas, Bérenger. "Optimization and parallelization of the boundary element method for the wave equation in time domain". Thesis, Bordeaux, 2016. http://www.theses.fr/2016BORD0022/document.
Texto completoThe time-domain BEM for the wave equation in acoustics and electromagnetism is used to simulatethe propagation of a wave with a discretization in time. It allows to obtain several frequencydomainresults with one solve. In this thesis, we investigate the implementation of an efficientTD-BEM solver using different approaches. We describe the context of our study and the TD-BEMformulation expressed as a sparse linear system composed of multiple interaction/convolutionmatrices. This system is naturally computed using the sparse matrix-vector product (SpMV). Wework on the limits of the SpMV kernel by looking at the matrix reordering and the behavior of ourSpMV kernels using vectorization (SIMD) on CPUs and an advanced blocking-layout on NvidiaGPUs. We show that this operator is not appropriate for our problem, and we then propose toreorder the original computation to get a special matrix structure. This new structure is called aslice matrix and is computed with a custom matrix/vector product operator. We present an optimizedimplementation of this operator on CPUs and Nvidia GPUs for which we describe advancedblocking schemes. The resulting solver is parallelized with a hybrid strategy above heterogeneousnodes and relies on a new heuristic to balance the work among the processing units. Due tothe quadratic complexity of this matrix approach, we study the use of the fast multipole method(FMM) for our time-domain BEM solver. We investigate the parallelization of the general FMMalgorithm using several paradigms in both shared and distributed memory, and we explain howmodern runtime systems are well-suited to express the FMM computation. Finally, we investigatethe implementation and the parametrization of an FMM kernel specific to our TD-BEM, and weprovide preliminary results
Cieren, Emmanuel. "Molecular Dynamics for Exascale Supercomputers". Thesis, Bordeaux, 2015. http://www.theses.fr/2015BORD0174/document.
Texto completoIn the exascale race, supercomputer architectures are evolving towards massively multicore nodes with hierarchical memory structures and equipped with larger vectorization registers. These trends tend to make MPI-only applications less effective, and now require programmers to explicitly manage low-level elements to get decent performance.In the context of Molecular Dynamics (MD) applied to condensed matter physics, the need for a better understanding of materials behaviour under extreme conditions involves simulations of ever larger systems, on tens of thousands of cores. This will put molecular dynamics codes among software that are very likely to meet serious difficulties when it comes to fully exploit the performance of next generation processors.This thesis proposes the design and implementation of a high-performance, flexible and scalable framework dedicated to the simulation of large scale MD systems on future supercomputers. We managed to separate numerical modules from different expressions of parallelism, allowing developers not to care about optimizations and still obtain high levels of performance. Our architecture is organized in three levels of parallelism: domain decomposition using MPI, thread parallelization within each domain, and explicit vectorization. We also included a dynamic load balancing capability in order to equally share the workload among domains.Results on simple tests show excellent sequential performance and a quasi linear speedup on several thousands of cores on various architectures. When applied to production simulations, we report an acceleration up to a factor 30 compared to the code previously used by CEA’s researchers
El, Moussawi Ali Hassan. "SIMD-aware word length optimization for floating-point to fixed-point conversion targeting embedded processors". Thesis, Rennes 1, 2016. http://www.theses.fr/2016REN1S150/document.
Texto completoIn order to cut-down their cost and/or their power consumption, many embedded processors do not provide hardware support for floating-point arithmetic. However, applications in many domains, such as signal processing, are generally specified using floating-point arithmetic for the sake of simplicity. Porting these applications on such embedded processors requires a software emulation of floating-point arithmetic, which can greatly degrade performance. To avoid this, the application is converted to use fixed-point arithmetic instead. Floating-point to fixed-point conversion involves a subtle tradeoff between performance and precision ; it enables the use of narrower data word lengths at the cost of degrading the computation accuracy. Besides, most embedded processors provide support for SIMD (Single Instruction Multiple Data) as a mean to improve performance. In fact, this allows the execution of one operation on multiple data in parallel, thus ultimately reducing the execution time. However, the application should usually be transformed in order to take advantage of the SIMD instruction set. This transformation, known as Simdization, is affected by the data word lengths ; narrower word lengths enable a higher SIMD parallelism rate. Hence the tradeoff between precision and Simdization. Many existing work aimed at provide/improving methodologies for automatic floating-point to fixed-point conversion on the one side, and Simdization on the other. In the state-of-the-art, both transformations are considered separately even though they are strongly related. In this context, we study the interactions between these transformations in order to better exploit the performance/accuracy tradeoff. First, we propose an improved SLP (Superword Level Parallelism) extraction (an Simdization technique) algorithm. Then, we propose a new methodology to jointly perform floating-point to fixed-point conversion and SLP extraction. Finally, we implement this work as a fully automated source-to-source compiler flow. Experimental results, targeting four different embedded processors, show the validity of our approach in efficiently exploiting the performance/accuracy tradeoff compared to a typical approach, which considers both transformations independently