Дисертації з теми "Synthèse audio"
Оформте джерело за APA, MLA, Chicago, Harvard та іншими стилями
Ознайомтеся з топ-28 дисертацій для дослідження на тему "Synthèse audio".
Біля кожної праці в переліку літератури доступна кнопка «Додати до бібліографії». Скористайтеся нею – і ми автоматично оформимо бібліографічне посилання на обрану працю в потрібному вам стилі цитування: APA, MLA, «Гарвард», «Чикаго», «Ванкувер» тощо.
Також ви можете завантажити повний текст наукової публікації у форматі «.pdf» та прочитати онлайн анотацію до роботи, якщо відповідні параметри наявні в метаданих.
Переглядайте дисертації для різних дисциплін та оформлюйте правильно вашу бібліографію.
Coulibaly, Patrice Yefoungnigui. "Codage audio à bas débit avec synthèse sinusoïdale." Mémoire, Université de Sherbrooke, 2000. http://savoirs.usherbrooke.ca/handle/11143/1078.
Повний текст джерелаOger, Marie. "Model-based techniques for flexible speech and audio coding." Nice, 2007. http://www.theses.fr/2007NICE4109.
Повний текст джерелаThe objective of this thesis is to develop optimal speech and audio coding techniques which are more flexible than the state of the art and can adapt in real-time to various constraints (rate, bandwidth, delay). This problem is addressed using several tools : statistical models, high-rate quantization theory, flexible entropy coding. Firstly, a novel method of flexible coding for linear prediction coding (LPC) coefficients is proposed using Karhunen-Loeve transform (KLT) and scalar quantization based on generalized Gaussian modelling. This method has a performance equivalent to the LPC quantizer used in AMR-WB with a lower complexity. Then, two transform audio coding structures are proposed using either stack-run coding or model-based bit plane coding. In both case the coefficients after perceptual weighting and modified discrete cosine transform (MDCT) are approximated by a generalized Gaussian distribution. The coding of MDCT coefficients is optimized according to this model. The performance is compared with that of ITU-T G. 7222. 1. The stack-run coder is better than G. 7222. 1 at low bit rates and equivalent at high bit rates. However, the computational complexity of the proposed stack-run coder is higher and the memory requirement is low. The bit plane coder has the advantage of being bit rate scalable. The generalized Gaussian model is used to initialize the probability tables of an arithmetic coder. The bit plane coder is worse than stack-run coding at low bit rates and equivalent at high bit rates. It has a computational complexity close to G. 7222. 1 while memory requirement is still low
Liuni, Marco. "Adaptation Automatique de la Résolution pour l'Analyse et la Synthèse du Signal Audio." Phd thesis, Université Pierre et Marie Curie - Paris VI, 2012. http://tel.archives-ouvertes.fr/tel-00773550.
Повний текст джерелаOlivero, Anaik. "Les multiplicateurs temps-fréquence : Applications à l’analyse et la synthèse de signaux sonores et musicaux." Thesis, Aix-Marseille, 2012. http://www.theses.fr/2012AIXM4788/document.
Повний текст джерелаAnalysis/Transformation/Synthesis is a generalparadigm in signal processing, that aims at manipulating or generating signalsfor practical applications. This thesis deals with time-frequencyrepresentations obtained with Gabor atoms. In this context, the complexity of a soundtransformation can be modeled by a Gabor multiplier. Gabormultipliers are linear diagonal operators acting on signals, andare characterized by a time-frequency transfer function of complex values, called theGabor mask. Gabor multipliers allows to formalize the conceptof filtering in the time-frequency domain. As they act by multiplying in the time-frequencydomain, they are "a priori'' well adapted to producesound transformations like timbre transformations. In a first part, this work proposes to model theproblem of Gabor mask estimation between two given signals,and provides algorithms to solve it. The Gabor multiplier between two signals is not uniquely defined and the proposed estimationstrategies are able to generate Gabor multipliers that produce signalswith a satisfied sound quality. In a second part, we show that a Gabor maskcontain a relevant information, as it can be viewed asa time-frequency representation of the difference oftimbre between two given sounds. By averaging the energy contained in a Gabor mask, we obtain a measure of this difference that allows to discriminate different musical instrumentsounds. We also propose strategies to automaticallylocalize the time-frequency regions responsible for such a timbre dissimilarity between musicalinstrument classes. Finally, we show that the Gabor multipliers can beused to construct a lot of sounds morphing trajectories,and propose an extension
Renault, Lenny. "Neural audio synthesis of realistic piano performances." Electronic Thesis or Diss., Sorbonne université, 2024. http://www.theses.fr/2024SORUS196.
Повний текст джерелаMusician and instrument make up a central duo in the musical experience.Inseparable, they are the key actors of the musical performance, transforming a composition into an emotional auditory experience. To this end, the instrument is a sound device, that the musician controls to transcribe and share their understanding of a musical work. Access to the sound of such instruments, often the result of advanced craftsmanship, and to the mastery of playing them, can require extensive resources that limit the creative exploration of composers.This thesis explores the use of deep neural networks to reproduce the subtleties introduced by the musician's playing and the sound of the instrument, making the music realistic and alive. Focusing on piano music, the conducted work has led to a sound synthesis model for the piano, as well as an expressive performance rendering model.DDSP-Piano, the piano synthesis model, is built upon the hybrid approach of Differentiable Digital Signal Processing (DDSP), which enables the inclusion of traditional signal processing tools into a deep learning model. The model takes symbolic performances as input and explicitly includes instrument-specific knowledge, such as inharmonicity, tuning, and polyphony. This modular, lightweight, and interpretable approach synthesizes sounds of realistic quality while separating the various components that make up the piano sound. As for the performance rendering model, the proposed approach enables the transformation of MIDI compositions into symbolic expressive interpretations.In particular, thanks to an unsupervised adversarial training, it stands out from previous works by not relying on aligned score-performance training pairs to reproduce expressive qualities. The combination of the sound synthesis and performance rendering models would enable the synthesis of expressive audio interpretations of scores, while enabling modification of the generated interpretations in the symbolic domain
Molina, Villota Daniel Hernán. "Vocal audio effects : tuning, vocoders, interaction." Electronic Thesis or Diss., Sorbonne université, 2024. http://www.theses.fr/2024SORUS166.
Повний текст джерелаThis research focuses on the use of digital audio effects (DAFx) on vocal tracks in modern music, mainly pitch correction and vocoding. Despite its widespread use, there has not been enough discussion on how to improve autotune or what makes a pitch-modification more musically interesting. A taxonomic analysis of vocal effects has been conducted, demonstrating examples of how they can preserve or transform vocal identity and their musical use, particularly with pitch modification. Furthermore, a compendium of technical-musical terms has been developed to distinguish types of vocal tuning and cases of pitch correction. Additionally, a graphical correction method for vocal pitch correction is proposed. This method is validated with theoretical pitch curves (supported by audio) and compared with a reference method. Although the vocoder is essential for pitch correction, there is a lack of descriptive and comparative basis for vocoding techniques. Therefore, a sonic description of the vocoder is proposed, given its use for tuning, employing four different techniques: Antares, Retune, World, and Circe. Subsequently, a subjective psychoacoustic evaluation is conducted to compare the four systems in the following cases: original tone resynthesis, soft vocal correction, and extreme vocal correction. This psychoacoustic evaluation seeks to understand the coloring of each vocoder (preservation of vocal identity) and the role of melody in extreme vocal correction. Furthermore, a protocol for the subjective evaluation of pitch correction methods is proposed and implemented. This protocol compares our DPW pitch correction method with the ATA reference method. This study aims to determine if there are perceptual differences between the systems and in which cases they occur, which is useful for developing new melodic modification methods in the future. Finally, the interactive use of vocal effects has been explored, capturing hand movement with wireless sensors and mapping it to control effects that modify the perception of space and vocal melody
Meynard, Adrien. "Stationnarités brisées : approches à l'analyse et à la synthèse." Thesis, Aix-Marseille, 2019. http://www.theses.fr/2019AIXM0475.
Повний текст джерелаNonstationarity characterizes transient physical phenomena. For example, it may be caused by a speed variation of an accelerating engine. Similarly, because of the Doppler effect, a stationary sound emitted by a moving source is perceived as being nonstationary by a motionless observer. These examples lead us to consider a class of nonstationary signals formed from stationary signals whose stationarity has been broken by a physically relevant deformation operator. After describing the considered deformation models (chapter 1), we present different methods that extend the spectral analysis and synthesis to such signals. The spectral estimation amounts to determining simultaneously the spectrum of the underlying stationary process and the deformation breaking its stationarity. To this end, we consider representations of the signal in which this deformation is characterized by a simple operation. Thus, in chapter 2, we are interested in the analysis of locally deformed signals. The deformation describing these signals is simply expressed as a displacement of the wavelet coefficients in the time-scale domain. We take advantage of this property to develop a method for the estimation of these displacements. Then, we propose an instantaneous spectrum estimation algorithm, named JEFAS. In chapter 3, we extend this spectral analysis to multi-sensor signals where the deformation operator takes a matrix form. This is a doubly nonstationary blind source separation problem. In chapter 4, we propose a synthesis approach to study locally deformed signals. Finally, in chapter 5, we construct a time-frequency representation adapted to the description of locally harmonic signals
Nistal, Hurlé Javier. "Exploring generative adversarial networks for controllable musical audio synthesis." Electronic Thesis or Diss., Institut polytechnique de Paris, 2022. http://www.theses.fr/2022IPPAT009.
Повний текст джерелаAudio synthesizers are electronic musical instruments that generate artificial sounds under some parametric control. While synthesizers have evolved since they were popularized in the 70s, two fundamental challenges are still unresolved: 1) the development of synthesis systems responding to semantically intuitive parameters; 2) the design of "universal," source-agnostic synthesis techniques. This thesis researches the use of Generative Adversarial Networks (GAN) towards building such systems. The main goal is to research and develop novel tools for music production that afford intuitive and expressive means of sound manipulation, e.g., by controlling parameters that respond to perceptual properties of the sound and other high-level features. Our first work studies the performance of GANs when trained on various common audio signal representations (e.g., waveform, time-frequency representations). These experiments compare different forms of audio data in the context of tonal sound synthesis. Results show that the Magnitude and Instantaneous Frequency of the phase and the complex-valued Short-Time Fourier Transform achieve the best results. Building on this, our following work presents DrumGAN, a controllable adversarial audio synthesizer of percussive sounds. By conditioning the model on perceptual features describing high-level timbre properties, we demonstrate that intuitive control can be gained over the generation process. This work results in the development of a VST plugin generating full-resolution audio and compatible with any Digital Audio Workstation (DAW). We show extensive musical material produced by professional artists from Sony ATV using DrumGAN. The scarcity of annotations in musical audio datasets challenges the application of supervised methods to conditional generation settings. Our third contribution employs a knowledge distillation approach to extract such annotations from a pre-trained audio tagging system. DarkGAN is an adversarial synthesizer of tonal sounds that employs the output probabilities of such a system (so-called “soft labels”) as conditional information. Results show that DarkGAN can respond moderately to many intuitive attributes, even with out-of-distribution input conditioning. Applications of GANs to audio synthesis typically learn from fixed-size two-dimensional spectrogram data analogously to the "image data" in computer vision; thus, they cannot generate sounds with variable duration. In our fourth paper, we address this limitation by exploiting a self-supervised method for learning discrete features from sequential data. Such features are used as conditional input to provide step-wise time-dependent information to the model. Global consistency is ensured by fixing the input noise z (characteristic in adversarial settings). Results show that, while models trained on a fixed-size scheme obtain better audio quality and diversity, ours can competently generate audio of any duration. One interesting direction for research is the generation of audio conditioned on preexisting musical material, e.g., the generation of some drum pattern given the recording of a bass line. Our fifth paper explores a simple pretext task tailored at learning such types of complex musical relationships. Concretely, we study whether a GAN generator, conditioned on highly compressed MP3 musical audio signals, can generate outputs resembling the original uncompressed audio. Results show that the GAN can improve the quality of the audio signals over the MP3 versions for very high compression rates (16 and 32 kbit/s). As a direct consequence of applying artificial intelligence techniques in musical contexts, we ask how AI-based technology can foster innovation in musical practice. Therefore, we conclude this thesis by providing a broad perspective on the development of AI tools for music production, informed by theoretical considerations and reports from real-world AI tool usage by professional artists
Tiger, Guillaume. "Synthèse sonore d'ambiances urbaines pour les applications vidéoludiques." Thesis, Paris, CNAM, 2014. http://www.theses.fr/2015CNAM0968/document.
Повний текст джерелаIn video gaming and interactive media, the making of complex sound ambiences relies heavily on the allowed memory and computational resources. So a compromise solution is necessary regarding the choice of audio material and its treatment in order to reach immersive and credible real-time ambiences. Alternatively, the use of procedural audio techniques, i.e. the generation of audio content relatively to the data provided by the virtual scene, has increased in recent years. Procedural methodologies seem appropriate to sonify complex environments such as virtual cities.In this thesis we specifically focus on the creation of interactive urban sound ambiences. Our analysis of these ambiences is based on the Soundscape theory and on a state of art on game oriented urban interactive applications. We infer that the virtual urban soundscape is made of several perceptive auditory grounds including a background. As a first contribution we define the morphological and narrative properties of such a background. We then consider the urban background sound as a texture and propose, as a second contribution, to pinpoint, specify and prototype a granular synthesis tool dedicated to interactive urban sound backgrounds.The synthesizer prototype is created using the visual programming language Pure Data. On the basis of our state of the art, we include an urban ambiences recording methodology to feed the granular synthesis. Finally, two validation steps regarding the prototype are described: the integration to the virtual city simulation Terra Dynamica on the one side and a perceptive listening comparison test on the other
Musti, Utpala. "Synthèse acoustico-visuelle de la parole par sélection d'unités bimodales." Thesis, Université de Lorraine, 2013. http://www.theses.fr/2013LORR0003.
Повний текст джерелаThis work deals with audio-visual speech synthesis. In the vast literature available in this direction, many of the approaches deal with it by dividing it into two synthesis problems. One of it is acoustic speech synthesis and the other being the generation of corresponding facial animation. But, this does not guarantee a perfectly synchronous and coherent audio-visual speech. To overcome the above drawback implicitly, we proposed a different approach of acoustic-visual speech synthesis by the selection of naturally synchronous bimodal units. The synthesis is based on the classical unit selection paradigm. The main idea behind this synthesis technique is to keep the natural association between the acoustic and visual modality intact. We describe the audio-visual corpus acquisition technique and database preparation for our system. We present an overview of our system and detail the various aspects of bimodal unit selection that need to be optimized for good synthesis. The main focus of this work is to synthesize the speech dynamics well rather than a comprehensive talking head. We describe the visual target features that we designed. We subsequently present an algorithm for target feature weighting. This algorithm that we developed performs target feature weighting and redundant feature elimination iteratively. This is based on the comparison of target cost based ranking and a distance calculated based on the acoustic and visual speech signals of units in the corpus. Finally, we present the perceptual and subjective evaluation of the final synthesis system. The results show that we have achieved the goal of synthesizing the speech dynamics reasonably well
Tiger, Guillaume. "Synthèse sonore d'ambiances urbaines pour les applications vidéoludiques." Electronic Thesis or Diss., Paris, CNAM, 2014. http://www.theses.fr/2014CNAM0968.
Повний текст джерелаIn video gaming and interactive media, the making of complex sound ambiences relies heavily on the allowed memory and computational resources. So a compromise solution is necessary regarding the choice of audio material and its treatment in order to reach immersive and credible real-time ambiences. Alternatively, the use of procedural audio techniques, i.e. the generation of audio content relatively to the data provided by the virtual scene, has increased in recent years. Procedural methodologies seem appropriate to sonify complex environments such as virtual cities.In this thesis we specifically focus on the creation of interactive urban sound ambiences. Our analysis of these ambiences is based on the Soundscape theory and on a state of art on game oriented urban interactive applications. We infer that the virtual urban soundscape is made of several perceptive auditory grounds including a background. As a first contribution we define the morphological and narrative properties of such a background. We then consider the urban background sound as a texture and propose, as a second contribution, to pinpoint, specify and prototype a granular synthesis tool dedicated to interactive urban sound backgrounds.The synthesizer prototype is created using the visual programming language Pure Data. On the basis of our state of the art, we include an urban ambiences recording methodology to feed the granular synthesis. Finally, two validation steps regarding the prototype are described: the integration to the virtual city simulation Terra Dynamica on the one side and a perceptive listening comparison test on the other
Caillon, Antoine. "Hierarchical temporal learning for multi-instrument and orchestral audio synthesis." Electronic Thesis or Diss., Sorbonne université, 2023. http://www.theses.fr/2023SORUS115.
Повний текст джерелаRecent advances in deep learning have offered new ways to build models addressing a wide variety of tasks through the optimization of a set of parameters based on minimizing a cost function. Amongst these techniques, probabilistic generative models have yielded impressive advances in text, image and sound generation. However, musical audio signal generation remains a challenging problem. This comes from the complexity of audio signals themselves, since a single second of raw audio spans tens of thousands of individual samples. Modeling musical signals is even more challenging as important information are structured across different time scales, from micro (e.g. timbre, transient, phase) to macro (e.g. genre, tempo, structure) information. Modeling every scale at once would require large architectures, precluding the use of resulting models in real time setups for computational complexity reasons.In this thesis, we study how a hierarchical approach to audio modeling can address the musical signal modeling task, while offering different levels of control to the user. Our main hypothesis is that extracting different representation levels of an audio signal allows to abstract the complexity of lower levels for each modeling stage. This would eventually allow the use of lightweight architectures, each modeling a single audio scale. We start by addressing raw audio modeling by proposing an audio model combining Variational Auto Encoders and Generative Adversarial Networks, yielding high-quality 48kHz neural audio synthesis, while being 20 times faster than real time on CPU. Then, we study how autoregressive models can be used to understand the temporal behavior of the representation yielded by this low-level audio model, using optional additional conditioning signals such as acoustic descriptors or tempo. Finally, we propose a method for using all the proposed models directly on audio streams, allowing their use in realtime applications that we developed during this thesis. We conclude by presenting various creative collaborations led in parallel of this work with several composers and musicians, directly integrating the current state of the proposed technologies inside musical pieces
Musti, Utpala. "Synthèse acoustico-visuelle de la parole par sélection d'unités bimodales." Electronic Thesis or Diss., Université de Lorraine, 2013. http://www.theses.fr/2013LORR0003.
Повний текст джерелаThis work deals with audio-visual speech synthesis. In the vast literature available in this direction, many of the approaches deal with it by dividing it into two synthesis problems. One of it is acoustic speech synthesis and the other being the generation of corresponding facial animation. But, this does not guarantee a perfectly synchronous and coherent audio-visual speech. To overcome the above drawback implicitly, we proposed a different approach of acoustic-visual speech synthesis by the selection of naturally synchronous bimodal units. The synthesis is based on the classical unit selection paradigm. The main idea behind this synthesis technique is to keep the natural association between the acoustic and visual modality intact. We describe the audio-visual corpus acquisition technique and database preparation for our system. We present an overview of our system and detail the various aspects of bimodal unit selection that need to be optimized for good synthesis. The main focus of this work is to synthesize the speech dynamics well rather than a comprehensive talking head. We describe the visual target features that we designed. We subsequently present an algorithm for target feature weighting. This algorithm that we developed performs target feature weighting and redundant feature elimination iteratively. This is based on the comparison of target cost based ranking and a distance calculated based on the acoustic and visual speech signals of units in the corpus. Finally, we present the perceptual and subjective evaluation of the final synthesis system. The results show that we have achieved the goal of synthesizing the speech dynamics reasonably well
Douwes, Constance. "On the Environmental Impact of Deep Generative Models for Audio." Electronic Thesis or Diss., Sorbonne université, 2023. http://www.theses.fr/2023SORUS074.
Повний текст джерелаIn this thesis, we investigate the environmental impact of deep learning models for audio generation and we aim to put computational cost at the core of the evaluation process. In particular, we focus on different types of deep learning models specialized in raw waveform audio synthesis. These models are now a key component of modern audio systems, and their use has increased significantly in recent years. Their flexibility and generalization capabilities make them powerful tools in many contexts, from text-to-speech synthesis to unconditional audio generation. However, these benefits come at the cost of expensive training sessions on large amounts of data, operated on energy-intensive dedicated hardware, which incurs large greenhouse gas emissions. The measures we use as a scientific community to evaluate our work are at the heart of this problem. Currently, deep learning researchers evaluate their works primarily based on improvements in accuracy, log-likelihood, reconstruction, or opinion scores, all of which overshadow the computational cost of generative models. Therefore, we propose using a new methodology based on Pareto optimality to help the community better evaluate their work's significance while bringing energy footprint -- and in fine carbon emissions -- at the same level of interest as the sound quality. In the first part of this thesis, we present a comprehensive report on the use of various evaluation measures of deep generative models for audio synthesis tasks. Even though computational efficiency is increasingly discussed, quality measurements are the most commonly used metrics to evaluate deep generative models, while energy consumption is almost never mentioned. Therefore, we address this issue by estimating the carbon cost of training generative models and comparing it to other noteworthy carbon costs to demonstrate that it is far from insignificant. In the second part of this thesis, we propose a large-scale evaluation of pervasive neural vocoders, which are a class of generative models used for speech generation, conditioned on mel-spectrogram. We introduce a multi-objective analysis based on Pareto optimality of both quality from human-based evaluation and energy consumption. Within this framework, we show that lighter models can perform better than more costly models. By proposing to rely on a novel definition of efficiency, we intend to provide practitioners with a decision basis for choosing the best model based on their requirements. In the last part of the thesis, we propose a method to reduce the inference costs of neural vocoders, based on quantizated neural networks. We show a significant gain on the memory size and give some hints for the future use of these models on embedded hardware. Overall, we provide keys to better understand the impact of deep generative models for audio synthesis as well as a new framework for developing models while accounting for their environmental impact. We hope that this work raises awareness on the need to investigate energy-efficient models simultaneously with high perceived quality
Andreux, Mathieu. "Foveal autoregressive neural time-series modeling." Electronic Thesis or Diss., Paris Sciences et Lettres (ComUE), 2018. http://www.theses.fr/2018PSLEE073.
Повний текст джерелаThis dissertation studies unsupervised time-series modelling. We first focus on the problem of linearly predicting future values of a time-series under the assumption of long-range dependencies, which requires to take into account a large past. We introduce a family of causal and foveal wavelets which project past values on a subspace which is adapted to the problem, thereby reducing the variance of the associated estimators. We then investigate under which conditions non-linear predictors exhibit better performances than linear ones. Time-series which admit a sparse time-frequency representation, such as audio ones, satisfy those requirements, and we propose a prediction algorithm using such a representation. The last problem we tackle is audio time-series synthesis. We propose a new generation method relying on a deep convolutional neural network, with an encoder-decoder architecture, which allows to synthesize new realistic signals. Contrary to state-of-the-art methods, we explicitly use time-frequency properties of sounds to define an encoder with the scattering transform, while the decoder is trained to solve an inverse problem in an adapted metric
Gonon, Gilles. "Proposition d'un schéma d'analyse/synthèse adaptatif dans le plan temps-fréquence basé sur des critères entropiques : application au codage audio par transformée." Le Mans, 2002. http://cyberdoc.univ-lemans.fr/theses/2002/2002LEMA1004.pdf.
Повний текст джерелаAdaptive representations contribute to the study and caracterization of the information carried by any signal. In this work, we present a new decomposition which uses separated segmentation criterias in time and frequency to improve the adaptivity of the analysis to the signal. This scheme is applied to a transform perceptual audio coder. The signal is first temporally segmented using a local entropic criteria. Based upon an estimator of the local entropy, the segmentation criteria is relevant of the entropy variations in a signal and allows to separate stationnary parts from transients ones. Temporal frames thus defined are frequentially filtered using the Wavelet Packet Decomposition and the adaptation is performed by the mean of the Best Basis Search Algorithm. An extension of the library of dyadic basis is derived to improve the entropic gain performed over the signal and so the adaptivity of the decomposition. The perceptual audio coder we developped follows an original design in order to include the proposed scheme. The whole implementation of the coder is described in the document. This coder is evaluated with subjective tests, performed according to absolute and blind comparison for a rate of 96 kbps. As many parts of our coder are still to be improved, results show a subjective quality equivalent to the tested standard and hardly transparent toward the original sounds
Sini, Aghilas. "Caractérisation et génération de l’expressivité en fonction des styles de parole pour la construction de livres audio." Thesis, Rennes 1, 2020. http://www.theses.fr/2020REN1S026.
Повний текст джерелаIn this thesis, we study the expressivity of read speech with a particular type of data, which are audiobooks. Audiobooks are audio recordings of literary works made by professionals (actors, singers, professional narrators) or by amateurs. These recordings may be intended for a particular audience (blind or visually impaired people). The availability of this kind of data in large quantities with a good enough quality has attracted the attention of the research community in automatic speech and language processing in general and of researchers specialized in expressive speech synthesis systems. We propose in this thesis to study three elementary entities of expressivity that are conveyed by audiobooks: emotion, variations related to discursive changes, and speaker properties. We treat these patterns from a prosodic point of view. The main contributions of this thesis are: the construction of a corpus of audiobooks with a large number of recordings partially annotated by an expert, a quantitative study characterizing the emotions in this type of data, the construction of a model based on automatic learning techniques for the automatic annotation of discourse types and finally we propose a vector representation of the prosodic identity of a speaker in the framework of parametric statistical speech synthesis
Pages, Guilhem. "Zones d’écoute personnalisées mobiles par approches adaptatives." Electronic Thesis or Diss., Le Mans, 2024. http://www.theses.fr/2024LEMA1012.
Повний текст джерелаThe thesis deals with the creation of mobile sound zones using adaptive approaches. The methods in use for the creation of sound zones aim to jointly resolve the sound reproduction in one zone and the minimisation of the energy of the signal reproduced in the other zone, from an array of loudspeakers. The thesis is divided into two parts: the estimation of impulse responses and moving sound zones. The aim of this thesis is to create two zones in space with a controlled sound field, which can move in space over time. In the first part, the estimation of the system's impulse responses is detailed, a necessary prerequisite for sound zone algorithms. Based on existing adaptive methods for estimating time-varying multi-input, multi-output systems, a new method applied to acoustics and MISO is presented. This method, called MISO-Autostep, makes it possible to estimate impulse responses over time without having to fine-tune any parameters. In the second part, the BACC-PM sound zone algorithm is rewritten in recursive form. This ability to update the filter coefficients over time opens up the possibility of adapting to temporal changes in the system geometry. Finally, preliminary results are presented with the joint use of the two adaptive algorithms in the case of an abrupt change in the system geometry
Daudet, Laurent. "Représentations structurelles de signaux audiophoniques : méthodes hybrides pour des applications à la compression." Aix-Marseille 1, 2000. http://www.theses.fr/2000AIX11056.
Повний текст джерелаMusti, Utpala. "Synthèse Acoustico-Visuelle de la Parole par Séléction d'Unités Bimodales." Phd thesis, Université de Lorraine, 2013. http://tel.archives-ouvertes.fr/tel-00927121.
Повний текст джерелаLostanlen, Vincent. "Opérateurs convolutionnels dans le plan temps-fréquence." Thesis, Paris Sciences et Lettres (ComUE), 2017. http://www.theses.fr/2017PSLEE012/document.
Повний текст джерелаThis dissertation addresses audio classification by designing signal representations which satisfy appropriate invariants while preserving inter-class variability. First, we study time-frequencyscattering, a representation which extract modulations at various scales and rates in a similar way to idealized models of spectrotemporal receptive fields in auditory neuroscience. We report state-of-the-artresults in the classification of urban and environmental sounds, thus outperforming short-term audio descriptors and deep convolutional networks. Secondly, we introduce spiral scattering, a representationwhich combines wavelet convolutions along time, along log-frequency, and across octaves. Spiral scattering follows the geometry of the Shepard pitch spiral, which makes a full turn at every octave. We study voiced sounds with a nonstationary sourcefilter model where both the source and the filter are transposed through time, and show that spiral scattering disentangles and linearizes these transpositions. Furthermore, spiral scattering reaches state-of-the-art results in musical instrument classification ofsolo recordings. Aside from audio classification, time-frequency scattering and spiral scattering can be used as summary statistics for audio texture synthesis. We find that, unlike the previously existing temporal scattering transform, time-frequency scattering is able to capture the coherence ofspectrotemporal patterns, such as those arising in bioacoustics or speech, up to anintegration scale of about 500 ms. Based on this analysis-synthesis framework, an artisticcollaboration with composer Florian Hecker has led to the creation of five computer music
Hennequin, Romain. "Décomposition de spectrogrammes musicaux informée par des modèles de synthèse spectrale : modélisation des variations temporelles dans les éléments sonores." Phd thesis, Télécom ParisTech, 2011. http://pastel.archives-ouvertes.fr/pastel-00648997.
Повний текст джерелаBazin, Théis. "Designing novel time-frequency scales for interactive music creation with hierarchical statistical modeling." Electronic Thesis or Diss., Sorbonne université, 2023. http://www.theses.fr/2023SORUS242.
Повний текст джерелаModern musical creation unfolds on many different time scales: from the vibration of a string or the resonance of an electronic instrument at the millisecond scale, through the few seconds typical of an instrument's note, to the tens of minutes of operas or DJ sets. The interleaving of these multiple scales has led to the development of numerous technical and theoretical tools to ease the manipulation of time. These abstractions, such as scales, rhythmic notations, or even usual models of audio synthesis, largely infuse current tools -- software and hardware -- for musical creation. However, these abstractions, which emerged for the most part during the 20th century in the West on the basis of classical musical theories of written music, are not devoid of cultural a priori. They reflect various principles aimed at abstracting away certain aspects of the music (for example, micro-deviations with respect to a metronomic time or micro-deviations of frequency with respect to an idealized pitch), whose high degree of physical variability makes them typically inconvenient for musical writing. These compromises, typically relevant when the written music is intended to be performed by musicians, able to reintroduce variations and physical and musical richness, are however limiting in the context of computer-assisted music creation, with computers coldly rendering these coarse representations abstractions, and they tend to restrict the diversity of the music that can be produced with these tools. Through a review of several typical interfaces for music creation, I show that an essential factor is the scale of the human-machine interactions proposed by these abstractions. At their most flexible level, such as audio representations or piano-roll representations with unquantized time, they prove difficult to manipulate, as they require a high degree of precision, particularly unsuitable for modern mobile and touch terminals. On the other hand, in most commonly used abstractions with discretized time, such as scores or sequencers, they prove to be too constraining for the creation of culturally diverse music that does not follow the proposed time and pitch grids. In this thesis, I argue that artificial intelligence, through its ability to build high-level representations of given complex objects, allows the construction of new scales of music creation, designed for interaction, and thus enables radically new approaches to music creation. I present and illustrate this idea through the design and implementation of three web-based prototypes of music creation assisted by artificial intelligence, one of which is based on a new neural model for the inpainting of musical instrument sounds also designed in the framework of this thesis. These high-level representations -- for sheet music, piano-rolls, and spectrograms -- are deployed at a time-frequency scale coarser than the original data, but better suited to interaction. By allowing localized transformations on these representations but also capturing, through statistical modeling, aesthetic specificities and fine micro-variations of the original musical training data, these tools allow to easily and controllably obtain musically rich results. Through the evaluation of these three prototypes in real conditions by several artists, I show that these new scales of interactive creation are useful for both experts and novices. Thanks to the assistance of AI on technical aspects that normally require precision and expertise, they are also suitable for use on touch screens and mobile devices
Coulibaly, Patrice Yefoungnigui. "Codage audio à bas débit avec synthèse sinusoïdale." Sherbrooke : Université de Sherbrooke, 2001.
Знайти повний текст джерелаCHEMLA, ROMEU SANTOS AXEL CLAUDE ANDRE'. "MANIFOLD REPRESENTATIONS OF MUSICAL SIGNALS AND GENERATIVE SPACES." Doctoral thesis, Università degli Studi di Milano, 2020. http://hdl.handle.net/2434/700444.
Повний текст джерелаAmong the diverse research fields within computer music, synthesis and generation of audio signals epitomize the cross-disciplinarity of this domain, jointly nourishing both scientific and artistic practices since its creation. Inherent in computer music since its genesis, audio generation has inspired numerous approaches, evolving both with musical practices and scientific/technical advances. Moreover, some syn- thesis processes also naturally handle the reverse process, named analysis, such that synthesis parameters can also be partially or totally extracted from actual sounds, and providing an alternative representation of the analyzed audio signals. On top of that, the recent rise of machine learning algorithms earnestly questioned the field of scientific research, bringing powerful data-centred methods that raised several epistemological questions amongst researchers, in spite of their efficiency. Especially, a family of machine learning methods, called generative models, are focused on the generation of original content using features extracted from an existing dataset. In that case, such methods not only questioned previous approaches in generation, but also the way of integrating this methods into existing creative processes. While these new generative frameworks are progressively introduced in the domain of image generation, the application of such generative techniques in audio synthesis is still marginal. In this work, we aim to propose a new audio analysis-synthesis framework based on these modern generative models, enhanced by recent advances in machine learning. We first review existing approaches, both in sound synthesis and in generative machine learning, and focus on how our work inserts itself in both practices and what can be expected from their collation. Subsequently, we focus a little more on generative models, and how modern advances in the domain can be exploited to allow us learning complex sound distributions, while being sufficiently flexible to be integrated in the creative flow of the user. We then propose an inference / generation process, mirroring analysis/synthesis paradigms that are natural in the audio processing domain, using latent models that are based on a continuous higher-level space, that we use to control the generation. We first provide preliminary results of our method applied on spectral information, extracted from several datasets, and evaluate both qualitatively and quantitatively the obtained results. Subsequently, we study how to make these methods more suitable for learning audio data, tackling successively three different aspects. First, we propose two different latent regularization strategies specifically designed for audio, based on and signal / symbol translation and perceptual constraints. Then, we propose different methods to address the inner temporality of musical signals, based on the extraction of multi-scale representations and on prediction, that allow the obtained generative spaces that also model the dynamics of the signal. As a last chapter, we swap our scientific approach to a more research & creation-oriented point of view: first, we describe the architecture and the design of our open-source library, vsacids, aiming to be used by expert and non-expert music makers as an integrated creation tool. Then, we propose an first musical use of our system by the creation of a real-time performance, called aego, based jointly on our framework vsacids and an explorative agent using reinforcement learning to be trained during the performance. Finally, we draw some conclusions on the different manners to improve and reinforce the proposed generation method, as well as possible further creative applications.
À travers les différents domaines de recherche de la musique computationnelle, l’analysie et la génération de signaux audio sont l’exemple parfait de la trans-disciplinarité de ce domaine, nourrissant simultanément les pratiques scientifiques et artistiques depuis leur création. Intégrée à la musique computationnelle depuis sa création, la synthèse sonore a inspiré de nombreuses approches musicales et scientifiques, évoluant de pair avec les pratiques musicales et les avancées technologiques et scientifiques de son temps. De plus, certaines méthodes de synthèse sonore permettent aussi le processus inverse, appelé analyse, de sorte que les paramètres de synthèse d’un certain générateur peuvent être en partie ou entièrement obtenus à partir de sons donnés, pouvant ainsi être considérés comme une représentation alternative des signaux analysés. Parallèlement, l’intérêt croissant soulevé par les algorithmes d’apprentissage automatique a vivement questionné le monde scientifique, apportant de puissantes méthodes d’analyse de données suscitant de nombreux questionnements épistémologiques chez les chercheurs, en dépit de leur effectivité pratique. En particulier, une famille de méthodes d’apprentissage automatique, nommée modèles génératifs, s’intéressent à la génération de contenus originaux à partir de caractéristiques extraites directement des données analysées. Ces méthodes n’interrogent pas seulement les approches précédentes, mais aussi sur l’intégration de ces nouvelles méthodes dans les processus créatifs existants. Pourtant, alors que ces nouveaux processus génératifs sont progressivement intégrés dans le domaine la génération d’image, l’application de ces techniques en synthèse audio reste marginale. Dans cette thèse, nous proposons une nouvelle méthode d’analyse-synthèse basés sur ces derniers modèles génératifs, depuis renforcés par les avancées modernes dans le domaine de l’apprentissage automatique. Dans un premier temps, nous examinerons les approches existantes dans le domaine des systèmes génératifs, sur comment notre travail peut s’insérer dans les pratiques de synthèse sonore existantes, et que peut-on espérer de l’hybridation de ces deux approches. Ensuite, nous nous focaliserons plus précisément sur comment les récentes avancées accomplies dans ce domaine dans ce domaine peuvent être exploitées pour l’apprentissage de distributions sonores complexes, tout en étant suffisamment flexibles pour être intégrées dans le processus créatif de l’utilisateur. Nous proposons donc un processus d’inférence / génération, reflétant les paradigmes d’analyse-synthèse existant dans le domaine de génération audio, basé sur l’usage de modèles latents continus que l’on peut utiliser pour contrôler la génération. Pour ce faire, nous étudierons déjà les résultats préliminaires obtenus par cette méthode sur l’apprentissage de distributions spectrales, prises d’ensembles de données diversifiés, en adoptant une approche à la fois quantitative et qualitative. Ensuite, nous proposerons d’améliorer ces méthodes de manière spécifique à l’audio sur trois aspects distincts. D’abord, nous proposons deux stratégies de régularisation différentes pour l’analyse de signaux audio : une basée sur la traduction signal/ symbole, ainsi qu’une autre basée sur des contraintes perceptives. Nous passerons par la suite à la dimension temporelle de ces signaux audio, proposant de nouvelles méthodes basées sur l’extraction de représentations temporelles multi-échelle et sur une tâche supplémentaire de prédiction, permettant la modélisation de caractéristiques dynamiques par les espaces génératifs obtenus. En dernier lieu, nous passerons d’une approche scientifique à une approche plus orientée vers un point de vue recherche & création. Premièrement, nous présenterons notre librairie open-source, vsacids, visant à être employée par des créateurs experts et non-experts comme un outil intégré. Ensuite, nous proposons une première utilisation musicale de notre système par la création d’une performance temps réel, nommée ægo, basée à la fois sur notre librarie et sur un agent d’exploration appris dynamiquement par renforcement au cours de la performance. Enfin, nous tirons les conclusions du travail accompli jusqu’à maintenant, concernant les possibles améliorations et développements de la méthode de synthèse proposée, ainsi que sur de possibles applications créatives.
Roche, Fanny. "Music sound synthesis using machine learning : Towards a perceptually relevant control space." Thesis, Université Grenoble Alpes, 2020. http://www.theses.fr/2020GRALT034.
Повний текст джерелаOne of the main challenges of the synthesizer market and the research in sound synthesis nowadays lies in proposing new forms of synthesis allowing the creation of brand new sonorities while offering musicians more intuitive and perceptually meaningful controls to help them reach the perfect sound more easily. Indeed, today's synthesizers are very powerful tools that provide musicians with a considerable amount of possibilities for creating sonic textures, but the control of parameters still lacks user-friendliness and may require some expert knowledge about the underlying generative processes. In this thesis, we are interested in developing and evaluating new data-driven machine learning methods for music sound synthesis allowing the generation of brand new high-quality sounds while providing high-level perceptually meaningful control parameters.The first challenge of this thesis was thus to characterize the musical synthetic timbre by evidencing a set of perceptual verbal descriptors that are both frequently and consensually used by musicians. Two perceptual studies were then conducted: a free verbalization test enabling us to select eight different commonly used terms for describing synthesizer sounds, and a semantic scale analysis enabling us to quantitatively evaluate the use of these terms to characterize a subset of synthetic sounds, as well as analyze how consensual they were.In a second phase, we investigated the use of machine learning algorithms to extract a high-level representation space with interesting interpolation and extrapolation properties from a dataset of sounds, the goal being to relate this space with the perceptual dimensions evidenced earlier. Following previous studies interested in using deep learning for music sound synthesis, we focused on autoencoder models and realized an extensive comparative study of several kinds of autoencoders on two different datasets. These experiments, together with a qualitative analysis made with a non real-time prototype developed during the thesis, allowed us to validate the use of such models, and in particular the use of the variational autoencoder (VAE), as relevant tools for extracting a high-level latent space in which we can navigate smoothly and create new sounds. However, so far, no link between this latent space and the perceptual dimensions evidenced by the perceptual tests emerged naturally.As a final step, we thus tried to enforce perceptual supervision of the VAE by adding a regularization during the training phase. Using the subset of synthetic sounds used in the second perceptual test and the corresponding perceptual grades along the eight perceptual dimensions provided by the semantic scale analysis, it was possible to constraint, to a certain extent, some dimensions of the VAE high-level latent space so as to match these perceptual dimensions. A final comparative test was then conducted in order to evaluate the efficiency of this additional regularization for conditioning the model and (partially) leading to a perceptual control of music sound synthesis
Weiss, Christian [Verfasser]. "Adaptive audio-visuelle Synthese audio-visuelle Sprachsynthese : automatische Trainingsverfahren fuer Unit-Selection-basierte audio-visuelle Sprachsynthese / vorgelegt von Christian Weiss." 2007. http://d-nb.info/986546127/34.
Повний текст джерелаZAMBON, Stefano. "Accurate sound synthesis of 3D object collisions in interactive virtual scenarios." Doctoral thesis, 2012. http://hdl.handle.net/11562/407137.
Повний текст джерелаThis thesis investigates efficient algorithms for the synthesis of sounds produced by colliding objects, starting from a physical description of the problem. The objective of this investigation is to provide tools capable of increasing the accuracy of the synthetic auditory feedback in virtual environments through a physics-based approach, hence without the need of pre-recorded sounds. Due to their versatility in dealing with complex geometries, Finite Element Methods (FEM) are chosen for the space-domain discretization of generic three-dimensional resonators. The resulting state-space representations are rearranged so as to decouple the normal modes in the corresponding equations, through the use of Modal Analysis/Synthesis techniques. Such techniques, in fact, conveniently lead to computationally efficient sound synthesis algorithms. The whole mathematical treatment develops until deriving such algorithms. Finally, implementation examples are provided which rely only on open-source software: this companion material guarantees the reproducibility of the results, and can be handled without much effort by most researchers having a background in sound processing. The original results presented in this work include: i efficient physics-based techniques that help implement real-time sound synthesis algorithms on common hardware; ii a method for the efficient management of FEM data which, by working together with an expressive damping model, allows to pre-compute the information characterizing a resonating object and then to store it in a compact data structure; iii a time-domain transformation of the state-space representation of second-order digital filters, allowing for the exact computation of dependent variables such as resonator velocity and energy, even when simple all-pole realizations are used; iv an efficient multirate realization of a parallel bank of resonators, which is derived using a Quadrature-Mirror-Filters (QMF) subdivision. Compared to similar works previously proposed in the literature, this realization allows for the nonlinear feedback excitation of a multirate filter bank: the key idea is to perform an adaptive state change in the resonator bank, by switching the sampling rate of the resonators from a common highest value, used while processing the initial transient of the signals at full bandwidth, to a set of lower values in ways to enable a multirate realization of the same bank during the steady state evolution of the signals.