Dissertations / Theses: 'Audio analysi'

1

CHEMLA, ROMEU SANTOS AXEL CLAUDE ANDRE'. "MANIFOLD REPRESENTATIONS OF MUSICAL SIGNALS AND GENERATIVE SPACES." Doctoral thesis, Università degli Studi di Milano, 2020. http://hdl.handle.net/2434/700444.

Full text

Abstract:

Tra i diversi campi di ricerca nell’ambito dell’informatica musicale, la sintesi e la generazione di segnali audio incarna la pluridisciplinalità di questo settore, nutrendo insieme le pratiche scientifiche e musicale dalla sua creazione. Inerente all’informatica dalla sua creazione, la generazione audio ha ispirato numerosi approcci, evolvendo colle pratiche musicale e gli progressi tecnologici e scientifici. Inoltre, alcuni processi di sintesi permettono anche il processo inverso, denominato analisi, in modo che i parametri di sintesi possono anche essere parzialmente o totalmente estratti dai suoni, dando una rappresentazione alternativa ai segnali analizzati. Per di più, la recente ascesa dei algoritmi di l’apprendimento automatico ha vivamente interrogato il settore della ricerca scientifica, fornendo potenti data-centered metodi che sollevavano diversi epistemologici interrogativi, nonostante i sui efficacia. Particolarmente, un tipo di metodi di apprendimento automatico, denominati modelli generativi, si concentrano sulla generazione di contenuto originale usando le caratteristiche che hanno estratti dei dati analizzati. In tal caso, questi modelli non hanno soltanto interrogato i precedenti metodi di generazione, ma anche sul modo di integrare questi algoritmi nelle pratiche artistiche. Mentre questi metodi sono progressivamente introdotti nel settore del trattamento delle immagini, la loro applicazione per la sintesi di segnali audio e ancora molto marginale. In questo lavoro, il nostro obiettivo e di proporre un nuovo metodo di audio sintesi basato su questi nuovi tipi di generativi modelli, rafforazti dalle nuove avanzati dell’apprendimento automatico. Al primo posto, facciamo una revisione dei approcci esistenti nei settori dei sistemi generativi e di sintesi sonore, focalizzando sul posto di nostro lavoro rispetto a questi disciplini e che cosa possiamo aspettare di questa collazione. In seguito, studiamo in maniera più precisa i modelli generativi, e come possiamo utilizzare questi recenti avanzati per l’apprendimento di complesse distribuzione di suoni, in un modo che sia flessibile e nel flusso creativo del utente. Quindi proponiamo un processo di inferenza / generazione, il quale rifletta i processi di analisi/sintesi che sono molto usati nel settore del trattamento del segnale audio, usando modelli latenti, che sono basati sull’utilizzazione di un spazio continuato di alto livello, che usiamo per controllare la generazione. Studiamo dapprima i risultati preliminari ottenuti con informazione spettrale estratte da diversi tipi di dati, che valutiamo qualitativamente e quantitativamente. Successiva- mente, studiamo come fare per rendere questi metodi più adattati ai segnali audio, fronteggiando tre diversi aspetti. Primo, proponiamo due diversi metodi di regolarizzazione di questo generativo spazio che sono specificamente sviluppati per l’audio : una strategia basata sulla traduzione segnali / simboli, e una basata su vincoli percettivi. Poi, proponiamo diversi metodi per fronteggiare il aspetto temporale dei segnali audio, basati sull’estrazione di rappresentazioni multiscala e sulla predizione, che permettono ai generativi spazi ottenuti di anche modellare l’aspetto dinamico di questi segnali. Per finire, cambiamo il nostro approccio scientifico per un punto di visto piú ispirato dall’idea di ricerca e creazione. Primo, descriviamo l’architettura e il design della nostra libreria open-source, vsacids, sviluppata per permettere a esperti o non-esperti musicisti di provare questi nuovi metodi di sintesi. Poi, proponiamo una prima utilizzazione del nostro modello con la creazione di una performance in real- time, chiamata ægo, basata insieme sulla nostra libreria vsacids e sull’uso di une agente di esplorazione, imparando con rinforzo nel corso della composizione. Finalmente, tramo dal lavoro presentato alcuni conclusioni sui diversi modi di migliorare e rinforzare il metodo di sintesi proposto, nonché eventuale applicazione artistiche.
Among the diverse research fields within computer music, synthesis and generation of audio signals epitomize the cross-disciplinarity of this domain, jointly nourishing both scientific and artistic practices since its creation. Inherent in computer music since its genesis, audio generation has inspired numerous approaches, evolving both with musical practices and scientific/technical advances. Moreover, some syn- thesis processes also naturally handle the reverse process, named analysis, such that synthesis parameters can also be partially or totally extracted from actual sounds, and providing an alternative representation of the analyzed audio signals. On top of that, the recent rise of machine learning algorithms earnestly questioned the field of scientific research, bringing powerful data-centred methods that raised several epistemological questions amongst researchers, in spite of their efficiency. Especially, a family of machine learning methods, called generative models, are focused on the generation of original content using features extracted from an existing dataset. In that case, such methods not only questioned previous approaches in generation, but also the way of integrating this methods into existing creative processes. While these new generative frameworks are progressively introduced in the domain of image generation, the application of such generative techniques in audio synthesis is still marginal. In this work, we aim to propose a new audio analysis-synthesis framework based on these modern generative models, enhanced by recent advances in machine learning. We first review existing approaches, both in sound synthesis and in generative machine learning, and focus on how our work inserts itself in both practices and what can be expected from their collation. Subsequently, we focus a little more on generative models, and how modern advances in the domain can be exploited to allow us learning complex sound distributions, while being sufficiently flexible to be integrated in the creative flow of the user. We then propose an inference / generation process, mirroring analysis/synthesis paradigms that are natural in the audio processing domain, using latent models that are based on a continuous higher-level space, that we use to control the generation. We first provide preliminary results of our method applied on spectral information, extracted from several datasets, and evaluate both qualitatively and quantitatively the obtained results. Subsequently, we study how to make these methods more suitable for learning audio data, tackling successively three different aspects. First, we propose two different latent regularization strategies specifically designed for audio, based on and signal / symbol translation and perceptual constraints. Then, we propose different methods to address the inner temporality of musical signals, based on the extraction of multi-scale representations and on prediction, that allow the obtained generative spaces that also model the dynamics of the signal. As a last chapter, we swap our scientific approach to a more research & creation-oriented point of view: first, we describe the architecture and the design of our open-source library, vsacids, aiming to be used by expert and non-expert music makers as an integrated creation tool. Then, we propose an first musical use of our system by the creation of a real-time performance, called aego, based jointly on our framework vsacids and an explorative agent using reinforcement learning to be trained during the performance. Finally, we draw some conclusions on the different manners to improve and reinforce the proposed generation method, as well as possible further creative applications.
À travers les différents domaines de recherche de la musique computationnelle, l’analysie et la génération de signaux audio sont l’exemple parfait de la trans-disciplinarité de ce domaine, nourrissant simultanément les pratiques scientifiques et artistiques depuis leur création. Intégrée à la musique computationnelle depuis sa création, la synthèse sonore a inspiré de nombreuses approches musicales et scientifiques, évoluant de pair avec les pratiques musicales et les avancées technologiques et scientifiques de son temps. De plus, certaines méthodes de synthèse sonore permettent aussi le processus inverse, appelé analyse, de sorte que les paramètres de synthèse d’un certain générateur peuvent être en partie ou entièrement obtenus à partir de sons donnés, pouvant ainsi être considérés comme une représentation alternative des signaux analysés. Parallèlement, l’intérêt croissant soulevé par les algorithmes d’apprentissage automatique a vivement questionné le monde scientifique, apportant de puissantes méthodes d’analyse de données suscitant de nombreux questionnements épistémologiques chez les chercheurs, en dépit de leur effectivité pratique. En particulier, une famille de méthodes d’apprentissage automatique, nommée modèles génératifs, s’intéressent à la génération de contenus originaux à partir de caractéristiques extraites directement des données analysées. Ces méthodes n’interrogent pas seulement les approches précédentes, mais aussi sur l’intégration de ces nouvelles méthodes dans les processus créatifs existants. Pourtant, alors que ces nouveaux processus génératifs sont progressivement intégrés dans le domaine la génération d’image, l’application de ces techniques en synthèse audio reste marginale. Dans cette thèse, nous proposons une nouvelle méthode d’analyse-synthèse basés sur ces derniers modèles génératifs, depuis renforcés par les avancées modernes dans le domaine de l’apprentissage automatique. Dans un premier temps, nous examinerons les approches existantes dans le domaine des systèmes génératifs, sur comment notre travail peut s’insérer dans les pratiques de synthèse sonore existantes, et que peut-on espérer de l’hybridation de ces deux approches. Ensuite, nous nous focaliserons plus précisément sur comment les récentes avancées accomplies dans ce domaine dans ce domaine peuvent être exploitées pour l’apprentissage de distributions sonores complexes, tout en étant suffisamment flexibles pour être intégrées dans le processus créatif de l’utilisateur. Nous proposons donc un processus d’inférence / génération, reflétant les paradigmes d’analyse-synthèse existant dans le domaine de génération audio, basé sur l’usage de modèles latents continus que l’on peut utiliser pour contrôler la génération. Pour ce faire, nous étudierons déjà les résultats préliminaires obtenus par cette méthode sur l’apprentissage de distributions spectrales, prises d’ensembles de données diversifiés, en adoptant une approche à la fois quantitative et qualitative. Ensuite, nous proposerons d’améliorer ces méthodes de manière spécifique à l’audio sur trois aspects distincts. D’abord, nous proposons deux stratégies de régularisation différentes pour l’analyse de signaux audio : une basée sur la traduction signal/ symbole, ainsi qu’une autre basée sur des contraintes perceptives. Nous passerons par la suite à la dimension temporelle de ces signaux audio, proposant de nouvelles méthodes basées sur l’extraction de représentations temporelles multi-échelle et sur une tâche supplémentaire de prédiction, permettant la modélisation de caractéristiques dynamiques par les espaces génératifs obtenus. En dernier lieu, nous passerons d’une approche scientifique à une approche plus orientée vers un point de vue recherche & création. Premièrement, nous présenterons notre librairie open-source, vsacids, visant à être employée par des créateurs experts et non-experts comme un outil intégré. Ensuite, nous proposons une première utilisation musicale de notre système par la création d’une performance temps réel, nommée ægo, basée à la fois sur notre librarie et sur un agent d’exploration appris dynamiquement par renforcement au cours de la performance. Enfin, nous tirons les conclusions du travail accompli jusqu’à maintenant, concernant les possibles améliorations et développements de la méthode de synthèse proposée, ainsi que sur de possibles applications créatives.

APA, Harvard, Vancouver, ISO, and other styles

2

TERENZI, Alessandro. "Innovative Digital Signal Processing Methodologies for Identification and Analysis of Real Audio Systems." Doctoral thesis, Università Politecnica delle Marche, 2021. http://hdl.handle.net/11566/287822.

Full text

Abstract:

Esistono molti sistemi audio reali e ciascuno ha le proprie caratteristiche ma tutti sono accomunati dal fatto che sono sistemi in grado di generare o modificare un suono. Se un sistema naturale o artificiale può essere definito come sistema sonoro, allora è possibile applicare le tecniche del digital signal processing per studiare il sistema ed emularne il comportamento. In questo lavoro di tesi si propone di introdurre delle metodologie innovative di processamento del segnale applicate ad alcuni sistemi sonori reali. In particolare, vengono studiati e discussi tre diversi sistemi: il mondo dei dispositivi non lineari basati su valvole, con particolare attenzione agli amplificatori per chitarra e hi-fi, l'ambiente acustico di una stanza ed il suo effetto sulla propagazione del suono ed infine il suono emesso dalle api in un alveare. Per quanto riguarda il primo sistema, vengono proposti dei contributi innovativi per l'identificazione di modelli come la serie di Volterra ed il modello di Hammerstein; in particolare viene discusso un approccio per superare alcune limitazioni dell'identificazione tramite serie di Volterra e l'applicazione di una struttura in sottobande per ridurre il costo computazionale e incrementare la velocità di convergenza di un algoritmo adattativo per l'identificazione del modello di Hammerstein. In ultima analisi, viene proposto un approccio innovativo in grado di stimare con una singola misura vari parametri di distorsione sfruttando un modello di Hammerstein generallizato. Per quanto riguarda il secondo ambito, vengono proposti i risultati relativi a due applicazioni di equalizzazione multipunto: nel primo caso si mostrerà come l'equalizzazione può essere usata non solo per compensare le anomalie sonore generate all'interno di una stanza, ma anche per migliorare la risposta in frequenza di particolari trasduttori a vibrazione ancorati ad un pannello rigido; nel secondo caso si illustra come un approccio in sottobande può migliorare l'efficienza computazionale e la velocità di un algoritmo di equalizzazione adattativo multipunto e multicanale. Infine, viene presentato un sistema sonoro naturale, ovvero quello generato da un alveare. In questo caso si illustrerà un sistema di acquisizione innovativo sviluppato per monitorare gli alveari con particolare attenzione al suono; succesivamente si mostrano gli approcci messi a punto per analizzare il suono registrato in due condizioni reali ed infine verranno si illustrano i risultati ottenuti grazie allo studio del suono usando algoritmi di classificazione. Inoltre, nella parte finale dell'elaborato sono presenti dei contributi secondari ma che hanno comunque come focus principale il signal processing applicato ad ambienti acustici reali, in particolare si discute di un'implementazione di un algoritmo di cancellazione attiva del rumore e di due algoritmi per effetti digitali in cui il primo migliora le performance sonore di altoparlanti compatti, ed il secondo genera un effetto stereofonico per chitarra elettrica.
Many real word audio systems exist, each has its own characteristics but almost all of them can be identified from the fact that they are able to generate or modify a sound. If a natural or artificial system can be defined as a sound system, then it is possible to apply the techniques of digital signal processing for the studying and the emulation of the system. In this thesis, innovative methodologies for digital signal processing applied to real audio systems will be discussed. In particular, three different audio systems will be considered: the world of vacuum-based non linear audio devices with particular attention to guitar and hi-fi amplifiers; the room acoustic environment and its effect on the sound propagation; and finally the sound emitted by honey bees in a beehive. Regarding the first system, innovative approaches for the identification of the Volterra series and Hammerstein models will be proposed, in particular an approach to overcome some limitation of Volterra series identification. The application of a sub-band structure to reduce the computational cost and increase the convergence speed of an adaptive Hammerstein model identification will be proposed as well. Finally, an innovative approach for the measurement of several distortion parameters using a single measure, exploiting a generalized Hammerstein model, will be presented. For the second system, the results of the application of a multi-point equalizer to two different situations will be exposed. In particular, in the first case, it will be shown how a multi-point equalization can be used not only to compensate the acoustical anomalies of a room, but also to improve the frequency response of vibrating transducers mounted on a rigid surface. The second contribution will show how a sub-band approach can be used to improve the computational cost and the speed of an adaptive algorithm for a multi-point and multi channel equalizer. At the end, the focus will be on a natural sound system, i.e., a honey bees colony. In this case, an innovative acquisition system for honey bees sound monitoring will be presented. Then, the approaches developed for sound analysis will be exposed and applied to the recorded sounds in two different situations. Finally, the obtained results, achieved with the application of classification algorithms, will be exposed. In the final part of the work some minor contributions still related to signal processing applied to real sound systems are presented. In particular, an implementation of an active noise control system is discussed, and two algorithms for digital effects where the former improves the sound performances of compact loudspeakers and the latter generates a stereophonic effect for electric guitars are exposed.

APA, Harvard, Vancouver, ISO, and other styles

3

Djebbar, Fatiha. "Contributions to Audio Steganography : Algorithms and Robustness Analysis." Thesis, Brest, 2012. http://www.theses.fr/2012BRES0005.

Full text

Abstract:

La stéganographie numérique est une technique récente qui a émergé comme une source importante pour la sécurité des données. Elle consiste à envoyer secrètement et de manière fiable des informations dissimulées et non pas seulement à masquer leur présence. Elle exploite les caractéristiques des fichiers médias numériques anodins, tels que l’image, le son et la vidéo,et les utilise comme support pour véhiculer des informations secrète d’une façon inaperçue. Les techniques de cryptage et de tatouage sont déjà utilisées pour résoudre les problèmes liés à la sécurité des données. Toutefois,l’évolution des tentatives d’interception et de déchiffrement des données numériques nécessitent de nouvelles techniques pour enrayer les tentatives malveillantes et d’élargir le champ des applications y associées. L’objectif principal des systèmes stéganographiques consiste à fournir de nouveaux moyens sécurisés, indétectables et imperceptibles pour dissimuler des données.La stéganographie est utilisée sous l’hypothèse qu’elle ne sera pas détectée si personne n’essaye de la découvrir. Les techniques récentes destéganographie ont déjà été employées dans diverses applications. La majorité de ces applications ont pour objectif d’assurer la confidentialité des données.D’autres par contre sont utilisées malicieusement. L’utilisation de la stéganographie à des fins criminelles, de terrorisme, d’espionnage ou de piraterie constitue une menace réelle. Ces tentatives malveillantes de communiquer secrètement ont récemment conduit les chercheurs à inclure une nouvelle branche de recherche: la stéganalyse, pour contrer les techniques stéganographique. L’objectif principal de la stéganalyse est de détecter la résence potentielle d’un message dissimulé dans un support numérique et ne considère pas nécessairement son extraction. La parole numérique constitue un terrain de prédilection pour dissimuler des données numériques. En effet, elle est présente en abondance grâce aux technologies de télécommunications fixes ou mobiles et aussi à travers divers moyens de stockage de l’audio numérique. Cette thèse étudie la stéganographie et la stéganalyse utilisant la voix numérisée comme support et vise à (1) présenter un algorithme qui répond aux exigences des systèmes stéganographiques reliées à la capacité élevée, l’indétectabilité et l’imperceptibilité des données dissimulées, (2) le contrôle de la distorsion induite par le processus de dissimulation des données (3) définir un nouveau concept de zones spectrales dans le domaine de Fourier utilisant et l’amplitude et la phase (4) introduire un nouveau algorithme de stéganalyse basé sur les techniques de compression sans pertes d’information à la fois simple et efficace. La performance de l’algorithme stéganographique est mesurée par des méthodes d’évaluation perceptuelles et statistiques. D’autre part, la performance de l’algorithme de stéganalyse est mesurée par la capacité du système à distinguer entre un signal audio pur et un signal audio stéganographié. Les résultats sont très prometteurs et montrent des compromis de performance intéressants par rapport aux méthodes connexes. Les travaux futurs incluent essentiellement le renforcement de l’algorithme de stéganalyse pour qu’il soit en mesure de détecter une faible quantité de données dissimulées. Nous planifions également l’intégration de notre algorithme de stéganographie dans certaines plateformes émergentes telles que l’iPhone. D’autres perspectives consistent à améliorer l’algorithme stéganographique pour que les données dissimulées soit résistantes au codage de la parole, au bruit et à la distorsion induits parles canaux de transmission
Digital steganography is a young flourishing science emerged as a prominent source of data security. The primary goal of steganography is to reliably send hidden information secretly, not merely to obscure its presence. It exploits the characteristics of digital media files such as: image, audio, video, text by utilizing them as carriers to secretly communicate data. Encryption and watermarking techniques are already used to address concerns related to datasecurity. However, constantly-changing attacks on the integrity of digital data require new techniques to break the cycle of malicious attempts and expand the scope of involved applications. The main objective of steganographic systems is to provide secure, undetectable and imperceptible ways to conceal high-rate of data into digital medium. Steganography is used under the assumption that it will not be detected if no one is attempting to uncover it. Steganography techniques have found their way into various and versatile applications. Some of these applications are used for the benefit of people others are used maliciously. The threat posed by criminals, hackers, terrorists and spies using steganography is indeed real. To defeat malicious attempts when communicating secretly, researchers’ work has been lately extended toinclude a new and parallel research branch to countermeasure steganagraphy techniques called steganalysis. The main purpose of steganalysis technique is to detect the presence or not of hidden message and does not consider necessarily its successful extraction. Digital speech, in particular, constitutes a prominent source of data-hiding across novel telecommunication technologies such as covered voice-over-IP, audio conferencing, etc. This thesis investigatesdigital speech steganography and steganalysis and aims at: (1) presenting an algorithm that meets high data capacity, undetectability and imperceptibility requirements of steganographic systems, (2) controlling the distortion induced by the embedding process (3) presenting new concepts of spectral embedding areas in the Fourier domain which is applicable to magnitude and phase spectrums and (4) introducing a simple yet effective speech steganalysis algorithm based on lossless data compression techniques. The steganographic algorithm’s performance is measured by perceptual and statistical evaluation methods. On the other hand, the steganalysis algorithm’s performance is measured by how well the system can distinguish between stego- and cover-audio signals. The results are very promising and show interesting performance tradeoffs compared to related methods. Future work is based mainly on strengthening the proposed steganalysis algorithm to be able to detect small hiding capacity. As for our steganographic algorithm, we aim at integrating our steganographic in some emerging devices such as iPhone and further enhancing the capabilities of our steganographic algorithm to ensure hidden-data integrity under severe compression, noise and channel distortion

APA, Harvard, Vancouver, ISO, and other styles

4

Kafentzis, George. "Adaptive Sinusoidal Models for Speech with Applications in Speech Modifications and Audio Analysis." Thesis, Rennes 1, 2014. http://www.theses.fr/2014REN1S085/document.

Full text

Abstract:

La modélisation sinusoïdale est une des méthodes les plus largement utilisés paramétriques pour la parole et le traitement des signaux audio. Inspiré par le récemment introduit Modèle aQHM et Modèle aHM, nous la vue d’ensemble de la théorie de l’ adaptation sinusoïdale modélisation et nous proposons un modèle nommé la Modèle eaQHM, qui est un non modèle paramétrique de mesure d’ajuster les amplitudes et les phases instantanées de ses fonctions de base aux caractéristiques variant dans le temps de sous-jacents du signal de parole, ainsi atténuer significativement la dite hypothèse de stationnarité locale. Le eaQHM est montré à surperformer aQHM dans l’analyse et la resynthèse de la parole voisée. Sur la base de la eaQHM , un système hybride d’analyse / synthèse de la parole est présenté (eaQHNM), et aussi d’ une version hybride de l’ aHM (aHNM). En outre, nous présentons la motivation pour une représentation pleine bande de la parole en utilisant le eaQHM, c’est, représentant toutes les parties du discours comme haute résolution des sinusoıdes AM-FM. Les expériences montrent que l’adaptation et la quasi-harmonicité est suffisante pour fournir une qualité de transparence dans la parole non voisée resynthèse. La pleine bande analyse eaQHM et système de synthèse est présenté à côté, ce qui surpasse l’état de l’art des systèmes, hybride ou pleine bande, dans la reconstruction de la parole, offrant une qualité transparente confirmé par des évaluations objectives et subjectives. En ce qui concerne les applications, le eaQHM et l’ aHM sont appliquées sur les modifications de la parole (de temps et pas mise à l’échelle). Les modifications qui en résultent sont de haute qualité, et suivent des règles très simples, par rapport à d’autres systèmes de modification état de l’art. Les résultats montrent que harmonicité est préféré au quasi- harmonicité de modifications de la parole du fait de la simplicité de la représentation intégrée. En outre, la pleine bande eaQHM est appliquée sur le problème de la modélisation des signaux audio, et en particulier d’instrument de musique retentit. Le eaQHM est évaluée et comparée à des systèmes à la pointe de la technologie, et leur est montré surpasser en termes de qualité de resynthèse, représentant avec succès l’attaque , transitoire, et une partie stationnaire d’un son d’instruments de musique. Enfin, une autre application est suggéré, à savoir l’analyse et la classification des discours émouvant. Le eaQHM est appliqué sur l’analyse des discours émouvant, offrant à ses paramètres instantanés comme des caractéristiques qui peuvent être utilisés dans la reconnaissance et la quantification vectorielle à base classification du contenu émotionnel de la parole. Bien que les modèles sinusoidaux sont pas couramment utilisés dans ces tâches, les résultats sont prometteurs
Sinusoidal Modeling is one of the most widely used parametric methods for speech and audio signal processing. The accurate estimation of sinusoidal parameters (amplitudes, frequencies, and phases) is a critical task for close representation of the analyzed signal. In this thesis, based on recent advances in sinusoidal analysis, we propose high resolution adaptive sinusoidal models for analysis, synthesis, and modifications systems of speech. Our goal is to provide systems that represent speech in a highly accurate and compact way. Inspired by the recently introduced adaptive Quasi-Harmonic Model (aQHM) and adaptive Harmonic Model (aHM), we overview the theory of adaptive Sinusoidal Modeling and we propose a model named the extended adaptive Quasi-Harmonic Model (eaQHM), which is a non-parametric model able to adjust the instantaneous amplitudes and phases of its basis functions to the underlying time-varying characteristics of the speech signal, thus significantly alleviating the so-called local stationarity hypothesis. The eaQHM is shown to outperform aQHM in analysis and resynthesis of voiced speech. Based on the eaQHM, a hybrid analysis/synthesis system of speech is presented (eaQHNM), along with a hybrid version of the aHM (aHNM). Moreover, we present motivation for a full-band representation of speech using the eaQHM, that is, representing all parts of speech as high resolution AM-FM sinusoids. Experiments show that adaptation and quasi-harmonicity is sufficient to provide transparent quality in unvoiced speech resynthesis. The full-band eaQHM analysis and synthesis system is presented next, which outperforms state-of-the-art systems, hybrid or full-band, in speech reconstruction, providing transparent quality confirmed by objective and subjective evaluations. Regarding applications, the eaQHM and the aHM are applied on speech modifications (time and pitch scaling). The resulting modifications are of high quality, and follow very simple rules, compared to other state-of-the-art modification systems. Results show that harmonicity is preferred over quasi-harmonicity in speech modifications due to the embedded simplicity of representation. Moreover, the full-band eaQHM is applied on the problem of modeling audio signals, and specifically of musical instrument sounds. The eaQHM is evaluated and compared to state-of-the-art systems, and is shown to outperform them in terms of resynthesis quality, successfully representing the attack, transient, and stationary part of a musical instrument sound. Finally, another application is suggested, namely the analysis and classification of emotional speech. The eaQHM is applied on the analysis of emotional speech, providing its instantaneous parameters as features that can be used in recognition and Vector-Quantization-based classification of the emotional content of speech. Although the sinusoidal models are not commonly used in such tasks, results are promising

APA, Harvard, Vancouver, ISO, and other styles

5

SIMONETTA, FEDERICO. "MUSIC INTERPRETATION ANALYSIS. A MULTIMODAL APPROACH TO SCORE-INFORMED RESYNTHESIS OF PIANO RECORDINGS." Doctoral thesis, Università degli Studi di Milano, 2022. http://hdl.handle.net/2434/918909.

Full text

Abstract:

This Thesis discusses the development of technologies for the automatic resynthesis of music recordings using digital synthesizers. First, the main issue is identified in the understanding of how Music Information Processing (MIP) methods can take into consideration the influence of the acoustic context on the music performance. For this, a novel conceptual and mathematical framework named “Music Interpretation Analysis” (MIA) is presented. In the proposed framework, a distinction is made between the “performance” – the physical action of playing – and the “interpretation” – the action that the performer wishes to achieve. Second, the Thesis describes further works aiming at the democratization of music production tools via automatic resynthesis: 1) it elaborates software and file formats for historical music archiving and multimodal machine-learning datasets; 2) it explores and extends MIP technologies; 3) it presents the mathematical foundations of the MIA framework and shows preliminary evaluations to demonstrate the effectiveness of the approach

APA, Harvard, Vancouver, ISO, and other styles

6

Song, Guanghan. "Effect of sound in videos on gaze : contribution to audio-visual saliency modelling." Thesis, Grenoble, 2013. http://www.theses.fr/2013GRENT013/document.

Full text

Abstract:

Les humains reçoivent grande quantité d'informations de l'environnement avec vue et l'ouïe . Pour nous aider à réagir rapidement et correctement, il existe des mécanismes dans le cerveau à l'attention de polarisation vers des régions particulières , à savoir les régions saillants . Ce biais attentionnel n'est pas seulement influencée par la vision , mais aussi influencée par l'interaction audio - visuelle . Selon la littérature existante , l'attention visuelle peut être étudié à mouvements oculaires , mais l'effet sonore sur le mouvement des yeux dans les vidéos est peu connue . L'objectif de cette thèse est d'étudier l'influence du son dans les vidéos sur le mouvement des yeux et de proposer un modèle de saillance audio - visuel pour prédire les régions saillants dans les vidéos avec plus de précision . A cet effet, nous avons conçu une première expérience audio - visuelle de poursuite oculaire . Nous avons créé une base de données d'extraits vidéo courts choisis dans divers films . Ces extraits ont été consultés par les participants , soit avec leur bande originale (condition AV ) , ou sans bande sonore ( état V) . Nous avons analysé la différence de positions de l'oeil entre les participants des conditions de AV et V . Les résultats montrent qu'il n'existe un effet du bruit sur le mouvement des yeux et l'effet est plus important pour la classe de la parole à l'écran . Ensuite , nous avons conçu une deuxième expérience audiovisuelle avec treize classes de sons. En comparant la différence de positions de l'oeil entre les participants des conditions de AV et V , nous concluons que l'effet du son est différente selon le type de son , et les classes avec la voix humaine ( c'est à dire les classes parole , chanteur , bruit humain et chanteurs ) ont le plus grand effet . Plus précisément , la source sonore a attiré considérablement la position des yeux uniquement lorsque le son a été la voix humaine . En outre , les participants atteints de la maladie de AV avaient une durée moyenne plus courte de fixation que de l'état de V . Enfin , nous avons proposé un modèle de saillance audio- visuel préliminaire sur la base des résultats des expériences ci-dessus . Dans ce modèle , deux stratégies de fusion de l'information audio et visuelle ont été décrits: l'un pour la classe de son discours , et l'autre pour la musique classe de son instrument . Les stratégies de fusion audio - visuelle définies dans le modèle améliore la prévisibilité à la condition AV
Humans receive large quantity of information from the environment with sight and hearing. To help us to react rapidly and properly, there exist mechanisms in the brain to bias attention towards particular regions, namely the salient regions. This attentional bias is not only influenced by vision, but also influenced by audio-visual interaction. According to existing literature, the visual attention can be studied towards eye movements, however the sound effect on eye movement in videos is little known. The aim of this thesis is to investigate the influence of sound in videos on eye movement and to propose an audio-visual saliency model to predict salient regions in videos more accurately. For this purpose, we designed a first audio-visual experiment of eye tracking. We created a database of short video excerpts selected from various films. These excerpts were viewed by participants either with their original soundtrack (AV condition), or without soundtrack (V condition). We analyzed the difference of eye positions between participants with AV and V conditions. The results show that there does exist an effect of sound on eye movement and the effect is greater for the on-screen speech class. Then, we designed a second audio-visual experiment with thirteen classes of sound. Through comparing the difference of eye positions between participants with AV and V conditions, we conclude that the effect of sound is different depending on the type of sound, and the classes with human voice (i.e. speech, singer, human noise and singers classes) have the greatest effect. More precisely, sound source significantly attracted eye position only when the sound was human voice. Moreover, participants with AV condition had a shorter average duration of fixation than with V condition. Finally, we proposed a preliminary audio-visual saliency model based on the findings of the above experiments. In this model, two fusion strategies of audio and visual information were described: one for speech sound class, and one for musical instrument sound class. The audio-visual fusion strategies defined in the model improves its predictability with AV condition

APA, Harvard, Vancouver, ISO, and other styles

7

Elfitri, I. "Analysis by synthesis spatial audio coding." Thesis, University of Surrey, 2013. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.590657.

Full text

Abstract:

Spatial Audio Coding (SAC) is a technique used to encode multichannel audio signals by extract ing the spatial parameters and downmixing the audio signals to a mono or stereo audio signal. Recently, various SAC techniques have been proposed to efficiently encode multichannel audio signals. However, all of them operate in open-loop, where the encoder and decoder operate sequentially and independently, and, thus, lack a mechanism for minimising the decoded audio reconstruction error. This thesis proposes a novel SAC technique that utilises the closed-loop system configuration, termed Analysis by Synthesis (AbS), in order to optimise the downmix: signal and the spatial parameters, so as to minimise the decoded signal error. In order to show the effect of the AbS optimisations, the Reverse One-To-Two (R-OTT) module, used in the MPEG Surround (MPS) , must first be applied in the frequency domain to recalculate the downmix and residual signals based on the quantised spatial parameters. These parameters show that the AbS scheme can minimise the quantisation errors of the spatial parameters. As the full AbS is far too complicated to be applied in practice, a simplified AbS algorithm for finding sub-optimal parameters, based on the adapted R-OTT module, is also proposed. Subjective tests show that the proposed Analysis by Synthesis Spatial Audio Coding (AbS-SAC), encoding 5-channel audio signals at a bitrate of 51.2 kb/s per audio channel, achieves higher Subjective Difference Grade (SDG) scores than the tested Advanced Audio Coding (AAC) technique. Furthermore, the objective test also shows that the proposed AbS-SAC method, operating at bitrates of 40 to 96 kb/s per audio channel, significantly outperforms (in terms of Objective Difference Grade (ODG) scores) the tested AAC multichannel technique.

APA, Harvard, Vancouver, ISO, and other styles

8

Fazekas, György. "Semantic audio analysis utilities and applications." Thesis, Queen Mary, University of London, 2012. http://qmro.qmul.ac.uk/xmlui/handle/123456789/8443.

Full text

Abstract:

Extraction, representation, organisation and application of metadata about audio recordings are in the concern of semantic audio analysis. Our broad interpretation, aligned with recent developments in the field, includes methodological aspects of semantic audio, such as those related to information management, knowledge representation and applications of the extracted information. In particular, we look at how Semantic Web technologies may be used to enhance information management practices in two audio related areas: music informatics and music production. In the first area, we are concerned with music information retrieval (MIR) and related research. We examine how structured data may be used to support reproducibility and provenance of extracted information, and aim to support multi-modality and context adaptation in the analysis. In creative music production, our goals can be summarised as follows: O↵-the-shelf sound editors do not hold appropriately structured information about the edited material, thus human-computer interaction is inefficient. We believe that recent developments in sound analysis and music understanding are capable of bringing about significant improvements in the music production workflow. Providing visual cues related to music structure can serve as an example of intelligent, context-dependent functionality. The central contributions of this work are a Semantic Web ontology for describing recording studios, including a model of technological artefacts used in music production, methodologies for collecting data about music production workflows and describing the work of audio engineers which facilitates capturing their contribution to music production, and finally a framework for creating Web-based applications for automated audio analysis. This has applications demonstrating how Semantic Web technologies and ontologies can facilitate interoperability between music research tools, and the creation of semantic audio software, for instance, for music recommendation, temperament estimation or multi-modal music tutoring.

APA, Harvard, Vancouver, ISO, and other styles

9

Steinhour, Jacob B. "The Social and Pedagogical Advantages of Audio Forensics and Restoration Education." Ohio University Honors Tutorial College / OhioLINK, 2010. http://rave.ohiolink.edu/etdc/view?acc_num=ouhonors1276014966.

Full text

APA, Harvard, Vancouver, ISO, and other styles

10

Xiao, Zhongzhe Chen Liming. "Recognition of emotions in audio signals." Ecully : Ecole Centrale de Lyon, 2008. http://bibli.ec-lyon.fr/exl-doc/zxiao.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

11

Åsén, Rickard. "Game Audio in Audio Games : Towards a Theory on the Roles and Functions of Sound in Audio Games." Thesis, Högskolan Dalarna, Ljud- och musikproduktion, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:du-13588.

Full text

Abstract:

For the past few decades, researchers have increased our understanding of how sound functions within various audio–visual media formats. With a different focus in mind, this study aims to identify the roles and functions of sound in relation to the game form Audio Games, in order to explore the potential of sound when acting as an autonomous narrative form. Because this is still a relatively unexplored research field, the main purpose of this study is to help establish a theoretical ground and stimulate further research within the field of audio games. By adopting an interdisciplinary approach to the topic, this research relies on theoretical studies, examinations of audio games and contact with the audio game community. In order to reveal the roles of sound, the gathered data is analyzed according to both a contextual and a functional perspective. The research shows that a distinction between the terms ‘function’ and ‘role’ is important when analyzing sound in digital games. The analysis therefore results in the identification of two analytical levels that help define the functions and roles of an entity within a social context, named the Functional and the Interfunctional levels. In addition to successfully identifying three main roles of sound within audio games—each describing the relationship between sound and the entities game system, player and virtual environment—many other issues are also addressed. Consequently, and in accordance with its purpose, this study provides a broad foundation for further research of sound in both audio games and video games.

APA, Harvard, Vancouver, ISO, and other styles

12

Skudelny, Sascha [Verfasser]. "Semantische Analyse von Audio-Logos : vom Audio-Branding-Element zur Metasprachlichen Betrachtung / Sascha Skudelny." Aachen : Shaker, 2012. http://d-nb.info/1069048127/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

13

Burka, Zak. "Perceptual audio classification using principal component analysis /." Online version of thesis, 2010. http://hdl.handle.net/1850/12247.

Full text

APA, Harvard, Vancouver, ISO, and other styles

14

Stammers, Jon. "Audio event classification for urban soundscape analysis." Thesis, University of York, 2011. http://etheses.whiterose.ac.uk/19142/.

Full text

Abstract:

The study of urban soundscapes has gained momentum in recent years as more people become concerned with the level of noise around them and the negative impact this can have on comfort. Monitoring the sounds present in a sonic environment can be a laborious and time–consuming process if performed manually. Therefore, techniques for automated signal identification are gaining importance if soundscapes are to be objectively monitored. This thesis presents a novel approach to feature extraction for the purpose of classifying urban audio events, adding to the library of techniques already established in the field. The research explores how techniques with their origins in the encoding of speech signals can be adapted to represent the complex everyday sounds all around us to allow accurate classification. The analysis methods developed herein are based on the zero–crossings information contained within a signal. Originally developed for the classification of bioacoustic signals, the codebook of Time–Domain Signal Coding (TDSC) has its band–limited restrictions removed to become more generic. Classification using features extracted with the new codebook achieves accuracies of over 80% when combined with a Multilayer Perceptron classifier. Further advancements are made to the standard TDSC algorithm, drawing inspiration from wavelets, resulting in a novel dyadic representation of time–domain features. Carrying the label of Multiscale TDSC (MTDSC), classification accuracies of 70% are achieved using these features. Recommendations for further work focus on expanding the library of training data to improve the accuracy of the classification system. Further research into classifier design is also suggested.

APA, Harvard, Vancouver, ISO, and other styles

15

Mitianoudis, Nikolaos. "Audio source separation using independent component analysis." Thesis, Queen Mary, University of London, 2004. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.406171.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Spina, Michelle S. (Michelle Suzanne). "Analysis and transcription of general audio data." Thesis, Massachusetts Institute of Technology, 2000. http://hdl.handle.net/1721.1/86479.

Full text

Abstract:

Thesis (Ph.D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2000.
Includes bibliographical references (p. 141-147).
by Michelle S. Spina.
Ph.D.

APA, Harvard, Vancouver, ISO, and other styles

17

Bando, Yoshiaki. "Robust Audio Scene Analysis for Rescue Robots." Kyoto University, 2018. http://hdl.handle.net/2433/232410.

Full text

APA, Harvard, Vancouver, ISO, and other styles

18

Occhipinti, Cristina. "Analisi di segnali audio mediante funzioni wavelet." Bachelor's thesis, Alma Mater Studiorum - Università di Bologna, 2010. http://amslaurea.unibo.it/1581/.

Full text

APA, Harvard, Vancouver, ISO, and other styles

19

De, Sena Enzo. "Analysis, design and implementation of multichannel audio systems." Thesis, King's College London (University of London), 2013. https://kclpure.kcl.ac.uk/portal/en/theses/analysis-design-and-implementation-of-multichannel-audio-systems(2667506b-f58e-44f1-858a-bcb67d341720).html.

Full text

Abstract:

This thesis is concerned with the analysis, design and implementation of multichannel audio systems. The design objective is to reconstruct a given sound field such that it is perceptually equivalent to the recorded one. A framework for the design of circular microphone arrays is proposed. This framework is based on fitting of psychoacoustic data and enables the design of both coincident and quasi-coincident arrays. Results of formal listening experiments suggest that the proposed methodology performs on a par with state of the art methods, albeit with a more graceful degradation away from the centre of the loudspeaker array. A computational model of auditory perception is also developed to estimate the subjects' response in a broader class of conditions than the ones considered in the listening experiments. The model predictions suggest that quasi-coincident microphone arrays result in auditory events that are easier to localise for off centre listeners. Two technologies are developed to enable using the proposed framework for recording of real sound fields (e.g. live concert) and virtual ones (e.g. video-games). Differential microphones are identified as desirable candidates for the case of real sound fields and are adapted to suit the framework requirements. Their robustness to self-noise is assessed and measurements of a third-order prototype are presented. Finally, a scalable and interactive room acoustic simulator is proposed to enable virtual recordings in simulated sound fields.

APA, Harvard, Vancouver, ISO, and other styles

20

Xiao, Zhongzhe. "Recognition of emotions in audio signals." Ecully, Ecole centrale de Lyon, 2008. http://www.theses.fr/2008ECDL0002.

Full text

Abstract:

Les travaux de recherche réalisés dans le cadre de cette thèse de doctorat portent sur la reconnaissance automatique de l’émotion et de l’humeur au sein de signaux sonores. En effet, l’émotion portée par les signaux audio constitue une information sémantique particulièrement importante dont l’analyse automatique offre de nombreuses possibilités en termes d’applications, telles que les interactions homme-machine intelligentes et l’indexation multimédia. L’objectif de cette thèse est ainsi d’étudier des solutions informatique d’analyse de l’émotion audio tant pour la parole que pour les signaux musicaux. Nous utilisons dans notre travail un modèle émotionnel discret combiné à un modèle dimensionnel, en nous appuyant sur des études existantes sur les corrélations entre les propriétés acoustiques et l’émotion dans la parole ainsi que l’humeur dans les signaux de musique. Les principales contributions de nos travaux sont les suivantes. Tout d’abord, nous avons proposé, en complément des caractéristiques audio basées sur les propriétés fréquentielles et d’énergie, de nouvelles caractéristiques harmoniques et Zipf, afin d’améliorer la caractérisation des propriétés des signaux de parole en terme de timbre et de prosodie. Deuxièmement, dans la mesure où très peu de ressources pour l’étude de l’émotion dans la parole et dans la musique sont disponibles par rapport au nombre important de caractéristiques audio qu’il est envisageable d’extraire, une méthode de sélection de caractéristiques nomméeESFS, basée sur la théorie de l’évidence est proposée afin de simplifier le modèle de classification et d’en améliorer les performances. De plus, nous avons montré que l’utilisation d’un classifieur hiérarchique basé sur un modèle dimensionnel de l’émotion, permet d’obtenir de meilleurs résultats de classification qu’un unique classifieur global, souvent utilisé dans la littérature. Par ailleurs, puisqu’il n’existe pas d’accord universel sur la définition des émotions de base, et parce que les états émotionnels considérés sont très dépendant des applications, nous avons également proposé un algorithme basés sur ESFS et permettant de construire automatiquement un classifieur hiérarchique adapté à un ensemble spécifique d’états émotionnels dans le cadre d’une application particulière. Cette classification hiérarchique procède en divisant un problème de classification complexe en un ensemble de problèmes plus petits et plus simples grâce à la combinaison d’un ensemble de sous-classifieurs binaires organisés sous forme d’un arbre binaire. Enfin, les émotions étant par nature des notions subjectives, nous avons également proposé un classifieur ambigu, basé sur la théorie de l’évidence, permettant l’association d’un signal audio à de multiples émotions, comme le font souvent les êtres humains
This Ph. D thesis work is dedicated to automatic emotion/mood recognition in audio signals. Indeed, audio emotion is high semantic information and its automatic analysis may have many applications such as smart human-computer interactions or multimedia indexing. The purpose of this thesis is thus to investigate machine-based audio emotion analysis solutions for both speech and music signals. Our work makes use of a discrete emotional model combined with the dimensional one and relies upon existing studies on acoustics correlates of emotional speech and music mood. The key contributions are the following. First, we have proposed, in complement to popular frequency-based and energy-based features, some new audio features, namely harmonic and Zipf features, to better characterize timbre and prosodic properties of emotional speech. Second, as there exists very few emotional resources either for speech or music for machine learning as compared to audio features that one can extract, an evidence theory-based feature selection scheme named Embedded Sequential Forward Selection (ESFS) is proposed to deal with the classic “curse of dimensionality” problem and thus over-fitting. Third, using a manually built dimensional emotion model-based hierarchical classifier to deal with fuzzy borders of emotional states, we demonstrated that a hierarchical classification scheme performs better than single global classifier mostly used in the literature. Furthermore, as there does not exist any universal agreement on basic emotion definition and as emotional states are typically application dependent, we also proposed a ESFS-based algorithm for automatically building a hierarchical classification scheme (HCS) which is best adapted to a specific set of application dependent emotional states. The HCS divides a complex classification problem into simpler and smaller problems by combining several binary sub-classifiers in the structure of a binary tree in several stages, and gives the result as the type of emotional states of the audio samples. Finally, to deal with the subjective nature of emotions, we also proposed an evidence theory-based ambiguous classifier allowing multiple emotions labeling as human often does. The effectiveness of all these recognition techniques was evaluated on Berlin and DES datasets for emotional speech recognition and on a music mood dataset that we collected in our laboratory as there exist no public dataset so far. Keywords: audio signal, emotion classification, music mood analysis, audio features, feature selection, hierarchical classification, ambiguous classification, evidence theory

APA, Harvard, Vancouver, ISO, and other styles

21

Nesvadba, Jan. "Segmentation sémantique des contenus audio-visuels." Bordeaux 1, 2007. http://www.theses.fr/2007BOR13456.

Full text

Abstract:

Dans ce travail, nous avons mis au point une méthode de segmentation des contenus audiovisuels applicable aux appareils de stockage domestiques pour cela nous avons expérimenté un système distribué pour l'analyse du contenu composé de modules individuels d'analyse : les service unit. L'un entre eux a été dédié à la caractérisation des éléments hors contenu, i. E. Les publicités, et offre de bonnes perfermances. Parallélement, nous avons testé différents détecteurs de changement de plans afin de retenir le meilleur d'ente eux pour la suite. Puis, nous avons proposé une étude des règles de production des films, i. E. Grammaire de films, qui a permis de définir les séquences de parallel shot. Nous avons, ainsi, testé quatre méthodes de regroupement basées similarité afin de retenir la meilleure d'entre elles pour la suite. Finalement, nous avons recherché différentes méthodes de détection des frontières de scènes et avons obtenu les meilleurs résultats en combinant une méthode basée couleur avec un critère de longueur de plan. Ce dernier offre des performances justifiant son intégration dans les appareils de stockage grand public.

APA, Harvard, Vancouver, ISO, and other styles

22

Parekh, Sanjeel. "Learning representations for robust audio-visual scene analysis." Thesis, Université Paris-Saclay (ComUE), 2019. http://www.theses.fr/2019SACLT015/document.

Full text

Abstract:

L'objectif de cette thèse est de concevoir des algorithmes qui permettent la détection robuste d’objets et d’événements dans des vidéos en s’appuyant sur une analyse conjointe de données audio et visuelle. Ceci est inspiré par la capacité remarquable des humains à intégrer les caractéristiques auditives et visuelles pour améliorer leur compréhension de scénarios bruités. À cette fin, nous nous appuyons sur deux types d'associations naturelles entre les modalités d'enregistrements audiovisuels (réalisés à l'aide d'un seul microphone et d'une seule caméra), à savoir la corrélation mouvement/audio et la co-occurrence apparence/audio. Dans le premier cas, nous utilisons la séparation de sources audio comme application principale et proposons deux nouvelles méthodes dans le cadre classique de la factorisation par matrices non négatives (NMF). L'idée centrale est d'utiliser la corrélation temporelle entre l'audio et le mouvement pour les objets / actions où le mouvement produisant le son est visible. La première méthode proposée met l'accent sur le couplage flexible entre les représentations audio et de mouvement capturant les variations temporelles, tandis que la seconde repose sur la régression intermodale. Nous avons séparé plusieurs mélanges complexes d'instruments à cordes en leurs sources constituantes en utilisant ces approches.Pour identifier et extraire de nombreux objets couramment rencontrés, nous exploitons la co-occurrence apparence/audio dans de grands ensembles de données. Ce mécanisme d'association complémentaire est particulièrement utile pour les objets où les corrélations basées sur le mouvement ne sont ni visibles ni disponibles. Le problème est traité dans un contexte faiblement supervisé dans lequel nous proposons un framework d’apprentissage de représentation pour la classification robuste des événements audiovisuels, la localisation des objets visuels, la détection des événements audio et la séparation de sources.Nous avons testé de manière approfondie les idées proposées sur des ensembles de données publics. Ces expériences permettent de faire un lien avec des phénomènes intuitifs et multimodaux que les humains utilisent dans leur processus de compréhension de scènes audiovisuelles
The goal of this thesis is to design algorithms that enable robust detection of objectsand events in videos through joint audio-visual analysis. This is motivated by humans’remarkable ability to meaningfully integrate auditory and visual characteristics forperception in noisy scenarios. To this end, we identify two kinds of natural associationsbetween the modalities in recordings made using a single microphone and camera,namely motion-audio correlation and appearance-audio co-occurrence.For the former, we use audio source separation as the primary application andpropose two novel methods within the popular non-negative matrix factorizationframework. The central idea is to utilize the temporal correlation between audio andmotion for objects/actions where the sound-producing motion is visible. The firstproposed method focuses on soft coupling between audio and motion representationscapturing temporal variations, while the second is based on cross-modal regression.We segregate several challenging audio mixtures of string instruments into theirconstituent sources using these approaches.To identify and extract many commonly encountered objects, we leverageappearance–audio co-occurrence in large datasets. This complementary associationmechanism is particularly useful for objects where motion-based correlations are notvisible or available. The problem is dealt with in a weakly-supervised setting whereinwe design a representation learning framework for robust AV event classification,visual object localization, audio event detection and source separation.We extensively test the proposed ideas on publicly available datasets. The experimentsdemonstrate several intuitive multimodal phenomena that humans utilize on aregular basis for robust scene understanding

APA, Harvard, Vancouver, ISO, and other styles

23

Faircloth, Ryan. "AUDIO AND VIDEO TEMPO ANALYSIS FOR DANCE DETECTION." Master's thesis, University of Central Florida, 2008. http://digital.library.ucf.edu/cdm/ref/collection/ETD/id/2633.

Full text

Abstract:

The amount of multimedia in existence has become so extensive that the organization of this data cannot be performed manually. Systems designed to maintain such quantity need superior methods of understanding the information contained in the data. Aspects of Computer Vision deal with such problems for the understanding of image and video content. Additionally large ontologies such as LSCOM are collections of feasible high-level concepts that are of interest to identify within multimedia content. While ontologies often include the activity of dance it has had virtually no coverage in Computer Vision literature in terms of actual detection. We will demonstrate the fact that training based approaches are challenged by dance because the activity is defined by an unlimited set of movements and therefore unreasonable amounts of training data would be required to recognize even a small portion of the immense possibilities for dance. In this thesis we present a non-training, tempo based approach to dance detection which yields very good results when compared to another method with state-of-the-art performance for other common activities; the testing dataset contains videos acquired mostly through YouTube. The algorithm is based on one dimensional analysis in which we perform visual beat detection through the computation of optical flow. Next we obtain a set of tempo hypotheses and the final stage of our method tracks visual beats through a video sequence in order to determine the most likely tempo for the object motion. In this thesis we will not only demonstrate the utility for visual beats in visual tempo detection but we will demonstrate their existence in most of the common activities considered by state-of-the-art methods.
M.S.E.E.
School of Electrical Engineering and Computer Science
Engineering and Computer Science
Electrical Engineering MSEE

APA, Harvard, Vancouver, ISO, and other styles

24

Kolozali, Sefki. "Automatic ontology generation based on semantic audio analysis." Thesis, Queen Mary, University of London, 2014. http://qmro.qmul.ac.uk/xmlui/handle/123456789/8452.

Full text

Abstract:

Ontologies provide an explicit conceptualisation of a domain and a uniform framework that represents domain knowledge in a machine interpretable format. The Semantic Web heavily relies on ontologies to provide well-defined meaning and support for automated services based on the description of semantics. However, considering the open, evolving and decentralised nature of the SemanticWeb – though many ontology engineering tools have been developed over the last decade – it can be a laborious and challenging task to deal with manual annotation, hierarchical structuring and organisation of data as well as maintenance of previously designed ontology structures. For these reasons, we investigate how to facilitate the process of ontology construction using semantic audio analysis. The work presented in this thesis contributes to solving the problems of knowledge acquisition and manual construction of ontologies. We develop a hybrid system that involves a formal method of automatic ontology generation for web-based audio signal processing applications. The proposed system uses timbre features extracted from audio recordings of various musical instruments. The proposed system is evaluated using a database of isolated notes and melodic phrases recorded in neutral conditions, and we make a detailed comparison between musical instrument recognition models to investigate their effects on the automatic ontology generation system. Finally, the automatically-generated musical instrument ontologies are evaluated in comparison with the terminology and hierarchical structure of the Hornbostel and Sachs organology system. We show that the proposed system is applicable in multi-disciplinary fields that deal with knowledge management and knowledge representation issues.

APA, Harvard, Vancouver, ISO, and other styles

25

Hainsworth, Stephen Webley. "Techniques for the automated analysis of musical audio." Thesis, University of Cambridge, 2004. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.616011.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

Solomon, Mary Joanna. "Multivariate Analysis of Korean Pop Music Audio Features." Bowling Green State University / OhioLINK, 2021. http://rave.ohiolink.edu/etdc/view?acc_num=bgsu1617105874719868.

Full text

APA, Harvard, Vancouver, ISO, and other styles

27

Carlo, Diego Di. "Echo-aware signal processing for audio scene analysis." Thesis, Rennes 1, 2020. http://www.theses.fr/2020REN1S075.

Full text

Abstract:

La plupart des méthodes de traitement du signal audio considèrent la réverbération et en particulier les échos acoustiques comme une nuisance. Cependant, ceux-ci transmettent des informations spatiales et sémantiques importantes sur les sources sonores et des méthodes essayant de les prendre en compte ont donc récemment émergé.. Dans ce travail, nous nous concentrons sur deux directions. Tout d’abord, nous étudions la manière d’estimer les échos acoustiques à l’aveugle à partir d’enregistrements microphoniques. Deux approches sont proposées, l’une s’appuyant sur le cadre des dictionnaires continus, l’autre sur des techniques récentes d’apprentissage profond. Ensuite, nous nous concentrons sur l’extension de méthodes existantes d’analyse de scènes audio à leurs formes sensibles à l’écho. Le cadre NMF multicanal pour la séparation de sources audio, la méthode de localisation SRP-PHAT et le formateur de voies MVDR pour l’amélioration de la parole sont tous étendus pour prendre en compte les échos. Ces applications montrent comment un simple modèle d’écho peut conduire à une amélioration des performances
Most of audio signal processing methods regard reverberation and in particular acoustic echoes as a nuisance. However, they convey important spatial and semantic information about sound sources and, based on this, recent echo-aware methods have been proposed. In this work we focus on two directions. First, we study the how to estimate acoustic echoes blindly from microphone recordings. Two approaches are proposed, one leveraging on continuous dictionaries, one using recent deep learning techniques. Then, we focus on extending existing methods in audio scene analysis to their echo-aware forms. The Multichannel NMF framework for audio source separation, the SRP-PHAT localization method, and the MVDR beamformer for speech enhancement are all extended to their echo-aware versions

APA, Harvard, Vancouver, ISO, and other styles

28

Hartquist, John E. "Real-time Musical Analysis of Polyphonic Guitar Audio." DigitalCommons@CalPoly, 2012. https://digitalcommons.calpoly.edu/theses/808.

Full text

Abstract:

In this thesis, we analyze the audio signal of a guitar to extract musical data in real-time. Specifically, the pitch and octave of notes and chords are displayed over time. Previous work has shown that non-negative matrix factorization is an effective method for classifying the pitches of simultaneous notes. We explore the effect of window size, hop length, and other parameters to maximize the resolution and accuracy of the output.Other groups have required prerecorded note samples to build a library of note templates to search for. We automate this step and compute the library at run-time, tuning it specifically for the input guitar. The program we present generates a musical visualization of the results in addition to suggestions for fingerings of chords in the form of a fretboard display and tablature notation. This program is built as an applet and is accessible from the web browser.

APA, Harvard, Vancouver, ISO, and other styles

29

Ning, Daryl. "Analysis and coding of high quality audio signals." Thesis, Queensland University of Technology, 2003. https://eprints.qut.edu.au/15814/1/Daryl_Ning_Thesis.pdf.

Full text

Abstract:

Digital audio is increasingly becoming more and more a part of our daily lives. Unfortunately, the excessive bitrate associated with the raw digital signal makes it an extremely expensive representation. Applications such as digital audio broadcasting, high definition television, and internet audio, require high quality audio at low bitrates. The field of audio coding addresses this important issue of reducing the bitrate of digital audio, while maintaining a high perceptual quality. Developing an efficient audio coder requires a detailed analysis of the audio signals themselves. It is important to find a representation that can concisely model any general audio signal. In this thesis, we propose two new high quality audio coders based on two different audio representations - the sinusoidal-wavelet representation, and the warped linear predictive coding (WLPC)-wavelet representation. In addition to high quality coding, it is also important for audio coders to be flexible in their application. With the increasing popularity of internet audio, it is advantageous for audio coders to address issues related to real-time audio delivery. The issue of bitstream scalability has been targeted in this thesis, and therefore, a third audio coder capable of bitstream scalability is also proposed. The performance of each of the proposed coders was evaluated by comparisons with the MPEG layer III coder. The first coder proposed is based on a hybrid sinusoidal-wavelet representation. This assumes that each frame of audio can be modelled as a sum of sinusoids plus a noisy residual. The discrete wavelet transform (DWT) is used to decompose the residual into subbands that approximate the critical bands of human hearing. A perceptually derived bit allocation algorithm is then used to minimise the audible distortions introduced from quantising the DWT coefficients. Listening tests showed that the coder delivers near-transparent quality for a range of critical audio signals at G4 kbps. It also outperforms the MPEG layer III coder operating at this same bitrate. This coder, however, is only useful for high quality coding, and is difficult to scale to operate at lower rates. The second coder proposed is based on a hybrid WLPC-wavelet representation. In this approach, the spectrum of the audio signal is estimated by an all pole filter using warped linear prediction (WLP). WLP operates on a warped frequency domain, where the resolution can be adjusted to approximate that of the human auditory system. This makes the inherent noise shaping of the synthesis filter even more suited to audio coding. The excitation to this filter is transformed using the DWT and perceptually encoded. Listening tests showed that near-transparent coding is achieved at G4 kbps. The coder was also found to be slightly superior to the MPEG layer III coder operating at this same bitrate. The third proposed coder is similar to the previous WLPC-wavelet coder, but modified to achieve bitstream scalability. A noise model for high frequency components is included to keep the overall bitrate low, and a two stage quantisation scheme for the DWT coefficients is implemented. The first stage uses fixed rate scalar and vector quantisation to provide a coarse approximation of the coefficients. This allows for low bitrate, low quality versions of the input signal to be embedded in the overall bitstream. The second stage of quantisation adds detail to the coefficients, and hence, enhances the quality of the output signal. Listening tests showed that signal quality gracefully improves as the bitrate increases from 16 kbps to SO kbps. This coder has a performance that is comparable to the MPEG layer III coder operating at a similar (but fixed) bitrate.

APA, Harvard, Vancouver, ISO, and other styles

30

Ning, Daryl. "Analysis and Coding of High Quality Audio Signals." Queensland University of Technology, 2003. http://eprints.qut.edu.au/15814/.

Full text

Abstract:

Digital audio is increasingly becoming more and more a part of our daily lives. Unfortunately, the excessive bitrate associated with the raw digital signal makes it an extremely expensive representation. Applications such as digital audio broadcasting, high definition television, and internet audio, require high quality audio at low bitrates. The field of audio coding addresses this important issue of reducing the bitrate of digital audio, while maintaining a high perceptual quality. Developing an efficient audio coder requires a detailed analysis of the audio signals themselves. It is important to find a representation that can concisely model any general audio signal. In this thesis, we propose two new high quality audio coders based on two different audio representations - the sinusoidal-wavelet representation, and the warped linear predictive coding (WLPC)-wavelet representation. In addition to high quality coding, it is also important for audio coders to be flexible in their application. With the increasing popularity of internet audio, it is advantageous for audio coders to address issues related to real-time audio delivery. The issue of bitstream scalability has been targeted in this thesis, and therefore, a third audio coder capable of bitstream scalability is also proposed. The performance of each of the proposed coders was evaluated by comparisons with the MPEG layer III coder. The first coder proposed is based on a hybrid sinusoidal-wavelet representation. This assumes that each frame of audio can be modelled as a sum of sinusoids plus a noisy residual. The discrete wavelet transform (DWT) is used to decompose the residual into subbands that approximate the critical bands of human hearing. A perceptually derived bit allocation algorithm is then used to minimise the audible distortions introduced from quantising the DWT coefficients. Listening tests showed that the coder delivers near-transparent quality for a range of critical audio signals at G4 kbps. It also outperforms the MPEG layer III coder operating at this same bitrate. This coder, however, is only useful for high quality coding, and is difficult to scale to operate at lower rates. The second coder proposed is based on a hybrid WLPC-wavelet representation. In this approach, the spectrum of the audio signal is estimated by an all pole filter using warped linear prediction (WLP). WLP operates on a warped frequency domain, where the resolution can be adjusted to approximate that of the human auditory system. This makes the inherent noise shaping of the synthesis filter even more suited to audio coding. The excitation to this filter is transformed using the DWT and perceptually encoded. Listening tests showed that near-transparent coding is achieved at G4 kbps. The coder was also found to be slightly superior to the MPEG layer III coder operating at this same bitrate. The third proposed coder is similar to the previous WLPC-wavelet coder, but modified to achieve bitstream scalability. A noise model for high frequency components is included to keep the overall bitrate low, and a two stage quantisation scheme for the DWT coefficients is implemented. The first stage uses fixed rate scalar and vector quantisation to provide a coarse approximation of the coefficients. This allows for low bitrate, low quality versions of the input signal to be embedded in the overall bitstream. The second stage of quantisation adds detail to the coefficients, and hence, enhances the quality of the output signal. Listening tests showed that signal quality gracefully improves as the bitrate increases from 16 kbps to SO kbps. This coder has a performance that is comparable to the MPEG layer III coder operating at a similar (but fixed) bitrate.

APA, Harvard, Vancouver, ISO, and other styles

31

Ren, Reede. "Audio-visual football video analysis, from structure detection to attention analysis." Thesis, Connect to e-thesis. Move to record for print version, 2008. http://theses.gla.ac.uk/77/.

Full text

Abstract:

Thesis (Ph.D.) - University of Glasgow, 2008.
Ph.D. thesis submitted to the Faculty of Information and Mathematical Sciences, Department of Computing Science, University of Glasgow, 2008. Includes bibliographical references. Print version also available.

APA, Harvard, Vancouver, ISO, and other styles

32

Alameda-Pineda, Xavier. "Egocentric Audio-Visual Scene Analysis : a machine learning and signal processing approach." Thesis, Grenoble, 2013. http://www.theses.fr/2013GRENM024/document.

Full text

Abstract:

Depuis les vingt dernières années, l'industrie a développé plusieurs produits commerciaux dotés de capacités auditives et visuelles. La grand majorité de ces produits est composée d'un caméscope et d'un microphone embarqué (téléphones portables, tablettes, etc). D'autres, comme la Kinect, sont équipés de capteurs de profondeur et/ou de petits réseaux de microphones. On trouve également des téléphones portables dotés d'un système de vision stéréo. En même temps, plusieurs systèmes orientés recherche sont apparus (par exemple, le robot humanoïde NAO). Du fait que ces systèmes sont compacts, leurs capteurs sont positionnés près les uns des autres. En conséquence, ils ne peuvent pas capturer la scène complète, mais qu'un point de vue très particulier de l'interaction sociale en cours. On appelle cela "Analyse Égocentrique de Scènes Audio-Visuelles''.Cette thèse contribue à cette thématique de plusieurs façons. D'abord, en fournissant une base de données publique qui cible des applications comme la reconnaissance d'actions et de gestes, localisation et suivi d'interlocuteurs, analyse du tour de parole, localisation de sources auditives, etc. Cette base a été utilisé en dedans et en dehors de cette thèse. Nous avons aussi travaillé le problème de la détection d'événements audio-visuels. Nous avons montré comme la confiance en une des modalités (issue de la vision en l'occurrence), peut être modélisée pour biaiser la méthode, en donnant lieu à un algorithme d'espérance-maximisation visuellement supervisé. Ensuite, nous avons modifié l'approche pour cibler la détection audio-visuelle d'interlocuteurs en utilisant le robot humanoïde NAO. En parallèle aux travaux en détection audio-visuelle d'interlocuteurs, nous avons développé une nouvelle approche pour la reconnaissance audio-visuelle de commandes. Nous avons évalué la qualité de plusieurs indices et classeurs, et confirmé que l'utilisation des données auditives et visuelles favorise la reconnaissance, en comparaison aux méthodes qui n'utilisent que l'audio ou que la vidéo. Plus tard, nous avons cherché la meilleure méthode pour des ensembles d'entraînement minuscules (5-10 observations par catégorie). Il s'agit d'un problème intéressant, car les systèmes réels ont besoin de s'adapter très rapidement et d'apprendre de nouvelles commandes. Ces systèmes doivent être opérationnels avec très peu d'échantillons pour l'usage publique. Pour finir, nous avons contribué au champ de la localisation de sources sonores, dans le cas particulier des réseaux coplanaires de microphones. C'est une problématique importante, car la géométrie du réseau est arbitraire et inconnue. En conséquence, cela ouvre la voie pour travailler avec des réseaux de microphones dynamiques, qui peuvent adapter leur géométrie pour mieux répondre à certaines tâches. De plus, la conception des produits commerciaux peut être contrainte de façon que les réseaux linéaires ou circulaires ne sont pas bien adaptés
Along the past two decades, the industry has developed several commercial products with audio-visual sensing capabilities. Most of them consists on a videocamera with an embedded microphone (mobile phones, tablets, etc). Other, such as Kinect, include depth sensors and/or small microphone arrays. Also, there are some mobile phones equipped with a stereo camera pair. At the same time, many research-oriented systems became available (e.g., humanoid robots such as NAO). Since all these systems are small in volume, their sensors are close to each other. Therefore, they are not able to capture de global scene, but one point of view of the ongoing social interplay. We refer to this as "Egocentric Audio-Visual Scene Analysis''.This thesis contributes to this field in several aspects. Firstly, by providing a publicly available data set targeting applications such as action/gesture recognition, speaker localization, tracking and diarisation, sound source localization, dialogue modelling, etc. This work has been used later on inside and outside the thesis. We also investigated the problem of AV event detection. We showed how the trust on one of the modalities (visual to be precise) can be modeled and used to bias the method, leading to a visually-supervised EM algorithm (ViSEM). Afterwards we modified the approach to target audio-visual speaker detection yielding to an on-line method working in the humanoid robot NAO. In parallel to the work on audio-visual speaker detection, we developed a new approach for audio-visual command recognition. We explored different features and classifiers and confirmed that the use of audio-visual data increases the performance when compared to auditory-only and to video-only classifiers. Later, we sought for the best method using tiny training sets (5-10 samples per class). This is interesting because real systems need to adapt and learn new commands from the user. Such systems need to be operational with a few examples for the general public usage. Finally, we contributed to the field of sound source localization, in the particular case of non-coplanar microphone arrays. This is interesting because the geometry of the microphone can be any. Consequently, this opens the door to dynamic microphone arrays that would adapt their geometry to fit some particular tasks. Also, because the design of commercial systems may be subject to certain constraints for which circular or linear arrays are not suited

APA, Harvard, Vancouver, ISO, and other styles

33

Xiong, Xin, and Shuang Li. "Energy Audit of a Building : Energy Audit and Saving Analysis." Thesis, University of Gävle, University of Gävle, Department of Technology and Built Environment, 2008. http://urn.kb.se/resolve?urn=urn:nbn:se:hig:diva-4617.

Full text

Abstract:

The typical residential building is located at the crossing of S. Centralgatan Street and Nedre Akargatan Street in the city of Gavle. It is a quadrangle building of six floors with a yard in the middle. There are 180 apartments of five types in total, and at the first floor there is a kindergarten. There is a District Heating in the building and heating recovery system ventilation which use heat exchanger to reheat.

Several solutions are used for reducing the heat loss. In the first step, the heat loss and heat in has been calculated. There are several parameters that involve the heat loss and heat in of whole building, so each parameter in the energy balance equation is extracted and calculated. And then the Energy Balance Sheet has been built. Among the heat loss part, the transmission is 1237MWh, the hot tap water is 332MWh, the mechanical ventilation is 1041MWh, the natural ventilation is 325.7MWh.In the part of heat in, the DH is 1265.7MWh, the heat pump is 793MWh, the solar radiation is 562MWh, the internal heating is 315MWh.Later in the second step, after analyzing data of heat loss part, the improvements will be focused on the transmission and hot tap water parts because the heat loss in those two parts occupy the most. At the end of final step, the solutions have been discussed to optimize the heating system.

As conclusion, there are several suggested solutions. The total reduction of heat loss after adjustment is 163MWh, accounts 5.6% of originally heat loss. The heat loss of the building has been reduced from 2935.7MWh to 2772.7MWh.

APA, Harvard, Vancouver, ISO, and other styles

34

Gebru, Israel Dejene. "Analyse audio-visuelle dans le cadre des interactions humaines avec les robots." Thesis, Université Grenoble Alpes (ComUE), 2018. http://www.theses.fr/2018GREAM020.

Full text

Abstract:

Depuis quelques années, un intérêt grandissant pour les interactions homme-robot (HRI), avec pour but de développer des robots pouvant interagir (ou plus généralement communiquer) avec des personnes de manière naturelle. Cela requiert aux robots d'avoir la capacité non seulement de comprendre une conversation et signaux non verbaux associés à la communication (e.g. le regard et les expressions du visage), mais aussi la capacité de comprendre les dynamiques des interactions sociales, e.g. détecter et identifier les personnes présentes, où sont-elles, les suivre au cours de la conversation, savoir qui est le locuteur, à qui parle t-il, mais aussi qui regarde qui, etc. Tout cela nécessite aux robots d’avoir des capacités de perception multimodales pour détecter et intégrer de manière significative les informations provenant de leurs multiples canaux sensoriels. Dans cette thèse, nous nous concentrons sur les entrées sensorielles audio-visuelles du robot composées de microphones (multiples) et de caméras vidéo. Dans cette thèse nous nous concentrons sur trois tâches associés à la perception des robots, à savoir : (P1) localisation de plusieurs locuteurs, (P2) localisation et suivi de plusieurs personnes, et (P3) journalisation de locuteur. La majorité des travaux existants sur le traitement du signal et de la vision par ordinateur abordent ces problèmes en utilisant uniquement soit des signaux audio ou des informations visuelles. Cependant, dans cette thèse, nous prévoyons de les aborder à travers la fusion des informations audio et visuelles recueillies par deux microphones et une caméra vidéo. Notre objectif est d'exploiter la nature complémentaire des modalités auditive et visuelle dans l'espoir d'améliorer de manière significatives la robustesse et la performance par rapport aux systèmes utilisant une seule modalité. De plus, les trois problèmes sont abordés en considérant des scénarios d'interaction Homme-Robot difficiles comme, par exemple, un robot engagé dans une interaction avec un nombre variable de participants, qui peuvent parler en même temps et qui peuvent se déplacer autour de la scène et tourner la tête / faire face aux autres participants plutôt qu’au robot
In recent years, there has been a growing interest in human-robot interaction (HRI), with the aim to enable robots to naturally interact and communicate with humans. Natural interaction implies that robots not only need to understand speech and non-verbal communication cues such as body gesture, gaze, or facial expressions, but they also need to understand the dynamics of the social interplay, e.g., find people in the environment, distinguish between different people, track them through the physical space, parse their actions and activity, estimate their engagement, identify who is speaking, who speaks to whom, etc. All these necessitate the robots to have multimodal perception skills to meaningfully detect and integrate information from their multiple sensory channels. In this thesis, we focus on the robot's audio-visual sensory inputs consisting of the (multiple) microphones and video cameras. Among the different addressable perception tasks, in this thesis we explore three, namely; (P1) multiple speakers localization, (P2) multiple-person location tracking, and (P3) speaker diarization. The majority of existing works in signal processing and computer vision address these problems by utilizing audio signals alone, or visual information only. However, in this thesis, we plan to address them via fusion of the audio and visual information gathered by two microphones and one video camera. Our goal is to exploit the complimentary nature of the audio and visual modalities with a hope of attaining significant improvements on robustness and performance over systems that use a single modality. Moreover, the three problems are addressed considering challenging HRI scenarios such as, eg a robot engaged in a multi-party interaction with varying number of participants, which may speak at the same time as well as may move around the scene and turn their heads/faces towards the other participants rather than facing the robot

APA, Harvard, Vancouver, ISO, and other styles

35

Phillips, Nicola Jane. "Audio-visual scene analysis : attending to music in film." Thesis, University of Cambridge, 2000. https://www.repository.cam.ac.uk/handle/1810/251745.

Full text

APA, Harvard, Vancouver, ISO, and other styles

36

Dietrich, Kelly. "Analysis of talker characteristics in audio-visual speech integration." Connect to resource, 2008. http://hdl.handle.net/1811/32149.

Full text

APA, Harvard, Vancouver, ISO, and other styles

37

Gower, Ephraim. "Mathematical Analysis and Audio Applications in Blind Signal Decomposition." Thesis, University of Essex, 2010. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.520088.

Full text

APA, Harvard, Vancouver, ISO, and other styles

38

Swartz, Jordan. "Audit Office Growth Under Analysis." Thesis, The University of Arizona, 2015. http://hdl.handle.net/10150/578932.

Full text

Abstract:

This study researches whether potential indicators of auditor dependence will affect the negative relationship between audit office size and audit quality in growing audit offices. I consider high market concentration and the low average number of public clients available to auditors in a city as indicators of dependency, and an auditor being a market specialist as an indicator of independency. Although the majority of my tests proved inconclusive, I do find some evidence indicating that there may be more to the growth effect than workload balancing. I provide some evidence supporting the hypothesis that the growth effect would be weaker (stronger) when there are more (fewer) potential clients in a local audit market.

APA, Harvard, Vancouver, ISO, and other styles

39

Best, Peter J. "Machine-independent audit trail analysis." Thesis, Queensland University of Technology, 1994.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

40

Tore, Gian Maria. "Expérience et figuration au cinéma : pour une sémiotique audio-visuelle." Limoges, 2007. http://www.theses.fr/2007LIMO2007.

Full text

Abstract:

La thèse est une proposition d'une sémiotique du cinéma. Elle confronte la sémiotique structurale d'aujourd'hui aux études cognitivistes et historicistes actuelles ; et par là, à un grand nombre de questions, de la perception à "l'auctorialité", de la subjectivité à "l'histoire" de l'art cinématographique. A travers des discussions théoriques et puis, surtout, des analyses filmiques très détaillées, elle propose une théorie du media cinématographique, entendue comme une approche des modes dont l'expérience audiovisuelle est contrôlée par une praxis qui est ancrée à un dispositif de textualisation spécifique - le cinéma, justement. Les analyses concernent : le début de "Pickpocket" de R. Bresson, le final de "Bonnie and Clyde" d'A. Penn, le film : "Walden - Notes, sketches and diaries" de J. Mekas et les deux films d'A. Sokurov : "Mère et fils" et "L'Arche russe". On y étudie, respectivement et entre autres : l'usage poétique du cinéma et la spatialisation ("Pickpocket") ; la technique du ralenti, la temporalisation et les effets passionnels ("Bonnie and Clyde") ; le genre expérimental, le "rythme" et les formes de vie ("Walden") ; la question de l'auteur, la "composition" et la praxis cinématographique (Sakurov). A travers les analyses, apparaît la possibilité, propre de la sémiotique, de créer des paradigmes : de confronter des questions différentes sur un même objet-texte, et de relier des objets différents par des mêmes questions.

APA, Harvard, Vancouver, ISO, and other styles

41

Lecomte, Sébastien. "Classification partiellement supervisée par SVM : application à la détection d’événements en surveillance audio." Thesis, Troyes, 2013. http://www.theses.fr/2013TROY0031/document.

Full text

Abstract:

Cette thèse s’intéresse aux méthodes de classification par Machines à Vecteurs de Support (SVM) partiellement supervisées permettant la détection de nouveauté (One-Class SVM). Celles-ci ont été étudiées dans le but de réaliser la détection d’événements audio anormaux pour la surveillance d’infrastructures publiques, en particulier dans les transports. Dans ce contexte, l’hypothèse « ambiance normale » est relativement bien connue (même si les signaux correspondants peuvent être très non stationnaires). En revanche, tout signal « anormal » doit pouvoir être détecté et, si possible, regroupé avec les signaux de même nature. Ainsi, un système de référence s’appuyant sur une modélisation unique de l’ambiance normale est présenté, puis nous proposons d’utiliser plusieurs SVM de type One Class mis en concurrence. La masse de données à traiter a impliqué l’étude de solveurs adaptés à ces problèmes. Les algorithmes devant fonctionner en temps réel, nous avons également investi le terrain de l’algorithmie pour proposer des solveurs capables de démarrer à chaud. Par l’étude de ces solveurs, nous proposons une formulation unifiée des problèmes à une et deux classes, avec et sans biais. Les approches proposées ont été validées sur un ensemble de signaux réels. Par ailleurs, un démonstrateur intégrant la détection d’événements anormaux pour la surveillance de station de métro en temps réel a également été présenté dans le cadre du projet Européen VANAHEIM
This thesis addresses partially supervised Support Vector Machines for novelty detection (One-Class SVM). These have been studied to design abnormal audio events detection for supervision of public infrastructures, in particular public transportation systems. In this context, the null hypothesis (“normal” audio signals) is relatively well known (even though corresponding signals can be notably non stationary). Conversely, every “abnormal” signal should be detected and, if possible, clustered with similar signals. Thus, a reference system based on a single model of normal signals is presented, then we propose to use several concurrent One-Class SVM to cluster new data. Regarding the amount of data to process, special solvers have been studied. The proposed algorithms must be real time. This is the reason why we have also investigated algorithms with warm start capabilities. By the study of these algorithms, we have proposed a unified framework for One Class and Binary SVMs, with and without bias. The proposed approach has been validated on a database of real signals. The whole process applied to the monitoring of a subway station has been presented during the final review of the European Project VANAHEIM

APA, Harvard, Vancouver, ISO, and other styles

42

Klevhamre, Benny, and Peter Nilsson. "Further Development of an Audio Analyzer." Thesis, Linköping University, Department of Electrical Engineering, 2002. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-1155.

Full text

Abstract:

En del av en Audioanalystor har blivit utveckladoch implementerad som en applikation i det hårdvarubeskrivande språket VHDL. Denna del har sedan programmerats in i en PLD-krets på ett kretskort som används i audiotester för mobiltelefoner på Flextronics. Applikationen konverterar data så att det ska gå att skicka information mellan telefonen och olika mätinstrument. Applikationen består av två äldre applikationer. Av dessa två har en blivit helt implementerad. I den andra kvarstår att finna orsaken till varför den ger ifrån sig felaktigt data i form av oönskat brus. Arbetet avbröts p.g.a. slutdatum. A part of an audio analyzer has been developed and implemented as an application in the hardware description language VHDL. This part has later been programmed into a PLD device on a circuit board used for audio tests on mobile telephones at Flextronics. The application converts data, making it possible to send information between the telephone and different measuring instruments. The application consists of two older applications. One of them has been fully implemented. What is left in the other part is to find the cause why it is sending wrong data as unwanted noise. The work had to be stopped when deadline was reached

A part of an audio analyzer has been developed and implemented as an application in the hardware description language VHDL. This part has later been programmed into a PLD device on a circuit board used for audio tests on mobile telephones at Flextronics. The application converts data, making it possible to send information between the telephone and different measuring instruments. The application consists of two older applications. One of them has been fully implemented. What is left in the other part is to find the cause why it is sending wrong data as unwanted noise. The work had to be stopped when deadline was reached.

APA, Harvard, Vancouver, ISO, and other styles

43

Aissa-El-Bey, Abdeldjalil. "Séparation aveugle de sources audio dans le contexte automobile." Paris, ENST, 2007. http://www.theses.fr/2007ENST0011.

Full text

Abstract:

Cette thèse se déroule dans le cadre d'une étude sur la séparation de sources audio dans un milieu réverbérant. Dans le cadre de notre étude nous avons montré comment effectuer la séparation de sources audio en utilisant une méthode basée sur des algorithmes de décomposition modale (EMD ou ESPRIT). Les avantages de cette approche résident dans le fait qu'elle permet de traiter le cas des mélanges instantanés et convolutifs, et elle nous permet en particulier, de traiter le cas sous-déterminé. Toujours dans le cadre de cette thèse, nous avons montré comment effectuer la séparation des mélanges instantané et convolutif de sources audio dans le cas sous-déterminé en utilisant la propriété de parcimonie des signaux audio dans le domaine temps-fréquence. Nous proposons deux méthodes utilisant différentes transformées dans le domaine temps-fréquence. La première utilise les distributions temps-fréquence quadratiques, la deuxième utilise la transformée de Fourier à court terme. Ces deux méthodes supposent que les sources sont disjointes dans le domaine temps-fréquence ; c'est à dire qu'une seule source est présente dans chaque point temps-fréquence. Nous proposons ensuite de relâcher cette contrainte on supposant que les sources ne sont pas forcément disjointes dans le domaine temps-fréquence. Nous avons exploité aussi la propriété de parcimonie des signaux audio dans le domaine temporel. Nous avons proposé une méthode itérative utilisant une technique du gradient relatif qui minimise une fonction de contraste basée sur la norme Lp, et pour finir, nous nous sommes intéressés à une méthode itérative de séparation de sources utilisant les statistiques d'ordre deux.

APA, Harvard, Vancouver, ISO, and other styles

44

Aïssa-el-Bey, Abdeldjalil. "Séparation aveugle de sources audio dans le contexte automobile /." Paris : École nationale supérieure des télécommunications, 2007. http://catalogue.bnf.fr/ark:/12148/cb41198300d.

Full text

APA, Harvard, Vancouver, ISO, and other styles

45

Comer, K. Allen. "A wavelet-based technique for reducing noise in audio signals." Thesis, This resource online, 1996. http://scholar.lib.vt.edu/theses/available/etd-06082009-170933/.

Full text

APA, Harvard, Vancouver, ISO, and other styles

46

Hlísta, Juraj. "Reaktivní audit." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2010. http://www.nusl.cz/ntk/nusl-237105.

Full text

Abstract:

The thesis deals with the proposal and the implementation of an extension for the audit system in Linux - the reactive audit. It brings a new functionality to the auditing in form of triggering reactions to certain audit events. The reactive audit is implemented within an audit plugin and its use is optional. Additionally, there is another plugin which stores some audit events and provides time-related statistics for the first plugin. As the result, the mechanism of the reactive audit does not only react to some audit events, it is also able to reveal anomalies according to the statistical information and set ofe the appropriate reactions. It is a fairly general mechanism that can be useful in various situations.

APA, Harvard, Vancouver, ISO, and other styles

47

Sigtia, Siddharth. "Neural networks for analysing music and environmental audio." Thesis, Queen Mary, University of London, 2017. http://qmro.qmul.ac.uk/xmlui/handle/123456789/24741.

Full text

Abstract:

In this thesis, we consider the analysis of music and environmental audio recordings with neural networks. Recently, neural networks have been shown to be an effective family of models for speech recognition, computer vision, natural language processing and a number of other statistical modelling problems. The composite layer-wise structure of neural networks allows for flexible model design, where prior knowledge about the domain of application can be used to inform the design and architecture of the neural network models. Additionally, it has been shown that when trained on sufficient quantities of data, neural networks can be directly applied to low-level features to learn mappings to high level concepts like phonemes in speech and object classes in computer vision. In this thesis we investigate whether neural network models can be usefully applied to processing music and environmental audio. With regards to music signal analysis, we investigate 2 different problems. The fi rst problem, automatic music transcription, aims to identify the score or the sequence of musical notes that comprise an audio recording. We also consider the problem of automatic chord transcription, where the aim is to identify the sequence of chords in a given audio recording. For both problems, we design neural network acoustic models which are applied to low-level time-frequency features in order to detect the presence of notes or chords. Our results demonstrate that the neural network acoustic models perform similarly to state-of-the-art acoustic models, without the need for any feature engineering. The networks are able to learn complex transformations from time-frequency features to the desired outputs, given sufficient amounts of training data. Additionally, we use recurrent neural networks to model the temporal structure of sequences of notes or chords, similar to language modelling in speech. Our results demonstrate that the combination of the acoustic and language model predictions yields improved performance over the acoustic models alone. We also observe that convolutional neural networks yield better performance compared to other neural network architectures for acoustic modelling. For the analysis of environmental audio recordings, we consider the problem of acoustic event detection. Acoustic event detection has a similar structure to automatic music and chord transcription, where the system is required to output the correct sequence of semantic labels along with onset and offset times. We compare the performance of neural network architectures against Gaussian mixture models and support vector machines. In order to account for the fact that such systems are typically deployed on embedded devices, we compare performance as a function of the computational cost of each model. We evaluate the models on 2 large datasets of real-world recordings of baby cries and smoke alarms. Our results demonstrate that the neural networks clearly outperform the other models and they are able to do so without incurring a heavy computation cost.

APA, Harvard, Vancouver, ISO, and other styles

48

Fourer, Dominique. "Approche informée pour l’analyse du son et de la musique." Thesis, Bordeaux 1, 2013. http://www.theses.fr/2013BOR14973/document.

Full text

Abstract:

En traitement du signal audio, l’analyse est une étape essentielle permettant de comprendre et d’inter-agir avec les signaux existants. En effet, la qualité des signaux obtenus par transformation ou par synthèse des paramètres estimés dépend de la précision des estimateurs utilisés. Cependant, des limitations théoriques existent et démontrent que la qualité maximale pouvant être atteinte avec une approche classique peut s’avérer insufﬁsante dans les applications les plus exigeantes (e.g. écoute active de la musique). Le travail présenté dans cette thèse revisite certains problèmes d’analyse usuels tels que l’analyse spectrale, la transcription automatique et la séparation de sources en utilisant une approche dite “informée”. Cette nouvelle approche exploite la conﬁguration des studios de musique actuels qui maitrisent la chaîne de traitement avant l’étape de création du mélange. Dans les solutions proposées, de l’information complémentaire minimale calculée est transmise en même temps que le signal de mélange aﬁn de permettre certaines transformations sur celui-ci tout en garantissant le niveau de qualité. Lorsqu’une compatibilité avec les formats audio existants est nécessaire, cette information est cachée à l’intérieur du mélange lui-même de manière inaudible grâce au tatouage audionumérique. Ce travail de thèse présente de nombreux aspects théoriques et pratiques dans lesquels nous montrons que la combinaison d’un estimateur avec de l’information complémentaire permet d’améliorer les performances des approches usuelles telles que l’estimation non informée ou le codage pur
In the field of audio signal processing, analysis is an essential step which allows interactions with existing signals. In fact, the quality of transformed or synthesized audio signals depends on the accuracy over the estimated model parameters. However, theoretical limits exist and show that the best accuracy which can be reached by a classic estimator can be insufficient for the most demanding applications (e.g. active listening of music). The work which is developed in this thesis revisits well known audio analysis problems like spectral analysis, automatic transcription of music and audio sources separation using the novel ``informed'' approach. This approach takes advantage of a specific configuration where the parameters of the elementary signals which compose a mixture are known before the mixing process. Using the tools which are proposed in this thesis, the minimal side information is computed and transmitted with the mixture signal. This allows any kind of transformation of the mixture signal with a constraint over the resulting quality. When the compatibility with existing audio formats is required, the side information is embedded directly into the analyzed audio signal using a watermarking technique. This work describes several theoretical and practical aspects of audio signal processing. We show that a classic estimator combined with the sufficient side information can obtain better performances than classic approaches (classic estimation or pure coding)

APA, Harvard, Vancouver, ISO, and other styles

49

Guo, Ziyuan. "Objective Audio Quality Assessment Based on Spectro-Temporal Modulation Analysis." Thesis, KTH, Ljud- och bildbehandling, 2011. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-91847.

Full text

Abstract:

Objective audio quality assessment is an interdisciplinary research area that incorporates audiology and machine learning. Although much work has been made on the machine learning aspect, the audiology aspect also deserves investigation. This thesis proposes a non-intrusive audio quality assessment algorithm, which is based on an auditory model that simulates human auditory system. The auditory model is based on spectro-temporal modulation analysis of spectrogram, which has been proven to be effective in predicting the neural activities of human auditory cortex. The performance of an implementation of the algorithm shows the effectiveness of the spectro-temporal modulation analysis in audio quality assessment.

APA, Harvard, Vancouver, ISO, and other styles

50

Boyes, Graham. "Dictionary-based analysis/synthesis and structured representations of musical audio." Thesis, McGill University, 2012. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=106507.

Full text

Abstract:

In the representation of musical audio, it is common to favour either a signal or symbol interpretation, where mid-level representation is an emerging topic. In this thesis we investigate the perspective of structured, intermediate representations through an integration of theoretical aspects related to separable sound objects, dictionary-based methods of signal analysis, and object-oriented programming. In contrast to examples in the literature that approach an intermediate representation from the signal level, we orient our formulation towards the symbolic level. This methodology is applied to both the specification of analytical techniques and the design of a software framework. Experimental results demonstrate that our method is able to achieve a lower Itakura-Saito distance, a perceptually-motivated measure of spectral dissimilarity, when compared to a generic model and that our structured representation can be applied to visualization as well as agglomerative post-processing.
Dans la représentation du signal audio musical, il est commun de favoriser une interprétation de type signal ou bien de type symbole, alors que la représentation de type mi-niveau, ou intermédiaire, devient un sujet d'actualité. Dans cette thèse nous investiguons la perspective de ces représentations intermédiaires et structurées. Notre recherche intègre tant les aspects théoriques liés à des objets sonores séparables, que les méthodes d'analyse des signaux fondées sur des dictionnaires, et ce jusqu'à la conception de logiciels conus dans le cadre de la programmation orienté objet. Contrairement aux exemples disponibles dans la littérature notre approche des représentations intermédiaires part du niveau symbolique pour aller vers le signal, plutôt que le contraire. Cette méthodologie est appliquée non seulement à la spécification de techniques analytiques mais aussi à la conception d'un système logiciel afférent. Les résultats expérimentaux montrent que notre méthode est capable de réduire la distance d'Itakura-Saito, distance fondé sur la perception, ceci en comparaison à une méthode de décomposition générique. Nous montrons également que notre représentation structurée peut être utilisée dans des applications pratiques telles que la visualisation, l'agrégation post-traitement ainsi qu'en composition musicale.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Audio analysi'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles