Dissertations / Theses: 'Acoustic Scene Analysis'

1

Kudo, Hiroaki, Jinji Chen, and Noboru Ohnishi. "Scene Analysis by Clues from the Acoustic Signals." INTELLIGENT MEDIA INTEGRATION NAGOYA UNIVERSITY / COE, 2004. http://hdl.handle.net/2237/10426.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Ford, Logan H. "Large-scale acoustic scene analysis with deep residual networks." Thesis, Massachusetts Institute of Technology, 2019. https://hdl.handle.net/1721.1/123026.

Full text

Abstract:

This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2019
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 63-66).
Many of the recent advances in audio event detection, particularly on the AudioSet dataset, have focused on improving performance using the released embeddings produced by a pre-trained model. In this work, we instead study the task of training a multi-label event classifier directly from the audio recordings of AudioSet. Using the audio recordings, not only are we able to reproduce results from prior work, we have also confirmed improvements of other proposed additions, such as an attention module. Moreover, by training the embedding network jointly with the additions, we achieve a mean Average Precision (mAP) of 0.392 and an area under ROC curve (AUC) of 0.971, surpassing the state-of-the-art without transfer learning from a large dataset. We also analyze the output activations of the network and find that the models are able to localize audio events when a finer time resolution is needed. In addition, we use this model in exploring multimodal learning, transfer learning, and realtime sound event detection tasks.
by Logan H. Ford.
M. Eng.
M.Eng. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science

APA, Harvard, Vancouver, ISO, and other styles

3

Teutsch, Heinz. "Wavefield decomposition using microphone arrays and its application to acoustic scene analysis." [S.l.] : [s.n.], 2006. http://deposit.ddb.de/cgi-bin/dokserv?idn=97902806X.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

McMullan, Amanda R. "Electroencephalographic measures of auditory perception in dynamic acoustic environments." Thesis, Lethbridge, Alta. : University of Lethbridge, Dept. of Neuroscience, c2013, 2013. http://hdl.handle.net/10133/3354.

Full text

Abstract:

We are capable of effortlessly parsing a complex scene presented to us. In order to do this, we must segregate objects from each other and from the background. While this process has been extensively studied in vision science, it remains relatively less understood in auditory science. This thesis sought to characterize the neuroelectric correlates of auditory scene analysis using electroencephalography. Chapter 2 determined components evoked by first-order energy boundaries and second-order pitch boundaries. Chapter 3 determined components evoked by first-order and second-order discontinuous motion boundaries. Both of these chapters focused on analysis of event-related potential (ERP) waveforms and time-frequency analysis. In addition, these chapters investigated the contralateral nature of a negative ERP component. These results extend the current knowledge of auditory scene analysis by providing a starting point for discussing and characterizing first-order and second-order boundaries in an auditory scene.
x, 90 leaves : col. ill. ; 29 cm

APA, Harvard, Vancouver, ISO, and other styles

5

Narayanan, Arun. "Computational auditory scene analysis and robust automatic speech recognition." The Ohio State University, 2014. http://rave.ohiolink.edu/etdc/view?acc_num=osu1401460288.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Carlo, Diego Di. "Echo-aware signal processing for audio scene analysis." Thesis, Rennes 1, 2020. http://www.theses.fr/2020REN1S075.

Full text

Abstract:

La plupart des méthodes de traitement du signal audio considèrent la réverbération et en particulier les échos acoustiques comme une nuisance. Cependant, ceux-ci transmettent des informations spatiales et sémantiques importantes sur les sources sonores et des méthodes essayant de les prendre en compte ont donc récemment émergé.. Dans ce travail, nous nous concentrons sur deux directions. Tout d’abord, nous étudions la manière d’estimer les échos acoustiques à l’aveugle à partir d’enregistrements microphoniques. Deux approches sont proposées, l’une s’appuyant sur le cadre des dictionnaires continus, l’autre sur des techniques récentes d’apprentissage profond. Ensuite, nous nous concentrons sur l’extension de méthodes existantes d’analyse de scènes audio à leurs formes sensibles à l’écho. Le cadre NMF multicanal pour la séparation de sources audio, la méthode de localisation SRP-PHAT et le formateur de voies MVDR pour l’amélioration de la parole sont tous étendus pour prendre en compte les échos. Ces applications montrent comment un simple modèle d’écho peut conduire à une amélioration des performances
Most of audio signal processing methods regard reverberation and in particular acoustic echoes as a nuisance. However, they convey important spatial and semantic information about sound sources and, based on this, recent echo-aware methods have been proposed. In this work we focus on two directions. First, we study the how to estimate acoustic echoes blindly from microphone recordings. Two approaches are proposed, one leveraging on continuous dictionaries, one using recent deep learning techniques. Then, we focus on extending existing methods in audio scene analysis to their echo-aware forms. The Multichannel NMF framework for audio source separation, the SRP-PHAT localization method, and the MVDR beamformer for speech enhancement are all extended to their echo-aware versions

APA, Harvard, Vancouver, ISO, and other styles

7

Deleforge, Antoine. "Acoustic Space Mapping : A Machine Learning Approach to Sound Source Separation and Localization." Thesis, Grenoble, 2013. http://www.theses.fr/2013GRENM033/document.

Full text

Abstract:

Dans cette thèse, nous abordons le problème longtemps étudié de la séparation et localisation binaurale (deux microphones) de sources sonores par l'apprentissage supervisé. Dans ce but, nous développons un nouveau paradigme dénommé projection d'espaces acoustiques, à la croisé des chemins entre la perception binaurale, de l'écoute robotisée, du traitement du signal audio, et de l'apprentissage automatisé. L'approche proposée consiste à apprendre un lien entre les indices auditifs perçus par le système et la position de la source sonore dans une autre modalité du système, comme l'espace visuelle ou l'espace moteur. Nous proposons de nouveaux protocoles expérimentaux permettant d'acquérir automatiquement de grands ensembles d'entraînement qui associent des telles données. Les jeux de données obtenus sont ensuite utilisés pour révéler certaines propriétés intrinsèques des espaces acoustiques, et conduisent au développement d'une famille générale de modèles probabilistes permettant la projection localement linéaire d'un espace de haute dimension vers un espace de basse dimension. Nous montrons que ces modèles unifient plusieurs méthodes de régression et de réduction de dimension existantes, tout en incluant un grand nombre de nouveaux modèles qui généralisent les précédents. Les popriétés et l'inférence de ces modèles sont détaillées en profondeur, et le net avantage des méthodes proposées par rapport à des techniques de l'état de l'art est établit sur différentes applications de projection d'espace, au delà du champs de l'analyse de scènes auditives. Nous montrons ensuite comment les méthodes proposées peuvent être étendues probabilistiquement pour s'attaquer au fameux problème de la soirée cocktail, c'est à dire localiser une ou plusieurs sources émettant simultanément dans un environnement réel, et reséparer les signaux mélangés. Nous montrons que les techniques qui en découlent accomplissent cette tâche avec une précision inégalée. Ceci démontre le rôle important de l'apprentissage et met en avant le paradigme de la projection d'espaces acoustiques comme un outil prometteur pour aborder de façon robuste les problèmes les plus difficiles de l'audition binaurale computationnelle
In this thesis, we address the long-studied problem of binaural (two microphones) sound source separation and localization through supervised leaning. To achieve this, we develop a new paradigm referred as acoustic space mapping, at the crossroads of binaural perception, robot hearing, audio signal processing and machine learning. The proposed approach consists in learning a link between auditory cues perceived by the system and the emitting sound source position in another modality of the system, such as the visual space or the motor space. We propose new experimental protocols to automatically gather large training sets that associates such data. Obtained datasets are then used to reveal some fundamental intrinsic properties of acoustic spaces and lead to the development of a general family of probabilistic models for locally-linear high- to low-dimensional space mapping. We show that these models unify several existing regression and dimensionality reduction techniques, while encompassing a large number of new models that generalize previous ones. The properties and inference of these models are thoroughly detailed, and the prominent advantage of proposed methods with respect to state-of-the-art techniques is established on different space mapping applications, beyond the scope of auditory scene analysis. We then show how the proposed methods can be probabilistically extended to tackle the long-known cocktail party problem, i.e., accurately localizing one or several sound sources emitting at the same time in a real-word environment, and separate the mixed signals. We show that resulting techniques perform these tasks with an unequaled accuracy. This demonstrates the important role of learning and puts forwards the acoustic space mapping paradigm as a promising tool for robustly addressing the most challenging problems in computational binaural audition

APA, Harvard, Vancouver, ISO, and other styles

8

Mouterde, Solveig. "Long-range discrimination of individual vocal signatures by a songbird : from propagation constraints to neural substrate." Thesis, Saint-Etienne, 2014. http://www.theses.fr/2014STET4012/document.

Full text

Abstract:

L'un des plus grands défis posés par la communication est que l'information codée par l'émetteur est toujours modifiée avant d'atteindre le récepteur, et que celui-ci doit traiter cette information altérée afin de recouvrer le message. Ceci est particulièrement vrai pour la communication acoustique, où la transmission du son dans l'environnement est une source majeure de dégradation du signal, ce qui diminue l'intensité du signal relatif au bruit. La question de savoir comment les animaux transmettent l'information malgré ces conditions contraignantes a été l'objet de nombreuses études, portant soit sur l'émetteur soit sur le récepteur. Cependant, une recherche plus intégrée sur l'analyse de scènes auditives est nécessaire pour aborder cette tâche dans toute sa complexité. Le but de ma recherche était d'utiliser une approche transversale afin d'étudier comment les oiseaux s'adaptent aux contraintes de la communication à longue distance, en examinant le codage de l'information au niveau de l'émetteur, les dégradations du signal acoustiques dues à la propagation, et la discrimination de cette information dégradée par le récepteur, au niveau comportemental comme au niveau neuronal. J'ai basé mon travail sur l'idée de prendre en compte les problèmes réellement rencontrés par les animaux dans leur environnement naturel, et d'utiliser des stimuli reflétant la pertinence biologique des problèmes posés à ces animaux. J'ai choisi de me focaliser sur l'information d'identité individuelle contenue dans le cri de distance des diamants mandarins (Taeniopygia guttata) et d'examiner comment la signature vocale individuelle est codée, dégradée, puis discriminée et décodée, depuis l'émetteur jusqu'au récepteur. Cette étude montre que la signature individuelle des diamants mandarins est très résistante à la propagation, et que les paramètres acoustiques les plus individualisés varient selon la distance considérée. En testant des femelles dans les expériences de conditionnement opérant, j'ai pu montrer que celles-ci sont expertes pour discriminer entre les signature vocales dégradées de deux mâles, et qu'elles peuvent s'améliorer en s'entraînant. Enfin, j'ai montré que cette capacité de discrimination impressionnante existe aussi au niveau neuronal : nous avons montré l'existence d'une population de neurones pouvant discriminer des voix individuelles à différent degrés de dégradation, sans entrainement préalable. Ce niveau de traitement évolué, dans le cortex auditif primaire, ouvre la voie à de nouvelles recherches, à l'interface entre le traitement neuronal de l'information et le comportement
In communication systems, one of the biggest challenges is that the information encoded by the emitter is always modified before reaching the receiver, who has to process this altered information in order to recover the intended message. In acoustic communication particularly, the transmission of sound through the environment is a major source of signal degradation, caused by attenuation, absorption and reflections, all of which lead to decreases in the signal relative to the background noise. How animals deal with the need for exchanging information in spite of constraining conditions has been the subject of many studies either at the emitter or at the receiver's levels. However, a more integrated research about auditory scene analysis has seldom been used, and is needed to address the complexity of this process. The goal of my research was to use a transversal approach to study how birds adapt to the constraints of long distance communication by investigating the information coding at the emitter's level, the propagation-induced degradation of the acoustic signal, and the discrimination of this degraded information by the receiver at both the behavioral and neural levels. Taking into account the everyday issues faced by animals in their natural environment, and using stimuli and paradigms that reflected the behavioral relevance of these challenges, has been the cornerstone of my approach. Focusing on the information about individual identity in the distance calls of zebra finches Taeniopygia guttata, I investigated how the individual vocal signature is encoded, degraded, and finally discriminated, from the emitter to the receiver. This study shows that the individual signature of zebra finches is very resistant to propagation-induced degradation, and that the most individualized acoustic parameters vary depending on distance. Testing female birds in operant conditioning experiments, I showed that they are experts at discriminating between the degraded vocal signatures of two males, and that they can improve their ability substantially when they can train over increasing distances. Finally, I showed that this impressive discrimination ability also occurs at the neural level: we found a population of neurons in the avian auditory forebrain that discriminate individual voices with various degrees of propagation-induced degradation without prior familiarization or training. The finding of such a high-level auditory processing, in the primary auditory cortex, opens a new range of investigations, at the interface of neural processing and behavior

APA, Harvard, Vancouver, ISO, and other styles

9

Teki, S. "Cognitive analysis of complex acoustic scenes." Thesis, University College London (University of London), 2013. http://discovery.ucl.ac.uk/1413017/.

Full text

Abstract:

Natural auditory scenes consist of a rich variety of temporally overlapping sounds that originate from multiple sources and locations and are characterized by distinct acoustic features. It is an important biological task to analyze such complex scenes and extract sounds of interest. The thesis addresses this question, also known as the “cocktail party problem” by developing an approach based on analysis of a novel stochastic signal contrary to deterministic narrowband signals used in previous work. This low-level signal, known as the Stochastic Figure-Ground (SFG) stimulus captures the spectrotemporal complexity of natural sound scenes and enables parametric control of stimulus features. In a series of experiments based on this stimulus, I have investigated specific behavioural and neural correlates of human auditory figure-ground segregation. This thesis is presented in seven sections. Chapter 1 reviews key aspects of auditory processing and existing models of auditory segregation. Chapter 2 presents the principles of the techniques used including psychophysics, modeling, functional Magnetic Resonance Imaging (fMRI) and Magnetoencephalography (MEG). Experimental work is presented in the following chapters and covers figure-ground segregation behaviour (Chapter 3), modeling of the SFG stimulus based on a temporal coherence model of auditory perceptual organization (Chapter 4), analysis of brain activity related to detection of salient targets in the SFG stimulus using fMRI (Chapter 5), and MEG respectively (Chapter 6). Finally, Chapter 7 concludes with a general discussion of the results and future directions for research. Overall, this body of work emphasizes the use of stochastic signals for auditory scene analysis and demonstrates an automatic, highly robust segregation mechanism in the auditory system that is sensitive to temporal correlations across frequency channels.

APA, Harvard, Vancouver, ISO, and other styles

10

Wang, Yuxuan. "Supervised Speech Separation Using Deep Neural Networks." The Ohio State University, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=osu1426366690.

Full text

APA, Harvard, Vancouver, ISO, and other styles

11

Chen, Jitong. "On Generalization of Supervised Speech Separation." The Ohio State University, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=osu1492038295603502.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

Huet, Moïra-Phoebé. "Voice mixology at a cocktail party : Combining behavioural and neural tracking for speech segregation." Thesis, Lyon, 2020. http://www.theses.fr/2020LYSEI070.

Full text

Abstract:

Il n’est pas toujours aisé de suivre une conversation dans un environnement bruyant. Pour parvenir à discriminer deux locuteurs, nous devons mobiliser de nombreux mécanismes perceptifs et cognitifs, ce qui peut parfois entraîner un basculement momentané de notre attention auditive sur les discussions alentour. Dans cette thèse, les processus qui sous-tendent la ségrégation de la parole sont explorés à travers des expériences comportementales et neurophysiologiques. Dans un premier temps, le développement d’une tâche d’intelligibilité – le Long-SWoRD test – est introduit. Ce nouveau protocole permet, tout d’abord, de s’approcher de situations réalistes et, in fine, de bénéficier pour les participants de ressources cognitives, telles que des connaissances linguistiques, pour séparer deux locuteurs. La similarité entre les locuteurs, et donc par extension la difficulté de la tâche, a été contrôlée en manipulant les paramètres des voix. Dans un deuxième temps, les performances des sujets avec cette nouvelle tâche est évaluée à travers trois études comportementales et neurophysiologiques (EEG). Les résultats comportementaux sont cohérents avec la littérature et montrent que la distance entre les voix, les indices de spatialisation, ainsi que les informations sémantiques influencent les performances des participants. Les résultats neurophysiologiques, analysés avec des fonctions de réponse temporelle (TRF), suggèrent que les représentations neuronales des deux locuteurs diffèrent selon la difficulté des conditions d’écoute. Par ailleurs, ces représentations se construisent plus rapidement lorsque les voix sont facilement distinguables. Il est souvent supposé dans la littérature que l’attention des participants reste constamment sur la même voix. Le protocole expérimental présenté dans ce travail permet également d’inférer rétrospectivement à quel moment et quelle voix les participants écoutaient. C’est pourquoi, dans un troisième temps, une analyse combinée de ces informations attentionnelles et des signaux EEG est présentée. Les résultats soulignent que les informations concernant le focus attentionnel peuvent être utilisées avantageusement pour améliorer la représentation neuronale du locuteur sur lequel est porté la concentration dans les situations où les voix sont similaires
It is not always easy to follow a conversation in a noisy environment. In order to discriminate two speakers, we have to mobilize many perceptual and cognitive processes to maintain attention on a target voice and avoid shifting attention to the background. In this dissertation, the processes underlying speech segregation are explored through behavioural and neurophysiological experiments. In a preliminary phase, the development of an intelligibility task -- the Long-SWoRD test -- is introduced. This protocol allows participants to benefit from cognitive resources, such as linguistic knowledge, to separate two talkers in a realistic listening environment. The similarity between the two speakers, and thus by extension the difficulty of the task, was controlled by manipulating the acoustic parameters of the target and masker voices. In a second phase, the performance of the participants on this task is evaluated through three behavioural and neurophysiological studies (EEG). Behavioural results are consistent with the literature and show that the distance between voices, spatialisation cues, and semantic information influence participants' performance. Neurophysiological results, analysed with temporal response functions (TRF), indicate that the neural representations of the two speakers differ according to the difficulty of listening conditions. In addition, these representations are constructed more quickly when the voices are easily distinguishable. It is often presumed in the literature that participants' attention remains constantly on the same voice. The experimental protocol presented in this work provides the opportunity to retrospectively infer when participants were listening to each voice. Therefore, in a third stage, a combined analysis of this attentional information and EEG signals is presented. Results show that information about attentional focus can be used to improve the neural representation of the attended voice in situations where the voices are similar

APA, Harvard, Vancouver, ISO, and other styles

13

Baque, Mathieu. "Analyse de scène sonore multi-capteurs : un front-end temps-réel pour la manipulation de scène." Thesis, Le Mans, 2017. http://www.theses.fr/2017LEMA1013/document.

Full text

Abstract:

La thèse s’inscrit dans un contexte d’essor de l’audio spatialisé (5.1, Dolby Atmos...). Parmi les formats audio 3D existants, l’ambisonie permet une représentation spatiale homogène du champ sonore et se prête naturellement à des manipulations : rotations, distorsion du champ sonore. L’objectif de cette thèse est de fournir un outil d’analyse et de manipulation de contenus audio (essentiellement vocaux) au format ambisonique. Un fonctionnement temps-réel et en conditions acoustiques réelles sont les principales contraintes à respecter. L’algorithme mis au point est basé sur une analyse en composantes indépendantes (ACI) appliquée trame à trame qui permet de décomposer le champ acoustique en un ensemble de contributions, correspondant à des sources (champ direct) ou à de la réverbération. Une étape de classification bayésienne, appliquée aux composantes extraites, permet alors l’identification et le dénombrement des sources sonores contenues dans le mélange. Les sources identifiées sont localisées grâce à la matrice de mélange obtenue par ACI, pour fournir une cartographie de la scène sonore. Une étude exhaustive des performances est menée sur des contenus réels en fonction de plusieurs paramètres : nombre de sources, environnement acoustique, longueur des trames, ou ordre ambisonique utilisé. Des résultats fiables en terme de localisation et de comptage de sources ont été obtenus pour des trames de quelques centaines de ms. L’algorithme, exploité comme prétraitement dans un prototype d’assistant vocal domestique, permet d’améliorer significativement les performances de reconnaissance, notamment en prise de son lointaine et en présence de sources interférentes
The context of this thesis is the development of spatialized audio (5.1 contents, Dolby Atmos...) and particularly of 3D audio. Among the existing 3D audio formats, Ambisonics and Higher Order Ambisonics (HOA) allow a homogeneous spatial representation of a sound field and allows basics manipulations, like rotations or distorsions. The aim of the thesis is to provides efficient tools for ambisonics and HOA sound scene analyse and manipulations. A real-time implementation and robustness to reverberation are the main constraints to deal with. The implemented algorithm is based on a frame-by-frame Independent Component Analysis (ICA), wich decomposes the sound field into a set of acoustic contributions. Then a bayesian classification step is applied to the extracted components to identify the real sources and the residual reverberation. Direction of arrival of the sources are extracted from the mixing matrix estimated by ICA, according to the ambisonic formalism, and a real-time cartography of the sound scene is obtained. Performances have been evaluated in different acoustic environnements to assess the influence of several parameters such as the ambisonic order, the frame length or the number of sources. Accurate results in terms of source localization and source counting have been obtained for frame lengths of a few hundred milliseconds. The algorithm is exploited as a pre-processing step for a speech recognition prototype and allows a significant increasing of the recognition results, in far field conditions and in the presence of noise and interferent sources

APA, Harvard, Vancouver, ISO, and other styles

14

Woodruff, John F. "Integrating Monaural and Binaural Cues for Sound Localization and Segregation in Reverberant Environments." The Ohio State University, 2012. http://rave.ohiolink.edu/etdc/view?acc_num=osu1332425718.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Sundar, Harshavardhan. "Who Spoke What And Where? A Latent Variable Framework For Acoustic Scene Analysis." Thesis, 2016. https://etd.iisc.ac.in/handle/2005/2569.

Full text

Abstract:

Speech is by far the most natural form of communication between human beings. It is intuitive, expressive and contains information at several cognitive levels. We as humans, are perceptive to several of these cognitive levels of information, as we can gather the information pertaining to the identity of the speaker, the speaker's gender, emotion, location, the language, and so on, in addition to the content of what is being spoken. This makes speech based human machine interaction (HMI), both desirable and challenging for the same set of reasons. For HMI to be natural for humans, it is imperative that a machine understands information present in speech, at least at the level of speaker identity, language, location in space, and the summary of what is being spoken. Although one can draw parallels between the human-human interaction and HMI, the two differ in their purpose. We, as humans, interact with a machine, mostly in the context of getting a task done more efficiently, than is possible without the machine. Thus, typically in HMI, controlling the machine in a specific manner is the primary goal. In this context, it can be argued that, HMI, with a limited vocabulary containing specific commands, would suffice for a more efficient use of the machine. In this thesis, we address the problem of ``Who spoke what and where", in the context of a machine understanding the information pertaining to identities of the speakers, their locations in space and the keywords they spoke, thus considering three levels of information - speaker identity (who), location (where) and keywords (what). This can be addressed with the help of multiple sensors like microphones, video camera, proximity sensors, motion detectors, etc., and combining all these modalities. However, we explore the use of only microphones to address this issue. In practical scenarios, often there are times, wherein, multiple people are talking at the same time. Thus, the goal of this thesis is to detect all the speakers, their keywords, and their locations in mixture signals containing speech from simultaneous speakers. Addressing this problem of ``Who spoke what and where" using only microphone signals, forms a part of acoustic scene analysis (ASA) of speech based acoustic events. We divide the problem of ``who spoke what and where" into two sub-problems: ``Who spoke what?" and ``Who spoke where". Each of these problems is cast in a generic latent variable (LV) framework to capture information in speech at different levels. We associate a LV to represent each of these levels and model the relationship between the levels using conditional dependency. The sub-problem of ``who spoke what" is addressed using single channel microphone signal, by modeling the mixture signal in terms of LV mass functions of speaker identity, the conditional mass function of the keyword spoken given the speaker identity, and a speaker-specific-keyword model. The LV mass functions are estimated in a Maximum likelihood (ML) framework using the Expectation Maximization (EM) algorithm using Student's-t Mixture Model (tMM) as speaker-specific-keyword models. Motivated by HMI in a home environment, we have created our own database. In mixture signals, containing two speakers uttering the keywords simultaneously, the proposed framework achieves an accuracy of 82 % for detecting both the speakers and their respective keywords. The other sub-problem of ``who spoke where?" is addressed in two stages. In the first stage, the enclosure is discretized into sectors. The speakers and the sectors in which they are located are detected in an approach similar to the one employed for ``who spoke what" using signals collected from a Uniform Circular Array (UCA). However, in place of speaker-specific-keyword models, we use tMM based speaker models trained on clean speech, along with a simple Delay and Sum Beamformer (DSB). In the second stage, the speakers are localized within the active sectors using a novel region constrained localization technique based on time difference of arrival (TDOA). Since the problem being addressed is a multi-label classification task, we use the average Hamming score (accuracy) as the performance metric. Although the proposed approach yields an accuracy of 100 % in an anechoic setting for detecting both the speakers and their corresponding sectors in two-speaker mixture signals, the performance degrades to an accuracy of 67 % in a reverberant setting, with a $60$ dB reverberation time (RT60) of 300 ms. To improve the performance under reverberation, prior knowledge of the location of multiple sources is derived using a novel technique derived from geometrical insights into TDOA estimation. With this prior knowledge, the accuracy of the proposed approach improves to 91 %. It is worthwhile to note that, the accuracies are computed for mixture signals containing more than 90 % overlap of competing speakers. The proposed LV framework offers a convenient methodology to represent information at broad levels. In this thesis, we have shown its use with three different levels. This can be extended to several such levels to be applicable for a generic analysis of the acoustic scene consisting of broad levels of events. It will turn out that not all levels are dependent on each other and hence the LV dependencies can be minimized by independence assumption, which will lead to solving several smaller sub-problems, as we have shown above. The LV framework is also attractive to incorporate prior knowledge about the acoustic setting, which is combined with the evidence from the data to derive the information about the presence of an acoustic event. The performance of the framework, is dependent on the choice of stochastic models, which model the likelihood function of the data given the presence of acoustic events. However, it provides an access to compare and contrast the use of different stochastic models for representing the likelihood function.

APA, Harvard, Vancouver, ISO, and other styles

16

Sundar, Harshavardhan. "Who Spoke What And Where? A Latent Variable Framework For Acoustic Scene Analysis." Thesis, 2016. http://etd.iisc.ernet.in/handle/2005/2569.

Full text

Abstract:

Speech is by far the most natural form of communication between human beings. It is intuitive, expressive and contains information at several cognitive levels. We as humans, are perceptive to several of these cognitive levels of information, as we can gather the information pertaining to the identity of the speaker, the speaker's gender, emotion, location, the language, and so on, in addition to the content of what is being spoken. This makes speech based human machine interaction (HMI), both desirable and challenging for the same set of reasons. For HMI to be natural for humans, it is imperative that a machine understands information present in speech, at least at the level of speaker identity, language, location in space, and the summary of what is being spoken. Although one can draw parallels between the human-human interaction and HMI, the two differ in their purpose. We, as humans, interact with a machine, mostly in the context of getting a task done more efficiently, than is possible without the machine. Thus, typically in HMI, controlling the machine in a specific manner is the primary goal. In this context, it can be argued that, HMI, with a limited vocabulary containing specific commands, would suffice for a more efficient use of the machine. In this thesis, we address the problem of ``Who spoke what and where", in the context of a machine understanding the information pertaining to identities of the speakers, their locations in space and the keywords they spoke, thus considering three levels of information - speaker identity (who), location (where) and keywords (what). This can be addressed with the help of multiple sensors like microphones, video camera, proximity sensors, motion detectors, etc., and combining all these modalities. However, we explore the use of only microphones to address this issue. In practical scenarios, often there are times, wherein, multiple people are talking at the same time. Thus, the goal of this thesis is to detect all the speakers, their keywords, and their locations in mixture signals containing speech from simultaneous speakers. Addressing this problem of ``Who spoke what and where" using only microphone signals, forms a part of acoustic scene analysis (ASA) of speech based acoustic events. We divide the problem of ``who spoke what and where" into two sub-problems: ``Who spoke what?" and ``Who spoke where". Each of these problems is cast in a generic latent variable (LV) framework to capture information in speech at different levels. We associate a LV to represent each of these levels and model the relationship between the levels using conditional dependency. The sub-problem of ``who spoke what" is addressed using single channel microphone signal, by modeling the mixture signal in terms of LV mass functions of speaker identity, the conditional mass function of the keyword spoken given the speaker identity, and a speaker-specific-keyword model. The LV mass functions are estimated in a Maximum likelihood (ML) framework using the Expectation Maximization (EM) algorithm using Student's-t Mixture Model (tMM) as speaker-specific-keyword models. Motivated by HMI in a home environment, we have created our own database. In mixture signals, containing two speakers uttering the keywords simultaneously, the proposed framework achieves an accuracy of 82 % for detecting both the speakers and their respective keywords. The other sub-problem of ``who spoke where?" is addressed in two stages. In the first stage, the enclosure is discretized into sectors. The speakers and the sectors in which they are located are detected in an approach similar to the one employed for ``who spoke what" using signals collected from a Uniform Circular Array (UCA). However, in place of speaker-specific-keyword models, we use tMM based speaker models trained on clean speech, along with a simple Delay and Sum Beamformer (DSB). In the second stage, the speakers are localized within the active sectors using a novel region constrained localization technique based on time difference of arrival (TDOA). Since the problem being addressed is a multi-label classification task, we use the average Hamming score (accuracy) as the performance metric. Although the proposed approach yields an accuracy of 100 % in an anechoic setting for detecting both the speakers and their corresponding sectors in two-speaker mixture signals, the performance degrades to an accuracy of 67 % in a reverberant setting, with a $60$ dB reverberation time (RT60) of 300 ms. To improve the performance under reverberation, prior knowledge of the location of multiple sources is derived using a novel technique derived from geometrical insights into TDOA estimation. With this prior knowledge, the accuracy of the proposed approach improves to 91 %. It is worthwhile to note that, the accuracies are computed for mixture signals containing more than 90 % overlap of competing speakers. The proposed LV framework offers a convenient methodology to represent information at broad levels. In this thesis, we have shown its use with three different levels. This can be extended to several such levels to be applicable for a generic analysis of the acoustic scene consisting of broad levels of events. It will turn out that not all levels are dependent on each other and hence the LV dependencies can be minimized by independence assumption, which will lead to solving several smaller sub-problems, as we have shown above. The LV framework is also attractive to incorporate prior knowledge about the acoustic setting, which is combined with the evidence from the data to derive the information about the presence of an acoustic event. The performance of the framework, is dependent on the choice of stochastic models, which model the likelihood function of the data given the presence of acoustic events. However, it provides an access to compare and contrast the use of different stochastic models for representing the likelihood function.

APA, Harvard, Vancouver, ISO, and other styles

17

Aaronson, Neil L. "Speech-on-speech masking in a front-back dimension and analysis of binaural parameters in rooms using MLS methods." Diss., 2008.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

18

Teutsch, Heinz [Verfasser]. "Wavefield decomposition using microphone arrays and its application to acoustic scene analysis / vorgelegt von Heinz Teutsch." 2006. http://d-nb.info/97902806X/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

19

Pelluri, Sai Gunaranjan. "Joint Spectro-Temporal Analysis of Moving Acoustic Sources." Thesis, 2017. http://etd.iisc.ac.in/handle/2005/4279.

Full text

Abstract:

Signals generated by fast moving acoustic sources are both challenging for analysis as well as rich in information. The motion is conceptually relative between the source and receiver i.e., either one of them is moving or both are moving. Thus, the receiver would gather information about the relative motion as well as the nature of source itself. For example, direction, velocity, acceleration, number of different sources, friend/foe, etc. are all information that can be gathered. All these parameters are inherently embedded in the received signal. Given the rich information content inherent in the signal, we address the task of extracting maximum information from minimum receivers, even a single micro- phone recording. By using more sensors with specific configurations, we can improve characterization of the moving acoustic source even better. When we have a moving source generating a signal, naturally the Doppler effect comes into picture; the received signal becomes non-stationary with respect to its spectral content, even if the generated signal is stationary. This non-stationarity in the signal provides information about motion of the source. Oechslin et al. [24], through a series of experiments, demonstrated the role of Doppler effect as a fundamental principle for identification of direction of moving sound source. The authors illustrated the sufficiency of the Doppler effect to distinguish between an approaching or receding moving sound source. An interesting finding from their experiments is that observers are more sensitive to approaching sound sources than to those receding. This compelling result can be used to hypothesize interactions in biological systems, particularly the ones between predator-prey. Thus, the biological systems, with a minimal number of auditory sensors (2 ears), are able to extract valuable information from the source signal sufficient for their survival. With the above motivation, we explore restricting the number of microphones used to estimate source parameters to a minimum. Since the Doppler effect manifests as time-varying frequency content in the signal, we use this non-stationarity to estimate source parameters. The thesis also addresses the issue of disambiguating the inherent source signal non-stationarity and the non-stationarity introduced due to the motion dynamics of the source. We also show analytically that these two kinds of non- stationarities can be decoupled under certain conditions. We next explore various methods of instantaneous frequency (IF) estimation from the received signal. In particular, we use non-stationary signal processing tools such as the AM-FM model, time-frequency representations (TFRs), time varying- linear prediction (TV-LP) to aid the signal analysis. We explore in detail the effectiveness of chirplettransform and its variants with regard to the IF estimation and also discuss about the time-frequency resolution properties. We propose a new variant of the chirp let transform for multi-component non- stationary signals such as Doppler signals. We conclude the thesis by summarising our research contributions and throwing open various problems for pursuing further research in this field.

APA, Harvard, Vancouver, ISO, and other styles

20

"Psychophysical and Neural Correlates of Auditory Attraction and Aversion." Master's thesis, 2014. http://hdl.handle.net/2286/R.I.27518.

Full text

Abstract:

abstract: This study explores the psychophysical and neural processes associated with the perception of sounds as either pleasant or aversive. The underlying psychophysical theory is based on auditory scene analysis, the process through which listeners parse auditory signals into individual acoustic sources. The first experiment tests and confirms that a self-rated pleasantness continuum reliably exists for 20 various stimuli (r = .48). In addition, the pleasantness continuum correlated with the physical acoustic characteristics of consonance/dissonance (r = .78), which can facilitate auditory parsing processes. The second experiment uses an fMRI block design to test blood oxygen level dependent (BOLD) changes elicited by a subset of 5 exemplar stimuli chosen from Experiment 1 that are evenly distributed over the pleasantness continuum. Specifically, it tests and confirms that the pleasantness continuum produces systematic changes in brain activity for unpleasant acoustic stimuli beyond what occurs with pleasant auditory stimuli. Results revealed that the combination of two positively and two negatively valenced experimental sounds compared to one neutral baseline control elicited BOLD increases in the primary auditory cortex, specifically the bilateral superior temporal gyrus, and left dorsomedial prefrontal cortex; the latter being consistent with a frontal decision-making process common in identification tasks. The negatively-valenced stimuli yielded additional BOLD increases in the left insula, which typically indicates processing of visceral emotions. The positively-valenced stimuli did not yield any significant BOLD activation, consistent with consonant, harmonic stimuli being the prototypical acoustic pattern of auditory objects that is optimal for auditory scene analysis. Both the psychophysical findings of Experiment 1 and the neural processing findings of Experiment 2 support that consonance is an important dimension of sound that is processed in a manner that aids auditory parsing and functional representation of acoustic objects and was found to be a principal feature of pleasing auditory stimuli.
Dissertation/Thesis
Masters Thesis Psychology 2014

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Acoustic Scene Analysis'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles