Rozprawy doktorskie: „Speech and audio signals”

1

Mason, Michael. "Hybrid coding of speech and audio signals". Thesis, Queensland University of Technology, 2001.

Style APA, Harvard, Vancouver, ISO itp.

2

Trinkaus, Trevor R. "Perceptual coding of audio and diverse speech signals". Diss., Georgia Institute of Technology, 1999. http://hdl.handle.net/1853/13883.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

3

Mészáros, Tomáš. "Speech Analysis for Processing of Musical Signals". Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2015. http://www.nusl.cz/ntk/nusl-234974.

Pełny tekst źródła

Streszczenie:

Hlavním cílem této práce je obohatit hudební signály charakteristikami lidské řeči. Práce zahrnuje tvorbu audioefektu inspirovaného efektem talk-box: analýzu hlasového ústrojí vhodným algoritmem jako je lineární predikce, a aplikaci odhadnutého filtru na hudební audio-signál. Důraz je kladen na dokonalou kvalitu výstupu, malou latenci a nízkou výpočetní náročnost pro použití v reálném čase. Výstupem práce je softwarový plugin využitelný v profesionálních aplikacích pro úpravu audia a při využití vhodné hardwarové platformy také pro živé hraní. Plugin emuluje reálné zařízení typu talk-box a poskytuje podobnou kvalitu výstupu s unikátním zvukem.

Style APA, Harvard, Vancouver, ISO itp.

4

Choi, Hyung Keun. "Blind source separation of the audio signals in a real world". Thesis, Georgia Institute of Technology, 2002. http://hdl.handle.net/1853/14986.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

5

Lucey, Simon. "Audio-visual speech processing". Thesis, Queensland University of Technology, 2002. https://eprints.qut.edu.au/36172/7/SimonLuceyPhDThesis.pdf.

Pełny tekst źródła

Streszczenie:

Speech is inherently bimodal, relying on cues from the acoustic and visual speech modalities for perception. The McGurk effect demonstrates that when humans are presented with conflicting acoustic and visual stimuli, the perceived sound may not exist in either modality. This effect has formed the basis for modelling the complementary nature of acoustic and visual speech by encapsulating them into the relatively new research field of audio-visual speech processing (AVSP). Traditional acoustic based speech processing systems have attained a high level of performance in recent years, but the performance of these systems is heavily dependent on a match between training and testing conditions. In the presence of mismatched conditions (eg. acoustic noise) the performance of acoustic speech processing applications can degrade markedly. AVSP aims to increase the robustness and performance of conventional speech processing applications through the integration of the acoustic and visual modalities of speech, in particular the tasks of isolated word speech and text-dependent speaker recognition. Two major problems in AVSP are addressed in this thesis, the first of which concerns the extraction of pertinent visual features for effective speech reading and visual speaker recognition. Appropriate representations of the mouth are explored for improved classification performance for speech and speaker recognition. Secondly, there is the question of how to effectively integrate the acoustic and visual speech modalities for robust and improved performance. This question is explored in-depth using hidden Markov model(HMM)classifiers. The development and investigation of integration strategies for AVSP required research into a new branch of pattern recognition known as classifier combination theory. A novel framework is presented for optimally combining classifiers so their combined performance is greater than any of those classifiers individually. The benefits of this framework are not restricted to AVSP, as they can be applied to any task where there is a need for combining independent classifiers.

Style APA, Harvard, Vancouver, ISO itp.

6

Anderson, David Verl. "Audio signal enhancement using multi-resolution sinusoidal modeling". Diss., Georgia Institute of Technology, 1999. http://hdl.handle.net/1853/15394.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

7

Zeghidour, Neil. "Learning representations of speech from the raw waveform". Thesis, Paris Sciences et Lettres (ComUE), 2019. http://www.theses.fr/2019PSLEE004/document.

Pełny tekst źródła

Streszczenie:

Bien que les réseaux de neurones soient à présent utilisés dans la quasi-totalité des composants d’un système de reconnaissance de la parole, du modèle acoustique au modèle de langue, l’entrée de ces systèmes reste une représentation analytique et fixée de la parole dans le domaine temps-fréquence, telle que les mel-filterbanks. Cela se distingue de la vision par ordinateur, un domaine où les réseaux de neurones prennent en entrée les pixels bruts. Les mel-filterbanks sont le produit d’une connaissance précieuse et documentée du système auditif humain, ainsi que du traitement du signal, et sont utilisées dans les systèmes de reconnaissance de la parole les plus en pointe, systèmes qui rivalisent désormais avec les humains dans certaines conditions. Cependant, les mel-filterbanks, comme toute représentation fixée, sont fondamentalement limitées par le fait qu’elles ne soient pas affinées par apprentissage pour la tâche considérée. Nous formulons l’hypothèse qu’apprendre ces représentations de bas niveau de la parole, conjontement avec le modèle, permettrait de faire avancer davantage l’état de l’art. Nous explorons tout d’abord des approches d’apprentissage faiblement supervisé et montrons que nous pouvons entraîner un unique réseau de neurones à séparer l’information phonétique de celle du locuteur à partir de descripteurs spectraux ou du signal brut et que ces représentations se transfèrent à travers les langues. De plus, apprendre à partir du signal brut produit des représentations du locuteur significativement meilleures que celles d’un modèle entraîné sur des mel-filterbanks. Ces résultats encourageants nous mènent par la suite à développer une alternative aux mel-filterbanks qui peut être entraînée à partir des données. Dans la seconde partie de cette thèse, nous proposons les Time-Domain filterbanks, une architecture neuronale légère prenant en entrée la forme d’onde, dont on peut initialiser les poids pour répliquer les mel-filterbanks et qui peut, par la suite, être entraînée par rétro-propagation avec le reste du réseau de neurones. Au cours d’expériences systématiques et approfondies, nous montrons que les Time-Domain filterbanks surclassent systématiquement les melfilterbanks, et peuvent être intégrées dans le premier système de reconnaissance de la parole purement convolutif et entraîné à partir du signal brut, qui constitue actuellement un nouvel état de l’art. Les descripteurs fixes étant également utilisés pour des tâches de classification non-linguistique, pour lesquelles elles sont d’autant moins optimales, nous entraînons un système de détection de dysarthrie à partir du signal brut, qui surclasse significativement un système équivalent entraîné sur des mel-filterbanks ou sur des descripteurs de bas niveau. Enfin, nous concluons cette thèse en expliquant en quoi nos contributions s’inscrivent dans une transition plus large vers des systèmes de compréhension du son qui pourront être appris de bout en bout
While deep neural networks are now used in almost every component of a speech recognition system, from acoustic to language modeling, the input to such systems are still fixed, handcrafted, spectral features such as mel-filterbanks. This contrasts with computer vision, in which a deep neural network is now trained on raw pixels. Mel-filterbanks contain valuable and documented prior knowledge from human auditory perception as well as signal processing, and are the input to state-of-the-art speech recognition systems that are now on par with human performance in certain conditions. However, mel-filterbanks, as any fixed representation, are inherently limited by the fact that they are not fine-tuned for the task at hand. We hypothesize that learning the low-level representation of speech with the rest of the model, rather than using fixed features, could push the state-of-the art even further. We first explore a weakly-supervised setting and show that a single neural network can learn to separate phonetic information and speaker identity from mel-filterbanks or the raw waveform, and that these representations are robust across languages. Moreover, learning from the raw waveform provides significantly better speaker embeddings than learning from mel-filterbanks. These encouraging results lead us to develop a learnable alternative to mel-filterbanks, that can be directly used in replacement of these features. In the second part of this thesis we introduce Time-Domain filterbanks, a lightweight neural network that takes the waveform as input, can be initialized as an approximation of mel-filterbanks, and then learned with the rest of the neural architecture. Across extensive and systematic experiments, we show that Time-Domain filterbanks consistently outperform melfilterbanks and can be integrated into a new state-of-the-art speech recognition system, trained directly from the raw audio signal. Fixed speech features being also used for non-linguistic classification tasks for which they are even less optimal, we perform dysarthria detection from the waveform with Time-Domain filterbanks and show that it significantly improves over mel-filterbanks or low-level descriptors. Finally, we discuss how our contributions fall within a broader shift towards fully learnable audio understanding systems

Style APA, Harvard, Vancouver, ISO itp.

8

Bando, Yoshiaki. "Robust Audio Scene Analysis for Rescue Robots". Kyoto University, 2018. http://hdl.handle.net/2433/232410.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

9

Moghimi, Amir Reza. "Array-based Spectro-temporal Masking For Automatic Speech Recognition". Research Showcase @ CMU, 2014. http://repository.cmu.edu/dissertations/334.

Pełny tekst źródła

Streszczenie:

Over the years, a variety of array processing techniques have been applied to the problem of enhancing degraded speech to improve automatic speech recognition. In this context, linear beamforming has long been the approach of choice, for reasons including good performance, robustness and analytical simplicity. While various non-linear techniques - typically based to some extent on the study of auditory scene analysis - have also been of interest, they tend to lag behind their linear counterparts in terms of simplicity, scalability and exibility. Nonlinear techniques are also more difficult to analyze and lack the systematic descriptions available in the study of linear beamformers. This work focuses on a class of nonlinear processing, known as time-frequency (T-F) masking - a.k.a. spectro-temporal masking { whose variants comprise a significant portion of the existing techniques. T-F masking is based on accepting or rejecting individual time-frequency cells based on some estimate of local signal quality. Analyses are developed that attempt to mirror the beam patterns used to describe linear processing, leading to a view of T-F masking as "nonlinear beamforming". Two distinct formulations of these "nonlinear beam patterns" are developed, based on different metrics of the algorithms behavior; these formulations are modeled in a variety of scenarios to demonstrate the flexibility of the idea. While these patterns are not quite as simple or all-encompassing as traditional beam patterns in microphone-array processing, they do accurately represent the behavior of masking algorithms in analogous and intuitive ways. In addition to analyzing this class of nonlinear masking algorithm, we also attempt to improve its performance in a variety of ways. Improvements are proposed to the baseline two-channel version of masking, by addressing both the mask estimation and the signal reconstruction stages; the latter more successfully than the former. Furthermore, while these approaches have been shown to outperform linear beamforming in two-sensor arrays, extensions to larger arrays have been few and unsuccessful. We find that combining beamforming and masking is a viable method of bringing the benefits of masking to larger arrays. As a result, a hybrid beamforming-masking approach, called "post-masking", is developed that improves upon the performance of MMSE beamforming (and can be used with any beamforming technique), with the potential for even greater improvement in the future.

Style APA, Harvard, Vancouver, ISO itp.

10

Brangers, Kirstin M. "Perceptual Ruler for Quantifying Speech Intelligibility in Cocktail Party Scenarios". UKnowledge, 2013. http://uknowledge.uky.edu/ece_etds/31.

Pełny tekst źródła

Streszczenie:

Systems designed to enhance intelligibility of speech in noise are difficult to evaluate quantitatively because intelligibility is subjective and often requires feedback from large populations for consistent evaluations. Attempts to quantify the evaluation have included related measures such as the Speech Intelligibility Index. These require separating speech and noise signals, which precludes its use on experimental recordings. This thesis develops a procedure using an Intelligibility Ruler (IR) for efficiently quantifying intelligibility. A calibrated Mean Opinion Score (MOS) method is also implemented in order to compare repeatability over a population of 24 subjective listeners. Results showed that subjects using the IR consistently estimated SII values of the test samples with an average standard deviation of 0.0867 between subjects on a scale from zero to one and R2=0.9421. After a calibration procedure from a subset of subjects, the MOS method yielded similar results with an average standard deviation of 0.07620 and R2=0.9181.While results suggest good repeatability of the IR method over a broad range of subjects, the calibrated MOS method is capable of producing results more closely related to actual SII values and is a simpler procedure for human subjects.

Style APA, Harvard, Vancouver, ISO itp.

11

Harvilla, Mark J. "Compensation for Nonlinear Distortion in Noise for Robust Speech Recognition". Research Showcase @ CMU, 2014. http://repository.cmu.edu/dissertations/437.

Pełny tekst źródła

Streszczenie:

The performance, reliability, and ubiquity of automatic speech recognition systems has flourished in recent years due to steadily increasing computational power and technological innovations such as hidden Markov models, weighted finite-state transducers, and deep learning methods. One problem which plagues speech recognition systems, especially those that operate offline and have been trained on specific in-domain data, is the deleterious effect of noise on the accuracy of speech recognition. Historically, robust speech recognition research has focused on traditional noise types such as additive noise, linear filtering, and reverberation. This thesis describes the effects of nonlinear dynamic range compression on automatic speech recognition and develops a number of novel techniques for characterizing and counteracting it. Dynamic range compression is any function which reduces the dynamic range of an input signal. Dynamic range compression is a widely-used tool in audio engineering and is almost always a component of a practical telecommunications system. Despite its ubiquity, this thesis is the first work to comprehensively study and address the effect of dynamic range compression on speech recognition. More specifically, this thesis treats the problem of dynamic range compression in three ways: (1) blind amplitude normalization methods, which counteract dynamic range compression when its parameter values allow the function to be mathematically inverted, (2) blind amplitude reconstruction techniques, i.e., declipping, which attempt to reconstruct clipped segments of the speech signal that are lost through non-invertible dynamic range compression, and (3) matched-training techniques, which attempt to select the pre-trained acoustic model with the closest set of compression parameters. All three of these methods rely on robust estimation of the dynamic range compression distortion parameters. Novel algorithms for the blind prediction of these parameters are also introduced. The algorithms' quality is evaluated in terms of the degree to which they decrease speech recognition word error rate, as well as in terms of the degree to which they increase a given speech signal's signal-to-noise ratio. In all evaluations, the possibility of independent additive noise following the application of dynamic range compression is assumed.

Style APA, Harvard, Vancouver, ISO itp.

12

Nylén, Helmer. "Detecting Signal Corruptions in Voice Recordings for Speech Therapy". Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-291429.

Pełny tekst źródła

Streszczenie:

When recording voice samples from a patient in speech therapy the quality of the recording may be affected by different signal corruptions, for example background noise or clipping. The equipment and expertise required to identify small disturbances are not always present at smaller clinics. Therefore, this study investigates possible machine learning algorithms to automatically detect selected corruptions in speech signals, including infrasound and random muting. Five algorithms are analyzed: kernel substitution based Support Vector Machine, Convolutional Neural Network, Long Short-term Memory (LSTM), Gaussian Mixture Model based Hidden Markov Model and Generative Model based Hidden Markov Model. A tool to generate datasets of corrupted recordings is developed to test the algorithms in both single-label and multi-label settings. Mel-frequency Cepstral Coefficients are used as the main features. For each type of corruption different ways to increase the classification accuracy are tested, for example by using a Voice Activity Detector to filter out less relevant parts of the recording, changing the feature parameters, or using an ensemble of classifiers. The experiments show that a machine learning approach is feasible for this problem as a balanced accuracy of at least 75% is reached on all tested corruptions. While the single-label study gave mixed results with no algorithm clearly outperforming the others, in the multi-label case the LSTM in general performs better than other algorithms. Notably it achieves over 95% balanced accuracy on both white noise and infrasound. As the algorithms are trained only on spoken English phrases the usability of this tool in its current state is limited, but the experiments are easily expanded upon with other types of audio recordings, corruptions, features, or classification algorithms.
När en patients röst spelas in för analys i talterapi kan inspelningskvaliteten påverkas av olika signalproblem, till exempel bakgrundsljud eller klippning. Utrustningen och expertisen som behövs för att upptäcka små störningar finns dock inte alltid tillgänglig på mindre kliniker. Därför undersöker denna studie olika maskininlärningsalgoritmer för att automatiskt kunna upptäcka utvalda problem i talinspelningar, bland andra infraljud och slumpmässig utsläckning av signalen. Fem algoritmer analyseras: stödvektormaskin, Convolutional Neural Network, Long Short-term Memory (LSTM), Gaussian mixture model-baserad dold Markovmodell och generatorbaserad dold Markovmodell. Ett verktyg för att skapa datamängder med försämrade inspelningar utvecklas för att kunna testa algoritmerna. Vi undersöker separat fallen där inspelningarna tillåts ha en eller flera problem samtidigt, och använder framförallt en slags kepstralkoefficienter, MFCC:er, som särdrag. För varje typ av problem undersöker vi också sätt att förbättra noggrannheten, till exempel genom att filtrera bort irrelevanta delar av signalen med hjälp av en röstupptäckare, ändra särdragsparametrarna, eller genom att använda en ensemble av klassificerare. Experimenten visar att maskininlärning är ett rimligt tillvägagångssätt för detta problem då den balanserade träffsäkerheten överskrider 75%för samtliga testade störningar. Den delen av studien som fokuserade på enproblemsinspelningar gav inga resultat som tydde på att en algoritm var klart bättre än de andra, men i flerproblemsfallet överträffade LSTM:en generellt övriga algoritmer. Värt att notera är att den nådde över 95 % balanserad träffsäkerhet på både vitt brus och infraljud. Eftersom algoritmerna enbart tränats på engelskspråkiga, talade meningar så har detta verktyg i nuläget begränsad praktisk användbarhet. Däremot är det lätt att utöka dessa experiment med andra typer av inspelningar, signalproblem, särdrag eller algoritmer.

Style APA, Harvard, Vancouver, ISO itp.

13

Sekiguchi, Kouhei. "A Unified Statistical Approach to Fast and Robust Multichannel Speech Separation and Dereverberation". Doctoral thesis, Kyoto University, 2021. http://hdl.handle.net/2433/263770.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

14

Yoo, Heejong. "Low-Power Audio Input Enhancement for Portable Devices". Diss., Georgia Institute of Technology, 2005. http://hdl.handle.net/1853/6821.

Pełny tekst źródła

Streszczenie:

With the development of VLSI and wireless communication technology, portable devices such as personal digital assistants (PDAs), pocket PCs, and mobile phones have gained a lot of popularity. Many such devices incorporate a speech recognition engine, enabling users to interact with the devices using voice-driven commands and text-to-speech synthesis. The power consumption of DSP microprocessors has been consistently decreasing by half about every 18 months, following Gene's law. The capacity of signal processing, however, is still significantly constrained by the limited power budget of these portable devices. In addition, analog-to-digital (A/D) converters can also limit the signal processing of portable devices. Many systems require very high-resolution and high-performance A/D converters, which often consume a large fraction of the limited power budget of portable devices. The proposed research develops a low-power audio signal enhancement system that combines programmable analog signal processing and traditional digital signal processing. By utilizing analog signal processing based on floating-gate transistor technology, the power consumption of the overall system as well as the complexity of the A/D converters can be reduced significantly. The system can be used as a front end of portable devices in which enhancement of audio signal quality plays a critical role in automatic speech recognition systems on portable devices. The proposed system performs background audio noise suppression in a continuous-time domain using analog computing elements and acoustic echo cancellation in a discrete-time domain using an FPGA.

Style APA, Harvard, Vancouver, ISO itp.

15

Della, Corte Giuseppe. "Text and Speech Alignment Methods for Speech Translation Corpora Creation : Augmenting English LibriVox Recordings with Italian Textual Translations". Thesis, Uppsala universitet, Institutionen för lingvistik och filologi, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-413064.

Pełny tekst źródła

Streszczenie:

The recent uprise of end-to-end speech translation models requires a new generation of parallel corpora, composed of a large amount of source language speech utterances aligned with their target language textual translations. We hereby show a pipeline and a set of methods to collect hundreds of hours of English audio-book recordings and align them with their Italian textual translations, using exclusively public domain resources gathered semi-automatically from the web. The pipeline consists in three main areas: text collection, bilingual text alignment, and forced alignment. For the text collection task, we show how to automatically find e-book titles in a target language by using machine translation, web information retrieval, and named entity recognition and translation techniques. For the bilingual text alignment task, we investigated three methods: the Gale–Church algorithm in conjunction with a small-size hand-crafted bilingual dictionary, the Gale–Church algorithm in conjunction with a bigger bilingual dictionary automatically inferred through statistical machine translation, and bilingual text alignment by computing the vector similarity of multilingual embeddings of concatenation of consecutive sentences. Our findings seem to indicate that the consecutive-sentence-embeddings similarity computation approach manages to improve the alignment of difficult sentences by indirectly performing sentence re-segmentation. For the forced alignment task, we give a theoretical overview of the preferred method depending on the properties of the text to be aligned with the audio, suggesting and using a TTS-DTW (text-to-speech and dynamic time warping) based approach in our pipeline. The result of our experiments is a publicly available multi-modal corpus composed of about 130 hours of English speech aligned with its Italian textual translation and split in 60561 triplets of English audio, English transcript, and Italian textual translation. We also post-processed the corpus so as to extract 40-MFCCs features from the audio segments and released them as a data-set.

Style APA, Harvard, Vancouver, ISO itp.

16

Leis, John W. "Spectral coding methods for speech compression and speaker identification". Thesis, Queensland University of Technology, 1998. https://eprints.qut.edu.au/36062/7/36062_Digitised_Thesis.pdf.

Pełny tekst źródła

Streszczenie:

This thesis investigates aspects of encoding the speech spectrum at low bit rates, with extensions to the effect of such coding on automatic speaker identification. Vector quantization (VQ) is a technique for jointly quantizing a block of samples at once, in order to reduce the bit rate of a coding system. The major drawback in using VQ is the complexity of the encoder. Recent research has indicated the potential applicability of the VQ method to speech when product code vector quantization (PCVQ) techniques are utilized. The focus of this research is the efficient representation, calculation and utilization of the speech model as stored in the PCVQ codebook. In this thesis, several VQ approaches are evaluated, and the efficacy of two training algorithms is compared experimentally. It is then shown that these productcode vector quantization algorithms may be augmented with lossless compression algorithms, thus yielding an improved overall compression rate. An approach using a statistical model for the vector codebook indices for subsequent lossless compression is introduced. This coupling of lossy compression and lossless compression enables further compression gain. It is demonstrated that this approach is able to reduce the bit rate requirement from the current 24 bits per 20 millisecond frame to below 20, using a standard spectral distortion metric for comparison. Several fast-search VQ methods for use in speech spectrum coding have been evaluated. The usefulness of fast-search algorithms is highly dependent upon the source characteristics and, although previous research has been undertaken for coding of images using VQ codebooks trained with the source samples directly, the product-code structured codebooks for speech spectrum quantization place new constraints on the search methodology. The second major focus of the research is an investigation of the effect of lowrate spectral compression methods on the task of automatic speaker identification. The motivation for this aspect of the research arose from a need to simultaneously preserve the speech quality and intelligibility and to provide for machine-based automatic speaker recognition using the compressed speech. This is important because there are several emerging applications of speaker identification where compressed speech is involved. Examples include mobile communications where the speech has been highly compressed, or where a database of speech material has been assembled and stored in compressed form. Although these two application areas have the same objective - that of maximizing the identification rate - the starting points are quite different. On the one hand, the speech material used for training the identification algorithm may or may not be available in compressed form. On the other hand, the new test material on which identification is to be based may only be available in compressed form. Using the spectral parameters which have been stored in compressed form, two main classes of speaker identification algorithm are examined. Some studies have been conducted in the past on bandwidth-limited speaker identification, but the use of short-term spectral compression deserves separate investigation. Combining the major aspects of the research, some important design guidelines for the construction of an identification model when based on the use of compressed speech are put forward.

Style APA, Harvard, Vancouver, ISO itp.

17

Jaureguiberry, Xabier. "Fusion pour la séparation de sources audio". Thesis, Paris, ENST, 2015. http://www.theses.fr/2015ENST0030/document.

Pełny tekst źródła

Streszczenie:

La séparation aveugle de sources audio dans le cas sous-déterminé est un problème mathématique complexe dont il est aujourd'hui possible d'obtenir une solution satisfaisante, à condition de sélectionner la méthode la plus adaptée au problème posé et de savoir paramétrer celle-ci soigneusement. Afin d'automatiser cette étape de sélection déterminante, nous proposons dans cette thèse de recourir au principe de fusion. L'idée est simple : il s'agit, pour un problème donné, de sélectionner plusieurs méthodes de résolution plutôt qu'une seule et de les combiner afin d'en améliorer la solution. Pour cela, nous introduisons un cadre général de fusion qui consiste à formuler l'estimée d'une source comme la combinaison de plusieurs estimées de cette même source données par différents algorithmes de séparation, chaque estimée étant pondérée par un coefficient de fusion. Ces coefficients peuvent notamment être appris sur un ensemble d'apprentissage représentatif du problème posé par minimisation d'une fonction de coût liée à l'objectif de séparation. Pour aller plus loin, nous proposons également deux approches permettant d'adapter les coefficients de fusion au signal à séparer. La première formule la fusion dans un cadre bayésien, à la manière du moyennage bayésien de modèles. La deuxième exploite les réseaux de neurones profonds afin de déterminer des coefficients de fusion variant en temps. Toutes ces approches ont été évaluées sur deux corpus distincts : l'un dédié au rehaussement de la parole, l'autre dédié à l'extraction de voix chantée. Quelle que soit l'approche considérée, nos résultats montrent l'intérêt systématique de la fusion par rapport à la simple sélection, la fusion adaptative par réseau de neurones se révélant être la plus performante
Underdetermined blind source separation is a complex mathematical problem that can be satisfyingly resolved for some practical applications, providing that the right separation method has been selected and carefully tuned. In order to automate this selection process, we propose in this thesis to resort to the principle of fusion which has been widely used in the related field of classification yet is still marginally exploited in source separation. Fusion consists in combining several methods to solve a given problem instead of selecting a unique one. To do so, we introduce a general fusion framework in which a source estimate is expressed as a linear combination of estimates of this same source given by different separation algorithms, each source estimate being weighted by a fusion coefficient. For a given task, fusion coefficients can then be learned on a representative training dataset by minimizing a cost function related to the separation objective. To go further, we also propose two ways to adapt the fusion coefficients to the mixture to be separated. The first one expresses the fusion of several non-negative matrix factorization (NMF) models in a Bayesian fashion similar to Bayesian model averaging. The second one aims at learning time-varying fusion coefficients thanks to deep neural networks. All proposed methods have been evaluated on two distinct corpora. The first one is dedicated to speech enhancement while the other deals with singing voice extraction. Experimental results show that fusion always outperform simple selection in all considered cases, best results being obtained by adaptive time-varying fusion with neural networks

Style APA, Harvard, Vancouver, ISO itp.

18

Herms, Robert. "Effective Speech Features for Cognitive Load Assessment: Classification and Regression". Universitätsverlag Chemnitz, 2018. https://monarch.qucosa.de/id/qucosa%3A33346.

Pełny tekst źródła

Streszczenie:

This thesis is about the effectiveness of speech features for cognitive load assessment, with particular attention being paid to new perspectives of this research area. A new cognitive load database, called CoLoSS, is introduced containing speech recordings of users who performed a learning task. Various acoustic features from different categories including prosody, voice quality, and spectrum are investigated in terms of their relevance. Moreover, Teager energy parameters, which have proven highly successful in stress detection, are introduced for cognitive load assessment and it is demonstrated how automatic speech recognition technology can be used to extract potential indicators. The suitability of the extracted features is systematically evaluated by recognition experiments with speaker-independent systems designed for discriminating between three levels of load. Additionally, a novel approach to speech-based cognitive load modelling is introduced, whereby the load is represented as a continuous quantity and its prediction can thus be regarded as a regression problem.
Die vorliegende Arbeit befasst sich mit der automatischen Erkennung von kognitiver Belastung auf Basis menschlicher Sprachmerkmale. Der Schwerpunkt liegt auf der Effektivität von akustischen Parametern, wobei die aktuelle Forschung auf diesem Gebiet um neuartige Ansätze erweitert wird. Hierzu wird ein neuer Datensatz – als CoLoSS bezeichnet – vorgestellt, welcher Sprachaufzeichnungen von Nutzern enthält und speziell auf Lernprozesse fokussiert. Zahlreiche Parameter der Prosodie, Stimmqualität und des Spektrums werden im Hinblick auf deren Relevanz analysiert. Darüber hinaus werden die Eigenschaften des Teager Energy Operators, welche typischerweise bei der Stressdetektion Verwendung finden, im Rahmen dieser Arbeit berücksichtigt. Ebenso wird gezeigt, wie automatische Spracherkennungssysteme genutzt werden können, um potenzielle Indikatoren zu extrahieren. Die Eignung der extrahierten Merkmale wird systematisch evaluiert. Dabei kommen sprecherunabhängige Klassifikationssysteme zur Unterscheidung von drei Belastungsstufen zum Einsatz. Zusätzlich wird ein neuartiger Ansatz zur sprachbasierten Modellierung der kognitiven Belastung vorgestellt, bei dem die Belastung eine kontinuierliche Größe darstellt und eine Vorhersage folglich als ein Regressionsproblem betrachtet werden kann.

Style APA, Harvard, Vancouver, ISO itp.

19

Fong, Katherine KaYan. "IR-Depth Face Detection and Lip Localization Using Kinect V2". DigitalCommons@CalPoly, 2015. https://digitalcommons.calpoly.edu/theses/1425.

Pełny tekst źródła

Streszczenie:

Face recognition and lip localization are two main building blocks in the development of audio visual automatic speech recognition systems (AV-ASR). In many earlier works, face recognition and lip localization were conducted in uniform lighting conditions with simple backgrounds. However, such conditions are seldom the case in real world applications. In this paper, we present an approach to face recognition and lip localization that is invariant to lighting conditions. This is done by employing infrared and depth images captured by the Kinect V2 device. First we present the use of infrared images for face detection. Second, we use the face’s inherent depth information to reduce the search area for the lips by developing a nose point detection. Third, we further reduce the search area by using a depth segmentation algorithm to separate the face from its background. Finally, with the reduced search range, we present a method for lip localization based on depth gradients. Experimental results demonstrated an accuracy of 100% for face detection, and 96% for lip localization.

Style APA, Harvard, Vancouver, ISO itp.

20

Almajai, Ibrahim M. "Audio Visual Speech Enhancement". Thesis, University of East Anglia, 2009. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.514309.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

21

Gómez, Gutiérrez Emilia. "Tonal description of music audio signals". Doctoral thesis, Universitat Pompeu Fabra, 2006. http://hdl.handle.net/10803/7537.

Pełny tekst źródła

Streszczenie:

Aquesta tesi doctoral proposa i avalua un enfocament computacional per a la descripció automàtica dels aspectes tonals de la música a partir de l'anàlisi de senyals d'-audio polifòniques. Aquests mètodes es centren en el càlcul de descriptors de distribucions de notes, en l'estimació de tonalitat d'una peça, en la visualització de l'evolució del centre tonal o en la mesura de la similitud tonal entre dues peces diferents.
Aquesta tesi contribueix substancialment al camp de la descripció tonal mitjançant mètodes computacionals: a) Proporciona una revisió multidisciplinària dels sistemes d'estimació de la tonalitat; b) Defineix una sèrie de requeriments que han de complir els descriptors tonals de baix nivell; c) Proporciona una avaluació quantitativa i modular dels mètodes proposats; d) Justifica la idea de que per a certes aplicacions es poden fer servir mètodes que treballen amb partitures sense la necessitat de realitzar una transcripció automàtica e) Estén la literatura existent que treballa amb música clàssica a altres generes musicals; f) Demostra la utilitat dels descriptors tonals per a comparar peces musicals; g) Proporciona un algoritme optimitzat que es fa servir dins un sistema real per a visualització, cerca i recomanació musical, que treballa amb més d'un milió de obres musicals.
Esta tesis doctoral propone y evalúa un enfoque computacional para la descripción automática de aspectos tonales de la música a partir del análisis de señales de audio polifónicas. Estos métodos se centran en calcular descriptores de distribución de notas, en estimar la tonalidad de una pieza, en visualizar la evolución del centro tonal o en medir la similitud tonal entre dos piezas diferentes.
Esta tesis contribuye sustancialmente al campo de la descripción tonal mediante métodos computacionales: a) Proporciona una revisión multidisciplinar de los sistemas de estimación de la tonalidad; b) Define una serie de requerimientos que deben cumplir los descriptores tonales de bajo nivel; c) Proporciona una evaluación cuantitativa y modular de los métodos propuestos; d) Respalda la idea de que para ciertas aplicaciones no es necesario obtener una transcripción perfecta de la partitura, y que se pueden utilizar métodos que trabajan con partituras sin realizar una transcripción automática; e) Extiende la literatura existente que trabaja con música clásica a otros géneros musicales; f) Demuestra la utilidad de los descriptores tonales para comparar piezas musicales; g) Proporciona un algoritmo optimizado que se utiliza en un sistema real para visualización, búsqueda y recomendación musical, que trabaja con mas de un millón de piezas musicales.
This doctoral dissertation proposes and evaluates a computational approach for the automatic description of tonal aspects of music from the analysis of polyphonic audio signals. These algorithms focus on the computation of pitch class distributions descriptors, the estimation of the key of a piece, the visualization of the evolution of its tonal center or the measurement of the similarity between two different musical pieces.
This dissertation substantially contributes to the field of computational tonal description: a) It provides a multidisciplinary review of tonal induction systems; b) It defines a set of requirements for low-level tonal features; c) It provides a quantitative and modular evaluation of the proposed methods; d) It contributes to bridge the gap between audio and symbolic-oriented methods without the need of a perfect transcription; e) It extents current literature dealing with classical music to other musical genres; f) It shows the usefulness of tonal descriptors for music similarity; g) It provides an optimized method which is used in a real system for music visualization and retrieval, working with over a million of musical pieces.

Style APA, Harvard, Vancouver, ISO itp.

22

Saruwatari, Hiroshi. "BLIND SIGNAL SEPARATION OF AUDIO SIGNALS". INTELLIGENT MEDIA INTEGRATION NAGOYA UNIVERSITY / COE, 2006. http://hdl.handle.net/2237/10406.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

23

Najafzadeh-Azghandi, Hossein. "Perceptual coding of narrowband audio signals". Thesis, National Library of Canada = Bibliothèque nationale du Canada, 2000. http://www.collectionscanada.ca/obj/s4/f2/dsk1/tape4/PQDD_0033/NQ64628.pdf.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

24

Godsill, Simon John. "The restoration of degraded audio signals". Thesis, University of Cambridge, 1993. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.296641.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

25

Bolton, Jered. "Gestural extraction from musical audio signals". Thesis, University of Glasgow, 2004. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.417664.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

26

Hicks, C. M. "Modelling of multi-channel audio signals". Thesis, University of Cambridge, 1999. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.603997.

Pełny tekst źródła

Streszczenie:

This dissertation is concerned with the mathematical modelling of musical and audio signals. The emphasis is on multi-channel signals where either more than one copy of a single original is available for analysis, or where the signal comprises two or more parts. The most common example of this latter class is stereo signals which comprise a left and a right signal to create an auditory illusion of space. Two models are analysed in which we have multiple observations of a single signal. Both are based on the well-known auto-regressive (AR) model which has previously been successfully deployed in many audio applications. The first of these is the Multiply-Observed AR Model in which a single AR signal is contaminated by a number of independent interference signals to give multiple noisy observations of the original. It is shown that the statistics of the noise sources can be determined given certain broad assumptions. The model is applied to the problem of broadband noise reduction of a 78 r.p.m. record, of which a number of copies are available. The second model is the Ensemble-AR Model in which an ensemble of excitation sources drive identical AR filters to give multiple observed signals. Methods for estimation of the AR parameters from the observed data are derived. The model is applied to the detection of impulsive noise in audio signals, and interpolation of the missing data. The E-AR model is demonstrated to be superior to a similar single-channel approach in both of these areas. There is such a variety of stereo signals in existence that a very general model is needed to encompass their whole spectrum. The Coupled-ARMA Model put forward here is based on the ARMA model, but generates a pair of interdependent signals. Its structure allows efficient estimation of its parameters, and various methods for this are examined. Interpolators for Coupled-ARMA signals are derived. For much multi-channel audio work it is necessary to ensure that the observed signals are accurately aligned with each other. Where multiple copies of a disc or tape are under examination this is a difficult problem, since even minute time offsets and speed fluctuations lead to effects such as time-varying comb-filtering when the signals are summed. We examine this problem in detail, and develop a robust scheme for resynchronising signals in a Bayesian statistical framework. Quantisation of audio signals has received much recent research effort. The final part of the dissertation presents a flexible model-based quantisation algorithm. The algorithm is demonstrated in the quantisation of narrow-band signals, and as a powerful enhancement to a simple linear prediction coding system.

Style APA, Harvard, Vancouver, ISO itp.

27

Shoji, Seiichiro. "Efficient individualisation of binaural audio signals". Thesis, University of York, 2007. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.442378.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

28

Xiao, Zhongzhe. "Recognition of emotions in audio signals". Ecully, Ecole centrale de Lyon, 2008. http://www.theses.fr/2008ECDL0002.

Pełny tekst źródła

Streszczenie:

Les travaux de recherche réalisés dans le cadre de cette thèse de doctorat portent sur la reconnaissance automatique de l’émotion et de l’humeur au sein de signaux sonores. En effet, l’émotion portée par les signaux audio constitue une information sémantique particulièrement importante dont l’analyse automatique offre de nombreuses possibilités en termes d’applications, telles que les interactions homme-machine intelligentes et l’indexation multimédia. L’objectif de cette thèse est ainsi d’étudier des solutions informatique d’analyse de l’émotion audio tant pour la parole que pour les signaux musicaux. Nous utilisons dans notre travail un modèle émotionnel discret combiné à un modèle dimensionnel, en nous appuyant sur des études existantes sur les corrélations entre les propriétés acoustiques et l’émotion dans la parole ainsi que l’humeur dans les signaux de musique. Les principales contributions de nos travaux sont les suivantes. Tout d’abord, nous avons proposé, en complément des caractéristiques audio basées sur les propriétés fréquentielles et d’énergie, de nouvelles caractéristiques harmoniques et Zipf, afin d’améliorer la caractérisation des propriétés des signaux de parole en terme de timbre et de prosodie. Deuxièmement, dans la mesure où très peu de ressources pour l’étude de l’émotion dans la parole et dans la musique sont disponibles par rapport au nombre important de caractéristiques audio qu’il est envisageable d’extraire, une méthode de sélection de caractéristiques nomméeESFS, basée sur la théorie de l’évidence est proposée afin de simplifier le modèle de classification et d’en améliorer les performances. De plus, nous avons montré que l’utilisation d’un classifieur hiérarchique basé sur un modèle dimensionnel de l’émotion, permet d’obtenir de meilleurs résultats de classification qu’un unique classifieur global, souvent utilisé dans la littérature. Par ailleurs, puisqu’il n’existe pas d’accord universel sur la définition des émotions de base, et parce que les états émotionnels considérés sont très dépendant des applications, nous avons également proposé un algorithme basés sur ESFS et permettant de construire automatiquement un classifieur hiérarchique adapté à un ensemble spécifique d’états émotionnels dans le cadre d’une application particulière. Cette classification hiérarchique procède en divisant un problème de classification complexe en un ensemble de problèmes plus petits et plus simples grâce à la combinaison d’un ensemble de sous-classifieurs binaires organisés sous forme d’un arbre binaire. Enfin, les émotions étant par nature des notions subjectives, nous avons également proposé un classifieur ambigu, basé sur la théorie de l’évidence, permettant l’association d’un signal audio à de multiples émotions, comme le font souvent les êtres humains
This Ph. D thesis work is dedicated to automatic emotion/mood recognition in audio signals. Indeed, audio emotion is high semantic information and its automatic analysis may have many applications such as smart human-computer interactions or multimedia indexing. The purpose of this thesis is thus to investigate machine-based audio emotion analysis solutions for both speech and music signals. Our work makes use of a discrete emotional model combined with the dimensional one and relies upon existing studies on acoustics correlates of emotional speech and music mood. The key contributions are the following. First, we have proposed, in complement to popular frequency-based and energy-based features, some new audio features, namely harmonic and Zipf features, to better characterize timbre and prosodic properties of emotional speech. Second, as there exists very few emotional resources either for speech or music for machine learning as compared to audio features that one can extract, an evidence theory-based feature selection scheme named Embedded Sequential Forward Selection (ESFS) is proposed to deal with the classic “curse of dimensionality” problem and thus over-fitting. Third, using a manually built dimensional emotion model-based hierarchical classifier to deal with fuzzy borders of emotional states, we demonstrated that a hierarchical classification scheme performs better than single global classifier mostly used in the literature. Furthermore, as there does not exist any universal agreement on basic emotion definition and as emotional states are typically application dependent, we also proposed a ESFS-based algorithm for automatically building a hierarchical classification scheme (HCS) which is best adapted to a specific set of application dependent emotional states. The HCS divides a complex classification problem into simpler and smaller problems by combining several binary sub-classifiers in the structure of a binary tree in several stages, and gives the result as the type of emotional states of the audio samples. Finally, to deal with the subjective nature of emotions, we also proposed an evidence theory-based ambiguous classifier allowing multiple emotions labeling as human often does. The effectiveness of all these recognition techniques was evaluated on Berlin and DES datasets for emotional speech recognition and on a music mood dataset that we collected in our laboratory as there exist no public dataset so far. Keywords: audio signal, emotion classification, music mood analysis, audio features, feature selection, hierarchical classification, ambiguous classification, evidence theory

Style APA, Harvard, Vancouver, ISO itp.

29

Xiao, Zhongzhe Chen Liming. "Recognition of emotions in audio signals". Ecully : Ecole Centrale de Lyon, 2008. http://bibli.ec-lyon.fr/exl-doc/zxiao.pdf.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

30

Miyajima, C., D. Negi, Y. Ninomiya, M. Sano, K. Mori, K. Itou, K. Takeda i Y. Suenaga. "Audio-Visual Speech Database for Bimodal Speech Recognition". INTELLIGENT MEDIA INTEGRATION NAGOYA UNIVERSITY / COE, 2005. http://hdl.handle.net/2237/10460.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

31

Othman, Noor Shamsiah. "Wireless speech and audio communications". Thesis, University of Southampton, 2008. https://eprints.soton.ac.uk/64488/.

Pełny tekst źródła

Streszczenie:

The limited applicability of Shannon’s separation theorem in practical speech/audio systems motivates the employment of joint source and channel coding techniques. Thus, considerable efforts have been invested in designing various types of joint source and channel coding schemes. This thesis discusses two different types of Joint Source and Channel Coding (JSCC) schemes, namely Unequal Error Protection (UEP) aided turbo transceivers as well as Iterative Source and Channel Decoding (ISCD) exploiting the residual redundancy inherent in the source encoded parameters. More specifically, in Chapter 2, two different UEP JSCC philosophies were designed for wireless audio and speech transmissions, namely a turbo-detected UEP scheme using twin-class convolutional codes and another turbo detector using more sophisticated Irregular Convolutional Codes (IRCC). In our investigations, the MPEG-4 Advanced Audio Coding (AAC), the MPEG-4 Transform-Domain Weighted Interleaved Vector Quantization (TwinVQ) and the Adaptive MultiRate WideBand (AMR-WB) audio/speech codecs were incorporated in the sophisticated UEP turbo transceiver, which consisted of a threestage serially concatenated scheme constituted by Space-Time Trellis Coding (STTC), Trellis Coded Modulation (TCM) and two different-rate Non-Systematic Convolutional codes (NSCs) used for UEP. Explicitly, both the twin-class UEP turbo transceiver assisted MPEG-4 TwinVQ and the AMR-WB audio/speech schemes outperformed their corresponding single-class audio/speech benchmarkers by approximately 0.5 dB, in terms of the required Eb/N0, when communicating over uncorrelated Rayleigh fading channels. By contrast, when employing the MPEG-4 AAC audio codec and protecting the class-1 audio bits using a 2/3-rate NSC code, a more substantial Eb/N0 gain of about 2 dB was achieved. As a further design alternative, we also proposed a turbo transceiver employing IRCCs for the sake of providing UEP for the AMR-WB speech codec. The resultant UEP schemes exhibited a better performance when compared to the corresponding Equal Error Protection (EEP) benchmark schemes, since the former protected the audio/speech bits according to their sensitivity. The proposed UEP aided system using IRCCs exhibits an Eb/N0 gain of about 0.4 dB over the EEP system employing regular convolutional codes, when communicating over AWGN channels, at the point of tolerating a SegSNR degradation of 1 dB. In Chapter 3, a novel system that invokes jointly optimised ISCD for enhancing the error resilience of the AMR-WB speech codec was proposed and investigated. The resultant AMR-WB coded speech signal is protected by a Recursive Systematic onvolutional (RSC) code and transmitted using a non-coherently detected Multiple-Input Multiple-Output (MIMO) Differential Space-Time Spreading (DSTS) scheme. To further enhance the attainable system performance and to maximise the coding advantage of the proposed transmission scheme, the system is also combined with multi-dimensional Sphere Packing (SP) modulation. The AMR-WB speech decoder was further developed for the sake of accepting the a priori information passed to it from the channel decoder as extrinsic information, where the residual redundancy inherent in the AMR-WB encoded parameters was exploited. Moreover, the convergence behaviour of the proposed scheme was evaluated with the aid of both Three-Dimensional (3D) and Two-Dimenstional (2D) EXtrinsic Information Transfer (EXIT) charts. The proposed scheme benefitted from the exploitation of the residual redundancy inherent in the AMR-WB encoded parameters, where an approximately 0.5 dB Eb/N0 gain was achieved in comparison to its corresponding hard speech decoding based counterpart. At the point of tolerating a SegSNR degradation of 1 dB, the advocated scheme exhibited an Eb/N0 gain of about 1.0 dB in comparison to the benchmark scheme carrying out joint channel decoding and DSTS aided SP-demodulation in conjunction with separate AMR-WB decoding, when communicating over narrowband temporally correlated Rayleigh fading channels. In Chapter 4, two jointly optimized ISCD schemes invoking the soft-output AMRWB speech codec using DSTS assisted SP modulation were proposed. More specifically, the soft-bit assisted iterative AMR-WB decoder’s convergence characteristics were further enhanced by using Over-Complete source-Mapping (OCM), as well as a recursive precoder. EXIT charts were used to analyse the convergence behaviour of the proposed turbo transceivers using the soft-bit assisted AMR-WB decoder. Explicitly, the OCM aided AMR-WB MIMO transceiver exhibits an Eb/N0 gain of about 3.0 dB in comparison to the benchmark scheme also using ISCD as well as DSTS aided SP-demodulation, but dispensing with the OCM scheme, when communicating over narrowband temporally correlated Rayleigh fading channels. Finally, the precoded soft-bit AMR-WB MIMO transceiver exhibits an Eb/N0 gain of about 1.5 dB in comparison to the benchmark scheme dispensing with the precoder, when communicating over narrowband temporally correlated Rayleigh fading channels.

Style APA, Harvard, Vancouver, ISO itp.

32

Barkmeier, Julie Marie. "Intelligibility of dysarthric speakers: audio-only and audio-visual presentations". Thesis, University of Iowa, 1988. https://ir.uiowa.edu/etd/5698.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

33

Xia, Feng. "Perceptual coding for high-quality audio signals". Ohio : Ohio University, 1998. http://www.ohiolink.edu/etd/view.cgi?ohiou1176235728.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

34

Lanciani, Christopher A. "Compressed-domain processing of MPEG audio signals". Diss., Georgia Institute of Technology, 1999. http://hdl.handle.net/1853/13760.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

35

Conway, Alexander. "Improving Broadband Noise Filter For Audio Signals". DigitalCommons@CalPoly, 2012. https://digitalcommons.calpoly.edu/theses/747.

Pełny tekst źródła

Streszczenie:

Abstract Improving broadband noise filter for audio signals By Alex Conway Filtering broadband noise in audio signals has always been a challenge since the noise is spread across a large bandwidth and overlaps the spectral range of the audio signal being recovered. There have been several broadband noise reduction techniques, such as spectral subtraction, developed for speech enhancement that might be applied to reduce noise in musical audio recordings as well. One such technique that is investigated in this thesis identifies the harmonic components of a signal to be preserved as those spectral magnitude peaks that exceed a chosen decision threshold. All other spectral components below the threshold are considered to be noise. Noise components are then attenuated enough to render them imperceptible in the presence of the harmonic signal, using a psycho-acoustical model of sound masking. The objective of this thesis is to show that this algorithm can be adapted and improved for musical audio noise reduction. Improvements include relaxing the filter when percussion events are anticipated, since these appear spectrally similar to broadband noise but are essential to the musical experience; and using harmonic prediction to preserve more of the signal harmonics than the simple threshold would pass. Harmonic prediction takes advantage of the known harmonic spacing of musical content and places noise filter pass bands around a fixed number of harmonics of every fundamental pitch found; passing some harmonics that would get filtered out by the original thresholding algorithm. Noise filtering improvements were assessed for both noise reduction and signal degradation effects by different signal to noise ratio computations. The audio results with and without improvements were also assessed using subjective listening tests, since how the sound is perceived is what matters most in musical recordings. Quantitative results show improved signal to noise ratios of filtered audio signals when the improvements were included compared to the original threshold-based filter. Perceived sound quality in listening tests was also higher with the percussion preservation and harmonic prediction improvements. In all listening tests, every listener rated the improved filter as the best.

Style APA, Harvard, Vancouver, ISO itp.

36

Le, Cornu Thomas. "Reconstruction of intelligible audio speech from visual speech information". Thesis, University of East Anglia, 2016. https://ueaeprints.uea.ac.uk/67012/.

Pełny tekst źródła

Streszczenie:

The aim of the work conducted in this thesis is to reconstruct audio speech signals using information which can be extracted solely from a visual stream of a speaker's face, with application for surveillance scenarios and silent speech interfaces. Visual speech is limited to that which can be seen of the mouth, lips, teeth, and tongue, where the visual articulators convey considerably less information than in the audio domain, leading to the task being difficult. Accordingly, the emphasis is on the reconstruction of intelligible speech, with less regard given to quality. A speech production model is used to reconstruct audio speech, where methods are presented in this work for generating or estimating the necessary parameters for the model. Three approaches are explored for producing spectral-envelope estimates from visual features as this parameter provides the greatest contribution to speech intelligibility. The first approach uses regression to perform the visual-to-audio mapping, and then two further approaches are explored using vector quantisation techniques and classification models, with long-range temporal information incorporated at the feature and model-level. Excitation information, namely fundamental frequency and aperiodicity, is generated using artificial methods and joint-feature clustering approaches. Evaluations are first performed using mean squared error analyses and objective measures of speech intelligibility to refine the various system configurations, and then subjective listening tests are conducted to determine word-level accuracy, giving real intelligibility scores, of reconstructed speech. The best performing visual-to-audio domain mapping approach, using a clustering-and-classification framework with feature-level temporal encoding, is able to achieve audio-only intelligibility scores of 77 %, and audiovisual intelligibility scores of 84 %, on the GRID dataset. Furthermore, the methods are applied to a larger and more continuous dataset, with less favourable results, but with the belief that extensions to the work presented will yield a further increase in intelligibility.

Style APA, Harvard, Vancouver, ISO itp.

37

Fackrell, Justin W. A. "Bispectral analysis of speech signals". Thesis, University of Edinburgh, 1997. http://hdl.handle.net/1842/1384.

Pełny tekst źródła

Streszczenie:

Techniques which utilise a signal's Higher Order Statistics (HOS) can reveal information about non-Gaussian signals and nonlinearities which cannot be obtained using conventional (second-order) techniques. This information may be useful in speech processing because it may provide clues about how to construct new models of speech production which are better than existing models. There has been a recent surge of interest in the application of HOS techniques to speech processing, but this has been handicapped by a lack of understanding of what the HOS properties of speech signals are. Without this understanding the HOS information which is in speech signals can not be efficiently utilised. This thesis describes an investigation into the use of HOS techniques, in particular the third-order frequency domain measure called the bispectrum, to speech signals. Several issues relating to bispectral speech analysis are addressed, including nonlinearity detection, pitch-synchronous analysis, estimation criteria and stationarity. A flaw is identified in an existing algorithm for detecting quadratic nonlinearities, and a new detector is proposed which has better statistical properties. In addition, a new algorithm is developed for estimating the normalised bispectrum of signals contaminated by transient noise. Finally the tools developed in the study are applied to a specially constructed database of continuant speech sounds. The results are consistent with the hypothesis that speech signals do not exhibit quadratic nonlinearity.

Style APA, Harvard, Vancouver, ISO itp.

38

Anderson, Mark David. "Pitch determination of speech signals". Thesis, Massachusetts Institute of Technology, 1986. http://hdl.handle.net/1721.1/14999.

Pełny tekst źródła

Streszczenie:

Thesis (M.S.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1986.
MICROFICHE COPY AVAILABLE IN ARCHIVES AND ENGINEERING
Bibliography: leaves 138-147.
by Mark David Anderson.
M.S.

Style APA, Harvard, Vancouver, ISO itp.

39

Seymour, R. "Audio-visual speech and speaker recognition". Thesis, Queen's University Belfast, 2008. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.492489.

Pełny tekst źródła

Streszczenie:

In this thesis, a number of important issues relating to the use of both audio and video information for speech and speaker recognition are investigated. A comprehensive comparison of different visual feature types is given, including both geometric and image transformation based features. A new geometric based method for feature extraction is described, as well as the novel use of curvelet based features. Different methods for constructing the feature vectors are compared, as well as feature vector sizes and the use of dynamic features. Each feature type is tested against three types of visual noise: compression, blurring and jitter. A novel method of integrating the audio and video information streams called the maximum stream posterior (MSP) is described. This method is tested in both speaker dependent and speaker independent audio-visual speech recognition (AVSR) systems, and is shown to be robust to noise in either the audio or video streams, given no prior knowledge of the noise. This method is then extended to form the maximum weighted stream posterior (MWSP) method. Finally, both the MSP and MWSP are tested in an audio-visual speaker recognition system (AVSpR). / Experiments using the XM2VTS database will show that both of these methods can outperform ,_.','/ standard methods in terms of recognition accuracy in situations where either stream is corrupted.

Style APA, Harvard, Vancouver, ISO itp.

40

Pachoud, Samuel. "Audio-visual speech and emotion recognition". Thesis, Queen Mary, University of London, 2010. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.528923.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

41

Matthews, Iain. "Features for audio-visual speech recognition". Thesis, University of East Anglia, 1998. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.266736.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

42

Rao, Ram Raghavendra. "Audio-visual interaction in multimedia". Diss., Georgia Institute of Technology, 1998. http://hdl.handle.net/1853/13349.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

43

Clemedson, Johan. "Audio Generation from Radar signals, for target classification". Thesis, KTH, Optimeringslära och systemteori, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-215502.

Pełny tekst źródła

Streszczenie:

Classification in radar application are often of great interest, since one does not only want to know where a target is, but also what type of target it is. This thesis focus on transforming the radar return from a target into a audio signal. So that the classification can be done by human perception, in this case human hearing. The aim of these classification methods is to be able to distinguish between two types of targets of roughly the same size, namely birds and smaller Unmanned Aerial Vehicles (UAV). It is possible with the radar to measure the targets velocity by using the Doppler effect. To be able to distinguish in which direction the target is moving are a so called I/Q representation of the radar return used, which is a complex representation of the signal. Using signal processing techniques, we extract radar signals generated from the target. By spectral transforms it is possible to generate real valued signals from the extracted target signals. It is required to extend these signals to be able to use them as audio signals, this is done with an extrapolation technique based on Autoregressive (AR) processes. The extrapolated signals are the signals used as the audio output, it is possible to perform the audio classification in most of the cases. This project is done in collaboration with Sebastian Edman [7], where different perspectives of radar classification has been investigated. As mentioned this thesis focus on transforming the radar return into an audio signal. While Edman in his thesis [7] making use of a machine learning approach to classify the targets from the generated audio signal.
Klassificering är ofta av stort intresse inom radarapplikation, eftersom man inte bara vill veta var ett mål befinner sig men också vad för typ av mål det är. Denna uppsats fokuserar på att omvandla radarekot från ett mål till en ljudsignal. Så att klassificeringen kan ske med mänskliga sinnen, i detta fall hörseln. Syftet med dessa klassificeringsmetoder är att kunna klassificera två typer av mål med ungefär samma storlek, nämligen fåglar och mindre obemannade flygfordon (UAV). Det är möjligt att med radarn mäta målets hastighet med hjälp av Doppler-effekten. För att kunna avgöra i vilken riktning målet rör sig används en I/Q-representation, som är en komplex representation av radar signalen. Med signalbehandling är det möjligt att extrahera radar signaler som målet generar. Genom att använda spektrala transformationer är det möjligt att generera reellvärda signaler från de extraherade målsignalerna. Det är nödvändigt att förlänga dessa signaler för att kunna använda dem som ljudsignaler, detta görs med en extrapoleringsteknik baserad på Autoregressiva (AR) -processer. De ljudsignaler som används är dessa extrapolerade signalerna, det är i det flesta fall möjligt att utifrån ljudet genomföra klassificeringen. Detta projekt är utfört i samarbete med Sebastian Edman [7], där olika inriktningar av radarklassificering har undersökts. Som nämnts ovan fokuserar denna uppsats på att omvandla

Style APA, Harvard, Vancouver, ISO itp.

44

Chen, Bingwei. "Adaptive watermarking algorithms for MP3 compressed audio signals". Thesis, University of Ottawa (Canada), 2008. http://hdl.handle.net/10393/27963.

Pełny tekst źródła

Streszczenie:

MPEG-1 Layer 3, known as MP3, has generated a significant popularity for distributing digital music over the Internet. MP3 compresses digital music with high ratio while keeping high sound quality. However, copyright issue is raised because of illegal copy, redistribution and various malicious attacks. Digital watermarking is a technology that allows users to embed some imperceptible data into digital contents such as image, movie and audio data. Once a watermark is embedded into the original MP3 signal, it can be used to identify the copyright holder in order to prevent illegal copy and to verify the modification from the original content. This thesis presents two novel adaptive watermarking algorithms for MP3 compressed audio signals for copyright protection. Based on Human Auditory System, the proposed algorithms calculate the energy of the original audio signal and apply Gaussian analysis on MP3 frames to adaptively adjust the watermarking coefficients. Watermark is embedded adaptively and transparently during the MP3 compression. The first watermarking algorithm detects watermark based on Gaussian distribution analysis. To enhance the security of the watermark, the second watermarking algorithm embeds random watermark pattern and uses correlation coefficient to detect watermark. Both algorithms support blind watermark detection and perform well. The first algorithm is more robust while the second algorithm is more secure. LAME 3.96.2 open source was used as standard ISO MP3 encoder and decoder reference in this study. The experimental results show that the proposed watermarking algorithms can work on a variety of audio signals and survive most common signal manipulation and malicious attacks. As expected, the watermarking algorithms provide superior performance on MP3 compression.

Style APA, Harvard, Vancouver, ISO itp.

45

Ning, Daryl. "Analysis and coding of high quality audio signals". Thesis, Queensland University of Technology, 2003. https://eprints.qut.edu.au/15814/1/Daryl_Ning_Thesis.pdf.

Pełny tekst źródła

Streszczenie:

Digital audio is increasingly becoming more and more a part of our daily lives. Unfortunately, the excessive bitrate associated with the raw digital signal makes it an extremely expensive representation. Applications such as digital audio broadcasting, high definition television, and internet audio, require high quality audio at low bitrates. The field of audio coding addresses this important issue of reducing the bitrate of digital audio, while maintaining a high perceptual quality. Developing an efficient audio coder requires a detailed analysis of the audio signals themselves. It is important to find a representation that can concisely model any general audio signal. In this thesis, we propose two new high quality audio coders based on two different audio representations - the sinusoidal-wavelet representation, and the warped linear predictive coding (WLPC)-wavelet representation. In addition to high quality coding, it is also important for audio coders to be flexible in their application. With the increasing popularity of internet audio, it is advantageous for audio coders to address issues related to real-time audio delivery. The issue of bitstream scalability has been targeted in this thesis, and therefore, a third audio coder capable of bitstream scalability is also proposed. The performance of each of the proposed coders was evaluated by comparisons with the MPEG layer III coder. The first coder proposed is based on a hybrid sinusoidal-wavelet representation. This assumes that each frame of audio can be modelled as a sum of sinusoids plus a noisy residual. The discrete wavelet transform (DWT) is used to decompose the residual into subbands that approximate the critical bands of human hearing. A perceptually derived bit allocation algorithm is then used to minimise the audible distortions introduced from quantising the DWT coefficients. Listening tests showed that the coder delivers near-transparent quality for a range of critical audio signals at G4 kbps. It also outperforms the MPEG layer III coder operating at this same bitrate. This coder, however, is only useful for high quality coding, and is difficult to scale to operate at lower rates. The second coder proposed is based on a hybrid WLPC-wavelet representation. In this approach, the spectrum of the audio signal is estimated by an all pole filter using warped linear prediction (WLP). WLP operates on a warped frequency domain, where the resolution can be adjusted to approximate that of the human auditory system. This makes the inherent noise shaping of the synthesis filter even more suited to audio coding. The excitation to this filter is transformed using the DWT and perceptually encoded. Listening tests showed that near-transparent coding is achieved at G4 kbps. The coder was also found to be slightly superior to the MPEG layer III coder operating at this same bitrate. The third proposed coder is similar to the previous WLPC-wavelet coder, but modified to achieve bitstream scalability. A noise model for high frequency components is included to keep the overall bitrate low, and a two stage quantisation scheme for the DWT coefficients is implemented. The first stage uses fixed rate scalar and vector quantisation to provide a coarse approximation of the coefficients. This allows for low bitrate, low quality versions of the input signal to be embedded in the overall bitstream. The second stage of quantisation adds detail to the coefficients, and hence, enhances the quality of the output signal. Listening tests showed that signal quality gracefully improves as the bitrate increases from 16 kbps to SO kbps. This coder has a performance that is comparable to the MPEG layer III coder operating at a similar (but fixed) bitrate.

Style APA, Harvard, Vancouver, ISO itp.

46

Ning, Daryl. "Analysis and Coding of High Quality Audio Signals". Queensland University of Technology, 2003. http://eprints.qut.edu.au/15814/.

Pełny tekst źródła

Streszczenie:

Digital audio is increasingly becoming more and more a part of our daily lives. Unfortunately, the excessive bitrate associated with the raw digital signal makes it an extremely expensive representation. Applications such as digital audio broadcasting, high definition television, and internet audio, require high quality audio at low bitrates. The field of audio coding addresses this important issue of reducing the bitrate of digital audio, while maintaining a high perceptual quality. Developing an efficient audio coder requires a detailed analysis of the audio signals themselves. It is important to find a representation that can concisely model any general audio signal. In this thesis, we propose two new high quality audio coders based on two different audio representations - the sinusoidal-wavelet representation, and the warped linear predictive coding (WLPC)-wavelet representation. In addition to high quality coding, it is also important for audio coders to be flexible in their application. With the increasing popularity of internet audio, it is advantageous for audio coders to address issues related to real-time audio delivery. The issue of bitstream scalability has been targeted in this thesis, and therefore, a third audio coder capable of bitstream scalability is also proposed. The performance of each of the proposed coders was evaluated by comparisons with the MPEG layer III coder. The first coder proposed is based on a hybrid sinusoidal-wavelet representation. This assumes that each frame of audio can be modelled as a sum of sinusoids plus a noisy residual. The discrete wavelet transform (DWT) is used to decompose the residual into subbands that approximate the critical bands of human hearing. A perceptually derived bit allocation algorithm is then used to minimise the audible distortions introduced from quantising the DWT coefficients. Listening tests showed that the coder delivers near-transparent quality for a range of critical audio signals at G4 kbps. It also outperforms the MPEG layer III coder operating at this same bitrate. This coder, however, is only useful for high quality coding, and is difficult to scale to operate at lower rates. The second coder proposed is based on a hybrid WLPC-wavelet representation. In this approach, the spectrum of the audio signal is estimated by an all pole filter using warped linear prediction (WLP). WLP operates on a warped frequency domain, where the resolution can be adjusted to approximate that of the human auditory system. This makes the inherent noise shaping of the synthesis filter even more suited to audio coding. The excitation to this filter is transformed using the DWT and perceptually encoded. Listening tests showed that near-transparent coding is achieved at G4 kbps. The coder was also found to be slightly superior to the MPEG layer III coder operating at this same bitrate. The third proposed coder is similar to the previous WLPC-wavelet coder, but modified to achieve bitstream scalability. A noise model for high frequency components is included to keep the overall bitrate low, and a two stage quantisation scheme for the DWT coefficients is implemented. The first stage uses fixed rate scalar and vector quantisation to provide a coarse approximation of the coefficients. This allows for low bitrate, low quality versions of the input signal to be embedded in the overall bitstream. The second stage of quantisation adds detail to the coefficients, and hence, enhances the quality of the output signal. Listening tests showed that signal quality gracefully improves as the bitrate increases from 16 kbps to SO kbps. This coder has a performance that is comparable to the MPEG layer III coder operating at a similar (but fixed) bitrate.

Style APA, Harvard, Vancouver, ISO itp.

47

Dabis, Homam Sabih. "The computer enhancement of speech signals". Thesis, University of the West of Scotland, 1991. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.304636.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

48

Kaucic, Robert August. "Lip tracking for audio-visual speech recognition". Thesis, University of Oxford, 1997. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.360392.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

49

Sharma, Dinkar. "Effects of attention on audio-visual speech". Thesis, University of Reading, 1989. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.329379.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

50

Hollier, M. P. "Audio quality prediction for telecomunications speech systems". Thesis, University of Essex, 1995. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.282496.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

Rozprawy doktorskie na temat „Speech and audio signals”

Utwórz poprawne odniesienie w stylach APA, MLA, Chicago, Harvard i wielu innych