Dissertations / Theses on the topic 'Speech and audio signals'
Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles
Consult the top 50 dissertations / theses for your research on the topic 'Speech and audio signals.'
Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.
You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.
Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.
Mason, Michael. "Hybrid coding of speech and audio signals." Thesis, Queensland University of Technology, 2001.
Find full textTrinkaus, Trevor R. "Perceptual coding of audio and diverse speech signals." Diss., Georgia Institute of Technology, 1999. http://hdl.handle.net/1853/13883.
Full textMészáros, Tomáš. "Speech Analysis for Processing of Musical Signals." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2015. http://www.nusl.cz/ntk/nusl-234974.
Full textChoi, Hyung Keun. "Blind source separation of the audio signals in a real world." Thesis, Georgia Institute of Technology, 2002. http://hdl.handle.net/1853/14986.
Full textLucey, Simon. "Audio-visual speech processing." Thesis, Queensland University of Technology, 2002. https://eprints.qut.edu.au/36172/7/SimonLuceyPhDThesis.pdf.
Full textAnderson, David Verl. "Audio signal enhancement using multi-resolution sinusoidal modeling." Diss., Georgia Institute of Technology, 1999. http://hdl.handle.net/1853/15394.
Full textZeghidour, Neil. "Learning representations of speech from the raw waveform." Thesis, Paris Sciences et Lettres (ComUE), 2019. http://www.theses.fr/2019PSLEE004/document.
Full textWhile deep neural networks are now used in almost every component of a speech recognition system, from acoustic to language modeling, the input to such systems are still fixed, handcrafted, spectral features such as mel-filterbanks. This contrasts with computer vision, in which a deep neural network is now trained on raw pixels. Mel-filterbanks contain valuable and documented prior knowledge from human auditory perception as well as signal processing, and are the input to state-of-the-art speech recognition systems that are now on par with human performance in certain conditions. However, mel-filterbanks, as any fixed representation, are inherently limited by the fact that they are not fine-tuned for the task at hand. We hypothesize that learning the low-level representation of speech with the rest of the model, rather than using fixed features, could push the state-of-the art even further. We first explore a weakly-supervised setting and show that a single neural network can learn to separate phonetic information and speaker identity from mel-filterbanks or the raw waveform, and that these representations are robust across languages. Moreover, learning from the raw waveform provides significantly better speaker embeddings than learning from mel-filterbanks. These encouraging results lead us to develop a learnable alternative to mel-filterbanks, that can be directly used in replacement of these features. In the second part of this thesis we introduce Time-Domain filterbanks, a lightweight neural network that takes the waveform as input, can be initialized as an approximation of mel-filterbanks, and then learned with the rest of the neural architecture. Across extensive and systematic experiments, we show that Time-Domain filterbanks consistently outperform melfilterbanks and can be integrated into a new state-of-the-art speech recognition system, trained directly from the raw audio signal. Fixed speech features being also used for non-linguistic classification tasks for which they are even less optimal, we perform dysarthria detection from the waveform with Time-Domain filterbanks and show that it significantly improves over mel-filterbanks or low-level descriptors. Finally, we discuss how our contributions fall within a broader shift towards fully learnable audio understanding systems
Bando, Yoshiaki. "Robust Audio Scene Analysis for Rescue Robots." Kyoto University, 2018. http://hdl.handle.net/2433/232410.
Full textMoghimi, Amir Reza. "Array-based Spectro-temporal Masking For Automatic Speech Recognition." Research Showcase @ CMU, 2014. http://repository.cmu.edu/dissertations/334.
Full textBrangers, Kirstin M. "Perceptual Ruler for Quantifying Speech Intelligibility in Cocktail Party Scenarios." UKnowledge, 2013. http://uknowledge.uky.edu/ece_etds/31.
Full textHarvilla, Mark J. "Compensation for Nonlinear Distortion in Noise for Robust Speech Recognition." Research Showcase @ CMU, 2014. http://repository.cmu.edu/dissertations/437.
Full textNylén, Helmer. "Detecting Signal Corruptions in Voice Recordings for Speech Therapy." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-291429.
Full textNär en patients röst spelas in för analys i talterapi kan inspelningskvaliteten påverkas av olika signalproblem, till exempel bakgrundsljud eller klippning. Utrustningen och expertisen som behövs för att upptäcka små störningar finns dock inte alltid tillgänglig på mindre kliniker. Därför undersöker denna studie olika maskininlärningsalgoritmer för att automatiskt kunna upptäcka utvalda problem i talinspelningar, bland andra infraljud och slumpmässig utsläckning av signalen. Fem algoritmer analyseras: stödvektormaskin, Convolutional Neural Network, Long Short-term Memory (LSTM), Gaussian mixture model-baserad dold Markovmodell och generatorbaserad dold Markovmodell. Ett verktyg för att skapa datamängder med försämrade inspelningar utvecklas för att kunna testa algoritmerna. Vi undersöker separat fallen där inspelningarna tillåts ha en eller flera problem samtidigt, och använder framförallt en slags kepstralkoefficienter, MFCC:er, som särdrag. För varje typ av problem undersöker vi också sätt att förbättra noggrannheten, till exempel genom att filtrera bort irrelevanta delar av signalen med hjälp av en röstupptäckare, ändra särdragsparametrarna, eller genom att använda en ensemble av klassificerare. Experimenten visar att maskininlärning är ett rimligt tillvägagångssätt för detta problem då den balanserade träffsäkerheten överskrider 75%för samtliga testade störningar. Den delen av studien som fokuserade på enproblemsinspelningar gav inga resultat som tydde på att en algoritm var klart bättre än de andra, men i flerproblemsfallet överträffade LSTM:en generellt övriga algoritmer. Värt att notera är att den nådde över 95 % balanserad träffsäkerhet på både vitt brus och infraljud. Eftersom algoritmerna enbart tränats på engelskspråkiga, talade meningar så har detta verktyg i nuläget begränsad praktisk användbarhet. Däremot är det lätt att utöka dessa experiment med andra typer av inspelningar, signalproblem, särdrag eller algoritmer.
Sekiguchi, Kouhei. "A Unified Statistical Approach to Fast and Robust Multichannel Speech Separation and Dereverberation." Doctoral thesis, Kyoto University, 2021. http://hdl.handle.net/2433/263770.
Full textYoo, Heejong. "Low-Power Audio Input Enhancement for Portable Devices." Diss., Georgia Institute of Technology, 2005. http://hdl.handle.net/1853/6821.
Full textDella, Corte Giuseppe. "Text and Speech Alignment Methods for Speech Translation Corpora Creation : Augmenting English LibriVox Recordings with Italian Textual Translations." Thesis, Uppsala universitet, Institutionen för lingvistik och filologi, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-413064.
Full textLeis, John W. "Spectral coding methods for speech compression and speaker identification." Thesis, Queensland University of Technology, 1998. https://eprints.qut.edu.au/36062/7/36062_Digitised_Thesis.pdf.
Full textJaureguiberry, Xabier. "Fusion pour la séparation de sources audio." Thesis, Paris, ENST, 2015. http://www.theses.fr/2015ENST0030/document.
Full textUnderdetermined blind source separation is a complex mathematical problem that can be satisfyingly resolved for some practical applications, providing that the right separation method has been selected and carefully tuned. In order to automate this selection process, we propose in this thesis to resort to the principle of fusion which has been widely used in the related field of classification yet is still marginally exploited in source separation. Fusion consists in combining several methods to solve a given problem instead of selecting a unique one. To do so, we introduce a general fusion framework in which a source estimate is expressed as a linear combination of estimates of this same source given by different separation algorithms, each source estimate being weighted by a fusion coefficient. For a given task, fusion coefficients can then be learned on a representative training dataset by minimizing a cost function related to the separation objective. To go further, we also propose two ways to adapt the fusion coefficients to the mixture to be separated. The first one expresses the fusion of several non-negative matrix factorization (NMF) models in a Bayesian fashion similar to Bayesian model averaging. The second one aims at learning time-varying fusion coefficients thanks to deep neural networks. All proposed methods have been evaluated on two distinct corpora. The first one is dedicated to speech enhancement while the other deals with singing voice extraction. Experimental results show that fusion always outperform simple selection in all considered cases, best results being obtained by adaptive time-varying fusion with neural networks
Herms, Robert. "Effective Speech Features for Cognitive Load Assessment: Classification and Regression." Universitätsverlag Chemnitz, 2018. https://monarch.qucosa.de/id/qucosa%3A33346.
Full textDie vorliegende Arbeit befasst sich mit der automatischen Erkennung von kognitiver Belastung auf Basis menschlicher Sprachmerkmale. Der Schwerpunkt liegt auf der Effektivität von akustischen Parametern, wobei die aktuelle Forschung auf diesem Gebiet um neuartige Ansätze erweitert wird. Hierzu wird ein neuer Datensatz – als CoLoSS bezeichnet – vorgestellt, welcher Sprachaufzeichnungen von Nutzern enthält und speziell auf Lernprozesse fokussiert. Zahlreiche Parameter der Prosodie, Stimmqualität und des Spektrums werden im Hinblick auf deren Relevanz analysiert. Darüber hinaus werden die Eigenschaften des Teager Energy Operators, welche typischerweise bei der Stressdetektion Verwendung finden, im Rahmen dieser Arbeit berücksichtigt. Ebenso wird gezeigt, wie automatische Spracherkennungssysteme genutzt werden können, um potenzielle Indikatoren zu extrahieren. Die Eignung der extrahierten Merkmale wird systematisch evaluiert. Dabei kommen sprecherunabhängige Klassifikationssysteme zur Unterscheidung von drei Belastungsstufen zum Einsatz. Zusätzlich wird ein neuartiger Ansatz zur sprachbasierten Modellierung der kognitiven Belastung vorgestellt, bei dem die Belastung eine kontinuierliche Größe darstellt und eine Vorhersage folglich als ein Regressionsproblem betrachtet werden kann.
Fong, Katherine KaYan. "IR-Depth Face Detection and Lip Localization Using Kinect V2." DigitalCommons@CalPoly, 2015. https://digitalcommons.calpoly.edu/theses/1425.
Full textAlmajai, Ibrahim M. "Audio Visual Speech Enhancement." Thesis, University of East Anglia, 2009. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.514309.
Full textGómez, Gutiérrez Emilia. "Tonal description of music audio signals." Doctoral thesis, Universitat Pompeu Fabra, 2006. http://hdl.handle.net/10803/7537.
Full textAquesta tesi contribueix substancialment al camp de la descripció tonal mitjançant mètodes computacionals: a) Proporciona una revisió multidisciplinària dels sistemes d'estimació de la tonalitat; b) Defineix una sèrie de requeriments que han de complir els descriptors tonals de baix nivell; c) Proporciona una avaluació quantitativa i modular dels mètodes proposats; d) Justifica la idea de que per a certes aplicacions es poden fer servir mètodes que treballen amb partitures sense la necessitat de realitzar una transcripció automàtica e) Estén la literatura existent que treballa amb música clàssica a altres generes musicals; f) Demostra la utilitat dels descriptors tonals per a comparar peces musicals; g) Proporciona un algoritme optimitzat que es fa servir dins un sistema real per a visualització, cerca i recomanació musical, que treballa amb més d'un milió de obres musicals.
Esta tesis doctoral propone y evalúa un enfoque computacional para la descripción automática de aspectos tonales de la música a partir del análisis de señales de audio polifónicas. Estos métodos se centran en calcular descriptores de distribución de notas, en estimar la tonalidad de una pieza, en visualizar la evolución del centro tonal o en medir la similitud tonal entre dos piezas diferentes.
Esta tesis contribuye sustancialmente al campo de la descripción tonal mediante métodos computacionales: a) Proporciona una revisión multidisciplinar de los sistemas de estimación de la tonalidad; b) Define una serie de requerimientos que deben cumplir los descriptores tonales de bajo nivel; c) Proporciona una evaluación cuantitativa y modular de los métodos propuestos; d) Respalda la idea de que para ciertas aplicaciones no es necesario obtener una transcripción perfecta de la partitura, y que se pueden utilizar métodos que trabajan con partituras sin realizar una transcripción automática; e) Extiende la literatura existente que trabaja con música clásica a otros géneros musicales; f) Demuestra la utilidad de los descriptores tonales para comparar piezas musicales; g) Proporciona un algoritmo optimizado que se utiliza en un sistema real para visualización, búsqueda y recomendación musical, que trabaja con mas de un millón de piezas musicales.
This doctoral dissertation proposes and evaluates a computational approach for the automatic description of tonal aspects of music from the analysis of polyphonic audio signals. These algorithms focus on the computation of pitch class distributions descriptors, the estimation of the key of a piece, the visualization of the evolution of its tonal center or the measurement of the similarity between two different musical pieces.
This dissertation substantially contributes to the field of computational tonal description: a) It provides a multidisciplinary review of tonal induction systems; b) It defines a set of requirements for low-level tonal features; c) It provides a quantitative and modular evaluation of the proposed methods; d) It contributes to bridge the gap between audio and symbolic-oriented methods without the need of a perfect transcription; e) It extents current literature dealing with classical music to other musical genres; f) It shows the usefulness of tonal descriptors for music similarity; g) It provides an optimized method which is used in a real system for music visualization and retrieval, working with over a million of musical pieces.
Saruwatari, Hiroshi. "BLIND SIGNAL SEPARATION OF AUDIO SIGNALS." INTELLIGENT MEDIA INTEGRATION NAGOYA UNIVERSITY / COE, 2006. http://hdl.handle.net/2237/10406.
Full textNajafzadeh-Azghandi, Hossein. "Perceptual coding of narrowband audio signals." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 2000. http://www.collectionscanada.ca/obj/s4/f2/dsk1/tape4/PQDD_0033/NQ64628.pdf.
Full textGodsill, Simon John. "The restoration of degraded audio signals." Thesis, University of Cambridge, 1993. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.296641.
Full textBolton, Jered. "Gestural extraction from musical audio signals." Thesis, University of Glasgow, 2004. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.417664.
Full textHicks, C. M. "Modelling of multi-channel audio signals." Thesis, University of Cambridge, 1999. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.603997.
Full textShoji, Seiichiro. "Efficient individualisation of binaural audio signals." Thesis, University of York, 2007. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.442378.
Full textXiao, Zhongzhe. "Recognition of emotions in audio signals." Ecully, Ecole centrale de Lyon, 2008. http://www.theses.fr/2008ECDL0002.
Full textThis Ph. D thesis work is dedicated to automatic emotion/mood recognition in audio signals. Indeed, audio emotion is high semantic information and its automatic analysis may have many applications such as smart human-computer interactions or multimedia indexing. The purpose of this thesis is thus to investigate machine-based audio emotion analysis solutions for both speech and music signals. Our work makes use of a discrete emotional model combined with the dimensional one and relies upon existing studies on acoustics correlates of emotional speech and music mood. The key contributions are the following. First, we have proposed, in complement to popular frequency-based and energy-based features, some new audio features, namely harmonic and Zipf features, to better characterize timbre and prosodic properties of emotional speech. Second, as there exists very few emotional resources either for speech or music for machine learning as compared to audio features that one can extract, an evidence theory-based feature selection scheme named Embedded Sequential Forward Selection (ESFS) is proposed to deal with the classic “curse of dimensionality” problem and thus over-fitting. Third, using a manually built dimensional emotion model-based hierarchical classifier to deal with fuzzy borders of emotional states, we demonstrated that a hierarchical classification scheme performs better than single global classifier mostly used in the literature. Furthermore, as there does not exist any universal agreement on basic emotion definition and as emotional states are typically application dependent, we also proposed a ESFS-based algorithm for automatically building a hierarchical classification scheme (HCS) which is best adapted to a specific set of application dependent emotional states. The HCS divides a complex classification problem into simpler and smaller problems by combining several binary sub-classifiers in the structure of a binary tree in several stages, and gives the result as the type of emotional states of the audio samples. Finally, to deal with the subjective nature of emotions, we also proposed an evidence theory-based ambiguous classifier allowing multiple emotions labeling as human often does. The effectiveness of all these recognition techniques was evaluated on Berlin and DES datasets for emotional speech recognition and on a music mood dataset that we collected in our laboratory as there exist no public dataset so far. Keywords: audio signal, emotion classification, music mood analysis, audio features, feature selection, hierarchical classification, ambiguous classification, evidence theory
Xiao, Zhongzhe Chen Liming. "Recognition of emotions in audio signals." Ecully : Ecole Centrale de Lyon, 2008. http://bibli.ec-lyon.fr/exl-doc/zxiao.pdf.
Full textMiyajima, C., D. Negi, Y. Ninomiya, M. Sano, K. Mori, K. Itou, K. Takeda, and Y. Suenaga. "Audio-Visual Speech Database for Bimodal Speech Recognition." INTELLIGENT MEDIA INTEGRATION NAGOYA UNIVERSITY / COE, 2005. http://hdl.handle.net/2237/10460.
Full textOthman, Noor Shamsiah. "Wireless speech and audio communications." Thesis, University of Southampton, 2008. https://eprints.soton.ac.uk/64488/.
Full textBarkmeier, Julie Marie. "Intelligibility of dysarthric speakers: audio-only and audio-visual presentations." Thesis, University of Iowa, 1988. https://ir.uiowa.edu/etd/5698.
Full textXia, Feng. "Perceptual coding for high-quality audio signals." Ohio : Ohio University, 1998. http://www.ohiolink.edu/etd/view.cgi?ohiou1176235728.
Full textLanciani, Christopher A. "Compressed-domain processing of MPEG audio signals." Diss., Georgia Institute of Technology, 1999. http://hdl.handle.net/1853/13760.
Full textConway, Alexander. "Improving Broadband Noise Filter For Audio Signals." DigitalCommons@CalPoly, 2012. https://digitalcommons.calpoly.edu/theses/747.
Full textLe, Cornu Thomas. "Reconstruction of intelligible audio speech from visual speech information." Thesis, University of East Anglia, 2016. https://ueaeprints.uea.ac.uk/67012/.
Full textFackrell, Justin W. A. "Bispectral analysis of speech signals." Thesis, University of Edinburgh, 1997. http://hdl.handle.net/1842/1384.
Full textAnderson, Mark David. "Pitch determination of speech signals." Thesis, Massachusetts Institute of Technology, 1986. http://hdl.handle.net/1721.1/14999.
Full textMICROFICHE COPY AVAILABLE IN ARCHIVES AND ENGINEERING
Bibliography: leaves 138-147.
by Mark David Anderson.
M.S.
Seymour, R. "Audio-visual speech and speaker recognition." Thesis, Queen's University Belfast, 2008. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.492489.
Full textPachoud, Samuel. "Audio-visual speech and emotion recognition." Thesis, Queen Mary, University of London, 2010. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.528923.
Full textMatthews, Iain. "Features for audio-visual speech recognition." Thesis, University of East Anglia, 1998. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.266736.
Full textRao, Ram Raghavendra. "Audio-visual interaction in multimedia." Diss., Georgia Institute of Technology, 1998. http://hdl.handle.net/1853/13349.
Full textClemedson, Johan. "Audio Generation from Radar signals, for target classification." Thesis, KTH, Optimeringslära och systemteori, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-215502.
Full textKlassificering är ofta av stort intresse inom radarapplikation, eftersom man inte bara vill veta var ett mål befinner sig men också vad för typ av mål det är. Denna uppsats fokuserar på att omvandla radarekot från ett mål till en ljudsignal. Så att klassificeringen kan ske med mänskliga sinnen, i detta fall hörseln. Syftet med dessa klassificeringsmetoder är att kunna klassificera två typer av mål med ungefär samma storlek, nämligen fåglar och mindre obemannade flygfordon (UAV). Det är möjligt att med radarn mäta målets hastighet med hjälp av Doppler-effekten. För att kunna avgöra i vilken riktning målet rör sig används en I/Q-representation, som är en komplex representation av radar signalen. Med signalbehandling är det möjligt att extrahera radar signaler som målet generar. Genom att använda spektrala transformationer är det möjligt att generera reellvärda signaler från de extraherade målsignalerna. Det är nödvändigt att förlänga dessa signaler för att kunna använda dem som ljudsignaler, detta görs med en extrapoleringsteknik baserad på Autoregressiva (AR) -processer. De ljudsignaler som används är dessa extrapolerade signalerna, det är i det flesta fall möjligt att utifrån ljudet genomföra klassificeringen. Detta projekt är utfört i samarbete med Sebastian Edman [7], där olika inriktningar av radarklassificering har undersökts. Som nämnts ovan fokuserar denna uppsats på att omvandla
Chen, Bingwei. "Adaptive watermarking algorithms for MP3 compressed audio signals." Thesis, University of Ottawa (Canada), 2008. http://hdl.handle.net/10393/27963.
Full textNing, Daryl. "Analysis and coding of high quality audio signals." Thesis, Queensland University of Technology, 2003. https://eprints.qut.edu.au/15814/1/Daryl_Ning_Thesis.pdf.
Full textNing, Daryl. "Analysis and Coding of High Quality Audio Signals." Queensland University of Technology, 2003. http://eprints.qut.edu.au/15814/.
Full textDabis, Homam Sabih. "The computer enhancement of speech signals." Thesis, University of the West of Scotland, 1991. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.304636.
Full textKaucic, Robert August. "Lip tracking for audio-visual speech recognition." Thesis, University of Oxford, 1997. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.360392.
Full textSharma, Dinkar. "Effects of attention on audio-visual speech." Thesis, University of Reading, 1989. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.329379.
Full textHollier, M. P. "Audio quality prediction for telecomunications speech systems." Thesis, University of Essex, 1995. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.282496.
Full text