Rozprawy doktorskie na temat „Audio speech recognition”
Utwórz poprawne odniesienie w stylach APA, MLA, Chicago, Harvard i wielu innych
Sprawdź 50 najlepszych rozpraw doktorskich naukowych na temat „Audio speech recognition”.
Przycisk „Dodaj do bibliografii” jest dostępny obok każdej pracy w bibliografii. Użyj go – a my automatycznie utworzymy odniesienie bibliograficzne do wybranej pracy w stylu cytowania, którego potrzebujesz: APA, MLA, Harvard, Chicago, Vancouver itp.
Możesz również pobrać pełny tekst publikacji naukowej w formacie „.pdf” i przeczytać adnotację do pracy online, jeśli odpowiednie parametry są dostępne w metadanych.
Przeglądaj rozprawy doktorskie z różnych dziedzin i twórz odpowiednie bibliografie.
Miyajima, C., D. Negi, Y. Ninomiya, M. Sano, K. Mori, K. Itou, K. Takeda i Y. Suenaga. "Audio-Visual Speech Database for Bimodal Speech Recognition". INTELLIGENT MEDIA INTEGRATION NAGOYA UNIVERSITY / COE, 2005. http://hdl.handle.net/2237/10460.
Pełny tekst źródłaSeymour, R. "Audio-visual speech and speaker recognition". Thesis, Queen's University Belfast, 2008. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.492489.
Pełny tekst źródłaPachoud, Samuel. "Audio-visual speech and emotion recognition". Thesis, Queen Mary, University of London, 2010. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.528923.
Pełny tekst źródłaMatthews, Iain. "Features for audio-visual speech recognition". Thesis, University of East Anglia, 1998. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.266736.
Pełny tekst źródłaKaucic, Robert August. "Lip tracking for audio-visual speech recognition". Thesis, University of Oxford, 1997. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.360392.
Pełny tekst źródłaLucey, Simon. "Audio-visual speech processing". Thesis, Queensland University of Technology, 2002. https://eprints.qut.edu.au/36172/7/SimonLuceyPhDThesis.pdf.
Pełny tekst źródłaEriksson, Mattias. "Speech recognition availability". Thesis, Linköping University, Department of Computer and Information Science, 2004. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-2651.
Pełny tekst źródłaThis project investigates the importance of availability in the scope of dictation programs. Using speech recognition technology for dictating has not reached the public, and that may very well be a result of poor availability in today’s technical solutions.
I have constructed a persona character, Johanna, who personalizes the target user. I have also developed a solution that streams audio into a speech recognition server and sends back interpreted text. Johanna affirmed that the solution was successful in theory.
I then incorporated test users that tried out the solution in practice. Half of them do indeed claim that their usage has been and will continue to be increased thanks to the new level of availability.
Rao, Ram Raghavendra. "Audio-visual interaction in multimedia". Diss., Georgia Institute of Technology, 1998. http://hdl.handle.net/1853/13349.
Pełny tekst źródłaDean, David Brendan. "Synchronous HMMs for audio-visual speech processing". Thesis, Queensland University of Technology, 2008. https://eprints.qut.edu.au/17689/3/David_Dean_Thesis.pdf.
Pełny tekst źródłaDean, David Brendan. "Synchronous HMMs for audio-visual speech processing". Queensland University of Technology, 2008. http://eprints.qut.edu.au/17689/.
Pełny tekst źródłaReikeras, Helge. "Audio-visual automatic speech recognition using Dynamic Bayesian Networks". Thesis, Stellenbosch : University of Stellenbosch, 2011. http://hdl.handle.net/10019.1/6777.
Pełny tekst źródłaRintala, Jonathan. "Speech Emotion Recognition from Raw Audio using Deep Learning". Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-278858.
Pełny tekst źródłaTraditionellt sätt, vid talbaserad känsloigenkänning, kräver modeller ett stort antal manuellt konstruerade attribut och mellanliggande representationer, såsom spektrogram, för träning. Men att konstruera sådana attribut för hand kräver ofta både domänspecifika expertkunskaper och resurser. Nyligen har djupinlärningens framväxande end-to-end modeller, som utvinner attribut och lär sig direkt från den råa ljudsignalen, undersökts. Ett tidigare tillvägagångssätt har varit att kombinera parallella CNN:er med olika filterlängder för att extrahera flera temporala attribut från ljudsignalen och sedan låta den resulterande sekvensen passera vidare in i ett så kallat Recurrent Neural Network. Andra tidigare studier har också nått en hög noggrannhet när man använder lokala inlärningsblock (LFLB) för att reducera dimensionaliteten hos den råa ljudsignalen, och på så sätt extraheras den viktigaste informationen från ljudet. Således kombinerar denna studie idén om att nyttja LFLB:er för extraktion av attribut, tillsammans med ett block av parallella CNN:er som har olika filterlängder för att fånga multitemporala attribut; detta kommer slutligen att matas in i ett LSTM-lager för global inlärning av kontextuell information. Så vitt vi vet har en sådan kombinerad arkitektur ännu inte undersökts. Vidare kommer denna studie att undersöka olika konfigurationer av en sådan arkitektur. Den föreslagna modellen tränas och utvärderas sedan på de välkända taldatabaserna EmoDB och RAVDESS, både via ett talarberoende och talaroberoende tillvägagångssätt. Resultaten indikerar att den föreslagna arkitekturen kan ge jämförbara resultat med state-of-the-art, trots att ingen ökning av data eller avancerad förbehandling har inkluderats. Det rapporteras att 3 parallella CNN-lager gav högsta noggrannhet, tillsammans med en serie av modifierade LFLB:er som nyttjar average-pooling och ReLU som aktiveringsfunktion. Detta visar fördelarna med att lämna inlärningen av attribut till nätverket och öppnar upp för intressant framtida forskning kring tidskomplexitet och avvägning mellan introduktion av komplexitet i förbehandlingen eller i själva modellarkitekturen.
Ahmad, Nasir. "A motion based approach for audio-visual automatic speech recognition". Thesis, Loughborough University, 2011. https://dspace.lboro.ac.uk/2134/8564.
Pełny tekst źródłaIbrahim, Zamri. "A novel lip geometry approach for audio-visual speech recognition". Thesis, Loughborough University, 2014. https://dspace.lboro.ac.uk/2134/16526.
Pełny tekst źródłaSusman, Derya. "Turkish Large Vocabulary Continuous Speech Recognition By Using Limited Audio Corpus". Master's thesis, METU, 2012. http://etd.lib.metu.edu.tr/upload/12614207/index.pdf.
Pełny tekst źródłaRavindran, Sourabh. "Physiologically Motivated Methods For Audio Pattern Classification". Diss., Georgia Institute of Technology, 2006. http://hdl.handle.net/1853/14066.
Pełny tekst źródłaDong, Junda. "Designing a Visual Front End in Audio-Visual Automatic Speech Recognition System". DigitalCommons@CalPoly, 2015. https://digitalcommons.calpoly.edu/theses/1382.
Pełny tekst źródłaMoghimi, Amir Reza. "Array-based Spectro-temporal Masking For Automatic Speech Recognition". Research Showcase @ CMU, 2014. http://repository.cmu.edu/dissertations/334.
Pełny tekst źródłaZhang, Xianxian. "Robust speech processing based on microphone array, audio-visual, and frame selection for in-vehicle speech recognition and in-set speaker recognition". Diss., Connect to online resource, 2005. http://wwwlib.umi.com/cr/colorado/fullcit?p3190350.
Pełny tekst źródłaGoussard, George Willem. "Unsupervised clustering of audio data for acoustic modelling in automatic speech recognition systems". Thesis, Stellenbosch : University of Stellenbosch, 2011. http://hdl.handle.net/10019.1/6686.
Pełny tekst źródłaENGLISH ABSTRACT: This thesis presents a system that is designed to replace the manual process of generating a pronunciation dictionary for use in automatic speech recognition. The proposed system has several stages. The first stage segments the audio into what will be known as the subword units, using a frequency domain method. In the second stage, dynamic time warping is used to determine the similarity between the segments of each possible pair of these acoustic segments. These similarities are used to cluster similar acoustic segments into acoustic clusters. The final stage derives a pronunciation dictionary from the orthography of the training data and corresponding sequence of acoustic clusters. This process begins with an initial mapping between words and their sequence of clusters, established by Viterbi alignment with the orthographic transcription. The dictionary is refined iteratively by pruning redundant mappings, hidden Markov model estimation and Viterbi re-alignment in each iteration. This approach is evaluated experimentally by applying it to two subsets of the TIMIT corpus. It is found that, when test words are repeated often in the training material, the approach leads to a system whose accuracy is almost as good as one trained using the phonetic transcriptions. When test words are not repeated often in the training set, the proposed approach leads to better results than those achieved using the phonetic transcriptions, although the recognition is poor overall in this case.
AFRIKAANSE OPSOMMING: Die doelwit van die tesis is om ’n stelsel te beskryf wat ontwerp is om die handgedrewe proses in die samestelling van ’n woordeboek, vir die gebruik in outomatiese spraakherkenningsstelsels, te vervang. Die voorgestelde stelsel bestaan uit ’n aantal stappe. Die eerste stap is die segmentering van die oudio in sogenaamde sub-woord eenhede deur gebruik te maak van ’n frekwensie gebied tegniek. Met die tweede stap word die dinamiese tydverplasingsalgoritme ingespan om die ooreenkoms tussen die segmente van elkeen van die moontlike pare van die akoestiese segmente bepaal. Die ooreenkomste word dan gebruik om die akoestiese segmente te groepeer in akoestiese groepe. Die laaste stap stel die woordeboek saam deur gebruik te maak van die ortografiese transkripsie van afrigtingsdata en die ooreenstemmende reeks akoestiese groepe. Die finale stap begin met ’n aanvanklike afbeelding vanaf woorde tot hul reeks groep identifiseerders, bewerkstellig deur Viterbi belyning en die ortografiese transkripsie. Die woordeboek word iteratief verfyn deur oortollige afbeeldings te snoei, verskuilde Markov modelle af te rig en deur Viterbi belyning te gebruik in elke iterasie. Die benadering is getoets deur dit eksperimenteel te evalueer op twee subversamelings data vanuit die TIMIT korpus. Daar is bevind dat, wanneer woorde herhaal word in die afrigtingsdata, die stelsel se benadering die akkuraatheid ewenaar van ’n stelsel wat met die fonetiese transkripsie afgerig is. As die woorde nie herhaal word in die afrigtingsdata nie, is die akkuraatheid van die stelsel se benadering beter as wanneer die stelsel afgerig word met die fonetiese transkripsie, alhoewel die akkuraatheid in die algemeen swak is.
MORRONE, GIOVANNI. "Metodologie di Apprendimento Profondo per l'Elaborazione Audio-Video del Parlato in Ambienti Rumorosi". Doctoral thesis, Università degli studi di Modena e Reggio Emilia, 2021. http://hdl.handle.net/11380/1245516.
Pełny tekst źródłaHuman communication is often an audio-visual experience. Indeed, listeners hear words uttered by speakers and can also see facial movements and other gestures which convey speech information. However, speech communication can be negatively affected by background noises and artifacts, which are very common in real environments. Restoring clean speech from degraded audio sources is crucial for many applications, e.g., automatic speech recognition and hearing aids. Neuroscience research proved that looking at a talking face enhances the human capability to focus auditory attention on a particular stimulus while muting external noisy sources. This dissertation is an attempt to exploit the bi-modal, i.e., audio-visual, nature of speech for speech enhancement, automatic speech recognition and speech inpainting. We start by presenting a novel approach to solve the problem of extracting the speech of a speaker of interest in a cocktail party scenario. Contrary to most previous work, we exploit a pre-trained face landmark detector and use facial landmarks motion as visual features in a deep learning model. In that way, we relieve our models from the task of learning useful visual feature from raw pixels. We train and test our models on two widely used limited size datasets and we achieve speaker independent speech enhancement in a multi-talker setting. Motivated by these results, we study how audio-visual speech enhancement can help to perform automatic speech recognition exploiting a multi-task learning framework. Then, we design a strategy where speech enhancement training phase is alternated with speech recognition phase. We observe that, in general, the joint optimization of the two phases shows a remarkable improvement of speech recognition performance compared to the audio-visual baseline models trained only to perform speech recognition. Finally, we explore if visual information can be useful for speech inpainting, i.e., the task of restoring missing parts of an acoustic speech signal from reliable audio context. We design a system that is able to inpaint multiple variable-length missing time gaps in a speech signal. We test our system with time gaps ranging from 100 ms to 1600 to investigate the contribution that vision can provide for time gaps of different duration. Experiments show that the performance of audio-only baseline models degrades rapidly when time gaps get large, while the proposed audio-visual approach is still able to plausibly restore missing information.
Dookhoo, Raul. "AUTOMATED REGRESSION TESTING APPROACH TO EXPANSION AND REFINEMENT OF SPEECH RECOGNITION GRAMMARS". Master's thesis, University of Central Florida, 2008. http://digital.library.ucf.edu/cdm/ref/collection/ETD/id/2634.
Pełny tekst źródłaM.S.
School of Electrical Engineering and Computer Science
Engineering and Computer Science
Computer Science MS
Harvilla, Mark J. "Compensation for Nonlinear Distortion in Noise for Robust Speech Recognition". Research Showcase @ CMU, 2014. http://repository.cmu.edu/dissertations/437.
Pełny tekst źródłaDuckitt, William. "The design of a high-performance, floating-point embedded system for speech recognition and audio research purposes". Thesis, Link to the online version, 2008. http://hdl.handle.net/10019/824.
Pełny tekst źródłaBrady-Herbst, Brenene Marie. "An Analysis of Spondee Recognition Thresholds in Auditory-only and Audio-visual Conditions". PDXScholar, 1996. https://pdxscholar.library.pdx.edu/open_access_etds/5218.
Pełny tekst źródłaChoi, Hyung Keun. "Blind source separation of the audio signals in a real world". Thesis, Georgia Institute of Technology, 2002. http://hdl.handle.net/1853/14986.
Pełny tekst źródłaZeghidour, Neil. "Learning representations of speech from the raw waveform". Thesis, Paris Sciences et Lettres (ComUE), 2019. http://www.theses.fr/2019PSLEE004/document.
Pełny tekst źródłaWhile deep neural networks are now used in almost every component of a speech recognition system, from acoustic to language modeling, the input to such systems are still fixed, handcrafted, spectral features such as mel-filterbanks. This contrasts with computer vision, in which a deep neural network is now trained on raw pixels. Mel-filterbanks contain valuable and documented prior knowledge from human auditory perception as well as signal processing, and are the input to state-of-the-art speech recognition systems that are now on par with human performance in certain conditions. However, mel-filterbanks, as any fixed representation, are inherently limited by the fact that they are not fine-tuned for the task at hand. We hypothesize that learning the low-level representation of speech with the rest of the model, rather than using fixed features, could push the state-of-the art even further. We first explore a weakly-supervised setting and show that a single neural network can learn to separate phonetic information and speaker identity from mel-filterbanks or the raw waveform, and that these representations are robust across languages. Moreover, learning from the raw waveform provides significantly better speaker embeddings than learning from mel-filterbanks. These encouraging results lead us to develop a learnable alternative to mel-filterbanks, that can be directly used in replacement of these features. In the second part of this thesis we introduce Time-Domain filterbanks, a lightweight neural network that takes the waveform as input, can be initialized as an approximation of mel-filterbanks, and then learned with the rest of the neural architecture. Across extensive and systematic experiments, we show that Time-Domain filterbanks consistently outperform melfilterbanks and can be integrated into a new state-of-the-art speech recognition system, trained directly from the raw audio signal. Fixed speech features being also used for non-linguistic classification tasks for which they are even less optimal, we perform dysarthria detection from the waveform with Time-Domain filterbanks and show that it significantly improves over mel-filterbanks or low-level descriptors. Finally, we discuss how our contributions fall within a broader shift towards fully learnable audio understanding systems
Thambiratnam, Albert J. K. "Acoustic keyword spotting in speech with applications to data mining". Thesis, Queensland University of Technology, 2005. https://eprints.qut.edu.au/37254/1/Albert_Thambiratnam_Thesis.pdf.
Pełny tekst źródłaSklar, Alexander Gabriel. "Channel Modeling Applied to Robust Automatic Speech Recognition". Scholarly Repository, 2007. http://scholarlyrepository.miami.edu/oa_theses/87.
Pełny tekst źródłaNavarathna, Rajitha Dharshana Bandara. "Robust recognition of human behaviour in challenging environments". Thesis, Queensland University of Technology, 2014. https://eprints.qut.edu.au/66235/1/Rajitha%20Dharshana%20Bandara_Navarathna_Thesis.pdf.
Pełny tekst źródłaMartí, Guerola Amparo. "Multichannel audio processing for speaker localization, separation and enhancement". Doctoral thesis, Universitat Politècnica de València, 2013. http://hdl.handle.net/10251/33101.
Pełny tekst źródłaMartí Guerola, A. (2013). Multichannel audio processing for speaker localization, separation and enhancement [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/33101
TESIS
Raghunathan, Anusha. "EVALUATION OF INTELLIGIBILITY AND SPEAKER SIMILARITY OF VOICE TRANSFORMATION". UKnowledge, 2011. http://uknowledge.uky.edu/gradschool_theses/101.
Pełny tekst źródłaSheikh, Imran. "Exploitation du contexte sémantique pour améliorer la reconnaissance des noms propres dans les documents audio diachroniques". Thesis, Université de Lorraine, 2016. http://www.theses.fr/2016LORR0260/document.
Pełny tekst źródłaThe diachronic nature of broadcast news causes frequent variations in the linguistic content and vocabulary, leading to the problem of Out-Of-Vocabulary (OOV) words in automatic speech recognition. Most of the OOV words are found to be proper names whereas proper names are important for automatic indexing of audio-video content as well as for obtaining reliable automatic transcriptions. The goal of this thesis is to model the semantic and topical context of new proper names in order to retrieve those which are relevant to the spoken content in the audio document. Training context models is a challenging problem in this task because several new names come with a low amount of data and the context model should be robust to errors in the automatic transcription. Probabilistic topic models and word embeddings from neural network models are explored for the task of retrieval of relevant proper names. A thorough evaluation of these contextual representations is performed. It is argued that these representations, which are learned in an unsupervised manner, are not the best for the given retrieval task. Neural network context models trained with an objective to maximise the retrieval performance are proposed. The proposed Neural Bag-of-Weighted-Words (NBOW2) model learns to assign a degree of importance to input words and has the ability to capture task specific key-words. Experiments on automatic speech recognition on French broadcast news videos demonstrate the effectiveness of the proposed models. Evaluation of the NBOW2 model on standard text classification tasks shows that it learns interesting information and gives best classification accuracies among the BOW models
Guenebaut, Boris. "Automatic Subtitle Generation for Sound in Videos". Thesis, University West, Department of Economics and IT, 2009. http://urn.kb.se/resolve?urn=urn:nbn:se:hv:diva-1784.
Pełny tekst źródłaThe last ten years have been the witnesses of the emergence of any kind of video content. Moreover, the appearance of dedicated websites for this phenomenon has increased the importance the public gives to it. In the same time, certain individuals are deaf and occasionally cannot understand the meanings of such videos because there is not any text transcription available. Therefore, it is necessary to find solutions for the purpose of making these media artefacts accessible for most people. Several software propose utilities to create subtitles for videos but all require an extensive participation of the user. Thence, a more automated concept is envisaged. This thesis report indicates a way to generate subtitles following standards by using speech recognition. Three parts are distinguished. The first one consists in separating audio from video and converting the audio in suitable format if necessary. The second phase proceeds to the recognition of speech contained in the audio. The ultimate stage generates a subtitle file from the recognition results of the previous step. Directions of implementation have been proposed for the three distinct modules. The experiment results have not done enough satisfaction and adjustments have to be realized for further work. Decoding parallelization, use of well trained models, and punctuation insertion are some of the improvements to be done.
Walters, Thomas C. "Auditory-based processing of communication sounds". Thesis, University of Cambridge, 2011. https://www.repository.cam.ac.uk/handle/1810/240577.
Pełny tekst źródłaFong, Katherine KaYan. "IR-Depth Face Detection and Lip Localization Using Kinect V2". DigitalCommons@CalPoly, 2015. https://digitalcommons.calpoly.edu/theses/1425.
Pełny tekst źródłaWallace, Roy Geoffrey. "Fast and accurate phonetic spoken term detection". Thesis, Queensland University of Technology, 2010. https://eprints.qut.edu.au/39610/1/Roy_Wallace_Thesis.pdf.
Pełny tekst źródłaZhezhela, Oleksandr. "Vizualizace výstupu z řečových technologií pro potřeby kontaktních center". Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2014. http://www.nusl.cz/ntk/nusl-236041.
Pełny tekst źródłaKalantari, Shahram. "Improving spoken term detection using complementary information". Thesis, Queensland University of Technology, 2015. https://eprints.qut.edu.au/90074/1/Shahram_Kalantari_Thesis.pdf.
Pełny tekst źródłaLucey, Patrick Joseph. "Lipreading across multiple views". Thesis, Queensland University of Technology, 2007. https://eprints.qut.edu.au/16676/1/Patrick_Joseph_Lucey_Thesis.pdf.
Pełny tekst źródłaLucey, Patrick Joseph. "Lipreading across multiple views". Queensland University of Technology, 2007. http://eprints.qut.edu.au/16676/.
Pełny tekst źródłaTemko, Andriy. "Acoustic event detection and classification". Doctoral thesis, Universitat Politècnica de Catalunya, 2007. http://hdl.handle.net/10803/6880.
Pełny tekst źródłasortides de diversos sistemes de classificació. Els sistemes de classificació d'events acústics
desenvolupats s'han testejat també mitjançant la participació en unes quantes avaluacions d'àmbit
internacional, entre els anys 2004 i 2006. La segona principal contribució d'aquest treball de tesi consisteix en el desenvolupament de sistemes de detecció d'events acústics. El problema de la detecció és més complex, ja que inclou tant la classificació dels sons com la determinació dels intervals temporals on tenen lloc. Es desenvolupen dues versions del sistema i es proven amb els conjunts de dades de les dues campanyes d'avaluació internacional CLEAR que van tenir lloc els anys 2006 i 2007, fent-se servir dos tipus de bases de dades: dues bases d'events acústics aïllats, i una base d'enregistraments de seminaris interactius, les quals contenen un nombre relativament elevat d'ocurrències dels events acústics especificats. Els sistemes desenvolupats, que consisteixen en l'ús de classificadors basats en SVM que operen dins
d'una finestra lliscant més un post-processament, van ser els únics presentats a les avaluacions
esmentades que no es basaven en models de Markov ocults (Hidden Markov Models) i cada un d'ells
va obtenir resultats competitius en la corresponent avaluació. La detecció d'activitat oral és un altre dels objectius d'aquest treball de tesi, pel fet de ser un cas particular de detecció d'events acústics especialment important. Es desenvolupa una tècnica de millora de l'entrenament dels SVM per fer front a la necessitat de reducció de l'enorme conjunt de dades existents. El sistema resultant, basat en SVM, és testejat amb uns quants conjunts de dades de l'avaluació NIST RT (Rich Transcription), on mostra puntuacions millors que les del sistema basat en GMM, malgrat que aquest darrer va quedar entre els primers en l'avaluació NIST RT de 2006.
Per acabar, val la pena esmentar alguns resultats col·laterals d'aquest treball de tesi. Com que s'ha dut a terme en l'entorn del projecte europeu CHIL, l'autor ha estat responsable de l'organització de les avaluacions internacionals de classificació i detecció d'events acústics abans esmentades, liderant l'especificació de les classes d'events, les bases de dades, els protocols d'avaluació i, especialment, proposant i implementant les diverses mètriques utilitzades. A més a més, els sistemes de detecció
s'han implementat en la sala intel·ligent de la UPC, on funcionen en temps real a efectes de test i demostració.
The human activity that takes place in meeting-rooms or class-rooms is reflected in a rich variety of acoustic events, either produced by the human body or by objects handled by humans, so the determination of both the identity of sounds and their position in time may help to detect and describe that human activity.
Additionally, detection of sounds other than speech may be useful to enhance the robustness of speech technologies like automatic speech recognition. Automatic detection and classification of acoustic events is the objective of this thesis work. It aims at processing the acoustic signals collected by distant microphones in meeting-room or classroom environments to convert them into symbolic descriptions corresponding to a listener's perception of the different sound events that are present in the signals and their sources. First of all, the task of acoustic event classification is faced using Support Vector Machine (SVM) classifiers, which are motivated by the scarcity of training data. A confusion-matrix-based variable-feature-set clustering scheme is developed for the multiclass recognition problem, and tested on the gathered database. With it, a higher classification rate than the GMM-based technique is obtained, arriving to a large relative average error reduction with respect to the best result from the conventional binary tree scheme. Moreover, several ways to extend SVMs to sequence processing are compared, in an attempt to avoid the drawback of SVMs when dealing with audio data, i.e. their restriction to work with fixed-length vectors, observing that the dynamic time warping kernels work well for sounds that show a temporal structure. Furthermore, concepts and tools from the fuzzy theory are used to investigate, first, the importance of and degree of interaction among features, and second, ways to fuse the outputs of several classification systems. The developed AEC systems are tested also by participating in several international evaluations from 2004 to 2006, and the results
are reported. The second main contribution of this thesis work is the development of systems for detection of acoustic events. The detection problem is more complex since it includes both classification and determination of the time intervals where the sound takes place. Two system versions are developed and tested on the datasets of the two CLEAR international evaluation campaigns in 2006 and 2007. Two kinds of databases are used: two databases of isolated acoustic events, and a database of interactive seminars containing a significant number of acoustic events of interest. Our developed systems, which consist of SVM-based classification within a sliding window plus post-processing, were the only submissions not using HMMs, and each of them obtained competitive results in the corresponding evaluation. Speech activity detection was also pursued in this thesis since, in fact, it is a -especially important - particular case of acoustic event detection. An enhanced SVM training approach for the speech activity detection task is developed, mainly to cope with the problem of dataset reduction. The resulting SVM-based system is tested with several NIST Rich Transcription (RT) evaluation datasets, and it shows better scores than our GMM-based system, which ranked among the best systems in the RT06 evaluation. Finally, it is worth mentioning a few side outcomes from this thesis work. As it has been carried out in the framework of the CHIL EU project, the author has been responsible for the organization of the above mentioned international evaluations in acoustic event classification and detection, taking a leading role in the specification of acoustic event classes, databases, and evaluation protocols, and, especially, in the proposal and implementation of the various metrics that have been used. Moreover, the detection systems have been implemented in the UPC's smart-room and work in real time for purposes of testing and demonstration.
Verdet, Florian. "Exploring variabilities through factor analysis in automatic acoustic language recognition". Phd thesis, Université d'Avignon, 2011. http://tel.archives-ouvertes.fr/tel-00954255.
Pełny tekst źródłaSilvestre, Cerdà Joan Albert. "Different Contributions to Cost-Effective Transcription and Translation of Video Lectures". Doctoral thesis, Universitat Politècnica de València, 2016. http://hdl.handle.net/10251/62194.
Pełny tekst źródła[ES] Durante estos últimos años, los repositorios multimedia on-line han experimentado un gran crecimiento que les ha hecho establecerse como fuentes fundamentales de conocimiento, especialmente en el área de la educación, donde se han creado grandes repositorios de vídeo charlas educativas para complementar e incluso reemplazar los métodos de enseñanza tradicionales. No obstante, la mayoría de estas charlas no están transcritas ni traducidas debido a la ausencia de soluciones de bajo coste que sean capaces de hacerlo garantizando una calidad mínima aceptable. Soluciones de este tipo son claramente necesarias para hacer que las vídeo charlas sean más accesibles para hablantes de otras lenguas o para personas con discapacidades auditivas. Además, dichas soluciones podrían facilitar la aplicación de funciones de búsqueda y de análisis tales como clasificación, recomendación o detección de plagios, así como el desarrollo de funcionalidades educativas avanzadas, como por ejemplo la generación de resúmenes automáticos de contenidos para ayudar al estudiante a tomar apuntes. Por este motivo, el principal objetivo de esta tesis es desarrollar una solución de bajo coste capaz de transcribir y traducir vídeo charlas con un nivel de calidad razonable. Más específicamente, abordamos la integración de técnicas estado del arte de Reconocimiento del Habla Automático y Traducción Automática en grandes repositorios de vídeo charlas educativas para la generación de subtítulos multilingües de alta calidad sin requerir intervención humana y con un reducido coste computacional. Además, también exploramos los beneficios potenciales que conllevaría la explotación de la información de la que disponemos a priori sobre estos repositorios, es decir, conocimientos específicos sobre las charlas tales como el locutor, la temática o las transparencias, para crear sistemas de transcripción y traducción especializados mediante técnicas de adaptación masiva. Las soluciones propuestas en esta tesis han sido testeadas en escenarios reales llevando a cabo nombrosas evaluaciones objetivas y subjetivas, obteniendo muy buenos resultados. El principal legado de esta tesis, The transLectures-UPV Platform, ha sido liberado públicamente como software de código abierto, y, en el momento de escribir estas líneas, está sirviendo transcripciones y traducciones automáticas para diversos miles de vídeo charlas educativas en nombrosas universidades e instituciones Españolas y Europeas.
[CAT] Durant aquests darrers anys, els repositoris multimèdia on-line han experimentat un gran creixement que els ha fet consolidar-se com a fonts fonamentals de coneixement, especialment a l'àrea de l'educació, on s'han creat grans repositoris de vídeo xarrades educatives per tal de complementar o inclús reemplaçar els mètodes d'ensenyament tradicionals. No obstant això, la majoria d'aquestes xarrades no estan transcrites ni traduïdes degut a l'absència de solucions de baix cost capaces de fer-ho garantint una qualitat mínima acceptable. Solucions d'aquest tipus són clarament necessàries per a fer que les vídeo xarres siguen més accessibles per a parlants d'altres llengües o per a persones amb discapacitats auditives. A més, aquestes solucions podrien facilitar l'aplicació de funcions de cerca i d'anàlisi tals com classificació, recomanació o detecció de plagis, així com el desenvolupament de funcionalitats educatives avançades, com per exemple la generació de resums automàtics de continguts per ajudar a l'estudiant a prendre anotacions. Per aquest motiu, el principal objectiu d'aquesta tesi és desenvolupar una solució de baix cost capaç de transcriure i traduir vídeo xarrades amb un nivell de qualitat raonable. Més específicament, abordem la integració de tècniques estat de l'art de Reconeixement de la Parla Automàtic i Traducció Automàtica en grans repositoris de vídeo xarrades educatives per a la generació de subtítols multilingües d'alta qualitat sense requerir intervenció humana i amb un reduït cost computacional. A més, també explorem els beneficis potencials que comportaria l'explotació de la informació de la que disposem a priori sobre aquests repositoris, és a dir, coneixements específics sobre les xarrades tals com el locutor, la temàtica o les transparències, per a crear sistemes de transcripció i traducció especialitzats mitjançant tècniques d'adaptació massiva. Les solucions proposades en aquesta tesi han estat testejades en escenaris reals duent a terme nombroses avaluacions objectives i subjectives, obtenint molt bons resultats. El principal llegat d'aquesta tesi, The transLectures-UPV Platform, ha sigut alliberat públicament com a programari de codi obert, i, en el moment d'escriure aquestes línies, està servint transcripcions i traduccions automàtiques per a diversos milers de vídeo xarrades educatives en nombroses universitats i institucions Espanyoles i Europees.
Silvestre Cerdà, JA. (2016). Different Contributions to Cost-Effective Transcription and Translation of Video Lectures [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/62194
TESIS
Fernández, López Adriana. "Learning of meaningful visual representations for continuous lip-reading". Doctoral thesis, Universitat Pompeu Fabra, 2021. http://hdl.handle.net/10803/671206.
Pełny tekst źródłaEn les darreres dècades, hi ha hagut un interès creixent en la descodificació de la parla utilitzant exclusivament senyals visuals, es a dir, imitant la capacitat humana de llegir els llavis, donant lloc a sistemes de lectura automàtica de llavis (ALR). No obstant això, se sap que l’accès a la parla a través del canal visual està subjecte a moltes limitacions en comparació amb el senyal acústic, es a dir, s’ha argumentat que els humans poden llegir al voltant del 30% de la informació dels llavis, i la resta es completa fent servir el context. Així, un dels principals reptes de l’ALR resideix en les ambigüitats visuals que sorgeixen a escala de paraula, destacant que no tots els sons que escoltem es poden distingir fàcilment observant els llavis. A la literatura, els primers sistemes ALR van abordar tasques de reconeixement senzilles, com ara el reconeixement de l’alfabet o els dígits, però progressivament van passar a entorns mes complexos i realistes que han conduït a diversos sistemes recents dirigits a la lectura continua dels llavis. En gran manera, aquests avenços han estat possibles gracies a la construcció de sistemes potents basats en arquitectures d’aprenentatge profund que han començat a substituir ràpidament els sistemes tradicionals. Tot i que les taxes de reconeixement de la lectura continua dels llavis poden semblar modestes en comparació amb les assolides pels sistemes basats en audio, és evident que el camp ha fet un pas endavant. Curiosament, es pot observar un efecte anàleg quan els humans intenten descodificar la parla: donats senyals sense soroll, la majoria de la gent pot descodificar el canal d’àudio sense esforç¸, però tindria dificultats per llegir els llavis, ja que l’ambigüitat dels senyals visuals fa necessari l’ús de context addicional per descodificar el missatge. En aquesta tesi explorem el modelatge adequat de representacions visuals amb l’objectiu de millorar la lectura contínua dels llavis. Amb aquest objectiu, presentem diferents mecanismes basats en dades per fer front als principals reptes de la lectura de llavis relacionats amb les ambigüitats o la dependència dels parlants dels senyals visuals. Els nostres resultats destaquen els avantatges d’una correcta codificació del canal visual, per a la qual les característiques més útils són aquelles que codifiquen les posicions corresponents dels llavis d’una manera similar, independentment de l’orador. Aquest fet obre la porta a i) la lectura de llavis en molts idiomes diferents sense necessitat de conjunts de dades a gran escala, i ii) a l’augment de la contribució del canal visual en sistemes de parla audiovisuals.´ D’altra banda, els nostres experiments identifiquen una tendència a centrar-se en iii la modelització del context temporal com la clau per avançar en el camp, on hi ha la necessitat de models d’ALR que s’entrenin en conjunts de dades que incloguin una gran variabilitat de la parla a diversos nivells de context. En aquesta tesi, demostrem que tant el modelatge adequat de les representacions visuals com la capacitat de retenir el context a diversos nivells són condicions necessàries per construir sistemes de lectura de llavis amb èxit.
Jain, Abhilash. "Visual Speech Recognition". Thesis, 2018. https://etd.iisc.ac.in/handle/2005/4767.
Pełny tekst źródłaLiu, Feng. "Audio fingerprinting for speech reconstruction and recognition in noisy environments". Thesis, 2017. http://hdl.handle.net/1828/7912.
Pełny tekst źródłaGraduate
Makkook, Mustapha. "A Multimodal Sensor Fusion Architecture for Audio-Visual Speech Recognition". Thesis, 2007. http://hdl.handle.net/10012/3065.
Pełny tekst źródłaChen, Bidong. "Audio recognition with distributed wireless sensor networks". Thesis, 2010. http://hdl.handle.net/1828/2683.
Pełny tekst źródłaLiao, Wen-Yuan, i 廖文淵. "A Study on Audio-Visual Feature Extraction for Mandarin Digit Speech Recognition". Thesis, 2009. http://ndltd.ncl.edu.tw/handle/46704732964354703864.
Pełny tekst źródła大同大學
資訊工程學系(所)
97
In recent years, there have been many machine speechreading systems proposed, that combine audio and visual speech features. For all such systems, the objective of these audio-visual speech recognizers is to improve recognition accuracy, particularly in difficult condition. This thesis presents a Mandarin audio-visual recognition system that has better recognition rate in noisy condition as well as speech spoken with emotional condition. We first extract the visual features of the lips, including geometric and motion features. These features are very important to the recognition system especially in noisy condition or with emotional effects. The motion features are obtained by applying an automatic face feature extractor followed by a fast motion feature extractor. We compare the performance when the system using motion and geometric features. In this recognition system, we propose to use the weighted-discrete KNN as the classifier and compare the results with two popular classifiers, the GMM and HMM, and evaluate their performance by applying to a Mandarin audio-visual speech corpus. We find that the WD-KNN is a suitable classifier for Mandarin speech because the monosyllable property of Mandarin and computationally inexpensive. The experimental results of different classifiers at various SNR levels are presented. The results show that using the WD-KNN classifier yields better recognition accuracy than other classifiers for the used Mandarin speech corpus. Several weighting functions were also studied for the weighted KNN based classifier, such as linear distance weighting, inverse distance weighting, rank weighting and reverse Fibonacci weighting function. The overall results have proved that WD-KNN classifier with reverse Fibonacci weighting function gets the higher recognition rate in three extended versions of KNN outperform others. Finally, we perform the emotional speech recognition experiments. The results show that it will be more robust if the visual information is included. The recognition rate of the audio-visual speech recognition system will have higher recognition rate when incorporated with the visual cues.