Dissertations / Theses on the topic 'Visual speech recognition'
Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles
Consult the top 46 dissertations / theses for your research on the topic 'Visual speech recognition.'
Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.
You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.
Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.
Luettin, Juergen. "Visual speech and speaker recognition." Thesis, University of Sheffield, 1997. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.264432.
Full textMiyajima, C., D. Negi, Y. Ninomiya, M. Sano, K. Mori, K. Itou, K. Takeda, and Y. Suenaga. "Audio-Visual Speech Database for Bimodal Speech Recognition." INTELLIGENT MEDIA INTEGRATION NAGOYA UNIVERSITY / COE, 2005. http://hdl.handle.net/2237/10460.
Full textPachoud, Samuel. "Audio-visual speech and emotion recognition." Thesis, Queen Mary, University of London, 2010. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.528923.
Full textMatthews, Iain. "Features for audio-visual speech recognition." Thesis, University of East Anglia, 1998. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.266736.
Full textSeymour, R. "Audio-visual speech and speaker recognition." Thesis, Queen's University Belfast, 2008. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.492489.
Full textRabi, Gihad. "Visual speech recognition by recurrent neural networks." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 1997. http://www.collectionscanada.ca/obj/s4/f2/dsk2/tape16/PQDD_0010/MQ36169.pdf.
Full textKaucic, Robert August. "Lip tracking for audio-visual speech recognition." Thesis, University of Oxford, 1997. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.360392.
Full textSaeed, Mehreen. "Soft AI methods and visual speech recognition." Thesis, University of Bristol, 1999. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.299270.
Full textSaenko, Ekaterina 1976. "Articulatory features for robust visual speech recognition." Thesis, Massachusetts Institute of Technology, 2004. http://hdl.handle.net/1721.1/28736.
Full textIncludes bibliographical references (p. 99-105).
This thesis explores a novel approach to visual speech modeling. Visual speech, or a sequence of images of the speaker's face, is traditionally viewed as a single stream of contiguous units, each corresponding to a phonetic segment. These units are defined heuristically by mapping several visually similar phonemes to one visual phoneme, sometimes referred to as a viseme. However, experimental evidence shows that phonetic models trained from visual data are not synchronous in time with acoustic phonetic models, indicating that visemes may not be the most natural building blocks of visual speech. Instead, we propose to model the visual signal in terms of the underlying articulatory features. This approach is a natural extension of feature-based modeling of acoustic speech, which has been shown to increase robustness of audio-based speech recognition systems. We start by exploring ways of defining visual articulatory features: first in a data-driven manner, using a large, multi-speaker visual speech corpus, and then in a knowledge-driven manner, using the rules of speech production. Based on these studies, we propose a set of articulatory features, and describe a computational framework for feature-based visual speech recognition. Multiple feature streams are detected in the input image sequence using Support Vector Machines, and then incorporated in a Dynamic Bayesian Network to obtain the final word hypothesis. Preliminary experiments show that our approach increases viseme classification rates in visually noisy conditions, and improves visual word recognition through feature-based context modeling.
by Ekaterina Saenko.
S.M.
Pass, A. R. "Towards pose invariant visual speech processing." Thesis, Queen's University Belfast, 2013. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.580170.
Full textDean, David Brendan. "Synchronous HMMs for audio-visual speech processing." Queensland University of Technology, 2008. http://eprints.qut.edu.au/17689/.
Full textRao, Ram Raghavendra. "Audio-visual interaction in multimedia." Diss., Georgia Institute of Technology, 1998. http://hdl.handle.net/1853/13349.
Full textDong, Junda. "Designing a Visual Front End in Audio-Visual Automatic Speech Recognition System." DigitalCommons@CalPoly, 2015. https://digitalcommons.calpoly.edu/theses/1382.
Full textReikeras, Helge. "Audio-visual automatic speech recognition using Dynamic Bayesian Networks." Thesis, Stellenbosch : University of Stellenbosch, 2011. http://hdl.handle.net/10019.1/6777.
Full textMukherjee, Niloy 1978. "Spontaneous speech recognition using visual context-aware language models." Thesis, Massachusetts Institute of Technology, 2003. http://hdl.handle.net/1721.1/62380.
Full textIncludes bibliographical references (p. 83-88).
The thesis presents a novel situationally-aware multimodal spoken language system called Fuse that performs speech understanding for visual object selection. An experimental task was created in which people were asked to refer, using speech alone, to objects arranged on a table top. During training, Fuse acquires a grammar and vocabulary from a "show-and-tell" procedure in which visual scenes are paired with verbal descriptions of individual objects. Fuse determines a set of visually salient words and phrases and associates them to a set of visual features. Given a new scene, Fuse uses the acquired knowledge to generate class-based language models conditioned on the objects present in the scene as well as a spatial language model that predicts the occurrences of spatial terms conditioned on target and landmark objects. The speech recognizer in Fuse uses a weighted mixture of these language models to search for more likely interpretations of user speech in context of the current scene. During decoding, the weights are updated using a visual attention model which redistributes attention over objects based on partially decoded utterances. The dynamic situationally-aware language models enable Fuse to jointly infer spoken language utterances underlying speech signals as well as the identities of target objects they refer to. In an evaluation of the system, visual situationally-aware language modeling shows significant , more than 30 %, decrease in speech recognition and understanding error rates. The underlying ideas of situation-aware speech understanding that have been developed in Fuse may may be applied in numerous areas including assistive and mobile human-machine interfaces.
by Niloy Mukherjee.
S.M.
Rochford, Matthew. "Visual Speech Recognition Using a 3D Convolutional Neural Network." DigitalCommons@CalPoly, 2019. https://digitalcommons.calpoly.edu/theses/2109.
Full textScott, Simon David. "A data-driven approach to visual speech synthesis." Thesis, University of Bath, 1996. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.307116.
Full textAhmad, Nasir. "A motion based approach for audio-visual automatic speech recognition." Thesis, Loughborough University, 2011. https://dspace.lboro.ac.uk/2134/8564.
Full textMonteiro, Axel. "Spatial and temporal replication in visual and audiovisual speech recognition." Thesis, University of Nottingham, 2003. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.410421.
Full textIbrahim, Zamri. "A novel lip geometry approach for audio-visual speech recognition." Thesis, Loughborough University, 2014. https://dspace.lboro.ac.uk/2134/16526.
Full textDew, Andrea M. "A study of computer-based visual feedback of speech for the hearing impaired." Thesis, University of Leeds, 1990. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.277195.
Full textZhang, Xianxian. "Robust speech processing based on microphone array, audio-visual, and frame selection for in-vehicle speech recognition and in-set speaker recognition." Diss., Connect to online resource, 2005. http://wwwlib.umi.com/cr/colorado/fullcit?p3190350.
Full textMartin, Claire. "Investigating the influence of natural variations in the quality of the visual image for visual and audiovisual speech recognition." Thesis, University of Nottingham, 2002. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.395576.
Full textBrady-Herbst, Brenene Marie. "An Analysis of Spondee Recognition Thresholds in Auditory-only and Audio-visual Conditions." PDXScholar, 1996. https://pdxscholar.library.pdx.edu/open_access_etds/5218.
Full textLew, Kum Hoi Chantal. "Talker variability and the roles of configural and featuralinformation in visual and audiovisual speech recognition." Thesis, University of Nottingham, 2007. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.446370.
Full textSilberer, Amanda Beth. "Importance of high frequency audibility on speech recognition with and without visual cues in listeners with normal hearing." Diss., University of Iowa, 2014. https://ir.uiowa.edu/etd/4755.
Full textBeckmeyer, Cynthia S. "Comprehensive Evaluation of Non-Verbal Communication.A visual alternative to assist Alzheimer's patients' communication with their caregivers." University of Cincinnati / OhioLINK, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1367927393.
Full textKhaldieh, Salim Ahmad. "The role of phonological encoding (speech recoding) and visual processes in world recognition of American learners of Arabic as a foreign language." The Ohio State University, 1990. http://rave.ohiolink.edu/etdc/view?acc_num=osu1272465854.
Full textKhaldieh, Salim Ahmad. "The role of phonological encoding (speech recoding) and visual processes in word recognition of American learners of Arabic as a foreign language /." The Ohio State University, 1990. http://rave.ohiolink.edu/etdc/view?acc_num=osu1487685204966592.
Full textLucey, Patrick Joseph. "Lipreading across multiple views." Queensland University of Technology, 2007. http://eprints.qut.edu.au/16676/.
Full textFernández, López Adriana. "Learning of meaningful visual representations for continuous lip-reading." Doctoral thesis, Universitat Pompeu Fabra, 2021. http://hdl.handle.net/10803/671206.
Full textEn les darreres dècades, hi ha hagut un interès creixent en la descodificació de la parla utilitzant exclusivament senyals visuals, es a dir, imitant la capacitat humana de llegir els llavis, donant lloc a sistemes de lectura automàtica de llavis (ALR). No obstant això, se sap que l’accès a la parla a través del canal visual està subjecte a moltes limitacions en comparació amb el senyal acústic, es a dir, s’ha argumentat que els humans poden llegir al voltant del 30% de la informació dels llavis, i la resta es completa fent servir el context. Així, un dels principals reptes de l’ALR resideix en les ambigüitats visuals que sorgeixen a escala de paraula, destacant que no tots els sons que escoltem es poden distingir fàcilment observant els llavis. A la literatura, els primers sistemes ALR van abordar tasques de reconeixement senzilles, com ara el reconeixement de l’alfabet o els dígits, però progressivament van passar a entorns mes complexos i realistes que han conduït a diversos sistemes recents dirigits a la lectura continua dels llavis. En gran manera, aquests avenços han estat possibles gracies a la construcció de sistemes potents basats en arquitectures d’aprenentatge profund que han començat a substituir ràpidament els sistemes tradicionals. Tot i que les taxes de reconeixement de la lectura continua dels llavis poden semblar modestes en comparació amb les assolides pels sistemes basats en audio, és evident que el camp ha fet un pas endavant. Curiosament, es pot observar un efecte anàleg quan els humans intenten descodificar la parla: donats senyals sense soroll, la majoria de la gent pot descodificar el canal d’àudio sense esforç¸, però tindria dificultats per llegir els llavis, ja que l’ambigüitat dels senyals visuals fa necessari l’ús de context addicional per descodificar el missatge. En aquesta tesi explorem el modelatge adequat de representacions visuals amb l’objectiu de millorar la lectura contínua dels llavis. Amb aquest objectiu, presentem diferents mecanismes basats en dades per fer front als principals reptes de la lectura de llavis relacionats amb les ambigüitats o la dependència dels parlants dels senyals visuals. Els nostres resultats destaquen els avantatges d’una correcta codificació del canal visual, per a la qual les característiques més útils són aquelles que codifiquen les posicions corresponents dels llavis d’una manera similar, independentment de l’orador. Aquest fet obre la porta a i) la lectura de llavis en molts idiomes diferents sense necessitat de conjunts de dades a gran escala, i ii) a l’augment de la contribució del canal visual en sistemes de parla audiovisuals.´ D’altra banda, els nostres experiments identifiquen una tendència a centrar-se en iii la modelització del context temporal com la clau per avançar en el camp, on hi ha la necessitat de models d’ALR que s’entrenin en conjunts de dades que incloguin una gran variabilitat de la parla a diversos nivells de context. En aquesta tesi, demostrem que tant el modelatge adequat de les representacions visuals com la capacitat de retenir el context a diversos nivells són condicions necessàries per construir sistemes de lectura de llavis amb èxit.
Fong, Katherine KaYan. "IR-Depth Face Detection and Lip Localization Using Kinect V2." DigitalCommons@CalPoly, 2015. https://digitalcommons.calpoly.edu/theses/1425.
Full textYau, Wai Chee, and waichee@ieee org. "Video Analysis of Mouth Movement Using Motion Templates for Computer-based Lip-Reading." RMIT University. Electrical and Computer Engineering, 2008. http://adt.lib.rmit.edu.au/adt/public/adt-VIT20081209.162504.
Full textBesson, Gabriel. "Approche temporelle de la mémoire de reconnaissance visuelle et atteinte au stade prodromal de la maladie d'Alzheimer." Phd thesis, Aix-Marseille Université, 2013. http://tel.archives-ouvertes.fr/tel-00858502.
Full textchum, Ting chia, and 丁家群. "Speech Recognition And Visual Basic." Thesis, 2003. http://ndltd.ncl.edu.tw/handle/49328880098397218445.
Full text義守大學
電子工程學系
91
Speech Recognition system( SRS)Whatever language the SRS is applied for, the essential method is almost the same except some minors. The main difference lies in the characteristic of each language。Chinese characters pronunciation ,for example , are composed of sound rhyme and the intonation, therefore, distinction of intonation is necessary. On the other hand, English SRS recognizes the word as a unit, its intonation certainly does not have be distinguished。 This thesis is aimed to give a introduction to SRS and try to compose a program in Visual Basis。VB is operable under Windows system and with Object-Oriented capabilities 。 The sampled signals are pre-processed through point detection and Hamming coding。By applying Hidden Markov Models(HMMS)HMMS as the recognition and classification tool,the characteristic parameters of each speech signal will be extracted which will establish the reference database。Afterwards, the outcome can be attained by Viterbi algorithm。 Taking number 0-9 as the objects to be recognized, then the state numbers will be varied as well,speech frame will sampled 。All these will be demonstrated to make clear the influences rooted from terms like “state number”、”number of sampling of speech frame”。
Abreu, Hélder Paulo Monteiro. "Visual speech recognition for European Portuguese." Master's thesis, 2014. http://hdl.handle.net/1822/37465.
Full textO reconhecimento da fala baseado em características visuais teve início na década de 80, integrado em sistemas de reconhecimento audiovisual da fala. De facto, o objetivo inicial do recurso a características visuais foi o de aumentar a robustez dos sistemas de reconhecimento automático da fala, que perdem precisão rapidamente em ambientes ruidosos. Contudo, o potencial para manter um bom desempenho de reconhecimento de fala em situações em que os dados acústicos estão comprometidos ou em qualquer outra situação em que é necessária uma pessoa capaz de ler os lábios, levou os investigadores e a criar e desenvolver a área de reconhecimento visual da fala. Os sistemas tradicionais de reconhecimento visual da fala usam apenas informação RGB, seguindo uma abordagem unimodal, uma vez que o recurso a outras modalidades é dispendioso e implica problemas de sincronização entre as mesmas. O lançamento do Microsoft Kinect, que inclui um microfone, uma câmara RGB e um sensor de profundidade, abriu novas portas às áreas de reconhecimento da fala. Para além disso, todas as modalidades podem ser sincronizadas usando as funcionalidades do SDK. Recentemente, a Microsoft lançou o novo Kinect One, que oferece uma melhor câmara e um sensor de profundidade com uma tecnologia diferente e mais precisa. O objetivo principal desta tese consiste em criar um sistema de reconhecimento visual da fala baseado no Kinect e verificar se um sistema multimodal, baseado em RGB e dados de profundidade, é capaz de obter melhores resultados do que um sistema unimodal baseado exclusivamente em RGB. Considerando o processo de extração de características, uma abordagem recente baseada em características articulatórias tem mostrado resultados promissores, quando comparada com abordagens baseadas em visemas. Esta tese pretende verificar se uma abordagem articulatória obtém melhores resultados que uma abordagem baseada na forma. O sistema desenvolvido, chamado ViKi (Visual Speech Recognition for Kinect), alcançou uma taxa de reconhecimento de 68% num vocabulário de 25 palavras isoladas, com 8 oradores, superando a abordagem unimodal testada. A informação de profundidade provou aumentar a taxa de reconhecimento do sistema, tanto na abordagem articulatória (+8%) como na abordagem baseada na forma (+2%). Num contexto de dependência em relação ao orador, ViKi também alcançou uma média de ≈70% de taxa de reconhecimento. A abordagem articulatória obteve piores resultados que a abordagem baseada na forma, alcançando 34% de taxa de reconhecimento, contrariando os resultados obtidos em estudos prévios com abordagens baseadas na aparência e a terceira hipótese desta tese.
Speech recognition based on visual features began in the early 1980s, embedded on AudioVisual Speech Recognition systems. In fact, the initial purpose to the use of visual cues was to increase the robustness of Automatic Speech Recognition systems, which rapidly lose accuracy in noisy environments. However, the potential to keep a good accuracy, whenever the use of an acoustic stream is excluded and in any other situations where a human lip reader would be needed, led researchers to create and explore the Visual Speech Recognition (VSR) field. Traditional VSR systems used only RGB information, following an unimodal approach, since the addition of other visual modalities could be expensive and present synchronization issues. The release of the Microsoft Kinect sensor brought new possibilities for the speech recognition fields. This sensor includes a microphone array, a RGB camera and a depth sensor. Furthermore, all its input modalities can be synchronized using the features of its SDK. Recently, Microsoft released the new Kinect One, offering a better camera and a different and improved depth sensing technology. This thesis sets the hypothesis that, using the available input HCI modalities of such sensor, such as RGB video and depth, as well as the skeletal tracking features available in the SDK and, by adopting a multimodal VSR articulatory approach, we can improve word recognition rate accuracy of a VSR system, compared to a unimodal approach using only RGB data. Regarding the feature extraction process, a recent approaches based on articulatory features have been shown promising results, when compared to standard shape-based viseme approaches. In this thesis, we also aim to verify the hypothesis that an articulatory VSR can outperform a shapebased approach, in what concerns word recognition rate. The VSR system developed in this thesis, named ViKi (Visual Speech Recognition for Kinect), achieved a 68% word recognition rate on a scenario where 8 speakers, pronounced a vocabulary of 25 isolated words, outperforming our tested unimodal approach. The use of depth information proved to increase the system accuracy, both for the articulatory (+8%) and the shape-based approach (+2%). On a speaker-dependent context, ViKi also achieved an interesting average accuracy of ≈70%. The articulatory approach performed worse than the shape-based, reaching 34% of word accuracy, contrary to what happens with previous research based on appearance approaches and not confirming our third hypothesis.
Hill, Brian, and 廖峻廷. "Robust Speech Recognition Integrating Visual Information." Thesis, 1997. http://ndltd.ncl.edu.tw/handle/97538191028447078081.
Full textWei, Chun-Chuan, and 魏俊全. "Discriminative Analysis on Visual Features for Mandarin Speech Recognition." Thesis, 2009. http://ndltd.ncl.edu.tw/handle/09955904216369882293.
Full text國立臺灣科技大學
資訊管理系
97
The visual features can improve the performance of the speech recognition system under noisy environment. However, it is hard to achieve acceptable performance in a multi-words recognition task by using visual features alone. The speech information delivered by visual features is less than acoustic features. In this paper, we apply the measurement of model distance on visual models to understand the discriminability of visual features. Then, we select the pair-wised recognition task of Chinese syllable pairs to put in use. According to the analysis of model distance and recognition error, we find the discriminative pairs of Chinese syllables. The experimental result show that the average error rate of this pair-wised task is 10.47%, and there are 18.17% model pairs its error rate lower than 2.5%. The model distance is highly correlative to the recognition error. Comparing with the analysis of audio features, we find the model pairs that are more discriminative in visual features than in audio features.
Makkook, Mustapha. "A Multimodal Sensor Fusion Architecture for Audio-Visual Speech Recognition." Thesis, 2007. http://hdl.handle.net/10012/3065.
Full textLiang, Shin-Hwei, and 梁欣蕙. "Feature-Based Visual Speech Recognition Using Time-Delay Neural Network." Thesis, 1997. http://ndltd.ncl.edu.tw/handle/10921510042799247430.
Full text國立交通大學
控制工程系
85
An automatic mouth feature detection and mouth motion recognition technique for visual speech recognition is proposed in this thesis. This technique consists of three stages : human mouth detection and extraction, mouth feature detection, and neural network learning. In the mouth detection stage, the first step is to find the locations of human faces without any constraints on the users for the consideration of practicability. Hough transform is used here for determining the candidate face locations under complex environments. We simplify it to a three-dimensional search and redefine the searching region using the symmetry property of human beings. Then, a Mouth Detection Algorithm (MDA) is proposed to verify the mouth location and the next three procedures are normalization, adjustment, and template matching for the candidate mouth images. After these processes only one mouth image is treated as the winner among the candidate mouth images. In the mouth feature detection stage, one procedure searches the mouth corners and a refined Mouth Feature Searching Algorithm (MFSA) is used to reconnoiter the four points on two lips. These four points play an important role in our system since two parabolas can be approximated using the mouth corners and these points. Finally, a precise mouth model is established after calculating two parabolas and selecting eleven features from the mouth model as the input patterns for the classifier. In the last stage, a TDNN is used as our classifier due to the tolerance of time shifting property. We have done many experiments to decide which kinds of features are crucial and sufficient enough in the lip- reading system. The off-line recognition rate can achieve 90% speaker dependently in our experiment. Two other methods are compared with our system and we find that our method can reach better performance than other two methods with the less memory space and training time. Finally, we generalize our system to a six speakers system to verify the robustness of our method. The experimental result shows the stability and practicability of the proposed approach.
Liao, Wen-Yuan, and 廖文淵. "A Study on Audio-Visual Feature Extraction for Mandarin Digit Speech Recognition." Thesis, 2009. http://ndltd.ncl.edu.tw/handle/46704732964354703864.
Full text大同大學
資訊工程學系(所)
97
In recent years, there have been many machine speechreading systems proposed, that combine audio and visual speech features. For all such systems, the objective of these audio-visual speech recognizers is to improve recognition accuracy, particularly in difficult condition. This thesis presents a Mandarin audio-visual recognition system that has better recognition rate in noisy condition as well as speech spoken with emotional condition. We first extract the visual features of the lips, including geometric and motion features. These features are very important to the recognition system especially in noisy condition or with emotional effects. The motion features are obtained by applying an automatic face feature extractor followed by a fast motion feature extractor. We compare the performance when the system using motion and geometric features. In this recognition system, we propose to use the weighted-discrete KNN as the classifier and compare the results with two popular classifiers, the GMM and HMM, and evaluate their performance by applying to a Mandarin audio-visual speech corpus. We find that the WD-KNN is a suitable classifier for Mandarin speech because the monosyllable property of Mandarin and computationally inexpensive. The experimental results of different classifiers at various SNR levels are presented. The results show that using the WD-KNN classifier yields better recognition accuracy than other classifiers for the used Mandarin speech corpus. Several weighting functions were also studied for the weighted KNN based classifier, such as linear distance weighting, inverse distance weighting, rank weighting and reverse Fibonacci weighting function. The overall results have proved that WD-KNN classifier with reverse Fibonacci weighting function gets the higher recognition rate in three extended versions of KNN outperform others. Finally, we perform the emotional speech recognition experiments. The results show that it will be more robust if the visual information is included. The recognition rate of the audio-visual speech recognition system will have higher recognition rate when incorporated with the visual cues.
Frisky, Aufaclav Zatu Kusuma, and 柯奧福. "Visual Speech Recognition and Password Verification Using Local Spatiotemporal Features and Kernel Sparse Representation Classifier." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/03868492706552896766.
Full text國立中央大學
資訊工程學系在職專班
103
Visual speech recognition (VSR) applications play an important role in various aspects of human life, with research efforts being put into recognition systems in security, biometrics, and human machine interaction. In this thesis, we proposed two lip-based systems. First system, we proposed a letter recognition system using spatiotemporal features descriptors. The proposed system adopted non-negative matrix factorization (NMF) to reduce the dimensionality of the feature and kernel sparse representation classifier for classification step. We used local texture and local temporal features to represent the visual lips data. Firstly, the visual lips data were preprocessed by enhancing the contrast of images and then used to extract the feature. In our experiment, the promising accuracies of 67.13%, 45.37%, and 63.12% can be achieved in semi speaker dependent, speaker independent, and speaker dependent on AVLetters database. We also compared our method with other methods on AVLetters 2 database. Using the same configuration, our method could achieve accuracy rate of 89.02% for speaker dependent case and 25.9% for speaker independent case. This result shows that our method outperforms the others in the same configuration. In the second system, we proposed a new approach in lip-based password for home entrance security using confidence point in home automation system. We also proposed new features using modified version of spatiotemporal descriptor features adopt L2-Hellinger to do a normalization and used two-dimension semi non-negative matrix factorization (2D Semi-NMF) for dimensionality reduction. In classifier parts, we proposed forward-backward kernel sparse representation classifier (FB-KSRC). Our experiment results proves that our system is quite robust to classify the password. We applied this system in AVLetters 2 dataset. Using ten visual passwords of five combined letters from AVLetters 2 dataset, using all combination experiments, the result shows that our system can verify the password very well. In the complexity experiment, we also get a reasonable time classification process if our system will be implemented in real world application.
Rigas, Dimitrios I., and M. Alsuraihi. "A Toolkit for Multimodal Interface Design: An Empirical Investigation." 2007. http://hdl.handle.net/10454/3156.
Full textThis paper introduces a comparative multi-group study carried out to investigate the use of multimodal interaction metaphors (visual, oral, and aural) for improving learnability (or usability from first time use) of interface-design environments. An initial survey was used for taking views about the effectiveness and satisfaction of employing speech and speech-recognition for solving some of the common usability problems. Then, the investigation was done empirically by testing the usability parameters: efficiency, effectiveness, and satisfaction of three design-toolkits (TVOID, OFVOID, and MMID) built especially for the study. TVOID and OFVOID interacted with the user visually only using typical and time-saving interaction metaphors. The third environment MMID added another modality through vocal and aural interaction. The results showed that the use of vocal commands and the mouse concurrently for completing tasks from first time use was more efficient and more effective than the use of visual-only interaction metaphors.
McIvor, Tom. "Continuous speech recognition : an analysis of its effect on listening comprehension, listening strategies and notetaking : a thesis presented in part fulfilment of the requirements for the degree of Doctorate in Education, Massey University." 2006. http://hdl.handle.net/10179/1471.
Full textChen, Yi-Ling, and 陳怡伶. "The Relationship between Recognition and Phonological Awareness, Naming Speed, Visual Skills for Reading-disabled Readers." Thesis, 2005. http://ndltd.ncl.edu.tw/handle/96447031387736628310.
Full text國立臺南大學
特殊教育學系碩士班
93
The purpose of this research is to discuss how PA (phonological awareness), NS (naming speed), and VS (visual skills) influence the abilities in recognition for Chinese RDs (reading-disabled readers). By comparing RDs with two contrasting groups, the differences between RDs and normal readers in three abilities can be realized. In addition, concerning recognition ability between phonetic compound and nonphonetic compound for RDs, I further analyzed whether different predictabilities exist among these three abilities. The ninety subjects were chosen from the second and fourth graders of four elementary schools in Kaohsiung City and were divided the RDs and two control groups. The two control groups were age-matched (AM), fourth graders and reading-level matched (RM), second readers. All subjects received four tests , including PA, NS, VS and Chinese recognition. The data were analyzed by the descriptive statistics, one-way analysis of variance, product moment correlation, and stepwise multiple regression. The main findings were summarized as follows: 1. There is difference between RD’s representation in PA, NS, VS, and recognition and AM , fourth graders. 2. There is difference between RD’s representation in PA and RM . There is no difference between RD’s representation in VS, NS, and recognition and RM . 3. The ability in recognition is related to PA, NS and VS. 4. There is predictability between PA and RDs’ Recognition whereas there is predictability between NS and second graders . 5. There is different predictability between ‘PA to phonetic compound’ and ‘PA to nonphonetic compound’ for RDs. Theoretical and practical implication as well as suggestions for future researchers were discussed in the thesis.
Fortier-St-Pierre, Simon. "La dynamique spatio-temporelle de l’attention en lecture chez les dyslexiques." Thesis, 2019. http://hdl.handle.net/1866/24649.
Full textDyslexia is a neurodevelopmental disorder that affects the normal development of reading fluency. Deficits affecting basic reading processes may affect dyslexics and would thus alter high-level word representations: orthographic, phonological, and semantic. One of these basic processes is the attentional mechanism that is involved in the visual processing of horizontal multi-element strings such as words. The effectiveness of this mechanism could be closely related to reading expertise in normal readers and anomalies thereof could be observed in dyslexics. Unfortunately, it remains unclear how attention is deployed during visual word recognition and how it may impact on reading speed and potentially on certain language skills. The first article of this thesis aims to shed light on divergences in the deployment of attention through time and space during the recognition of familiar words in a group of adults with dyslexia in comparison to normal readers. These groups were matched in terms of age and intellectual functioning. This objective is pursued with the attentional probe technique. Results reveal that less attentional resources are directed to the first letter of a word in dyslexics, which is suboptimal considering that the first letter of a word has a higher diagnostic value than any other letter position. The goal of the second article is to determine if reading fluency and phonological awareness in dyslexics may benefit from a short attentional training. The effects of an active training using the NeuroTracker program and a placebo training in adults with dyslexia shows systematic gains immediately after active training. The order of the training (active then placebo, or placebo then active) was counter-balanced across two groups. These gains are observed on reading speed as well as on phonological awareness. The third article of this thesis finally brings a significant additional contribution to the evaluation of reading speed among Quebec university students. The use of test sentences from an existing tool (MNRead) has been incorporated into a rapid visual serial presentation protocol to assess reading speed. In addition to this set of test sentences, four other sets of test sentences have been standardized. The tool is reliable, as reading speed measurements are similar in the same individual at different times (Exp 1). Moreover, it meets different psychometric standards (Exps 1 and 2) while being particularly sensitive to the presence of the reading difficulties found in dyslexics (Exp.2). In sum, it appears that particular visual-attention processes underlie reading expertise and that these show anomalies in dyslexics. The characterization of a suboptimal attention deployment in visual word recognition as well as the benefits obtained in reading and phonological awareness subsequent to an attentional training highlight the importance of these basic processes in reading.