Dissertations / Theses: 'Visual speech recognition'

1

Luettin, Juergen. "Visual speech and speaker recognition." Thesis, University of Sheffield, 1997. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.264432.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Miyajima, C., D. Negi, Y. Ninomiya, M. Sano, K. Mori, K. Itou, K. Takeda, and Y. Suenaga. "Audio-Visual Speech Database for Bimodal Speech Recognition." INTELLIGENT MEDIA INTEGRATION NAGOYA UNIVERSITY / COE, 2005. http://hdl.handle.net/2237/10460.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Pachoud, Samuel. "Audio-visual speech and emotion recognition." Thesis, Queen Mary, University of London, 2010. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.528923.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Matthews, Iain. "Features for audio-visual speech recognition." Thesis, University of East Anglia, 1998. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.266736.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Seymour, R. "Audio-visual speech and speaker recognition." Thesis, Queen's University Belfast, 2008. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.492489.

Full text

Abstract:

In this thesis, a number of important issues relating to the use of both audio and video information for speech and speaker recognition are investigated. A comprehensive comparison of different visual feature types is given, including both geometric and image transformation based features. A new geometric based method for feature extraction is described, as well as the novel use of curvelet based features. Different methods for constructing the feature vectors are compared, as well as feature vector sizes and the use of dynamic features. Each feature type is tested against three types of visual noise: compression, blurring and jitter. A novel method of integrating the audio and video information streams called the maximum stream posterior (MSP) is described. This method is tested in both speaker dependent and speaker independent audio-visual speech recognition (AVSR) systems, and is shown to be robust to noise in either the audio or video streams, given no prior knowledge of the noise. This method is then extended to form the maximum weighted stream posterior (MWSP) method. Finally, both the MSP and MWSP are tested in an audio-visual speaker recognition system (AVSpR). / Experiments using the XM2VTS database will show that both of these methods can outperform ,_.','/ standard methods in terms of recognition accuracy in situations where either stream is corrupted.

APA, Harvard, Vancouver, ISO, and other styles

6

Rabi, Gihad. "Visual speech recognition by recurrent neural networks." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 1997. http://www.collectionscanada.ca/obj/s4/f2/dsk2/tape16/PQDD_0010/MQ36169.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Kaucic, Robert August. "Lip tracking for audio-visual speech recognition." Thesis, University of Oxford, 1997. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.360392.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Saeed, Mehreen. "Soft AI methods and visual speech recognition." Thesis, University of Bristol, 1999. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.299270.

Full text

APA, Harvard, Vancouver, ISO, and other styles

9

Saenko, Ekaterina 1976. "Articulatory features for robust visual speech recognition." Thesis, Massachusetts Institute of Technology, 2004. http://hdl.handle.net/1721.1/28736.

Full text

Abstract:

Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2004.
Includes bibliographical references (p. 99-105).
This thesis explores a novel approach to visual speech modeling. Visual speech, or a sequence of images of the speaker's face, is traditionally viewed as a single stream of contiguous units, each corresponding to a phonetic segment. These units are defined heuristically by mapping several visually similar phonemes to one visual phoneme, sometimes referred to as a viseme. However, experimental evidence shows that phonetic models trained from visual data are not synchronous in time with acoustic phonetic models, indicating that visemes may not be the most natural building blocks of visual speech. Instead, we propose to model the visual signal in terms of the underlying articulatory features. This approach is a natural extension of feature-based modeling of acoustic speech, which has been shown to increase robustness of audio-based speech recognition systems. We start by exploring ways of defining visual articulatory features: first in a data-driven manner, using a large, multi-speaker visual speech corpus, and then in a knowledge-driven manner, using the rules of speech production. Based on these studies, we propose a set of articulatory features, and describe a computational framework for feature-based visual speech recognition. Multiple feature streams are detected in the input image sequence using Support Vector Machines, and then incorporated in a Dynamic Bayesian Network to obtain the final word hypothesis. Preliminary experiments show that our approach increases viseme classification rates in visually noisy conditions, and improves visual word recognition through feature-based context modeling.
by Ekaterina Saenko.
S.M.

APA, Harvard, Vancouver, ISO, and other styles

10

Pass, A. R. "Towards pose invariant visual speech processing." Thesis, Queen's University Belfast, 2013. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.580170.

Full text

APA, Harvard, Vancouver, ISO, and other styles

11

Dean, David Brendan. "Synchronous HMMs for audio-visual speech processing." Queensland University of Technology, 2008. http://eprints.qut.edu.au/17689/.

Full text

Abstract:

Both human perceptual studies and automaticmachine-based experiments have shown that visual information from a speaker's mouth region can improve the robustness of automatic speech processing tasks, especially in the presence of acoustic noise. By taking advantage of the complementary nature of the acoustic and visual speech information, audio-visual speech processing (AVSP) applications can work reliably in more real-world situations than would be possible with traditional acoustic speech processing applications. The two most prominent applications of AVSP for viable human-computer-interfaces involve the recognition of the speech events themselves, and the recognition of speaker's identities based upon their speech. However, while these two fields of speech and speaker recognition are closely related, there has been little systematic comparison of the two tasks under similar conditions in the existing literature. Accordingly, the primary focus of this thesis is to compare the suitability of general AVSP techniques for speech or speaker recognition, with a particular focus on synchronous hidden Markov models (SHMMs). The cascading appearance-based approach to visual speech feature extraction has been shown to work well in removing irrelevant static information from the lip region to greatly improve visual speech recognition performance. This thesis demonstrates that these dynamic visual speech features also provide for an improvement in speaker recognition, showing that speakers can be visually recognised by how they speak, in addition to their appearance alone. This thesis investigates a number of novel techniques for training and decoding of SHMMs that improve the audio-visual speech modelling ability of the SHMM approach over the existing state-of-the-art joint-training technique. Novel experiments are conducted within to demonstrate that the reliability of the two streams during training is of little importance to the final performance of the SHMM. Additionally, two novel techniques of normalising the acoustic and visual state classifiers within the SHMM structure are demonstrated for AVSP. Fused hidden Markov model (FHMM) adaptation is introduced as a novel method of adapting SHMMs from existing wellperforming acoustic hidden Markovmodels (HMMs). This technique is demonstrated to provide improved audio-visualmodelling over the jointly-trained SHMMapproach at all levels of acoustic noise for the recognition of audio-visual speech events. However, the close coupling of the SHMM approach will be shown to be less useful for speaker recognition, where a late integration approach is demonstrated to be superior.

APA, Harvard, Vancouver, ISO, and other styles

12

Rao, Ram Raghavendra. "Audio-visual interaction in multimedia." Diss., Georgia Institute of Technology, 1998. http://hdl.handle.net/1853/13349.

Full text

APA, Harvard, Vancouver, ISO, and other styles

13

Dong, Junda. "Designing a Visual Front End in Audio-Visual Automatic Speech Recognition System." DigitalCommons@CalPoly, 2015. https://digitalcommons.calpoly.edu/theses/1382.

Full text

Abstract:

Audio-visual automatic speech recognition (AVASR) is a speech recognition technique integrating audio and video signals as input. Traditional audio-only speech recognition system only uses acoustic information from an audio source. However the recognition performance degrades significantly in acoustically noisy environments. It has been shown that visual information also can be used to identify speech. To improve the speech recognition performance, audio-visual automatic speech recognition has been studied. In this paper, we focus on the design of the visual front end of an AVASR system, which mainly consists of face detection and lip localization. The front end is built upon the AVICAR database that was recorded in moving vehicles. Therefore, diverse lighting conditions and poor quality of imagery are the problems we must overcome. We first propose the use of the Viola-Jones face detection algorithm that can process images rapidly with high detection accuracy. When the algorithm is applied to the AVICAR database, we reach an accuracy of 89% face detection rate. By separately detecting and integrating the detection results from all different color channels, we further improve the detection accuracy to 95%. To reliably localize the lips, three algorithms are studied and compared: the Gabor filter algorithm, the lip enhancement algorithm, and the modified Viola-Jones algorithm for lip features. Finally, to increase detection rate, a modified Viola-Jones algorithm and lip enhancement algorithms are cascaded based on the results of three lip localization methods. Overall, the front end achieves an accuracy of 90% for lip localization.

APA, Harvard, Vancouver, ISO, and other styles

14

Reikeras, Helge. "Audio-visual automatic speech recognition using Dynamic Bayesian Networks." Thesis, Stellenbosch : University of Stellenbosch, 2011. http://hdl.handle.net/10019.1/6777.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Mukherjee, Niloy 1978. "Spontaneous speech recognition using visual context-aware language models." Thesis, Massachusetts Institute of Technology, 2003. http://hdl.handle.net/1721.1/62380.

Full text

Abstract:

Thesis (S.M.)--Massachusetts Institute of Technology, School of Architecture and Planning, Program in Media Arts and Sciences, 2003.
Includes bibliographical references (p. 83-88).
The thesis presents a novel situationally-aware multimodal spoken language system called Fuse that performs speech understanding for visual object selection. An experimental task was created in which people were asked to refer, using speech alone, to objects arranged on a table top. During training, Fuse acquires a grammar and vocabulary from a "show-and-tell" procedure in which visual scenes are paired with verbal descriptions of individual objects. Fuse determines a set of visually salient words and phrases and associates them to a set of visual features. Given a new scene, Fuse uses the acquired knowledge to generate class-based language models conditioned on the objects present in the scene as well as a spatial language model that predicts the occurrences of spatial terms conditioned on target and landmark objects. The speech recognizer in Fuse uses a weighted mixture of these language models to search for more likely interpretations of user speech in context of the current scene. During decoding, the weights are updated using a visual attention model which redistributes attention over objects based on partially decoded utterances. The dynamic situationally-aware language models enable Fuse to jointly infer spoken language utterances underlying speech signals as well as the identities of target objects they refer to. In an evaluation of the system, visual situationally-aware language modeling shows significant , more than 30 %, decrease in speech recognition and understanding error rates. The underlying ideas of situation-aware speech understanding that have been developed in Fuse may may be applied in numerous areas including assistive and mobile human-machine interfaces.
by Niloy Mukherjee.
S.M.

APA, Harvard, Vancouver, ISO, and other styles

16

Rochford, Matthew. "Visual Speech Recognition Using a 3D Convolutional Neural Network." DigitalCommons@CalPoly, 2019. https://digitalcommons.calpoly.edu/theses/2109.

Full text

Abstract:

Main stream automatic speech recognition (ASR) makes use of audio data to identify spoken words, however visual speech recognition (VSR) has recently been of increased interest to researchers. VSR is used when audio data is corrupted or missing entirely and also to further enhance the accuracy of audio-based ASR systems. In this research, we present both a framework for building 3D feature cubes of lip data from videos and a 3D convolutional neural network (CNN) architecture for performing classification on a dataset of 100 spoken words, recorded in an uncontrolled envi- ronment. Our 3D-CNN architecture achieves a testing accuracy of 64%, comparable with recent works, but using an input data size that is up to 75% smaller. Overall, our research shows that 3D-CNNs can be successful in finding spatial-temporal features using unsupervised feature extraction and are a suitable choice for VSR-based systems.

APA, Harvard, Vancouver, ISO, and other styles

17

Scott, Simon David. "A data-driven approach to visual speech synthesis." Thesis, University of Bath, 1996. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.307116.

Full text

APA, Harvard, Vancouver, ISO, and other styles

18

Ahmad, Nasir. "A motion based approach for audio-visual automatic speech recognition." Thesis, Loughborough University, 2011. https://dspace.lboro.ac.uk/2134/8564.

Full text

Abstract:

The research work presented in this thesis introduces novel approaches for both visual region of interest extraction and visual feature extraction for use in audio-visual automatic speech recognition. In particular, the speaker‘s movement that occurs during speech is used to isolate the mouth region in video sequences and motionbased features obtained from this region are used to provide new visual features for audio-visual automatic speech recognition. The mouth region extraction approach proposed in this work is shown to give superior performance compared with existing colour-based lip segmentation methods. The new features are obtained from three separate representations of motion in the region of interest, namely the difference in luminance between successive images, block matching based motion vectors and optical flow. The new visual features are found to improve visual-only and audiovisual speech recognition performance when compared with the commonly-used appearance feature-based methods. In addition, a novel approach is proposed for visual feature extraction from either the discrete cosine transform or discrete wavelet transform representations of the mouth region of the speaker. In this work, the image transform is explored from a new viewpoint of data discrimination; in contrast to the more conventional data preservation viewpoint. The main findings of this work are that audio-visual automatic speech recognition systems using the new features extracted from the frequency bands selected according to their discriminatory abilities generally outperform those using features designed for data preservation. To establish the noise robustness of the new features proposed in this work, their performance has been studied in presence of a range of different types of noise and at various signal-to-noise ratios. In these experiments, the audio-visual automatic speech recognition systems based on the new approaches were found to give superior performance both to audio-visual systems using appearance based features and to audio-only speech recognition systems.

APA, Harvard, Vancouver, ISO, and other styles

19

Monteiro, Axel. "Spatial and temporal replication in visual and audiovisual speech recognition." Thesis, University of Nottingham, 2003. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.410421.

Full text

APA, Harvard, Vancouver, ISO, and other styles

20

Ibrahim, Zamri. "A novel lip geometry approach for audio-visual speech recognition." Thesis, Loughborough University, 2014. https://dspace.lboro.ac.uk/2134/16526.

Full text

Abstract:

By identifying lip movements and characterizing their associations with speech sounds, the performance of speech recognition systems can be improved, particularly when operating in noisy environments. Various method have been studied by research group around the world to incorporate lip movements into speech recognition in recent years, however exactly how best to incorporate the additional visual information is still not known. This study aims to extend the knowledge of relationships between visual and speech information specifically using lip geometry information due to its robustness to head rotation and the fewer number of features required to represent movement. A new method has been developed to extract lip geometry information, to perform classification and to integrate visual and speech modalities. This thesis makes several contributions. First, this work presents a new method to extract lip geometry features using the combination of a skin colour filter, a border following algorithm and a convex hull approach. The proposed method was found to improve lip shape extraction performance compared to existing approaches. Lip geometry features including height, width, ratio, area, perimeter and various combinations of these features were evaluated to determine which performs best when representing speech in the visual domain. Second, a novel template matching technique able to adapt dynamic differences in the way words are uttered by speakers has been developed, which determines the best fit of an unseen feature signal to those stored in a database template. Third, following on evaluation of integration strategies, a novel method has been developed based on alternative decision fusion strategy, in which the outcome from the visual and speech modality is chosen by measuring the quality of audio based on kurtosis and skewness analysis and driven by white noise confusion. Finally, the performance of the new methods introduced in this work are evaluated using the CUAVE and LUNA-V data corpora under a range of different signal to noise ratio conditions using the NOISEX-92 dataset.

APA, Harvard, Vancouver, ISO, and other styles

21

Dew, Andrea M. "A study of computer-based visual feedback of speech for the hearing impaired." Thesis, University of Leeds, 1990. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.277195.

Full text

APA, Harvard, Vancouver, ISO, and other styles

22

Zhang, Xianxian. "Robust speech processing based on microphone array, audio-visual, and frame selection for in-vehicle speech recognition and in-set speaker recognition." Diss., Connect to online resource, 2005. http://wwwlib.umi.com/cr/colorado/fullcit?p3190350.

Full text

APA, Harvard, Vancouver, ISO, and other styles

23

Martin, Claire. "Investigating the influence of natural variations in the quality of the visual image for visual and audiovisual speech recognition." Thesis, University of Nottingham, 2002. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.395576.

Full text

APA, Harvard, Vancouver, ISO, and other styles

24

Brady-Herbst, Brenene Marie. "An Analysis of Spondee Recognition Thresholds in Auditory-only and Audio-visual Conditions." PDXScholar, 1996. https://pdxscholar.library.pdx.edu/open_access_etds/5218.

Full text

Abstract:

To date there are no acceptable speechreading tests with normative or psychometric data indicating the test is a valid and reliable measure of speechreading assessment. Middlewerd and Plomp (1987) completed a study of speechreading assessment using sentences (auditory-only and auditory-visual) in the presence of background noise. Results revealed speech reception thresholds to be lower in the auditory-visual condition. Montgomery and Demorest ( 1988) concurred that these results were appealing, but unfortunately not efficient enough to be used clinically. The purpose of this study was to develop a clinically valid and reliable assessment of speech reading ability, following Middlewerd and Plomp's ( 1987) framework to achiev~ this goal. The method of obtaining a valid assessment tool was to define a group of stimuli that can be administered and scored to produce reliable data efficiently. Because spondaic words are accepted as a reliable method of clinically achieving speech reception thresholds, they were chosen to be used as the stimuli in this study to develop an efficient clinical speechreading assessment tool. Ten subjects were presented with spondaic words in each of two conditions, auditory-only and auditory-visual, in the presence of background noise. The spondee words were randomized for each presentation, to validate the data. A computerized presentation was used so that each subject received the identical input. The computer also produced a performance-intensity function for each spondaic word. Results revealed an acceptable speech recognition threshold for 18 of the 36 spondee words in the auditory-only condition; 6 words were outside of one standard deviation; and the remaining 12 words did not produce obtainable thresholds. In the auditory-visual condition, all words except one had no obtainable threshold. Although these results invalidated the spondee words as an acceptable stimuli, the study does validate the foundation for further research to study different types of stimuli using this same framework.

APA, Harvard, Vancouver, ISO, and other styles

25

Lew, Kum Hoi Chantal. "Talker variability and the roles of configural and featuralinformation in visual and audiovisual speech recognition." Thesis, University of Nottingham, 2007. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.446370.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

Silberer, Amanda Beth. "Importance of high frequency audibility on speech recognition with and without visual cues in listeners with normal hearing." Diss., University of Iowa, 2014. https://ir.uiowa.edu/etd/4755.

Full text

Abstract:

Purpose: To study the impact of visual cues, speech materials and age on the frequency bandwidth necessary for optimizing speech recognition performance in listeners with normal hearing. Method: Speech recognition abilities of adults and children with normal hearing were assessed using three speech perception tests that were low-pass (LP) filtered and presented in quiet and noise. The speech materials included the Multimodal Lexical Sentence Test (MLST) that was presented in auditory-only and auditory-visual modalities for the purpose of determining the listener's visual benefit. In addition, The University of Western Ontario Plurals Test (UWO) assessed listeners' ability to detect high frequency acoustic information (e.g., /s/ and /z/) in isolated words and The Maryland CNC test that assessed speech recognition performance using isolated single words. Speech recognition performance was calculated as percent correct and was compared across groups (children and adults), tests (MLST, UWO, and CNC) and conditions (quiet and noise). Results: Statistical analyses revealed a number of significant findings. The effect of visual cues was significant in adults and children. The type of speech material had significant impact on the frequency bandwidth required for adults and children to optimize speech recognition performance. The children required significantly more bandwidth to optimize performance than adults across speech perception tests and conditions of quiet and noise. Adults and children required significantly more bandwidth in noise than in quiet across speech perception tests. Conclusion: The results suggest that children and adults require significantly less bandwidth for optimizing speech recognition performance when assessed using sentence materials which provide visual cues. Children, however, showed less benefit from visual cues in the noise condition than adults. The amount of bandwidth required by both groups decreased as a function of the speech material. In other words, the more ecologically valid the speech material (e.g., sentences with visual cues versus single isolated words), the less bandwidth was required for optimizing performance. In all, the optimal bandwidth (except for the noise condition of the UWO test) is achievable with current amplification schemes.

APA, Harvard, Vancouver, ISO, and other styles

27

Beckmeyer, Cynthia S. "Comprehensive Evaluation of Non-Verbal Communication.A visual alternative to assist Alzheimer's patients' communication with their caregivers." University of Cincinnati / OhioLINK, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1367927393.

Full text

APA, Harvard, Vancouver, ISO, and other styles

28

Khaldieh, Salim Ahmad. "The role of phonological encoding (speech recoding) and visual processes in world recognition of American learners of Arabic as a foreign language." The Ohio State University, 1990. http://rave.ohiolink.edu/etdc/view?acc_num=osu1272465854.

Full text

APA, Harvard, Vancouver, ISO, and other styles

29

Khaldieh, Salim Ahmad. "The role of phonological encoding (speech recoding) and visual processes in word recognition of American learners of Arabic as a foreign language /." The Ohio State University, 1990. http://rave.ohiolink.edu/etdc/view?acc_num=osu1487685204966592.

Full text

APA, Harvard, Vancouver, ISO, and other styles

30

Lucey, Patrick Joseph. "Lipreading across multiple views." Queensland University of Technology, 2007. http://eprints.qut.edu.au/16676/.

Full text

Abstract:

Visual information from a speaker's mouth region is known to improve automatic speech recognition (ASR) robustness, especially in the presence of acoustic noise. Currently, the vast majority of audio-visual ASR (AVASR) studies assume frontal images of the speaker's face, which is a rather restrictive human-computer interaction (HCI) scenario. The lack of research into AVASR across multiple views has been dictated by the lack of large corpora that contains varying pose/viewpoint speech data. Recently, research has concentrated on recognising human be- haviours within "meeting " or "lecture " type scenarios via "smart-rooms ". This has resulted in the collection of audio-visual speech data which allows for the recognition of visual speech from both frontal and non-frontal views to occur. Using this data, the main focus of this thesis was to investigate and develop various methods within the confines of a lipreading system which can recognise visual speech across multiple views. This reseach constitutes the first published work within the field which looks at this particular aspect of AVASR. The task of recognising visual speech from non-frontal views (i.e. profile) is in principle very similar to that of frontal views, requiring the lipreading system to initially locate and track the mouth region and subsequently extract visual features. However, this task is far more complicated than the frontal case, because the facial features required to locate and track the mouth lie in a much more limited spatial plane. Nevertheless, accurate mouth region tracking can be achieved by employing techniques similar to frontal facial feature localisation. Once the mouth region has been extracted, the same visual feature extraction process can take place to the frontal view. A novel contribution of this thesis, is to quantify the degradation in lipreading performance between the frontal and profile views. In addition to this, novel patch-based analysis of the various views is conducted, and as a result a novel multi-stream patch-based representation is formulated. Having a lipreading system which can recognise visual speech from both frontal and profile views is a novel contribution to the field of AVASR. How- ever, given both the frontal and profile viewpoints, this begs the question, is there any benefit of having the additional viewpoint? Another major contribution of this thesis, is an exploration of a novel multi-view lipreading system. This system shows that there does exist complimentary information in the additional viewpoint (possibly that of lip protrusion), with superior performance achieved in the multi-view system compared to the frontal-only system. Even though having a multi-view lipreading system which can recognise visual speech from both front and profile views is very beneficial, it can hardly considered to be realistic, as each particular viewpoint is dedicated to a single pose (i.e. front or profile). In an effort to make the lipreading system more realistic, a unified system based on a single camera was developed which enables a lipreading system to recognise visual speech from both frontal and profile poses. This is called pose-invariant lipreading. Pose-invariant lipreading can be performed on either stationary or continuous tasks. Methods which effectively normalise the various poses into a single pose were investigated for the stationary scenario and in another contribution of this thesis, an algorithm based on regularised linear regression was employed to project all the visual speech features into a uniform pose. This particular method is shown to be beneficial when the lipreading system was biased towards the dominant pose (i.e. frontal). The final contribution of this thesis is the formulation of a continuous pose-invariant lipreading system which contains a pose-estimator at the start of the visual front-end. This system highlights the complexity of developing such a system, as introducing more flexibility within the lipreading system invariability means the introduction of more error. All the works contained in this thesis present novel and innovative contributions to the field of AVASR, and hopefully this will aid in the future deployment of an AVASR system in realistic scenarios.

APA, Harvard, Vancouver, ISO, and other styles

31

Fernández, López Adriana. "Learning of meaningful visual representations for continuous lip-reading." Doctoral thesis, Universitat Pompeu Fabra, 2021. http://hdl.handle.net/10803/671206.

Full text

Abstract:

In the last decades, there has been an increased interest in decoding speech exclusively using visual cues, i.e. mimicking the human capability to perform lip-reading, leading to Automatic Lip-Reading (ALR) systems. However, it is well known that the access to speech through the visual channel is subject to many limitations when compared to the audio channel, i.e. it has been argued that humans can actually read around 30% of the information from the lips, and the rest is filled-in from the context. Thus, one of the main challenges in ALR resides in the visual ambiguities that arise at the word level, highlighting that not all sounds that we hear can be easily distinguished by observing the lips. In the literature, early ALR systems addressed simple recognition tasks such as alphabet or digit recognition but progressively shifted to more complex and realistic settings leading to several recent systems that target continuous lip-reading. To a large extent, these advances have been possible thanks to the construction of powerful systems based on deep learning architectures that have quickly started to replace traditional systems. Despite the recognition rates for continuous lip-reading may appear modest in comparison to those achieved by audio-based systems, the field has undeniably made a step forward. Interestingly, an analogous effect can be observed when humans try to decode speech: given sufficiently clean signals, most people can effortlessly decode the audio channel but would struggle to perform lip-reading, since the ambiguity of the visual cues makes it necessary the use of further context to decode the message. In this thesis, we explore the appropriate modeling of visual representations with the aim to improve continuous lip-reading. To this end, we present different data-driven mechanisms to handle the main challenges in lip-reading related to the ambiguities or the speaker dependency of visual cues. Our results highlight the benefits of a proper encoding of the visual channel, for which the most useful features are those that encode corresponding lip positions in a similar way, independently of the speaker. This fact opens the door to i) lip-reading in many different languages without requiring large-scale datasets, and ii) increasing the contribution of the visual channel in audio-visual speech systems. On the other hand, our experiments identify a tendency to focus on the modeling of temporal context as the key to advance the field, where there is a need for ALR models that are trained on datasets comprising large speech variability at several context levels. In this thesis, we show that both proper modeling of visual representations and the ability to retain context at several levels are necessary conditions to build successful lip-reading systems.
En les darreres dècades, hi ha hagut un interès creixent en la descodificació de la parla utilitzant exclusivament senyals visuals, es a dir, imitant la capacitat humana de llegir els llavis, donant lloc a sistemes de lectura automàtica de llavis (ALR). No obstant això, se sap que l’accès a la parla a través del canal visual està subjecte a moltes limitacions en comparació amb el senyal acústic, es a dir, s’ha argumentat que els humans poden llegir al voltant del 30% de la informació dels llavis, i la resta es completa fent servir el context. Així, un dels principals reptes de l’ALR resideix en les ambigüitats visuals que sorgeixen a escala de paraula, destacant que no tots els sons que escoltem es poden distingir fàcilment observant els llavis. A la literatura, els primers sistemes ALR van abordar tasques de reconeixement senzilles, com ara el reconeixement de l’alfabet o els dígits, però progressivament van passar a entorns mes complexos i realistes que han conduït a diversos sistemes recents dirigits a la lectura continua dels llavis. En gran manera, aquests avenços han estat possibles gracies a la construcció de sistemes potents basats en arquitectures d’aprenentatge profund que han començat a substituir ràpidament els sistemes tradicionals. Tot i que les taxes de reconeixement de la lectura continua dels llavis poden semblar modestes en comparació amb les assolides pels sistemes basats en audio, és evident que el camp ha fet un pas endavant. Curiosament, es pot observar un efecte anàleg quan els humans intenten descodificar la parla: donats senyals sense soroll, la majoria de la gent pot descodificar el canal d’àudio sense esforç¸, però tindria dificultats per llegir els llavis, ja que l’ambigüitat dels senyals visuals fa necessari l’ús de context addicional per descodificar el missatge. En aquesta tesi explorem el modelatge adequat de representacions visuals amb l’objectiu de millorar la lectura contínua dels llavis. Amb aquest objectiu, presentem diferents mecanismes basats en dades per fer front als principals reptes de la lectura de llavis relacionats amb les ambigüitats o la dependència dels parlants dels senyals visuals. Els nostres resultats destaquen els avantatges d’una correcta codificació del canal visual, per a la qual les característiques més útils són aquelles que codifiquen les posicions corresponents dels llavis d’una manera similar, independentment de l’orador. Aquest fet obre la porta a i) la lectura de llavis en molts idiomes diferents sense necessitat de conjunts de dades a gran escala, i ii) a l’augment de la contribució del canal visual en sistemes de parla audiovisuals.´ D’altra banda, els nostres experiments identifiquen una tendència a centrar-se en iii la modelització del context temporal com la clau per avançar en el camp, on hi ha la necessitat de models d’ALR que s’entrenin en conjunts de dades que incloguin una gran variabilitat de la parla a diversos nivells de context. En aquesta tesi, demostrem que tant el modelatge adequat de les representacions visuals com la capacitat de retenir el context a diversos nivells són condicions necessàries per construir sistemes de lectura de llavis amb èxit.

APA, Harvard, Vancouver, ISO, and other styles

32

Fong, Katherine KaYan. "IR-Depth Face Detection and Lip Localization Using Kinect V2." DigitalCommons@CalPoly, 2015. https://digitalcommons.calpoly.edu/theses/1425.

Full text

Abstract:

Face recognition and lip localization are two main building blocks in the development of audio visual automatic speech recognition systems (AV-ASR). In many earlier works, face recognition and lip localization were conducted in uniform lighting conditions with simple backgrounds. However, such conditions are seldom the case in real world applications. In this paper, we present an approach to face recognition and lip localization that is invariant to lighting conditions. This is done by employing infrared and depth images captured by the Kinect V2 device. First we present the use of infrared images for face detection. Second, we use the face’s inherent depth information to reduce the search area for the lips by developing a nose point detection. Third, we further reduce the search area by using a depth segmentation algorithm to separate the face from its background. Finally, with the reduced search range, we present a method for lip localization based on depth gradients. Experimental results demonstrated an accuracy of 100% for face detection, and 96% for lip localization.

APA, Harvard, Vancouver, ISO, and other styles

33

Yau, Wai Chee, and waichee@ieee org. "Video Analysis of Mouth Movement Using Motion Templates for Computer-based Lip-Reading." RMIT University. Electrical and Computer Engineering, 2008. http://adt.lib.rmit.edu.au/adt/public/adt-VIT20081209.162504.

Full text

Abstract:

This thesis presents a novel lip-reading approach to classifying utterances from video data, without evaluating voice signals. This work addresses two important issues which are the efficient representation of mouth movement for visual speech recognition the temporal segmentation of utterances from video. The first part of the thesis describes a robust movement-based technique used to identify mouth movement patterns while uttering phonemes. This method temporally integrates the video data of each phoneme into a 2-D grayscale image named as a motion template (MT). This is a view-based approach that implicitly encodes the temporal component of an image sequence into a scalar-valued MT. The data size was reduced by extracting image descriptors such as Zernike moments (ZM) and discrete cosine transform (DCT) coefficients from MT. Support vector machine (SVM) and hidden Markov model (HMM) were used to classify the feature descriptors. A video speech corpus of 2800 utterances was collected for evaluating the efficacy of MT for lip-reading. The experimental results demonstrate the promising performance of MT in mouth movement representation. The advantages and limitations of MT for visual speech recognition were identified and validated through experiments. A comparison between ZM and DCT features indicates that th e accuracy of classification for both methods is very comparable when there is no relative motion between the camera and the mouth. Nevertheless, ZM is resilient to rotation of the camera and continues to give good results despite rotation but DCT is sensitive to rotation. DCT features are demonstrated to have better tolerance to image noise than ZM. The results also demonstrate a slight improvement of 5% using SVM as compared to HMM. The second part of this thesis describes a video-based, temporal segmentation framework to detect key frames corresponding to the start and stop of utterances from an image sequence, without using the acoustic signals. This segmentation technique integrates mouth movement and appearance information. The efficacy of this technique was tested through experimental evaluation and satisfactory performance was achieved. This segmentation method has been demonstrated to perform efficiently for utterances separated with short pauses. Potential applications for lip-reading technologies include human computer interface (HCI) for mobility-impaired users, defense applications that require voice-less communication, lip-reading mobile phones, in-vehicle systems, and improvement of speech-based computer control in noisy environments.

APA, Harvard, Vancouver, ISO, and other styles

34

Besson, Gabriel. "Approche temporelle de la mémoire de reconnaissance visuelle et atteinte au stade prodromal de la maladie d'Alzheimer." Phd thesis, Aix-Marseille Université, 2013. http://tel.archives-ouvertes.fr/tel-00858502.

Full text

Abstract:

La mémoire de reconnaissance visuelle (MRV) est atteinte précocement dans la maladie d'Alzheimer (MA). Or, elle reposerait sur deux processus: la familiarité (simple sentiment d'avoir déjà rencontré un item) et la recollection (récupération de détails associés à l'item lors de son encodage). Si la recollection est clairement atteinte au début de la MA, les résultats concernant la familiarité sont à ce jour contradictoires. Supposée plus rapide que la recollection, la familiarité devrait pouvoir être évaluée directement par une approche temporelle. Son atteinte dans la MA pourrait alors être mieux comprise. Pour tester ces hypthèses, la procédure comportementale SAB (Speed and Accuracy Boosting) a été créée. Permettant d'étudier les propriétés de la MRV (sa vitesse-limite, (Articles 1 et 2, ou sa nature "bottom-up", Article 3) ainsi que l'hypothèse que la familiarité serait plus rapide que la recollection, cette méthode s'est montrée évaluer majoritairement la familiarité (Article 1). Chez des patients à risque de MA, une dissociation inattendue au sein de la familiarité a alors pu être révélée, avec une atteinte des signaux tardifs de familiarité (utilisés lors d'un jugement classique), mais une préservation des premiers signaux (supportant la détection rapide évaluée en SAB) (Article 4). En outre, la segmentation manuelle d'images IRM du lobe temporal interne (premières régions cérébrales touchées dans la MA, et clés pour la MRV) a été appliquée à la problématique connexe de l'effet de l'âge au début de la MA (Article 5). Indépendamment, ces méthodes ont permis de mieux comprendre la MRV et son atteinte au début de la MA ; leur combinaison s'annonce très prometteuse.

APA, Harvard, Vancouver, ISO, and other styles

35

chum, Ting chia, and 丁家群. "Speech Recognition And Visual Basic." Thesis, 2003. http://ndltd.ncl.edu.tw/handle/49328880098397218445.

Full text

Abstract:

碩士
義守大學
電子工程學系
91
Speech Recognition system( SRS)Whatever language the SRS is applied for, the essential method is almost the same except some minors. The main difference lies in the characteristic of each language。Chinese characters pronunciation ，for example , are composed of sound rhyme and the intonation, therefore, distinction of intonation is necessary. On the other hand, English SRS recognizes the word as a unit, its intonation certainly does not have be distinguished。 This thesis is aimed to give a introduction to SRS and try to compose a program in Visual Basis。VB is operable under Windows system and with Object-Oriented capabilities 。 The sampled signals are pre-processed through point detection and Hamming coding。By applying Hidden Markov Models（HMMS）HMMS as the recognition and classification tool，the characteristic parameters of each speech signal will be extracted which will establish the reference database。Afterwards, the outcome can be attained by Viterbi algorithm。 Taking number 0-9 as the objects to be recognized, then the state numbers will be varied as well，speech frame will sampled 。All these will be demonstrated to make clear the influences rooted from terms like “state number”、”number of sampling of speech frame”。

APA, Harvard, Vancouver, ISO, and other styles

36

Abreu, Hélder Paulo Monteiro. "Visual speech recognition for European Portuguese." Master's thesis, 2014. http://hdl.handle.net/1822/37465.

Full text

Abstract:

Dissertação de mestrado em Engenharia Informática
O reconhecimento da fala baseado em características visuais teve início na década de 80, integrado em sistemas de reconhecimento audiovisual da fala. De facto, o objetivo inicial do recurso a características visuais foi o de aumentar a robustez dos sistemas de reconhecimento automático da fala, que perdem precisão rapidamente em ambientes ruidosos. Contudo, o potencial para manter um bom desempenho de reconhecimento de fala em situações em que os dados acústicos estão comprometidos ou em qualquer outra situação em que é necessária uma pessoa capaz de ler os lábios, levou os investigadores e a criar e desenvolver a área de reconhecimento visual da fala. Os sistemas tradicionais de reconhecimento visual da fala usam apenas informação RGB, seguindo uma abordagem unimodal, uma vez que o recurso a outras modalidades é dispendioso e implica problemas de sincronização entre as mesmas. O lançamento do Microsoft Kinect, que inclui um microfone, uma câmara RGB e um sensor de profundidade, abriu novas portas às áreas de reconhecimento da fala. Para além disso, todas as modalidades podem ser sincronizadas usando as funcionalidades do SDK. Recentemente, a Microsoft lançou o novo Kinect One, que oferece uma melhor câmara e um sensor de profundidade com uma tecnologia diferente e mais precisa. O objetivo principal desta tese consiste em criar um sistema de reconhecimento visual da fala baseado no Kinect e verificar se um sistema multimodal, baseado em RGB e dados de profundidade, é capaz de obter melhores resultados do que um sistema unimodal baseado exclusivamente em RGB. Considerando o processo de extração de características, uma abordagem recente baseada em características articulatórias tem mostrado resultados promissores, quando comparada com abordagens baseadas em visemas. Esta tese pretende verificar se uma abordagem articulatória obtém melhores resultados que uma abordagem baseada na forma. O sistema desenvolvido, chamado ViKi (Visual Speech Recognition for Kinect), alcançou uma taxa de reconhecimento de 68% num vocabulário de 25 palavras isoladas, com 8 oradores, superando a abordagem unimodal testada. A informação de profundidade provou aumentar a taxa de reconhecimento do sistema, tanto na abordagem articulatória (+8%) como na abordagem baseada na forma (+2%). Num contexto de dependência em relação ao orador, ViKi também alcançou uma média de ≈70% de taxa de reconhecimento. A abordagem articulatória obteve piores resultados que a abordagem baseada na forma, alcançando 34% de taxa de reconhecimento, contrariando os resultados obtidos em estudos prévios com abordagens baseadas na aparência e a terceira hipótese desta tese.
Speech recognition based on visual features began in the early 1980s, embedded on AudioVisual Speech Recognition systems. In fact, the initial purpose to the use of visual cues was to increase the robustness of Automatic Speech Recognition systems, which rapidly lose accuracy in noisy environments. However, the potential to keep a good accuracy, whenever the use of an acoustic stream is excluded and in any other situations where a human lip reader would be needed, led researchers to create and explore the Visual Speech Recognition (VSR) field. Traditional VSR systems used only RGB information, following an unimodal approach, since the addition of other visual modalities could be expensive and present synchronization issues. The release of the Microsoft Kinect sensor brought new possibilities for the speech recognition fields. This sensor includes a microphone array, a RGB camera and a depth sensor. Furthermore, all its input modalities can be synchronized using the features of its SDK. Recently, Microsoft released the new Kinect One, offering a better camera and a different and improved depth sensing technology. This thesis sets the hypothesis that, using the available input HCI modalities of such sensor, such as RGB video and depth, as well as the skeletal tracking features available in the SDK and, by adopting a multimodal VSR articulatory approach, we can improve word recognition rate accuracy of a VSR system, compared to a unimodal approach using only RGB data. Regarding the feature extraction process, a recent approaches based on articulatory features have been shown promising results, when compared to standard shape-based viseme approaches. In this thesis, we also aim to verify the hypothesis that an articulatory VSR can outperform a shapebased approach, in what concerns word recognition rate. The VSR system developed in this thesis, named ViKi (Visual Speech Recognition for Kinect), achieved a 68% word recognition rate on a scenario where 8 speakers, pronounced a vocabulary of 25 isolated words, outperforming our tested unimodal approach. The use of depth information proved to increase the system accuracy, both for the articulatory (+8%) and the shape-based approach (+2%). On a speaker-dependent context, ViKi also achieved an interesting average accuracy of ≈70%. The articulatory approach performed worse than the shape-based, reaching 34% of word accuracy, contrary to what happens with previous research based on appearance approaches and not confirming our third hypothesis.

APA, Harvard, Vancouver, ISO, and other styles

37

Hill, Brian, and 廖峻廷. "Robust Speech Recognition Integrating Visual Information." Thesis, 1997. http://ndltd.ncl.edu.tw/handle/97538191028447078081.

Full text

APA, Harvard, Vancouver, ISO, and other styles

38

Wei, Chun-Chuan, and 魏俊全. "Discriminative Analysis on Visual Features for Mandarin Speech Recognition." Thesis, 2009. http://ndltd.ncl.edu.tw/handle/09955904216369882293.

Full text

Abstract:

碩士
國立臺灣科技大學
資訊管理系
97
The visual features can improve the performance of the speech recognition system under noisy environment. However, it is hard to achieve acceptable performance in a multi-words recognition task by using visual features alone. The speech information delivered by visual features is less than acoustic features. In this paper, we apply the measurement of model distance on visual models to understand the discriminability of visual features. Then, we select the pair-wised recognition task of Chinese syllable pairs to put in use. According to the analysis of model distance and recognition error, we find the discriminative pairs of Chinese syllables. The experimental result show that the average error rate of this pair-wised task is 10.47%, and there are 18.17% model pairs its error rate lower than 2.5%. The model distance is highly correlative to the recognition error. Comparing with the analysis of audio features, we find the model pairs that are more discriminative in visual features than in audio features.

APA, Harvard, Vancouver, ISO, and other styles

39

Makkook, Mustapha. "A Multimodal Sensor Fusion Architecture for Audio-Visual Speech Recognition." Thesis, 2007. http://hdl.handle.net/10012/3065.

Full text

Abstract:

A key requirement for developing any innovative system in a computing environment is to integrate a sufficiently friendly interface with the average end user. Accurate design of such a user-centered interface, however, means more than just the ergonomics of the panels and displays. It also requires that designers precisely define what information to use and how, where, and when to use it. Recent advances in user-centered design of computing systems have suggested that multimodal integration can provide different types and levels of intelligence to the user interface. The work of this thesis aims at improving speech recognition-based interfaces by making use of the visual modality conveyed by the movements of the lips. Designing a good visual front end is a major part of this framework. For this purpose, this work derives the optical flow fields for consecutive frames of people speaking. Independent Component Analysis (ICA) is then used to derive basis flow fields. The coefficients of these basis fields comprise the visual features of interest. It is shown that using ICA on optical flow fields yields better classification results than the traditional approaches based on Principal Component Analysis (PCA). In fact, ICA can capture higher order statistics that are needed to understand the motion of the mouth. This is due to the fact that lips movement is complex in its nature, as it involves large image velocities, self occlusion (due to the appearance and disappearance of the teeth) and a lot of non-rigidity. Another issue that is of great interest to audio-visual speech recognition systems designers is the integration (fusion) of the audio and visual information into an automatic speech recognizer. For this purpose, a reliability-driven sensor fusion scheme is developed. A statistical approach is developed to account for the dynamic changes in reliability. This is done in two steps. The first step derives suitable statistical reliability measures for the individual information streams. These measures are based on the dispersion of the N-best hypotheses of the individual stream classifiers. The second step finds an optimal mapping between the reliability measures and the stream weights that maximizes the conditional likelihood. For this purpose, genetic algorithms are used. The addressed issues are challenging problems and are substantial for developing an audio-visual speech recognition framework that can maximize the information gather about the words uttered and minimize the impact of noise.

APA, Harvard, Vancouver, ISO, and other styles

40

Liang, Shin-Hwei, and 梁欣蕙. "Feature-Based Visual Speech Recognition Using Time-Delay Neural Network." Thesis, 1997. http://ndltd.ncl.edu.tw/handle/10921510042799247430.

Full text

Abstract:

碩士
國立交通大學
控制工程系
85
An automatic mouth feature detection and mouth motion recognition technique for visual speech recognition is proposed in this thesis. This technique consists of three stages : human mouth detection and extraction, mouth feature detection, and neural network learning. In the mouth detection stage, the first step is to find the locations of human faces without any constraints on the users for the consideration of practicability. Hough transform is used here for determining the candidate face locations under complex environments. We simplify it to a three-dimensional search and redefine the searching region using the symmetry property of human beings. Then, a Mouth Detection Algorithm (MDA) is proposed to verify the mouth location and the next three procedures are normalization, adjustment, and template matching for the candidate mouth images. After these processes only one mouth image is treated as the winner among the candidate mouth images. In the mouth feature detection stage, one procedure searches the mouth corners and a refined Mouth Feature Searching Algorithm (MFSA) is used to reconnoiter the four points on two lips. These four points play an important role in our system since two parabolas can be approximated using the mouth corners and these points. Finally, a precise mouth model is established after calculating two parabolas and selecting eleven features from the mouth model as the input patterns for the classifier. In the last stage, a TDNN is used as our classifier due to the tolerance of time shifting property. We have done many experiments to decide which kinds of features are crucial and sufficient enough in the lip- reading system. The off-line recognition rate can achieve 90% speaker dependently in our experiment. Two other methods are compared with our system and we find that our method can reach better performance than other two methods with the less memory space and training time. Finally, we generalize our system to a six speakers system to verify the robustness of our method. The experimental result shows the stability and practicability of the proposed approach.

APA, Harvard, Vancouver, ISO, and other styles

41

Liao, Wen-Yuan, and 廖文淵. "A Study on Audio-Visual Feature Extraction for Mandarin Digit Speech Recognition." Thesis, 2009. http://ndltd.ncl.edu.tw/handle/46704732964354703864.

Full text

Abstract:

博士
大同大學
資訊工程學系(所)
97
In recent years, there have been many machine speechreading systems proposed, that combine audio and visual speech features. For all such systems, the objective of these audio-visual speech recognizers is to improve recognition accuracy, particularly in difficult condition. This thesis presents a Mandarin audio-visual recognition system that has better recognition rate in noisy condition as well as speech spoken with emotional condition. We first extract the visual features of the lips, including geometric and motion features. These features are very important to the recognition system especially in noisy condition or with emotional effects. The motion features are obtained by applying an automatic face feature extractor followed by a fast motion feature extractor. We compare the performance when the system using motion and geometric features. In this recognition system, we propose to use the weighted-discrete KNN as the classifier and compare the results with two popular classifiers, the GMM and HMM, and evaluate their performance by applying to a Mandarin audio-visual speech corpus. We find that the WD-KNN is a suitable classifier for Mandarin speech because the monosyllable property of Mandarin and computationally inexpensive. The experimental results of different classifiers at various SNR levels are presented. The results show that using the WD-KNN classifier yields better recognition accuracy than other classifiers for the used Mandarin speech corpus. Several weighting functions were also studied for the weighted KNN based classifier, such as linear distance weighting, inverse distance weighting, rank weighting and reverse Fibonacci weighting function. The overall results have proved that WD-KNN classifier with reverse Fibonacci weighting function gets the higher recognition rate in three extended versions of KNN outperform others. Finally, we perform the emotional speech recognition experiments. The results show that it will be more robust if the visual information is included. The recognition rate of the audio-visual speech recognition system will have higher recognition rate when incorporated with the visual cues.

APA, Harvard, Vancouver, ISO, and other styles

42

Frisky, Aufaclav Zatu Kusuma, and 柯奧福. "Visual Speech Recognition and Password Verification Using Local Spatiotemporal Features and Kernel Sparse Representation Classifier." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/03868492706552896766.

Full text

Abstract:

碩士
國立中央大學
資訊工程學系在職專班
103
Visual speech recognition (VSR) applications play an important role in various aspects of human life, with research efforts being put into recognition systems in security, biometrics, and human machine interaction. In this thesis, we proposed two lip-based systems. First system, we proposed a letter recognition system using spatiotemporal features descriptors. The proposed system adopted non-negative matrix factorization (NMF) to reduce the dimensionality of the feature and kernel sparse representation classifier for classification step. We used local texture and local temporal features to represent the visual lips data. Firstly, the visual lips data were preprocessed by enhancing the contrast of images and then used to extract the feature. In our experiment, the promising accuracies of 67.13%, 45.37%, and 63.12% can be achieved in semi speaker dependent, speaker independent, and speaker dependent on AVLetters database. We also compared our method with other methods on AVLetters 2 database. Using the same configuration, our method could achieve accuracy rate of 89.02% for speaker dependent case and 25.9% for speaker independent case. This result shows that our method outperforms the others in the same configuration. In the second system, we proposed a new approach in lip-based password for home entrance security using confidence point in home automation system. We also proposed new features using modified version of spatiotemporal descriptor features adopt L2-Hellinger to do a normalization and used two-dimension semi non-negative matrix factorization (2D Semi-NMF) for dimensionality reduction. In classifier parts, we proposed forward-backward kernel sparse representation classifier (FB-KSRC). Our experiment results proves that our system is quite robust to classify the password. We applied this system in AVLetters 2 dataset. Using ten visual passwords of five combined letters from AVLetters 2 dataset, using all combination experiments, the result shows that our system can verify the password very well. In the complexity experiment, we also get a reasonable time classification process if our system will be implemented in real world application.

APA, Harvard, Vancouver, ISO, and other styles

43

Rigas, Dimitrios I., and M. Alsuraihi. "A Toolkit for Multimodal Interface Design: An Empirical Investigation." 2007. http://hdl.handle.net/10454/3156.

Full text

Abstract:

No
This paper introduces a comparative multi-group study carried out to investigate the use of multimodal interaction metaphors (visual, oral, and aural) for improving learnability (or usability from first time use) of interface-design environments. An initial survey was used for taking views about the effectiveness and satisfaction of employing speech and speech-recognition for solving some of the common usability problems. Then, the investigation was done empirically by testing the usability parameters: efficiency, effectiveness, and satisfaction of three design-toolkits (TVOID, OFVOID, and MMID) built especially for the study. TVOID and OFVOID interacted with the user visually only using typical and time-saving interaction metaphors. The third environment MMID added another modality through vocal and aural interaction. The results showed that the use of vocal commands and the mouse concurrently for completing tasks from first time use was more efficient and more effective than the use of visual-only interaction metaphors.

APA, Harvard, Vancouver, ISO, and other styles

44

McIvor, Tom. "Continuous speech recognition : an analysis of its effect on listening comprehension, listening strategies and notetaking : a thesis presented in part fulfilment of the requirements for the degree of Doctorate in Education, Massey University." 2006. http://hdl.handle.net/10179/1471.

Full text

Abstract:

This thesis presents an investigation into the effect of Liberated Learning Technology (LLP) on academic listening comprehension, notetaking and listening strategies in an English as a foreign language context (L2). Two studies are reported: an exploratory study and subsequent main study. The exploratory study was undertaken to determine L2 and native speaker (L1) students' perceptions on the effectiveness of the technology on academic listening and notetaking. The main study took a more focused approach and as a result, extended the exploratory study that was done in an authentic lecture context in order to gather data to measure listening comprehension and notetaking quality. The participants in the main study comprised six L2 students: five of whom intended to go to university. The methodology was a multimethod one: data was gathered from notetaking samples, protocol analysis, email responses and a questionnaire. Results indicated that continuous speech recognition (CSR) has the potential to support the listening comprehension and notetaking abilities of L2 students as well as facilitate metacognitive listening strategy use and enhance affective factors in academic listening. However, it is important to note that as CSR is an innovative technology, it first needs to meet a number of challenges before its full potential can be realized. Consequently, recommendations for future research and potential innovative uses for the technology are discussed. This thesis contributes to L2 academic listening and notetaking measurement in two areas: 1. the measurement of LLP-supported notetaking; and, 2. the measurement of LLP-supported academic listening comprehension.

APA, Harvard, Vancouver, ISO, and other styles

45

Chen, Yi-Ling, and 陳怡伶. "The Relationship between Recognition and Phonological Awareness, Naming Speed, Visual Skills for Reading-disabled Readers." Thesis, 2005. http://ndltd.ncl.edu.tw/handle/96447031387736628310.

Full text

Abstract:

碩士
國立臺南大學
特殊教育學系碩士班
93
The purpose of this research is to discuss how PA (phonological awareness), NS (naming speed), and VS (visual skills) influence the abilities in recognition for Chinese RDs (reading-disabled readers). By comparing RDs with two contrasting groups, the differences between RDs and normal readers in three abilities can be realized. In addition, concerning recognition ability between phonetic compound and nonphonetic compound for RDs, I further analyzed whether different predictabilities exist among these three abilities. The ninety subjects were chosen from the second and fourth graders of four elementary schools in Kaohsiung City and were divided the RDs and two control groups. The two control groups were age-matched (AM), fourth graders and reading-level matched (RM), second readers. All subjects received four tests , including PA, NS, VS and Chinese recognition. The data were analyzed by the descriptive statistics, one-way analysis of variance, product moment correlation, and stepwise multiple regression. The main findings were summarized as follows: 1. There is difference between RD’s representation in PA, NS, VS, and recognition and AM , fourth graders. 2. There is difference between RD’s representation in PA and RM . There is no difference between RD’s representation in VS, NS, and recognition and RM . 3. The ability in recognition is related to PA, NS and VS. 4. There is predictability between PA and RDs’ Recognition whereas there is predictability between NS and second graders . 5. There is different predictability between ‘PA to phonetic compound’ and ‘PA to nonphonetic compound’ for RDs. Theoretical and practical implication as well as suggestions for future researchers were discussed in the thesis.

APA, Harvard, Vancouver, ISO, and other styles

46

Fortier-St-Pierre, Simon. "La dynamique spatio-temporelle de l’attention en lecture chez les dyslexiques." Thesis, 2019. http://hdl.handle.net/1866/24649.

Full text

Abstract:

La dyslexie est un trouble neurodéveloppemental nuisant au développement normal de la fluidité en lecture. Certains processus de base à la lecture pourraient être atteints chez les dyslexiques et entraîner des répercussions touchant les représentations de haut niveau des mots en découlant : orthographique, phonologique et sémantique. Un de ces processus de base est le déploiement spatio-temporel de l’attention sur des séquences de stimuli multiples alignés à l’horizontale. L’efficacité de ce déploiement pourrait être étroitement liée à l’expertise en lecture chez les normo-lecteurs, et des irrégularités dans celui-ci pourraient être observées chez des dyslexiques. Malheureusement, la caractérisation de ce déploiement en contexte de reconnaissance de mots écrits, son implication dans la vitesse de lecture et (potentiellement) même dans certaines habiletés langagières demeurent largement sous-spécifiées. Le premier article de cette thèse vise à révéler les divergences du déploiement de l’attention dans le temps et dans l’espace pendant la reconnaissance d’un mot familier chez un groupe d’adultes dyslexiques par rapport à un groupe de normo-lecteurs. Les groupes sont appariés en termes d’âge et de fonctionnement intellectuel. Cet objectif est poursuivi avec la technique de sonde attentionnelle. Les résultats révèlent que les dyslexiques dirigent moins de ressources attentionnelles vers la première lettre d’un mot, ce qui est sous-optimal considérant que la première lettre d’un mot est particulièrement informative sur son identité. Le deuxième article de cette thèse vise à déterminer si les habiletés en lecture de texte et de traitement phonologique chez les dyslexiques peuvent bénéficier d’un entraînement attentionnel court. Un protocole utilisant un entraînement visuo-attentionnel (NeuroTracker) et un entraînement placebo chez une vingtaine d’adultes dyslexiques met en évidence des gains systématiques immédiatement après l’entraînement actif. L’ordre des entraînements (actif puis placebo, ou placebo puis actif) était contrebalancé entre deux groupes. Ces gains s’observent en vitesse de lecture, et même au niveau de la conscience phonologique. Le troisième article de cette thèse apporte finalement une contribution additionnelle significative pour l’évaluation de la vitesse de la lecture chez les adultes universitaires franco-québécois. L’utilisation des phrases-tests d’un outil existant (MNRead) a été intégrée à un protocole de présentation visuelle sérielle rapide pour l’évaluation de la vitesse de lecture. En plus de cet ensemble de phrases-test, quatre autres ensembles de phrases-test ont été normés. L’outil développé permet d’obtenir une mesure de la vitesse de lecture fiable chez un même individu à différentes reprises (Exp. 1) et il satisfait différents standards psychométriques (Exps. 1 et 2), en étant notamment sensible à la présence des difficultés en lecture retrouvées chez les dyslexiques (Exp. 2). En somme, il appert que certains processus visuo-attentionnels sous-tendent l’expertise en lecture et que ceux-ci pourraient présenter des irrégularités chez les dyslexiques. La caractérisation d’un déploiement attentionnel sous-optimal en reconnaissance de mots familiers tout comme les bénéfices obtenus en lecture et en traitement phonologique subséquents à un entraînement attentionnels mettent en lumière l’importance de ces processus de base en lecture.
Dyslexia is a neurodevelopmental disorder that affects the normal development of reading fluency. Deficits affecting basic reading processes may affect dyslexics and would thus alter high-level word representations: orthographic, phonological, and semantic. One of these basic processes is the attentional mechanism that is involved in the visual processing of horizontal multi-element strings such as words. The effectiveness of this mechanism could be closely related to reading expertise in normal readers and anomalies thereof could be observed in dyslexics. Unfortunately, it remains unclear how attention is deployed during visual word recognition and how it may impact on reading speed and potentially on certain language skills. The first article of this thesis aims to shed light on divergences in the deployment of attention through time and space during the recognition of familiar words in a group of adults with dyslexia in comparison to normal readers. These groups were matched in terms of age and intellectual functioning. This objective is pursued with the attentional probe technique. Results reveal that less attentional resources are directed to the first letter of a word in dyslexics, which is suboptimal considering that the first letter of a word has a higher diagnostic value than any other letter position. The goal of the second article is to determine if reading fluency and phonological awareness in dyslexics may benefit from a short attentional training. The effects of an active training using the NeuroTracker program and a placebo training in adults with dyslexia shows systematic gains immediately after active training. The order of the training (active then placebo, or placebo then active) was counter-balanced across two groups. These gains are observed on reading speed as well as on phonological awareness. The third article of this thesis finally brings a significant additional contribution to the evaluation of reading speed among Quebec university students. The use of test sentences from an existing tool (MNRead) has been incorporated into a rapid visual serial presentation protocol to assess reading speed. In addition to this set of test sentences, four other sets of test sentences have been standardized. The tool is reliable, as reading speed measurements are similar in the same individual at different times (Exp 1). Moreover, it meets different psychometric standards (Exps 1 and 2) while being particularly sensitive to the presence of the reading difficulties found in dyslexics (Exp.2). In sum, it appears that particular visual-attention processes underlie reading expertise and that these show anomalies in dyslexics. The characterization of a suboptimal attention deployment in visual word recognition as well as the benefits obtained in reading and phonological awareness subsequent to an attentional training highlight the importance of these basic processes in reading.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Visual speech recognition'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles