Dissertations / Theses: 'Lipreading'

1

Lucey, Patrick Joseph. "Lipreading across multiple views." Thesis, Queensland University of Technology, 2007. https://eprints.qut.edu.au/16676/1/Patrick_Joseph_Lucey_Thesis.pdf.

Full text

Abstract:

Visual information from a speaker's mouth region is known to improve automatic speech recognition (ASR) robustness, especially in the presence of acoustic noise. Currently, the vast majority of audio-visual ASR (AVASR) studies assume frontal images of the speaker's face, which is a rather restrictive human-computer interaction (HCI) scenario. The lack of research into AVASR across multiple views has been dictated by the lack of large corpora that contains varying pose/viewpoint speech data. Recently, research has concentrated on recognising human be- haviours within "meeting " or "lecture " type scenarios via "smart-rooms ". This has resulted in the collection of audio-visual speech data which allows for the recognition of visual speech from both frontal and non-frontal views to occur. Using this data, the main focus of this thesis was to investigate and develop various methods within the confines of a lipreading system which can recognise visual speech across multiple views. This reseach constitutes the first published work within the field which looks at this particular aspect of AVASR. The task of recognising visual speech from non-frontal views (i.e. profile) is in principle very similar to that of frontal views, requiring the lipreading system to initially locate and track the mouth region and subsequently extract visual features. However, this task is far more complicated than the frontal case, because the facial features required to locate and track the mouth lie in a much more limited spatial plane. Nevertheless, accurate mouth region tracking can be achieved by employing techniques similar to frontal facial feature localisation. Once the mouth region has been extracted, the same visual feature extraction process can take place to the frontal view. A novel contribution of this thesis, is to quantify the degradation in lipreading performance between the frontal and profile views. In addition to this, novel patch-based analysis of the various views is conducted, and as a result a novel multi-stream patch-based representation is formulated. Having a lipreading system which can recognise visual speech from both frontal and profile views is a novel contribution to the field of AVASR. How- ever, given both the frontal and profile viewpoints, this begs the question, is there any benefit of having the additional viewpoint? Another major contribution of this thesis, is an exploration of a novel multi-view lipreading system. This system shows that there does exist complimentary information in the additional viewpoint (possibly that of lip protrusion), with superior performance achieved in the multi-view system compared to the frontal-only system. Even though having a multi-view lipreading system which can recognise visual speech from both front and profile views is very beneficial, it can hardly considered to be realistic, as each particular viewpoint is dedicated to a single pose (i.e. front or profile). In an effort to make the lipreading system more realistic, a unified system based on a single camera was developed which enables a lipreading system to recognise visual speech from both frontal and profile poses. This is called pose-invariant lipreading. Pose-invariant lipreading can be performed on either stationary or continuous tasks. Methods which effectively normalise the various poses into a single pose were investigated for the stationary scenario and in another contribution of this thesis, an algorithm based on regularised linear regression was employed to project all the visual speech features into a uniform pose. This particular method is shown to be beneficial when the lipreading system was biased towards the dominant pose (i.e. frontal). The final contribution of this thesis is the formulation of a continuous pose-invariant lipreading system which contains a pose-estimator at the start of the visual front-end. This system highlights the complexity of developing such a system, as introducing more flexibility within the lipreading system invariability means the introduction of more error. All the works contained in this thesis present novel and innovative contributions to the field of AVASR, and hopefully this will aid in the future deployment of an AVASR system in realistic scenarios.

APA, Harvard, Vancouver, ISO, and other styles

2

Lucey, Patrick Joseph. "Lipreading across multiple views." Queensland University of Technology, 2007. http://eprints.qut.edu.au/16676/.

Full text

Abstract:

Visual information from a speaker's mouth region is known to improve automatic speech recognition (ASR) robustness, especially in the presence of acoustic noise. Currently, the vast majority of audio-visual ASR (AVASR) studies assume frontal images of the speaker's face, which is a rather restrictive human-computer interaction (HCI) scenario. The lack of research into AVASR across multiple views has been dictated by the lack of large corpora that contains varying pose/viewpoint speech data. Recently, research has concentrated on recognising human be- haviours within "meeting " or "lecture " type scenarios via "smart-rooms ". This has resulted in the collection of audio-visual speech data which allows for the recognition of visual speech from both frontal and non-frontal views to occur. Using this data, the main focus of this thesis was to investigate and develop various methods within the confines of a lipreading system which can recognise visual speech across multiple views. This reseach constitutes the first published work within the field which looks at this particular aspect of AVASR. The task of recognising visual speech from non-frontal views (i.e. profile) is in principle very similar to that of frontal views, requiring the lipreading system to initially locate and track the mouth region and subsequently extract visual features. However, this task is far more complicated than the frontal case, because the facial features required to locate and track the mouth lie in a much more limited spatial plane. Nevertheless, accurate mouth region tracking can be achieved by employing techniques similar to frontal facial feature localisation. Once the mouth region has been extracted, the same visual feature extraction process can take place to the frontal view. A novel contribution of this thesis, is to quantify the degradation in lipreading performance between the frontal and profile views. In addition to this, novel patch-based analysis of the various views is conducted, and as a result a novel multi-stream patch-based representation is formulated. Having a lipreading system which can recognise visual speech from both frontal and profile views is a novel contribution to the field of AVASR. How- ever, given both the frontal and profile viewpoints, this begs the question, is there any benefit of having the additional viewpoint? Another major contribution of this thesis, is an exploration of a novel multi-view lipreading system. This system shows that there does exist complimentary information in the additional viewpoint (possibly that of lip protrusion), with superior performance achieved in the multi-view system compared to the frontal-only system. Even though having a multi-view lipreading system which can recognise visual speech from both front and profile views is very beneficial, it can hardly considered to be realistic, as each particular viewpoint is dedicated to a single pose (i.e. front or profile). In an effort to make the lipreading system more realistic, a unified system based on a single camera was developed which enables a lipreading system to recognise visual speech from both frontal and profile poses. This is called pose-invariant lipreading. Pose-invariant lipreading can be performed on either stationary or continuous tasks. Methods which effectively normalise the various poses into a single pose were investigated for the stationary scenario and in another contribution of this thesis, an algorithm based on regularised linear regression was employed to project all the visual speech features into a uniform pose. This particular method is shown to be beneficial when the lipreading system was biased towards the dominant pose (i.e. frontal). The final contribution of this thesis is the formulation of a continuous pose-invariant lipreading system which contains a pose-estimator at the start of the visual front-end. This system highlights the complexity of developing such a system, as introducing more flexibility within the lipreading system invariability means the introduction of more error. All the works contained in this thesis present novel and innovative contributions to the field of AVASR, and hopefully this will aid in the future deployment of an AVASR system in realistic scenarios.

APA, Harvard, Vancouver, ISO, and other styles

3

MacLeod, A. "Effective methods for measuring lipreading skills." Thesis, University of Nottingham, 1987. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.233400.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

MacDermid, Catriona. "Lipreading and language processing by deaf children." Thesis, University of Surrey, 1991. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.291020.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Yuan, Hanfeng 1972. "Tactual display of consonant voicing to supplement lipreading." Thesis, Massachusetts Institute of Technology, 2003. http://hdl.handle.net/1721.1/87906.

Full text

Abstract:

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2003.
Includes bibliographical references (p. 241-251).
This research is concerned with the development of tactual displays to supplement the information available through lipreading. Because voicing carries a high informational load in speech and is not well transmitted through lipreading, the efforts are focused on providing tactual displays of voicing to supplement the information available on the lips of the talker. This research includes exploration of 1) signal-processing schemes to extract information about voicing from the acoustic speech signal, 2) methods of displaying this information through a multi-finger tactual display, and 3) perceptual evaluations of voicing reception through the tactual display alone (T), lipreading alone (L), and the combined condition (L+T). Signal processing for the extraction of voicing information used amplitude-envelope signals derived from filtered bands of speech (i.e., envelopes derived from a lowpass-filtered band at 350 Hz and from a highpass-filtered band at 3000 Hz). Acoustic measurements made on the envelope signals of a set of 16 initial consonants represented through multiple tokens of C₁VC₂ syllables indicate that the onset-timing difference between the low- and high-frequency envelopes (EOA: envelope-onset asynchrony) provides a reliable and robust cue for distinguishing voiced from voiceless consonants. This acoustic cue was presented through a two-finger tactual display such that the envelope of the high-frequency band was used to modulate a 250-Hz carrier signal delivered to the index finger (250-I) and the envelope of the low-frequency band was used to modulate a 50-Hz carrier delivered to the thumb (50T).
(cont.) The temporal-onset order threshold for these two signals, measured with roving signal amplitude and duration, averaged 34 msec, sufficiently small for use of the EOA cue. Perceptual evaluations of the tactual display of EOA with speech signal indicated: 1) that the cue was highly effective for discrimination of pairs of voicing contrasts; 2) that the identification of 16 consonants was improved by roughly 15 percentage points with the addition of the tactual cue over L alone; and 3) that no improvements in L+T over L were observed for reception of words in sentences, indicating the need for further training on this task.
by Hanfeng Yuan.
Ph.D.

APA, Harvard, Vancouver, ISO, and other styles

6

Chiou, Greg I. "Active contour models for distinct feature tracking and lipreading /." Thesis, Connect to this title online; UW restricted, 1995. http://hdl.handle.net/1773/6023.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Kaucic, Robert August. "Lip tracking for audio-visual speech recognition." Thesis, University of Oxford, 1997. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.360392.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Matthews, Iain. "Features for audio-visual speech recognition." Thesis, University of East Anglia, 1998. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.266736.

Full text

APA, Harvard, Vancouver, ISO, and other styles

9

Thangthai, Kwanchiva. "Computer lipreading via hybrid deep neural network hidden Markov models." Thesis, University of East Anglia, 2018. https://ueaeprints.uea.ac.uk/69215/.

Full text

Abstract:

Constructing a viable lipreading system is a challenge because it is claimed that only 30% of information of speech production is visible on the lips. Nevertheless, in small vocabulary tasks, there have been several reports of high accuracies. However, investigation of larger vocabulary tasks is rare. This work examines constructing a large vocabulary lipreading system using an approach based-on Deep Neural Network Hidden Markov Models (DNN-HMMs). We present the historical development of computer lipreading technology and the state-ofthe-art results in small and large vocabulary tasks. In preliminary experiments, we evaluate the performance of lipreading and audiovisual speech recognition in small vocabulary data sets. We then concentrate on the improvement of lipreading systems in a more substantial vocabulary size with a multi-speaker data set. We tackle the problem of lipreading an unseen speaker. We investigate the effect of employing several stepstopre-processvisualfeatures. Moreover, weexaminethecontributionoflanguage modelling in a lipreading system where we use longer n-grams to recognise visual speech. Our lipreading system is constructed on the 6000-word vocabulary TCDTIMIT audiovisual speech corpus. The results show that visual-only speech recognition can definitely reach about 60% word accuracy on large vocabularies. We actually achieved a mean of 59.42% measured via three-fold cross-validation on the speaker independent setting of the TCD-TIMIT corpus using Deep autoencoder features and DNN-HMM models. This is the best word accuracy of a lipreading system in a large vocabulary task reported on the TCD-TIMIT corpus. In the final part of the thesis, we examine how the DNN-HMM model improves lipreading performance. We also give an insight into lipreading by providing a feature visualisation. Finally, we present an analysis of lipreading results and suggestions for future development.

APA, Harvard, Vancouver, ISO, and other styles

10

Hiramatsu, Sandra. "Does lipreading help word reading? : an investigation of the relationship between visible speech and early reading achievement /." Thesis, Connect to this title online; UW restricted, 2005. http://hdl.handle.net/1773/7913.

Full text

APA, Harvard, Vancouver, ISO, and other styles

11

Divin, William. "The irrelevant speech effect, lipreading and theories of short-term memory." Thesis, University of Ulster, 1999. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.365401.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

Alness, Borg Axel, and Marcus Enström. "A study of the temporal resolution in lipreading using event vision." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-280325.

Full text

Abstract:

Mechanically analysing visual features from lips to extract spoken words consists of finding patterns in movements, which is why machine learning has been applied in previous research to address this problem. In previous research conventional frame based cameras have been used with good results. Classifying visual features is expensive and capturing just enough information can be of importance. Event cameras are a type of cameras which is inspired by the human vision and only capture changes in the scene and offer very high temporal resolution. In this report we investigate the importance of the temporal resolution when performing lipreading and whether event cameras can be used for lipreading. A trend of initially increasing accuracy which peaks at a maximum to later decrease when the frame rate is increased can be observed. The research was therefore able to conclude that when using a frame based representation of event data increasing the temporal resolution does not necessarily strictly increase classification accuracy. It is however difficult to be certain about this conclusion due to the fact that there are many other parameters that could effect the accuracy such as an increasing temporal resolution requiring a larger dataset and parameters of the neural network used.
Maskinell analys av visuella drag från läppar för att extrahera uttalade ord består av att hitta mönster i rörelser varpå maskin inlärning har använts i tidigare forskning för att hantera detta problem. I tidigare forskning har konventionella bildbaserade kameror använts med bra resultat. Att klassificera visuella drag är dyrt och att fånga precis tillräckligt med information kan vara viktigt. Eventkameror är en ny typ av kamera som är inspirerade av hur den mänskliga synen funkar och fångar bara förändringar i scenen och erbjuder väldigt hög temporär upplösning. I den här rapporten undersöker vi vikten av temporär upplösning vid läppläsning och om en eventkamera kan användas för läppläsning. En trend av initial ökning av noggrannhet som toppar på ett maximum för att sen minska när bildfrekvensen minskar kan observeras. Forskningen kan därför dra slutsatsen att när en bild baserad representation av eventdata används ökar den temporära upplösningen inte nödvändigtvis klassificeringsnoggrannheten. Det är däremot svårt att vara säker på den här slutsatsen eftersom det finns för många parametrar som kan påverka noggrannheten som att en ökande temporär upplösning kräver ett större dataset och parametrar för det neurala nätverket som använts.

APA, Harvard, Vancouver, ISO, and other styles

13

Gray, Michael Stewart. "Unsupervised statistical methods for processing of image sequences /." Diss., Connect to a 24 p. preview or request complete full text in PDF format. Access restricted to UC campuses, 1998. http://wwwlib.umi.com/cr/ucsd/fullcit?p9901442.

Full text

APA, Harvard, Vancouver, ISO, and other styles

14

Dupuis, Karine. "Bimodal cueing in aphasia : the influence of lipreading on speech discrimination and language comprehension." Thesis, University of British Columbia, 2011. http://hdl.handle.net/2429/33791.

Full text

Abstract:

Previous research on the influence of lipreading on speech perception has failed to consistently show that individuals with aphasia benefit from the visual cues provided by lipreading. The present study was designed to replicate these findings, and to investigate the role of lipreading at the discourse level. Six participants with aphasia took part in this study. A syllable discrimination task using the syllables /pem, tem, kem, bem, dem, gem/, and a discourse task consisting of forty short fictional passages, were administered to the participants. The stimuli were presented in two modality conditions, audio-only and audio-visual. The discourse task employed two grammatical complexity levels to examine the influence of lipreading on the comprehension of simple and moderately-complex passages. Response accuracy was used as the dependent measure on the discrimination task. Two measures were used in the discourse task, on-line reaction time from an auditory moving window procedure, and off-line comprehension question accuracy. A test of working memory was also administered. Both inferential statistics and descriptive analyses were conducted to evaluate the data. The results of the discrimination task failed to consistently show that the participants benefited from the visual cues. On the discourse task, faster reaction times were observed in the audio-visual condition, particularly for the complex passages. The comprehension question accuracy data revealed that the two participants with the most severe language comprehension deficits appeared to benefit from lipreading. These findings suggest that the benefits of lipreading primarily relate to processing time, and that these benefits are greater with increased stimulus complexity and context. In addition, a strong positive correlation between working memory and comprehension question accuracy was found, supporting the claim that working memory may be a constraint in language comprehension. No correlation was found between participants’ accuracy scores on the discourse and discrimination tasks, replicating previous research findings. The results from this study provide preliminary support for the clinical use of lipreading and working memory capacity for the treatment of language comprehension difficulties in individuals with aphasia.

APA, Harvard, Vancouver, ISO, and other styles

15

Zhou, Yichao. "Lip password-based speaker verification system with unknown language alphabet." HKBU Institutional Repository, 2018. https://repository.hkbu.edu.hk/etd_oa/562.

Full text

Abstract:

The traditional security systems that verify the identity of users based on password usually face the risk of leaking the password contents. To solve this problem, biometrics such as the face, iris, and fingerprint, begin to be widely used in verifying the identity of people. However, these biometrics cannot be changed if the database is hacked. What's more, verification systems based on the traditional biometrics might be cheated by fake fingerprint or the photo.;Liu and Cheung (Liu and Cheung 2014) have recently initiated the concept of lip password, which is composed of a password embedded in the lip movement and the underlying characteristics of lip motion [26]. Subsequently, a lip password-based system for visual speaker verification has been developed. Such a system is able to detect a target speaker saying the wrong password or an impostor who knows the correct password. That is, only a target user speaking correct password can be accepted by the system. Nevertheless, it recognizes the lip password based on a lip-reading algorithm, which needs to know the language alphabet of the password in advance, which may limit its applications.;To tackle this problem, in this thesis, we study the lip password-based visual speaker verification system with unknown language alphabet. First, we propose a method to verify the lip password based on the key frames of lip movement instead of recognizing the individual password elements, such that the lip password verification process can be made without knowing the password alphabet beforehand. To detect these key frames, we extract the lip contours and detect the interest intervals where the lip contours have significant variations. Moreover, in order to avoid accurate alignment of feature sequences or detection on mouth status which is computationally expensive, we design a novel overlapping subsequence matching approach to encode the information in lip passwords in the system. This technique works by sampling the feature sequences extracted from lip videos into overlapping subsequences and matching them individually. All the log-likelihood of each subsequence form the final feature of the sequence and are verified by the Euclidean distance to positive sample centers. We evaluate the proposed two methods on a database that contains totally 8 kinds of lip passwords including English digits and Chinese phrases. Experimental results show the superiority of the proposed methods for visual speaker verification.;Next, we propose a novel visual speaker verification approach based on diagonal-like pooling and pyramid structure of lips. We take advantage of the diagonal structure of sparse representation to preserve the temporal order of lip sequences by employ a diagonal-like mask in pooling stage and build a pyramid spatiotemporal features containing the structural characteristic under lip password. This approach eliminates the requirement of segmenting the lip-password into words or visemes. Consequently, the lip password with any language can be used for visual speaker verification. Experiments show the efficacy of the proposed approach compared with the state-of-the-art ones.;Additionally, to further evaluate the system, we also develop a prototype of the lip password-based visual speaker verification. The prototype has a Graphical User Interface (GUI) that make users easy to access.

APA, Harvard, Vancouver, ISO, and other styles

16

Montserrat, Maria Navarro. "The influence of situational cues on a standardized speechreading test." PDXScholar, 1985. https://pdxscholar.library.pdx.edu/open_access_etds/3546.

Full text

Abstract:

The purpose of the present study was to determine the influence of situational cues on a standardized speechreading test in order to assess an individual's natural speechreading ability. The widely used, standardized Utley Lipreading Test was selected to which photoslides of message-related situational cues were added. The Utley Lipreading Test consists of two relatively equivalent test lists, containing series of unrelated sentences.

APA, Harvard, Vancouver, ISO, and other styles

17

Nayfeh, Taysir H. "Multi-signal processing for voice recognition in noisy environments." Thesis, This resource online, 1991. http://scholar.lib.vt.edu/theses/available/etd-10222009-125021/.

Full text

APA, Harvard, Vancouver, ISO, and other styles

18

Ho, Eve. "Speechreading abilities of Cantonese-speaking hearing-impaired children on consonants and words." Click to view the E-thesis via HKUTO, 1997. http://sunzi.lib.hku.hk/hkuto/record/B36209454.

Full text

Abstract:

Thesis (B.Sc)--University of Hong Kong, 1997.
"A dissertation submitted in partial fulfilment of the requirements for the Bachelor of Science (Speech and Hearing Sciences), The University of Hong Kong, April 30, 1997." Also available in print.

APA, Harvard, Vancouver, ISO, and other styles

19

Liu, Xin. "Lip motion tracking and analysis with application to lip-password based speaker verification." HKBU Institutional Repository, 2013. http://repository.hkbu.edu.hk/etd_ra/1538.

Full text

APA, Harvard, Vancouver, ISO, and other styles

20

Gorman, Benjamin Millar. "A framework for speechreading acquisition tools." Thesis, University of Dundee, 2018. https://discovery.dundee.ac.uk/en/studentTheses/fc05921f-024e-471e-abd4-0d053634a2e7.

Full text

Abstract:

At least 360 million people worldwide have disabling hearing loss that frequently causes difficulties in day-to-day conversations. Hearing aids often fail to offer enough benefits and have low adoption rates. However, people with hearing loss find that speechreading can improve their understanding during conversation. Speechreading (often called lipreading) refers to using visual information about the movements of a speaker's lips, teeth, and tongue to help understand what they are saying. Speechreading is commonly used by people with all severities of hearing loss to understand speech, and people with typical hearing also speechread (albeit subconsciously) to help them understand others. However, speechreading is a skill that takes considerable practice to acquire. Publicly-funded speechreading classes are sometimes provided, and have been shown to improve speechreading acquisition. However, classes are only provided in a handful of countries around the world and students can only practice effectively when attending class. Existing tools have been designed to help improve speechreading acquisition, but are often not effective because they have not been designed within the context of contemporary speechreading lessons or practice. To address this, in this thesis I present a novel speechreading acquisition framework that can be used to design Speechreading Acquisition Tools (SATs) - a new type of technology to improve speechreading acquisition. I interviewed seven speechreading tutors and used thematic analysis to identify and organise the key elements of the framework. I evaluated the framework by using it to: 1) categorise every tutor-identified speechreading teaching technique, 2) critically evaluate existing Conversation Aids and SATs, and 3) design three new SATs. I then conducted a postal survey with 59 speechreading students to understand students' perspectives on speechreading, and how their thoughts could influence future SATs. To further evaluate the framework's effectiveness I then developed and evaluated two new SATs (PhonemeViz and MirrorMirror) designed using the framework. The findings from the evaluation of these two new SATs demonstrates that using the framework can help design effective tools to improve speechreading acquisition.

APA, Harvard, Vancouver, ISO, and other styles

21

Li, Meng. "On study of lip segmentation in color space." HKBU Institutional Repository, 2014. https://repository.hkbu.edu.hk/etd_oa/42.

Full text

Abstract:

This thesis mainly addresses two issues: 1) to investigate how to perform the lip segmentation without knowing the true number of segments in advance, and 2) to investigate how to select the local optimal observation scale for each structure from the viewpoint of lip segmentation e.ectively. Regarding the .rst issue, two number of prede.ned segments independent lip segmentation methods are proposed. In the .rst one, a multi-layer model is built up, in which each layer corresponds to one segment cluster. Subsequently, a Markov random .eld (MRF) derived from this model is obtained such that the segmentation problem is formulated as a labeling optimization problem under the maximum a posteriori-Markov random .eld (MAP-MRF) framework. Suppose the pre-assigned number of segments may over-estimate the ground truth, whereby leading to the over-segmentation. An iterative algorithm capable of performing segment clusters and over-segmentation elimination simultaneously is presented. Based upon this algorithm, a lip segmentation scheme is proposed, featuring the robust performance to the estimate of the number of segment clusters. In the second method, a fuzzy clustering objective function which is a variant of the partition entropy (PE) and implemented using Havrda-Charvat’s structural a-entropy is presented. This objective function features that the coincident cluster centroids in pattern space can be equivalently substituted by one centroid with the function value unchanged. The minimum of the proposed objective function can be reached provided that: (1) the number of positions occupied by cluster centroids in pattern space is equal to the truth cluster number, and (2) these positions are coincident with the optimal cluster centroids obtained under PE criterion. In the implementation, the clusters provided that the number of clusters is greater than or equal to the ground truth are randomly initialized. Then, an iterative algorithm is utilized to minimize the proposed objective function. The initial over-partition will be gradually faded out with the redundant centroids superposed over the convergence of the algorithm. For the second issue, an MRF based method with taking local scale variation into account to deal with the lip segmentation problem is proposed. Supposing each pixel of the target image has an optimal local scale from the segmentation viewpoint, the lip segmentation problem can be treated as a combination of observation scale selection and observed data classi.cation. Accordingly, a multi-scale MRF model is proposed to represent the membership map of each input pixel to a speci.c segment and local-scale map simultaneously. The optimal scale map and the corresponding segmentation result are obtained by minimizing the objective function via an iterative algorithm. Finally, based upon the three proposed methods, some lip segmentation experiments are conducted, respectively. The results show the e.cacy of the proposed methods in comparison with the existing counterparts.

APA, Harvard, Vancouver, ISO, and other styles

22

Lees, Nicole C. "Vocalisations with a better view : hyperarticulation augments the auditory-visual advantage for the detection of speech in noise." Thesis, View thesis, 2007. http://handle.uws.edu.au:8081/1959.7/19576.

Full text

Abstract:

Recent studies have shown that there is a visual influence early in speech processing - visual speech enhances the ability to detect auditory speech in noise. However, identifying exactly how visual speech interacts with auditory processing at such an early stage has been challenging, because this so-called AV speech detection advantage is both highly related to a specific lower-order, signal-based, optic-acoustic relationship between the second formant amplitude and the area of the mouth (F2/Mouth-area), and mediated by higher-order, information-based factors. Previous investigations either have maximised or minimised information-based factors, or have minimised signal-based factors, in order to try to tease out the relative importance of these sources of the advantage, but they have not yet been successful in this endeavour. Maximising signal-based factors has not previously been explored. This avenue was explored in this thesis by manipulating speaking style, hyperarticulated speech was used to maximise signal-based factors, and hypoarticulated speech to minimise signal-based factors - to examine whether the AV speech detection advantage is modified by these means, and to provide a clearer idea of the primary source of visual influence in the AV detection advantage. Two sets of six studies were conducted. In the first set, three recorded speech styles, hyperarticulated, normal, and hypoarticulated, were extensively analysed in physical (optic and acoustic) and perceptual (visual and auditory) dimensions ahead of stimulus selection for the second set of studies. The analyses indicated that the three styles comprise distinctive categories on the Hyper-Hypo continuum of articulatory effort (Lindblom, 1990). Most relevantly, both optically and visually hyperarticulated speech was more informative, and hypoarticulated less informative, than normal speech with regard to signal-based movement factors. However, the F2/Mouth-area correlation was similarly strong for all speaking styles, thus allowing examination of signal-based, visual informativeness on AV speech detection with optic-acoustic association controlled. In the second set of studies, six Detection Experiments incorporating the three speaking styles were designed to examine whether, and if so why, more visually-informative (hyperarticulated) speech augmented, and less visually informative (hypoarticulated) speech attenuated, the AV detection advantage relative to normal speech, and to examine visual influence when auditory speech was absent. Detection Experiment 1 used a two-interval, two-alternative (first or second interval, 2I2AFC) detection task, and indicated that hyperarticulation provided an AV detection advantage greater than for normal and hypoarticulated speech, with less of an advantage for hypoarticulated than for normal speech. Detection Experiment 2 used a single-interval, yes-no detection task to assess responses in signal-absent independent of signal-present conditions as a means of addressing participants’ reports that speech was heard when it was not presented in the 2I2AFC task. Hyperarticulation resulted in an AV detection advantage, and for all speaking styles there was a consistent response bias to indicate speech was present in signal-absent conditions. To examine whether the AV detection advantage for hyperarticulation was due to visual, auditory or auditory-visual factors, Detection Experiments 3 and 4 used mismatching AV speaking style combinations (AnormVhyper, AnormVhypo, AhyperVnorm, AhypoVnorm) that were onset-matched or time-aligned, respectively. The results indicated that higher rates of mouth movement can be sufficient for the detection advantage with weak optic-acoustic associations, but, in circumstances where these associations are low, even high rates of movement have little impact on augmenting detection in noise. Furthermore, in Detection Experiment 5, in which visual stimuli consisted only of the mouth movements extracted from the three styles, there was no AV detection advantage, and it seems that this is so because extra-oral information is required, perhaps to provide a frame of reference that improves the availability of mouth movement to the perceiver. Detection Experiment 6 used a new 2I-4AFC task and the measures of false detections and response bias to identify whether visual influence in signal absent conditions is due to response bias or an illusion of hearing speech in noise (termed here the Speech in Noise, SiN, Illusion). In the event, the SiN illusion occurred for both the hyperarticulated and the normal styles – styles with reasonable amounts of movement change. For normal speech, the responses in signal-absent conditions were due only to the illusion of hearing speech in noise, whereas for hypoarticulated speech such responses were due only to response bias. For hyperarticulated speech there is evidence for the presence of both types of visual influence in signal-absent conditions. It seems to be the case that there is more doubt with regard to the presence of auditory speech for non-normal speech styles. An explanation of past and present results is offered within a new framework -the Dynamic Bimodal Accumulation Theory (DBAT). This is developed in this thesis to address the limitations of, and conflicts between, previous theoretical positions. DBAT suggests a bottom-up influence of visual speech on the processing of auditory speech; specifically, it is proposed that the rate of change of visual movements guides auditory attention rhythms ‘on-line’ at corresponding rates, which allows selected samples of the auditory stream to be given prominence. Any patterns contained within these samples then emerge from the course of auditory integration processes. By this account, there are three important elements of visual speech necessary for enhanced detection of speech in noise. First and foremost, when speech is present, visual movement information must be available (as opposed to hypoarticulated and synthetic speech) Then the rate of change, and opticacoustic relatedness also have an impact (as in Detection Experiments 3 and 4). When speech is absent, visual information has an influence; and the SiN illusion (Detection Experiment 6) can be explained as a perceptual modulation of a noise stimulus by visually-driven rhythmic attention. In sum, hyperarticulation augments the AV speech detection advantage, and, whenever speech is perceived in noisy conditions, there is either response bias to perceive speech or a SiN illusion, or both. DBAT provides a detailed description of these results, with wider-ranging explanatory power than previous theoretical accounts. Predictions are put forward for examination of the predictive power of DBAT in future studies.

APA, Harvard, Vancouver, ISO, and other styles

23

Lees, Nicole C. "Vocalisations with a better view hyperarticulation augments the auditory-visual advantage for the detection of speech in noise /." View thesis, 2007. http://handle.uws.edu.au:8081/1959.7/19576.

Full text

Abstract:

Thesis (Ph.D.)--University of Western Sydney, 2007.
A thesis submitted to the University of Western Sydney, College of Arts, in fulfilment of the requirements for the degree of Doctor of Philosophy. Includes bibliography.

APA, Harvard, Vancouver, ISO, and other styles

24

Habermann, Barbara L. "Speechreading ability in elementary school-age children with and without functional articulation disorders." PDXScholar, 1990. https://pdxscholar.library.pdx.edu/open_access_etds/4087.

Full text

Abstract:

The purpose of this study was to compare the speechreading abilities of elementary school-age children with mild to severe articulation disorders with those of children with normal articulation. Speechreading ability, as determined by a speechreading test, indicates how well a person recognizes the visual cues of speech. Speech sounds that have similar visual characteristics have been defined as visemes by Jackson in 1988 and can be categorized into distinct groups based on their place of articulation. A relationship between recognition of these visemes and correct articulation was first proposed by Woodward and Barber in 1960. Dodd, in 1983, noted that speechread information shows a child how to produce a sound, while aural input simply offers a target at which to aim.

APA, Harvard, Vancouver, ISO, and other styles

25

Engelbrecht, Elizabeth M. "Die ontwikkeling van sosiale verhoudings van adolessente met ernstige gehoorverlies met hulle normaal horende portuurgroep." Pretoria : [s.n.], 2007. http://upetd.up.ac.za/thesis/available/etd-09122008-135458/.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

Wagner, Jessica Lynn. "Exploration of Lip Shape Measures and their Association with Tongue Contact Patterns." Diss., CLICK HERE for online access, 2005. http://contentdm.lib.byu.edu/ETD/image/etd984.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

27

Lidestam, Björn. "Semantic Framing of Speech : Emotional and Topical Cues in Perception of Poorly Specified Speech." Doctoral thesis, Linköpings universitet, Institutionen för beteendevetenskap, 2003. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-6344.

Full text

Abstract:

The general aim of this thesis was to test the effects of paralinguistic (emotional) and prior contextual (topical) cues on perception of poorly specified visual, auditory, and audiovisual speech. The specific purposes were to (1) examine if facially displayed emotions can facilitate speechreading performance; (2) to study the mechanism for such facilitation; (3) to map information-processing factors that are involved in processing of poorly specified speech; and (4) to present a comprehensive conceptual framework for speech perception, with specification of the signal being considered. Experi¬mental and correlational designs were used, and 399 normal-hearing adults participated in seven experiments. The main conclusions are summarised as follows. (a) Speechreading can be facilitated by paralinguistic information as constituted by facial displayed emotions. (b) The facilitatory effect of emitted emotional cues is mediated by their degree of specification in transmission and ambiguity as percepts; and by how distinct the perceived emotions combined with topical cues are as cues for lexical access. (c) The facially displayed emotions affect speech perception by conveying semantic cues; no effect via enhanced articulatory distinctiveness, nor of emotion-related state in the perceiver is needed for facilitation. (d) The combined findings suggest that emotional and topical cues provide constraints for activation spreading in the lexicon. (e) Both bottom-up and top-down factors are associated with perception of poorly specified speech, indicating that variation in information-processing abilities is a crucial factor for perception if there is paucity in sensory input. A conceptual framework for speech perception, comprising specification of the linguistic and paralinguistic information, as well as distinctiveness of primes, is presented. Generalisations of the findings to other forms of paralanguage and language processing are discussed.

APA, Harvard, Vancouver, ISO, and other styles

28

Shou, Virginia. "WHAT?: Visual Interpretations of the Miscommunication Between the Hearing and Deaf." VCU Scholars Compass, 2013. http://scholarscompass.vcu.edu/etd/3125.

Full text

Abstract:

This thesis visualizes the communication challenges both latent and obvious of my daily life as a hard of hearing individual. By focusing on a variety of experiences and examples I demonstrate the implications of a hard of hearing individual’s life. The prints, objects and videos that I have created for my visual thesis aim to enrich the understanding of a broader public on issues regularly faced by Deaf people. At the heart of my work my goal is to generate mutual empathy between the hearing and the Deaf.

APA, Harvard, Vancouver, ISO, and other styles

29

Horacio, Camila Paes. "Manifestações linguísticas em adultos com alterações no espectro da neuropatia auditiva." Universidade de São Paulo, 2010. http://www.teses.usp.br/teses/disponiveis/5/5143/tde-26082010-170001/.

Full text

Abstract:

Introdução: A presença de perdas auditivas de origem neural no adulto que já desenvolveu linguagem pode acarretar alteração de compreensão da fala com dificuldade na discriminação auditiva dos sons e entendimento completo da mensagem. Entre as causas de perdas auditivas neurais está o distúrbio do espectro da neuropatia auditiva (DENA). A maioria das publicações sobre o DENA descrevem o padrão do diagnóstico auditivo, entretanto as consequências dessa alteração auditiva para a comunicação do indivíduo e as implicações dessas para o tratamento fonoaudiólogico são escassas. Faz-se necessária a identificação das especificidades linguísticas a serem avaliadas nos neuropatas, por meio de um protocolo de avaliação direcionado, para permitir a elaboração de diretrizes terapêuticas bem delineadas. Objetivo: Este estudo teve como objetivo descrever as manifestações linguísticas em adultos com o Distúrbio do espectro da neuropatia auditiva (DENA). Métodos: Foram incluídos neste estudo pacientes adultos identificados com o diagnóstico de DENA, alfabetizados, sem alterações neurológicas e cognitivas, no período entre 2007 e 2009 no setor de Fonoaudiologia do Ambulatório de Otorrinolaringologia do HCFMUSP. Doze pacientes foram selecionados, sendo 8 do sexo masculino (66,7%), com idades entre 18 e 50 anos. Foi elaborado um protocolo de anamnese incluindo dados sobre escolaridade, uso de amplificação sonora individual (AASI) e queixas auditivas específicas. O protocolo de avaliação constou de provas que abordaram a avaliação da recepção auditiva e da emissão de fala (identificação fonêmica; inteligibilidade; leitura e compreensão de texto e consciência fonológica) e da expressão (fala e elaboração). Os estímulos foram dados por via somente auditiva e no modo auditivo e visual (com leitura orofacial - LOF). Resultados: As principais características observadas nestes pacientes: sexo masculino, ensino fundamental incompleto, uso de AASI menor que três meses em ambas as orelhas, dificuldade de ouvir em ambientes ruidosos e diálogo foram as situações comunicativas que geraram maior dificuldade na expressão. Observou-se que em todas as provas com apoio da LOF, houve melhora significativa da percepção da fala do ponto de vista clínico. Conclusões: As especificidades linguísticas dos pacientes adultos com DENA encontradas foram: baixa escolaridade, velocidade de fala alterada, dificuldade de compreensão de texto tanto pela via auditiva como pela leitura, dificuldade de consciência fonológica, melhora da repetição de palavras e frases com o uso da LOF.
Introduction: Post linguistic neural hearing loss in adults can lead to speech alterations and difficulties in auditory discrimination of sounds and comprehension of the message. Auditory neuropathy spectrum disorder (ANSD) is among the causes of neural hearing loss. Most studies on ANSD describe the standard for auditory diagnosis. However, the consequences of such hearing impairment in communication and its implication on speech therapy are scarce. Thus, it is necessary to identify the specific language aspects to be assessed in neurologically impaired individuals through a directed assessment protocol to allow the development of outlined treatment guidelines. Objective: This study aimed to describe the linguistic manifestations in adults with ANSD. Methods: The study included adults diagnosed with ANSD, who were literate and had no neurological or cognitive alterations. Data collection was carried out between 2007 and 2009 at the Speech, Language and Hearing service of the Clinic of Otorhinolaryngology of HCFMUSP. Twelve patients, eight males (66,7%) with ages ranging from 18 and 50 years of age were selected. An anamnesis protocol was designed. This protocol included data on education, use of hearing aids (HA) and specific hearing complaints. The assessment protocol consisted on tests of auditory reception and production of speech (phonemic identification; intelligibility; reading and text comprehension; and phonological awareness) and expression (speech and elaboration). The stimuli input were given in auditory only and in auditory plus visual mode (with lip reading). Results: The main characteristics observed in all participants were: male gender; incomplete primary school; use of hearing aids for less than three months in both ears; difficulty hearing in noisy environments; and dialogue, were the communicative situations that led to greater difficulty in expression. A significant improvement in speech perception was observed in all tests with lip reading. Conclusions: The language specificities of individuals with ANSD were: low educational level; speech rate alterations; difficulty in reading comprehension both by hearing and by reading; difficulty in phonological awareness; improvement of words and phrases repetition using LR.

APA, Harvard, Vancouver, ISO, and other styles

30

Charlier, Brigitte. "Le développement des représentations phonologiques chez l'enfant sourd: étude comparative du langage parlé complété avec d'autres outils de communication." Doctoral thesis, Universite Libre de Bruxelles, 1994. http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/212631.

Full text

APA, Harvard, Vancouver, ISO, and other styles

31

Bayard, Clémence. "Perception de la langue française parlée complétée: intégration du trio lèvres-main-son." Doctoral thesis, Universite Libre de Bruxelles, 2014. http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/209168.

Full text

Abstract:

La Langue française Parlée Complétée est un système peu connu du grand public. Adapté du Cued Speech en 1977, il a pour ambition d’aider les sourds francophones à percevoir un message oral en complétant les informations fournies par la lecture labiale à l’aide d’un geste manuel. Si, depuis sa création, la LPC a fait l’objet de nombreuses recherches scientifiques, peu de chercheurs ont, jusqu’à présent, étudié les processus mis en jeu dans la perception de la parole codée. Or, par la présence conjointe d’indices visuels (liés aux lèvres et à la main) et d’indices auditifs (via les prothèses auditives ou l’implant cochléaire), l’étude de la LPC offre un cadre idéal pour les recherches sur l’intégration multimodale dans le traitement de la parole. En effet, on sait aujourd’hui que sourds comme normo-entendants mettent à contribution l’ouïe et la vue pour percevoir la parole, un phénomène appelé intégration audio-visuelle (AV).

Dans le cadre de cette thèse nous avons cherché à objectiver et caractériser l’intégration labio-manuelle dans la perception de la parole codée. Le poids accordé par le système perceptif aux informations manuelles, d’une part, et aux informations labiales, d’autre part, dépend-il de la qualité de chacune d’entre elles ?Varie-t-il en fonction du statut auditif ?Quand l’information auditive est disponible, comment le traitement de l’information manuelle est-il incorporé au traitement audio-visuel ?Pour tenter de répondre à cette série de questions, cinq paradigmes expérimentaux ont été créés et administrés à des adultes sourds et normo-entendants décodant la LPC.

Les trois premières études étaient focalisées sur la perception de la parole codée sans informations auditives. Dans l’étude n° 1, le but était d’objectiver l’intégration labio-manuelle ;l’impact de la qualité des informations labiales et du statut auditif sur cette intégration a également été investigué. L’objectif de l’étude n° 2 était d’examiner l’impact conjoint de la qualité des informations manuelles et labiales ;nous avons également comparé des décodeurs normo-entendants à des décodeurs sourds. Enfin, dans l’étude n° 3, nous avons examiné, chez des décodeurs normo-entendants et sourds, l’effet de l’incongruence entre les informations labiales et manuelles sur la perception de mots.

Les deux dernières études étaient focalisées sur la perception de la parole codée avec du son. L’objectif de l’étude n°4 était de comparer l’impact de la LPC sur l’intégration AV entre les sourds et les normo-entendants. Enfin, dans l’étude n°5, nous avons comparé l’impact de la LPC chez des décodeurs sourds présentant une récupération auditive faible ou forte.

Nos résultats ont permis de confirmer le véritable ancrage du code LPC sur la parole et de montrer que le poids de chaque information au sein du processus d’intégration est dépendant notamment de la qualité du stimulus manuel, de la qualité du stimulus labial et du niveau de performance auditive.

Doctorat en Sciences Psychologiques et de l'éducation
info:eu-repo/semantics/nonPublished

APA, Harvard, Vancouver, ISO, and other styles

32

Huyse, Aurélie. "Intégration audio-visuelle de la parole: le poids de la vision varie-t-il en fonction de l'âge et du développement langagier?" Doctoral thesis, Universite Libre de Bruxelles, 2012. http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/209690.

Full text

Abstract:

Pour percevoir la parole, le cerveau humain utilise les informations sensorielles provenant non seulement de la modalité auditive mais également de la modalité visuelle. En effet, de précédentes recherches ont mis en évidence l’importance de la lecture labiale dans la perception de la parole, en montrant sa capacité à améliorer et à modifier celle-ci. C’est ce que l’on appelle l’intégration audio-visuelle de la parole. L’objectif de cette thèse de doctorat était d’étudier la possibilité de faire varier ce processus d’intégration en fonction de différentes variables. Ce travail s’inscrit ainsi au cœur d’un débat régnant depuis plusieurs années, celui opposant l’hypothèse d’une intégration audio-visuelle universelle à l’hypothèse d’une intégration dépendante du contexte. C’est dans ce cadre que nous avons réalisé les cinq études constituant cette thèse, chacune d’elles investiguant l’impact d’une variable bien précise sur l’intégration bimodale :la qualité du signal visuel, l’âge des participants, le fait de porter un implant cochléaire, l’âge au moment de l’implantation cochléaire et le fait d’avoir des troubles spécifiques du langage.

Le paradigme expérimental utilisé consistait toujours en une tâche d’identification de syllabes présentées dans trois modalités :auditive seule, visuelle seule et audio-visuelle (congruente et incongruente). Les cinq études avaient également comme point commun la présentation de stimuli visuels dont la qualité était réduite, visant à empêcher une lecture labiale de bonne qualité. Le but de chacune de ces études était non seulement d’examiner si les performances variaient en fonction des variables investiguées mais également de déterminer si les différences provenaient bien du processus d’intégration lui-même et non uniquement de différences au niveau de la perception unimodale. Pour cela, les scores des participants ont été comparés à des scores prédits sur base d’un modèle prenant en compte les variations individuelles des poids auditifs et visuels, le weighted fuzzy-logical model of perception.

L’ensemble des résultats, discuté dans la dernière partie de ce travail, fait pencher la balance en faveur de l’hypothèse d’une intégration dépendante du contexte. Nous proposons alors une nouvelle architecture de fusion bimodale, prenant en compte ces dernières données. Enfin, les implications sont aussi d’ordre pratique, suggérant la nécessité d’incorporer des évaluations et rééducations à la fois auditives et visuelles dans le cadre des programmes de revalidation de personnes âgées, dysphasiques ou avec implant cochléaire./During face-to-face conversation, perception of auditory speech is influenced by the visual speech cues contained in lip movements. Indeed, previous research has highlighted the ability of lip-reading to enhance and even modify speech perception. This phenomenon is known as audio-visual integration. The aim of this doctoral thesis is to study the possibility of modifying this audio-visual integration according to several variables. This work lies into the scope of an important debate between invariant versus subject-dependent audio-visual integration in speech processing. Each study of this dissertation investigates the impact of a specific variable on bimodal integration: the quality of the visual input, age of participants, the use of a cochlear implant, age at cochlear implantation and the presence of specific language impairments.

The paradigm used always consisted of a syllable identification task, where syllables were presented in three modalities: auditory only, visual only and audio-visual (congruent and incongruent). There was also a condition where the quality of the visual input was reduced, in order to prevent a lip-reading of good quality. The aim of each of the five studies was not only to examine whether performances were modified according to the variable under study but also to ascertain that differences were indeed issued from the integration process itself. Thereby, our results were analyzed in the framework of model predictive of audio-visual speech performance (weighted fuzzy-logical model of perception) in order to disentangle unisensory effects from audio-visual integration effects.

Taken together, our data suggest that speech integration is not automatic but rather depends on the context. We propose a new architecture of bimodal fusions, taking these considerations into account. Finally, there are also practical implications suggesting the need to incorporate not only auditory but also visual exercise in the rehabilitation programs of older adults and children with cochlear implants or with specific language impairements.

Doctorat en Sciences Psychologiques et de l'éducation
info:eu-repo/semantics/nonPublished

APA, Harvard, Vancouver, ISO, and other styles

33

Fang-Chen, Chang, and 昌芳騁. "Lipreading System." Thesis, 1999. http://ndltd.ncl.edu.tw/handle/42032761223426994701.

Full text

APA, Harvard, Vancouver, ISO, and other styles

34

Yuan, Hanfeng. "Tactual Display of Consonant Voicing to Supplement Lipreading." 2004. http://hdl.handle.net/1721.1/6568.

Full text

Abstract:

This research is concerned with the development of tactual displays to supplement the information available through lipreading. Because voicing carries a high informational load in speech and is not well transmitted through lipreading, the efforts are focused on providing tactual displays of voicing to supplement the information available on the lips of the talker. This research includes exploration of 1) signal-processing schemes to extract information about voicing from the acoustic speech signal, 2) methods of displaying this information through a multi-finger tactual display, and 3) perceptual evaluations of voicing reception through the tactual display alone (T), lipreading alone (L), and the combined condition (L+T). Signal processing for the extraction of voicing information used amplitude-envelope signals derived from filtered bands of speech (i.e., envelopes derived from a lowpass-filtered band at 350 Hz and from a highpass-filtered band at 3000 Hz). Acoustic measurements made on the envelope signals of a set of 16 initial consonants represented through multiple tokens of C1VC2 syllables indicate that the onset-timing difference between the low- and high-frequency envelopes (EOA: envelope-onset asynchrony) provides a reliable and robust cue for distinguishing voiced from voiceless consonants. This acoustic cue was presented through a two-finger tactual display such that the envelope of the high-frequency band was used to modulate a 250-Hz carrier signal delivered to the index finger (250-I) and the envelope of the low-frequency band was used to modulate a 50-Hz carrier delivered to the thumb (50T). The temporal-onset order threshold for these two signals, measured with roving signal amplitude and duration, averaged 34 msec, sufficiently small for use of the EOA cue. Perceptual evaluations of the tactual display of EOA with speech signal indicated: 1) that the cue was highly effective for discrimination of pairs of voicing contrasts; 2) that the identification of 16 consonants was improved by roughly 15 percentage points with the addition of the tactual cue over L alone; and 3) that no improvements in L+T over L were observed for reception of words in sentences, indicating the need for further training on this task
Thesis Supervisor: Nathaniel I. Durlach, Senior Research Scientist. Thesis Supervisor: Charlotte M. Reed, Senior Research Scientist.

APA, Harvard, Vancouver, ISO, and other styles

35

Chang, Chih-Yu, and 張志瑜. "A Lipreading System Based on Hidden Markov Model." Thesis, 2009. http://ndltd.ncl.edu.tw/handle/68212276335580923113.

Full text

Abstract:

碩士
淡江大學
電機工程學系碩士班
97
Nowadays, the conventional speech recognition system has been used in many applications. However, the conventional speech recognition system would be interfered by the voice noise According to the disturbance, the recognition rate would be decreased in the noise condition. So, researchers proposed the singular visual feature speech recognition system, a lipreading system, to avoid the affection of voice noise. The lipreading system can be the assistance part of the conventional speech recognition system, to raise the speech recognition rate. In our research, we proposed a lipreading system which the lip image segmentation part is chromaticity color space combined with K-means algorithm. And taking the Hidden Markov Model as the recognition part to improve the recognition rate. In the experiment results, our method compared with other color based lip segmentation, and compared the recognition rate of different features.

APA, Harvard, Vancouver, ISO, and other styles

36

Southard, Stuart D. Morris Richard. "Speechreading's benefit to the recognition of sentences as a function of signal-to-noise ratio." 2003. http://etd.lib.fsu.edu/theses/available/etd-11202003-175600/.

Full text

Abstract:

Thesis (Ph. D.)--Florida State University, 2003.
Advisor: Dr. Richard Morris, Florida State University, College of Communication, Dept. of Communication Disorders. Title and description from dissertation home page (viewed Mar. 3, 2004). Includes bibliographical references.

APA, Harvard, Vancouver, ISO, and other styles

37

Lees, Nicole C., University of Western Sydney, and College of Arts. "Vocalisations with a better view : hyperarticulation augments the auditory-visual advantage for the detection of speech in noise." 2007. http://handle.uws.edu.au:8081/1959.7/19576.

Full text

Abstract:

Recent studies have shown that there is a visual influence early in speech processing - visual speech enhances the ability to detect auditory speech in noise. However, identifying exactly how visual speech interacts with auditory processing at such an early stage has been challenging, because this so-called AV speech detection advantage is both highly related to a specific lower-order, signal-based, optic-acoustic relationship between the second formant amplitude and the area of the mouth (F2/Mouth-area), and mediated by higher-order, information-based factors. Previous investigations either have maximised or minimised information-based factors, or have minimised signal-based factors, in order to try to tease out the relative importance of these sources of the advantage, but they have not yet been successful in this endeavour. Maximising signal-based factors has not previously been explored. This avenue was explored in this thesis by manipulating speaking style, hyperarticulated speech was used to maximise signal-based factors, and hypoarticulated speech to minimise signal-based factors - to examine whether the AV speech detection advantage is modified by these means, and to provide a clearer idea of the primary source of visual influence in the AV detection advantage. Two sets of six studies were conducted. In the first set, three recorded speech styles, hyperarticulated, normal, and hypoarticulated, were extensively analysed in physical (optic and acoustic) and perceptual (visual and auditory) dimensions ahead of stimulus selection for the second set of studies. The analyses indicated that the three styles comprise distinctive categories on the Hyper-Hypo continuum of articulatory effort (Lindblom, 1990). Most relevantly, both optically and visually hyperarticulated speech was more informative, and hypoarticulated less informative, than normal speech with regard to signal-based movement factors. However, the F2/Mouth-area correlation was similarly strong for all speaking styles, thus allowing examination of signal-based, visual informativeness on AV speech detection with optic-acoustic association controlled. In the second set of studies, six Detection Experiments incorporating the three speaking styles were designed to examine whether, and if so why, more visually-informative (hyperarticulated) speech augmented, and less visually informative (hypoarticulated) speech attenuated, the AV detection advantage relative to normal speech, and to examine visual influence when auditory speech was absent. Detection Experiment 1 used a two-interval, two-alternative (first or second interval, 2I2AFC) detection task, and indicated that hyperarticulation provided an AV detection advantage greater than for normal and hypoarticulated speech, with less of an advantage for hypoarticulated than for normal speech. Detection Experiment 2 used a single-interval, yes-no detection task to assess responses in signal-absent independent of signal-present conditions as a means of addressing participants’ reports that speech was heard when it was not presented in the 2I2AFC task. Hyperarticulation resulted in an AV detection advantage, and for all speaking styles there was a consistent response bias to indicate speech was present in signal-absent conditions. To examine whether the AV detection advantage for hyperarticulation was due to visual, auditory or auditory-visual factors, Detection Experiments 3 and 4 used mismatching AV speaking style combinations (AnormVhyper, AnormVhypo, AhyperVnorm, AhypoVnorm) that were onset-matched or time-aligned, respectively. The results indicated that higher rates of mouth movement can be sufficient for the detection advantage with weak optic-acoustic associations, but, in circumstances where these associations are low, even high rates of movement have little impact on augmenting detection in noise. Furthermore, in Detection Experiment 5, in which visual stimuli consisted only of the mouth movements extracted from the three styles, there was no AV detection advantage, and it seems that this is so because extra-oral information is required, perhaps to provide a frame of reference that improves the availability of mouth movement to the perceiver. Detection Experiment 6 used a new 2I-4AFC task and the measures of false detections and response bias to identify whether visual influence in signal absent conditions is due to response bias or an illusion of hearing speech in noise (termed here the Speech in Noise, SiN, Illusion). In the event, the SiN illusion occurred for both the hyperarticulated and the normal styles – styles with reasonable amounts of movement change. For normal speech, the responses in signal-absent conditions were due only to the illusion of hearing speech in noise, whereas for hypoarticulated speech such responses were due only to response bias. For hyperarticulated speech there is evidence for the presence of both types of visual influence in signal-absent conditions. It seems to be the case that there is more doubt with regard to the presence of auditory speech for non-normal speech styles. An explanation of past and present results is offered within a new framework -the Dynamic Bimodal Accumulation Theory (DBAT). This is developed in this thesis to address the limitations of, and conflicts between, previous theoretical positions. DBAT suggests a bottom-up influence of visual speech on the processing of auditory speech; specifically, it is proposed that the rate of change of visual movements guides auditory attention rhythms ‘on-line’ at corresponding rates, which allows selected samples of the auditory stream to be given prominence. Any patterns contained within these samples then emerge from the course of auditory integration processes. By this account, there are three important elements of visual speech necessary for enhanced detection of speech in noise. First and foremost, when speech is present, visual movement information must be available (as opposed to hypoarticulated and synthetic speech) Then the rate of change, and opticacoustic relatedness also have an impact (as in Detection Experiments 3 and 4). When speech is absent, visual information has an influence; and the SiN illusion (Detection Experiment 6) can be explained as a perceptual modulation of a noise stimulus by visually-driven rhythmic attention. In sum, hyperarticulation augments the AV speech detection advantage, and, whenever speech is perceived in noisy conditions, there is either response bias to perceive speech or a SiN illusion, or both. DBAT provides a detailed description of these results, with wider-ranging explanatory power than previous theoretical accounts. Predictions are put forward for examination of the predictive power of DBAT in future studies.
Doctor of Philosophy (PhD)

APA, Harvard, Vancouver, ISO, and other styles

38

Mirus, Gene R. 1969. "The linguistic repertoire of deaf cuers: an ethnographic query on practice." Thesis, 2008. http://hdl.handle.net/2152/3889.

Full text

Abstract:

Taking an anthropological perspective, this dissertation focuses on a small segment of the American deaf community that uses Cued Speech by examining the nature of the cuers' linguistic repertoire. Multimodality is at issue for this dissertation. It can affect the ways of speaking or more appropriately, ways of communicating (specifically, signing or cueing). Speech and Cued Speech rely on different modalities by using different sets of articulators. Hearing adults do not learn Cued Speech the same way deaf children do. English-speaking, hearing adult learners can base their articulation of Cued Speech on existing knowledge of their spoken language. However, because deaf children do not have natural access to spoken language phonology aurally, they tend to learn Cued Speech communicatively through day-to-day interactions with family members and deaf cueing peers. I am interested in examining the construct of cuers' linguistic repertoire. Which parts of their linguistic repertoire model after signed languages? Which parts of their linguistic repertoire model after spoken languages? Cuers' phonological, syntactal and lexical repertoire largely depends on several factors including social class, geography, and the repertoire of hearing cuers whom they interacted with on a daily basis. For most deaf cuers, hearing cuers including parents, transliterators and educators serve as a model for the English language. Hearing cuers play a role as unwitting gatekeepers for the maintenance of 'proper' cueing among deaf users. For this dissertation, I seek to study the effects of modality on how cuers manage their linguistic repertoire. The statement of the problem is this: Cued Speech is visual and made with the hands like ASL but is ultimately a code for the English language. The research questions to be examined in this dissertation include how cuers adapt an invented system for their purposes, what adjustments they make to Cued Speech, how Cued Speech interacts with gesture, and what language play in Cued Speech looks like.
text

APA, Harvard, Vancouver, ISO, and other styles

39

Lin, Wen-Chieh, and 林文杰. "A Space-Time Delay Neural Network for Motion Recognition and Its Application to Lipreading in Bimodal Speech Recognition." Thesis, 1996. http://ndltd.ncl.edu.tw/handle/30448892490229517714.

Full text

Abstract:

碩士
國立交通大學
控制工程系
84
The researches of the motion recognition has received more and more attentions in recent years because the need for computer vision is increasing in many domains, such as the surveillance system, multimodal human computer interface, and traffic control system. Most of the existing approaches separate the recognition into the spatial feature extraction and time domai??cognition. However, we believe that the information of motion resides in the space-time domain, not restricted to the time domain or space domain only. Consequently, it seems more reasonable to integrate the feature extraction and classification in the space and time domains altogether. We propose a Space-Time Delay Neural Network (STDNN) that can deal with the 3-D dynamic information, such as motion recognition. For the motion recognition problem that we focus in this paper, the STDNN is an unified structure, in which the low-level spatiotemporal feature extraction and space-time recognition are embedded. It possesses the spatiotemporal shift-invariant recognition abilities that are inherited from the time delay neural network (TDNN) and space displacement neural network (SDNN). Unlike the multilayer perceptron (MLP), TDNN, and SDNN, the STDNN is constructed by the vector-type nodes and matrix-type links such that the spatiotemporal information can be gracefully represented in a neural network. Some experiments are done to evaluate the performance of the proposed STDNN. In the moving Arabic numerals (MAN) experiments, which simulate the object'smoving in the space-time domain by image sequences, the STDNN shows its generalization ability on spatiotemporal shift-invariance recognition. In the lipreading experiment, the STDNN recognizes the lip motions by the inputs of real image sequences. It shows that the STDNN has better performance than the existing TDNN- based system, especially on the generalization ability. Although the lipreading is a more specific application, the STDNN can be applied to other applications since no domain-dependentknowledge is used in the experiment.

APA, Harvard, Vancouver, ISO, and other styles

40

Gritzman, Ashley Daniel. "Adaptive threshold optimisation for colour-based lip segmentation in automatic lip-reading systems." Thesis, 2016. http://hdl.handle.net/10539/22664.

Full text

Abstract:

A thesis submitted to the Faculty of Engineering and the Built Environment, University of the Witwatersrand, Johannesburg, in ful lment of the requirements for the degree of Doctor of Philosophy. Johannesburg, September 2016
Having survived the ordeal of a laryngectomy, the patient must come to terms with the resulting loss of speech. With recent advances in portable computing power, automatic lip-reading (ALR) may become a viable approach to voice restoration. This thesis addresses the image processing aspect of ALR, and focuses three contributions to colour-based lip segmentation. The rst contribution concerns the colour transform to enhance the contrast between the lips and skin. This thesis presents the most comprehensive study to date by measuring the overlap between lip and skin histograms for 33 di erent colour transforms. The hue component of HSV obtains the lowest overlap of 6:15%, and results show that selecting the correct transform can increase the segmentation accuracy by up to three times. The second contribution is the development of a new lip segmentation algorithm that utilises the best colour transforms from the comparative study. The algorithm is tested on 895 images and achieves percentage overlap (OL) of 92:23% and segmentation error (SE) of 7:39 %. The third contribution focuses on the impact of the histogram threshold on the segmentation accuracy, and introduces a novel technique called Adaptive Threshold Optimisation (ATO) to select a better threshold value. The rst stage of ATO incorporates -SVR to train the lip shape model. ATO then uses feedback of shape information to validate and optimise the threshold. After applying ATO, the SE decreases from 7:65% to 6:50%, corresponding to an absolute improvement of 1:15 pp or relative improvement of 15:1%. While this thesis concerns lip segmentation in particular, ATO is a threshold selection technique that can be used in various segmentation applications.
MT2017

APA, Harvard, Vancouver, ISO, and other styles

41

Hochstrasser, Daniel. "Investigating the effect of visual phonetic cues on the auditory N1 & P2." Thesis, 2017. http://hdl.handle.net/1959.7/uws:44884.

Full text

Abstract:

Studies have shown that the N1 and P2 auditory event-related potentials (ERPs) that occur to a speech sound when the talker can be seen (i.e., Auditory-Visual speech), occur earlier and are reduced in amplitude compared to when the talker cannot be seen (auditory-only speech). An explanation for why seeing the talker changes the brain’s response to sound is that visual speech provides information about the upcoming auditory speech event. This information reduces uncertainty about when the sound will occur and about what the event will be (resulting in a smaller N1 and P2, which are markers associated with auditory processing). It has yet to be determined whether form information alone can influence the amplitude or timing of either the N1 or P2. We tested this by conducting two separate EEG experiments. In Experiment 1, we compared the N1 and P2 peaks of the ERPs to auditory speech when preceded by a visual speech cue (Audio-visual Speech) or by a static neutral face. In Experiment 2, we compared contrasting N1/P2 peaks of the ERPs to auditory speech preceded by print cues presenting reliable information about their content (written “ba” or “da” shown before these spoken syllables), or to control cues (meaningless printed symbols). The results of Experiment 1 confirmed that the presentation of visual speech produced the expected effect of amplitude suppression of the N1 but the opposite effect occurred for latency facilitation (Auditory-only speech faster than Audio-visual speech). For Experiment 2, no difference in the amplitude or timing of the N1 or P2 ERPs to the reliable print versus the control cues was found. The unexpected slower latency response of the N1 to AV speech stimuli found in Experiment 1, may be accounted for by attentional differences induced by the experimental design. The null effect of print cues in Experiment 2 indicate the importance of the temporal relationship between visual and auditory events.

APA, Harvard, Vancouver, ISO, and other styles

42

Tan, Sok Hui (Jessica). "Seeing a talking face matters to infants, children and adults : behavioural and neurophysiological studies." Thesis, 2020. http://hdl.handle.net/1959.7/uws:59610.

Full text

Abstract:

Everyday conversations typically occur face-to-face. Over and above auditory information, visual information from a speaker’s face, e.g., lips, eyebrows, contributes to speech perception and comprehension. The facilitation that visual speech cues bring— termed the visual speech benefit—are experienced by infants, children and adults. Even so, studies on speech perception have largely focused on auditory-only speech leaving a relative paucity of research on the visual speech benefit. Central to this thesis are the behavioural and neurophysiological manifestations of the visual speech benefit. As the visual speech benefit assumes that a listener is attending to a speaker’s talking face, the investigations are conducted in relation to the possible modulating effects that gaze behaviour brings. Three investigations were conducted. Collectively, these studies demonstrate that visual speech information facilitates speech perception, and this has implications for individuals who do not have clear access to the auditory speech signal. The results, for instance the enhancement of 5-month-olds’ cortical tracking by visual speech cues, and the effect of idiosyncratic differences in gaze behaviour on speech processing, expand knowledge of auditory-visual speech processing, and provide firm bases for new directions in this burgeoning and important area of research.

APA, Harvard, Vancouver, ISO, and other styles

43

Goecke, Roland. "A stereo vision lip tracking algorithm and subsequent statistical analyses of the audio-video correlation in Australian English." Phd thesis, 2004. http://hdl.handle.net/1885/149999.

Full text

APA, Harvard, Vancouver, ISO, and other styles

44

Fitzpatrick, Michael F. "Auditory and auditory-visual speech perception and production in noise in younger and older adults." Thesis, 2014. http://handle.uws.edu.au:8081/1959.7/uws:31936.

Full text

Abstract:

The overall aim of the thesis was to investigate spoken communication in adverse conditions using methods that take into account that spoken communication is a highly dynamic and adaptive process, underpinned by interaction and feedback between speech partners. To this end, first I assessed the speech production adaptations of talkers in quiet and in noise, and in different communicative settings, i.e., where the talker and interlocutor were face to face (FTF) or could not see each other (Non-visual) (Chapter 2). Results showed that talkers adapted their speech production to suit the specific communicative environment. Talkers exaggerated their speech productions in noise (Lombard speech) compared to in quiet conditions. Further, in noise, in the FTF condition, talkers exaggerated mouth opening and reduced auditory intensity compared to the non-visual condition. To determine whether these speech production changes affected speech perception, materials drawn from the production study were tested in a speech perception in noise experiment (Chapter 3). The results showed that speech produced in noise provided an additional visual and auditory-visual intelligibility benefit for the perceiver. Following up this finding, I tested older and younger adults to see whether older adults would also show a Lombard speech benefit (Chapter 4 & 5). It was found that older adults were able to benefit from the auditory and auditory-visual speech production changes talkers made in noise. However, the amount of benefit they received depended on the type of noise (i.e., the degree of energetic-masking or informational-masking present in the noise masker), the signal type (i.e., whether the signal is auditory, visual, or auditory-visual) and the type of speech material considered (i.e., vowels or consonants). The results also showed that older adults were significantly poorer at lipreading than younger adults. To investigate a possible cause of the older adults’ lipreading problems, I presented time-compressed and time-expanded visual speech stimuli to determine how durational changes affected the lipreading accuracy of older compared to the younger adults (Chapter 6). The results showed that older adults were not disproportionately affected by changes in the durational properties of visual speech, suggesting that factors other than the speed of the visual speech signal determine older adults’ reduced lipreading capacity. The final experiment followed up several methodological issues concerning testing speech perception in noise. I examined whether the noise type (i.e., SSN or babble), the degree of lead-in noise, as well as the temporal predictability of the speech signal influenced on speech perception in noise performance (Chapter 7). I found that the degree of energetic- and informational-masking of speech in noise was affected by the amount of lead-in noise before the onset of the speech signal, but not by the predictability of the target speech signal. Taken together, the research presented in this thesis provides an insight into some of the factors that affect how well younger and older adults communicate in adverse conditions.

APA, Harvard, Vancouver, ISO, and other styles

45

Beadle, Julianne M. "Contributions of visual speech, visual distractors, and cognition to speech perception in noise for younger and older adults." Thesis, 2019. http://hdl.handle.net/1959.7/uws:55879.

Full text

Abstract:

Older adults report that understanding speech in noisy situations (e.g., a restaurant) is difficult. Repeated experiences of frustration in noisy situations may cause older adults to withdraw socially, increasing their susceptibility to mental and physical illness. Understanding the factors that contribute to older adults’ difficulty in noise, and in turn, what might be able to alleviate this difficulty, is therefore an important area of research. The experiments in this thesis investigated how sensory and cognitive factors, in particular attention, affect older and younger adults’ ability to understand speech in noise. First, the performance of older as well as younger adults on a standardised speech perception in noise task and on a series of cognitive and hearing tasks was assessed. A correlational analysis indicated that there was no reliable association between pure-tone audiometry and speech perception in noise performance but that there was some evidence of an association between auditory attention and speech perception in noise performance for older adults. Next, a series of experiments were conducted that aimed to investigate the role of attention in gaining a visual speech benefit in noise. These auditory-visual experiments were largely motivated by the idea that as the visual speech benefit is the largest benefit available to listeners in noisy situations, any reduction in this benefit, particularly for older adults, could exacerbate difficulties understanding speech in noise. For the first auditory-visual experiments, whether increasing the number of visual distractors displayed affected the visual speech benefit in noise for younger and older adults when the SNR was -6dB (Experiment 1) and when the SNR was -1dB (Experiment 2) was tested. For both SNRs, the magnitude of older adults’ visual speech benefit reduced by approximately 50% each time an additional visual distractor was presented. Younger adults showed the same pattern when the SNR was - 6dB, but unlike older adults, were able to get a full visual speech benefit when one distractor was presented and the SNR was -1dB. As discussed in Chapter 3, a possible interpretation of these results is that combining auditory and visual speech requires attentional resources. To follow up the finding that visual distractors had a detrimental impact on the visual speech benefit, particularly for older adults, the experiment in Chapter 4 tested whether presenting a salient visual cue that indicated the location of the target talker would help older adults get a visual speech benefit. The results showed that older adults did not benefit from the cue, whereas younger adults did. As older adults should have had sufficient time to switch their gaze and/or attention to the location of the target talker, the failure to find a cueing effect suggests that age related declines in inhibition likely affected older adults’ ability to ignore the visual distractor. The final experiment tested whether the visual speech benefit and the visual distraction effect found for older adults in Chapter 4 transferred to a conversationcomprehension style task (i.e., The Question-and-Answer Task). The results showed that younger and older adults’ performance improved on an auditory-visual condition in comparison to an auditory-only condition and that this benefit did not reduce when a visual distractor was presented. To explain the absence of a distraction effect, several properties of the visual distractor presented were discussed. Together, the experiments in this thesis suggest that the roles of attention and visual distraction should be considered when trying to understand the communication difficulties that older adults experience in noisy situations.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Lipreading'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles