Dissertations / Theses: 'Language identification'

1

Botha, Gerrti Reinier. "Text-based language identification for the South African languages." Pretoria : [s.n.], 2007. http://upetd.up.ac.za/thesis/available/etd-090942008-133715/.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Yin, Bo Electrical Engineering &amp Telecommunications Faculty of Engineering UNSW. "Language identification with language and feature dependency." Awarded By:University of New South Wales. Electrical Engineering & Telecommunications, 2009. http://handle.unsw.edu.au/1959.4/44045.

Full text

Abstract:

The purpose of Language Identification (LID) is to identify a specific language from a spoken utterance, automatically. Language-specific characteristics are always associated with different languages. Most existing LID approaches utilise a statistical modelling process with common acoustic/phonotactic features to model specific languages while avoiding any language-specific knowledge. Great successes have been achieved in this area over past decades. However, there is still a huge gap between these languageindependent methods and the actual language-specific patterns. It is extremely useful to address these specific acoustic or semantic construction patterns, without spending huge labour on annotation which requires language-specific knowledge. Inspired by this goal, this research focuses on the language-feature dependency. Several practical methods have been proposed. Various features and modelling techniques have been studied in this research. Some of them carry out additional language-specific information without manual labelling, such as a novel duration modelling method based on articulatory features, and a novel Frequency-Modulation (FM) based feature. The performance of each individual feature is studied for each of the language-pair combinations. The similarity between languages and the contribution in identifying a language by using a particular feature are defined for the first time, in a quantitative style. These distance measures and languagedependent contributions become the foundations of the later-presented frameworks ?? language-dependent weighting and hierarchical language identification. The latter particularly provides remarkable flexibility and enhancement when identifying a relatively large number of languages and accents, due to the fact that the most discriminative feature or feature-combination is used when separating each of the languages. The proposed systems are evaluated in various corpora and task contexts including NIST language recognition evaluation tasks. The performances have been improved in various degrees. The key techniques developed for this work have also been applied to solve a different problem other than LID ?? speech-based cognitive load monitoring.

APA, Harvard, Vancouver, ISO, and other styles

3

Newman, Jacob Laurence. "Language identification using visual features." Thesis, University of East Anglia, 2011. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.539371.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Berkling, Kay Margarethe. "Automatic language identification with sequences of language-independent phoneme clusters /." Full text open access at:, 1996. http://content.ohsu.edu/u?/etd,204.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Conti, Matteo. "Machine Learning Based Programming Language Identification." Bachelor's thesis, Alma Mater Studiorum - Università di Bologna, 2020. http://amslaurea.unibo.it/20875/.

Full text

Abstract:

L'avvento dell'era digitale ha contribuito allo sviluppo di nuovi settori tecnologici, i quali, per diretta conseguenza, hanno portato alla richiesta di nuove figure professionali capaci di assumere un ruolo chiave nel processo d'innovazione tecnologica. L'aumento di questa richiesta ha interessato particolarmente il settore dello sviluppo del software, a seguito della nascita di nuovi linguaggi di programmazione e nuovi campi a cui applicarli. La componente principale di cui è composto un software, infatti, è il codice sorgente, il quale può essere rappresentato come un archivio di uno o più file testuali contenti una serie d'istruzioni scritte in uno o più linguaggi di programmazione. Nonostante molti di questi vengano utilizzati in diversi settori tecnologici, spesso accade che due o più di questi condividano una struttura sintattica e semantica molto simile. Chiaramente questo aspetto può generare confusione nell'identificazione di questo all'interno di un frammento di codice, soprattutto se consideriamo l'eventualità che non sia specificata nemmeno l'estensione dello stesso file. Infatti, ad oggi, la maggior parte del codice disponibile online contiene informazioni relative al linguaggio di programmazione specificate manualmente. All'interno di questo elaborato ci concentreremo nel dimostrare che l'identificazione del linguaggio di programmazione di un file `generico' di codice sorgente può essere effettuata in modo automatico utilizzando algoritmi di Machine Learning e non usando nessun tipo di assunzione `a priori' sull'estensione o informazioni particolari che non riguardino il contenuto del file. Questo progetto segue la linea dettata da alcune ricerche precedenti basate sullo stesso approccio, confrontando tecniche di estrazione delle features differenti e algoritmi di classificazione con caratteristiche molto diverse, cercando di ottimizzare la fase di estrazione delle features in base al modello considerato.

APA, Harvard, Vancouver, ISO, and other styles

6

Munday, Emma Rachel. "Language and identification in contemporary Kazakhstan." Thesis, University of Edinburgh, 2010. http://hdl.handle.net/1842/6200.

Full text

Abstract:

In the years since the dissolution of the Soviet Union Central Asia has experienced wide-reaching and ongoing social change. The structures and values of all social strata have been questioned and re-evaluated in a continuing exploration of what it means to be part of the post-Soviet space. Within this space, identity formation and reformation has been a pre-eminent process for individuals, for groups of all kinds and for the newly emerging states and their leaders. Through the analysis of individual interviews and selected newspaper extracts and government policy documents this study explores the ways in which ethnic and state identities are being negotiated in Kazakhstan. Using the social identity theory framework it investigates the value and content of these identities by examining the state ideologies of language and the policies which are their expression as well as the discourses of language and identity engaged in by individuals and in the media. There is an exploration of common and conflicting themes referred to as aspects of these identities, of outgroups deemed relevant for comparison and of the roles of Kazakh and Russian in particular, alongside other languages, in relation to these identities. The study focuses on the availability to an individual of multiple possible identities of differing levels of inclusiveness. The saliency of a particular identity is demonstrated to vary according both to context and to the beliefs and goals of the individual concerned. The importance of discourse to processes of identity formation and maintenance is also described and the interaction between discourse and social context is highlighted. The ongoing construction of a Kazakhstani identity is described and the importance of group norms of hospitality, inclusiveness and interethnic accord observed. The sense of learning from other cultures and of mutual enrichment is also demonstrated. However, these themes exist in tension with those of Kazakhstan as belonging primarily to Kazakhs and of cultural oppression and loss. The multi-dimensional nature of ethnic identity is highlighted as is the difficulty, experienced by some, in maintaining a positive sense of ethnic group identity. Perceptions of the importance of language in the construction of ethnic and state identity are explored as are the tensions created by the ideological and instrumental values adhering to different languages in use in Kazakhstan.

APA, Harvard, Vancouver, ISO, and other styles

7

Nkadimeng, Calvin. "Language identification using Gaussian mixture models." Thesis, Stellenbosch : University of Stellenbosch, 2010. http://hdl.handle.net/10019.1/4170.

Full text

Abstract:

Thesis (MScEng (Electrical and Electronic Engineering))--University of Stellenbosch, 2010.
ENGLISH ABSTRACT: The importance of Language Identification for African languages is seeing a dramatic increase due to the development of telecommunication infrastructure and, as a result, an increase in volumes of data and speech traffic in public networks. By automatically processing the raw speech data the vital assistance given to people in distress can be speeded up, by referring their calls to a person knowledgeable in that language. To this effect a speech corpus was developed and various algorithms were implemented and tested on raw telephone speech data. These algorithms entailed data preparation, signal processing, and statistical analysis aimed at discriminating between languages. The statistical model of Gaussian Mixture Models (GMMs) were chosen for this research due to their ability to represent an entire language with a single stochastic model that does not require phonetic transcription. Language Identification for African languages using GMMs is feasible, although there are some few challenges like proper classification and accurate study into the relationship of langauges that need to be overcome. Other methods that make use of phonetically transcribed data need to be explored and tested with the new corpus for the research to be more rigorous.
AFRIKAANSE OPSOMMING: Die belang van die Taal identifiseer vir Afrika-tale is sien ’n dramatiese toename te danke aan die ontwikkeling van telekommunikasie-infrastruktuur en as gevolg ’n toename in volumes van data en spraak verkeer in die openbaar netwerke.Deur outomaties verwerking van die ruwe toespraak gegee die noodsaaklike hulp verleen aan mense in nood kan word vinniger-up ”, deur te verwys hul oproepe na ’n persoon ingelichte in daardie taal. Tot hierdie effek van ’n toespraak corpus het ontwikkel en die verskillende algoritmes is gemplementeer en getoets op die ruwe telefoon toespraak gegee.Hierdie algoritmes behels die data voorbereiding, seinverwerking, en statistiese analise wat gerig is op onderskei tussen tale.Die statistiese model van Gauss Mengsel Modelle (GGM) was gekies is vir hierdie navorsing as gevolg van hul vermo te verteenwoordig ’n hele taal met’ n enkele stogastiese model wat nodig nie fonetiese tanscription nie. Taal identifiseer vir die Afrikatale gebruik GGM haalbaar is, alhoewel daar enkele paar uitdagings soos behoorlike klassifikasie en akkurate ondersoek na die verhouding van TALE wat moet oorkom moet word.Ander metodes wat gebruik maak van foneties getranskribeerde data nodig om ondersoek te word en getoets word met die nuwe corpus vir die ondersoek te word strenger.

APA, Harvard, Vancouver, ISO, and other styles

8

Avenberg, Anna. "Automatic language identification of short texts." Thesis, Uppsala universitet, Avdelningen för beräkningsvetenskap, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-421032.

Full text

Abstract:

The world is growing more connected through the use of online communication, exposing software and humans to all the world's languages. While devices are able to understand and share the raw data between themselves and with humans, the information itself is not expressed in a monolithic format. This causes issues both in the human to computer interaction and human to human communication. Automatic language identification (LID) is a field within artificial intelligence and natural language processing that strives to solve a part of these issues by identifying languages from text, sign language and speech. One of the challenges is to identify the short pieces of text that can be found online, such as messages, comments and posts on social media. This is due to the small amount of information they carry. The goal of this thesis has been to build a machine learning model that can identify the language for these short pieces of text. A long short-term memory (LSTM) machine learning model was built and benchmarked towards Facebook's fastText model. The results show how the LSTM model reached an accuracy of around 95% and the fastText model used as comparison reached an accuracy of 97%. The LSTM model struggled more when identifying texts shorter than 50 characters than with longer text. The classification performance of the LSTM model was also relatively poor in cases where languages were similar, like Croatian and Serbian. Both the LSTM model and the fastText model reached accuracy's above 94% which can be considered high, depending on how it is evaluated. There are however many improvements and possible future work to be considered; looking further into texts shorter than 50 characters, evaluating the model's softmax output vector values and how to handle similar languages.

APA, Harvard, Vancouver, ISO, and other styles

9

Foran, Jeffrey (Jeffrey Matthew) 1977. "Missing argument referent identification in natural language." Thesis, Massachusetts Institute of Technology, 1999. http://hdl.handle.net/1721.1/80532.

Full text

Abstract:

Thesis (S.B. and M.Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1999.
Includes bibliographical references (p. 54-55).
by Jeffrey Foran.
S.B.and M.Eng.

APA, Harvard, Vancouver, ISO, and other styles

10

Gambardella, Maria-Elena. "Cleartext detection and language identification in ciphers." Thesis, Uppsala universitet, Institutionen för lingvistik och filologi, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-446439.

Full text

Abstract:

In historical cryptology, cleartext represents text written in a known language ina cipher (a hand-written manuscript aiming at hiding the content of a message).Cleartext can give us an historical interpretation and contextualisation of themanuscript and could help researchers in cryptanalysis, but to these days thereis still no research on how to automatically detect cleartext and identifying itslanguage. In this paper, we investigate to what extent we can automaticallydistinguish cleartext from ciphertext in transcribed historical ciphers and towhat extent we are able to identify its language. We took a rule-based approachand run 7 different models using historical language models on ciphertextsprovided by the DECRYPT-Project. Our results show that using unigrams andbigrams on a word-level combined with 3-grams, 4-grams and 5-grams on acharacter-level is the best approach to tackle cleartext detection.

APA, Harvard, Vancouver, ISO, and other styles

11

Williams, A. Lynn, and Carol Stoel-Gammon. "Identification of Speech-language Disorders in Toddlers." Digital Commons @ East Tennessee State University, 2016. https://dc.etsu.edu/etsu-works/2038.

Full text

Abstract:

This session is developed by, and presenters invited by, Speech Sound Disorders in Children and Language in Infants Toddlers and Preschoolers. This invited session provides an overview of early speech/language development with a focus on identifying delay/disorders in toddlers. Types of speech/language behaviors in prelinguistic/ early linguistic development that serve as “red flags” for possible disorders will be discussed. The need for developmentally appropriate assessments will be highlighted.

APA, Harvard, Vancouver, ISO, and other styles

12

Yang, Xi. "Discriminative acoustic and sequence models for GMM based automatic language identification /." View abstract or full-text, 2007. http://library.ust.hk/cgi/db/thesis.pl?ECED%202007%20YANG.

Full text

APA, Harvard, Vancouver, ISO, and other styles

13

Vindfallet, Vegar Enersen. "Language Identification Based on Detection of Phonetic Characteristics." Thesis, Norges teknisk-naturvitenskapelige universitet, Institutt for elektronikk og telekommunikasjon, 2012. http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-19506.

Full text

Abstract:

This thesis has taken a closer look at the implementation of the back-end of a language recognition system. The front-end of the system is a Universal Attribute Recognizer (UAR), which is used to detect phonetic characteristics in an utterance. When a speech signal is sent through the UAR, it is decoded into a sequence of attributes which is used to generate a vector of term-count. Vector Space Modeling (VSM) have been used for training the language classifiers in the back-end. The main principle of VSM is that term-count vectors from the same language will position themselves close to eachother when they are mapped into a vector space, and this property can be exploited for recognizing languages. The implemented back-end has trained vectors space classifiers for 12 different languages, and a NIST recognition task has been performed for evaluating the recognition rate of the system. The NIST task was a verification task and the system achived a equal error rate (EER) of $6.73 %$. Tools like Support Vector Machines (SVM) and Gaussian Mixture Models (GMM) have been used in the implementation of the back-end. Thus, are quite a few parameters which can be varied and tweaked, and different experiments were conducted to investigate how these parameters would affect EER of the language recognizer. As a part test the robustness of the system, the language recognizer were exposed to a so-called out-of-set language, which is a language that the system has not been trained to handle. The system showed a poor performance at rejecting these speech segments correctly.

APA, Harvard, Vancouver, ISO, and other styles

14

del, Castillo Iglesias Daniel. "End-to-end Learning for Singing-Language Identification." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-277837.

Full text

Abstract:

Singing-language identification (SLID) consists in identifying the language of the sung lyrics directly from a given music recording. This task is of spe- cial interest to music-streaming businesses who benefit from music localiza- tion applications. However, language is a complex semantic quality of music recordings, making the finding and exploiting of its characteristic features ex- tremely challenging. In recent years, most Music Information Retrieval (MIR) research efforts have been directed to problems that are not related to language, and most of the progress in speech recognition methods stay far from musical applications. This works investigates the SLID problem, its challenges and limitations, with the aim of finding a novel solution that effectively leverages the power of deep learning architectures and a relatively large-scale private dataset. As part of the dataset pre-processing, a novel method for identifying the high-level structure of songs is proposed. As the classifier model, a Temporal Convolu- tional Network (TCN) is trained and evaluated on music recordings belonging to seven of the most prominent languages in the global music market. Although results show much lower performance with respect to the current state-of-the-art, a thorough discussion is realized with the purpose of explor- ing the limitations of SLID, identifying the causes of the poor performance, and expanding the current knowledge about the SLID problem. Future im- provements and lines of work are delineated, attempting to stimulate further research in this direction.
Sång-språkidentifiering (SLID) består av att identifiera språket av de sjung- ade texterna direkt från en viss musikinspelning. Denna uppgift är av sär- skilt intresse för musikströmmande företag som drar nytta av applikationer för musiklokalisering. Däremot, är språk en komplex semantisk kvalitet av musikinspelningar, vilket gör upptäckten och utnyttjandet av dess karakteris- tiska funktioner extremt utmanande. Under de senaste åren har de flesta MIR- forskningsinsatser riktats mot problem som inte är relaterade till språk, och de flesta av framstegen med metoder för språkidentifiering förblir långt ifrån musikaliska applikationer. Detta arbete undersöker SLID-problemet, dess ut- maningar och begränsningar, med syftet att hitta en ny lösning som effektivt ut- nyttjar kraften hos djupa inlärningsarkitekturer och en relativt storskalig privat datasats. Som en del av datasatsförbehandlingen föreslås en ny metod för att identifiera högnivåstrukturen av låtar. Som klassificeringsmodell utbildas och utvärderas ett Temporal Convolutional Network (TCN) på musikinspelningar som hör till sju av de mest framstående språk på den globala musikmarkna- den. Även om resultaten visar mycket lägre prestation med avseende på den nuvarande bästa-möjliga-teknik, realiseras en grundlig diskussion med syftet att utforska begränsningarna för SLID, orsakerna till dålig prestation identi- fieras och den nuvarande kunskapen om SLID problemet utökas. Framtida förbättringar och arbetslinjer a gränsas med avseende att stimulera ytterligare forskning mot denna riktning.

APA, Harvard, Vancouver, ISO, and other styles

15

Hubeika, Valiantsina. "Intersession Variability Compensation in Language and Speaker Identification." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2008. http://www.nusl.cz/ntk/nusl-235432.

Full text

Abstract:

Variabilita kanálu a hovoru je velmi důležitým problémem v úloze rozpoznávání mluvčího. V současné době je ve velkém množství vědeckých článků uvedeno několik technik pro kompenzaci vlivu kanálu. Kompenzace vlivu kanálu může být implementována jak v doméně modelu, tak i v doménách příznaků i skóre. Relativně nová výkoná technika je takzvaná eigenchannel adaptace pro GMM (Gaussian Mixture Models). Mevýhodou této metody je nemožnost její aplikace na jiné klasifikátory, jako napřílad takzvané SVM (Support Vector Machines), GMM s různým počtem Gausových komponent nebo v rozpoznávání řeči s použitím skrytých markovových modelů (HMM). Řešením může být aproximace této metody, eigenchannel adaptace v doméně příznaků. Obě tyto techniky, eigenchannel adaptace v doméně modelu a doméně příznaků v systémech rozpoznávání mluvčího, jsou uvedeny v této práci. Po dosažení dobrých výsledků v rozpoznávání mluvčího, byl přínos těchto technik zkoumán pro akustický systém rozpoznávání jazyka zahrnující 14 jazyků. V této úloze má nežádoucí vliv nejen variabilita kanálu, ale i variabilita mluvčího. Výsledky jsou prezentovány na datech definovaných pro evaluaci rozpoznávání mluvčího z roku 2006 a evaluaci rozpoznávání jazyka v roce 2007, obě organizované Amerických Národním Institutem pro Standard a Technologie (NIST)

APA, Harvard, Vancouver, ISO, and other styles

16

Nariyama, Shigeko. "Referent identification for ellipted arguments in Japanese." Connent to thesis, 2000. http://repository.unimelb.edu.au/10187/2870.

Full text

Abstract:

Nominal arguments, such as the subject and the object are not grammatically required to be overt in Japanese, and are frequently unexpressed, approximately 50% of the time in written narrative texts. Despite this in high frequency of ellipsis, Japanese is not equipped with such familiar devices as the cross-referencing systems and verbal inflections commonly found in pro-drop languages for referent identification. Yet the mechanisms governing argument ellipsis have been little explicated. This thesis elucidates the linguistic mechanisms with which to identify the referents of ellipted arguments.
These mechanisms stem from three tiers of linguistic system. Each sentence is structured in such a way as to anchor the subject., (using Sentence devices following the principle of direct alignment), with argument inferring cues on the verbal predicate (using Predicate devices). These subject oriented sentences are cohesively sequenced with the topic as a pivot (using Discourse devices). These subject oriented sentences are cohesively sequenced with the topic as a pivot (using Discourse devices). It is this topicalised subject which is most prone to ellipsis. I develop an algorithm summing up these mechanisms, using naturally occurring texts. I demonstrate how it can detect the existence of ellipsis in sentences and track the referential identity of it.
A generalisation for ellipsis resolution and the way in which the algorithm is constituted is as follows. Sentence devices formulate sentences to make the subject most prone to ellipsis, discourse devices enable the interaction of wa (the topic maker) and ga (the nominative marker), which mark the majority of subjects, to provide the default reading for referent identification of ellipsis, and predicate devices furnish additional cues to verify that reading. Since Japanese is an SOV language, it is intuitively tenable from the perspective of language processing that the interplay of wa/ga representing subjects gives initial cues from predicate devices. This multiple layering of mechanisms, therefore, can determine referents for ellipted arguments more accurately.

APA, Harvard, Vancouver, ISO, and other styles

17

Samperio, Sanchez Nahum. "General learning strategies : identification, transfer to language learning and effect on language achievement." Thesis, University of Southampton, 2016. https://eprints.soton.ac.uk/412008/.

Full text

Abstract:

Each learner has a set repertoire of general learning strategies that he or she uses despite the learning context. The purpose of this study is to identify the general learning strategies that beginner learners of English have in their repertoire, the transfer of such strategies to language learning and the predictive value they have in language achievement. It is also intended to discover the effect that the teaching of not frequently used general learning strategies have on learners’ language achievement. Additionally, to identify possible differences in strategy types and frequency of strategy use in low and high strategy users as well as high and low achievers of beginner English language learners. This study followed a mixed-methods research methodology by collecting numerical data by means of a 51-item general strategies questionnaire (Martinez- Guerrero 2004) applied in two administrations. The sample consists of 118 beginner English language learners in a language center at a northern Mexican University. Data were analyzed with the SPSS and Excel software. The qualitative data was collected through twenty individual semistructured interviews; furthermore, three one-hour-forty minute strategy instruction sessions were included as the treatment. Quantitative results show that learners have a more frequent use of Achievement Motivation, Cognitive and Concentration strategies; and less frequent use of Study, Study Organization, and Interaction in Class strategies. Qualitative findings indicate that learners use Study and Study organization and Concentration strategies largely in both general learning and language learning. Qualitative data complement and extend the quantitative data gathered in the questionnaire. No significant differences were found on the type of strategies that learners use in general learning contexts and language learning, which suggests that learners transfer their learning strategies from their general strategy repertoire to language learning as the first tools to deal with language learning tasks. A positive correlation was found between learning strategy use and language achievement test scores. Achievement test scores were primarily predicted by the use of Achievement Motivation and Interaction in Class strategies, and to a lesser extent by affective and study strategies. Strategy instruction sessions had no significant increase in the adoption and use of strategies. Furthermore, high and low achievers and strategy users seem to use the same type of strategies; the frequency of strategy use and how they use the strategy represented the difference between types of learners. Finally, a number of language learning strategies emerge from qualitative data that learners use in language learning. Pedagogical implications of the findings of this study provide a potential framework to help not only teachers but also institutions in identifying and teaching new and specific learning strategies.

APA, Harvard, Vancouver, ISO, and other styles

18

Knudson, Ryan Charles. "Automatic Language Identification for Metadata Records: Measuring the Effectiveness of Various Approaches." Thesis, University of North Texas, 2015. https://digital.library.unt.edu/ark:/67531/metadc801895/.

Full text

Abstract:

Automatic language identification has been applied to short texts such as queries in information retrieval, but it has not yet been applied to metadata records. Applying this technology to metadata records, particularly their title elements, would enable creators of metadata records to obtain a value for the language element, which is often left blank due to a lack of linguistic expertise. It would also enable the addition of the language value to existing metadata records that currently lack a language value. Titles lend themselves to the problem of language identification mainly due to their shortness, a factor which increases the difficulty of accurately identifying a language. This study implemented four proven approaches to language identification as well as one open-source approach on a collection of multilingual titles of books and movies. Of the five approaches considered, a reduced N-gram frequency profile and distance measure approach outperformed all others, accurately identifying over 83% of all titles in the collection. Future plans are to offer this technology to curators of digital collections for use.

APA, Harvard, Vancouver, ISO, and other styles

19

Rupe, Jonathan C. "Vision-based hand shape identification for sign language recognition /." Link to online version, 2005. https://ritdml.rit.edu/dspace/handle/1850/940.

Full text

APA, Harvard, Vancouver, ISO, and other styles

20

Strømhaug, Tommy. "Discriminating Music,Speech and other Sounds and Language Identification." Thesis, Norwegian University of Science and Technology, Department of Computer and Information Science, 2008. http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-8953.

Full text

Abstract:

The tasks : discriminating music, speech and other sounds and language identification have a broad range of applications in todays multilingual multimedia community. Both tasks gave a lot of possibilities regarding methods and development tools which also brings some risk. The Language Identification(LID) problem ended up with two different approaches. One approach was discarded due to poor results in the pre-study while the other approach had some promising potential but did not deliver as hoped in the first place. On the other hand, the music, speech discrimination was solved with great accuracy using 3 simple time domain features and Support Vector Machines(SVM). Adding 'other sounds' to this discrimination problem did complicate the problem but the final solution delivered great results using the enormous BBC Sound Effects library as examples of non speech and music. Both tasks were tried being solved using Gaussian Mixture Models(GMM) because of it's known great ability to model arbitrary feature space segmentations. The tools used were Matlab together with a number of different toolboxes explained further in the text.

APA, Harvard, Vancouver, ISO, and other styles

21

Peyton, Kari C. "Literacy programs identification and assessment of English language learners /." Menomonie, WI : University of Wisconsin--Stout, 2007. http://www.uwstout.edu/lib/thesis/2007/.

Full text

APA, Harvard, Vancouver, ISO, and other styles

22

Clark, Jessica Celeste. "Automated Identification of Adverbial Clauses in Child Language Samples." Diss., CLICK HERE for online access, 2009. http://contentdm.lib.byu.edu/ETD/image/etd2803.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

23

Brown, Brittany Cheree. "Automated Identification of Adverbial Clauses in Child Language Samples." BYU ScholarsArchive, 2013. https://scholarsarchive.byu.edu/etd/3404.

Full text

Abstract:

Adverbial clauses are grammatical constructions that are of relevance in both typical language development and impaired language development. In recent years, computer software has been used to assist in the automated analysis of clinical language samples. This software has attempted to accurately identify adverbial clauses with limited success. The present study investigated the accuracy of software for the automated identification of adverbial clauses. Two separate collections of language samples were used. One collection included 10 children with language impairment, with ages ranging from 7;6 to 11;1 (years;months), 10 age-matched peers,and 10 language-matched peers. A second collection contained 30 children ranging from 2;6 to 7;11 in age, with none considered to have language or speech impairments. Language sample utterances were manually coded for the presence of adverbial clauses (both finite and non-finite). Samples were then automatically tagged using the computer software. Results were tabulated and compared for accuracy. ANOVA revealed differences in frequencies of so-adverbial clauses whereas ANACOVA revealed differences in frequencies of both types of finite adverbial clauses. None of the structures were significantly correlated with age; however, frequencies of both types of finite adverbial clauses were correlated with mean length of utterance. Kappa levels revealed that agreement between manual and automated coding was high on both types of finite adverbial clauses.

APA, Harvard, Vancouver, ISO, and other styles

24

Michaelis, Hali Anne. "Automated Identification of Relative Clauses in Child Language Samples." BYU ScholarsArchive, 2009. https://scholarsarchive.byu.edu/etd/1997.

Full text

Abstract:

Previously existing computer analysis programs have been unable to correctly identify many complex syntactic structures thus requiring further manual analysis by the clinician. Complex structures, including the relative clause, are of interest in child language samples due to the difference in development between children with and without language impairment. The purpose of this study was to assess the comparability of results from a new automated program, Cx, to results from manual identification of relative clauses. On language samples from 10 children with language impairment (LI), 10 language matched peers (LA), and 10 chronologically age matched peers (CA), a computerized analysis based on probabilities of sequences of grammatical markers agreed with a manual analysis with a Kappa of 0.88.

APA, Harvard, Vancouver, ISO, and other styles

25

Manning, Britney Richey. "Automated Identification of Noun Clauses in Clinical Language Samples." BYU ScholarsArchive, 2009. https://scholarsarchive.byu.edu/etd/2197.

Full text

Abstract:

The identification of complex grammatical structures including noun clauses is of clinical importance because differences in the use of these structures have been found between individuals with and without language impairment. In recent years, computer software has been used to assist in analyzing clinical language samples. However, this software has been unable to accurately identify complex syntactic structures such as noun clauses. The present study investigated the accuracy of new software, called Cx, in identifying finite wh- and that-noun clauses. Two sets of language samples were used. One set included 10 children with language impairment, 10 age-matched peers, and 10 language-matched peers. The second set included 40 adults with mental retardation. Levels of agreement between computerized and manual analysis were similar for both sets of language samples; Kappa levels were high for wh-noun clauses and very low for that-noun clauses.

APA, Harvard, Vancouver, ISO, and other styles

26

Ehlert, Erika E. "Automated Identification of Relative Clauses in Child Language Samples." BYU ScholarsArchive, 2013. https://scholarsarchive.byu.edu/etd/3615.

Full text

Abstract:

Relative clauses are grammatical constructions that are of relevance in both typical and impaired language development. Thus, the accurate identification of these structures in child language samples is clinically important. In recent years, computer software has been used to assist in the automated analysis of clinical language samples. However, this software has had only limited success when attempting to identify relative clauses. The present study explores the development and clinical importance of relative clauses and investigates the accuracy of the software used for automated identification of these structures. Two separate collections of language samples were used. The first collection included 10 children with language impairment, ranging in age from 7;6 to 11;1 (years;months), 10 age-matched peers, and 10 language-matched peers. A second collection contained 30 children considered to have typical speech and language skills and who ranged in age from 2;6 to 7;11. Language samples were manually coded for the presence of relative clauses (including those containing a relative pronoun, those without a relative pronoun and reduced relative clauses). These samples were then tagged using computer software and finally tabulated and compared for accuracy. ANACOVA revealed a significant difference in the frequency of relative clauses containing a relative pronoun but not for those without a relative pronoun nor for reduce relative clauses. None of the structures were significantly correlated with age; however, frequencies of both relative clauses with and without relative pronouns were correlated with mean length of utterance. Kappa levels revealed that agreement between manual and automated coding was relatively high for each relative clause type and highest for relative clauses containing relative pronouns.

APA, Harvard, Vancouver, ISO, and other styles

27

Lareau, Jonathan. "Application of shifted delta cepstral features for GMM language identification /." Electronic version of thesis, 2006. https://ritdml.rit.edu/dspace/handle/1850/2686.

Full text

APA, Harvard, Vancouver, ISO, and other styles

28

Ford, George Harold. "Spoken Language Identification from Processing and Pattern Analysis of Spectrograms." NSUWorks, 2014. http://nsuworks.nova.edu/gscis_etd/152.

Full text

Abstract:

Prior speech and linguistics research has focused on the use of phonemes recognition in speech, and their use in formulation of recognizable words, to determine language identification. Some languages have additional phoneme sounds, which can help identify a language; however, most of the phonemes are common to a wide variety of languages. Legacy approaches recognize strings of phonemes as syllables, used by dictionary queries to see if a word can be found to uniquely identify a language. This dissertation research considers an alternative means of determining language identification of speech data based solely on analysis of frequency-domain data. An analytical approach to speech language identification by three comparative techniques is performed. First, a character-based pattern analysis is performed using the Rix and Forster algorithm to replicate their research on language identification. Second, techniques of phoneme recognition and their relative pattern of occurrence in speech samples are measured for performance in ability for language identification using the Rix and Forster approach. Finally, an experiment using statistical analysis of time-ensemble frequency spectrum data is assessed for its ability to establish spectral patterns for language identification, along with performance. This novel approach is applied to spectrogram audio data using pattern analysis techniques for language identification. It applies the Rix and Forster method to the ensemble of spectral frequencies used over the duration of a speech waveform. This novel approach is compared to the applications of the Rix and Forster algorithm to character-based and phoneme symbols for language identification on the basis of statistical accuracy, processing time requirements, and spatial processing resource needs. The audio spectrum analysis also demonstrates the ability to perform speaker identification using the same techniques performed for language identification. The results of this research demonstrate the efficacy of audio frequency-domain pattern analysis applied to speech waveform data. It provides an efficient technique in language identification without reliance upon linguistic approaches using phonemes or word derivations. This work also demonstrates a quick, automated means by which information gatherers, travelers, and diplomatic officials might obtain rapid language identification supporting time-critical determination of appropriate translator resource needs.

APA, Harvard, Vancouver, ISO, and other styles

29

Rock, Jonna. "Intergenerational Memory, Language and Jewish Identification of the Sarajevo Sephardim." Doctoral thesis, Humboldt-Universität zu Berlin, 2019. http://dx.doi.org/10.18452/19793.

Full text

Abstract:

Diese Doktorarbeit befasst sich mit Fragen der Sprache und Identität von drei Generationen sephardischer Juden in Sarajevo. Aufgrund der Komplexität sephardischen Geschichte in Sarajevo untersuche ich Bosnien-Herzegowina/Jugoslawien, Israel und Spanien als mögliche Identitätsoptionen für die Sephardim in Sarajevo nach der Shoah. In einem weiteren Kontext ist die Arbeit auch ein Beitrag zu Minderheiten in Europa und zum facettenreichen Zusammenspiel von Sprache und ethnischer und religiöser Identifikation. Typisch für die jüdische Gemeinschaft im heutigen Sarajevo ist, dass nur ein Gesprächspartner seine jüdische Identität auf der traditionellen halachischen Definition aufbaut, einer Definition, die von der matrilinealen Abstammung abhängt. Ebenso ist die Feier der jüdischen Feiertage meinen Informanten für die Aufrechterhaltung der Identität wichtiger als das Sprechen einer jüdischen Sprache. Gleichzeitig vertreten die Individuen auch alternative Formen des Bosnischseins, die mehrere Ethnien und religiöse Zuschreibungen umfassen. Zu den einzigartigen Merkmalen der Sephardim in Sarajevo zählen der Status der Sephardim und der anderen Minderheiten in Bosnien und Herzegowina, die sie (1) durch die diskriminierende bosnische Verfassung zugeteilt bekommen haben; (2) das Fehlen eines Gesetzes in Bosnien über die Rückgabe von Eigentum; (3) die besondere Situation, in der drei ethnische Hauptgruppen und nicht nur eine einzige ethnisch homogene ‚Mehrheit‘ das Land beherrschen; (4) das Fehlen einer gut entwickelten jüdischen kulturellen Infrastruktur. Trotz alledem findet eine Annäherung der Mitglieder der Jüdischen Gemeinde von Sarajevo an ihre Religion und Tradition statt. Dieses Phänomen ist zum Teil dem jungen religiösen Aktivisten und chazan (Kantor) der Gemeinde, Igor Kožemjakin, zuzuschreiben, der jüngere Mitglieder zu den Gottesdiensten angezogen hat.
This study analyzes issues of language and Jewish identification pertaining to the Sephardim in Sarajevo. Complexity of the Sarajevo Sephardi history means that I explore Bosnia-Herzegovina/Yugoslavia, Israel and Spain as possible identity-creating factors for the Sephardim in Sarajevo today. My findings show that the elderly Sephardic generation insist on calling their language Serbo-Croatian, whereas the younger generations do not really know what language they speak – and laugh about the linguistic situation in Sarajevo, or rely on made-up categories such as ‘Sarajevan.’ None of the interviewees emphasize the maintenance of Judeo-Spanish as a crucial condition for the continuation of Sephardic culture in Sarajevo. Similarly, the celebration of Jewish holidays is more important for the maintenance of identity across the generations than speaking a Jewish language. At the same time, the individuals also assert alternative forms of being Bosnian, ones that encompass multiple ethnicities and religious ascriptions. All the youngest interviewees however fear that the Sarajevo Sephardic identity will disappear in a near future. Unique characteristics of Sarajevo Sephardim include the status of the Sephardim and minorities in Bosnia and Herzegovina given (1) the discriminatory Bosnian Constitution; (2) the absence of a law in Bosnia on the return of property; (3) the special situation wherein three major ethnic groups, and not just a single, ethnically homogeneous ‘majority,’ dominate the country; (4) the lack of a well-developed Jewish cultural infrastructure. Despite all of this, a rapprochement between the Sarajevo Jewish Community members and their religion and tradition is taking place. This phenomenon is partly attributable to the Community’s young religious activist and chazan, Igor Kožemjakin, who has attracted younger members to the religious services.

APA, Harvard, Vancouver, ISO, and other styles

30

Wong, Kim-Yung Eddie. "Automatic spoken language identification utilizing acoustic and phonetic speech information." Thesis, Queensland University of Technology, 2004. https://eprints.qut.edu.au/37259/1/Kim-Yung_Wong_Thesis.pdf.

Full text

Abstract:

Automatic spoken Language Identi¯cation (LID) is the process of identifying the language spoken within an utterance. The challenge that this task presents is that no prior information is available indicating the content of the utterance or the identity of the speaker. The trend of globalization and the pervasive popularity of the Internet will amplify the need for the capabilities spoken language identi¯ca- tion systems provide. A prominent application arises in call centers dealing with speakers speaking di®erent languages. Another important application is to index or search huge speech data archives and corpora that contain multiple languages. The aim of this research is to develop techniques targeted at producing a fast and more accurate automatic spoken LID system compared to the previous National Institute of Standards and Technology (NIST) Language Recognition Evaluation. Acoustic and phonetic speech information are targeted as the most suitable fea- tures for representing the characteristics of a language. To model the acoustic speech features a Gaussian Mixture Model based approach is employed. Pho- netic speech information is extracted using existing speech recognition technol- ogy. Various techniques to improve LID accuracy are also studied. One approach examined is the employment of Vocal Tract Length Normalization to reduce the speech variation caused by di®erent speakers. A linear data fusion technique is adopted to combine the various aspects of information extracted from speech. As a result of this research, a LID system was implemented and presented for evaluation in the 2003 Language Recognition Evaluation conducted by the NIST.

APA, Harvard, Vancouver, ISO, and other styles

31

Zeberlein, Jennifer Catherine. "Examination of the Accuracy of the Social Language Development Test for Identification of Social Language Impairments." Miami University / OhioLINK, 2014. http://rave.ohiolink.edu/etdc/view?acc_num=miami1398975745.

Full text

APA, Harvard, Vancouver, ISO, and other styles

32

Johnson, Marie A. F., and A. Rice. "Early Childhood Language Delay: Identification of Children At-risk, Characteristics, and Strategies for Building Language Skills." Digital Commons @ East Tennessee State University, 2010. https://dc.etsu.edu/etsu-works/1550.

Full text

APA, Harvard, Vancouver, ISO, and other styles

33

Johnson, Marie A. F., and A. Rice. "Early Childhood Language Delay: Identification of Children At-risk, Characteristics, and Strategies for Building Language Skills." Digital Commons @ East Tennessee State University, 2011. https://dc.etsu.edu/etsu-works/1549.

Full text

APA, Harvard, Vancouver, ISO, and other styles

34

Smolenska, Greta. "Complex Word Identification for Swedish." Thesis, Uppsala universitet, Institutionen för lingvistik och filologi, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-352349.

Full text

Abstract:

Complex Word Identification (CWI) is a task of identifying complex words in text data and it is often viewed as a subtask of Automatic Text Simplification (ATS) where the main task is making a complex text simpler. The ways in which a text should be simplified depend on the target readers such as second language learners or people with reading disabilities. In this thesis, we focus on Complex Word Identification for Swedish. First, in addition to exploring existing resources, we collect a new dataset for Swedish CWI. We continue by building several classifiers of Swedish simple and complex words. We then use the findings to analyze the characteristics of lexical complexity in Swedish and English. Our method for collecting training data based on second language learning material has shown positive evaluation scores and resulted in a new dataset for Swedish CWI. Additionally, the built complex word classifiers have an accuracy at least as good as similar systems for English. Finally, the analysis of the selected features confirms the findings of previous studies and reveals some interesting characteristics of lexical complexity.

APA, Harvard, Vancouver, ISO, and other styles

35

Dwyer, Edward J. "Word Identification Strategies." Digital Commons @ East Tennessee State University, 2018. https://dc.etsu.edu/etsu-works/3417.

Full text

APA, Harvard, Vancouver, ISO, and other styles

36

Dwyer, Edward J. "Word Identification Strategies." Digital Commons @ East Tennessee State University, 2016. https://dc.etsu.edu/etsu-works/3419.

Full text

APA, Harvard, Vancouver, ISO, and other styles

37

Chou, Christine S. (Christine Susan). "Language identification through parallel phone recognition dc by Christine S. Chou." Thesis, Massachusetts Institute of Technology, 1994. http://hdl.handle.net/1721.1/34056.

Full text

APA, Harvard, Vancouver, ISO, and other styles

38

Xiang, Yang. "Grammatical Error Identification for Learners of Chinese as a Foreign Language." Thesis, Uppsala universitet, Institutionen för lingvistik och filologi, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-361927.

Full text

Abstract:

This thesis aims to build a system to tackle the task of diagnosing the grammatical errors in sentences written by learners of Chinese as a foreign language with the help of the CRF model (Conditional Random Field). The goal of this task is threefold: 1) identify if the sentence is correct or not, 2) identify the specific error types in the sentence, 3) find out the location of the identified errors. In this thesis, the task of Chinese grammatical error diagnosis is approached as a sequence tagging problem. The data and evaluation tool come from the previous shared tasks on Chinese Grammatical Error Diagnosis in 2016 and 2017. First, we use characters and POS tags as features to train the model and build the baseline system. We then notice that there are overlapping errors in the data. To solve this problem, we adopt three approaches: filtering out the problematic data, assigning encoding to characters with more than one label and building separate classifiers for each error type. We continue to increase the amount of training data and include syntactic features. The results show that both filtering out the problematic data and including syntactic features have a positive impact on the results. In addition, difference between domains of training data and test data can hurt performance to a large extent.

APA, Harvard, Vancouver, ISO, and other styles

39

Gerl, Armin. "Modelling of a privacy language and efficient policy-based de-identification." Thesis, Lyon, 2019. http://www.theses.fr/2019LYSEI105.

Full text

Abstract:

De nos jours, les informations personnelles des utilisateurs intéressent énormément les annonceurs et les industriels qui les utilisent pour mieux cibler leurs clients et pour amééliorer leurs offres. Ces informations, souvent trés sensibles, nécessitent d’être protégées pour réguler leur utilisation. Le RGPD est la législation européenne, récemment entrée en vigueur en Mai 2018 et qui vise à renforcer les droits de l’utilisateur quant au traitement de ses données personnelles. Parmi les concepts phares du RGPD, la définition des règles régissant la protection de la vie privée par défaut (privacy by default) et dès la conception (privacy by design). La possibilité pour chaque utilisateur, d’établir un consentement personnalisé sur la manière de consommer ses données personnelles constitue un de ces concepts. Ces règles, malgré qu’elles soient bien explicitées dans les textes juridiques, sont difficiles à mettre en oeuvre du fait de l’absence d’outils permettant de les exprimer et de les appliquer de manière systématique – et de manière différente – à chaque fois que les informations personnelles d’un utilisateur sont sollicitées pour une tâche donnée, par une organisation donnée. L’application de ces règles conduit à adapter l’utilisation des données personnelles aux exigences de chaque utilisateur, en appliquant des méthodes empêchant de révéler plus d’information que souhaité (par exemple : des méthodes d’anonymisation ou de pseudo-anonymisation). Le problème tend cependant à se complexifier quand il s’agit d’accéder aux informations personnelles de plusieurs utilisateurs, en provenance de sources différentes et respectant des normes hétérogènes, où il s’agit de surcroit de respecter individuellement les consentements de chaque utilisateur. L’objectif de cette thèse est donc de proposer un framework permettant de définir et d’appliquer des règles protégeant la vie privée de l’utilisateur selon le RGPD. La première contribution de ce travail consiste à définir le langage LPL (Layered Privacy Language) permettant d’exprimer, de personnaliser (pour un utilisateur) et de guider l’application de politiques de consommation des données personnelles, respectueuses de la vie privée. LPL présente la particularité d’être compréhensible pour un utilisateur ce qui facilite la négociation puis la mise en place de versions personnalisées des politiques de respect de la vie privée. La seconde contribution de la thèse est une méthode appelée Policy-based De-identification. Cette méthode permet l’application efficace des règles de protection de la vie privée dans un contexte de données multi-utilisateurs, régies par des normes hétérogènes de respect de la vie privée et tout en respectant les choix de protection arrêtés par chaque utilisateur. L’évaluation des performances de la méthode proposée montre un extra-temps de calcul négligeable par rapport au temps nécessaire à l’application des méthodes de protection des données
The processing of personal information is omnipresent in our datadriven society enabling personalized services, which are regulated by privacy policies. Although privacy policies are strictly defined by the General Data Protection Regulation (GDPR), no systematic mechanism is in place to enforce them. Especially if data is merged from several sources into a data-set with different privacy policies associated, the management and compliance to all privacy requirements is challenging during the processing of the data-set. Privacy policies can vary hereby due to different policies for each source or personalization of privacy policies by individual users. Thus, the risk for negligent or malicious processing of personal data due to defiance of privacy policies exists. To tackle this challenge, a privacy-preserving framework is proposed. Within this framework privacy policies are expressed in the proposed Layered Privacy Language (LPL) which allows to specify legal privacy policies and privacy-preserving de-identification methods. The policies are enforced by a Policy-based De-identification (PD) process. The PD process enables efficient compliance to various privacy policies simultaneously while applying pseudonymization, personal privacy anonymization and privacy models for de-identification of the data-set. Thus, the privacy requirements of each individual privacy policy are enforced filling the gap between legal privacy policies and their technical enforcement

APA, Harvard, Vancouver, ISO, and other styles

40

Asadullah, Munshi. "Identification of Function Points in Software Specifications Using Natural Language Processing." Thesis, Paris 11, 2015. http://www.theses.fr/2015PA112228/document.

Full text

Abstract:

La nécessité d'estimer la taille d’un logiciel pour pouvoir en estimer le coût et l’effort nécessaire à son développement est une conséquence de l'utilisation croissante des logiciels dans presque toutes les activités humaines. De plus, la nature compétitive de l’industrie du développement logiciel rend courante l’utilisation d’estimations précises de leur taille, au plus tôt dans le processus de développement. Traditionnellement, l’estimation de la taille des logiciels était accomplie a posteriori à partir de diverses mesures appliquées au code source. Cependant, avec la prise de conscience, par la communauté de l’ingénierie logicielle, que l’estimation de la taille du code est une donnée cruciale pour la maîtrise du développement et des coûts, l’estimation anticipée de la taille des logiciels est devenue une préoccupation répandue. Une fois le code écrit, l’estimation de sa taille et de son coût permettent d'effectuer des études contrastives et éventuellement de contrôler la productivité. D’autre part, les bénéfices apportés par l'estimation de la taille sont d'autant plus grands que cette estimation est effectuée tôt pendant le développement. En outre, si l’estimation de la taille peut être effectuée périodiquement au fur et à mesure de la progression de la conception et du développement, elle peut fournir des informations précieuses aux gestionnaires du projet pour suivre au mieux la progression du développement et affiner en conséquence l'allocation des ressources. Notre recherche se positionne autour des mesures d’estimation de la taille fonctionnelle, couramment appelées Analyse des Points de Fonctions, qui permettent d’estimer la taille d’un logiciel à partir des fonctionnalités qu’il doit fournir à l’utilisateur final, exprimées uniquement selon son point de vue, en excluant en particulier toute considération propre au développement. Un problème significatif de l'utilisation des points de fonction est le besoin d'avoir recours à des experts humains pour effectuer la quotation selon un ensemble de règles de comptage. Le processus d'estimation représente donc une charge de travail conséquente et un coût important. D'autre part, le fait que les règles de comptage des points de fonction impliquent nécessairement une part d'interprétation humaine introduit un facteur d'imprécision dans les estimations et rend plus difficile la reproductibilité des mesures. Actuellement, le processus d'estimation est entièrement manuel et contraint les experts humains à lire en détails l'intégralité des spécifications, une tâche longue et fastidieuse. Nous proposons de fournir aux experts humains une aide automatique dans le processus d'estimation, en identifiant dans le texte des spécifications, les endroits les plus à même de contenir des points de fonction. Cette aide automatique devrait permettre une réduction significative du temps de lecture et de réduire le coût de l'estimation, sans perte de précision. Enfin, l’identification non ambiguë des points de fonction permettra de faciliter et d'améliorer la reproductibilité des mesures. À notre connaissance, les travaux présentés dans cette thèse sont les premiers à se baser uniquement sur l’analyse du contenu textuel des spécifications, applicable dès la mise à disposition des spécifications préliminaires et en se basant sur une approche générique reposant sur des pratiques établies d'analyse automatique du langage naturel
The inevitable emergence of the necessity to estimate the size of a software thus estimating the probable cost and effort is a direct outcome of increasing need of complex and large software in almost every conceivable situation. Furthermore, due to the competitive nature of the software development industry, the increasing reliance on accurate size estimation at early stages of software development becoming a commonplace practice. Traditionally, estimation of software was performed a posteriori from the resultant source code and several metrics were in practice for the task. However, along with the understanding of the importance of code size estimation in the software engineering community, the realization of early stage software size estimation, became a mainstream concern. Once the code has been written, size and cost estimation primarily provides contrastive study and possibly productivity monitoring. On the other hand, if size estimation can be performed at an early development stage (the earlier the better), the benefits are virtually endless. The most important goals of the financial and management aspect of software development namely development cost and effort estimation can be performed even before the first line of code is being conceived. Furthermore, if size estimation can be performed periodically as the design and development progresses, it can provide valuable information to project managers in terms of progress, resource allocation and expectation management. This research focuses on functional size estimation metrics commonly known as Function Point Analysis (FPA) that estimates the size of a software in terms of the functionalities it is expected to deliver from a user’s point of view. One significant problem with FPA is the requirement of human counters, who need to follow a set of standard counting rules, making the process labour and cost intensive (the process is called Function Point Counting and the professional, either analysts or counters). Moreover, these rules, in many occasion, are open to interpretation, thus they often produce inconsistent counts. Furthermore, the process is entirely manual and requires Function Point (FP) counters to read large specification documents, making it a rather slow process. Some level of automation in the process can make a significant difference in the current counting practice. Automation of the process of identifying the FPs in a document accurately, will at least reduce the reading requirement of the counters, making the process faster and thus shall significantly reduce the cost. Moreover, consistent identification of FPs will allow the production of consistent raw function point counts. To the best of our knowledge, the works presented in this thesis is an unique attempt to analyse specification documents from early stages of the software development, using a generic approach adapted from well established Natural Language Processing (NLP) practices

APA, Harvard, Vancouver, ISO, and other styles

41

Eyecioglu, Ozmutlu Asli. "Paraphrase identification using knowledge-lean techniques." Thesis, University of Sussex, 2016. http://sro.sussex.ac.uk/id/eprint/65497/.

Full text

Abstract:

This research addresses the problem of identification of sentential paraphrases; that is, the ability of an estimator to predict well whether two sentential text fragments are paraphrases. The paraphrase identification task has practical importance in the Natural Language Processing (NLP) community because of the need to deal with the pervasive problem of linguistic variation. Accurate methods for identifying paraphrases should help to improve the performance of NLP systems that require language understanding. This includes key applications such as machine translation, information retrieval and question answering amongst others. Over the course of the last decade, a growing body of research has been conducted on paraphrase identification and it has become an individual working area of NLP. Our objective is to investigate whether techniques concentrating on automated understanding of text requiring less resource may achieve results comparable to methods employing more sophisticated NLP processing tools and other resources. These techniques, which we call “knowledge-lean”, range from simple, shallow overlap methods based on lexical items or n-grams through to more sophisticated methods that employ automatically generated distributional thesauri. The work begins by focusing on techniques that exploit lexical overlap and text-based statistical techniques that are much less in need of NLP tools. We investigate the question “To what extent can these methods be used for the purpose of a paraphrase identification task?” For the two gold standard data, we obtained competitive results on the Microsoft Research Paraphrase Corpus (MSRPC) and reached the state-of-the-art results on the Twitter Paraphrase Corpus, using only n-gram overlap features in conjunction with support vector machines (SVMs). These techniques do not require any language specific tools or external resources and appear to perform well without the need to normalise colloquial language such as that found on Twitter. It was natural to extend the scope of the research and to consider experimenting on another language, which is poor in resources. The scarcity of available paraphrase data led us to construct our own corpus; we have constructed a paraphrasecorpus in Turkish. This corpus is relatively small but provides a representative collection, including a variety of texts. While there is still debate as to whether a binary or fine-grained judgement satisfies a paraphrase corpus, we chose to provide data for a sentential textual similarity task by agreeing on fine-grained scoring, knowing that this could be converted to binary scoring, but not the other way around. The correlation between the results from different corpora is promising. Therefore, it can be surmised that languages poor in resources can benefit from knowledge-lean techniques. Discovering the strengths of knowledge-lean techniques extended with a new perspective to techniques that use distributional statistical features of text by representing each word as a vector (word2vec). While recent research focuses on larger fragments of text with word2vec, such as phrases, sentences and even paragraphs, a new approach is presented by introducing vectors of character n-grams that carry the same attributes as word vectors. The proposed method has the ability to capture syntactic relations as well as semantic relations without semantic knowledge. This is proven to be competitive on Twitter compared to more sophisticated methods.

APA, Harvard, Vancouver, ISO, and other styles

42

Sardinha, Antonio Paulo Berber. "Automatic identification of segments in written texts." Thesis, University of Liverpool, 1997. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.364227.

Full text

APA, Harvard, Vancouver, ISO, and other styles

43

Dwyer, Edward J. "Encouraging Word Identification Skills." Digital Commons @ East Tennessee State University, 2016. https://dc.etsu.edu/etsu-works/3401.

Full text

APA, Harvard, Vancouver, ISO, and other styles

44

Alharthi, Haifa. "Natural Language Processing for Book Recommender Systems." Thesis, Université d'Ottawa / University of Ottawa, 2019. http://hdl.handle.net/10393/39134.

Full text

Abstract:

The act of reading has benefits for individuals and societies, yet studies show that reading declines, especially among the young. Recommender systems (RSs) can help stop such decline. There is a lot of research regarding literary books using natural language processing (NLP) methods, but the analysis of textual book content to improve recommendations is relatively rare. We propose content-based recommender systems that extract elements learned from book texts to predict readers’ future interests. One factor that influences reading preferences is writing style; we propose a system that recommends books after learning their authors’ writing style. To our knowledge, this is the first work that transfers the information learned by an author-identification model to a book RS. Another approach that we propose uses over a hundred lexical, syntactic, stylometric, and fiction-based features that might play a role in generating high-quality book recommendations. Previous book RSs include very few stylometric features; hence, our study is the first to include and analyze a wide variety of textual elements for book recommendations. We evaluated both approaches according to a top-k recommendation scenario. They give better accuracy when compared with state-of-the-art content and collaborative filtering methods. We highlight the significant factors that contributed to the accuracy of the recommendations using a forest of randomized regression trees. We also conducted a qualitative analysis by checking if similar books/authors were annotated similarly by experts. Our content-based systems suffer from the new user problem, well-known in the field of RSs, that hinders their ability to make accurate recommendations. Therefore, we propose a Topic Model-Based book recommendation component (TMB) that addresses the issue by using the topics learned from a user’s shared text on social media, to recognize their interests and map them to related books. To our knowledge, there is no literature regarding book RSs that exploits public social networks other than book-cataloging websites. Using topic modeling techniques, extracting user interests can be automatic and dynamic, without the need to search for predefined concepts. Though TMB is designed to complement other systems, we evaluated it against a traditional book CB. We assessed the top k recommendations made by TMB and CB and found that both retrieved a comparable number of books, even though CB relied on users’ rating history, while TMB only required their social profiles.

APA, Harvard, Vancouver, ISO, and other styles

45

Vdovichenko, Susan E. C. "The Beholder’s Eye: How Self-Identification and Linguistic Ideology Affect Shifting Language Attitudes and Language Maintenance in Ukraine." The Ohio State University, 2011. http://rave.ohiolink.edu/etdc/view?acc_num=osu1305582855.

Full text

APA, Harvard, Vancouver, ISO, and other styles

46

Sin, Wan-san Dorene. "The identification and characterization of Cantonese-speaking children with specific language impairment." Click to view the E-thesis via HKUTO, 2000. http://sunzi.lib.hku.hk/hkuto/record/B3620769X.

Full text

Abstract:

Thesis (B.Sc)--University of Hong Kong, 2000.
"A dissertation submitted in partial fulfilment of the requirements for the Bachelor of Science (Speech and Hearing Sciences), The University of Hong Kong, May 10, 2000." Also available in print.

APA, Harvard, Vancouver, ISO, and other styles

47

Combrinck, Hendrik Petrus. "A cost, complexity and performance comparison of two automatic language identification architectures." Pretoria : [s.n.], 2006. http://upetd.up.ac.za/thesis/available/etd-12212006-141335/.

Full text

APA, Harvard, Vancouver, ISO, and other styles

48

Esquierdo, Jennifer Joy. "Early identification of Hispanic English language learners for gifted and talented programs." Diss., Texas A&M University, 2003. http://hdl.handle.net/1969.1/3944.

Full text

Abstract:

The exponential growth of the Hispanic student population and the controversial educational issue surrounding the assessment of English language learners are the two fundamental topics of this study. Due to the uncertainty and ambiguity surrounding the assessment of the escalating Hispanic student population, the underrepresentation of Hispanics in gifted and talented (GT) programs has developed into a critical educational concern (Bernal, 2002; Irby & Lara-Alecio, 1996; Ortiz & Gonzalez, 1998). The research questions that guided this study focused on finding validated assessments for early identification of the gifted Hispanic English language learners (ELLs) in kindergarten. The first research question aimed to determine the concurrent validity of the Hispanic Bilingual Gifted Screening Instrument (HBGSI) using the Naglieri Nonverbal Abilities Test (NNAT) and Wookcock Language Proficient Battery-Revised (WLPB-R) selected three subtests, administered in English and Spanish. This study found a positive statistically significant correlation between the HBGSI, the NNAT, and WLPB-R subtests. The second question focused on the correlation between language proficiency as measured by the WLPB-R subtests and nonverbal intelligence measured using the NNAT. This analysis found that there was a statistically significant correlation between the NNAT and the WLPB-R subtests. The third question concentrated on the difference in performance on the NNAT and WLPB-R subtests by two student groups, those identified and those not identified GT using the HBGSI. The study determined that the students identified GT performed statistically significantly different on the NNAT than those not identified GT. The fourth question centered on the difference in performance on the HBGSI of students enrolled in a transitional bilingual education (TBE) and those enrolled in an English as a second language (ESL) classroom. The results of my study showed that students in a TBE classroom performed statistically significantly different on five HBGSI clusters (Social & Academic Language, Familial, Collaboration, Imagery, and Creative Performance) than students in ESL classroom. The studyÂs results were analyzed, interpreted and discussed in this dissertation.

APA, Harvard, Vancouver, ISO, and other styles

49

Segers, Vaughn Mackman. "The efficacy of the Eigenvector approach to South African sign language identification." Thesis, University of the Western Cape, 2010. http://etd.uwc.ac.za/index.php?module=etd&action=viewtitle&id=gen8Srv25Nme4_2697_1298280657.

Full text

Abstract:

The communication barriers between deaf and hearing society mean that interaction between these communities is kept to a minimum. The South African Sign Language research group, Integration of Signed and Verbal Communication: South African Sign Language Recognition and Animation (SASL), at the University of the Western Cape aims to create technologies to bridge the communication gap. In this thesis we address the subject of whole hand gesture recognition. We demonstrate a method to identify South African Sign Language classifiers using an eigenvector ap- proach. The classifiers researched within this thesis are based on those outlined by the Thibologa Sign Language Institute for SASL. Gesture recognition is achieved in real- time. Utilising a pre-processing method for image registration we are able to increase the recognition rates for the eigenvector approach.

APA, Harvard, Vancouver, ISO, and other styles

50

Van, Der Merwe Ruan Henry. "Triplet entropy loss: improving the generalisation of short speech language identification systems." Master's thesis, Faculty of Science, 2021. http://hdl.handle.net/11427/33953.

Full text

Abstract:

Spoken language identification systems form an integral part in many speech recognition tools today. Over the years many techniques have been used to identify the language spoken, given just the audio input, but in recent years the trend has been to use end to end deep learning systems. Most of these techniques involve converting the audio signal into a spectrogram which can be fed into a Convolutional Neural Network which can then predict the spoken language. This technique performs very well when the data being fed to model originates from the same domain as the training examples, but as soon as the input comes from a different domain these systems tend to perform poorly. Examples could be when these systems were trained on WhatsApp recordings but are put into production in an environment where the system receives recordings from a phone line. The research presented investigates several methods to improve the generalisation of language identification systems to new speakers and to new domains. These methods involve Spectral augmentation, where spectrograms are masked in the frequency or time bands during training and CNN architectures that are pre-trained on the Imagenet dataset. The research also introduces the novel Triplet Entropy Loss training method. This training method involves training a network simultaneously using Cross Entropy and Triplet loss. Several tests were run with three different CNN architectures to investigate what the effect all three of these methods have on the generalisation of an LID system. The tests were done in a South African context on six languages, namely Afrikaans, English, Sepedi, Setswanna, Xhosa and Zulu. The two domains tested were data from the NCHLT speech corpus, used as the training domain, with the Lwazi speech corpus being the unseen domain. It was found that all three methods improved the generalisation of the models, though not significantly. Even though the models trained using Triplet Entropy Loss showed a better understanding of the languages and higher accuracies, it appears as though the models still memorise word patterns present in the spectrograms rather than learning the finer nuances of a language. The research shows that Triplet Entropy Loss has great potential and should be investigated further, but not only in language identification tasks but any classification task.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Language identification'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles