Relevant bibliographies by topics / Tokenisation

Journal articles
Dissertations / Theses
Book chapters
Conference papers

Academic literature on the topic 'Tokenisation'

Author: Grafiati

Published: 4 June 2021

Last updated: 13 February 2022

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Tokenisation.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Tokenisation"

Pretorius, Laurette, Biffie Viljoen, Ansu Berg, and Rigardt Pretorius. "Tswana finite state tokenisation." Language Resources and Evaluation 49, no. 4 (December 24, 2014): 831–56. http://dx.doi.org/10.1007/s10579-014-9292-1.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Martin, Luther. "Protecting credit card information: encryption vs tokenisation." Network Security 2010, no. 6 (June 2010): 17–19. http://dx.doi.org/10.1016/s1353-4858(10)70084-2.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Haggett, Shawn, and Greg Knowles. "Tokenisation and compression of Java class files." Journal of Systems Architecture 58, no. 1 (January 2012): 1–12. http://dx.doi.org/10.1016/j.sysarc.2011.09.002.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Fam, Rashel, and Yves Lepage. "A Study of Analogical Density in Various Corpora at Various Granularity." Information 12, no. 8 (August 5, 2021): 314. http://dx.doi.org/10.3390/info12080314.

Full text

Abstract:

In this paper, we inspect the theoretical problem of counting the number of analogies between sentences contained in a text. Based on this, we measure the analogical density of the text. We focus on analogy at the sentence level, based on the level of form rather than on the level of semantics. Experiments are carried on two different corpora in six European languages known to have various levels of morphological richness. Corpora are tokenised using several tokenisation schemes: character, sub-word and word. For the sub-word tokenisation scheme, we employ two popular sub-word models: unigram language model and byte-pair-encoding. The results show that the corpus with a higher Type-Token Ratio tends to have higher analogical density. We also observe that masking the tokens based on their frequency helps to increase the analogical density. As for the tokenisation scheme, the results show that analogical density decreases from the character to word. However, this is not true when tokens are masked based on their frequencies. We find that tokenising the sentences using sub-word models and masking the least frequent tokens increase analogical density.

APA, Harvard, Vancouver, ISO, and other styles

Ortiz-Yepes, Diego. "A critical review of the EMV payment tokenisation specification." Computer Fraud & Security 2014, no. 10 (October 2014): 5–12. http://dx.doi.org/10.1016/s1361-3723(14)70539-1.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Lochlainn, Mícheál Mac. "Sintéiseoir 1.0: a multidialectical TTS application for Irish." ReCALL 22, no. 2 (May 2010): 152–71. http://dx.doi.org/10.1017/s0958344010000054.

Full text

Abstract:

AbstractThis paper details the development of a multidialectical text-to-speech (TTS) application, Sintéiseoir, for the Irish language. This work is being carried out in the context of Irish as a lesser-used language, where learners and other L2 speakers have limited direct exposure to L1 speakers and speech communities, and where native sound systems and vocabularies can be seen to be receding even among L1 speakers – particularly the young.Sintéiseoir essentially implements the diphone concatenation model, albeit augmented to include phones, half-phones and, potentially, other phonic units. It is based on a platform-independent framework comprising a user interface, a set of dialect-specific tokenisation engines, a concatenation engine and a playback device.The tokenisation strategy is entirely rule-based and does not refer to dictionary look-ups. Provision has been made for prosodic processing in the framework but has not yet been implemented. Concatenation units are stored in the form of WAV files on the local file system.Sintéiseoir’s user interface (UI) provides a text field that allows the user to submit a grapheme string for synthesis and a prompt to select a dialect. It also filters input to reject graphotactically invalid strings, restrict input to alphabetic and certain punctuation marks found in Irish orthography, and ensure that a dialect has, indeed, been selected.The UI forwards the filtered grapheme string to the appropriate tokenisation engine. This searches for specified substrings and maps them to corresponding tokens that themselves correspond to concatenation units.The resultant token string is then forwarded to the concatenation engine, which retrieves the relevant concatenation units, extracts their audio data and combines them in a new unit. This is then forwarded to the playback device.The terms of reference for the initial development of Sintéiseoir specified that it should be capable of uttering, individually, the 99 most common Irish lemmata in the dialects of An Spidéal, Músgraí Uí Fhloínn and Gort a’ Choirce, which are internally consistent dialects within the Connacht, Munster and Ulster regions, respectively, of the dialect continuum. Audio assets to satisfy this requirement have already been prepared, and have been found to produce reasonably accurate output. The tokenisation engine is, however, capable of processing a wider range of input strings and when required concatenation units are found to be unavailable, returns a report via the user interface.

APA, Harvard, Vancouver, ISO, and other styles

Saario, Lassi, Tanja Säily, Samuli Kaislaniemi, and Terttu Nevalainen. "The burden of legacy: Producing the Tagged Corpus of Early English Correspondence Extension (TCEECE)." Research in Corpus Linguistics 9, no. 1 (2021): 104–31. http://dx.doi.org/10.32714/ricl.09.01.07.

Full text

Abstract:

This paper discusses the process of part-of-speech tagging the Corpus of Early English Correspondence Extension (CEECE), as well as the end result. The process involved normalisation of historical spelling variation, conversion from a legacy format into TEI-XML, and finally, tokenisation and tagging by the CLAWS software. At each stage, we had to face and work around problems such as whether to retain original spelling variants in corpus markup, how to implement overlapping hierarchies in XML, and how to calculate the accuracy of tagging in a way that acknowledges errors in tokenisation. The final tagged corpus is estimated to have an accuracy of 94.5 per cent (in the C7 tagset), which is circa two percentage points (pp) lower than that of present-day corpora but respectable for Late Modern English. The most accurate tag groups include pronouns and numerals, whereas adjectives and adverbs are among the least accurate. Normalisation increased the overall accuracy of tagging by circa 3.7pp. The combination of POS tagging and social metadata will make the corpus attractive to linguists interested in the interplay between language-internal and -external factors affecting variation and change.

APA, Harvard, Vancouver, ISO, and other styles

Santos, Igor, Carlos Laorden, Borja Sanz, and Pablo G. Bringas. "Reversing the effects of tokenisation attacks against content-based spam filters." International Journal of Security and Networks 8, no. 2 (2013): 106. http://dx.doi.org/10.1504/ijsn.2013.055944.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Corcoran, Padraig, Geraint Palmer, Laura Arman, Dawn Knight, and Irena Spasić. "Creating Welsh Language Word Embeddings." Applied Sciences 11, no. 15 (July 27, 2021): 6896. http://dx.doi.org/10.3390/app11156896.

Full text

Abstract:

Word embeddings are representations of words in a vector space that models semantic relationships between words by means of distance and direction. In this study, we adapted two existing methods, word2vec and fastText, to automatically learn Welsh word embeddings taking into account syntactic and morphological idiosyncrasies of this language. These methods exploit the principles of distributional semantics and, therefore, require a large corpus to be trained on. However, Welsh is a minoritised language, hence significantly less Welsh language data are publicly available in comparison to English. Consequently, assembling a sufficiently large text corpus is not a straightforward endeavour. Nonetheless, we compiled a corpus of 92,963,671 words from 11 sources, which represents the largest corpus of Welsh. The relative complexity of Welsh punctuation made the tokenisation of this corpus relatively challenging as punctuation could not be used for boundary detection. We considered several tokenisation methods including one designed specifically for Welsh. To account for rich inflection, we used a method for learning word embeddings that is based on subwords and, therefore, can more effectively relate different surface forms during the training phase. We conducted both qualitative and quantitative evaluation of the resulting word embeddings, which outperformed previously described word embeddings in Welsh as part of larger study including 157 languages. Our study was the first to focus specifically on Welsh word embeddings.

APA, Harvard, Vancouver, ISO, and other styles

GAJDOŠ, Ľuboš. "Chinese legal texts – Quantitative Description." Acta Linguistica Asiatica 7, no. 1 (June 28, 2017): 77–87. http://dx.doi.org/10.4312/ala.7.1.77-87.

Full text

Abstract:

The aim of the paper is to provide a quantitative description of legal Chinese. This study adopts the approach of corpus-based analyses and it shows basic statistical parameters of legal texts in Chinese, namely the length of a sentence, the proportion of part of speech etc. The research is conducted on the Chinese monolingual corpus Hanku. The paper also discusses the issues of statistical data processing from various corpora, e.g. the tokenisation and part of speech tagging and their relevance to study of registers variation.

APA, Harvard, Vancouver, ISO, and other styles

More sources

Dissertations / Theses on the topic "Tokenisation"

Puttkammer, Martin Johannes. "Outomatiese Afrikaanse tekseenheididentifisering / deur Martin J. Puttkammer." Thesis, North-West University, 2006. http://hdl.handle.net/10394/872.

Full text

Abstract:

An important core technology in the development of human language technology applications is an automatic morphological analyser. Such a morphological analyser consists of various modules, one of which is a tokeniser. At present no tokeniser exists for Afrikaans and it has therefore been impossible to develop a morphological analyser for Afrikaans. Thus, in this research project such a tokeniser is being developed, and the project therefore has two objectives: i)to postulate a tag set for integrated tokenisation, and ii) to develop an algorithm for integrated tokenisation. In order to achieve the first object, a tag set for the tagging of sentences, named-entities, words, abbreviations and punctuation is proposed specifically for the annotation of Afrikaans texts. It consists of 51 tags, which can be expanded in future in order to establish a larger, more specific tag set. The postulated tag set can also be simplified according to the level of specificity required by the user. It is subsequently shown that an effective tokeniser cannot be developed using only linguistic, or only statistical methods. This is due to the complexity of the task: rule-based modules should be used for certain processes (for example sentence recognition), while other processes (for example named-entity recognition) can only be executed successfully by means of a machine-learning module. It is argued that a hybrid system (a system where rule-based and statistical components are integrated) would achieve the best results on Afrikaans tokenisation. Various rule-based and statistical techniques, including a TiMBL-based classifier, are then employed to develop such a hybrid tokeniser for Afrikaans. The final tokeniser achieves an ∫-score of 97.25% when the complete set of tags is used. For sentence recognition an ∫-score of 100% is achieved. The tokeniser also recognises 81.39% of named entities. When a simplified tag set (consisting of only 12 tags) is used to annotate named entities, the ∫-score rises to 94.74%. The conclusion of the study is that a hybrid approach is indeed suitable for Afrikaans sentencisation, named-entity recognition and tokenisation. The tokeniser will improve if it is trained with more data, while the expansion of gazetteers as well as the tag set will also lead to a more accurate system
Thesis (M.A. (Applied Language and Literary Studies))--North-West University, Potchefstroom Campus, 2006.

APA, Harvard, Vancouver, ISO, and other styles

Asian, Jelita, and jelitayang@gmail com. "Effective Techniques for Indonesian Text Retrieval." RMIT University. Computer Science and Information Technology, 2007. http://adt.lib.rmit.edu.au/adt/public/adt-VIT20080110.084651.

Full text

Abstract:

The Web is a vast repository of data, and information on almost any subject can be found with the aid of search engines. Although the Web is international, the majority of research on finding of information has a focus on languages such as English and Chinese. In this thesis, we investigate information retrieval techniques for Indonesian. Although Indonesia is the fourth most populous country in the world, little attention has been given to search of Indonesian documents. Stemming is the process of reducing morphological variants of a word to a common stem form. Previous research has shown that stemming is language-dependent. Although several stemming algorithms have been proposed for Indonesian, there is no consensus on which gives better performance. We empirically explore these algorithms, showing that even the best algorithm still has scope for improvement. We propose novel extensions to this algorithm and develop a new Indonesian stemmer, and show that these can improve stemming correctness by up to three percentage points; our approach makes less than one error in thirty-eight words. We propose a range of techniques to enhance the performance of Indonesian information retrieval. These techniques include: stopping; sub-word tokenisation; and identification of proper nouns; and modifications to existing similarity functions. Our experiments show that many of these techniques can increase retrieval performance, with the highest increase achieved when we use grams of size five to tokenise words. We also present an effective method for identifying the language of a document; this allows various information retrieval techniques to be applied selectively depending on the language of target documents. We also address the problem of automatic creation of parallel corpora --- collections of documents that are the direct translations of each other --- which are essential for cross-lingual information retrieval tasks. Well-curated parallel corpora are rare, and for many languages, such as Indonesian, do not exist at all. We describe algorithms that we have developed to automatically identify parallel documents for Indonesian and English. Unlike most current approaches, which consider only the context and structure of the documents, our approach is based on the document content itself. Our algorithms do not make any prior assumptions about the documents, and are based on the Needleman-Wunsch algorithm for global alignment of protein sequences. Our approach works well in identifying Indonesian-English parallel documents, especially when no translation is performed. It can increase the separation value, a measure to discriminate good matches of parallel documents from bad matches, by approximately ten percentage points. We also investigate the applicability of our identification algorithms for other languages that use the Latin alphabet. Our experiments show that, with minor modifications, our alignment methods are effective for English-French, English-German, and French-German corpora, especially when the documents are not translated. Our technique can increase the separation value for the European corpus by up to twenty-eight percentage points. Together, these results provide a substantial advance in understanding techniques that can be applied for effective Indonesian text retrieval.

APA, Harvard, Vancouver, ISO, and other styles

Book chapters on the topic "Tokenisation"

Umar, Assad, Iakovos Gurulian, Keith Mayes, and Konstantinos Markantonakis. "Tokenisation Blacklisting Using Linkable Group Signatures." In Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, 182–98. Cham: Springer International Publishing, 2017. http://dx.doi.org/10.1007/978-3-319-59608-2_10.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Butler, Simon, Michel Wermelinger, Yijun Yu, and Helen Sharp. "Improving the Tokenisation of Identifier Names." In Lecture Notes in Computer Science, 130–54. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011. http://dx.doi.org/10.1007/978-3-642-22655-7_7.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Lecomte, Patrick. "The tokenisation of commercial real estate." In New Frontiers in Real Estate Finance, 134–55. 1 Edition. | New York : Routledge, 2021. |: Routledge, 2021. http://dx.doi.org/10.1201/9780429344145-3.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Lugo-Ocando, Jairo, and An Nguyen. "The “tokenisation” of development in the news." In Developing News, 14–29. London ; New York : Routledge, 2017.: Routledge, 2017. http://dx.doi.org/10.4324/9781315269245-2.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Jayasinghe, Danushka, Konstantinos Markantonakis, Raja Naeem Akram, and Keith Mayes. "Enhancing EMV Tokenisation with Dynamic Transaction Tokens." In Radio Frequency Identification and IoT Security, 107–22. Cham: Springer International Publishing, 2017. http://dx.doi.org/10.1007/978-3-319-62024-4_8.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Tokenisation"

Pretorius, Rigardt, Ansu Berg, Laurette Pretorius, and Biffie Viljoen. "Setswana tokenisation and computational verb morphology." In the First Workshop. Morristown, NJ, USA: Association for Computational Linguistics, 2009. http://dx.doi.org/10.3115/1564508.1564522.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Haggett, Shawn, Greg Knowles, and Graham Bignell. "Tokenisation of Class Files for an Embedded Java Processor." In 6th IEEE/ACIS International Conference on Computer and Information Science (ICIS 2007). IEEE, 2007. http://dx.doi.org/10.1109/icis.2007.181.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Grover, Claire, Michael Matthews, and Richard Tobin. "Tools to address the interdependence between tokenisation and standoff annotation." In the 5th Workshop. Morristown, NJ, USA: Association for Computational Linguistics, 2006. http://dx.doi.org/10.3115/1621034.1621038.

Full text

APA, Harvard, Vancouver, ISO, and other styles

"Information Extraction from Web Services - A Comparison of Tokenisation Algorithms." In International Workshop on Software Knowledge. SciTePress - Science and and Technology Publications, 2011. http://dx.doi.org/10.5220/0003698000120023.

Full text

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

Contents

Academic literature on the topic 'Tokenisation'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Journal articles on the topic "Tokenisation"

Dissertations / Theses on the topic "Tokenisation"

Book chapters on the topic "Tokenisation"

Conference papers on the topic "Tokenisation"