To see the other types of publications on this topic, follow the link: News text corpus.

Journal articles on the topic 'News text corpus'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'News text corpus.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Sharjeel, Muhammad, Rao Muhammad Adeel Nawab, and Paul Rayson. "COUNTER: corpus of Urdu news text reuse." Language Resources and Evaluation 51, no. 3 (September 10, 2016): 777–803. http://dx.doi.org/10.1007/s10579-016-9367-2.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Zhang, Yiqiong. "Retailing science: genre hybridization in online science news stories." Text & Talk 38, no. 2 (February 23, 2018): 243–65. http://dx.doi.org/10.1515/text-2017-0040.

Full text
Abstract:
AbstractThis study explores how marketing and science rhetoric have become entrenched in online science news stories. The schematic structures of a corpus of 270 news stories from three types of website (university websites, the websites of Futurity.org and MSNBC.com) have been analyzed and compared. An eight-move structure identified from the corpus suggests that the genre of news stories is a hybridization of promotional discourse for marketization and science discourse for explanation. Hybridization is first evident in university press releases, which are then spread by the mass media without significant changes. From the perspective of intertextual chains, the emerging discourse practices can be attributed to the power shifting of news production from journalists to science institutions and further from journalistic to scientific norms. In turn, the discourse practices accelerate the shift of power, which could ultimately lead to the loss of independent and critical science journalism.
APA, Harvard, Vancouver, ISO, and other styles
3

Watanabe, Chiaki, and Ichiro Kobayashi. "Intelligent Information Presentation Corresponding to User Request Based on Collaboration Between Text and 2D Charts." Journal of Advanced Computational Intelligence and Intelligent Informatics 12, no. 1 (January 20, 2008): 10–15. http://dx.doi.org/10.20965/jaciii.2008.p0010.

Full text
Abstract:
We discuss intelligent information provision involving different modal information collaboratively presented, with an example of news articles about stock prices summarized based on 2D chart representation on stock prices. We use the MuST corpus, an annotated corpus for easily extracting trends in information, e.g., statistical values, etc., as the news article corpus to be summarized. We associate the MuST corpus with numerical data on the stock prices, and propose a way to provide people with a summarized text about news articles on prices corresponding to 2D chart representation.
APA, Harvard, Vancouver, ISO, and other styles
4

Dong, Min, and Mengfei Gao. "Appraisal as co-selection and media performativity: 5G technology imaged in German news discourse." Text & Talk 42, no. 2 (November 2, 2021): 177–208. http://dx.doi.org/10.1515/text-2020-0012.

Full text
Abstract:
Abstract This article views appraisal as co-selection patterns of target, source and evaluative parameters and investigates the ways in which news discourse retells news stories and reproduces truthful reality. We combined the corpus-assisted method and quantitative/qualitative analysis of the data, i.e., 904 sentences which were extracted from the corpus of German 5G news reports by selecting the top 5 items from each of the noun keywords lists of the three subcorpora of economics, politics and technology news reports. It was found that the German media restage the necessity and desirability to promote the development of German communication facilities/technology through international cooperation, particularly Germany-Sino cooperation. In addition, a hesitant image was evoked as to the high-profile 5G development in Germany with an awareness of the potential security risks and economic losses. On the intersubjective dimension, our findings suggest that journalists make full exploitation of different dialogistic positioning strategies for closing down or opening up the dialogic space to a greater or lesser degree. More specifically, they tend to acknowledge and endorse the positive/negative attitudes attributed to the non-authorial voices towards particular targets in the fields of economics, politics or technology. A future comparison with the genre of news comments or editorials would deepen our understanding of the performativity of media.
APA, Harvard, Vancouver, ISO, and other styles
5

Hou, Zhide. "The American Dream meets the Chinese Dream: a corpus-driven phraseological analysis of news texts." Text & Talk 38, no. 3 (April 25, 2018): 317–40. http://dx.doi.org/10.1515/text-2018-0006.

Full text
Abstract:
Abstract This study is a corpus-driven examination of frequent lexical words and keywords in the news texts related to the American Dream and the Chinese Dream. Based on Sinclair’s (Sinclair, John McHardy. 2004. Trust the Text. Routledge: London) five categories of co-selection as framework, it discusses the patterns of co-selection across the corpora of news texts, with a particular focus on the cumulative effects of the co-construction of situated meanings and establishment of ideological positions associated with the two dreams. The corpus linguistic tool Wordsmith is used to generate frequent words and keywords for detailed concordance analysis along both syntagmatic and paradigmatic relations in order to indicate collocation, colligation, semantic preference, and semantic prosody. The findings demonstrate the individualistic home, work and education associations of the American Dream versus the collectivistic attributions of the Chinese Dream of national rejuvenation. The study not only confirms different cultural practices, but also reveals different social-historical conditions, and political influences associated with media representations of the American Dream and the Chinese Dream.
APA, Harvard, Vancouver, ISO, and other styles
6

Ho, Janet. "An earthquake or a category 4 financial storm? A corpus study of disaster metaphors in the media framing of the 2008 financial crisis." Text & Talk 39, no. 2 (March 26, 2019): 191–212. http://dx.doi.org/10.1515/text-2019-2024.

Full text
Abstract:
Abstract This study investigates the use of disaster metaphors in the American media coverage of the 2008 global financial crisis. More specifically, it aims to examine the role of different sub-metaphors in performing various pragmatic and rhetorical functions in financial news discourse. Using the Metaphor Identification Procedure, this study identifies key words from the 1-million-word corpus which comprised the news articles published from September 15, 2008 to March 15, 2009, and examines the associated concordance lines to discern their metaphorical connotations. The findings show that a wide range of sub-source domains of disaster—namely, wind, storm, and water—metaphors was deployed by journalists to capture the various negative impacts of the financial crisis. These findings suggest that the salient extension and mixing of metaphors could enhance the popularization of specialist financial news discourse. The findings also indicate that the news media was complicit in constructing the collective illusion that the financial crisis was unavoidable and not caused by anyone.
APA, Harvard, Vancouver, ISO, and other styles
7

Best, Michael L. "An Ecology of Text: Using Text Retrieval to Study Alife on the Net." Artificial Life 3, no. 4 (October 1997): 261–87. http://dx.doi.org/10.1162/artl.1997.3.4.261.

Full text
Abstract:
I introduce a new alife model, an ecology based on a corpus of text, and apply it to the analysis of posts to USENET News. In this corporal ecology posts are organisms, the newsgroups of NetNews define an environment, and human posters situated in their wider context make up a scarce resource. I apply latent semantic indexing (LSI), a text retrieval method based on principal component analysis, to distill from the corpus those replicating units of text. LSI arrives at suitable replicators because it discovers word co-occurrences that segregate and recombine with appreciable frequency. I argue that natural selection is necessarily in operation because sufficient conditions for its occurrence are met: replication, mutagenicity, and trait/fitness covariance. I describe a set of experiments performed on a static corpus of over 10,000 posts. In these experiments I study average population fitness, a fundamental element of population ecology. My study of fitness arrives at the tinhappy discovery that a flame-war, centered around an overly prolific poster, is the king of the jungle.
APA, Harvard, Vancouver, ISO, and other styles
8

Cenek, Martin, Rowan Bulkow, Eric Pak, Levi Oyster, Boyd Ching, and Ashika Mulagada. "Semantic Network Analysis Pipeline—Interactive Text Mining Framework for Exploration of Semantic Flows in Large Corpus of Text." Applied Sciences 9, no. 24 (December 5, 2019): 5302. http://dx.doi.org/10.3390/app9245302.

Full text
Abstract:
Historical topic modeling and semantic concepts exploration in a large corpus of unstructured text remains a hard, opened problem. Despite advancements in natural languages processing tools, statistical linguistics models, graph theory and visualization, there is no framework that combines these piece-wise tools under one roof. We designed and constructed a Semantic Network Analysis Pipeline (SNAP) that is available as an open-source web-service that implements work-flow needed by a data scientist to explore historical semantic concepts in a text corpus. We define a graph theoretic notion of a semantic concept as a flow of closely related tokens through the corpus of text. The modular work-flow pipeline processes text using natural language processing tools, statistical content narrowing, creates semantic networks from lexical token chaining, performs social network analysis of token networks and creates a 3D visualization of the semantic concept flows through corpus for interactive concept exploration. Finally, we illustrate the framework’s utility to extract the information from a text corpus of Herman Melville’s novel Moby Dick, the transcript of the 2015–2016 United States (U.S.) Senate Hearings on Environment and Public Works, and the Australian Broadcast Corporation’s short news articles on rural and science topics.
APA, Harvard, Vancouver, ISO, and other styles
9

Yulita, Winda, Sigit Priyanta, and Azhari SN. "Automatic Text Summarization Based on Semantic Networks and Corpus Statistics." IJCCS (Indonesian Journal of Computing and Cybernetics Systems) 13, no. 2 (April 30, 2019): 137. http://dx.doi.org/10.22146/ijccs.38261.

Full text
Abstract:
One simple automatic text summarization method that can minimize redundancy, in summary, is the Maximum Marginal Relevance (MMR) method. The MMR method has the disadvantage of having parts that are separated from each other in summary results that are not semantically connected. Therefore, this study aims to compare summary results using the MMR method based on semantic and non-semantic based MMR. Semantic-based MMR methods utilize WordNet Bahasa and corpus in processing text summaries. The MMR method is non-semantic based on the TF-IDF method. This study also carried out summary compression of 30%, 20%, and 10%. The research data used is 50 online news texts. Testing of the summary text results is done using the ROUGE toolkit. The results of the study state that the best value of the f-score in the semantic-based MMR method is 0.561, while the best f-score in the non-semantic MMR method is 0.598. This value is generated by adding a preprocessing process in the form of stemming and compression of a 30% summary result. The difference in value obtained is due to incomplete WordNet Bahasa and there are several words in the news title that are not in accordance with EYD (KBBI).
APA, Harvard, Vancouver, ISO, and other styles
10

Pryzant, Reid, Richard Diehl Martinez, Nathan Dass, Sadao Kurohashi, Dan Jurafsky, and Diyi Yang. "Automatically Neutralizing Subjective Bias in Text." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 01 (April 3, 2020): 480–89. http://dx.doi.org/10.1609/aaai.v34i01.5385.

Full text
Abstract:
Texts like news, encyclopedias, and some social media strive for objectivity. Yet bias in the form of inappropriate subjectivity — introducing attitudes via framing, presupposing truth, and casting doubt — remains ubiquitous. This kind of bias erodes our collective trust and fuels social conflict. To address this issue, we introduce a novel testbed for natural language generation: automatically bringing inappropriately subjective text into a neutral point of view (“neutralizing” biased text). We also offer the first parallel corpus of biased language. The corpus contains 180,000 sentence pairs and originates from Wikipedia edits that removed various framings, presuppositions, and attitudes from biased sentences. Last, we propose two strong encoder-decoder baselines for the task. A straightforward yet opaque concurrent system uses a BERT encoder to identify subjective words as part of the generation process. An interpretable and controllable modular algorithm separates these steps, using (1) a BERT-based classifier to identify problematic words and (2) a novel join embedding through which the classifier can edit the hidden states of the encoder. Large-scale human evaluation across four domains (encyclopedias, news headlines, books, and political speeches) suggests that these algorithms are a first step towards the automatic identification and reduction of bias.
APA, Harvard, Vancouver, ISO, and other styles
11

Cardoso, Paula C. F., Thiago A. S. Pardo, and Maite Taboada. "Subtopic annotation and automatic segmentation for news texts in Brazilian Portuguese." Corpora 12, no. 1 (April 2017): 23–54. http://dx.doi.org/10.3366/cor.2017.0108.

Full text
Abstract:
Subtopic segmentation aims to break documents into subtopical text passages, which develop a main topic in a text. Being capable of automatically detecting subtopics is very useful for several Natural Language Processing applications. For instance, in automatic summarisation, having the subtopics at hand enables the production of summaries with good subtopic coverage. Given the usefulness of subtopic segmentation, it is common to assemble a reference-annotated corpus that supports the study of the envisioned phenomena and the development and evaluation of systems. In this paper, we describe the subtopic annotation process in a corpus of news texts written in Brazilian Portuguese, following a systematic annotation process and answering the main research questions when performing corpus annotation. Based on this corpus, we propose novel methods for subtopic segmentation following patterns of discourse organisation, specifically using Rhetorical Structure Theory. We show that discourse structures mirror the subtopic changes in news texts. An important outcome of this work is the freely available annotated corpus, which, to the best of our knowledge, is the only one for Portuguese. We demonstrate that some discourse knowledge may significantly help to find boundaries automatically in a text. In particular, the relation type and the level of the tree structure are important features.
APA, Harvard, Vancouver, ISO, and other styles
12

Pan, Feng, Rutu Mulkar-Mehta, and Jerry R. Hobbs. "Annotating and Learning Event Durations in Text." Computational Linguistics 37, no. 4 (December 2011): 727–52. http://dx.doi.org/10.1162/coli_a_00075.

Full text
Abstract:
This article presents our work on constructing a corpus of news articles in which events are annotated for estimated bounds on their duration, and automatically learning from this corpus. We describe the annotation guidelines, the event classes we categorized to reduce gross discrepancies in inter-annotator judgments, and our use of normal distributions to model vague and implicit temporal information and to measure inter-annotator agreement for these event duration distributions. We then show that machine learning techniques applied to this data can produce coarse-grained event duration information automatically, considerably outperforming a baseline and approaching human performance. The methods described here should be applicable to other kinds of vague but substantive information in texts.
APA, Harvard, Vancouver, ISO, and other styles
13

Károly, Krisztina. "Shifts in repetition vs. shifts in text meaning." Target. International Journal of Translation Studies 22, no. 1 (June 30, 2010): 40–70. http://dx.doi.org/10.1075/target.22.1.04kar.

Full text
Abstract:
This study focuses on the discoursal role of repetition, exploring the way shifts in repetition patterns in text trigger coherence shifts, altering the meaning potential of translations. As repetition in translation has been hypothesized to be affected by certain universals of translation, the paper also offers initial data to support the universals of explicitation and avoiding repetition. Lexical repetitions are investigated using Hoey’s (1991) theory in a corpus of Hungarian—English news texts. Analyses reveal considerable shifts in repetition in translations; however, these differences are not statistically significant. The corpus also provides evidence for repetition shifts affecting the macropropositional structure of target texts, leading to macropropositional shifts, which alter the global meaning of translations compared to sources.
APA, Harvard, Vancouver, ISO, and other styles
14

Ling Lee, Joanna Chiew, Phoey Lee Teh, Sian Lun Lau, and Irina Pak. "Compilation of malay criminological terms from online news." Indonesian Journal of Electrical Engineering and Computer Science 15, no. 1 (July 1, 2019): 355. http://dx.doi.org/10.11591/ijeecs.v15.i1.pp355-364.

Full text
Abstract:
<p>A Malay language corpus has been established by the Institute of Language and Literature (Dewan Bahasa dan Pustaka, DBP in Malaysia). Most of the past research on the Malay language corpus has focused on the description, lexicography and translation of the Malay language. However, in the existing literature, there is no list of Malay words that categorizes crime terminologies. This study aims to fill that linguistic gap. First, we aggregated the most frequently used crime terminology words from Malaysian online news sources. Five hundred crime-related words were compiled. No automatic machines were in the initial process, but they were subsequently used to verify the data. Four human coders were used to validate the data and ensure the originality of the semantic understanding of the Malay text. Finally, major crime terminologies were outlined from a set of keywords to serve as taggers in our solution. The ultimate goal of this study is to provide a corpus for forensic linguistics, police investigations, and general crime research. This study has established the first corpus of a criminological text in the Malay language.</p>
APA, Harvard, Vancouver, ISO, and other styles
15

SCOTT, Mike. "A Parser for News Downloads." DELTA: Documentação de Estudos em Lingüística Teórica e Aplicada 34, no. 1 (March 2018): 1–16. http://dx.doi.org/10.1590/0102-445083054975354211.

Full text
Abstract:
ABSTRACT This paper presents the Download Parser, a tool for handling text downloads from large online databases. Many universities have access to full-text databases which allow the user to search their holdings and then view and ideally download the full text of relevant articles, but there are important problems in practice in managing such downloads, because of factors such as duplication, unevenness of formatting standards, lack of documentation. The tool under discussion was devised to parse downloads, clean them up and standardise them, identify headlines and insert suitably marked-up headers for corpus analysis.
APA, Harvard, Vancouver, ISO, and other styles
16

Mahi, Gurjot Singh, and Amandeep Verma. "Development of Focused Crawlers for Building Large Punjabi News Corpus." Journal of ICT Research and Applications 15, no. 3 (December 28, 2021): 205–15. http://dx.doi.org/10.5614/itbj.ict.res.appl.2021.15.3.1.

Full text
Abstract:
Web crawlers are as old as the Internet and are most commonly used by search engines to visit websites and index them into repositories. They are not limited to search engines but are also widely utilized to build corpora in different domains and languages. This study developed a focused set of web crawlers for three Punjabi news websites. The web crawlers were developed to extract quality text articles and add them to a local repository to be used in further research. The crawlers were implemented using the Python programming language and were utilized to construct a corpus of more than 134,000 news articles in nine different news genres. The crawler code and extracted corpora were made publicly available to the scientific community for research purposes.
APA, Harvard, Vancouver, ISO, and other styles
17

Yoko, Kuncoro, Viny Christanti Mawardi, and Janson Hendryli. "SISTEM PERINGKAS OTOMATIS ABSTRAKTIF DENGAN MENGGUNAKAN RECURRENT NEURAL NETWORK." Computatio : Journal of Computer Science and Information Systems 2, no. 1 (May 22, 2018): 65. http://dx.doi.org/10.24912/computatio.v2i1.1481.

Full text
Abstract:
Abstractive Text Summarization try to creates a shorter version of a text while preserve its meaning. We try to use Recurrent Neural Network (RNN) to create summaries of Bahasa Indonesia text. We get corpus from Detik dan Kompas site news. We used word2vec to create word embedding from our corpus then train our data set with RNN to create a model. This model used to generate news. We search the best model by changing word2vec size and RNN hidden states. We use system evaluation and Q&A Evaluation to evaluate our model. System evaluation showed that model with 6457 data set, 200 word2vec size, and 256 RNN hidden states gives best accuracy for 99.8810%. This model evaluated by Q&A Evaluation. Q&A Evaluation showed that the model gives 46.65% accurary.
APA, Harvard, Vancouver, ISO, and other styles
18

Kondath, Manju, David Peter Suseelan, and Sumam Mary Idicula. "Extractive summarization of Malayalam documents using latent Dirichlet allocation: An experience." Journal of Intelligent Systems 31, no. 1 (January 1, 2022): 393–406. http://dx.doi.org/10.1515/jisys-2022-0027.

Full text
Abstract:
Abstract Automatic text summarization (ATS) extracts information from a source text and presents it to the user in a condensed form while preserving its primary content. Many text summarization approaches have been investigated in the literature for highly resourced languages. At the same time, ATS is a complicated and challenging task for under-resourced languages like Malayalam. The lack of a standard corpus and enough processing tools are challenges when it comes to language processing. In the absence of a standard corpus, we have developed a dataset consisting of Malayalam news articles. This article proposes an extractive topic modeling-based multi-document text summarization approach for Malayalam news documents. We first cluster the contents based on latent topics identified using the latent Dirichlet allocation topic modeling technique. Then by adopting vector space model, the topic vector and sentence vector of the given document are generated. According to the relevant status value, sentences are ranked between the document’s topic and sentence vectors. The summary obtained is optimized for non-redundancy. Evaluation results on Malayalam news articles show that the summary generated by the proposed method is closer to the human-generated summaries than the existing text summarization methods.
APA, Harvard, Vancouver, ISO, and other styles
19

Lyashevskaya, Olga, Victor Bocharov, Alexey Sorokin, Tatiana Shavrina, Dmitry Granovsky, and Svetlana Alexeeva. "Text collections for evaluation of Russian morphological taggers." Journal of Linguistics/Jazykovedný casopis 68, no. 2 (December 1, 2017): 258–67. http://dx.doi.org/10.1515/jazcas-2017-0035.

Full text
Abstract:
Abstract The paper describes the preparation and development of the text collections within the framework of MorphoRuEval-2017 shared task, an evaluation campaign designed to stimulate development of the automatic morphological processing technologies for Russian. The main challenge for the organizers was to standardize all available Russian corpora with the manually verified high-quality tagging to a single format (Universal Dependencies CONLL-U). The sources of the data were the disambiguated subcorpus of the Russian National Corpus, SynTagRus, OpenCorpora.org data and GICR corpus with the resolved homonymy, all exhibiting different tagsets, rules for lemmatization, pipeline architecture, technical solutions and error systematicity. The collections includes both normative texts (the news and modern literature) and more informal discourse (social media and spoken data), the texts are available under CC BY-NC-SA 3.0 license.
APA, Harvard, Vancouver, ISO, and other styles
20

Alruily, Meshrif. "Issues of Dialectal Saudi Twitter Corpus." International Arab Journal of Information Technology 17, no. 3 (May 1, 2019): 367–74. http://dx.doi.org/10.34028/iajit/17/3/10.

Full text
Abstract:
Text mining research relies heavily on the availability of a suitable corpus. This paper presents a dialectal Saudi corpus that contains 207452 tweets generated by Saudi Twitter users. In addition, a comparison between the Saudi tweets dataset, Egyptian Twitter corpus and Arabic top news raw corpus (representing Modern Standard Arabic (MSA) in various aspects, such as the differences between formal and colloquial texts was carried out. Moreover, investigation into the issues and phenomena, such as shortening, concatenation, colloquial language, compounding, foreign language, spelling errors and neologisms on this type of dataset was performed.
APA, Harvard, Vancouver, ISO, and other styles
21

Kanekar, Saurabh A., Alind Sharma, Gaurang S. Patkar, and Amey K. Shet Tilve. "Building semantically annotated corpus for text classification of Indian defence news articles." International Journal of Information Technology 13, no. 4 (June 17, 2021): 1539–44. http://dx.doi.org/10.1007/s41870-021-00679-x.

Full text
APA, Harvard, Vancouver, ISO, and other styles
22

Din, Muhammad, and Mamuna Ghani. "Corpus-based Study of Identifying Verb Patterns Used in Pakistani Newspaper Headlines." Theory and Practice in Language Studies 10, no. 2 (February 1, 2020): 149. http://dx.doi.org/10.17507/tpls.1002.02.

Full text
Abstract:
Newspaper headlines are an important subgenre of media genre and enjoy much significance in news discourse. Headlines are ascribed different functions as they are the opening section to their respective text. This corpus-driven study strives to identify those verb patterns which have been used in Pakistani newspaper headlines. To identify different verb patterns used in newspaper headlines, the researcher compiled a corpus of 3135 newspaper headlines consisting of 28646 words drawn from three on-line Pakistani English newspapers which include The Dawn, The Nation and The News. The researcher tagged this corpus by using the software TagAnt and analyzed this corpus with the help of corpus tool AntConc to identify the verb patterns used in these Pakistani English newspaper headlines. To this end, the researcher analyzed the compiled corpus in accordance with the POS Tags given by Tree Tagger Tag Set. This study has found different verb patterns which have been used in newspaper headlines.
APA, Harvard, Vancouver, ISO, and other styles
23

Oktavianti, Ikmi Nur, and Novi Retno Ardianti. "A CORPUS-BASED ANALYSIS OF VERBS IN NEWS SECTION OF THE JAKARTA POST: HOW FREQUENCY IS RELATED TO TEXT CHARACTERISTICS." JOALL (Journal of Applied Linguistics & Literature) 4, no. 2 (August 27, 2019): 203–14. http://dx.doi.org/10.33369/joall.v4i2.7623.

Full text
Abstract:
Verb is one of the most important word classes in linguistic construction due to its prominent role and dynamic nature. Interestingly, the use of verbs in different linguistic contexts might be various because the context can limit or allow certain verbs to occur more frequently than other verbs. It is compelling to study further the use of verbs in a particular linguistic context. This paper thus aims at examining the use of verbs in news section in The Jakarta Post to figure out the frequency of verbs and how it relates to the characteristics of news text. This study compiled The Jakarta Post corpus comprising news articles belong to the category of hard news from October to December 2018 with total size of 21.682 words. The verb types used in this study refer to those compiled by Scheibmann (combining Halliday’s verb taxonomy and Dixon’s verb types). Based on the analysis, it is obvious that verbal type is the most frequent verb type, followed by material and existential. As for the least frequent ones, there are corporeal and perception/relational types. It is plausible that verbal type occupies the most frequent position because the nature of news text is to deliver information and thus it needs to use verbal verbs quite often. Likewise, material verb is frequent because it states concrete action and existential verb denotes existence; both are vital in constructing news text. Meanwhile, corporeal and perception/relational types are least frequent because corporeal deals with bodily gestures actions and perception/relational shows subjectivity. Both verb types are rather insignificant concepts in news writing. Based on the results of analysis, it is obvious that there is a firm relation between frequency of verbs used in news text and the characteristics of the text: linguistic units that are not in accordance with the function of the text are not really needed and thus infrequently used.
APA, Harvard, Vancouver, ISO, and other styles
24

Tripathi, Rajeev. "PERFECTIONOF CLASSIFICATION ACCURACY IN TEXT CATEGORIZATION." International Journal of Advanced Research 9, no. 09 (September 30, 2021): 484–88. http://dx.doi.org/10.21474/ijar01/13437.

Full text
Abstract:
Problems and strategies for text classification have already been known for a long time. Theyre widely utilised by companies like Google and Yahoo for email spam screening, sentiment analysis of Twitter data, and automatic news categories in Google alerts. Were still working on getting the findings to be as accurate as possible. When dealing with large amounts of text data, however, the models performance and accuracy become a difficulty. The type of words utilised in the corpus and the type of features produced for classification have a big impact on the performance of a text classification model.
APA, Harvard, Vancouver, ISO, and other styles
25

Deuber, Dagmar. "“First year of nation’s return to government of make you talk your own make I talk my own”." English World-Wide 23, no. 2 (December 20, 2002): 195–222. http://dx.doi.org/10.1075/eww.23.2.03deu.

Full text
Abstract:
As a language which for the greater part of its history was used only for simple everyday interactions and which lacks any kind of standardization, Nigerian Pidgin (NigP) is not well equipped for the wide range of functions it has to perform in present-day Nigeria. Among educated NigP speakers, borrowing from English is a common strategy, but broadcasters who translate news from English into NigP have to produce a form of the language that will be intelligible to a target audience whose command of English is limited. The paper offers a discussion of this problem based on a corpus of spoken NigP comprising news and several other text categories. Text samples from the news texts are analysed, and corpus data illustrating Anglicisms and pidginization on the lexical, grammatical and discourse levels are discussed. In addition, the results of an elicitation experiment in which Nigerian informants were asked to evaluate extracts from the corpus by means of a questionnaire are reported. The news texts were found to be less satisfactory than others, and it is argued that this is due not only to Anglicisms but in some cases also to an overuse of pidginization strategies. However, there are also examples of successful adaptation of an English script, and it is argued that even with only a moderate degree of language engineering, one could build on such achievements to make NigP a more viable medium of news broadcasting.
APA, Harvard, Vancouver, ISO, and other styles
26

Lakshika, M. V. P. T., and H. A. Caldera. "Knowledge Graphs Representation for Event-Related E-News Articles." Machine Learning and Knowledge Extraction 3, no. 4 (September 26, 2021): 802–18. http://dx.doi.org/10.3390/make3040040.

Full text
Abstract:
E-newspaper readers are overloaded with massive texts on e-news articles, and they usually mislead the reader who reads and understands information. Thus, there is an urgent need for a technology that can automatically represent the gist of these e-news articles more quickly. Currently, popular machine learning approaches have greatly improved presentation accuracy compared to traditional methods, but they cannot be accommodated with the contextual information to acquire higher-level abstraction. Recent research efforts in knowledge representation using graph approaches are neither user-driven nor flexible to deviations in the data. Thus, there is a striking concentration on constructing knowledge graphs by combining the background information related to the subjects in text documents. We propose an enhanced representation of a scalable knowledge graph by automatically extracting the information from the corpus of e-news articles and determine whether a knowledge graph can be used as an efficient application in analyzing and generating knowledge representation from the extracted e-news corpus. This knowledge graph consists of a knowledge base built using triples that automatically produce knowledge representation from e-news articles. Inclusively, it has been observed that the proposed knowledge graph generates a comprehensive and precise knowledge representation for the corpus of e-news articles.
APA, Harvard, Vancouver, ISO, and other styles
27

Ethelb, Hamza. "Changing the Structure of Paragraphs and Texts in Arabic: A Case from News Reporting." International Journal of Comparative Literature and Translation Studies 7, no. 3 (July 31, 2019): 8. http://dx.doi.org/10.7575/aiac.ijclts.v.7n.3p.8.

Full text
Abstract:
This study explores the textual alterations of Arabic news structure and how it has been influenced by news texts produced in English. The paper precisely examines sentence, paragraph and text structures in terms of form and content in relation to news translation. It analyses news articles collated from Aljazeera and Al-Arabiya news networks. The collated corpus is translations from English into Arabic by these two media outlets. The analysis showed considerable changes that the form of Arabic textual structures has incurred, especially in the general layout of texts. Although it confirm Hatim’s (1997) text-type categorisation with regard to argumentation in Arabic news texts that Arabic lacks argumentative elements it news content, it exhibited significant shift in internal cohesion, paragraph transitions, and syntactic patterns. These changes could emanate from many other influencing factors, but translation is definitely one.
APA, Harvard, Vancouver, ISO, and other styles
28

Pala, Mythilisharan, Laxminarayana Parayitam, and Venkataramana Appala. "Unsupervised stemmed text corpus for language modeling and transcription of Telugu broadcast news." International Journal of Speech Technology 23, no. 3 (September 2020): 695–704. http://dx.doi.org/10.1007/s10772-020-09749-0.

Full text
APA, Harvard, Vancouver, ISO, and other styles
29

Wong, Dora. "A Corpus-Based Study of Peer Comments and Self-Reflections." International Journal of Online Pedagogy and Course Design 8, no. 4 (October 2018): 65–90. http://dx.doi.org/10.4018/ijopcd.2018100105.

Full text
Abstract:
Engaging students in peer reviewing in the writing classroom has been widely practiced as a way of assessment for learning. In-depth research is needed however to investigate how students specifically use peer comments in their editing process. Using a corpus-based approach, this article investigates the acquisition of journalistic writing skills by 112 undergraduates in Hong Kong. The learner corpora comprise student comments and self-reflections from an online news writing project. While grammatical accuracy remains to be a concern for effectiveness of the peer review practice, the findings reflect sound understanding of structure, layout and style of the online news genre among the participants. Although the students commented more on contents and organization of news writing, findings from keyword search and co-text in the concordances demonstrate awareness of main features of the online news genre. The findings further clarified judgement and choices made by the ESL learners during the drafting and editing processes. The study suggests how peer assessment and self-assessment can be effectively practiced through a cycle of reviewing peer writing, receiving peer comments and self-reflecting on their own drafts. It also indicates how peer review may help acquisition of style and lexico-grammar which can be demanding for many ESL learners.
APA, Harvard, Vancouver, ISO, and other styles
30

Song, Min. "Critical Discourse Analysis of Chinese English News Reports from the Perspective of Ecolinguistics." BCP Social Sciences & Humanities 14 (December 17, 2021): 238–47. http://dx.doi.org/10.54691/bcpssh.v14i.214.

Full text
Abstract:
This study makes a Critical Discourse Analysis of the ecological and non-ecological features of Chinese English news reports and holds that news reports ultimately serves the interest groups it represents and cannot get rid of its ideological influence. To achieve this, with Fairclough’s three-dimensional model as the theoretical framework and from the perspective of ecolinguistics, this study, based on 29 English news reports on the northward migration of Asian elephants from China Daily and by building up a small corpus, combining qualitative and quantitative research methods, aims to analyze the selected corpus with the help of AntConc 3.5.9 (windows) 2020. Finally, through the analysis of text analysis, discourse interpretation, and social interpretation, it is found that Chinese and English news reports have ecological and non-ecological features. At the same time, English news reports in China also try to express and build a harmonious relationship between man and nature. It reflects that China follows the ecological order concept of “harmonious co-existence between man and nature” and the ecological basis of “man and nature are the community of life”.
APA, Harvard, Vancouver, ISO, and other styles
31

Fišer, Darja, Tomaž Erjavec, and Nikola Ljubešić. "JANES v0.4: Korpus slovenskih spletnih uporabniških vsebin." Slovenščina 2.0: empirical, applied and interdisciplinary research 4, no. 2 (September 27, 2016): 67. http://dx.doi.org/10.4312/slo2.0.2016.2.67-100.

Full text
Abstract:
The paper presents the current version of the Slovene corpus of netspeak Janes which contains tweets, forum posts, news comments, blogs and blog comments, and user and talk pages from Wikipedia. First, we describe the harvesting procedure for each data source and provide a quantitative analysis of the corpus. Next, we present automatic and manual procedures for enriching the corpus with metadata, such as user type, gender and region, and text sentiment and standardness level. Finally, we give a detailed account of the linguistic annotation workflow which includes tokenization, sentence segmentation, rediacritisation, normalization, morphosyntactic tagging and lemmatization.
APA, Harvard, Vancouver, ISO, and other styles
32

Chen, Yanxin, and Qinling Jing. "Semantic Study on Network News Texts in Mode of “Distant Reading”." BCP Social Sciences & Humanities 14 (December 17, 2021): 256–68. http://dx.doi.org/10.54691/bcpssh.v14i.226.

Full text
Abstract:
The corpus adopted in this study is from the official news texts of Chinese and foreign network media collected and processed by researchers. By Voyant, a web-based text reading and analysis platform, the study finds and analyzes the semantic differences of lexical chunk Chinese culture in Chinese and foreign news stories under the semantic view of systematic-functional grammar with the digital humanistic mode “distant reading” as the semantic analysis research means. the study explores the implicit semantic deviation and its logical semantic relationship between Chinese and foreign news texts.
APA, Harvard, Vancouver, ISO, and other styles
33

Studer, Patrick. "Textual structures in eighteenth-century newspapers." Media and Language Change 4, no. 1 (January 31, 2003): 19–44. http://dx.doi.org/10.1075/jhp.4.1.03stu.

Full text
Abstract:
Newspapers have recently become attractive objects of interest to linguists, but little research has been done thus far on news discourse of the seventeenth and eighteenth centuries. The present study contributes to filling this gap by reporting results from a corpus-based study of early English-language newspaper headlines. The analysis reveals that the modern segmentation of news into the three elements of headline, lead, and news story cannot be applied to forerunners of modern newspapers. Instead, a classification model is proposed that takes account of the specific properties of the genre. The physical organisation of early newspapers is first considered, so as to be able to identify typographical categories of headings. In a second step, the intended textual functions of headlines are identified, along with typical correlations of headline forms and functions. Applying these categories to an eighteenth-century corpus reveals general tendencies of text structuring in early newspapers.
APA, Harvard, Vancouver, ISO, and other styles
34

Olaleye, Taiwo Olapeju. "Veracity Assessment of Multimedia Facebook Posts for Infodemic Symptom Detection using Bi-modal Unsupervised Machine Learning Approach." International Journal for Research in Applied Science and Engineering Technology 9, no. 12 (December 31, 2021): 2234–41. http://dx.doi.org/10.22214/ijraset.2021.39406.

Full text
Abstract:
Abstract: Ascertaining the truthfulness and trustworthiness of information posted on social media has been challenging with the proliferation of unsubstantiated, misleading, and inciting news, with different intents by purveyors. Unlike the traditional media with some level of regulations, user-generated posts on social networks does not pass through censorships in order to establish the truism of news items hence the need to be cautious of posted information on the networks. The lingering issue of recent suspension of Twitter microblogging site by the Nigerian government and the consequent decision to regulate social network operations in the country similarly centers on the subject of social media dependability for legitimate social engagements by millions of savvy Nigerian users. Whereas existing models in literature have proposed state-of-the-arts, this study seeks to improve on obtainable studies with a bi-modal machine learning methodology that indicate symptoms of infodemic social media posts. Using a multimedia facebook corpus, an unsupervised natural language processor, Inception v3 model, coupled with a hierarchical clustering network, is deployed for the duo of image and text sentiment analytics. Experimental result uniquely identified infodemic tendencies in facebook text-corpus and efficiently differentiates image-corpus into respective clusters through the Euclidian distance metrics. The most infodemic post returned a -0.9719 compound score while the most positive post returns 0.9488. Veracity assessment of polarized opinions expressed in negative clusters reveals that provocative, derogatory, obnoxious, etc. indicate propensity for infodemic tendencies. Keywords: Fake news. Facebook. Social media. Sentiment Analysis. Infodemic
APA, Harvard, Vancouver, ISO, and other styles
35

Pattnaik, Sagarika, and Ajit Kumar Nayak. "A Modified Markov-Based Maximum-Entropy Model for POS Tagging of Odia Text." International Journal of Decision Support System Technology 14, no. 1 (January 2022): 1–24. http://dx.doi.org/10.4018/ijdsst.286690.

Full text
Abstract:
POS (Parts of Speech) tagging, a vital step in diverse Natural Language Processing (NLP) tasks has not drawn much attention in case of Odia a computationally under-developed language. The proposed hybrid method suggests a robust POS tagger for Odia. Observing the rich morphology of the language and unavailability of sufficient annotated text corpus a combination of machine learning and linguistic rules is adopted in the building of the tagger. The tagger is trained on tagged text corpus from the domain of tourism and is capable of obtaining a perceptible improvement in the result. Also an appreciable performance is observed for news articles texts of varied domains. The performance of proposed algorithm experimenting on Odia language shows its manifestation in dominating over existing methods like rule based, hidden Markov model (HMM), maximum entropy (ME) and conditional random field (CRF).
APA, Harvard, Vancouver, ISO, and other styles
36

Satyapanich, Taneeya, Francis Ferraro, and Tim Finin. "CASIE: Extracting Cybersecurity Event Information from Text." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 05 (April 3, 2020): 8749–57. http://dx.doi.org/10.1609/aaai.v34i05.6401.

Full text
Abstract:
We present CASIE, a system that extracts information about cybersecurity events from text and populates a semantic model, with the ultimate goal of integration into a knowledge graph of cybersecurity data. It was trained on a new corpus of 1,000 English news articles from 2017–2019 that are labeled with rich, event-based annotations and that covers both cyberattack and vulnerability-related events. Our model defines five event subtypes along with their semantic roles and 20 event-relevant argument types (e.g., file, device, software, money). CASIE uses different deep neural networks approaches with attention and can incorporate rich linguistic features and word embeddings. We have conducted experiments on each component in the event detection pipeline and the results show that each subsystem performs well.
APA, Harvard, Vancouver, ISO, and other styles
37

Wu, Xingsu, and Jinhui Chen. "Text Classification on Large Scale Chinese News Corpus using Character-level Convolutional Neural Network." Journal of Physics: Conference Series 1693 (December 2020): 012171. http://dx.doi.org/10.1088/1742-6596/1693/1/012171.

Full text
APA, Harvard, Vancouver, ISO, and other styles
38

Chiang, Jung-Hsien, and Yan-Cheng Chen. "An intelligent news recommender agent for filtering and categorizing large volumes of text corpus." International Journal of Intelligent Systems 19, no. 3 (2004): 201–16. http://dx.doi.org/10.1002/int.10136.

Full text
APA, Harvard, Vancouver, ISO, and other styles
39

Pfeilstetter, Richard. "Nations in news." Pragmatics and Society 8, no. 4 (December 31, 2017): 477–97. http://dx.doi.org/10.1075/ps.15060.pfe.

Full text
Abstract:
Abstract This contribution investigates the stereotyping of nations in TV news text. It compares the headline appearances of the names Germany and Spain on each other’s leading national evening TV news program during the peak of the European financial crisis (2011–13). The paper combines quantitative analysis of word-frequency and topic-distribution in a 621 headline-corpus, with in-depth case analysis of news values underpinning 32 extracted headline examples. A discussion of literature in media anthropology and Critical Discourse Analysis concludes with the argument that intentions and consequences of media discourse should be separated, whereas differences between ordinary and official language should not be overvalued. The case study shows how the textual display of Germans and Spaniards supports the everyday imagining of national belonging, how othering works through the labelling of nations as “economies”, and how negativity, competition and relatedness are prevailing values underlying the examined news headlines.
APA, Harvard, Vancouver, ISO, and other styles
40

Pfeilstetter, Richard. "Nations in news." Pragmatics and Society 8, no. 4 (2017): 477–97. http://dx.doi.org/10.1075/ps.8.4.01pfe.

Full text
Abstract:
This contribution investigates the stereotyping of nations in TV news text. It compares the headline appearances of the names Germany and Spain on each other’s leading national evening TV news program during the peak of the European financial crisis (2011–13). The paper combines quantitative analysis of word-frequency and topic-distribution in a 621 headline-corpus, with in-depth case analysis of news values underpinning 32 extracted headline examples. A discussion of literature in media anthropology and Critical Discourse Analysis concludes with the argument that intentions and consequences of media discourse should be separated, whereas differences between ordinary and official language should not be overvalued. The case study shows how the textual display of Germans and Spaniards supports the everyday imagining of national belonging, how othering works through the labelling of nations as “economies”, and how negativity, competition and relatedness are prevailing values underlying the examined news headlines.
APA, Harvard, Vancouver, ISO, and other styles
41

Nguyen, Vu H., Hien T. Nguyen, Hieu N. Duong, and Vaclav Snasel. "n-Gram-Based Text Compression." Computational Intelligence and Neuroscience 2016 (2016): 1–11. http://dx.doi.org/10.1155/2016/9483646.

Full text
Abstract:
We propose an efficient method for compressing Vietnamese text usingn-gram dictionaries. It has a significant compression ratio in comparison with those of state-of-the-art methods on the same dataset. Given a text, first, the proposed method splits it inton-grams and then encodes them based onn-gram dictionaries. In the encoding phase, we use a sliding window with a size that ranges from bigram to five grams to obtain the best encoding stream. Eachn-gram is encoded by two to four bytes accordingly based on its correspondingn-gram dictionary. We collected 2.5 GB text corpus from some Vietnamese news agencies to buildn-gram dictionaries from unigram to five grams and achieve dictionaries with a size of 12 GB in total. In order to evaluate our method, we collected a testing set of 10 different text files with different sizes. The experimental results indicate that our method achieves compression ratio around 90% and outperforms state-of-the-art methods.
APA, Harvard, Vancouver, ISO, and other styles
42

Barberá, Pablo, Amber E. Boydstun, Suzanna Linn, Ryan McMahon, and Jonathan Nagler. "Automated Text Classification of News Articles: A Practical Guide." Political Analysis 29, no. 1 (June 9, 2020): 19–42. http://dx.doi.org/10.1017/pan.2020.8.

Full text
Abstract:
Automated text analysis methods have made possible the classification of large corpora of text by measures such as topic and tone. Here, we provide a guide to help researchers navigate the consequential decisions they need to make before any measure can be produced from the text. We consider, both theoretically and empirically, the effects of such choices using as a running example efforts to measure the tone of New York Times coverage of the economy. We show that two reasonable approaches to corpus selection yield radically different corpora and we advocate for the use of keyword searches rather than predefined subject categories provided by news archives. We demonstrate the benefits of coding using article segments instead of sentences as units of analysis. We show that, given a fixed number of codings, it is better to increase the number of unique documents coded rather than the number of coders for each document. Finally, we find that supervised machine learning algorithms outperform dictionaries on a number of criteria. Overall, we intend this guide to serve as a reminder to analysts that thoughtfulness and human validation are key to text-as-data methods, particularly in an age when it is all too easy to computationally classify texts without attending to the methodological choices therein.
APA, Harvard, Vancouver, ISO, and other styles
43

Mohamed, Sally, Mahmoud Hussien, and Hamdy M. Mousa. "ADPBC: Arabic Dependency Parsing Based Corpora for Information Extraction." International Journal of Information Technology and Computer Science 13, no. 1 (February 8, 2021): 54–61. http://dx.doi.org/10.5815/ijitcs.2021.01.04.

Full text
Abstract:
There is a massive amount of different information and data in the World Wide Web, and the number of Arabic users and contents is widely increasing. Information extraction is an essential issue to access and sort the data on the web. In this regard, information extraction becomes a challenge, especially for languages, which have a complex morphology like Arabic. Consequently, the trend today is to build a new corpus that makes the information extraction easier and more precise. This paper presents Arabic linguistically analyzed corpus, including dependency relation. The collected data includes five fields; they are a sport, religious, weather, news and biomedical. The output is CoNLL universal lattice file format (CoNLL-UL). The corpus contains an index for the sentences and their linguistic meta-data to enable quick mining and search across the corpus. This corpus has seventeenth morphological annotations and eight features based on the identification of the textual structures help to recognize and understand the grammatical characteristics of the text and perform the dependency relation. The parsing and dependency process conducted by the universal dependency model and corrected manually. The results illustrated the enhancement in the dependency relation corpus. The designed Arabic corpus helps to quickly get linguistic annotations for a text and make the information Extraction techniques easy and clear to learn. The gotten results illustrated the average enhancement in the dependency relation corpus.
APA, Harvard, Vancouver, ISO, and other styles
44

Petrasova, Svitlana, Nina Khairova, and Anastasiia Kolesnyk. "TECHNOLOGY FOR IDENTIFICATION OF INFORMATION AGENDA IN NEWS DATA STREAMS." Bulletin of National Technical University "KhPI". Series: System Analysis, Control and Information Technologies, no. 1 (5) (July 12, 2021): 86–90. http://dx.doi.org/10.20998/2079-0023.2021.01.14.

Full text
Abstract:
Currently, the volume of news data streams is growing that contributes to increasing interest in systems that allow automating the big data streams processing. Based on intelligent data processing tools, the semantic similarity identification of text information will make it possible to select common information spaces of news. The article analyzes up-to-date statistical metrics for identifying coherent fragments, in particular, from news texts displaying the agenda, identifies the main advantages and disadvantages as well. The information technology is proposed for identifying the common information space of relevant news in the data stream for a certain period of time. The technology includes the logical-linguistic and distributive-statistical models for identifying collocations. The MI distributional semantic model is applied at the stage of potential collocation extraction. At the same time, regular expressions developed in accordance with the grammar of the English language make it possible to identify grammatically correct constructions. The advantage of the developed logical-linguistic model formalizing the semantic-grammatical characteristics of collocations, based on the use of algebraicpredicate operations and a semantic equivalence predicate, is that both the grammatical structure of the language and the meaning of words (collocates) are analyzed. The WordNet thesaurus is used to determine the synonymy relationship between the main and dependent collocation components. Based on the investigated corpus of news texts from the CNN and BBC services, the effectiveness of the developed technology is assessed. The analysis shows that the precision coefficient is 0.96. The use of the proposed technology could improve the quality of news streams processing. The solution to the problem of automatic identification of semantic similarity can be used to identify texts of the same domain, relevant information, extract facts and eliminate semantic ambiguity, etc. Keywords: data stream, agenda, logical-linguistic model, distribution-statistical model, collocation, semantic similarity, WordNet, news text corpus, precision.
APA, Harvard, Vancouver, ISO, and other styles
45

Syahira, T. Aldilla, T. Silvana Sinar, and Masdiana Lubis. "Types of Modality in News Item is Used in the Texts News in the Jakarta Post Newspaper." Budapest International Research and Critics Institute (BIRCI-Journal): Humanities and Social Sciences 4, no. 1 (January 14, 2021): 66–71. http://dx.doi.org/10.33258/birci.v4i1.1537.

Full text
Abstract:
This research aimed to find out types of modality and to explain how the most dominant types of modality in news item is used in the texts news in The Jakarta Post newspaper. This research was conducted by using corpus analysis as the appropriate tool to analyze the online written text. The data were taken from the source of Jakarta Post newspaper which published online from the 1st August until 31th December 2019. The sources varied in four themes i.e. politic, education, sports and economic news. As a result, the researcher found 2 types of modality there are modalization and modulation. These types have 2 types of intermediacy on each, probability and usuality for modalization and obligation and inclination for modulation.
APA, Harvard, Vancouver, ISO, and other styles
46

Hassan, Waqar, Nadia Perveen Thalho, and Yasmeen Mehboob. "Professional and Institutional Discourse: A Case Study of Media Discourse." International Journal of English Language Studies 3, no. 3 (March 29, 2021): 16–25. http://dx.doi.org/10.32996/ijels.2021.3.3.3.

Full text
Abstract:
Professional discourse has established in the last few decades. As a control, many applied etymologists and discourse experts have managed it in an insightful way. The main striking work on professional discourse is The Construction of Professional Discourse (Gunnarson et al., 1997). This quantitative study's objective was to identify the professional discourse and define the types of discourse. For this data was collected from Three Pakistani TV news channels named ARY Digital, Express News and Geo News. The data consisted upon the one week recording of TV news channels. Audio recording transform into text format and one corpus-based file was developed. Further corpus analysis tool AntConc version 3.5.9 was used to get the data's frequencies and concordance. On the base of extracted concordance and frequencies descriptive analysis was done and then subjectively analyzed to get the professional discourse from media channels. The study's findings presented that media is a vast profession and has its own particular vocabulary that identifies their profession. Media discourse has specific domains and topics for discussion. This study's findings will help the learners of sociolinguistics and discourse analysis in their case studies.
APA, Harvard, Vancouver, ISO, and other styles
47

Samadi, Mohammadreza, Maryam Mousavian, and Saeedeh Momtazi. "Persian Fake News Detection: Neural Representation and Classification at Word and Text Levels." ACM Transactions on Asian and Low-Resource Language Information Processing 21, no. 1 (January 31, 2022): 1–11. http://dx.doi.org/10.1145/3472620.

Full text
Abstract:
Nowadays, broadcasting news on social media and websites has grown at a swifter pace, which has had negative impacts on both the general public and governments; hence, this has urged us to build a fake news detection system. Contextualized word embeddings have achieved great success in recent years due to their power to embed both syntactic and semantic features of textual contents. In this article, we aim to address the problem of the lack of fake news datasets in Persian by introducing a new dataset crawled from different news agencies, and propose two deep models based on the Bidirectional Encoder Representations from Transformers model (BERT), which is a deep contextualized pre-trained model for extracting valuable features. In our proposed models, we benefit from two different settings of BERT, namely pool-based representation, which provides a representation for the whole document, and sequence representation, which provides a representation for each token of the document. In the former one, we connect a Single Layer Perceptron (SLP) to the BERT to use the embedding directly for detecting fake news. The latter one uses Convolutional Neural Network (CNN) after the BERT’s embedding layer to extract extra features based on the collocation of words in a corpus. Furthermore, we present the TAJ dataset, which is a new Persian fake news dataset crawled from news agencies’ websites. We evaluate our proposed models on the newly provided TAJ dataset as well as the two different Persian rumor datasets as baselines. The results indicate the effectiveness of using deep contextualized embedding approaches for the fake news detection task. We also show that both BERT-SLP and BERT-CNN models achieve superior performance to the previous baselines and traditional machine learning models, with 15.58% and 17.1% improvement compared to the reported results by Zamani et al. [ 30 ], and 11.29% and 11.18% improvement compared to the reported results by Jahanbakhsh-Nagadeh et al. [ 9 ].
APA, Harvard, Vancouver, ISO, and other styles
48

Qi, Shanshan, Limin Zheng, and Feiyu Shang. "Dependency Parsing-based Entity Relation Extraction over Chinese Complex Text." ACM Transactions on Asian and Low-Resource Language Information Processing 20, no. 4 (June 9, 2021): 1–34. http://dx.doi.org/10.1145/3450273.

Full text
Abstract:
Open Relation Extraction (ORE) plays a significant role in the field of Information Extraction. It breaks the limitation that traditional relation extraction must pre-define relational types in the annotated corpus and specific domains restrictions, to realize the goal of extracting entities and the relation between entities in the open domain. However, with the increase of sentence complexity, the precision and recall of Entity Relation Extraction will be significantly reduced. To solve this problem, we present an unsupervised Clause_CORE method based on Chinese grammar and dependency parsing features. Clause_CORE is used for complex sentences processing, including decomposing complex sentence and dynamically complementing sentence components, which can reduce sentences complexity and maintain the integrity of sentences at the same time. Then, we perform dependency parsing for complete sentences and implement open entity relation extraction based on the model constructed by Chinese grammar rules. The experimental results show that the performance of Clause_CORE method is better than that of other advanced Chinese ORE systems on Wikipedia and Sina news datasets, which proves the correctness and effectiveness of the method. The results on mixed datasets of news data and encyclopedia data prove the generalization and portability of the method.
APA, Harvard, Vancouver, ISO, and other styles
49

Hassanzadeh, Oktie, Debarun Bhattacharjya, Mark Feblowitz, Kavitha Srinivas, Michael Perrone, Shirin Sohrabi, and Michael Katz. "Causal Knowledge Extraction through Large-Scale Text Mining." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 09 (April 3, 2020): 13610–11. http://dx.doi.org/10.1609/aaai.v34i09.7092.

Full text
Abstract:
In this demonstration, we present a system for mining causal knowledge from large corpuses of text documents, such as millions of news articles. Our system provides a collection of APIs for causal analysis and retrieval. These APIs enable searching for the effects of a given cause and the causes of a given effect, as well as the analysis of existence of causal relation given a pair of phrases. The analysis includes a score that indicates the likelihood of the existence of a causal relation. It also provides evidence from an input corpus supporting the existence of a causal relation between input phrases. Our system uses generic unsupervised and weakly supervised methods of causal relation extraction that do not impose semantic constraints on causes and effects. We show example use cases developed for a commercial application in enterprise risk management.
APA, Harvard, Vancouver, ISO, and other styles
50

Veszelszki, Ágnes. "Linguistic and Non-Linguistic Elements in Detecting (Hungarian) Fake News." Acta Universitatis Sapientiae Communicatio 4, no. 1 (December 1, 2017): 7–35. http://dx.doi.org/10.1515/auscom-2017-0001.

Full text
Abstract:
Abstract Fake news texts often show clear signs of the deceptive nature; still, they are shared by many users on Facebook. What could be the reason for this? The paper tries to answer the question by collecting the linguistic and non-linguistic characteristics of fake news. Linguistic characteristics include among others the exaggerating, sensational title, the eye-catching, tabloid-style text, the correct or incorrect use of terms, and the fake URLs imitating real websites; non-linguistic characteristics are expressive pictures often featuring celebrities, the use of all caps, excessive punctuation, and spelling mistakes. The corpus was compiled using snowball sampling: manipulative news not originating from big news portals were collected from the social networking website Facebook. The aim of the study is to identify the characteristics of Hungarian fake news in comparison to the English ones and to elaborate a system of aspects which help identify fake news.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography