Journal articles on the topic 'Automated audio captioning'

To see the other types of publications on this topic, follow the link: Automated audio captioning.

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 23 journal articles for your research on the topic 'Automated audio captioning.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Bokhove, Christian, and Christopher Downey. "Automated generation of ‘good enough’ transcripts as a first step to transcription of audio-recorded data." Methodological Innovations 11, no. 2 (May 2018): 205979911879074. http://dx.doi.org/10.1177/2059799118790743.

Full text
Abstract:
In the last decade, automated captioning services have appeared in mainstream technology use. Until now, the focus of these services have been on the technical aspects, supporting pupils with special educational needs and supporting teaching and learning of second language students. Only limited explorations have been attempted regarding its use for research purposes: transcription of audio recordings. This article presents a proof-of-concept exploration utilising three examples of automated transcription of audio recordings from different contexts; an interview, a public hearing and a classroom setting, and compares them against ‘manual’ transcription techniques in each case. It begins with an overview of literature on automated captioning and the use of voice recognition tools for the purposes of transcription. An account is provided of the specific processes and tools used for the generation of the automated captions followed by some basic processing of the captions to produce automated transcripts. Originality checking software was used to determine a percentage match between the automated transcript and a manual version as a basic measure of the potential usability of each of the automated transcripts. Some analysis of the more common and persistent mismatches observed between automated and manual transcripts is provided, revealing that the majority of mismatches would be easily identified and rectified in a review and edit of the automated transcript. Finally, some of the challenges and limitations of the approach are considered. These limitations notwithstanding, we conclude that this form of automated transcription provides ‘good enough’ transcription for first versions of transcripts. The time and cost advantages of this could be considerable, even for the production of summary or gisted transcripts.
APA, Harvard, Vancouver, ISO, and other styles
2

Koenecke, Allison, Andrew Nam, Emily Lake, Joe Nudell, Minnie Quartey, Zion Mengesha, Connor Toups, John R. Rickford, Dan Jurafsky, and Sharad Goel. "Racial disparities in automated speech recognition." Proceedings of the National Academy of Sciences 117, no. 14 (March 23, 2020): 7684–89. http://dx.doi.org/10.1073/pnas.1915768117.

Full text
Abstract:
Automated speech recognition (ASR) systems, which use sophisticated machine-learning algorithms to convert spoken language to text, have become increasingly widespread, powering popular virtual assistants, facilitating automated closed captioning, and enabling digital dictation platforms for health care. Over the last several years, the quality of these systems has dramatically improved, due both to advances in deep learning and to the collection of large-scale datasets used to train the systems. There is concern, however, that these tools do not work equally well for all subgroups of the population. Here, we examine the ability of five state-of-the-art ASR systems—developed by Amazon, Apple, Google, IBM, and Microsoft—to transcribe structured interviews conducted with 42 white speakers and 73 black speakers. In total, this corpus spans five US cities and consists of 19.8 h of audio matched on the age and gender of the speaker. We found that all five ASR systems exhibited substantial racial disparities, with an average word error rate (WER) of 0.35 for black speakers compared with 0.19 for white speakers. We trace these disparities to the underlying acoustic models used by the ASR systems as the race gap was equally large on a subset of identical phrases spoken by black and white individuals in our corpus. We conclude by proposing strategies—such as using more diverse training datasets that include African American Vernacular English—to reduce these performance differences and ensure speech recognition technology is inclusive.
APA, Harvard, Vancouver, ISO, and other styles
3

Mirzaei, Maryam Sadat, Kourosh Meshgi, Yuya Akita, and Tatsuya Kawahara. "Partial and synchronized captioning: A new tool to assist learners in developing second language listening skill." ReCALL 29, no. 2 (March 2, 2017): 178–99. http://dx.doi.org/10.1017/s0958344017000039.

Full text
Abstract:
AbstractThis paper introduces a novel captioning method, partial and synchronized captioning (PSC), as a tool for developing second language (L2) listening skills. Unlike conventional full captioning, which provides the full text and allows comprehension of the material merely by reading, PSC promotes listening to the speech by presenting a selected subset of words, where each word is synched to its corresponding speech signal. In this method, word-level synchronization is realized by an automatic speech recognition (ASR) system, dedicated to the desired corpora. This feature allows the learners to become familiar with the correspondences between words and their utterances. Partialization is done by automatically selecting words or phrases likely to hinder listening comprehension. In this work we presume that the incidence of infrequent or specific words and fast delivery of speech are major barriers to listening comprehension. The word selection criteria are thus based on three factors: speech rate, word frequency and specificity. The thresholds for these features are adjusted to the proficiency level of the learners. The selected words are presented to aid listening comprehension while the remaining words are masked in order to keep learners listening to the audio. PSC was evaluated against no-captioning and full-captioning conditions using TED videos. The results indicate that PSC leads to the same level of comprehension as the full-captioning method while presenting less than 30% of the transcript. Furthermore, compared with the other methods, PSC can serve as an effective medium for decreasing dependence on captions and preparing learners to listen without any assistance.
APA, Harvard, Vancouver, ISO, and other styles
4

Guo, Rundong. "Advancing real-time close captioning: blind source separation and transcription for hearing impairments." Applied and Computational Engineering 30, no. 1 (January 22, 2024): 125–30. http://dx.doi.org/10.54254/2755-2721/30/20230084.

Full text
Abstract:
This project investigates the potential of integrating Blind Source Separation (DUET algorithm) and Automatic Speech Recognition (Wav2Vec2 model) for real-time, accurate transcription in multi-speaker scenarios. Specifically targeted towards improving accessibility for individuals with hearing impairments, the project addresses the challenging task of separating and transcribing speech from simultaneous speakers in various contexts. The DUET algorithm effectively separates individual voices from complex audio scenarios, which are then accurately transcribed into text by the machine learning model, Wav2Vec2. However, despite their remarkable capabilities, both techniques present limitations, particularly when handling complicated audio scenarios and in terms of computational efficiency. Looking ahead, the research suggests incorporating a feedback mechanism between the two systems as a potential solution for these issues. This innovative mechanism could contribute to a more accurate and efficient separation and transcription process by enabling the systems to dynamically adjust to each other's outputs. Nevertheless, this promising direction also brings with it new challenges, particularly in terms of system complexity, defining actionable feedback parameters, and maintaining system efficiency in real-time applications.
APA, Harvard, Vancouver, ISO, and other styles
5

Prabhala, Jagat Chaitanya, Venkatnareshbabu K, and Ragoju Ravi. "OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIARIZATION SYSTEMS: A MATHEMATICAL FORMULATION." Applied Mathematics and Sciences An International Journal (MathSJ) 10, no. 1/2 (June 26, 2023): 1–10. http://dx.doi.org/10.5121/mathsj.2023.10201.

Full text
Abstract:
Speaker diarization is a critical task in speech processing that aims to identify "who spoke when?" in an audio or video recording that contains unknown amounts of speech from unknown speakers and unknown number of speakers. Diarization has numerous applications in speech recognition, speaker identification, and automatic captioning. Supervised and unsupervised algorithms are used to address speaker diarization problems, but providing exhaustive labeling for the training dataset can become costly in supervised learning, while accuracy can be compromised when using unsupervised approaches. This paper presents a novel approach to speaker diarization, which defines loosely labeled data and employs x-vector embedding and a formalized approach for threshold searching with a given abstract similarity metric to cluster temporal segments into unique user segments. The proposed algorithm uses concepts of graph theory, matrix algebra, and genetic algorithm to formulate and solve the optimization problem. Additionally, the algorithm is applied to English, Spanish, and Chinese audios, and the performance is evaluated using wellknown similarity metrics. The results demonstrate that the robustness of the proposed approach. The findings of this research have significant implications for speech processing, speaker identification including those with tonal differences. The proposed method offers a practical and efficient solution for speaker diarization in real-world scenarios where there are labeling time and cost constraints
APA, Harvard, Vancouver, ISO, and other styles
6

Nam, Somang, and Deborah Fels. "Simulation of Subjective Closed Captioning Quality Assessment Using Prediction Models." International Journal of Semantic Computing 13, no. 01 (March 2019): 45–65. http://dx.doi.org/10.1142/s1793351x19400038.

Full text
Abstract:
As a primary user group, Deaf or Hard of Hearing (D/HOH) audiences use Closed Captioning (CC) service to enjoy the TV programs with audio by reading text. However, the D/HOH communities are not completely satisfied with the quality of CC even though the government regulators entail certain rules in the CC quality factors. The measure of the CC quality is often interpreted as an accuracy on translation and regulators use the empirical models to assess. The need of a subjective quality scale comes from the gap in between current empirical assessment models and the audience perceived quality. It is possible to fill the gap by including the subjective assessment by D/HOH audiences. This research proposes a design of an automatic quality assessment system for CC which can predict the D/HOH audience subjective ratings. A simulated rater is implemented based on literature and the CC quality factor representative value extraction algorithm is developed. Three prediction models are trained with a set of CC quality values and corresponding rating scores, then they are compared to find the feasible prediction model.
APA, Harvard, Vancouver, ISO, and other styles
7

Gotmare, Abhay, Gandharva Thite, and Laxmi Bewoor. "A multimodal machine learning approach to generate news articles from geo-tagged images." International Journal of Electrical and Computer Engineering (IJECE) 14, no. 3 (June 1, 2024): 3434. http://dx.doi.org/10.11591/ijece.v14i3.pp3434-3442.

Full text
Abstract:
Classical machine learning algorithms typically operate on unimodal data and hence it can analyze and make predictions based on data from a single source (modality). Whereas multimodal machine learning algorithm, learns from information across multiple modalities, such as text, images, audio, and sensor data. The paper leverages the functionalities of multimodal machine learning (ML) application for generating text from images. The proposed work presents an innovative multimodal algorithm that automates the creation of news articles from geo-tagged images by leveraging cutting-edge developments in machine learning, image captioning, and advanced text generation technologies. Employing a multimodal approach that integrates machine learning and transformer algorithms, such as visual geometry group network16 (VGGNet16), convolutional neural network (CNN) and a long short-term memory (LSTM) based system, the algorithm initiates by extracting the location from exchangeable image file format (Exif) data from the image. The features are extracted from the image and corresponding news headline is generated. The headlines are used for generating a comprehensive article with contemporary large language model (LLM). Further, the algorithm generates the news article big-science large open-science open-access multilingual language model (BLOOM). The algorithm was tested on real time photographs as well as images from the internet. In both the cases the news articles generated were validated with ROUGE and BULE score. The proposed work is found to be successful attempt in journalism field.
APA, Harvard, Vancouver, ISO, and other styles
8

Verma, Dr Neeta. "Assistive Vision Technology using Deep Learning Techniques." International Journal for Research in Applied Science and Engineering Technology 9, no. VII (July 31, 2021): 2695–704. http://dx.doi.org/10.22214/ijraset.2021.36815.

Full text
Abstract:
One of the most important functions of the human visual system is automatic captioning. Caption generation is one of the more interesting and focused areas of AI, with numerous challenges to overcome. If there is an application that automatically captions the scenes in which a person is present and converts the caption into a clear message, people will benefit from it in a variety of ways. In this, we offer a deep learning model that detects things or features in images automatically, produces descriptions for the images, and transforms the descriptions to audio for louder readout. The model uses pre-trained CNN and LSTM models to perform the task of extracting objects or features to get the captions. In our model, first task is to detect objects within the image using pre trained Mobilenet model of CNN (Convolutional Neural Networks) and therefore the other is to caption the pictures based on the detected objects by using LSTM (Long Short Term Memory) and convert caption into speech to read out louder to the person by using SpeechSynthesisUtterance interface of the Web Speech API. The interface of the model is developed using NodeJS as a backend for the web page. Caption generation entails a number of complex steps, including selecting the dataset, training the model, validating the model, creating pre-trained models to check the images, detecting the images, and finally generating captions.
APA, Harvard, Vancouver, ISO, and other styles
9

Eren, Aysegul Ozkaya, and Mustafa Sert. "Automated Audio Captioning with Topic Modeling." IEEE Access, 2023, 1. http://dx.doi.org/10.1109/access.2023.3235733.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Xiao, Feiyang, Jian Guan, Qiaoxi Zhu, and Wenwu Wang. "Graph Attention for Automated Audio Captioning." IEEE Signal Processing Letters, 2023, 1–5. http://dx.doi.org/10.1109/lsp.2023.3266114.

Full text
APA, Harvard, Vancouver, ISO, and other styles
11

Mei, Xinhao, Xubo Liu, Mark D. Plumbley, and Wenwu Wang. "Automated audio captioning: an overview of recent progress and new challenges." EURASIP Journal on Audio, Speech, and Music Processing 2022, no. 1 (October 9, 2022). http://dx.doi.org/10.1186/s13636-022-00259-2.

Full text
Abstract:
AbstractAutomated audio captioning is a cross-modal translation task that aims to generate natural language descriptions for given audio clips. This task has received increasing attention with the release of freely available datasets in recent years. The problem has been addressed predominantly with deep learning techniques. Numerous approaches have been proposed, such as investigating different neural network architectures, exploiting auxiliary information such as keywords or sentence information to guide caption generation, and employing different training strategies, which have greatly facilitated the development of this field. In this paper, we present a comprehensive review of the published contributions in automated audio captioning, from a variety of existing approaches to evaluation metrics and datasets. We also discuss open challenges and envisage possible future research directions.
APA, Harvard, Vancouver, ISO, and other styles
12

Won, Hyejin, Baekseung Kim, Il-Youp Kwak, and Changwon Lim. "Using various pre-trained models for audio feature extraction in automated audio captioning." Expert Systems with Applications, June 2023, 120664. http://dx.doi.org/10.1016/j.eswa.2023.120664.

Full text
APA, Harvard, Vancouver, ISO, and other styles
13

Poongodi, M., Mounir Hamdi, and Huihui Wang. "Image and audio caps: automated captioning of background sounds and images using deep learning." Multimedia Systems, February 26, 2022. http://dx.doi.org/10.1007/s00530-022-00902-0.

Full text
Abstract:
AbstractImage recognition based on computers is something human beings have been working on for many years. It is one of the most difficult tasks in the field of computer science, and improvements to this system are made when we speak. In this paper, we propose a methodology to automatically propose an appropriate title and add a specific sound to the image. Two models have been extensively trained and combined to achieve this effect. Sounds are recommended based on the image scene and the headings are generated using a combination of natural language processing and state-of-the-art computer vision models. A Top 5 accuracy of 67% and a Top 1 accuracy of 53% have been achieved. It is also worth mentioning that this is also the first model of its kind to make this forecast.
APA, Harvard, Vancouver, ISO, and other styles
14

Kala, Sankalp, and Prof Sridhar Ranganathan. "Deep Learning Based Lipreading for Video Captioning." Engineering and Technology Journal 09, no. 05 (May 15, 2024). http://dx.doi.org/10.47191/etj/v9i05.08.

Full text
Abstract:
Visual speech recognition, often referred to as lipreading, has garnered significant attention in recent years due to its potential applications in various fields such as human-computer interaction, accessibility technology, and biometric security systems. This paper explores the challenges and advancements in the field of lipreading, which involves deciphering speech from visual cues, primarily movements of the lips, tongue, and teeth. Despite being an essential aspect of human communication, lipreading presents inherent difficulties, especially in noisy environments or when contextual information is limited. The McGurk effect, where conflicting audio and visual cues lead to perceptual illusions, highlights the complexity of lipreading. Human lipreading performance varies widely, with hearing-impaired individuals achieving relatively low accuracy rates. Automating lipreading using machine learning techniques has emerged as a promising solution, with potential applications ranging from silent dictation in public spaces to biometric authentication systems. Visual speech recognition methods can be broadly categorized into those that focus on mimicking words and those that model visemes, visually distinguishable phonemes. While word-based approaches are suitable for isolated word recognition, viseme-based techniques are better suited for continuous speech recognition tasks. This study proposes a novel deep learning architecture for lipreading, leveraging Conv3D layers for spatiotemporal feature extraction and bidirectional LSTM layers for sequence modelling. The proposed model demonstrates significant improvements in lipreading accuracy, outperforming traditional methods on benchmark datasets. The practical implications of automated lipreading extend beyond accessibility technology to include biometric identity verification, security surveillance, and enhanced communication aids for individuals with hearing impairments. This paper provides insights into the advancements, challenges, and future directions of visual speech recognition research, paving the way for innovative applications in diverse domains.
APA, Harvard, Vancouver, ISO, and other styles
15

Gencyilmaz, Izel Zeynep, and Kürşat Mustafa Karaoğlan. "Optimizing Speech to Text Conversion in Turkish: An Analysis of Machine Learning Approaches." Bitlis Eren Üniversitesi Fen Bilimleri Dergisi, March 20, 2024. http://dx.doi.org/10.17798/bitlisfen.1434925.

Full text
Abstract:
The Conversion of Speech to Text (CoST) is crucial for developing automated systems to understand and process voice commands. Studies have focused on developing this task, especially for Turkish-specific voice commands, a strategic language in the international arena. However, researchers face various challenges, such as Turkish's suffixed structure, phonological features and unique letters, dialect and accent differences, word stress, word-initial vowel effects, background noise, gender-based sound variations, and dialectal differences. To address the challenges above, this study aims to convert speech data consisting of Turkish-specific audio clips, which have been limitedly researched in the literature, into texts with high-performance accuracy using different Machine Learning (ML) models, especially models such as Convolutional Neural Networks (CNNs) and Convolutional Recurrent Neural Networks (CRNNs). For this purpose, experimental studies were conducted on a dataset of 26,485 Turkish audio clips, and performance evaluation was performed with various metrics. In addition, hyperparameters were optimized to improve the model's performance in experimental studies. A performance of over 97% has been achieved according to the F1-score metric. The highest performance results were obtained with the CRNN approach. In conclusion, this study provides valuable insights into the strengths and limitations of various ML models applied to CoST. In addition to potentially contributing to a wide range of applications, such as supporting hard-of-hearing individuals, facilitating notetaking, automatic captioning, and improving voice command recognition systems, this study is one of the first in the literature on CoST in Turkish.
APA, Harvard, Vancouver, ISO, and other styles
16

Lucia-Mulas, Maria Jose, Pablo Revuelta-Sanz, Belen Ruiz-Mezcua, and Israel Gonzalez-Carrasco. "Automatic music emotion classification model for movie soundtrack subtitling based on neuroscientific premises." Applied Intelligence, September 1, 2023. http://dx.doi.org/10.1007/s10489-023-04967-w.

Full text
Abstract:
AbstractThe ability of music to induce emotions has been arousing a lot of interest in recent years, especially due to the boom in music streaming platforms and the use of automatic music recommenders. Music Emotion Recognition approaches are based on combining multiple audio features extracted from digital audio samples and different machine learning techniques. In these approaches, neuroscience results on musical emotion perception are not considered. The main goal of this research is to facilitate the automatic subtitling of music. The authors approached the problem of automatic musical emotion detection in movie soundtracks considering these characteristics and using scientific musical databases, which have become a reference in neuroscience research. In the experiments, the Constant-Q-Transform spectrograms, the ones that best represent the relationships between musical tones from the point of view of human perception, are combined with Convolutional Neural Networks. Results show an efficient emotion classification model for 2-second musical audio fragments representative of intense basic feelings of happiness, sadness, and fear. Those emotions are the most interesting to be identified in the case of movie music captioning. The quality metrics have demonstrated that the results of the different models differ significantly and show no homogeneity. Finally, these results pave the way for an accessible and automatic captioning of music, which could automatically identify the emotional intent of the different segments of the movie soundtrack.
APA, Harvard, Vancouver, ISO, and other styles
17

Bochner, Joseph, Mark Indelicato, and Pralhad Konnur. "Effects of Sound Quality on the Accuracy of Telephone Captions Produced by Automatic Speech Recognition: A Preliminary Investigation." American Journal of Audiology, December 14, 2022, 1–8. http://dx.doi.org/10.1044/2022_aja-22-00102.

Full text
Abstract:
Purpose: Automatic speech recognition (ASR) is commonly used to produce telephone captions to provide telecommunication access for individuals who are d/Deaf and hard of hearing (DHH). However, little is known about the effects of degraded telephone audio on the intelligibility of ASR captioning. This research note investigates the accuracy of telephone captions produced by ASR under degraded audio conditions. Method: Packet loss, delay, and repetition are common sources of degradation in sound quality for telephone audio. Eleven sets of wideband filtered sentences were degraded by high and low levels of simulated packet loss, delay, and repetition. These sets, along with a clean set of sentences, were submitted to ASR, and the accuracy of the resulting output was evaluated using three metrics: a word recognition score, word error rate, and word information loss. Results: The resulting pattern of data indicated the relative impact of each degraded condition on message intelligibility. The high and low packet loss conditions had the largest effect on message intelligibility. This finding was interpreted to indicate that packet loss can have a substantial impact on the accuracy of telephone captions produced with ASR. Conclusions: The results of this investigation point to a potential area of improvement in service quality that could have a substantial impact on telecommunication services for consumers who are DHH. Further research in this area is needed to provide additional information concerning the scope and impact of packet loss on the accuracy of telephone captioning produced by ASR. Supplemental Material: https://doi.org/10.23641/asha.21699557
APA, Harvard, Vancouver, ISO, and other styles
18

Starr, Kim Linda, Sabine Braun, and Jaleh Delfani. "Taking a Cue From the Human." Journal of Audiovisual Translation 3, no. 2 (December 18, 2020). http://dx.doi.org/10.47476/jat.v3i2.2020.138.

Full text
Abstract:
Human beings find the process of narrative sequencing in written texts and moving imagery a relatively simple task. Key to the success of this activity is establishing coherence by using critical cues to identify key characters, objects, actions and locations as they contribute to plot development. In the drive to make audiovisual media more widely accessible (through audio description), and media archives more searchable (through content description), computer vision experts strive to automate video captioning in order to supplement human description activities. Existing models for automating video descriptions employ deep convolutional neural networks for encoding visual material and feature extraction (Krizhevsky, Sutskever, & Hinton, 2012; Szegedy et al., 2015; He, Zhang, Ren, & Sun, 2016). Recurrent neural networks decode the visual encodings and supply a sentence that describes the moving images in a manner mimicking human performance. However, these descriptions are currently “blind” to narrative coherence. Our study examines the human approach to narrative sequencing and coherence creation using the MeMAD [Methods for Managing Audiovisual Data: Combining Automatic Efficiency with Human Accuracy] film corpus involving five-hundred extracts chosen as stand-alone narrative arcs. We examine character recognition, object detection and temporal continuity as indicators of coherence, using linguistic analysis and qualitative assessments to inform the development of more narratively sophisticated computer models in the future.
APA, Harvard, Vancouver, ISO, and other styles
19

Kuhn, Korbinian, Verena Kersken, Benedikt Reuter, Niklas Egger, and Gottfried Zimmermann. "Measuring the Accuracy of Automatic Speech Recognition Solutions." ACM Transactions on Accessible Computing, December 8, 2023. http://dx.doi.org/10.1145/3636513.

Full text
Abstract:
For d/Deaf and hard of hearing (DHH) people, captioning is an essential accessibility tool. Significant developments in artificial intelligence (AI) mean that Automatic Speech Recognition (ASR) is now a part of many popular applications. This makes creating captions easy and broadly available - but transcription needs high levels of accuracy to be accessible. Scientific publications and industry report very low error rates, claiming AI has reached human parity or even outperforms manual transcription. At the same time the DHH community reports serious issues with the accuracy and reliability of ASR. There seems to be a mismatch between technical innovations and the real-life experience for people who depend on transcription. Independent and comprehensive data is needed to capture the state of ASR. We measured the performance of eleven common ASR services with recordings of Higher Education lectures. We evaluated the influence of technical conditions like streaming, the use of vocabularies, and differences between languages. Our results show that accuracy ranges widely between vendors and for the individual audio samples. We also measured a significant lower quality for streaming ASR, which is used for live events. Our study shows that despite the recent improvements of ASR, common services lack reliability in accuracy.
APA, Harvard, Vancouver, ISO, and other styles
20

Hekanaho, Laura, Maija Hirvonen, and Tuomas Virtanen. "Language-based machine perception: linguistic perspectives on the compilation of captioning datasets." Digital Scholarship in the Humanities, June 21, 2024. http://dx.doi.org/10.1093/llc/fqae029.

Full text
Abstract:
Abstract Over the last decade, a plethora of training datasets have been compiled for use in language-based machine perception and in human-centered AI, alongside research regarding their compilation methods. From a primarily linguistic perspective, we add to these studies in two ways. First, we provide an overview of sixty-six training datasets used in automatic image, video, and audio captioning, examining their compilation methods with a metadata analysis. Second, we delve into the annotation process of crowdsourced datasets with an interest in understanding the linguistic factors that affect the form and content of the captions, such as contextualization and perspectivation. With a qualitative content analysis, we examine annotator instructions with a selection of eleven datasets. Drawing from various theoretical frameworks that help assess the effectiveness of the instructions, we discuss the visual and textual presentation of the instructions, as well as the perspective-guidance that is an essential part of the language instructions. While our analysis indicates that some standards in the formulation of instructions seem to have formed in the field, we also identified various reoccurring issues potentially hindering readability and comprehensibility of the instructions, and therefore, caption quality. To enhance readability, we emphasize the importance of text structure, organization of the information, consistent use of typographical cues, and clarity of language use. Last, engaging with previous research, we assess the compilation of both web-sourced and crowdsourced captioning datasets from various perspectives, discussing factors affecting the diversity of the datasets.
APA, Harvard, Vancouver, ISO, and other styles
21

Man, Xin, Jie Shao, Feiyu Chen, Mingxing Zhang, and Heng Tao Shen. "TEVL: Trilinear Encoder for Video-Language Representation Learning." ACM Transactions on Multimedia Computing, Communications, and Applications, February 24, 2023. http://dx.doi.org/10.1145/3585388.

Full text
Abstract:
Pre-training model on large-scale unlabeled web videos followed by task-specific fine-tuning is a canonical approach to learning video and language representations. However, the accompanying Automatic Speech Recognition (ASR) transcripts in these videos are directly transcribed from audio, which may be inconsistent with visual information and would impair the language modeling ability of the model. Meanwhile, previous V-L models fuse visual and language modality features using single- or dual-stream architectures, which are not suitable for the current situation. Besides, traditional V-L research focuses mainly on the interaction between vision and language modalities and leaves the modeling of relationships within modalities untouched. To address these issues and maintain a small manual labor cost, we add automatically extracted dense captions as a supplementary text and propose a new trilinear video-language interaction framework TEVL (Trilinear Encoder for Video-Language representation learning). TEVL contains three unimodal encoders, a TRIlinear encOder (TRIO) block and a temporal Transformer. TRIO is specially designed to support effective text-vision-text interaction, which encourages inter-modal cooperation while maintaining intra-modal dependencies. We pre-train TEVL on the HowTo100M and TV datasets with four task objectives. Experimental results demonstrate that TEVL can learn powerful video-text representation and achieve competitive performance on three downstream tasks, including multimodal video captioning, video Question Answering (QA) as well as video and language inference. Implementation code is available at https://github.com/Gufrannn/TEVL.
APA, Harvard, Vancouver, ISO, and other styles
22

Ellis, Katie, Mike Kent, and Gwyneth Peaty. "Captioned Recorded Lectures as a Mainstream Learning Tool." M/C Journal 20, no. 3 (June 21, 2017). http://dx.doi.org/10.5204/mcj.1262.

Full text
Abstract:
In Australian universities, many courses provide lecture notes as a standard learning resource; however, captions and transcripts of these lectures are not usually provided unless requested by a student through dedicated disability support officers (Worthington). As a result, to date their use has been limited. However, while the requirement for—and benefits of—captioned online lectures for students with disabilities is widely recognised, these captions or transcripts might also represent further opportunity for a personalised approach to learning for the mainstream student population (Podszebka et al.; Griffin). This article reports findings of research assessing the usefulness of captioned recorded lectures as a mainstream learning tool to determine their usefulness in enhancing inclusivity and learning outcomes for the disabled, international, and broader student population.Literature ReviewCaptions have been found to be of benefit for a number of different groups considered at-risk. These include people who are D/deaf or hard of hearing, those with other learning difficulties, and those from a non-English speaking background (NESB).For students who are D/deaf or hard of hearing, captions play a vital role in providing access to otherwise inaccessible audio content. Captions have been found to be superior to sign language interpreters, note takers, and lip reading (Stinson et al.; Maiorana-Basas and Pagliaro; Marschark et al.).The use of captions for students with a range of cognitive disabilities has also been shown to help with student comprehension of video-based instruction in a higher education context (Evmenova; Evmenova and Behrmann). This includes students with autism spectrum disorder (ASD) (Knight et al.; Reagon et al.) and students with dyslexia (Alty et al.; Beacham and Alty). While, anecdotally, captions are also seen as of benefit for students with attention deficit hyperactivity disorder (ADHD) (Kent et al.), studies have proved inconclusive (Lewis and Brown).The third group of at-risk students identified as benefiting from captioning recorded lecture content are those from a NESB. The use of captions has been shown to increase vocabulary learning (Montero Perez, Peters, Clarebout, and Desmet; Montero Perez, Van Den Noortgate, and Desmet) and to assist with comprehension of presenters with accents or rapid speech (Borgaonkar, 2013).In addition to these three main groups of at-risk students, captions have also been demonstrated to increase the learning outcomes for older students (Pachman and Ke, 2012; Schmidt and Haydu, 1992). Captions also have demonstrable benefits for the broader student cohort beyond these at-risk groups (Podszebka et al.; Griffin). For example, a recent study found that the broader student population utilised lecture captions and transcripts in order to focus, retain information, and overcome poor audio quality (Linder). However, the same study revealed that students were largely unaware about the availability of captions and transcripts, nor how to access them.MethodologyIn 2016 students in the Curtin University unit Web Communications (an introductory unit for the Internet Communications major) and its complementary first year unit, Internet and Everyday Life, along with a second year unit, Web Media, were provided with access to closed captions for their online recorded lectures. The latter unit was added to the study serendipitously when its lectures were required to be captioned through a request from the Curtin Disability Office during the study period. Recordings and captions were created using the existing captioning system available through Curtin’s lecture recording platform—Echo360. As well as providing a written caption of what is being said during the lectures, this system also offers a sophisticated search functionality, as well as access to a total transcript of the lecture. The students were provided access to an online training module, developed specifically for this study, to explain the use of this system.Enrolled Curtin students, both on-campus and online, Open Universities Australia (OUA) students studying through Curtin online, teaching staff, and disability officers were then invited to participate in a survey and interviews. The study sought to gain insights into students’ use of both recorded lectures and captioned video at the time of the survey, and their anticipated future usage of these services (see Kent et al.).A total of 50 students—of 539 enrolled across the different instances of the three units—completed the survey. In addition, five follow-up interviews with students, teaching staff, and disability support staff were conducted once the surveys had been completed. Staff interviewed included tutors and unit coordinators who taught and supervised units in which the lecture captions were provided. The interviews assessed the awareness, use, and perceived validity of the captions system in the context of both learning and teaching.ResultsA number of different questions were asked regarding students’ demographics, their engagement with online unit materials, including recorded lectures, their awareness of Echo360’s lecture captions, as well as its additional features, their perceived value of online captions for their studies, and the future significance of captions in a university context.Of the 50 participants in the survey, only six identified themselves as a person with a disability—almost 90 per cent did not identify as disabled. Additionally, 45 of the 50 participants identified English as their primary language. Only one student identified as a person with both a disability and coming from a NESB.Engagement with Online Unit Materials and Recorded LecturesThe survey results provide insight into the ways in which participants interact with the Echo360 lecture system. Over 90 per cent of students had accessed the recorded lectures via the Echo360 system. While this might not seem notable at first, given such materials are essential elements of the units surveyed, the level of repeated engagement seen in these results is important because it indicates the extent to which students are revising the same material multiple times—a practice that captions are designed to facilitate and assist. For instance, one lecture was recorded per week for each unit surveyed, and most respondents (70 per cent) were viewing these lectures at least once or twice a week, while 10 per cent were viewing the lectures multiple times a week. Over half of the students surveyed reported viewing the same lecture more than once. Out these participants, 19 (or 73 per cent) had viewed a lecture twice and 23 per cent had viewed it three times or more. This illustrates that frequent revision is taking place, as students watch the same lecture repeatedly to absorb and clarify its contents. This frequency of repeated engagement with recorded unit materials—lectures in particular—indicates that students were making online engagement and revision a key element of their learning process.Awareness of the Echo360 Lecture Captions and Additional FeaturesHowever, while students were highly engaged with both the online learning material and the recorded lectures, there was less awareness of the availability of the captioning system—only 34 per cent of students indicated they were aware of having access to captions. The survey also asked students whether or not they had used additional features of the Echo360 captioning system such as the search function and downloadable lecture transcripts. Survey results confirm that these features were being used; however, responses indicated that only a minority of students using the captions system used these features, with 28 per cent using the search function and 33 per cent making use of the transcripts. These results can be seen as an indication that additional features were useful for revision, albeit for the minority of students who used them. A Curtin disability advisor noted in their interview that:transcripts are particularly useful in addition to captions as they allow the user to quickly skim the material rather than sit through a whole lecture. Transcripts also allow translation into other languages, highlighting text and other features that make the content more accessible.Teaching staff were positive about these features and suggested that providing transcripts saved time for tutors who are often approached to provide these to individual students:I typically receive requests for lecture transcripts at the commencement of each study period. In SP3 [during this study] I did not receive any requests.I feel that lecture transcripts would be particularly useful as this is the most common request I receive from students, especially those with disabilities.I think transcripts and keyword searching would likely be useful to many students who access lectures through recordings (or who access recordings even after attending the lecture in person).However, the one student who was interviewed preferred the keyword search feature, although they expressed interest in transcripts as well:I used the captions keyword search. I think I would like to use the lecture transcript as well but I did not use that in this unit.In summary, while not all students made use of Echo360’s additional features for captions, those who did access them did so frequently, indicating that these are potentially useful learning tools.Value of CaptionsOf the students who were aware of the captions, 63 per cent found them useful for engaging with the lecture material. According to one of the students:[captions] made a big difference to me in terms on understanding and retaining what was said in the lectures. I am not sure that many students would realise this unless they actually used the captions…I found it much easier to follow what was being said in the recorded lectures and I also found that they helped stay focussed and not become distracted from the lecture.It is notable that the improvements described above do not involve assistance with hearing or language issues, but the extent to which captions improve a more general learning experience. This participant identified themselves as a native English speaker with no disabilities, yet the captions still made a “big difference” in their ability to follow, understand, focus on, and retain information drawn from the lectures.However, while over 60 per cent of students who used the captions reported they found them useful, it was difficult to get more detailed feedback on precisely how and why. Only 52.6 per cent reported actually using them when accessing the lectures, and a relatively small number reported taking advantage of the search and transcripts features available through the Echo360 system. Exactly how they were being used and what role they play in student learning is therefore an area to pursue in future research, as it will assist in breaking down the benefits of captions for all learners.Teaching staff also reported the difficulty in assessing the full value of captions—one teacher interviewed explained that the impact of captions was hard to monitor quantitatively during regular teaching:it is difficult enough to track who listens to lectures at all, let alone who might be using the captions, or have found these helpful. I would like to think that not only those with hearing impairments, but also ESL students and even people who find listening to and taking in the recording difficult for other reasons, might have benefitted.Some teaching staff, however, did note positive feedback from students:one student has given me positive feedback via comments on the [discussion board].one has reported that it helps with retention and with times when speech is soft or garbled. I suspect it helps mediate my accent and pitch!While 60 per cent claiming captions were useful is a solid majority, it is notable that some participants skipped this question. As discussed above, survey answers indicate that this was because these 37 students did not think they had access to captions in their units.Future SignificanceOverall, these results indicate that while captions can provide a benefit to students’ engagement with online lecture learning material, there is a need for more direct and ongoing information sharing to ensure both students and teaching staff are fully aware of captions and how to use them. Technical issues—such as the time delay in captions being uploaded—potentially dissuade students from using this facility, so improving the speed and reliability of this tool could increase the number of learners keen to use it. All staff interviewed agreed that implementing captions for all lectures would be beneficial for everyone:any technology that can assist in making lectures more accessible is useful, particularly in OUA [online] courses.it would be a good example of Universal Design as it would make the lecture content more accessible for students with disabilities as well as students with other equity needs.YES—it benefits all students. I personally find that I understand and my attention is held more by captioned content.it certainly makes my role easier as it allows effective access to recorded lectures. Captioning allows full access as every word is accessible as opposed to note taking which is not verbatim.DiscussionThe results of this research indicate that captions—and their additional features—available through the Echo360 captions system are an aid to student learning. However, there are significant challenges to be addressed to make students aware of these features and their potential benefits.This study has shown that in a cohort of primarily English speaking students without disabilities, over 60 per cent found captions a useful addition to recorded lectures. This suggests that the implementation of captions for all recorded lectures would have widespread benefits for all learners, not only those with hearing or language difficulties. However, at present, only “eligible” students who approach the disability office would be considered for this service, usually students who are D/deaf or hard of hearing. Yet it can be argued that these benefits—and challenges—could also extend to other groups that are might traditionally have been seen to benefit from the use of captions such as students with other disabilities or those from a NESB.However, again, a lack of awareness of the training module meant that this potential cohort did not benefit from this trial. In this study, none of the students who identified as having a disability or coming from a NESB indicated that they had access to the training module. Further, five of the six students with disabilities reported that they did not have access to the captions system and, similarly, only two of the five NESB students. Despite these low numbers, all the students who were part of these two groups and who did access the captions system did find it useful.It can therefore be seen that the main challenge for teaching staff is to ensure all students are aware of captions and can access them easily. One option for reducing the need for training or further instructions might be having captions always ON by default. This means students could incorporate them into their study experience without having to take direct action or, equally, could simply choose to switch them off.There are also a few potential teething issues with implementing captions universally that need to be noted, as staff expressed some concerns regarding how this might alter the teaching and learning experience. For example:because the captioning is once-off, it means I can’t re-record the lectures where there was a failure in technology as the new versions would not be captioned.a bit cautious about the transcript as there may be problems with students copying that content and also with not viewing the lectures thinking the transcripts are sufficient.Despite these concerns, the survey results and interviews support the previous findings showing that lecture captions have the potential to benefit all learners, enhancing each student’s existing capabilities. As one staff member put it:in the main I just feel [captions are] important for accessibility and equity in general. Why should people have to request captions? Recorded lecture content should be available to all students, in whatever way they find it most easy (or possible) to engage.Follow-up from students at the end of the study further supported this. As one student noted in an email at the start of 2017:hi all, in one of my units last semester we were lucky enough to have captions on the recorded lectures. They were immensely helpful for a number of reasons. I really hope they might become available to us in this unit.ConclusionsWhen this project set out to investigate the ways diverse groups of students could utilise captioned lectures if they were offered it as a mainstream learning tool rather than a feature only disabled students could request, existing research suggested that many accommodations designed to assist students with disabilities actually benefit the entire cohort. The results of the survey confirmed this was also the case for captioning.However, currently, lecture captions are typically utilised in Australian higher education settings—including Curtin—only as an assistive technology for students with disabilities, particularly students who are D/deaf or hard of hearing. In these circumstances, the student must undertake a lengthy process months in advance to ensure timely access to essential captioned material. Mainstreaming the provision of captions and transcripts for online lectures would greatly increase the accessibility of online learning—removing these barriers allows education providers to harness the broad potential of captioning technology. Indeed, ensuring that captions were available “by default” would benefit the educational outcomes and self-determination of the wide range of students who could benefit from this technology.Lecture captioning and transcription is increasingly cost-effective, given technological developments in speech-to-text or automatic speech recognition software, and the increasing re-use of content across different iterations of a unit in online higher education courses. At the same time, international trends in online education—not least the rapidly evolving interpretations of international legislation—provide new incentives for educational providers to begin addressing accessibility shortcomings by incorporating captions and transcripts into the basic materials of a course.Finally, an understanding of the diverse benefits of lecture captions and transcripts needs to be shared widely amongst higher education providers, researchers, teaching staff, and students to ensure the potential of this technology is accessed and used effectively. Understanding who can benefit from captions, and how they benefit, is a necessary step in encouraging greater use of such technology, and thereby enhancing students’ learning opportunities.AcknowledgementsThis research was funded by the Curtin University Teaching Excellence Development Fund. Natalie Latter and Kai-ti Kao provided vital research assistance. We also thank the students and staff who participated in the surveys and interviews.ReferencesAlty, J.L., A. Al-Sharrah, and N. Beacham. “When Humans Form Media and Media Form Humans: An Experimental Study Examining the Effects Different Digital Media Have on the Learning Outcomes of Students Who Have Different Learning Styles.” Interacting with Computers 18.5 (2006): 891–909.Beacham, N.A., and J.L. Alty. “An Investigation into the Effects That Digital Media Can Have on the Learning Outcomes of Individuals Who Have Dyslexia.” Computers & Education 47.1 (2006): 74–93.Borgaonkar, R. “Captioning for Classroom Lecture Videos.” University of Houston 2013. <https://uh-ir.tdl.org/uh-ir/handle/10657/517>.Evmenova, A. “Lights. Camera. Captions: The Effects of Picture and/or Word Captioning Adaptations, Alternative Narration, and Interactive Features on Video Comprehension by Students with Intellectual Disabilities.” Ph.D. thesis. Virginia: George Mason U, 2008.Evmenova, A., and M. Behrmann. “Enabling Access and Enhancing Comprehension of Video Content for Postsecondary Students with Intellectual Disability.” Education and Training in Autism and Developmental Disabilities 49.1 (2014): 45–59.Griffin, Emily. “Who Uses Closed Captions? Not Just the Deaf or Hard of Hearing.” 3PlayMedia Aug. 2015 <http://www.3playmedia.com/2015/08/28/who-uses-closed-captions-not-just-the-deaf-or-hard-of-hearing/>.Kent, Mike, Katie Ellis, Gwyneth Peaty, Natalie Latter, and Kathryn Locke. Mainstreaming Captions for Online Lectures in Higher Education in Australia: Alternative Approaches to Engaging with Video Content. Perth: National Centre for Student Equity in Higher Education (NCSEHE), Curtin U, 2017. <https://www.ncsehe.edu.au/publications/4074/?doing_wp_cron=1493183232.7519669532775878906250>.Knight, V., B.R. McKissick, and A. Saunders. “A Review of Technology-Based Interventions to Teach Academic Skills to Students with Autism Spectrum Disorder.” Journal of Autism and Developmental Disorders 43.11 (2013): 2628–2648. <https://doi.org/10.1007/s10803-013-1814-y>.Linder, Katie. Student Uses and Perceptions of Closed Captions and Transcripts: Results from a National Study. Corvallis, OR: Oregon State U Ecampus Research Unit, 2016.Lewis, D., and V. Brown. “Multimedia and ADHD Learners: Are Subtitles Beneficial or Detrimental?” Annual Meeting of the AECT International Convention, The Galt House, Louisville 2012. <http://www.aect.org/pdf/proceedings12/2012/12_17.pdf>.Maiorana-Basas, M., and C.M. Pagliaro. “Technology Use among Adults Who Are Deaf and Hard of Hearing: A National Survey.” Journal of Deaf Studies and Deaf Education 19.3 (2014): 400–410. <https://doi.org/10.1093/deafed/enu005>.Marschark, Marc, Greg Leigh, Patricia Sapere, Denis Burnham, Carol Convertino, Michael Stinson, Harry Knoors, Mathijs P. J. Vervloed, and William Noble. “Benefits of Sign Language Interpreting and Text Alternatives for Deaf Students’ Classroom Learning.” Journal of Deaf Studies and Deaf Education 11.4 (2006): 421–437. <https://doi.org/10.1093/deafed/enl013>.Montero Perez, M., E. Peters, G. Clarebout, and P. Desmet. “Effects of Captioning on Video Comprehension and Incidental Vocabulary Learning.” Language Learning & Technology 18.1 (2014): 118–141.Montero Perez, M., W. Van Den Noortgate, and P. Desmet. “Captioned Video for L2 Listening and Vocabulary Learning: A Meta-Analysis.” System 41.3 (2013): 720–739. <https://doi.org/10.1016/j.system.2013.07.013>.Pachman, M., and F. Ke. “Environmental Support Hypothesis in Designing Multimedia Training for Older Adults: Is Less Always More?” Computers & Education 58.1 (2012): 100–110. <https://doi.org/10.1016/j.compedu.2011.08.011>.Podszebka, Darcy, Candee Conklin, Mary Apple, and Amy Windus. “Comparison of Video and Text Narrative Presentations on Comprehension and Vocabulary Acquisition”. Paper presented at SUNY – Geneseo Annual Reading and Literacy Symposium. New York: Geneseo, May 1998. <https://dcmp.org/caai/nadh161.pdf>.Reagon, K.A., T.S. Higbee, and K. Endicott. “Using Video Instruction Procedures with and without Embedded Text to Teach Object Labeling to Preschoolers with Autism: A Preliminary Investigation.” Journal of Special Education Technology 22.1 (2007): 13–20.Schmidt, M.J., and M.L. Haydu. “The Older Hearing‐Impaired Adult in the Classroom: Real‐Time Closed Captioning as a Technological Alternative to the Oral Lecture.” Educational Gerontology 18.3 (1992): 273–276. <https://doi.org/10.1080/0360127920180308>.Stinson, M.S., L.B. Elliot, R.R. Kelly, and Y. Liu. “Deaf and Hard-of-Hearing Students’ Memory of Lectures with Speech-to-Text and Interpreting/Note Taking Services.” The Journal of Special Education 43.1 (2009): 52–64. <https://doi.org/10.1177/0022466907313453>.Worthington, Tom. “Are Australian Universities Required to Caption Lecture Videos?” Higher Education Whisperer 14 Feb. 2015. <http://blog.highereducationwhisperer.com/2015/02/are-australian-universities-required-to.html>.
APA, Harvard, Vancouver, ISO, and other styles
23

Burwell, Catherine. "New(s) Readers: Multimodal Meaning-Making in AJ+ Captioned Video." M/C Journal 20, no. 3 (June 21, 2017). http://dx.doi.org/10.5204/mcj.1241.

Full text
Abstract:
IntroductionIn 2013, Facebook introduced autoplay video into its newsfeed. In order not to produce sound disruptive to hearing users, videos were muted until a user clicked on them to enable audio. This move, recognised as a competitive response to the popularity of video-sharing sites like YouTube, has generated significant changes to the aesthetics, form, and modalities of online video. Many video producers have incorporated captions into their videos as a means of attracting and maintaining user attention. Of course, captions are not simply a replacement or translation of sound, but have instead added new layers of meaning and changed the way stories are told through video.In this paper, I ask how the use of captions has altered the communication of messages conveyed through online video. In particular, I consider the role captions have played in news reporting, as online platforms like Facebook become increasingly significant sites for the consumption of news. One of the most successful producers of online news video has been Al Jazeera Plus (AJ+). I examine two recent AJ+ news videos to consider how meaning is generated when captions are integrated into the already multimodal form of the video—their online reporting of Australian versus US healthcare systems, and the history of the Black Panther movement. I analyse interactions amongst image, sound, language, and typography and consider the role of captions in audience engagement, branding, and profit-making. Sean Zdenek notes that captions have yet to be recognised “as a significant variable in multimodal analysis, on par with image, sound and video” (xiii). Here, I attempt to pay close attention to the representational, cultural and economic shifts that occur when captions become a central component of online news reporting. I end by briefly enquiring into the implications of captions for our understanding of literacy in an age of constantly shifting media.Multimodality in Digital MediaJeff Bezemer and Gunther Kress define a mode as a “socially and culturally shaped resource for meaning making” (171). Modes include meaning communicated through writing, sound, image, gesture, oral language, and the use of space. Of course, all meanings are conveyed through multiple modes. A page of written text, for example, requires us to make sense through the simultaneous interpretation of words, space, colour, and font. Media such as television and film have long been understood as multimodal; however, with the appearance of digital technologies, media’s multimodality has become increasingly complex. Video games, for example, demonstrate an extraordinary interplay between image, sound, oral language, written text, and interactive gestures, while technologies such as the mobile phone combine the capacity to produce meaning through speaking, writing, and image creation.These multiple modes are not simply layered one on top of the other, but are instead “enmeshed through the complexity of interaction, representation and communication” (Jewitt 1). The rise of multimodal media—as well as the increasing interest in understanding multimodality—occurs against the backdrop of rapid technological, cultural, political, and economic change. These shifts include media convergence, political polarisation, and increased youth activism across the globe (Herrera), developments that are deeply intertwined with uses of digital media and technology. Indeed, theorists of multimodality like Jay Lemke challenge us to go beyond formalist readings of how multiple modes work together to create meaning, and to consider multimodality “within a political economy and a cultural ecology of identities, markets and values” (140).Video’s long history as an inexpensive and portable way to produce media has made it an especially dynamic form of multimodal media. In 1974, avant-garde video artist Nam June Paik predicted that “new forms of video … will stimulate the whole society to find more imaginative ways of telecommunication” (45). Fast forward more than 40 years, and we find that video has indeed become an imaginative and accessible form of communication. The cultural influence of video is evident in the proliferation of video genres, including remix videos, fan videos, Let’s Play videos, video blogs, live stream video, short form video, and video documentary, many of which combine semiotic resources in novel ways. The economic power of video is evident in the profitability of video sharing sites—YouTube in particular—as well as the recent appearance of video on other social media platforms such as Instagram and Facebook.These platforms constitute significant “sites of display.” As Rodney Jones notes, sites of display are not merely the material media through which information is displayed. Rather, they are complex spaces that organise social interactions—for example, between producers and users—and shape how meaning is made. Certainly we can see the influence of sites of display by considering Facebook’s 2013 introduction of autoplay into its newsfeed, a move that forced video producers to respond with new formats. As Edson Tandoc and Julian Maitra write, news organisations have had been forced to “play by Facebook’s frequently modified rules and change accordingly when the algorithms governing the social platform change” (2). AJ+ has been considered one of the media companies that has most successfully adapted to these changes, an adaptation I examine below. I begin by taking up Lemke’s challenge to consider multimodality contextually, reading AJ+ videos through the conceptual lens of the “attention economy,” a lens that highlights the profitability of attention within digital cultures. I then follow with analyses of two short AJ+ videos to show captions’ central role, not only in conveying meaning, but also in creating markets, and communicating branded identities and ideologies.AJ+, Facebook and the New Economies of AttentionThe Al Jazeera news network was founded in 1996 to cover news of the Arab world, with a declared commitment to give “voice to the voiceless.” Since that time, the network has gained global influence, yet many of its attempts to break into the American market have been unsuccessful (Youmans). In 2013, the network acquired Current TV in an effort to move into cable television. While that effort ultimately failed, Al Jazeera’s purchase of the youth-oriented Current TV nonetheless led to another, surprisingly fruitful enterprise, the development of the digital media channel Al Jazeera Plus (AJ+). AJ+ content, which is made up almost entirely of video, is directed at 18 to 35-year-olds. As William Youmans notes, AJ+ videos are informal and opinionated, and, while staying consistent with Al Jazeera’s mission to “give voice to the voiceless,” they also take an openly activist stance (114). Another distinctive feature of AJ+ videos is the way they are tailored for specific platforms. From the beginning, AJ+ has had particular success on Facebook, a success that has been recognised in popular and trade publications. A 2015 profile on AJ+ videos in Variety (Roettgers) noted that AJ+ was the ninth biggest video publisher on the social network, while a story on Journalism.co (Reid, “How AJ+ Reaches”) that same year commented on the remarkable extent to which Facebook audiences shared and interacted with AJ+ videos. These stories also note the distinctive video style that has become associated with the AJ+ brand—short, bold captions; striking images that include photos, maps, infographics, and animations; an effective opening hook; and a closing call to share the video.AJ+ video producers were developing this unique style just as Facebook’s autoplay was being introduced into newsfeeds. Autoplay—a mechanism through which videos are played automatically, without action from a user—predates Facebook’s introduction of the feature. However, autoplay on Internet sites had already begun to raise the ire of many users before its appearance on Facebook (Oremus, “In Defense of Autoplay”). By playing video automatically, autoplay wrests control away from users, and causes particular problems for users using assistive technologies. Reporting on Facebook’s decision to introduce autoplay, Josh Constine notes that the company was looking for a way to increase advertising revenues without increasing the number of actual ads. Encouraging users to upload and share video normalises the presence of video on Facebook, and opens up the door to the eventual addition of profitable video ads. Ensuring that video plays automatically gives video producers an opportunity to capture the attention of users without the need for them to actively click to start a video. Further, ensuring that the videos can be understood when played silently means that both deaf users and users who are situationally unable to hear the audio can also consume its content in any kind of setting.While Facebook has promoted its introduction of autoplay as a benefit to users (Oremus, “Facebook”), it is perhaps more clearly an illustration of the carefully-crafted production strategies used by digital platforms to capture, maintain, and control attention. Within digital capitalism, attention is a highly prized and scarce resource. Michael Goldhaber argues that once attention is given, it builds the potential for further attention in the future. He writes that “obtaining attention is obtaining a kind of enduring wealth, a form of wealth that puts you in a preferred position to get anything this new economy offers” (n.p.). In the case of Facebook, this offers video producers the opportunity to capture users’ attention quickly—in the time it takes them to scroll through their newsfeed. While this may equate to only a few seconds, those few seconds hold, as Goldhaber predicted, the potential to create further value and profit when videos are viewed, liked, shared, and commented on.Interviews with AJ+ producers reveal that an understanding of the value of this attention drives the organisation’s production decisions, and shapes content, aesthetics, and modalities. They also make it clear that it is captions that are central in their efforts to engage audiences. Jigar Mehta, former head of engagement at AJ+, explains that “those first three to five seconds have become vital in grabbing the audience’s attention” (quoted in Reid, “How AJ+ Reaches”). While early videos began with the AJ+ logo, that was soon dropped in favour of a bold image and text, a decision that dramatically increased views (Reid, “How AJ+ Reaches”). Captions and titles are not only central to grabbing attention, but also to maintaining it, particularly as many audience members consume video on mobile devices without sound. Mehta tells an editor at the Nieman Journalism Lab:we think a lot about whether a video works with the sound off. Do we have to subtitle it in order to keep the audience retention high? Do we need to use big fonts? Do we need to use color blocking in order to make words pop and make things stand out? (Mehta, qtd. in Ellis)An AJ+ designer similarly suggests that the most important aspects of AJ+ videos are brand, aesthetic style, consistency, clarity, and legibility (Zou). While questions of brand, style, and clarity are not surprising elements to associate with online video, the matter of legibility is. And yet, in contexts where video is viewed on small, hand-held screens and sound is not an option, legibility—as it relates to the arrangement, size and colour of type—does indeed take on new importance to storytelling and sense-making.While AJ+ producers frame the use of captions as an innovative response to Facebook’s modern algorithmic changes, it makes sense to also remember the significant histories of captioning that their videos ultimately draw upon. This lineage includes silent films of the early twentieth century, as well as the development of closed captions for deaf audiences later in that century. Just as he argues for the complexity, creativity, and transformative potential of captions themselves, Sean Zdenek also urges us to view the history of closed captioning not as a linear narrative moving inevitably towards progress, but as something far more complicated and marked by struggle, an important reminder of the fraught and human histories that are often overlooked in accounts of “new media.” Another important historical strand to consider is the centrality of the written word to digital media, and to the Internet in particular. As Carmen Lee writes, despite public anxieties and discussions over a perceived drop in time spent reading, digital media in fact “involve extensive use of the written word” (2). While this use takes myriad forms, many of these forms might be seen as connected to the production, consumption, and popularity of captions, including practices such as texting, tweeting, and adding titles and catchphrases to photos.Captions, Capture, and Contrast in Australian vs. US HealthcareOn May 4, 2017, US President Donald Trump was scheduled to meet with Australian Prime Minister Malcolm Turnbull in New York City. Trump delayed the meeting, however, in order to await the results of a vote in the US House of Representatives to repeal the Affordable Care Act—commonly known as Obama Care. When he finally sat down with the Prime Minister later that day, Trump told him that Australia has “better health care” than the US, a statement that, in the words of a Guardian report, “triggered astonishment and glee” amongst Trump’s critics (Smith). In response to Trump’s surprising pronouncement, AJ+ produced a 1-minute video extending Trump’s initial comparison with a series of contrasts between Australian government-funded health care and American privatised health care (Facebook, “President Trump Says…”). The video provides an excellent example of the role captions play in both generating attention and creating the unique aesthetic that is crucial to the AJ+ brand.The opening frame of the video begins with a shot of the two leaders seated in front of the US and Australian flags, a diplomatic scene familiar to anyone who follows politics. The colours of the picture are predominantly red, white and blue. Superimposed on top of the image is a textbox containing the words “How does Australia’s healthcare compare to the US?” The question appears in white capital letters on a black background, and the box itself is heavily outlined in yellow. The white and yellow AJ+ logo appears in the upper right corner of the frame. This opening frame poses a question to the viewer, encouraging a kind of rhetorical interactivity. Through the use of colour in and around the caption, it also quickly establishes the AJ+ brand. This opening scene also draws on the Internet’s history of humorous “image macros”—exemplified by the early LOL cat memes—that create comedy through the superimposition of captions on photographic images (Shifman).Captions continue to play a central role in meaning-making once the video plays. In the next frame, Trump is shown speaking to Turnbull. As he speaks, his words—“We have a failing healthcare”—drop onto the screen (Image 1). The captions are an exact transcription of Trump’s awkward phrase and appear centred in caps, with the words “failing healthcare” emphasised in larger, yellow font. With or without sound, these bold captions are concise, easily read on a small screen, and visually dominate the frame. The next few seconds of the video complete the sequence, as Trump tells Turnbull, “I shouldn’t say this to our great gentleman, my friend from Australia, ‘cause you have better healthcare than we do.” These words continue to appear over the image of the two men, still filling the screen. In essence, Trump’s verbal gaffe, transcribed word for word and appearing in AJ+’s characteristic white and yellow lettering, becomes the video’s hook, designed to visually call out to the Facebook user scrolling silently through their newsfeed.Image 1: “We have a failing healthcare.”The middle portion of the video answers the opening question, “How does Australia’s healthcare compare to the US?”. There is no verbal language in this segment—the only sound is a simple synthesised soundtrack. Instead, captions, images, and spatial design, working in close cooperation, are used to draw five comparisons. Each of these comparisons uses the same format. A title appears at the top of the screen, with the remainder of the screen divided in two. The left side is labelled Australia, the right U.S. Underneath these headings, a representative image appears, followed by two statistics, one for each country. For example, the third comparison contrasts Australian and American infant mortality rates (Image 2). The left side of the screen shows a close-up of a mother kissing a baby, with the superimposed caption “3 per 1,000 births.” On the other side of the yellow border, the American infant mortality rate is illustrated with an image of a sleeping baby superimposed with a corresponding caption, “6 per 1,000 births.” Without voiceover, captions do much of the work of communicating the national differences. They are, however, complemented and made more quickly comprehensible through the video’s spatial design and its subtly contrasting images, which help to visually organise the written content.Image 2: “Infant mortality rate”The final 10 seconds of the video bring sound back into the picture. We once again see and hear Trump tell Turnbull, “You have better healthcare than we do.” This image transforms into another pair of male faces—liberal American commentator Chris Hayes and US Senator Bernie Sanders—taken from a MSNBC cable television broadcast. On one side, Hayes says “They do have, they have universal healthcare.” On the other, Sanders laughs uproariously in response. The only added caption for this segment is “Hahahaha!”, the simplicity of which suggests that the video’s target audience is assumed to have a context for understanding Sander’s laughter. Here and throughout the video, autoplay leads to a far more visual style of relating information, one in which captions—working alongside images and layout—become, in Zdenek’s words, a sort of “textual performance” (6).The Black Panther Party and the Textual Performance of Progressive PoliticsReports on police brutality and Black Lives Matters protests have been amongst AJ+’s most widely viewed and shared videos (Reid, “Beyond Websites”). Their 2-minute video (Facebook, Black Panther) commemorating the 50th anniversary of the Black Panther Party, viewed 9.5 million times, provides background to these contemporary events. Like the comparison of American and Australian healthcare, captions shape the video’s structure. But here, rather than using contrast as means of quick visual communication, the video is structured as a list of five significant points about the Black Panther Party. Captions are used not only to itemise and simplify—and ultimately to reduce—the party’s complex history, but also, somewhat paradoxically, to promote the news organisation’s own progressive values.After announcing the intent and structure of the video—“5 things you should know about the Black Panther Party”—in its first 3 seconds, the video quickly sets in to describe each item in turn. The themes themselves correspond with AJ+’s own interests in policing, community, and protest, while the language used to announce each theme is characteristically concise and colloquial:They wanted to end police brutality.They were all about the community.They made enemies in high places.Women were vocal and active panthers.The Black Panthers’ legacy is still alive today.Each of these themes is represented using a combination of archival black and white news footage and photographs depicting Black Panther members, marches, and events. These still and moving images are accompanied by audio recordings from party members, explaining its origins, purposes, and influences. Captions are used throughout the video both to indicate the five themes and to transcribe the recordings. As the video moves from one theme to another, the corresponding number appears in the centre of the screen to indicate the transition, and then shrinks and moves to the upper left corner of the screen as a reminder for viewers. A musical soundtrack of strings and percussion, communicating a sense of urgency, underscores the full video.While typographic features like font size, colour, and placement were significant in communicating meaning in AJ+’s healthcare video, there is an even broader range of experimentation here. The numbers 1 to 5 that appear in the centre of the screen to announce each new theme blink and flicker like the countdown at the beginning of bygone film reels, gesturing towards the historical topic and complementing the black and white footage. For those many viewers watching the video without sound, an audio waveform above the transcribed interviews provides a visual clue that the captions are transcriptions of recorded voices. Finally, the colour green, used infrequently in AJ+ videos, is chosen to emphasise a select number of key words and phrases within the short video. Significantly, all of these words are spoken by Black Panther members. For example, captions transcribing former Panther leader Ericka Huggins speaking about the party’s slogan—“All power to the people”—highlight the words “power” and “people” with large, lime green letters that stand out against the grainy black and white photos (Image 3). The captions quite literally highlight ideas about oppression, justice, and social change that are central to an understanding of the history of the Black Panther Party, but also to the communication of the AJ+ brand.Image 3: “All power to the people”ConclusionEmploying distinctive combinations of word and image, AJ+ videos are produced to call out to users through the crowded semiotic spaces of social media. But they also call out to scholars to think carefully about the new kinds of literacies associated with rapidly changing digital media formats. Captioned video makes clear the need to recognise how meaning is constructed through sophisticated interpretive strategies that draw together multiple modes. While captions are certainly not new, an analysis of AJ+ videos suggests the use of novel typographical experiments that sit “midway between language and image” (Stöckl 289). Discussions of literacy need to expand to recognise this experimentation and to account for the complex interactions between the verbal and visual that get lost when written text is understood to function similarly across multiple platforms. In his interpretation of closed captioning, Zdenek provides an insightful list of the ways that captions transform meaning, including their capacity to contextualise, clarify, formalise, linearise and distill (8–9). His list signals not only the need for a deeper understanding of the role of captions, but also for a broader and more vivid vocabulary to describe multimodal meaning-making. Indeed, as Allan Luke suggests, within the complex multimodal and multilingual contexts of contemporary global societies, literacy requires that we develop and nurture “languages to talk about language” (459).Just as importantly, an analysis of captioned video that takes into account the economic reasons for captioning also reminds us of the need for critical media literacies. AJ+ videos reveal how the commercial goals of branding, promotion, and profit-making influence the shape and presentation of news. As meaning-makers and as citizens, we require the capacity to assess how we are being addressed by news organisations that are themselves responding to the interests of economic and cultural juggernauts such as Facebook. In schools, universities, and informal learning spaces, as well as through discourses circulated by research, media, and public policy, we might begin to generate more explicit and critical discussions of the ways that digital media—including texts that inform us and even those that exhort us towards more active forms of citizenship—simultaneously seek to manage, direct, and profit from our attention.ReferencesBezemer, Jeff, and Gunther Kress. “Writing in Multimodal Texts: A Social Semiotic Account of Designs for Learning.” Written Communication 25.2 (2008): 166–195.Constine, Josh. “Facebook Adds Automatic Subtitling for Page Videos.” TechCrunch 4 Jan. 2017. 1 May 2017 <https://techcrunch.com/2017/01/04/facebook-video-captions/>.Ellis, Justin. “How AJ+ Embraces Facebook, Autoplay, and Comments to Make Its Videos Stand Out.” Nieman Labs 3 Aug. 2015. 28 Apr. 2017 <http://www.niemanlab.org/2015/08/how-aj-embraces-facebook-autoplay-and-comments-to-make-its-videos-stand-out/>.Facebook. “President Trump Says…” Facebook, 2017. <https://www.facebook.com/ajplusenglish/videos/954884227986418/>.Facebook. “Black Panther.” Facebook, 2017. <https://www.facebook.com/ajplusenglish/videos/820822028059306/>.Goldhaber, Michael. “The Attention Economy and the Net.” First Monday 2.4 (1997). 9 June 2013 <http://firstmonday.org/article/view/519/440>.Herrera, Linda. “Youth and Citizenship in the Digital Age: A View from Egypt.” Harvard Educational Review 82.3 (2012): 333–352.Jewitt, Carey.”Introduction.” Routledge Handbook of Multimodal Analysis. Ed. Carey Jewitt. New York: Routledge, 2009. 1–8.Jones, Rodney. “Technology and Sites of Display.” Routledge Handbook of Multimodal Analysis. Ed. Carey Jewitt. New York: Routledge, 2009. 114–126.Lee, Carmen. “Micro-Blogging and Status Updates on Facebook: Texts and Practices.” Digital Discourse: Language in the New Media. Eds. Crispin Thurlow and Kristine Mroczek. Oxford Scholarship Online, 2011. DOI: 10.1093/acprof:oso/9780199795437.001.0001.Lemke, Jay. “Multimodality, Identity, and Time.” Routledge Handbook of Multimodal Analysis. Ed. Carey Jewitt. New York: Routledge, 2009. 140–150.Luke, Allan. “Critical Literacy in Australia: A Matter of Context and Standpoint.” Journal of Adolescent and Adult Literacy 43.5 (200): 448–461.Oremus, Will. “Facebook Is Eating the Media.” National Post 14 Jan. 2015. 15 June 2017 <http://news.nationalpost.com/news/facebook-is-eating-the-media-how-auto-play-videos-could-put-news-websites-out-of-business>.———. “In Defense of Autoplay.” Slate 16 June 2015. 14 June 2017 <http://www.slate.com/articles/technology/future_tense/2015/06/autoplay_videos_facebook_twitter_are_making_them_less_annoying.html>.Paik, Nam June. “The Video Synthesizer and Beyond.” The New Television: A Public/Private Art. Eds. Douglas Davis and Allison Simmons. Cambridge, MA: MIT Press, 1977. 45.Reid, Alistair. “Beyond Websites: How AJ+ Is Innovating in Digital Storytelling.” Journalism.co 17 Apr. 2015. 13 Feb. 2017 <https://www.journalism.co.uk/news/beyond-websites-how-aj-is-innovating-in-digital-storytelling/s2/a564811/>.———. “How AJ+ Reaches 600% of Its Audience on Facebook.” Journalism.co. 5 Aug. 2015. 13 Feb. 2017 <https://www.journalism.co.uk/news/how-aj-reaches-600-of-its-audience-on-facebook/s2/a566014/>.Roettgers, Jank. “How Al Jazeera’s AJ+ Became One of the Biggest Video Publishers on Facebook.” Variety 30 July 2015. 1 May 2017 <http://variety.com/2015/digital/news/how-al-jazeeras-aj-became-one-of-the-biggest-video-publishers-on-facebook-1201553333/>.Shifman, Limor. Memes in Digital Culture. Cambridge, MA: MIT Press, 2014.Smith, David. “Trump Says ‘Everybody’, Not Just Australia, Has Better Healthcare than US.” The Guardian 5 May 2017. 5 May 2017 <https://www.theguardian.com/us-news/2017/may/05/trump-healthcare-australia-better-malcolm-turnbull>.Stöckl, Hartmut. “Typography: Visual Language and Multimodality.” Interactions, Images and Texts. Eds. Sigrid Norris and Carmen Daniela Maier. Amsterdam: De Gruyter, 2014. 283–293.Tandoc, Edson, and Maitra, Julian. “New Organizations’ Use of Native Videos on Facebook: Tweaking the Journalistic Field One Algorithm Change at a Time. New Media & Society (2017). DOI: 10.1177/1461444817702398.Youmans, William. An Unlikely Audience: Al Jazeera’s Struggle in America. New York: Oxford University Press, 2017.Zdenek, Sean. Reading Sounds: Closed-Captioned Media and Popular Culture. Chicago: University of Chicago Press, 2015.Zou, Yanni. “How AJ+ Applies User-Centered Design to Win Millennials.” Medium 16 Apr. 2016. 7 May 2017 <https://medium.com/aj-platforms/how-aj-applies-user-centered-design-to-win-millennials-3be803a4192c>.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography