Relevant bibliographies by topics / Automated audio captioning

Academic literature on the topic 'Automated audio captioning'

Author: Grafiati

Published: 20 July 2024

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Journal articles
Dissertations / Theses
Book chapters
Conference papers

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Automated audio captioning.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Automated audio captioning"

Bokhove, Christian, and Christopher Downey. "Automated generation of ‘good enough’ transcripts as a first step to transcription of audio-recorded data." Methodological Innovations 11, no. 2 (May 2018): 205979911879074. http://dx.doi.org/10.1177/2059799118790743.

Full text

Abstract:

In the last decade, automated captioning services have appeared in mainstream technology use. Until now, the focus of these services have been on the technical aspects, supporting pupils with special educational needs and supporting teaching and learning of second language students. Only limited explorations have been attempted regarding its use for research purposes: transcription of audio recordings. This article presents a proof-of-concept exploration utilising three examples of automated transcription of audio recordings from different contexts; an interview, a public hearing and a classroom setting, and compares them against ‘manual’ transcription techniques in each case. It begins with an overview of literature on automated captioning and the use of voice recognition tools for the purposes of transcription. An account is provided of the specific processes and tools used for the generation of the automated captions followed by some basic processing of the captions to produce automated transcripts. Originality checking software was used to determine a percentage match between the automated transcript and a manual version as a basic measure of the potential usability of each of the automated transcripts. Some analysis of the more common and persistent mismatches observed between automated and manual transcripts is provided, revealing that the majority of mismatches would be easily identified and rectified in a review and edit of the automated transcript. Finally, some of the challenges and limitations of the approach are considered. These limitations notwithstanding, we conclude that this form of automated transcription provides ‘good enough’ transcription for first versions of transcripts. The time and cost advantages of this could be considerable, even for the production of summary or gisted transcripts.

APA, Harvard, Vancouver, ISO, and other styles

Koenecke, Allison, Andrew Nam, Emily Lake, Joe Nudell, Minnie Quartey, Zion Mengesha, Connor Toups, John R. Rickford, Dan Jurafsky, and Sharad Goel. "Racial disparities in automated speech recognition." Proceedings of the National Academy of Sciences 117, no. 14 (March 23, 2020): 7684–89. http://dx.doi.org/10.1073/pnas.1915768117.

Full text

Abstract:

Automated speech recognition (ASR) systems, which use sophisticated machine-learning algorithms to convert spoken language to text, have become increasingly widespread, powering popular virtual assistants, facilitating automated closed captioning, and enabling digital dictation platforms for health care. Over the last several years, the quality of these systems has dramatically improved, due both to advances in deep learning and to the collection of large-scale datasets used to train the systems. There is concern, however, that these tools do not work equally well for all subgroups of the population. Here, we examine the ability of five state-of-the-art ASR systems—developed by Amazon, Apple, Google, IBM, and Microsoft—to transcribe structured interviews conducted with 42 white speakers and 73 black speakers. In total, this corpus spans five US cities and consists of 19.8 h of audio matched on the age and gender of the speaker. We found that all five ASR systems exhibited substantial racial disparities, with an average word error rate (WER) of 0.35 for black speakers compared with 0.19 for white speakers. We trace these disparities to the underlying acoustic models used by the ASR systems as the race gap was equally large on a subset of identical phrases spoken by black and white individuals in our corpus. We conclude by proposing strategies—such as using more diverse training datasets that include African American Vernacular English—to reduce these performance differences and ensure speech recognition technology is inclusive.

APA, Harvard, Vancouver, ISO, and other styles

Mirzaei, Maryam Sadat, Kourosh Meshgi, Yuya Akita, and Tatsuya Kawahara. "Partial and synchronized captioning: A new tool to assist learners in developing second language listening skill." ReCALL 29, no. 2 (March 2, 2017): 178–99. http://dx.doi.org/10.1017/s0958344017000039.

Full text

Abstract:

AbstractThis paper introduces a novel captioning method, partial and synchronized captioning (PSC), as a tool for developing second language (L2) listening skills. Unlike conventional full captioning, which provides the full text and allows comprehension of the material merely by reading, PSC promotes listening to the speech by presenting a selected subset of words, where each word is synched to its corresponding speech signal. In this method, word-level synchronization is realized by an automatic speech recognition (ASR) system, dedicated to the desired corpora. This feature allows the learners to become familiar with the correspondences between words and their utterances. Partialization is done by automatically selecting words or phrases likely to hinder listening comprehension. In this work we presume that the incidence of infrequent or specific words and fast delivery of speech are major barriers to listening comprehension. The word selection criteria are thus based on three factors: speech rate, word frequency and specificity. The thresholds for these features are adjusted to the proficiency level of the learners. The selected words are presented to aid listening comprehension while the remaining words are masked in order to keep learners listening to the audio. PSC was evaluated against no-captioning and full-captioning conditions using TED videos. The results indicate that PSC leads to the same level of comprehension as the full-captioning method while presenting less than 30% of the transcript. Furthermore, compared with the other methods, PSC can serve as an effective medium for decreasing dependence on captions and preparing learners to listen without any assistance.

APA, Harvard, Vancouver, ISO, and other styles

Guo, Rundong. "Advancing real-time close captioning: blind source separation and transcription for hearing impairments." Applied and Computational Engineering 30, no. 1 (January 22, 2024): 125–30. http://dx.doi.org/10.54254/2755-2721/30/20230084.

Full text

Abstract:

This project investigates the potential of integrating Blind Source Separation (DUET algorithm) and Automatic Speech Recognition (Wav2Vec2 model) for real-time, accurate transcription in multi-speaker scenarios. Specifically targeted towards improving accessibility for individuals with hearing impairments, the project addresses the challenging task of separating and transcribing speech from simultaneous speakers in various contexts. The DUET algorithm effectively separates individual voices from complex audio scenarios, which are then accurately transcribed into text by the machine learning model, Wav2Vec2. However, despite their remarkable capabilities, both techniques present limitations, particularly when handling complicated audio scenarios and in terms of computational efficiency. Looking ahead, the research suggests incorporating a feedback mechanism between the two systems as a potential solution for these issues. This innovative mechanism could contribute to a more accurate and efficient separation and transcription process by enabling the systems to dynamically adjust to each other's outputs. Nevertheless, this promising direction also brings with it new challenges, particularly in terms of system complexity, defining actionable feedback parameters, and maintaining system efficiency in real-time applications.

APA, Harvard, Vancouver, ISO, and other styles

Prabhala, Jagat Chaitanya, Venkatnareshbabu K, and Ragoju Ravi. "OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIARIZATION SYSTEMS: A MATHEMATICAL FORMULATION." Applied Mathematics and Sciences An International Journal (MathSJ) 10, no. 1/2 (June 26, 2023): 1–10. http://dx.doi.org/10.5121/mathsj.2023.10201.

Full text

Abstract:

Speaker diarization is a critical task in speech processing that aims to identify "who spoke when?" in an audio or video recording that contains unknown amounts of speech from unknown speakers and unknown number of speakers. Diarization has numerous applications in speech recognition, speaker identification, and automatic captioning. Supervised and unsupervised algorithms are used to address speaker diarization problems, but providing exhaustive labeling for the training dataset can become costly in supervised learning, while accuracy can be compromised when using unsupervised approaches. This paper presents a novel approach to speaker diarization, which defines loosely labeled data and employs x-vector embedding and a formalized approach for threshold searching with a given abstract similarity metric to cluster temporal segments into unique user segments. The proposed algorithm uses concepts of graph theory, matrix algebra, and genetic algorithm to formulate and solve the optimization problem. Additionally, the algorithm is applied to English, Spanish, and Chinese audios, and the performance is evaluated using wellknown similarity metrics. The results demonstrate that the robustness of the proposed approach. The findings of this research have significant implications for speech processing, speaker identification including those with tonal differences. The proposed method offers a practical and efficient solution for speaker diarization in real-world scenarios where there are labeling time and cost constraints

APA, Harvard, Vancouver, ISO, and other styles

Nam, Somang, and Deborah Fels. "Simulation of Subjective Closed Captioning Quality Assessment Using Prediction Models." International Journal of Semantic Computing 13, no. 01 (March 2019): 45–65. http://dx.doi.org/10.1142/s1793351x19400038.

Full text

Abstract:

As a primary user group, Deaf or Hard of Hearing (D/HOH) audiences use Closed Captioning (CC) service to enjoy the TV programs with audio by reading text. However, the D/HOH communities are not completely satisfied with the quality of CC even though the government regulators entail certain rules in the CC quality factors. The measure of the CC quality is often interpreted as an accuracy on translation and regulators use the empirical models to assess. The need of a subjective quality scale comes from the gap in between current empirical assessment models and the audience perceived quality. It is possible to fill the gap by including the subjective assessment by D/HOH audiences. This research proposes a design of an automatic quality assessment system for CC which can predict the D/HOH audience subjective ratings. A simulated rater is implemented based on literature and the CC quality factor representative value extraction algorithm is developed. Three prediction models are trained with a set of CC quality values and corresponding rating scores, then they are compared to find the feasible prediction model.

APA, Harvard, Vancouver, ISO, and other styles

Gotmare, Abhay, Gandharva Thite, and Laxmi Bewoor. "A multimodal machine learning approach to generate news articles from geo-tagged images." International Journal of Electrical and Computer Engineering (IJECE) 14, no. 3 (June 1, 2024): 3434. http://dx.doi.org/10.11591/ijece.v14i3.pp3434-3442.

Full text

Abstract:

Classical machine learning algorithms typically operate on unimodal data and hence it can analyze and make predictions based on data from a single source (modality). Whereas multimodal machine learning algorithm, learns from information across multiple modalities, such as text, images, audio, and sensor data. The paper leverages the functionalities of multimodal machine learning (ML) application for generating text from images. The proposed work presents an innovative multimodal algorithm that automates the creation of news articles from geo-tagged images by leveraging cutting-edge developments in machine learning, image captioning, and advanced text generation technologies. Employing a multimodal approach that integrates machine learning and transformer algorithms, such as visual geometry group network16 (VGGNet16), convolutional neural network (CNN) and a long short-term memory (LSTM) based system, the algorithm initiates by extracting the location from exchangeable image file format (Exif) data from the image. The features are extracted from the image and corresponding news headline is generated. The headlines are used for generating a comprehensive article with contemporary large language model (LLM). Further, the algorithm generates the news article big-science large open-science open-access multilingual language model (BLOOM). The algorithm was tested on real time photographs as well as images from the internet. In both the cases the news articles generated were validated with ROUGE and BULE score. The proposed work is found to be successful attempt in journalism field.

APA, Harvard, Vancouver, ISO, and other styles

Verma, Dr Neeta. "Assistive Vision Technology using Deep Learning Techniques." International Journal for Research in Applied Science and Engineering Technology 9, no. VII (July 31, 2021): 2695–704. http://dx.doi.org/10.22214/ijraset.2021.36815.

Full text

Abstract:

One of the most important functions of the human visual system is automatic captioning. Caption generation is one of the more interesting and focused areas of AI, with numerous challenges to overcome. If there is an application that automatically captions the scenes in which a person is present and converts the caption into a clear message, people will benefit from it in a variety of ways. In this, we offer a deep learning model that detects things or features in images automatically, produces descriptions for the images, and transforms the descriptions to audio for louder readout. The model uses pre-trained CNN and LSTM models to perform the task of extracting objects or features to get the captions. In our model, first task is to detect objects within the image using pre trained Mobilenet model of CNN (Convolutional Neural Networks) and therefore the other is to caption the pictures based on the detected objects by using LSTM (Long Short Term Memory) and convert caption into speech to read out louder to the person by using SpeechSynthesisUtterance interface of the Web Speech API. The interface of the model is developed using NodeJS as a backend for the web page. Caption generation entails a number of complex steps, including selecting the dataset, training the model, validating the model, creating pre-trained models to check the images, detecting the images, and finally generating captions.

APA, Harvard, Vancouver, ISO, and other styles

Eren, Aysegul Ozkaya, and Mustafa Sert. "Automated Audio Captioning with Topic Modeling." IEEE Access, 2023, 1. http://dx.doi.org/10.1109/access.2023.3235733.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Xiao, Feiyang, Jian Guan, Qiaoxi Zhu, and Wenwu Wang. "Graph Attention for Automated Audio Captioning." IEEE Signal Processing Letters, 2023, 1–5. http://dx.doi.org/10.1109/lsp.2023.3266114.

Full text

APA, Harvard, Vancouver, ISO, and other styles

More sources

Dissertations / Theses on the topic "Automated audio captioning"

Labbé, Etienne. "Description automatique des événements sonores par des méthodes d'apprentissage profond." Electronic Thesis or Diss., Université de Toulouse (2023-....), 2024. http://www.theses.fr/2024TLSES054.

Full text

Abstract:

Dans le domaine de l'audio, la majorité des systèmes d'apprentissage automatique se concentrent sur la reconnaissance d'un nombre restreint d'événements sonores. Cependant, lorsqu'une machine est en interaction avec des données réelles, elle doit pouvoir traiter des situations beaucoup plus variées et complexes. Pour traiter ce problème, les annotateurs ont recours au langage naturel, qui permet de résumer n'importe quelle information sonore. La Description Textuelle Automatique de l'Audio (DTAA ou Automated Audio Captioning en anglais) a été introduite récemment afin de développer des systèmes capables de produire automatiquement une description de tout type de son sous forme de texte. Cette tâche concerne toutes sortes d'événements sonores comme des sons environnementaux, urbains, domestiques, des bruitages, de la musique ou de parole. Ce type de système pourrait être utilisé par des personnes sourdes ou malentendantes, et pourrait améliorer l'indexation de grandes bases de données audio. Dans la première partie de cette thèse, nous présentons l'état de l'art de la tâche de DTAA au travers d'une description globale des jeux de données publics, méthodes d'apprentissage, architectures et métriques d'évaluation. À l'aide de ces connaissances, nous présentons ensuite l'architecture de notre premier système de DTAA, qui obtient des scores encourageants sur la principale métrique de DTAA nommée SPIDEr : 24,7 % sur le corpus Clotho et 40,1 % sur le corpus AudioCaps. Dans une seconde partie, nous explorons de nombreux aspects des systèmes de DTAA. Nous nous focalisons en premier lieu sur les méthodes d'évaluations au travers de l'étude de SPIDEr. Pour cela, nous proposons une variante nommée SPIDEr-max, qui considère plusieurs candidats pour chaque fichier audio, et qui montre que la métrique SPIDEr est très sensible aux mots prédits. Puis, nous améliorons notre système de référence en explorant différentes architectures et de nombreux hyper-paramètres pour dépasser l'état de l'art sur AudioCaps (SPIDEr de 49,5 %). Ensuite, nous explorons une méthode d'apprentissage multitâche visant à améliorer la sémantique des phrases générées par notre système. Enfin, nous construisons un système de DTAA généraliste et sans biais nommé CONETTE, pouvant générer différents types de descriptions qui se rapprochent de celles des jeux de données cibles. Dans la troisième et dernière partie, nous proposons d'étudier les capacités d'un système de DTAA pour rechercher automatiquement du contenu audio dans une base de données. Notre approche obtient des scores comparables aux systèmes dédiés à cette tâche, alors que nous utilisons moins de paramètres. Nous introduisons également des méthodes semi-supervisées afin d'améliorer notre système à l'aide de nouvelles données audio non annotées, et nous montrons comment la génération de pseudo-étiquettes peut impacter un modèle de DTAA. Enfin, nous avons étudié les systèmes de DTAA dans d'autres langues que l'anglais : français, espagnol et allemand. De plus, nous proposons un système capable de produire les quatre langues en même temps, et nous le comparons avec les systèmes spécialisés dans chaque langue
In the audio research field, the majority of machine learning systems focus on recognizing a limited number of sound events. However, when a machine interacts with real data, it must be able to handle much more varied and complex situations. To tackle this problem, annotators use natural language, which allows any sound information to be summarized. Automated Audio Captioning (AAC) was introduced recently to develop systems capable of automatically producing a description of any type of sound in text form. This task concerns all kinds of sound events such as environmental, urban, domestic sounds, sound effects, music or speech. This type of system could be used by people who are deaf or hard of hearing, and could improve the indexing of large audio databases. In the first part of this thesis, we present the state of the art of the AAC task through a global description of public datasets, learning methods, architectures and evaluation metrics. Using this knowledge, we then present the architecture of our first AAC system, which obtains encouraging scores on the main AAC metric named SPIDEr: 24.7% on the Clotho corpus and 40.1% on the AudioCaps corpus. Then, subsequently, we explore many aspects of AAC systems in the second part. We first focus on evaluation methods through the study of SPIDEr. For this, we propose a variant called SPIDEr-max, which considers several candidates for each audio file, and which shows that the SPIDEr metric is very sensitive to the predicted words. Then, we improve our reference system by exploring different architectures and numerous hyper-parameters to exceed the state of the art on AudioCaps (SPIDEr of 49.5%). Next, we explore a multi-task learning method aimed at improving the semantics of sentences generated by our system. Finally, we build a general and unbiased AAC system called CONETTE, which can generate different types of descriptions that approximate those of the target datasets. In the third and last part, we propose to study the capabilities of a AAC system to automatically search for audio content in a database. Our approach obtains competitive scores to systems dedicated to this task, while using fewer parameters. We also introduce semi-supervised methods to improve our system using new unlabeled audio data, and we show how pseudo-label generation can impact a AAC model. Finally, we studied the AAC systems in languages other than English: French, Spanish and German. In addition, we propose a system capable of producing all four languages at the same time, and we compare it with systems specialized in each language

APA, Harvard, Vancouver, ISO, and other styles

Book chapters on the topic "Automated audio captioning"

M., Nivedita, AsnathVictyPhamila Y., Umashankar Kumaravelan, and Karthikeyan N. "Voice-Based Image Captioning System for Assisting Visually Impaired People Using Neural Networks." In Principles and Applications of Socio-Cognitive and Affective Computing, 177–99. IGI Global, 2022. http://dx.doi.org/10.4018/978-1-6684-3843-5.ch011.

Full text

Abstract:

Many people worldwide have the problem of visual impairment. The authors' idea is to design a novel image captioning model for assisting the blind people by using deep learning-based architecture. Automatic understanding of the image and providing description of that image involves tasks from two complex fields: computer vision and natural language processing. The first task is to correctly identify objects along with their attributes present in the given image, and the next is to connect all the identified objects along with actions and generating the statements, which should be syntactically correct. From the real-time video, the features are extracted using a convolutional neural network (CNN), and the feature vectors are given as input to long short-term memory (LSTM) network to generate the appropriate captions in a natural language (English). The captions can then be converted into audio files, which the visually impaired people can listen. The model is tested on the two standardized image captioning datasets Flickr 8K and MSCOCO and evaluated using BLEU score.

APA, Harvard, Vancouver, ISO, and other styles

Venturini, Shamira, Michaela Mae Vann, Martina Pucci, and Giulia M. L. Bencini. "Towards a More Inclusive Learning Environment: The Importance of Providing Captions That Are Suited to Learners’ Language Proficiency in the UDL Classroom." In Studies in Health Technology and Informatics. IOS Press, 2022. http://dx.doi.org/10.3233/shti220884.

Full text

Abstract:

Captions have been found to benefit diverse learners, supporting comprehension, memory for content, vocabulary acquisition, and literacy. Captions may, thus, be one feature of universally designed learning (UDL) environments [1, 4]. The primary aim of this study was to examine whether captions are always useful, or whether their utility depends on individual differences, specifically proficiency in the language of the audio. To study this, we presented non-native speakers of English with an audio-visual recording of an unscripted seminar-style lesson in English retrieved from a University website. We assessed English language proficiency with an objective test. To test comprehension, we administered a ten-item comprehension test on the content of the lecture. Our secondary aim was to compare the effects of different types of captions on viewer comprehension. We, therefore, created three viewing conditions: video with no captions (NC), video with premade captions (downloaded from the university website) (UC) and video with automatically generated captions (AC). Our results showed an overall strong effect of proficiency on lecture comprehension, as expected. Interestingly, we also found that whether captions helped or not depended on proficiency and caption type. The captions provided by the University website benefited our learners only if their English language proficiency was high enough. When their proficiency was lower, however, the captions provided by the university were detrimental and performance was worse than having no captions. For the lower proficiency levels, automatic captions (AC) provided the best advantage. We attribute this finding to pre-existing characteristics of the captions provided by the university website. Taken together, these findings caution institutions with a commitment to UDL against thinking that one type of caption suits all. The study highlights the need for testing captioning systems with diverse learners, under different conditions, to better understand what factors are beneficial for whom and when.

APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Automated audio captioning"

Kim, Minkyu, Kim Sung-Bin, and Tae-Hyun Oh. "Prefix Tuning for Automated Audio Captioning." In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023. http://dx.doi.org/10.1109/icassp49357.2023.10096877.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Drossos, Konstantinos, Sharath Adavanne, and Tuomas Virtanen. "Automated audio captioning with recurrent neural networks." In 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2017. http://dx.doi.org/10.1109/waspaa.2017.8170058.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Chen, Chen, Nana Hou, Yuchen Hu, Heqing Zou, Xiaofeng Qi, and Eng Siong Chng. "Interactive Auido-text Representation for Automated Audio Captioning with Contrastive Learning." In Interspeech 2022. ISCA: ISCA, 2022. http://dx.doi.org/10.21437/interspeech.2022-10510.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Kim, Jaeyeon, Jaeyoon Jung, Jinjoo Lee, and Sang Hoon Woo. "EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning." In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024. http://dx.doi.org/10.1109/icassp48485.2024.10446672.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Ye, Zhongjie, Yuqing Wang, Helin Wang, Dongchao Yang, and Yuexian Zou. "FeatureCut: An Adaptive Data Augmentation for Automated Audio Captioning." In 2022 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2022. http://dx.doi.org/10.23919/apsipaasc55919.2022.9980325.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Koh, Andrew, Soham Tiwari, and Chng Eng Siong. "Automated Audio Captioning with Epochal Difficult Captions for curriculum learning." In 2022 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2022. http://dx.doi.org/10.23919/apsipaasc55919.2022.9980242.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Wijngaard, Gijs, Elia Formisano, Bruno L. Giordano, and Michel Dumontier. "ACES: Evaluating Automated Audio Captioning Models on the Semantics of Sounds." In 2023 31st European Signal Processing Conference (EUSIPCO). IEEE, 2023. http://dx.doi.org/10.23919/eusipco58844.2023.10289793.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Jain, Arushi, Navaneeth B. R, Shelly Mohanty, R. Sujatha, Sujatha R, Sourabh Tiwari, and Rashmi T. Shankarappa. "Web Framework for Enhancing Automated Audio Captioning Performance for Domestic Environment." In 2022 13th International Conference on Computing Communication and Networking Technologies (ICCCNT). IEEE, 2022. http://dx.doi.org/10.1109/icccnt54827.2022.9984255.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Sun, Jianyuan, Xubo Liu, Xinhao Mei, Volkan Kılıç, Mark D. Plumbley, and Wenwu Wang. "Dual Transformer Decoder based Features Fusion Network for Automated Audio Captioning." In INTERSPEECH 2023. ISCA: ISCA, 2023. http://dx.doi.org/10.21437/interspeech.2023-943.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Xu, Xuenan, Heinrich Dinkel, Mengyue Wu, Zeyu Xie, and Kai Yu. "Investigating Local and Global Information for Automated Audio Captioning with Transfer Learning." In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021. http://dx.doi.org/10.1109/icassp39728.2021.9413982.

Full text

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

Academic literature on the topic 'Automated audio captioning'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Contents

Journal articles on the topic "Automated audio captioning"

Dissertations / Theses on the topic "Automated audio captioning"

Book chapters on the topic "Automated audio captioning"

Conference papers on the topic "Automated audio captioning"