Siga este enlace para ver otros tipos de publicaciones sobre el tema: Video text.

Artículos de revistas sobre el tema "Video text"

Crea una cita precisa en los estilos APA, MLA, Chicago, Harvard y otros

Elija tipo de fuente:

Consulte los 50 mejores artículos de revistas para su investigación sobre el tema "Video text".

Junto a cada fuente en la lista de referencias hay un botón "Agregar a la bibliografía". Pulsa este botón, y generaremos automáticamente la referencia bibliográfica para la obra elegida en el estilo de cita que necesites: APA, MLA, Harvard, Vancouver, Chicago, etc.

También puede descargar el texto completo de la publicación académica en formato pdf y leer en línea su resumen siempre que esté disponible en los metadatos.

Explore artículos de revistas sobre una amplia variedad de disciplinas y organice su bibliografía correctamente.

1

Huang, Bin, Xin Wang, Hong Chen, Houlun Chen, Yaofei Wu, and Wenwu Zhu. "Identity-Text Video Corpus Grounding." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 4 (2025): 3608–16. https://doi.org/10.1609/aaai.v39i4.32375.

Texto completo
Resumen
Video corpus grounding (VCG), which aims to retrieve relevant video moments from a video corpus, has attracted significant attention in the multimedia research community. However, the existing VCG setting primarily focuses on matching textual descriptions with videos and ignores the distinct visual identities in the videos, thus resulting in inaccurate understanding of video content and deteriorated retrieval performances. To address this limitation, we introduce a novel task, Identity-Text Video Corpus Grounding (ITVCG), which simultaneously utilize textual descriptions and visual identities as queries. As such, ITVCG benefits in enabling more accurate video corpus grounding with visual identities, as well as providing users with more flexible options to locate relevant frames based on either textual descriptions or textual descriptions and visual identities. To conduct evaluations regarding the novel ITVCG task, we propose the TVR-IT dataset, comprising 463 identity images from 6 TV shows, with 68,840 out of 72,840 queries containing at least one identity image. Furthermore, we propose Video-Locator, the first model designed for the ITVCG task. Our proposed Video-Locator integrates video-identity-text alignment and multi-modal fine-grained fusion components, facilitating a video large language model (Video LLM) to jointly understand textual descriptions, visual identities, as well as videos. Experimental results demonstrate the effectiveness of the proposed Video-Locator model and highlight the importance of identity-generalization capability for ITVCG.
Los estilos APA, Harvard, Vancouver, ISO, etc.
2

Avinash, N. Bhute, and Meshram B.B. "Text Based Approach For Indexing And Retrieval Of Image And Video: A Review." Advances in Vision Computing: An International Journal (AVC) 1, no. 1 (2014): 27–38. https://doi.org/10.5281/zenodo.3554868.

Texto completo
Resumen
Text data present in multimedia contain useful information for automatic annotation, indexing. Extracted information used for recognition of the overlay or scene text from a given video or image. The Extracted text can be used for retrieving the videos and images. In this paper, firstly, we are discussed the different techniques for text extraction from images and videos. Secondly, we are reviewed the techniques for indexing and retrieval of image and videos by using extracted text.
Los estilos APA, Harvard, Vancouver, ISO, etc.
3

Avinash, N. Bhute, and Meshram B.B. "Text Based Approach For Indexing And Retrieval Of Image And Video: A Review." Advances in Vision Computing: An International Journal (AVC) 1, no. 1 (2014): 27–38. https://doi.org/10.5281/zenodo.3357696.

Texto completo
Resumen
Text data present in multimedia contain useful information for automatic annotation, indexing. Extracted information used for recognition of the overlay or scene text from a given video or image. The Extracted text can be used for retrieving the videos and images. In this paper, firstly, we are discussed the different techniques for text extraction from images and videos. Secondly, we are reviewed the techniques for indexing and retrieval of image and videos by using extracted text.
Los estilos APA, Harvard, Vancouver, ISO, etc.
4

V, Divya, Prithica G, and Savija J. "Text Summarization for Education in Vernacular Languages." International Journal for Research in Applied Science and Engineering Technology 11, no. 7 (2023): 175–78. http://dx.doi.org/10.22214/ijraset.2023.54589.

Texto completo
Resumen
Abstract: This project proposes a video summarizing system based on natural language processing (NLP) and Machine Learning to summarize the YouTube video transcripts without losing the key elements. The quantity of videos available on web platforms is steadily expanding. The content is made available globally, primarily for educational purposes. Additionally, educational content is available on YouTube, Facebook, Google, and Instagram. A significant issue of extracting information from videos is that unlike an image, where data can be collected from a single frame, a viewer must watch the entire video to grasp the context. This study aims to shorten the length ofthe transcript of the given video. The suggested method involves retrieving transcripts from the video link provided by the user and then summarizing the by using Hugging Face Transformers and Pipelining. The built model accepts video links and the required summary duration as input from the user and generates a summarized transcript as output. According to the results, the final translated was obtained in less time when compared with other proposed techniques. Furthermore, the video’s central concept is accurately present in the final without any deviations.
Los estilos APA, Harvard, Vancouver, ISO, etc.
5

Namrata, Dave, and S. Holia Mehfuza. "News Story Retrieval Based on Textual Query." International Journal of Engineering and Advanced Technology (IJEAT) 9, no. 3 (2021): 2918–22. https://doi.org/10.5281/zenodo.5589205.

Texto completo
Resumen
This paper presents news video retrieval using text query for Gujarati language news videos. Due to the fact that Broadcasted Video in India is lacking in metadata information such as closed captioning, transcriptions etc., retrieval of videos based on text data is trivial task for most of the Indian language video. To retrieve specific story based on text query in regional language is the key idea behind our approach. Broadcast video is segmented to get shots representing small news stories. To represent each shot efficiently, key frame extraction using singular value decomposition and rank of matrix is proposed. Text is extracted from keyframes for further indexing data. Next task is to process text using natural language processing steps like tokenization, punctuation and extra symbols removal as well as stemming of words to root words etc. Due to unavailability of stemming and other methods of preprocessing of text in Guajarati language, we have given basic stemming technique to reduce dictionary size for efficient indexing of text data. With proposed system 82.5 percent accuracy is achieved on Gujarati news video dataset ETV.
Los estilos APA, Harvard, Vancouver, ISO, etc.
6

Doran, Michael, Adrian Barnett, Joan Leach, William Lott, Katie Page, and Will Grant. "Can video improve grant review quality and lead to more reliable ranking?" Research Ideas and Outcomes 3 (February 1, 2017): e11931. https://doi.org/10.3897/rio.3.e11931.

Texto completo
Resumen
Multimedia video is rapidly becoming mainstream, and many studies indicate that it is a more effective communication medium than text. In this project we AIM to test if videos can be used, in place of text-based grant proposals, to improve communication and increase the reliability of grant ranking. We will test if video improves reviewer comprehension (AIM 1), if external reviewer grant scores are more consistent with video (AIM 2), and if mock Australian Research Council (ARC) panels award more consistent scores when grants are presented as videos (AIM 3). This will be the first study to evaluate the use of video in this application. The ARC reviewed over 3500 Discovery Project applications in 2015, awarding 635 Projects. Selecting the "best" projects is extremely challenging. This project will improve the selection process by facilitating the transition from text-based to video-based proposals. The impact could be profound: Improved video communication should streamline the grant preparation and review processes, enable more reliable ranking of applications, and more accurate identification of the "next big innovations".
Los estilos APA, Harvard, Vancouver, ISO, etc.
7

Jiang, Ai Wen, and Gao Rong Zeng. "Multi-information Integrated Method for Text Extraction from Videos." Advanced Materials Research 225-226 (April 2011): 827–30. http://dx.doi.org/10.4028/www.scientific.net/amr.225-226.827.

Texto completo
Resumen
Video text provides important semantic information in video content analysis. However, video text with complex background has a poor recognition performance for OCR. Most of the previous approaches to extracting overlay text from videos are based on traditional binarization and give little attention on multi-information integration, especially fusing the background information. This paper presents an effective method to precisely extract characters from videos to enable it for OCR with a good recognition performance. The proposed method combines multi-information together including background information, edge information, and character’s spatial information. Experimental results show that it is robust to complex background and various text appearances.
Los estilos APA, Harvard, Vancouver, ISO, etc.
8

Ma, Fan, Xiaojie Jin, Heng Wang, Jingjia Huang, Linchao Zhu, and Yi Yang. "Stitching Segments and Sentences towards Generalization in Video-Text Pre-training." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 5 (2024): 4080–88. http://dx.doi.org/10.1609/aaai.v38i5.28202.

Texto completo
Resumen
Video-language pre-training models have recently achieved remarkable results on various multi-modal downstream tasks. However, most of these models rely on contrastive learning or masking modeling to align global features across modalities, neglecting the local associations between video frames and text tokens. This limits the model’s ability to perform fine-grained matching and generalization, especially for tasks that selecting segments in long videos based on query texts. To address this issue, we propose a novel stitching and matching pre-text task for video-language pre-training that encourages fine-grained interactions between modalities. Our task involves stitching video frames or sentences into longer sequences and predicting the positions of cross-model queries in the stitched sequences. The individual frame and sentence representations are thus aligned via the stitching and matching strategy, encouraging the fine-grained interactions between videos and texts. in the stitched sequences for the cross-modal query. We conduct extensive experiments on various benchmarks covering text-to-video retrieval, video question answering, video captioning, and moment retrieval. Our results demonstrate that the proposed method significantly improves the generalization capacity of the video-text pre-training models.
Los estilos APA, Harvard, Vancouver, ISO, etc.
9

Liu, Yang, Shudong Huang, Deng Xiong, and Jiancheng Lv. "Learning Dynamic Similarity by Bidirectional Hierarchical Sliding Semantic Probe for Efficient Text Video Retrieval." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 6 (2025): 5667–75. https://doi.org/10.1609/aaai.v39i6.32604.

Texto completo
Resumen
Text-video retrieval is a foundation task in multi-modal research which aims to align texts and videos in the embedding space. The key challenge is to learn the similarity between videos and texts. A conventional approach involves directly aligning video-text pairs using cosine similarity. However, due to the disparity in the information conveyed by videos and texts, i.e., a single video can be described from multiple perspectives, the retrieval accuracy is suboptimal. An alternative approach employs cross-modal interaction to enable videos to dynamically acquire distinct features from various texts, thus facilitating similarity calculations. Nevertheless, this solution incurs a computational complexity of O(n^2) during retrieval. To this end, this paper proposes a novel method called Bidirectional Hierarchical Sliding Semantic Probe (BiHSSP), which calculates dynamic similarity between videos and texts with O(n) complexity during retrieval. We introduce a hierarchical semantic probe module that learns semantic probes at different scales for both video and text features. Semantic probe involves a sliding calculation of the cross-correlation between semantic probes at different scales and embeddings from another modality, allowing for dynamic similarity computation between video and text descriptions from various perspectives. Specifically, for text descriptions from different angles, we calculate the similarity at different locations within the video features and vice versa. This approach preserves the complete information of the video while addressing the issue of unequal information between video and text without requiring cross-modal interaction. Additionally, our method can function as a plug-and-play module across various methods, thereby enhancing the corresponding performance. Experimental results demonstrate that our BiHSSP significantly outperforms the baseline.
Los estilos APA, Harvard, Vancouver, ISO, etc.
10

Sun, Shangkun, Xiaoyu Liang, Songlin Fan, Wenxu Gao, and Wei Gao. "VE-Bench: Subjective-Aligned Benchmark Suite for Text-Driven Video Editing Quality Assessment." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 7 (2025): 7105–13. https://doi.org/10.1609/aaai.v39i7.32763.

Texto completo
Resumen
Text-driven video editing has recently experienced rapid development. Despite this, evaluating edited videos remains a considerable challenge. Current metrics tend to fail to align with human perceptions, and effective quantitative metrics for video editing are still notably absent. To address this, we introduce VE-Bench, a benchmark suite tailored to the assessment of text-driven video editing. This suite includes VE-Bench DB, a video quality assessment (VQA) database for video editing. VE-Bench DB encompasses a diverse set of source videos featuring various motions and subjects, along with multiple distinct editing prompts, editing results from 8 different models, and the corresponding Mean Opinion Scores (MOS) from 24 human annotators. Based on VE-Bench DB, we further propose VE-Bench QA, a quantitative human-aligned measurement for the text-driven video editing task. In addition to the aesthetic, distortion, and other visual quality indicators that traditional VQA methods emphasize, VE-Bench QA focuses on the text-video alignment and the relevance modeling between source and edited videos. It introduces a new assessment network for video editing that attains superior performance in alignment with human preferences.To the best of our knowledge, VE-Bench introduces the first quality assessment dataset for video editing and proposes an effective subjective-aligned quantitative metric for this domain. All models, data, and code will be publicly available to the community.
Los estilos APA, Harvard, Vancouver, ISO, etc.
11

Yariv, Guy, Itai Gat, Sagie Benaim, Lior Wolf, Idan Schwartz, and Yossi Adi. "Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 7 (2024): 6639–47. http://dx.doi.org/10.1609/aaai.v38i7.28486.

Texto completo
Resumen
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes. For this task, the videos are required to be aligned both globally and temporally with the input audio: globally, the input audio is semantically associated with the entire output video, and temporally, each segment of the input audio is associated with a corresponding segment of that video. We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model. The proposed method is based on a lightweight adaptor network, which learns to map the audio-based representation to the input representation expected by the text-to-video generation model. As such, it also enables video generation conditioned on text, audio, and, for the first time as far as we can ascertain, on both text and audio. We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples and further propose a novel evaluation metric (AV-Align) to assess the alignment of generated videos with input audio samples. AV-Align is based on the detection and comparison of energy peaks in both modalities. In comparison to recent state-of-the-art approaches, our method generates videos that are better aligned with the input sound, both with respect to content and temporal axis. We also show that videos produced by our method present higher visual quality and are more diverse. Code and samples are available at: https://pages.cs.huji.ac.il/adiyoss-lab/TempoTokens/.
Los estilos APA, Harvard, Vancouver, ISO, etc.
12

Rachidi, Youssef. "Text Detection in Video for Video Indexing." International Journal of Computer Trends and Technology 68, no. 4 (2020): 96–99. http://dx.doi.org/10.14445/22312803/ijctt-v68i4p117.

Texto completo
Los estilos APA, Harvard, Vancouver, ISO, etc.
13

Cao, Shuqiang, Bairui Wang, Wei Zhang, and Lin Ma. "Visual Consensus Modeling for Video-Text Retrieval." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 1 (2022): 167–75. http://dx.doi.org/10.1609/aaai.v36i1.19891.

Texto completo
Resumen
In this paper, we propose a novel method to mine the commonsense knowledge shared between the video and text modalities for video-text retrieval, namely visual consensus modeling. Different from the existing works, which learn the video and text representations and their complicated relationships solely based on the pairwise video-text data, we make the first attempt to model the visual consensus by mining the visual concepts from videos and exploiting their co-occurrence patterns within the video and text modalities with no reliance on any additional concept annotations. Specifically, we build a shareable and learnable graph as the visual consensus, where the nodes denoting the mined visual concepts and the edges connecting the nodes representing the co-occurrence relationships between the visual concepts. Extensive experimental results on the public benchmark datasets demonstrate that our proposed method, with the ability to effectively model the visual consensus, achieves state-of-the-art performances on the bidirectional video-text retrieval task. Our code is available at https://github.com/sqiangcao99/VCM.
Los estilos APA, Harvard, Vancouver, ISO, etc.
14

Chiu, Chih-Yi, Po-Chih Lin, Sheng-Yang Li, Tsung-Han Tsai, and Yu-Lung Tsai. "Tagging Webcast Text in Baseball Videos by Video Segmentation and Text Alignment." IEEE Transactions on Circuits and Systems for Video Technology 22, no. 7 (2012): 999–1013. http://dx.doi.org/10.1109/tcsvt.2012.2189478.

Texto completo
Los estilos APA, Harvard, Vancouver, ISO, etc.
15

Liu, Yi, Yue Zhang, Haidong Hu, Xiaodong Liu, Lun Zhang, and Ruijun Liu. "An Extended Text Combination Classification Model for Short Video Based on Albert." Journal of Sensors 2021 (October 16, 2021): 1–7. http://dx.doi.org/10.1155/2021/8013337.

Texto completo
Resumen
With the rise and rapid development of short video sharing websites, the number of short videos on the Internet has been growing explosively. The organization and classification of short videos have become the basis for the effective use of short videos, which is also a problem faced by major short video platforms. Aiming at the characteristics of complex short video content categories and rich extended text information, this paper uses methods in the text classification field to solve the short video classification problem. Compared with the traditional way of classifying and understanding short video key frames, this method has the characteristics of lower computational cost, more accurate classification results, and easier application. This paper proposes a text classification model based on the attention mechanism of multitext embedding short video extension. The experiment first uses the training language model Albert to extract sentence-level vectors and then uses the attention mechanism to study the text information in various short video extensions in a short video classification weight factor. And this research applied Google’s unsupervised data augmentation (UDA) method based on unsupervised data, creatively combining it with the Chinese knowledge graph, and realized TF-IDF word replacement. During the training process, we introduced a large amount of unlabeled data, which significantly improved the accuracy of model classification. The final series of related experiments is aimed at comparing with the existing short video title classification methods, classification methods based on video key frames, and hybrid methods, and proving that the method proposed in this article is more accurate and robust on the test set.
Los estilos APA, Harvard, Vancouver, ISO, etc.
16

Bodyanskaya, Alisa, and Kapitalina Sinegubova. "Music Video as a Poetic Interpretation." Virtual Communication and Social Networks 2023, no. 2 (2023): 47–55. http://dx.doi.org/10.21603/2782-4799-2023-2-2-47-55.

Texto completo
Resumen
This article introduces the phenomenon of videopoetry as a hybrid product of mass media whose popularity is based on intermediality, i.e., the cumulative effect on different perception channels. Videopoetry is a productive form of verbal creativity in the contemporary media culture with its active reception of art. The research featured poems by W. B. Yeats, T. S. Eliot, and W. H. Auden presented as videos and the way they respond to someone else's poetic word. The authors analyzed 15 videos by comparing the original text and the video sequence in line with the method developed by N. V. Barkovskaya and A. A. Zhitenev. The analysis revealed several options for relaying a poetic work as a music video. Three videos provided a direct illustration of the source text, suggesting a complete or partial visual duplication of the original poetic imagery. Five videos offered an indirect illustration of the source text by using associative images in relation to the central images of the poem. Five videos gave a minimal illustration: the picture did not dominate the text of the poem, but its choice implied a certain interpretation. Two videos featured the video maker as a reciter. The video makers did not try to transform the poetic text but used the video sequence as a way to enter into a dialogue with the original poem or resorted to indirect illustration to generate occasional meanings. Thus, video makers keep the original text unchanged and see the video sequence and musical accompaniment as their responsibility but maintain a dialogue between the original text and its game reinterpretation.
Los estilos APA, Harvard, Vancouver, ISO, etc.
17

Letroiwen, Kornelin, Aunurrahman ., and Indri Astuti. "PENGEMBANGAN VIDEO ANIMASI UNTUK MENINGKATKAN KEMAMPUAN READING COMPREHENSION FACTUAL REPORT TEXT." Jurnal Teknologi Pendidikan (JTP) 16, no. 1 (2023): 16. http://dx.doi.org/10.24114/jtp.v16i1.44842.

Texto completo
Resumen
Abstrak: Penelitian ini bertujuan untuk mengembangkan desain video animasi untuk pembelajaran Bahasa Inggris materi factual report text. Metode penelitian ini adalah Research and Development dengan model desain pengembangan ASSURE. Sebanyak 42 siswa kelas XI SMKN 1 Ngabang terlibat dalam penelitian ini. Adapun data yang diperoleh dianalisis secara kualitatif dan kuantitatif. Profil video animasi menampilkan video animasi dengan karakter animasi 2D dan terdiri dari cover, profil pengembang, salam (greeting), kompetensi dasar, tujuan pembelajaran, definisi, fungsi sosial, struktur teks, unsur kebahasaan, materi pembelajaran, contoh materi factual report text. Hasil Uji validasi ahli desain yaitu 3,79, hasil uji validasi ahli materi 3,57 dan hasil uji validasi ahli media yaitu 3,55. Dari hasil uji validasi ahli desain, ahli materi, ahli media menunjukkan semua rata – rata nya diatas 3,0, yang artinya hasil nya valid. Hal ini menandakan video animasi factual report text daapat digunakan pada pembelajaran bahasa inggris. Untuk memperkuat validnya video animasi ini maka dilakukaan uji coba pada siswa SMK yang merupakan pengguna langsung video animasi tersbut. Adapun hasil uji coba one to one adalah 95,31, hasil uji coba kelompok sedang adalah 93,81, hasil ujicoba kelompok besar adalah 94,75, hasil uji coba ketiga kelompok tersebut lebih tinggi dari kriteria respons siswa Rs ≥ 85. Rata-rata hasil pretest yaitu 62,67 dan rata-rata hasil postest yaitu 81,3, hal tersebut menunjukkan ada peningkatan hasil belajar siswa setelah penggunaan video animasi factual report text. Kata kunci: media pembelajaran, video animasi, reading comprehension, factual report text Abstract: This study aims to develop an animated video design for learning English on factual report text material. This research method is Research and Development with the ASSURE development design model. A total of 42 class XI students of SMKN 1 Ngabang were involved in this study. The data obtained were analyzed qualitatively and quantitatively. The animated video profile displays animated videos with 2D animated characters and consists of covers, developer profiles, greetings, basic competencies, learning objectives, definitions, social functions, text structures, linguistic elements, learning materials, animated images. The result of the design expert validation test was 3.79, the material expert validation test result was 3.57 and the media expert validation test result was 3.55. From the results of the validation test by design experts, material experts, media experts all show that the average is above 3.0, which means the results are valid. This indicates that factual report text animated videos can be used in learning English. To strengthen the validity of this animated video, a trial was carried out on SMK students who are direct users of the animated video. The results of the one to one trial were 95.31, the results of the medium group trial were 93.81, the results of the large group trial were 94.75, the results of the three group trials were higher than the media response criteria of Rs ≥ 85. The average pretest result is 62.67 and the average posttest result is 81.3. This shows that there is an increase in student learning outcomes after using factual report text animated videos. Keywords: learning media, animated videos, reading comprehension, factual report text
Los estilos APA, Harvard, Vancouver, ISO, etc.
18

Ghorpade, Jayshree, Raviraj Palvankar, Ajinkya Patankar, and Snehal Rathi. "Extracting Text from Video." Signal & Image Processing : An International Journal 2, no. 2 (2011): 103–12. http://dx.doi.org/10.5121/sipij.2011.2209.

Texto completo
Los estilos APA, Harvard, Vancouver, ISO, etc.
19

Wadaskar, Ghanshyam, Sanghdip Udrake, Vipin Bopanwar, Shravani Upganlawar, and Prof Minakshi Getkar. "Extract Text from Video." International Journal for Research in Applied Science and Engineering Technology 12, no. 5 (2024): 2881–83. http://dx.doi.org/10.22214/ijraset.2024.62287.

Texto completo
Resumen
Abstract: The code import the YoutubeTranscriptionApi from youtube_transcription_api libray, The YouTube video ID is defined. The transcription data for the given video ID is fetched using get_transcription method. The transcription text is extracted from the data and stroed in the transcription variable. The transcriptipn is split into lines and thenjoined back into a single string. Finally the processed transcript is writen into a text file name “Love.text” with UTF-8 encoding. The commented-out code block is an alternative way to write the transcript into a text file using the open function directly. which you can use if you prefer.
Los estilos APA, Harvard, Vancouver, ISO, etc.
20

Vishwashanthi, M. "Text-To-Video Generator." International Scientific Journal of Engineering and Management 04, no. 05 (2025): 1–9. https://doi.org/10.55041/isjem03655.

Texto completo
Resumen
Abstract: The integration of artificial intelligence in multimedia content creation has paved the way for innovative applications like text-to-video generation. This research presents an advanced Text-to-Video Generator capable of converting textual inputs into coherent video narratives. The system is further enhanced with multilingual support for Indian languages and the inclusion of subtitles, broadening its accessibility and user engagement. By leveraging natural language processing and machine learning techniques, the application ensures accurate interpretation and representation of diverse linguistic inputs. The addition of subtitles not only aids in comprehension but also caters to audiences with hearing impairments. This paper delves into the system's architecture, implementation, and performance evaluation, highlighting its potential in educational, entertainment, and informational domains. Key Word: Text-to-Video Generation, Multilingual Support, Subtitles, Natural Language Processing, Machine Learning, Indian Languages
Los estilos APA, Harvard, Vancouver, ISO, etc.
21

Luo, Dezhao, Shaogang Gong, Jiabo Huang, Hailin Jin, and Yang Liu. "Generative Video Diffusion for Unseen Novel Semantic Video Moment Retrieval." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 6 (2025): 5847–55. https://doi.org/10.1609/aaai.v39i6.32624.

Texto completo
Resumen
Video moment retrieval (VMR) aims to locate the most likely video moment(s) corresponding to a text query in untrimmed videos. Training of existing methods is limited by the lack of diverse and generalisable VMR datasets, hindering their ability to generalise moment-text associations to queries containing novel semantic concepts (unseen both visually and textually in a training source domain). For model generalisation to novel semantics, existing methods rely heavily on assuming to have access to both video and text sentence pairs from a target domain in addition to the source domain pair-wise training data. This is neither practical nor scalable. In this work, we introduce a more generalisable approach by assuming only text sentences describing new semantics are available in model training without having seen any videos from a target domain. To that end, we propose a Fine-grained Video Editing framework, termed FVE, that explores generative video diffusion to facilitate fine-grained video editing from the seen source concepts to the unseen target sentences consisting of new concepts. This enables generative hypotheses of unseen video moments corresponding to the novel concepts in the target domain. This fine-grained generative video diffusion retains the original video structure and subject specifics from the source domain while introducing semantic distinctions of unseen novel vocabularies in the target domain. A critical challenge is how to enable this generative fine-grained diffusion process to be meaningful in optimising VMR, more than just synthesising visually pleasing videos. We solve this problem by introducing a hybrid selection mechanism that integrates three quantitative metrics to selectively incorporate synthetic video moments (novel video hypotheses) as enlarged additions to the original source training data, whilst minimising potential detrimental noise or unnecessary repetitions in the novel synthetic videos harmful to VMR learning. Experiments on three datasets demonstrate the effectiveness of FVE to unseen novel semantic video moment retrieval tasks
Los estilos APA, Harvard, Vancouver, ISO, etc.
22

Godha, Ashima, and Puja Trivedi. "CNN Filter based Text Region Segmentation from Lecture Video and Extraction using NeuroOCR." SMART MOVES JOURNAL IJOSCIENCE 5, no. 7 (2019): 7. http://dx.doi.org/10.24113/ijoscience.v5i7.218.

Texto completo
Resumen
Lecture videos are rich with textual information and to be able to understand the text is quite useful for larger video understanding/analysis applications. Though text recognition from images have been an active research area in computer vision, text in lecture videos has mostly been overlooked. In this paper, text extraction from lecture videos are focused. For text extraction from different types of lecture videos such as slides, whiteboard lecture videos, paper lecture videos, etc. The text extraction, the text regions are segmented in video frames and extracted using recurrent neural network based OCR. And finally, the extracted text is converted into audio for ease of convenience. The designed algorithm is tested on different videos from different lectures. The experimental results show that the proposed methodology is quite efficient over existing work.
Los estilos APA, Harvard, Vancouver, ISO, etc.
23

Tahwiana, Zein, Regina Regina, Eka Fajar Rahmani, Yohanes Gatot Sutapa Yuliana, and Wardah Wardah. "The ENHANCING NARRATIVE WRITING SKILLS THROUGH ANIMATION VIDEOS IN THE EFL CLASSROOM." Getsempena English Education Journal 12, no. 1 (2025): 1–13. https://doi.org/10.46244/geej.v12i1.2902.

Texto completo
Resumen
This study examined the use of animation videos to teach narrative text writing to SMP Negeri 21 Pontianak eighth-grade students. The study used the 8B class of SMP Negeri 21 Pontianak as the research sample, consisting of 35 students taken from cluster random sampling from a population of 209 students. This pre-experimental study also used a group pre-test and post-test design, consisting of three procedures: pre-test, treatment, and post-test. This study was conducted in two treatments for 120 minutes per meeting by using animation videos to teach narrative text. Two methods were used in the treatment: the first was a full viewing, in which learners watched the entire video without pausing to learn about the plot, characters, and moral lessons. The second method is freeze with text, in which the teacher paused the video and explained each of the generic structures and language features in the narrative text based on the video scenes. The data were gathered through a narrative writing test for the pre-test and post-test, and as for the assessment, a rubric consisting of content, organization, vocabulary, grammar, and mechanics. The finding revealed that animation videos effectively taught students narrative text writing skills. The t-test result was higher than the t-table (10.90>1.691), indicating the effectiveness. The effect size was also categorized as a “very strong effect” with a calculation value of 1.74. These findings imply that animation videos can be used as media support to teach narrative text writing to students, especially those at SMP Negeri 21 Pontianak.
Los estilos APA, Harvard, Vancouver, ISO, etc.
24

Nazmun, Nessa Moon, Salehin Imrus, Parvin Masuma, et al. "Natural language processing based advanced method of unnecessary video detection." International Journal of Electrical and Computer Engineering (IJECE) 11, no. 6 (2021): 5411–19. https://doi.org/10.11591/ijece.v11i6.pp5411-5419.

Texto completo
Resumen
In this study we have described the process of identifying unnecessary video using an advanced combined method of natural language processing and machine learning. The system also includes a framework that contains analytics databases and which helps to find statistical accuracy and can detect, accept or reject unnecessary and unethical video content. In our video detection system, we extract text data from video content in two steps, first from video to MPEG-1 audio layer 3 (MP3) and then from MP3 to WAV format. We have used the text part of natural language processing to analyze and prepare the data set. We use both Naive Bayes and logistic regression classification algorithms in this detection system to determine the best accuracy for our system. In our research, our video MP4 data has converted to plain text data using the python advance library function. This brief study discusses the identification of unauthorized, unsocial, unnecessary, unfinished, and malicious videos when using oral video record data. By analyzing our data sets through this advanced model, we can decide which videos should be accepted or rejected for the further actions.
Los estilos APA, Harvard, Vancouver, ISO, etc.
25

Alabsi, Thuraya. "Effects of Adding Subtitles to Video via Apps on Developing EFL Students’ Listening Comprehension." Theory and Practice in Language Studies 10, no. 10 (2020): 1191. http://dx.doi.org/10.17507/tpls.1010.02.

Texto completo
Resumen
It is unclear if using videos and education apps in learning adds additional value to students’ listening comprehension. This study assesses the impact of adding text to videos on English as a Foreign Language (EFL) learners’ listening comprehension. The participants were 76 prep college EFL students from Taibah University, divided into two groups. The semi-experimental measure was employed to compare the experimental group and the control group. The experimental group watched an English learning video and then wrote text subtitles relating to the video using apps, and later took a listening test to evaluate their ability in acquiring information through the videos. The control group watched videos during live lectures but did not add subtitles on the content they viewed. A paired samples t-test was used to assess the extent of listening comprehension achievement and posttest results were compared. Results revealed statistically significant increases in posttest listening comprehension scores. The result indicated superior performance and a significant positive impact through teaching/learning via video watching and adding text apps.
Los estilos APA, Harvard, Vancouver, ISO, etc.
26

Wu, Peng, Wanshun Su, Xiangteng He, Peng Wang, and Yanning Zhang. "VarCMP: Adapting Cross-Modal Pre-Training Models for Video Anomaly Retrieval." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 8 (2025): 8423–31. https://doi.org/10.1609/aaai.v39i8.32909.

Texto completo
Resumen
Video anomaly retrieval (VAR) aims to retrieve pertinent abnormal or normal videos from collections of untrimmed and long videos through cross-modal requires such as textual descriptions and synchronized audios. Cross-modal pre-training (CMP) models, by pre-training on large-scale cross-modal pairs, e.g., image and text, can learn the rich associations between different modalities, and this cross-modal association capability gives CMP an advantage in conventional retrieval tasks. Inspired by this, how to utilize the robust cross-modal association capabilities of CMP in VAR to search crucial visual component from these untrimmed and long videos becomes a critical research problem. Therefore, this paper proposes a VAR method based on CMP models, named VarCMP. First, a unified hierarchical alignment strategy is proposed to constrain the semantic and spatial consistency between video and text, as well as the semantic, temporal, and spatial consistency between video and audio. It fully leverages the efficient cross-modal association capabilities of CMP models by considering cross-modal similarities at multiple granularities, enabling VarCMP to achieve effective all-round information matching for both video-text and video-audio VAR tasks. Moreover, to further solve the problem of untrimmed and long video alignment, an anomaly-biased weighting is devised in the fine-grained alignment, which identifies key segments in untrimmed long videos using anomaly priors, giving them more attention, thereby discarding irrelevant segment information, and achieving more accurate matching with cross-modal queries. Extensive experiments demonstrates high efficacy of VarCMP in both video-text and video-audio VAR tasks, achieving significant improvements on both text-video (UCFCrime-AR) and audio-video (XDViolence-AR) datasets against the best competitors by 5.0% and 5.3% R@1.
Los estilos APA, Harvard, Vancouver, ISO, etc.
27

Bi, Xiuli, Jian Lu, Bo Liu, et al. "CustomTTT: Motion and Appearance Customized Video Generation via Test-Time Training." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 2 (2025): 1871–79. https://doi.org/10.1609/aaai.v39i2.32182.

Texto completo
Resumen
Benefiting from large-scale pre-training of text-video pairs, current text-to-video (T2V) diffusion models can generate high-quality videos from the text description. Besides, given some reference images or videos, the parameter-efficient fine-tuning method, i.e. LoRA, can generate high-quality customized concepts, e.g., the specific subject or the motions from a reference video. However, combining the trained multiple concepts from different references into a single network shows obvious artifacts. To this end, we propose CustomTTT, where we can joint custom the appearance and the motion of the given video easily. In detail, we first analyze the prompt influence in the current video diffusion model and find the LoRAs are only needed for the specific layers for appearance and motion customization. Besides, since each LoRA is trained individually, we propose a novel test-time training technique to update parameters after combination utilizing the trained customized models. We conduct detailed experiments to verify the effectiveness of the proposed methods. Our method outperforms several state-of-the-art works in both qualitative and quantitative evaluations.
Los estilos APA, Harvard, Vancouver, ISO, etc.
28

Gawade, Shruti. "A Deep Learning Approach to Text-to-Video Generation." International Journal for Research in Applied Science and Engineering Technology 12, no. 6 (2024): 2489–93. http://dx.doi.org/10.22214/ijraset.2024.63513.

Texto completo
Resumen
Abstract: In the ever-evolving landscape of multimedia content creation, there is a growing demand for automated tools that can seamlessly transform textual descriptions into engaging and realistic videos. This research paper introduces a state-of-the-art Text to Video Generation Model, a groundbreaking approach designed to bridge the gap between textual input and visually compelling video output. Leveraging advanced deep learning techniques, the proposed model not only captures the semantic nuances of the input text but also generates dynamic and contextually relevant video sequences. The model architecture combines both natural language processing and computer vision components, allowing it to understand textual descriptions and transform them into visually cohesive scenes.. Through a carefully curated dataset and extensive training, the model learns to understand the intricate relationships between words, phrases, and visual elements, allowing for the creation of videos that faithfully represent the intended narrative. The incorporation of attention mechanisms further enhances the model's ability to focus on key details, ensuring a more nuanced and accurate translation from text to video.
Los estilos APA, Harvard, Vancouver, ISO, etc.
29

P, Ilampiray, Naveen Raju D, Thilagavathy A, et al. "Video Transcript Summarizer." E3S Web of Conferences 399 (2023): 04015. http://dx.doi.org/10.1051/e3sconf/202339904015.

Texto completo
Resumen
In today’s world, a large number of videos are uploaded in everyday, which contains information about something. The major challenge is to find the right video and understand the correct content, because there are lot of videos available some videos will contain useless content and even though the perfect content available that content should be required to us. If we not found right one it wastes your full effort and full time to extract the correct usefull information. We propose an innovation idea which uses NLP processing for text extraction and BERT Summarization for Text Summarization. This provides a video main content in text description and abstractive summary, enabling users to discriminate between relevant and irrelevant information according to their needs. Furthermore, our experiments show that the joint model can attain good results with informative, concise, and readable multi-line video description and summary in a human evaluation.
Los estilos APA, Harvard, Vancouver, ISO, etc.
30

Choudhary, Waffa. "Text Extraction from Videos Using the Combination of Edge-Based and Stroke Filter Techniques." Advanced Materials Research 403-408 (November 2011): 1068–74. http://dx.doi.org/10.4028/www.scientific.net/amr.403-408.1068.

Texto completo
Resumen
A novel method by combining the edge-based and stroke filter based text extraction in the videos is presented. Several researchers have used edge-based and filter based text extraction in the video frames. However, these individual techniques are having their own advantages and disadvantages to extract text in the video frames. Combination of these two techniques fetches good result as compared to individual techniques. In this paper, the canny edge-based and stroke filter for text extraction in the video frames are amalgamated. The effectiveness of the proposed method is evaluated over the individual edge-based and stroke filter based techniques and found that the proposed method significantly improves the text extraction rate in the videos in terms of precision (91.99%) and recall (87.18%) respectively
Los estilos APA, Harvard, Vancouver, ISO, etc.
31

Ilaslan, Muhammet Furkan, Ali Köksal, Kevin Qinghong Lin, Burak Satar, Mike Zheng Shou, and Qianli Xu. "VG-TVP: Multimodal Procedural Planning via Visually Grounded Text-Video Prompting." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 4 (2025): 3886–94. https://doi.org/10.1609/aaai.v39i4.32406.

Texto completo
Resumen
Large Language Model (LLM)-based agents have shown promise in procedural tasks, but the potential of multimodal instructions augmented by texts and videos to assist users remains under-explored. To address this gap, we propose the Visually Grounded Text-Video Prompting (VG-TVP) method which is a novel LLM-empowered Multimodal Procedural Planning (MPP) framework. It generates cohesive text and video procedural plans given a specified high-level objective. The main challenges are achieving textual and visual informativeness, temporal coherence, and accuracy in procedural plans. VG-TVP leverages the zero-shot reasoning capability of LLMs, the video-to-text generation ability of the video captioning models, and the text-to-video generation ability of diffusion models. VG-TVP improves the interaction between modalities by proposing a novel Fusion of Captioning (FoC) method and using Text-to-Video Bridge (T2V-B) and Video-to-Text Bridge (V2T-B). They allow LLMs to guide the generation of visually-grounded text plans and textual-grounded video plans. To address the scarcity of datasets suitable for MPP, we have curated a new dataset called Daily-Life Task Procedural Plans (Daily-PP). We conduct comprehensive experiments and benchmarks to evaluate human preferences (regarding textual and visual informativeness, temporal coherence, and plan accuracy). Our VG-TVP method outperforms unimodal baselines on the Daily-PP dataset.
Los estilos APA, Harvard, Vancouver, ISO, etc.
32

Wu, Yihong, Mingli Lin, and Wenlong Yao. "The Influence of Titles on YouTube Trending Videos." Communications in Humanities Research 29, no. 1 (2024): 285–94. http://dx.doi.org/10.54254/2753-7064/29/20230835.

Texto completo
Resumen
The global video platform market has been growing in a remarkable way in recent years. As a part of a video, title can compel people to view. However, few scholars have studied the relationship between video trendiness and title at present. This work studies the influence of sentiment polarity of videos using Valence Aware Dictionary Sentiment Reasoner (VADER) and investigated the feasibility of the application of video titles text on YouTube trending videos research using Doc2Vec. It is found that the text in YouTube trend video titles possesses predictive value for video trendiness, but it requires advanced techniques such as deep learning for full exploitation. The sentiment polawrity in titles impacts the video views and this impact varies across video categories.
Los estilos APA, Harvard, Vancouver, ISO, etc.
33

Frobenius, Maximiliane. "Pointing gestures in video blogs." Text & Talk 33, no. 1 (2013): 1–23. http://dx.doi.org/10.1515/text-2013-0001.

Texto completo
Resumen
AbstractVideo blogs are a form of CMC (computer-mediated communication) that feature speakers who talk into a camera, and thereby produce a viewer-directed performance. Pointing gestures are part of the resources that the medium affords to design vlogs for the absent recipients. Based on a corpus of 40 vlogs, this research categorizes different kinds of common pointing actions in vlogs. Close analysis reveals the role multimodal factors such as gaze and body posture play along with deictic gestures and verbal reference in the production of a viewer-directed monologue. Those instances where vloggers point at referents outside the video frame, e.g., elements of the Web site that represent alternative modes of communication, such as written comments, receive particular attention in the present study, as they require mutual knowledge about the shared virtual context the vlog is situated in.
Los estilos APA, Harvard, Vancouver, ISO, etc.
34

Puspita, Widya, Teti Sobari, and Wikanengsih Wikanengsih. "Improving Students Writing Skills Explanation Text using Animated Video." JLER (Journal of Language Education Research) 6, no. 1 (2023): 35–60. http://dx.doi.org/10.22460/jler.v6i1.10198.

Texto completo
Resumen
This study focuses on the influence of an animated video on students' ability to write explanation text. This research uses descriptive qualitative research method. The purpose of this study is to find out whether the animated video used can help students improve their explanation text writing skills and see the differences in students' abilities before and after using animated videos in Indonesian language learning. The subjects in this study came from 20 students of class VII A at MTs Pasundan Cimahi, and the objects in this study were obtained from the results of the pre-test and post-test of students in writing explanation texts before and after using animated videos. The results of the analysis show that based on the output table data 1, there is no decrease from the pre test value to the post test value. Then between the results of the explanation text writing test for the pre-test and post-test there were 20 positive data (N), meaning that 20 students experienced an increase in the results of the explanation text writing test. Furthermore, in output table 2, it shows that there is a difference between the results of the explanation text writing test for pre-test and post-test, meaning that there is an influence of using animated videos on the learning outcomes of writing explanation texts for class VII A students. Animated videos are one of the learning media that can be used to improve the skills of writing explanation text.Â
Los estilos APA, Harvard, Vancouver, ISO, etc.
35

Rushikesh, Chandrakant Konapure, and L.M.R.J. Lobo Dr. "Text Data Analysis for Advertisement Recommendation System Using Multi-label Classification of Machine Learning." Journal of Data Mining and Management 5, no. 1 (2020): 1–6. https://doi.org/10.5281/zenodo.3600112.

Texto completo
Resumen
<em>Everyone today can access the streaming content on their mobile phones, laptops very easily and video has been a very important and popular content on the internet. Nowadays, people are making their content and uploading it on the streaming platforms so the size of the video dataset became massive compared to text, audio and image datasets. So, providing advertisements on the video related to the topic of video will help to boost business. In this proposed system the title and description of video will be taken as input to classify the video using a natural language processing text classification method. Aim of Natural Language Processing is to solve the text classification problem by analyzing the contents of text data and decide its category. The proposed system would extract features from videos like title, description, and hashtags based on these extracted features we intend producing classification labels with the use of multi-label classification models. Analyzing produced labels concerning advertisement datasets we intend to provide advertisements on the video related to the topic of the video. &nbsp;</em>
Los estilos APA, Harvard, Vancouver, ISO, etc.
36

S, Ramacharan, Akshara Reddy P., Rukmini Reddy R, and Ch.Chathurya. "Script Abstract from Video Clip." Journal of Advancement in Software Engineering and Testing 5, no. 3 (2022): 1–4. https://doi.org/10.5281/zenodo.7321898.

Texto completo
Resumen
In a world where technology is developing at a tremendously fast pace, the educational field has witnessed various new technologies that help in better learning, teaching and understanding. Video tutorials are playing a major role in helping students and learners understand new concepts at a much faster rate and at their own comfort level, but watching long tutorial or lecture videos can be time consuming and tiring the solution for this can be found through a video to text summarization application. With the help of advance NLP and machine learning we can summarize a video tutorial, this summarized video tutorial will have all important points of the topic. In the core of this system we have used NLTK which is a python library for working on Natural language processing.
Los estilos APA, Harvard, Vancouver, ISO, etc.
37

Sanjeeva, Polepaka, Vanipenta Balasri Nitin Reddy, Jagirdar Indraj Goud, Aavula Guru Prasad, and Ashish Pathani. "TEXT2AV – Automated Text to Audio and Video Conversion." E3S Web of Conferences 430 (2023): 01027. http://dx.doi.org/10.1051/e3sconf/202343001027.

Texto completo
Resumen
The paper aims to develop a machine learning-based system that can automatically convert text to audio and text to video as per the user’s request. Suppose Reading a large text is difficult for anyone, but this TTS model makes it easy by converting text into audio by producing the audio output by an avatar with lip sync to make it look more attractive and human-like interaction in many languages. The TTS model is built based on Waveform Recurrent Neural Networks (WaveRNN). It is a type of auto-regressive model that predicts future data based on the present. The system identifies the keywords in the input texts and uses diffusion models to generate high-quality video content. The system uses GAN (Generative Adversarial Network) to generate videos. Frame Interpolation is used to combine different frames into two adjacent frames to generate a slow- motion video. WebVid-20M, Image-Net, and Hugging-Face are the datasets used for Text video and LibriTTS corpus, and Lip Sync are the dataset used for text-to-audio. The System provides a user-friendly and automated platform to the user which takes text as input and produces either a high-quality audio or high-resolution video quickly and efficiently.
Los estilos APA, Harvard, Vancouver, ISO, etc.
38

Creamer, MeLisa, Heather R. Bowles, Belinda von Hofe, Kelley Pettee Gabriel, Harold W. Kohl, and Adrian Bauman. "Utility of Computer-Assisted Approaches for Population Surveillance of Physical Activity." Journal of Physical Activity and Health 11, no. 6 (2014): 1111–19. http://dx.doi.org/10.1123/jpah.2012-0266.

Texto completo
Resumen
Background:Computer-assisted techniques may be a useful way to enhance physical activity surveillance and increase accuracy of reported behaviors.Purpose:Evaluate the reliability and validity of a physical activity (PA) self-report instrument administered by telephone and internet.Methods:The telephone-administered Active Australia Survey was adapted into 2 forms for internet self-administration: survey questions only (internet-text) and with videos demonstrating intensity (internet-video). Data were collected from 158 adults (20–69 years, 61% female) assigned to telephone (telephone-interview) (n = 56), internet-text (n = 51), or internet-video (n = 51). Participants wore an accelerometer and completed a logbook for 7 days. Test-retest reliability was assessed using intraclass correlation coefficients (ICC). Convergent validity was assessed using Spearman correlations.Results:Strong test-retest reliability was observed for PA variables in the internet-text (ICC = 0.69 to 0.88), internet-video (ICC = 0.66 to 0.79), and telephone-interview (ICC = 0.69 to 0.92) groups (P-values &lt; 0.001). For total PA, correlations (ρ) between the survey and Actigraph+logbook were ρ = 0.47 for the internet-text group, ρ = 0.57 for the internet-video group, and ρ = 0.65 for the telephone-interview group. For vigorous-intensity activity, the correlations between the survey and Actigraph+logbook were 0.52 for internet-text, 0.57 for internet-video, and 0.65 for telephone-interview (P &lt; .05).Conclusions:Internet-video of the survey had similar test-retest reliability and convergent validity when compared with the telephone-interview, and should continue to be developed.
Los estilos APA, Harvard, Vancouver, ISO, etc.
39

Du, Wanru, Xiaochuan Jing, Quan Zhu, Xiaoyin Wang, and Xuan Liu. "A cross-modal conditional mechanism based on attention for text-video retrieval." Mathematical Biosciences and Engineering 20, no. 11 (2023): 20073–92. http://dx.doi.org/10.3934/mbe.2023889.

Texto completo
Resumen
&lt;abstract&gt;&lt;p&gt;Current research in cross-modal retrieval has primarily focused on aligning the global features of videos and sentences. However, video conveys a much more comprehensive range of information than text. Thus, text-video matching should focus on the similarities between frames containing critical information and text semantics. This paper proposes a cross-modal conditional feature aggregation model based on the attention mechanism. It includes two innovative modules: (1) A cross-modal attentional feature aggregation module, which uses the semantic text features as conditional projections to extract the most relevant features from the video frames. It aggregates these frame features to form global video features. (2) A global-local similarity calculation module calculates similarities at two granularities (video-sentence and frame-word features) to consider both the topic and detail features in the text-video matching process. Our experiments on the four widely used MSR-VTT, LSMDC, MSVD and DiDeMo datasets demonstrate the effectiveness of our model and its superiority over state-of-the-art methods. The results show that the cross-modal attention aggregation approach can effectively capture the primary semantic information of the video. At the same time, the global-local similarity calculation model can accurately match text and video based on topic and detail features.&lt;/p&gt;&lt;/abstract&gt;
Los estilos APA, Harvard, Vancouver, ISO, etc.
40

Hua, Hang, Yunlong Tang, Chenliang Xu, and Jiebo Luo. "V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 4 (2025): 3599–607. https://doi.org/10.1609/aaai.v39i4.32374.

Texto completo
Resumen
Video summarization aims to create short, accurate, and cohesive summaries of longer videos. Despite the existence of various video summarization datasets, a notable limitation is their limited amount of source videos, which hampers the effective training of advanced large vision-language models (VLMs). Additionally, most existing datasets are created for video-to-video summarization, overlooking the contemporary need for multimodal video content summarization. Recent efforts have been made to expand from unimodal to multimodal video summarization, categorizing the task into three sub-tasks based on the summary's modality: video-to-video (V2V), video-to-text (V2T), and a combination of video and text summarization (V2VT). However, the textual summaries in previous multimodal datasets are inadequate. To address these issues, we introduce Instruct-V2Xum, a cross-modal video summarization dataset featuring 30,000 diverse videos sourced from YouTube, with lengths ranging from 40 to 940 seconds and an average summarization ratio of 16.39%. Each video summary in Instruct-V2Xum is paired with a textual summary that references specific frame indexes, facilitating the generation of aligned video and textual summaries. In addition, we propose a new video summarization framework named V2Xum-LLM. V2Xum-LLM, specifically V2Xum-LLaMA in this study, is the first framework that unifies different video summarization tasks into one large language model's (LLM) text decoder and achieves task-controllable video summarization with temporal prompts and task instructions. Experiments show that V2Xum-LLaMA outperforms strong baseline models on multiple video summarization tasks. Furthermore, we propose an enhanced evaluation metric for V2V and V2VT summarization tasks.
Los estilos APA, Harvard, Vancouver, ISO, etc.
41

Adams, Aubrie, and Weimin Toh. "Student Emotion in Mediated Learning: Comparing a Text, Video, and Video Game." Electronic Journal of e-Learning 19, no. 6 (2021): pp575–587. http://dx.doi.org/10.34190/ejel.19.6.2546.

Texto completo
Resumen
Although serious games are generally praised by scholars for their potential to enhance teaching and e-learning practices, more empirical evidence is needed to support these accolades. Existing research in this area tends to show that gamified teaching experiences do contribute to significant effects to improve student cognitive, motivational, and behavioural learning outcomes, but these effects are usually small. In addition, less research examines how different types of mediated learning tools compare to one another in influencing student outcomes associated with learning and motivation. As such, a question can be asked in this area: how do video games compare to other types of mediated tools, such as videos or texts, in influencing student emotion outcomes? This study used an experimental design (N = 153) to examine the influence of different types of mass media modalities (text, video, and video game) on college students’ emotions in a mediated learning context. Research examining the impact of video games on instruction has begun to grow, but few studies appropriately acknowledge the nuanced differences between media tools in comparison to one another. Using a media-attributes approach as a lens, this study first compared these mediated tools along the attributional dimensions of textuality, channel, interactivity, and control. This study next tested the impact of each media type on thirteen emotion outcomes. Results showed that six emotion outcomes did not indicate differences between groups (fear, guilt, sadness, shyness, serenity, and general negative emotions). However, six of the tested emotion outcomes did indicate differences between groups with students experiencing higher levels of emotional arousal in both the text and video game conditions (in comparison to the video condition) for the emotions of joviality, self-assurance, attentiveness, surprise, hostility, and general positive emotions. Lastly, students also felt less fatigue in the video game condition. Overall, implications for e-learning suggest that when a message’s content is held constant, both video games and texts may be better in inducing emotional intensity and reducing fatigue than videos alone, which could enhance motivation to learn when teaching is mediated by technology.
Los estilos APA, Harvard, Vancouver, ISO, etc.
42

Chen, Yizhen, Jie Wang, Lijian Lin, Zhongang Qi, Jin Ma, and Ying Shan. "Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text Retrieval." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 1 (2023): 396–404. http://dx.doi.org/10.1609/aaai.v37i1.25113.

Texto completo
Resumen
Vision-language alignment learning for video-text retrieval arouses a lot of attention in recent years. Most of the existing methods either transfer the knowledge of image-text pretraining model to video-text retrieval task without fully exploring the multi-modal information of videos, or simply fuse multi-modal features in a brute force manner without explicit guidance. In this paper, we integrate multi-modal information in an explicit manner by tagging, and use the tags as the anchors for better video-text alignment. Various pretrained experts are utilized for extracting the information of multiple modalities, including object, person, motion, audio, etc. To take full advantage of these information, we propose the TABLE (TAgging Before aLignmEnt) network, which consists of a visual encoder, a tag encoder, a text encoder, and a tag-guiding cross-modal encoder for jointly encoding multi-frame visual features and multi-modal tags information. Furthermore, to strengthen the interaction between video and text, we build a joint cross-modal encoder with the triplet input of [vision, tag, text] and perform two additional supervised tasks, Video Text Matching (VTM) and Masked Language Modeling (MLM). Extensive experimental results demonstrate that the TABLE model is capable of achieving State-Of-The-Art (SOTA) performance on various video-text retrieval benchmarks, including MSR-VTT, MSVD, LSMDC and DiDeMo.
Los estilos APA, Harvard, Vancouver, ISO, etc.
43

Huang, Hong-Bo, Yao-Lin Zheng, and Zhi-Ying Hu. "Video Abnormal Action Recognition Based on Multimodal Heterogeneous Transfer Learning." Advances in Multimedia 2024 (January 19, 2024): 1–12. http://dx.doi.org/10.1155/2024/4187991.

Texto completo
Resumen
Human abnormal action recognition is crucial for video understanding and intelligent surveillance. However, the scarcity of labeled data for abnormal human actions often hinders the development of high-performance models. Inspired by the multimodal approach, this paper proposes a novel approach that leverages text descriptions associated with abnormal human action videos. Our method exploits the correlation between the text domain and the video domain in the semantic feature space and introduces a multimodal heterogeneous transfer learning framework from the text domain to the video domain. The text of the videos is used for feature encoding and knowledge extraction, and knowledge transfer and sharing are realized in the feature space, which is used to assist in the training of the abnormal action recognition model. The proposed method reduces the reliance on labeled video data, improves the performance of the abnormal human action recognition algorithm, and outperforms the popular video-based models, particularly in scenarios with sparse data. Moreover, our framework contributes to the advancement of automatic video analysis and abnormal action recognition, providing insights for the application of multimodal methods in a broader context.
Los estilos APA, Harvard, Vancouver, ISO, etc.
44

Mochurad, Lesia. "A NEW APPROACH FOR TEXT RECOGNITION ON A VIDEO CARD." Computer systems and information technologies, no. 3 (September 28, 2022): 22–30. http://dx.doi.org/10.31891/csit-2022-3-3.

Texto completo
Resumen
An important task is to develop a computer system that can automatically read text content from images or videos with a complex background. Due to a large number of calculations, it is quite difficult to apply them in real-time. Therefore, the use of parallel and distributed computing in the development of real-time or near real-time systems is relevant. The latter is especially relevant in such areas as automation of video recording of traffic violations, text recognition, machine vision, fingerprint recognition, speech, and more. The paper proposes a new approach to text recognition on a video card. A parallel algorithm for processing a group of images and a video sequence has been developed and tested. Parallelization on the video-core is provided by the OpenCL framework and CUDA technology. Without reducing the generality, the problem of processing images on which there are vehicles, which allowed to obtain text from the license plate. A system was developed that was tested for the processing speed of a group of images and videos while achieving an average processing speed of 207 frames per second. As for the execution time of the parallel algorithm, for 50 images and video in 63 frames, image preprocessing took 0.4 seconds, which is sufficient for real-time or near real-time systems. The maximum acceleration of image processing is obtained up to 8 times, and the video sequence – up to 12. The general tendency to increase the acceleration with increasing dimensionality of the processed image is preserved, which indicates the relevance of parallel calculations in solving the problem.
Los estilos APA, Harvard, Vancouver, ISO, etc.
45

Lokkondra, Chaitra Yuvaraj, Dinesh Ramegowda, Gopalakrishna Madigondanahalli Thimmaiah, Ajay Prakash Bassappa Vijaya, and Manjula Hebbaka Shivananjappa. "ETDR: An Exploratory View of Text Detection and Recognition in Images and Videos." Revue d'Intelligence Artificielle 35, no. 5 (2021): 383–93. http://dx.doi.org/10.18280/ria.350504.

Texto completo
Resumen
Images and videos with text content are a direct source of information. Today, there is a high need for image and video data that can be intelligently analyzed. A growing number of researchers are focusing on text identification, making it a hot issue in machine vision research. Since this opens the way, several real-time-based applications such as text detection, localization, and tracking have become more prevalent in text analysis systems. To find out more about how text information may be extracted, have a look at our survey. This study presents a trustworthy dataset for text identification in images and videos at first. The second part of the article details the numerous text formats, both in images and video. Third, the process flow for extracting information from the text and the existing machine learning and deep learning techniques used to train the model was described. Fourth, explain assessment measures that are used to validate the model. Finally, it integrates the uses and difficulties of text extraction across a wide range of fields. Difficulties focus on the most frequent challenges faced in the actual world, such as capturing techniques, lightning, and environmental conditions. Images and videos have evolved into valuable sources of data. The text inside the images and video provides a massive quantity of facts and statistics. However, such data is not easy to access. This exploratory view provides easier and more accurate mathematical modeling and evaluation techniques to retrieve the text in image and video into an accessible form.
Los estilos APA, Harvard, Vancouver, ISO, etc.
46

Aljorani, Reem, and Boshra Zopon. "Encapsulation Video Classification and Retrieval Based on Arabic Text." Diyala Journal For Pure Science 17, no. 4 (2021): 20–36. http://dx.doi.org/10.24237/djps.17.04.558b.

Texto completo
Resumen
Since Arabic video classification is not a popular field and there isn’t a lot of researches in this area especially in the educational field. A system was proposed to solve this problem and to make the educational Arabic videos more available to the students. A survey was fulfilled to study several papers in order to design and implement a system that classifies videos operative in the Arabic language by extracting its audio features using azure cognitive services which produce text transcripts. Several preprocessing operations are then applied to process the text transcript. A stochastic gradient descent SGD algorithm was used to classify transcripts and give a suitable label for each video. In addition, a search technique was applied to enable students to retrieve the videos they need. The results showed that SGD algorithm recorded the highest classification accuracy with 89.3 % when compared to other learning models. In the section below, a survey was introduced consisting of the most relevant and recent papers to this work.
Los estilos APA, Harvard, Vancouver, ISO, etc.
47

Chen, Yupeng, Penglin Chen, Xiaoyu Zhang, Yixian Huang, and Qian Xie. "EditBoard: Towards a Comprehensive Evaluation Benchmark for Text-Based Video Editing Models." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 15 (2025): 15975–83. https://doi.org/10.1609/aaai.v39i15.33754.

Texto completo
Resumen
The rapid development of diffusion models has significantly advanced AI-generated content (AIGC), particularly in Text-to-Image (T2I) and Text-to-Video (T2V) generation. Text-based video editing, leveraging these generative capabilities, has emerged as a promising field, enabling precise modifications to videos based on text prompts. Despite the proliferation of innovative video editing models, there is a conspicuous lack of comprehensive evaluation benchmarks that holistically assess these models’ performance across various dimensions. Existing evaluations are limited and inconsistent, typically summarizing overall performance with a single score, which obscures models’ effectiveness on individual editing tasks. To address this gap, we propose EditBoard, the first comprehensive evaluation benchmark for text-based video editing models. EditBoard encompasses nine automatic metrics across four dimensions, evaluating models on four task categories and introducing three new metrics to assess fidelity. This task-oriented benchmark facilitates objective evaluation by detailing model performance and providing insights into each model’s strengths and weaknesses. By open-sourcing EditBoard, we aim to standardize evaluation and advance the development of robust video editing models.
Los estilos APA, Harvard, Vancouver, ISO, etc.
48

Krishnamoorthy, Niveda, Girish Malkarnenkar, Raymond Mooney, Kate Saenko, and Sergio Guadarrama. "Generating Natural-Language Video Descriptions Using Text-Mined Knowledge." Proceedings of the AAAI Conference on Artificial Intelligence 27, no. 1 (2013): 541–47. http://dx.doi.org/10.1609/aaai.v27i1.8679.

Texto completo
Resumen
We present a holistic data-driven technique that generates natural-language descriptions for videos. We combine the output of state-of-the-art object and activity detectors with "real-world' knowledge to select the most probable subject-verb-object triplet for describing a video. We show that this knowledge, automatically mined from web-scale text corpora, enhances the triplet selection algorithm by providing it contextual information and leads to a four-fold increase in activity identification. Unlike previous methods, our approach can annotate arbitrary videos without requiring the expensive collection and annotation of a similar training video corpus. We evaluate our technique against a baseline that does not use text-mined knowledge and show that humans prefer our descriptions 61% of the time.
Los estilos APA, Harvard, Vancouver, ISO, etc.
49

CHEN, DATONG, JEAN-MARC ODOBEZ, and JEAN-PHILIPPE THIRAN. "MONTE CARLO VIDEO TEXT SEGMENTATION." International Journal of Pattern Recognition and Artificial Intelligence 19, no. 05 (2005): 647–61. http://dx.doi.org/10.1142/s0218001405004216.

Texto completo
Resumen
This paper presents a probabilistic algorithm for segmenting and recognizing text embedded in video sequences based on adaptive thresholding using a Bayes filtering method. The algorithm approximates the posterior distribution of segmentation thresholds of video text by a set of weighted samples. The set of samples is initialized by applying a classical segmentation algorithm on the first video frame and further refined by random sampling under a temporal Bayesian framework. This framework allows us to evaluate a text image segmentor on the basis of recognition result instead of visual segmentation result, which is directly relevant to our character recognition task. Results on a database of 6944 images demonstrate the validity of the algorithm.
Los estilos APA, Harvard, Vancouver, ISO, etc.
50

Welsh, Stephen, and Damian Conway. "Encoding Video Narration as Text." Real-Time Imaging 6, no. 5 (2000): 391–405. http://dx.doi.org/10.1006/rtim.1999.0189.

Texto completo
Los estilos APA, Harvard, Vancouver, ISO, etc.
Ofrecemos descuentos en todos los planes premium para autores cuyas obras están incluidas en selecciones literarias temáticas. ¡Contáctenos para obtener un código promocional único!