Log in

Relevant bibliographies by topics / Video Vision Transformer / Journal articles

To see the other types of publications on this topic, follow the link: Video Vision Transformer.

Journal articles on the topic 'Video Vision Transformer'

Author: Grafiati

Published: 12 April 2025

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Video Vision Transformer.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Naikwadi, Sanket Shashikant. "Video Summarization Using Vision and Language Transformer Models." International Journal of Research Publication and Reviews 6, no. 6 (January 2025): 5217–21. https://doi.org/10.55248/gengpi.6.0125.0654.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Moutik, Oumaima, Hiba Sekkat, Smail Tigani, Abdellah Chehri, Rachid Saadane, Taha Ait Tchakoucht, and Anand Paul. "Convolutional Neural Networks or Vision Transformers: Who Will Win the Race for Action Recognitions in Visual Data?" Sensors 23, no. 2 (January 9, 2023): 734. http://dx.doi.org/10.3390/s23020734.

Full text

Abstract:

Understanding actions in videos remains a significant challenge in computer vision, which has been the subject of several pieces of research in the last decades. Convolutional neural networks (CNN) are a significant component of this topic and play a crucial role in the renown of Deep Learning. Inspired by the human vision system, CNN has been applied to visual data exploitation and has solved various challenges in various computer vision tasks and video/image analysis, including action recognition (AR). However, not long ago, along with the achievement of the transformer in natural language processing (NLP), it began to set new trends in vision tasks, which has created a discussion around whether the Vision Transformer models (ViT) will replace CNN in action recognition in video clips. This paper conducts this trending topic in detail, the study of CNN and Transformer for Action Recognition separately and a comparative study of the accuracy-complexity trade-off. Finally, based on the performance analysis’s outcome, the question of whether CNN or Vision Transformers will win the race will be discussed.

APA, Harvard, Vancouver, ISO, and other styles

3

Yuan, Hongchun, Zhenyu Cai, Hui Zhou, Yue Wang, and Xiangzhi Chen. "TransAnomaly: Video Anomaly Detection Using Video Vision Transformer." IEEE Access 9 (2021): 123977–86. http://dx.doi.org/10.1109/access.2021.3109102.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Sarraf, Saman, and Milton Kabia. "Optimal Topology of Vision Transformer for Real-Time Video Action Recognition in an End-To-End Cloud Solution." Machine Learning and Knowledge Extraction 5, no. 4 (September 29, 2023): 1320–39. http://dx.doi.org/10.3390/make5040067.

Full text

Abstract:

This study introduces an optimal topology of vision transformers for real-time video action recognition in a cloud-based solution. Although model performance is a key criterion for real-time video analysis use cases, inference latency plays a more crucial role in adopting such technology in real-world scenarios. Our objective is to reduce the inference latency of the solution while admissibly maintaining the vision transformer’s performance. Thus, we employed the optimal cloud components as the foundation of our machine learning pipeline and optimized the topology of vision transformers. We utilized UCF101, including more than one million action recognition video clips. The modeling pipeline consists of a preprocessing module to extract frames from video clips, training two-dimensional (2D) vision transformer models, and deep learning baselines. The pipeline also includes a postprocessing step to aggregate the frame-level predictions to generate the video-level predictions at inference. The results demonstrate that our optimal vision transformer model with an input dimension of 56 × 56 × 3 with eight attention heads produces an F1 score of 91.497% for the testing set. The optimized vision transformer reduces the inference latency by 40.70%, measured through a batch-processing approach, with a 55.63% faster training time than the baseline. Lastly, we developed an enhanced skip-frame approach to improve the inference latency by finding an optimal ratio of frames for prediction at inference, where we could further reduce the inference latency by 57.15%. This study reveals that the vision transformer model is highly optimizable for inference latency while maintaining the model performance.

APA, Harvard, Vancouver, ISO, and other styles

5

Zhao, Hong, Zhiwen Chen, Lan Guo, and Zeyu Han. "Video captioning based on vision transformer and reinforcement learning." PeerJ Computer Science 8 (March 16, 2022): e916. http://dx.doi.org/10.7717/peerj-cs.916.

Full text

Abstract:

Global encoding of visual features in video captioning is important for improving the description accuracy. In this paper, we propose a video captioning method that combines Vision Transformer (ViT) and reinforcement learning. Firstly, Resnet-152 and ResNeXt-101 are used to extract features from videos. Secondly, the encoding block of the ViT network is applied to encode video features. Thirdly, the encoded features are fed into a Long Short-Term Memory (LSTM) network to generate a video content description. Finally, the accuracy of video content description is further improved by fine-tuning reinforcement learning. We conducted experiments on the benchmark dataset MSR-VTT used for video captioning. The results show that compared with the current mainstream methods, the model in this paper has improved by 2.9%, 1.4%, 0.9% and 4.8% under the four evaluation indicators of LEU-4, METEOR, ROUGE-L and CIDEr-D, respectively.

APA, Harvard, Vancouver, ISO, and other styles

6

Im, Heeju, and Yong Suk Choi. "A Full Transformer Video Captioning Model via Vision Transformer." KIISE Transactions on Computing Practices 29, no. 8 (August 31, 2023): 378–83. http://dx.doi.org/10.5626/ktcp.2023.29.8.378.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Ugile, Tukaram, and Dr Nilesh Uke. "TRANSFORMER ARCHITECTURES FOR COMPUTER VISION: A COMPREHENSIVE REVIEW AND FUTURE RESEARCH DIRECTIONS." Journal of Dynamics and Control 9, no. 3 (March 15, 2025): 70–79. https://doi.org/10.71058/jodac.v9i3005.

Full text

Abstract:

Transformers have made revolutionary impacts in Natural Language Processing (NLP) area and started making significant contributions in Computer Vision problems. This paper provides a comprehensive review of the Transformer Architectures in Computer Vision, providing a detailed view about their evolution from Vision Transformers (ViTs) to more advanced variants of transformers like Swin Transformer, Transformer-XL, and Hybrid CNN-Transformer models. We have tried to make the study of the advantages of the Transformers over the traditional Convolutional Neural Networks (CNNs), their applications for Object Detection, Image Classification, Video Analysis, and their computational challenges. Finally, we discuss the future research directions, including the self-attention mechanisms, multi-modal learning, and lightweight architectures for Edge Computing.

APA, Harvard, Vancouver, ISO, and other styles

8

Wu, Pengfei, Le Wang, Sanping Zhou, Gang Hua, and Changyin Sun. "Temporal Correlation Vision Transformer for Video Person Re-Identification." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 6 (March 24, 2024): 6083–91. http://dx.doi.org/10.1609/aaai.v38i6.28424.

Full text

Abstract:

Video Person Re-Identification (Re-ID) is a task of retrieving persons from multi-camera surveillance systems. Despite the progress made in leveraging spatio-temporal information in videos, occlusion in dense crowds still hinders further progress. To address this issue, we propose a Temporal Correlation Vision Transformer (TCViT) for video person Re-ID. TCViT consists of a Temporal Correlation Attention (TCA) module and a Learnable Temporal Aggregation (LTA) module. The TCA module is designed to reduce the impact of non-target persons by relative state, while the LTA module is used to aggregate frame-level features based on their completeness. Specifically, TCA is a parameter-free module that first aligns frame-level features to restore semantic coherence in videos and then enhances the features of the target person according to temporal correlation. Additionally, unlike previous methods that treat each frame equally with a pooling layer, LTA introduces a lightweight learnable module to weigh and aggregate frame-level features under the guidance of a classification score. Extensive experiments on four prevalent benchmarks demonstrate that our method achieves state-of-the-art performance in video Re-ID.

APA, Harvard, Vancouver, ISO, and other styles

9

Jin, Yanxiu, and Rulin Ma. "Applications of transformers in computer vision." Applied and Computational Engineering 16, no. 1 (October 23, 2023): 234–41. http://dx.doi.org/10.54254/2755-2721/16/20230898.

Full text

Abstract:

Recently, research based on transformers has become a hot topic. Owing to their ability to capture long-range dependencies, transformers have been rapidly adopted in the field of computer vision for processing image and video data. Despite their widespread adoption, the application of transformer in computer vision such as semantic segmentation, image generation and image repair are still lacking. To address this gap, this paper provides a thorough review and summary of the latest research findings on the applications of transformers in these areas, with a focus on the mechanism of transformers and using ViT (Vision Transformer) as an example. The paper further highlights recent or popular discoveries of transformers in medical scenarios, image generation, and image inpainting. Based on the research, this work also provides insights on future developments and expectations.

APA, Harvard, Vancouver, ISO, and other styles

10

Pei, Pengfei, Xianfeng Zhao, Jinchuan Li, Yun Cao, and Xuyuan Lai. "Vision Transformer-Based Video Hashing Retrieval for Tracing the Source of Fake Videos." Security and Communication Networks 2023 (June 28, 2023): 1–16. http://dx.doi.org/10.1155/2023/5349392.

Full text

Abstract:

With the increasing negative impact of fake videos on individuals and society, it is crucial to detect different types of forgeries. Existing forgery detection methods often output a probability value, which lacks interpretability and reliability. In this paper, we propose a source-tracing-based solution to find the original real video of a fake video, which can provide more reliable results in practical situations. However, directly applying retrieval methods to traceability tasks is infeasible since traceability tasks require finding the unique source video from a large number of real videos, while retrieval methods are typically used to find similar videos. In addition, training an effective hashing center to distinguish similar real videos is challenging. To address the above issues, we introduce a novel loss function, hash triplet loss, to capture fine-grained features with subtle differences. Extensive experiments show that our method outperforms state-of-the-art methods on multiple datasets of object removal (video inpainting), object addition (video splicing), and object swapping (face swapping), demonstrating excellent robustness and cross-dataset performance. The effectiveness of the hash triplet loss for nondifferentiable optimization problems is validated through experiments in similar video scenes.

APA, Harvard, Vancouver, ISO, and other styles

11

Wang, Hao, Wenjia Zhang, and Guohua Liu. "TSNet: Token Sparsification for Efficient Video Transformer." Applied Sciences 13, no. 19 (September 24, 2023): 10633. http://dx.doi.org/10.3390/app131910633.

Full text

Abstract:

In the domain of video recognition, video transformers have demonstrated remarkable performance, albeit at significant computational cost. This paper introduces TSNet, an innovative approach for dynamically selecting informative tokens from given video samples. The proposed method involves a lightweight prediction module that assigns importance scores to each token in the video. Tokens with top scores are then utilized for self-attention computation. We apply the Gumbel-softmax technique to sample from the output of the prediction module, enabling end-to-end optimization of the prediction module. We aim to extend our method on hierarchical vision transformers rather than single-scale vision transformers. We use a simple linear module to project the pruned tokens, and the projected result is then concatenated with the output of the self-attention network to maintain the same number of tokens while capturing interactions with the selected tokens. Since feedforward networks (FFNs) contribute significant computation, we also propose linear projection for the pruned tokens to accelerate the model, and the existing FFN layer progresses the selected tokens. Finally, in order to ensure that the structure of the output remains unchanged, the two groups of tokens are reassembled based on their spatial positions in the original feature map. The experiments conducted primarily focus on the Kinetics-400 dataset using UniFormer, a hierarchical video transformer backbone that incorporates convolution in its self-attention block. Our model demonstrates comparable results to the original model while reducing computation by over 13%. Notably, by hierarchically pruning 70% of input tokens, our approach significantly decreases 55.5% of the FLOPs, while the decline in accuracy is confined to 2%. Additional testing of wide applicability and adaptability with other transformers such as the Video Swin Transformer was also performed and indicated its progressive potentials in video recognition benchmarks. By implementing our token sparsification framework, video vision transformers can achieve a remarkable balance between enhanced computational speed and a slight reduction in accuracy.

APA, Harvard, Vancouver, ISO, and other styles

12

Kim, Dahyun, and Myung Hwan Na. "Rice yield prediction and self-attention visualization using Video Vision Transformer." Korean Data Analysis Society 25, no. 4 (August 31, 2023): 1249–59. http://dx.doi.org/10.37727/jkdas.2023.25.4.1249.

Full text

Abstract:

The government and farmers' organizations are paying much attention to the problem of predicting how much rice can be produced each year. However, it is difficult to accurately predict the yield of rice due to variable factors such as extreme climate change and various pests and diseases that change every year. In this study, images were collected several times during the growing season of rice through a multi-spectral sensor mounted on an unmanned aerial vehicle, and rice yield was predicted using a deep learning algorithm. Multispectral images can be viewed as a kind of image data taken several times at regular intervals, and rice yield was predicted using the Video Vision Transformer (ViViT) model, which applies the Transformer structure to image computer vision among deep learning algorithms. The ViViT model generates patches by dividing the input image into a certain size, and as a result of learning the model by setting the size of these patches differently, it was found that the smaller the patch size, the better the predictive power. In addition, as a result of comparing prediction performance with a 3D CNN model that receives an image as an input in a CNN (Convolutional Neural Network) structure used in the image processing field, it was found that the ViViT model using a small patch size performed better. As a result of visualizing the weight matrix of the ViViT model as a heat map, images taken in mid- to late August appear to be important in yield prediction, making it possible to predict yield about two months before rice harvest.

APA, Harvard, Vancouver, ISO, and other styles

13

Lee, Jaewoo, Sungjun Lee, Wonki Cho, Zahid Ali Siddiqui, and Unsang Park. "Vision Transformer-Based Tailing Detection in Videos." Applied Sciences 11, no. 24 (December 7, 2021): 11591. http://dx.doi.org/10.3390/app112411591.

Full text

Abstract:

Tailing is defined as an event where a suspicious person follows someone closely. We define the problem of tailing detection from videos as an anomaly detection problem, where the goal is to find abnormalities in the walking pattern of the pedestrians (victim and follower). We, therefore, propose a modified Time-Series Vision Transformer (TSViT), a method for anomaly detection in video, specifically for tailing detection with a small dataset. We introduce an effective way to train TSViT with a small dataset by regularizing the prediction model. To do so, we first encode the spatial information of the pedestrians into 2D patterns and then pass them as tokens to the TSViT. Through a series of experiments, we show that the tailing detection on a small dataset using TSViT outperforms popular CNN-based architectures, as the CNN architectures tend to overfit with a small dataset of time-series images. We also show that when using time-series images, the performance of CNN-based architecture gradually drops, as the network depth is increased, to increase its capacity. On the other hand, a decreasing number of heads in Vision Transformer architecture shows good performance on time-series images, and the performance is further increased as the input resolution of the images is increased. Experimental results demonstrate that the TSViT performs better than the handcrafted rule-based method and CNN-based method for tailing detection. TSViT can be used in many applications for video anomaly detection, even with a small dataset.

APA, Harvard, Vancouver, ISO, and other styles

14

Abdlrazg, Bassma A. Awad, Sumaia Masoud, and Mnal M. Ali. "Human Action Detection Using A hybrid Architecture of CNN and Transformer." International Science and Technology Journal 34, no. 1 (January 25, 2024): 1–15. http://dx.doi.org/10.62341/bsmh2119.

Full text

Abstract:

This work presents a Deep learning and Vision Transformer hybrid sequence model for the classification and identification of Human Motion Actions. The deep learning model works by extracting Spatial-temporal features from the features of every video, and then we use a CNN model that takes these inputs as spatial features map from videos and outputs them as a sequence of features. These sequences will be temporally fed into the Vision Transformer (ViT) which classifies the videos used into 7 different classes: Jump, Walk, Wave1, wave2, Bend, Jack, and powerful jump. The model was trained and tested on the Weismann dataset and the results showed that such a model was accurately capable of identifying the human actions. Keywords: Deep Learning, Vision Transformer, Human Motion Action Detection, Spatial features, CNN.

APA, Harvard, Vancouver, ISO, and other styles

15

Li, Xue, Huibo Zhou, and Ming Zhao. "Transformer-based cascade networks with spatial and channel reconstruction convolution for deepfake detection." Mathematical Biosciences and Engineering 21, no. 3 (2024): 4142–64. http://dx.doi.org/10.3934/mbe.2024183.

Full text

Abstract:

<abstract><p>The threat posed by forged video technology has gradually grown to include individuals, society, and the nation. The technology behind fake videos is getting more advanced and modern. Fake videos are appearing everywhere on the internet. Consequently, addressing the challenge posed by frequent updates in various deepfake detection models is imperative. The substantial volume of data essential for their training adds to this urgency. For the deepfake detection problem, we suggest a cascade network based on spatial and channel reconstruction convolution (SCConv) and vision transformer. Our network model's front portion, which uses SCConv and regular convolution to detect fake videos in conjunction with vision transformer, comprises these two types of convolution. We enhance the feed-forward layer of the vision transformer, which can increase detection accuracy while lowering the model's computing burden. We processed the dataset by splitting frames and extracting faces to obtain many images of real and fake faces. Examinations conducted on the DFDC, FaceForensics++, and Celeb-DF datasets resulted in accuracies of 87.92, 99.23 and 99.98%, respectively. Finally, the video was tested for authenticity and good results were obtained, including excellent visualization results. Numerous studies also confirm the efficacy of the model presented in this study.</p></abstract>

APA, Harvard, Vancouver, ISO, and other styles

16

Zhou, Siyuan, Chunru Zhan, Biao Wang, Tiezheng Ge, Yuning Jiang, and Li Niu. "Video Object of Interest Segmentation." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 3 (June 26, 2023): 3805–13. http://dx.doi.org/10.1609/aaai.v37i3.25493.

Full text

Abstract:

In this work, we present a new computer vision task named video object of interest segmentation (VOIS). Given a video and a target image of interest, our objective is to simultaneously segment and track all objects in the video that are relevant to the target image. This problem combines the traditional video object segmentation task with an additional image indicating the content that users are concerned with. Since no existing dataset is perfectly suitable for this new task, we specifically construct a large-scale dataset called LiveVideos, which contains 2418 pairs of target images and live videos with instance-level annotations. In addition, we propose a transformer-based method for this task. We revisit Swin Transformer and design a dual-path structure to fuse video and image features. Then, a transformer decoder is employed to generate object proposals for segmentation and tracking from the fused features. Extensive experiments on LiveVideos dataset show the superiority of our proposed method.

APA, Harvard, Vancouver, ISO, and other styles

17

Huo, Hua, and Bingjie Li. "MgMViT: Multi-Granularity and Multi-Scale Vision Transformer for Efficient Action Recognition." Electronics 13, no. 5 (February 29, 2024): 948. http://dx.doi.org/10.3390/electronics13050948.

Full text

Abstract:

Nowadays, the field of video-based action recognition is rapidly developing. Although Vision Transformers (ViT) have made great progress in static image processing, they are not yet fully optimized for dynamic video applications. Convolutional Neural Networks (CNN) and related models perform exceptionally well in video action recognition. However, there are still some issues that cannot be ignored, such as high computational costs and large memory consumption. In the face of these issues, current research focuses on finding effective methods to improve model performance and overcome current limits. Therefore, we present a unique Vision Transformer model based on multi-granularity and multi-scale fusion to accomplish efficient action recognition, which is designed for action recognition in videos to effectively reduce computational costs and memory usage. Firstly, we devise a multi-scale, multi-granularity module that integrates with Transformer blocks. Secondly, a hierarchical structure is utilized to manage information at various scales, and we introduce multi-granularity on top of multi-scale, which allows for a selective choice of the number of tokens to enter the next computational step, thereby reducing redundant tokens. Thirdly, a coarse-fine granularity fusion layer is introduced to reduce the sequence length of tokens with lower information content. The above two mechanisms are combined to optimize the allocation of resources in the model, further emphasizing critical information and reducing redundancy, thereby minimizing computational costs. To assess our proposed approach, comprehensive experiments are conducted by using benchmark datasets in the action recognition domain. The experimental results demonstrate that our method has achieved state-of-the-art performance in terms of accuracy and efficiency.

APA, Harvard, Vancouver, ISO, and other styles

18

Kumar, Pavan. "Revolutionizing Deepfake Detection and Realtime Video Vision with CNN-based Deep Learning Model." International Journal of Innovative Research in Information Security 10, no. 04 (May 8, 2024): 173–77. http://dx.doi.org/10.26562/ijiris.2024.v1004.10.

Full text

Abstract:

The rapid advancement of deep learning models that can generate and synthesis hyper-realistic videos known as Deepfakes and their ease of access have raised concern on possible malicious intent use. Deep learning techniques can now generate faces, swap faces between two subjects in a video, alter facial expressions, change gender, and alter facial features, to list a few. These powerful video manipulation methods have potential use in many fields. However, they also pose a looming threat to everyone if used for harmful purposes such as identity theft, phishing, and scam. In this work, we propose a Convolutional Vision Transformer for the detection of Deepfakes. The Convolutional Vision Transformer: Convolutional Neural Network (CNN) and Vision Transformer (ViT). The CNN extracts learnable features while the ViT takes in the learned features as input and categorizes them using an attention mechanism. We trained our model on the DeepFake Detection Challenge Dataset (DFDC) and have achieved 91.5 percent accuracy, an AUC value of 0.91, and a loss value of 0.32. Our contribution is that we have added a CNN module to the ViT architecture and have achieved a competitive result on the DFDC dataset.

APA, Harvard, Vancouver, ISO, and other styles

19

Reddy, Sai Krishna. "Advancements in Video Deblurring: A Comprehensive Review." INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT 08, no. 05 (May 7, 2024): 1–5. http://dx.doi.org/10.55041/ijsrem32759.

Full text

Abstract:

Video deblurring is a critical task in computer vision, aimed at restoring the clarity of videos distorted by motion blur or other factors. It holds immense importance across various domains, including surveillance, entertainment, medical imaging, and autonomous driving, where clear visual information is crucial for decision-making and analysis. This article provides an overview of recent advancements in video deblurring techniques, ranging from convolutional neural network (CNN)-based methods to Transformer-based models and event-based reconstruction approaches. By synthesizing insights from recent research, this review delves into the applications and methodologies of these techniques, showcasing their effectiveness in real-world scenarios. By fostering knowledge exchange and inspiring further advancements in the field, this review aims to contribute to the continuous improvement of video processing technologies for enhanced visual quality and analysis. Keywords: Video Deblurring, Video Restoration, deep learning, Convolutional Neural Networks, Event Based Reconstruction, Transformers, Image Processing

APA, Harvard, Vancouver, ISO, and other styles

20

Im, Heeju, and Yong-Suk Choi. "UAT: Universal Attention Transformer for Video Captioning." Sensors 22, no. 13 (June 25, 2022): 4817. http://dx.doi.org/10.3390/s22134817.

Full text

Abstract:

Video captioning via encoder–decoder structures is a successful sentence generation method. In addition, using various feature extraction networks for extracting multiple features to obtain multiple kinds of visual features in the encoding process is a standard method for improving model performance. Such feature extraction networks are weight-freezing states and are based on convolution neural networks (CNNs). However, these traditional feature extraction methods have some problems. First, when the feature extraction model is used in conjunction with freezing, additional learning of the feature extraction model is not possible by exploiting the backpropagation of the loss obtained from the video captioning training. Specifically, this blocks feature extraction models from learning more about spatial information. Second, the complexity of the model is further increased when multiple CNNs are used. Additionally, the author of Vision Transformers (ViTs) pointed out the inductive bias of CNN called the local receptive field. Therefore, we propose the full transformer structure that uses an end-to-end learning method for video captioning to overcome this problem. As a feature extraction model, we use a vision transformer (ViT) and propose feature extraction gates (FEGs) to enrich the input of the captioning model through that extraction model. Additionally, we design a universal encoder attraction (UEA) that uses all encoder layer outputs and performs self-attention on the outputs. The UEA is used to address the lack of information about the video’s temporal relationship because our method uses only the appearance feature. We will evaluate our model against several recent models on two benchmark datasets and show its competitive performance on MSRVTT/MSVD datasets. We show that the proposed model performed captioning using only a single feature, but in some cases, it was better than the others, which used several features.

APA, Harvard, Vancouver, ISO, and other styles

21

Yamazaki, Kashu, Khoa Vo, Quang Sang Truong, Bhiksha Raj, and Ngan Le. "VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 3 (June 26, 2023): 3081–90. http://dx.doi.org/10.1609/aaai.v37i3.25412.

Full text

Abstract:

Video Paragraph Captioning aims to generate a multi-sentence description of an untrimmed video with multiple temporal event locations in a coherent storytelling. Following the human perception process, where the scene is effectively understood by decomposing it into visual (e.g. human, animal) and non-visual components (e.g. action, relations) under the mutual influence of vision and language, we first propose a visual-linguistic (VL) feature. In the proposed VL feature, the scene is modeled by three modalities including (i) a global visual environment; (ii) local visual main agents; (iii) linguistic scene elements. We then introduce an autoregressive Transformer-in-Transformer (TinT) to simultaneously capture the semantic coherence of intra- and inter-event contents within a video. Finally, we present a new VL contrastive loss function to guarantee the learnt embedding features are consistent with the captions semantics. Comprehensive experiments and extensive ablation studies on the ActivityNet Captions and YouCookII datasets show that the proposed Visual-Linguistic Transformer-in-Transform (VLTinT) outperforms previous state-of-the-art methods in terms of accuracy and diversity. The source code is made publicly available at: https://github.com/UARK-AICV/VLTinT.

APA, Harvard, Vancouver, ISO, and other styles

22

Choksi, Sarah, Sanjeev Narasimhan, Mattia Ballo, Mehmet Turkcan, Yiran Hu, Chengbo Zang, Alex Farrell, et al. "Automatic assessment of robotic suturing utilizing computer vision in a dry-lab simulation." Artificial Intelligence Surgery 5, no. 2 (April 1, 2025): 160–9. https://doi.org/10.20517/ais.2024.84.

Full text

Abstract:

Aim: Automated surgical skill assessment is poised to become an invaluable asset in surgical residency training. In our study, we aimed to create deep learning (DL) computer vision artificial intelligence (AI) models capable of automatically assessing trainee performance and determining proficiency on robotic suturing tasks. Methods: Participants performed two robotic suturing tasks on a bench-top model created by our lab. Videos were recorded of each surgeon performing a backhand suturing task and a railroad suturing task at 30 frames per second (FPS) and downsampled to 15 FPS for the study. Each video was segmented into four sub-stitch phases: needle positioning, targeting, driving, and withdrawal. Each sub-stitch was annotated with a binary technical score (ideal or non-ideal), reflecting the operator’s skill while performing the suturing action. For DL analysis, 16-frame overlapping clips were sampled from the videos with a stride of 1. To extract the features useful for classification, two pretrained Video Swin Transformer models were fine-tuned using these clips: one to classify the sub-stitch phase and another to predict the technical score. The model outputs were then combined and used to train a Random Forest Classifier to predict the surgeon's proficiency level. Results: A total of 102 videos from 27 surgeons were evaluated using 3-fold cross-validation, 51 videos for the backhand suturing task and 51 videos for the railroad suturing task. Performance was assessed on sub-stitch classification accuracy, technical score accuracy, and surgeon proficiency prediction. The clip-based Video Swin Transformer models achieved an average classification accuracy of 70.23% for sub-stitch classification and 68.4% for technical score prediction on the test folds. Combining the model outputs, the Random Forest Classifier achieved an average accuracy of 66.7% in predicting surgeon proficiency. Conclusion: This study shows the feasibility of creating a DL-based automatic assessment tool for robotic-assisted surgery. Using machine learning models, we predicted the proficiency level of a surgeon with 66.7% accuracy. Our dry lab model proposes a standardized training and assessment tool for suturing tasks using computer vision.

APA, Harvard, Vancouver, ISO, and other styles

23

Narsina, Deekshith, Nicholas Richardson, Arjun Kamisetty, Jaya Chandra Srikanth Gummadi, and Krishna Devarapu. "Neural Network Architectures for Real-Time Image and Video Processing Applications." Engineering International 10, no. 2 (2022): 131–44. https://doi.org/10.18034/ei.v10i2.735.

Full text

Abstract:

This research optimizes neural network topologies for real-time image and video processing to achieve high-speed, accurate performance in dynamic contexts. The project aims to find efficient optimization methodologies, track neural network model progress, and highlight visual media applications. A secondary data review synthesizes peer-reviewed literature, technical reports, neural network design, and optimization advances. The research found that lightweight neural network architectures like MobileNet and Transformer-based Vision Transformers (ViTs) boost the computing economy without losing accuracy. Real-time applications need model pruning, quantization, knowledge distillation, and hardware-aware design. From real-time object identification in surveillance and autonomous driving to medical imaging and creative media creation, neural networks have transformed many applications. Despite these advances, balancing accuracy and economy, addressing hardware variability, and assuring ethical usage in face recognition remain issues. The report emphasizes the need for privacy-friendly and egalitarian AI rules. These results may help future research improve real-time visual processing systems and legislators control their responsible use in real-world applications.

APA, Harvard, Vancouver, ISO, and other styles

24

Han, Xiao, Yongbin Wang, Shouxun Liu, and Cong Jin. "Online Multiplayer Tracking by Extracting Temporal Contexts with Transformer." Wireless Communications and Mobile Computing 2022 (October 11, 2022): 1–10. http://dx.doi.org/10.1155/2022/6177973.

Full text

Abstract:

Sports competition is one of the most popular programs for many audiences. Tracking the players in sports game videos from broadcasts is a nontrivial challenge for computer vision researchers. In sports videos, the direction of an athlete’s movement changes quickly and unpredictably. Mutual occlusion between athletes is also more frequent in team competitions. However, the rich temporal contexts among the adjacent frames have been excluded from consideration. To address this dilemma, we propose an online transformer-based learnable framework in an end-to-end fashion. We use a transformer architecture to extract the temporal contexts between the successive frames and add them to the network training, which is robust to occlusion and complex direction changes in multiplayer tracking. We demonstrate the effectiveness of our method on three sports video datasets by comparing them with recently advanced multiplayer trackers.

APA, Harvard, Vancouver, ISO, and other styles

25

Zhang, Fan, Jiawei Tian, Jianhao Wang, Guanyou Liu, and Ying Liu. "ECViST: Mine Intelligent Monitoring Based on Edge Computing and Vision Swin Transformer-YOLOv5." Energies 15, no. 23 (November 29, 2022): 9015. http://dx.doi.org/10.3390/en15239015.

Full text

Abstract:

Mine video surveillance has a key role in ensuring the production safety of intelligent mining. However, existing mine intelligent monitoring technology mainly processes the video data in the cloud, which has problems, such as network congestion, large memory consumption, and untimely response to regional emergencies. In this paper, we address these limitations by utilizing the edge-cloud collaborative optimization framework. First, we obtained a coarse model using the edge-cloud collaborative architecture and updated this to realize the continuous improvement of the detection model. Second, we further proposed a target detection model based on the Vision Swin Transformer-YOLOv5(ViST-YOLOv5) algorithm and improved the model for edge device deployment. The experimental results showed that the object detection model based on ViST-YOLOv5, with a model size of only 27.057 MB, improved the average detection accuracy is by 25% compared to the state-of-the-art model, which makes it suitable for edge-end deployment in mining workface. For the actual mine surveillance video, the edge-cloud collaborative architecture can achieve better performance and robustness in typical application scenarios, such as weak lighting and occlusion, which verifies the feasibility of the designed architecture.

APA, Harvard, Vancouver, ISO, and other styles

26

Mardani, Konstantina, Nicholas Vretos, and Petros Daras. "Transformer-Based Fire Detection in Videos." Sensors 23, no. 6 (March 11, 2023): 3035. http://dx.doi.org/10.3390/s23063035.

Full text

Abstract:

Fire detection in videos forms a valuable feature in surveillance systems, as its utilization can prevent hazardous situations. The combination of an accurate and fast model is necessary for the effective confrontation of this significant task. In this work, a transformer-based network for the detection of fire in videos is proposed. It is an encoder–decoder architecture that consumes the current frame that is under examination, in order to compute attention scores. These scores denote which parts of the input frame are more relevant for the expected fire detection output. The model is capable of recognizing fire in video frames and specifying its exact location in the image plane in real-time, as can be seen in the experimental results, in the form of segmentation mask. The proposed methodology has been trained and evaluated for two computer vision tasks, the full-frame classification task (fire/no fire in frames) and the fire localization task. In comparison with the state-of-the-art models, the proposed method achieves outstanding results in both tasks, with 97% accuracy, 20.4 fps processing time, 0.02 false positive rate for fire localization, and 97% for f-score and recall metrics in the full-frame classification task.

APA, Harvard, Vancouver, ISO, and other styles

27

Peng, Pengfei, Guoqing Liang, and Tao Luan. "Multi-View Inconsistency Analysis for Video Object-Level Splicing Localization." International Journal of Emerging Technologies and Advanced Applications 1, no. 3 (April 24, 2024): 1–5. http://dx.doi.org/10.62677/ijetaa.2403111.

Full text

Abstract:

In the digital era, the widespread use of video content has led to the rapid development of video editing technologies. However, it has also raised concerns about the authenticity and integrity of multimedia content. Video splicing forgery has emerged as a challenging and deceptive technique used to create fake video objects, potentially for malicious purposes such as deception, defamation, and fraud. Therefore, the detection of video splicing forgery has become critically important. Nevertheless, due to the complexity of video data and a lack of relevant datasets, research on video splicing forgery detection remains relatively limited. This paper introduces a novel method for detecting video object splicing forgery, which enhances detection performance by deeply exploring inconsistent features between different source videos. We incorporate various feature types, including edge luminance, texture, and video quality information, and utilize a joint learning approach with Convolutional Neural Network (CNN) and Vision Transformer (ViT) models. Experimental results demonstrate that our method excels in detecting video object splicing forgery, offering promising prospects for further advancements in this field.

APA, Harvard, Vancouver, ISO, and other styles

28

Wang, Jing, and ZongJu Yang. "Transformer-Guided Video Inpainting Algorithm Based on Local Spatial-Temporal joint." EAI Endorsed Transactions on e-Learning 8, no. 4 (August 15, 2023): e2. http://dx.doi.org/10.4108/eetel.3156.

Full text

Abstract:

INTRODUCTION: Video inpainting is a very important task in computer vision, and it’s a key component of various practical applications. It also plays an important role in video occlusion removal, traffic monitoring and old movie restoration technology. Video inpainting is to obtain reasonable content from the video sequence to fill the missing region, and maintain time continuity and spatial consistency.OBJECTIVES: In previous studies, due to the complexity of the scene of video inpainting, there are often cases of fast motion of objects in the video or motion of background objects, which will lead to optical flow failure. So the current video inpainting algorithm hasn’t met the requirements of practical applications. In order to avoid the problem of optical flow failure, this paper proposes a transformer-guided video inpainting model based on local Spatial-temporal joint.METHODS: First, considering the rich Spatial-temporal relationship between local flows, a Local Spatial-Temporal Joint Network (LSTN) including encoder, decoder and transformer module is designed to roughly inpaint the local corrupted frames, and the Deep Flow Network is used to calculate the local bidirectional corrupted flows. Then, the local corrupted optical flow map is input into the Local Flow Completion Network (LFCN) with pseudo 3D convolution and attention mechanism to obtain a complete set of bidirectional local optical flow maps. Finally, the roughly inpainted local frame and the complete bidirectional local optical flow map are sent to the Spatial-temporal transformer and the inpainted video frame is output.RESULTS: Experiments show that the algorithm achieves high quality results in the video target removal task, and has a certain improvement in indicators compared with advanced technologies.CONCLUSION: Transformer-Guided Video Inpainting Algorithm Based on Local Spatial-Temporal joint can obtain high-quality optical flow information and inpainted result video.

APA, Harvard, Vancouver, ISO, and other styles

29

Le, Viet-Tuan, Kiet Tran-Trung, and Vinh Truong Hoang. "A Comprehensive Review of Recent Deep Learning Techniques for Human Activity Recognition." Computational Intelligence and Neuroscience 2022 (April 20, 2022): 1–17. http://dx.doi.org/10.1155/2022/8323962.

Full text

Abstract:

Human action recognition is an important field in computer vision that has attracted remarkable attention from researchers. This survey aims to provide a comprehensive overview of recent human action recognition approaches based on deep learning using RGB video data. Our work divides recent deep learning-based methods into five different categories to provide a comprehensive overview for researchers who are interested in this field of computer vision. Moreover, a pure-transformer architecture (convolution-free) has outperformed its convolutional counterparts in many fields of computer vision recently. Our work also provides recent convolution-free-based methods which replaced convolution networks with the transformer networks that achieved state-of-the-art results on many human action recognition datasets. Firstly, we discuss proposed methods based on a 2D convolutional neural network. Then, methods based on a recurrent neural network which is used to capture motion information are discussed. 3D convolutional neural network-based methods are used in many recent approaches to capture both spatial and temporal information in videos. However, with long action videos, multistream approaches with different streams to encode different features are reviewed. We also compare the performance of recently proposed methods on four popular benchmark datasets. We review 26 benchmark datasets for human action recognition. Some potential research directions are discussed to conclude this survey.

APA, Harvard, Vancouver, ISO, and other styles

30

Hong, Jiuk, Chaehyeon Lee, and Heechul Jung. "Late Fusion-Based Video Transformer for Facial Micro-Expression Recognition." Applied Sciences 12, no. 3 (January 23, 2022): 1169. http://dx.doi.org/10.3390/app12031169.

Full text

Abstract:

In this article, we propose a novel model for facial micro-expression (FME) recognition. The proposed model basically comprises a transformer, which is recently used for computer vision and has never been used for FME recognition. A transformer requires a huge amount of data compared to a convolution neural network. Then, we use motion features, such as optical flow and late fusion to complement the lack of FME dataset. The proposed method was verified and evaluated using the SMIC and CASME II datasets. Our approach achieved state-of-the-art (SOTA) performance of 0.7447 and 73.17% in SMIC in terms of unweighted F1 score (UF1) and accuracy (Acc.), respectively, which are 0.31 and 1.8% higher than previous SOTA. Furthermore, UF1 of 0.7106 and Acc. of 70.68% were shown in the CASME II experiment, which are comparable with SOTA.

APA, Harvard, Vancouver, ISO, and other styles

31

D, Mrs Srivalli, and Divya Sri V. "Video Inpainting with Local and Global Refinement." INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT 08, no. 03 (March 17, 2024): 1–5. http://dx.doi.org/10.55041/ijsrem29385.

Full text

Abstract:

Video inpainting is a crucial task in computer vision and video editing, involving in removing unnecessary objects and restoration of missing or corrupted regions within a video sequence. One approach that has gained prominence in recent years is the combination of local and global refinement techniques. This innovative strategy leverages the strengths of both local and global information to produce high-quality inpainted videos. In the context of video inpainting, local refinement focuses on accurately restoring missing or damaged regions by considering nearby pixels or frames. Encoder-decoder network can be employed to fill in gaps with content that seamlessly blends with the surrounding context. On the other hand, global refinement seeks to ensure temporal consistency and smooth transitions between frames, preventing noticeable artifacts or jittering in the inpainted video. In conclusion, by seamlessly blending local and global inpainting strategies, these methods can effectively remove unwanted elements from videos while preserving both spatial and temporal coherence. This technology finds applications in video editing, restoration of damaged archival footage, and even in the entertainment industry for special effects and scene corrections, ultimately contributing to the improvement of video quality and aesthetics. Keywords: Video Inpainting, Local and Global Refinement, Encoder-Decoder network, Recurrent Flow Completion, mask-guided sparse video Transformer, dual-domain propagation.

APA, Harvard, Vancouver, ISO, and other styles

32

Habeb, Mohamed H., May Salama, and Lamiaa A. Elrefaei. "Enhancing Video Anomaly Detection Using a Transformer Spatiotemporal Attention Unsupervised Framework for Large Datasets." Algorithms 17, no. 7 (July 1, 2024): 286. http://dx.doi.org/10.3390/a17070286.

Full text

Abstract:

This work introduces an unsupervised framework for video anomaly detection, leveraging a hybrid deep learning model that combines a vision transformer (ViT) with a convolutional spatiotemporal relationship (STR) attention block. The proposed model addresses the challenges of anomaly detection in video surveillance by capturing both local and global relationships within video frames, a task that traditional convolutional neural networks (CNNs) often struggle with due to their localized field of view. We have utilized a pre-trained ViT as an encoder for feature extraction, which is then processed by the STR attention block to enhance the detection of spatiotemporal relationships among objects in videos. The novelty of this work is utilizing the ViT with the STR attention to detect video anomalies effectively in large and heterogeneous datasets, an important thing given the diverse environments and scenarios encountered in real-world surveillance. The framework was evaluated on three benchmark datasets, i.e., the UCSD-Ped2, CHUCK Avenue, and ShanghaiTech. This demonstrates the model’s superior performance in detecting anomalies compared to state-of-the-art methods, showcasing its potential to significantly enhance automated video surveillance systems by achieving area under the receiver operating characteristic curve (AUC ROC) values of 95.6, 86.8, and 82.1. To show the effectiveness of the proposed framework in detecting anomalies in extra-large datasets, we trained the model on a subset of the huge contemporary CHAD dataset that contains over 1 million frames, achieving AUC ROC values of 71.8 and 64.2 for CHAD-Cam 1 and CHAD-Cam 2, respectively, which outperforms the state-of-the-art techniques.

APA, Harvard, Vancouver, ISO, and other styles

33

Usmani, Shaheen, Sunil Kumar, and Debanjan Sadhya. "Spatio-temporal knowledge distilled video vision transformer (STKD-VViT) for multimodal deepfake detection." Neurocomputing 620 (March 2025): 129256. https://doi.org/10.1016/j.neucom.2024.129256.

Full text

APA, Harvard, Vancouver, ISO, and other styles

34

Kumar, Yulia, Kuan Huang, Chin-Chien Lin, Annaliese Watson, J. Jenny Li, Patricia Morreale, and Justin Delgado. "Applying Swin Architecture to Diverse Sign Language Datasets." Electronics 13, no. 8 (April 16, 2024): 1509. http://dx.doi.org/10.3390/electronics13081509.

Full text

Abstract:

In an era where artificial intelligence (AI) bridges crucial communication gaps, this study extends AI’s utility to American and Taiwan Sign Language (ASL and TSL) communities through advanced models like the hierarchical vision transformer with shifted windows (Swin). This research evaluates Swin’s adaptability across sign languages, aiming for a universal platform for the unvoiced. Utilizing deep learning and transformer technologies, it has developed prototypes for ASL-to-English translation, supported by an educational framework to facilitate learning and comprehension, with the intention to include more languages in the future. This study highlights the efficacy of the Swin model, along with other models such as the vision transformer with deformable attention (DAT), ResNet-50, and VGG-16, in ASL recognition. The Swin model’s accuracy across various datasets underscore its potential. Additionally, this research explores the challenges of balancing accuracy with the need for real-time, portable language recognition capabilities and introduces the use of cutting-edge transformer models like Swin, DAT, and video Swin transformers for diverse datasets in sign language recognition. This study explores the integration of multimodality and large language models (LLMs) to promote global inclusivity. Future efforts will focus on enhancing these models and expanding their linguistic reach, with an emphasis on real-time translation applications and educational frameworks. These achievements not only advance the technology of sign language recognition but also provide more effective communication tools for the deaf and hard-of-hearing community.

APA, Harvard, Vancouver, ISO, and other styles

35

Li, Yixiao, Lixiang Li, Zirui Zhuang, Yuan Fang, Haipeng Peng, and Nam Ling. "Transformer-Based Data-Driven Video Coding Acceleration for Industrial Applications." Mathematical Problems in Engineering 2022 (September 27, 2022): 1–11. http://dx.doi.org/10.1155/2022/1440323.

Full text

Abstract:

With the exploding development of edge intelligence and smart industry, deep learning-based intelligent industrial solutions are promptly applied in the manufacturing process. Many intelligent industrial solutions such as automatic manufacturing inspection are computer vision based and require fast and efficient video encoding techniques so that video streams can be processed as quickly as possible either at the edge cluster or over the cloud. As one of the most popular video coding standards, the high efficiency video coding (HEVC) standard has been applied to various industrial scenes. However, HEVC brings not only a higher compression rate but also a significant increase in encoding complexity, which hinders its practical application in industrial scenarios. Fortunately, a large amount of video coding data makes it possible to accelerate the encoding process in the industry. To speed up the video coding process in some industrial scenes, this paper proposes a data-driven fast approach for coding tree unit (CTU) partitioning in HEVC intracoding. First, we propose a method to represent the partition result of a CTU as a column vector of length 21. Then, we employ lots of encoding data produced in normal industry scenes to train transformer models used to predict the partitioning vector of the CTU. Finally, the final partitioning structure of the CTU is generated from the partitioning vector after a postprocessing operation and used by an industrial encoder. Compared with the original HEVC encoder used by some industrial applications, experiment results show that our approach achieves 58.77% encoding time reduction with 3.9% bit rate loss, which indicates that our data-driven approach for video coding has great capacity working in industrial applications.

APA, Harvard, Vancouver, ISO, and other styles

36

Nikulina, Olena, Valerii Severyn, Oleksii Kondratov, and Oleksii Olhovoy. "MODELS OF REMOTE IDENTIFICATION OF PARAMETERS OF DYNAMIC OBJECTS USING DETECTION TRANSFORMERS AND OPTICAL FLOW." Bulletin of National Technical University "KhPI". Series: System Analysis, Control and Information Technologies, no. 1 (11) (July 30, 2024): 52–57. http://dx.doi.org/10.20998/2079-0023.2024.01.08.

Full text

Abstract:

The tasks of remote identification of parameters of dynamic objects are important for various fields, including computer vision, robotics, autonomous vehicles, video surveillance systems, and many others. Traditional methods of solving these problems face the problems of insufficient accuracy and efficiency of determining dynamic parameters in conditions of rapidly changing environments and complex dynamic scenarios. Modern methods of identifying parameters of dynamic objects using technologies of detection transformers and optical flow are considered. Transformer detection is one of the newest approaches in computer vision that uses transformer architecture for object detection tasks. This transformer integrates the object detection and boundary detection processes into a single end-to-end model, which greatly improves the accuracy and speed of processing. The use of transformers allows the model to effectively process information from the entire image at the same time, which contributes to better recognition of objects even in difficult conditions. Optical flow is a motion analysis method that determines the speed and direction of pixel movement between successive video frames. This method allows obtaining detailed information about the dynamics of the scene, which is critical for accurate tracking and identification of parameters of moving objects. The integration of detection transformers and optical flow is proposed to increase the accuracy of identification of parameters of dynamic objects. The combination of these two methods allows you to use the advantages of both approaches: high accuracy of object detection and detailed information about their movement. The conducted experiments show that the proposed model significantly outperforms traditional methods both in the accuracy of determining the parameters of objects and in the speed of data processing. The key results of the study indicate that the integration of detection transformers and optical flow provides reliable and fast determination of parameters of moving objects in real time, which can be applied in various practical scenarios. The conducted research also showed the potential for further improvement of data processing methods and their application in complex dynamic environments. The obtained results open new perspectives for the development of intelligent monitoring and control systems capable of adapting to rapidly changing environmental conditions, increasing the efficiency and safety of their work.

APA, Harvard, Vancouver, ISO, and other styles

37

El Moaqet, Hisham, Rami Janini, Tamer Abdulbaki Alshirbaji, Nour Aldeen Jalal, and Knut Möller. "Using Vision Transformers for Classifying Surgical Tools in Computer Aided Surgeries." Current Directions in Biomedical Engineering 10, no. 4 (December 1, 2024): 232–35. https://doi.org/10.1515/cdbme-2024-2056.

Full text

Abstract:

Abstract Automated laparoscopic video analysis is essential for assisting surgeons during computer aided medical procedures. Nevertheless, it faces challenges due to complex surgical scenes and limited annotated data. Most of the existing methods for classifying surgical tools in laparoscopic surgeries rely on conventional deep learning methods such as convolutional and recurrent neural networks. This paper explores the use of pure self-attention based models-Vision Transformers for classifying both single-label (SL) and multi-label (ML) frames in Laparoscopic surgeries. The proposed SL and ML models were comprehensively evaluated on the Cholec80 surgical workflow dataset using 5-fold cross validation. Experimental results showed an excellent classification performance with a mean average precision mAP=95.8% that outperforms conventional deep learning multi-label models developed in previous studies. Our results open new avenues for further research on the use of deep transformer models for surgical tool detection in modern operating theaters.

APA, Harvard, Vancouver, ISO, and other styles

38

Jang, Hee-Deok, Seokjoon Kwon, Hyunwoo Nam, and Dong Eui Chang. "Chemical Gas Source Localization with Synthetic Time Series Diffusion Data Using Video Vision Transformer." Applied Sciences 14, no. 11 (May 23, 2024): 4451. http://dx.doi.org/10.3390/app14114451.

Full text

Abstract:

Gas source localization is vital in emergency scenarios to enable swift and effective responses. In this study, we introduce a gas source localization model leveraging the video vision transformer (ViViT). Utilizing synthetic time series diffusion data, the source grid is predicted by classifying the grid with the highest probability of gas occurrence within the diffusion data coverage. Through extensive experimentation using the NBC-RAMS simulator, we generate large datasets of gas diffusion under varied experimental conditions and meteorological environments, enabling comprehensive model training and evaluation. Our findings demonstrate that the ViViT outperforms other deep learning models in processing time series gas data, showcasing a superior estimation performance. Leveraging a transformer architecture, the ViViT exhibits a robust classification performance even in scenarios influenced by weather conditions or incomplete observations. Furthermore, we conduct an analysis of accuracy and parameter count across various input sequence lengths, revealing the ability of the ViViT to maintain high computational efficiency while achieving accurate source localization. These results underscore the effectiveness of the ViViT as a model for gas source localization, particularly in situations demanding a rapid response in real-world environments, such as gas leaks or attacks.

APA, Harvard, Vancouver, ISO, and other styles

39

Mozaffari, M. Hamed, Yuchuan Li, Niloofar Hooshyaripour, and Yoon Ko. "Vision-Based Prediction of Flashover Using Transformers and Convolutional Long Short-Term Memory Model." Electronics 13, no. 23 (December 3, 2024): 4776. https://doi.org/10.3390/electronics13234776.

Full text

Abstract:

The prediction of fire growth is crucial for effective firefighting and rescue operations. Recent advancements in vision-based techniques using RGB vision and infrared (IR) thermal imaging data, coupled with artificial intelligence and deep learning techniques, have shown promising solutions to be applied in the detection of fire and the prediction of its behavior. This study introduces the use of Convolutional Long Short-term Memory (ConvLSTM) network models for predicting room fire growth by analyzing spatiotemporal IR thermal imaging data acquired from full-scale room fire tests. Our findings revealed that SwinLSTM, an enhanced version of ConvLSTM combined with transformers (a deep learning architecture based on a new mechanism called multi-head attention) for computer vision purposes, can be used for the prediction of room fire flashover occurrence. Notably, transformer-based ConvLSTM deep learning models, such as SwinLSTM, demonstrate superior prediction capability, which suggests a new vision-based smart solution for future fire growth prediction tasks. The main focus of this work is to perform a feasibility study on the use of a pure vision-based deep learning model for analysis of future video data to anticipate behavior of fire growth in room fire incidents.

APA, Harvard, Vancouver, ISO, and other styles

40

Geng, Xiaozhong, Cheng Chen, Ping Yu, Baijin Liu, Weixin Hu, Qipeng Liang, and Xintong Zhang. "OM-VST: A video action recognition model based on optimized downsampling module combined with multi-scale feature fusion." PLOS ONE 20, no. 3 (March 6, 2025): e0318884. https://doi.org/10.1371/journal.pone.0318884.

Full text

Abstract:

Video classification, as an essential task in computer vision, aims to identify and label video content using computer technology automatically. However, the current mainstream video classification models face two significant challenges in practical applications: first, the classification accuracy is not high, which is mainly attributed to the complexity and diversity of video data, including factors such as subtle differences between different categories, background interference, and illumination variations; and second, the number of model training parameters is too high resulting in longer training time and increased energy consumption. To solve these problems, we propose the OM-Video Swin Transformer (OM-VST) model. This model adds a multi-scale feature fusion module with an optimized downsampling module based on a Video Swin Transformer (VST) to improve the model’s ability to perceive and characterize feature information. To verify the performance of the OM-VST model, we conducted comparison experiments between it and mainstream video classification models, such as VST, SlowFast, and TSM, on a public dataset. The results show that the accuracy of the OM-VST model is improved by 2.81% while the number of parameters is reduced by 54.7%. This improvement significantly enhances the model’s accuracy in video classification tasks and effectively reduces the number of parameters during model training.

APA, Harvard, Vancouver, ISO, and other styles

41

Kim, Nayeon, Sukhee Cho, and Byungjun Bae. "SMaTE: A Segment-Level Feature Mixing and Temporal Encoding Framework for Facial Expression Recognition." Sensors 22, no. 15 (August 1, 2022): 5753. http://dx.doi.org/10.3390/s22155753.

Full text

Abstract:

Despite advanced machine learning methods, the implementation of emotion recognition systems based on real-world video content remains challenging. Videos may contain data such as images, audio, and text. However, the application of multimodal models using two or more types of data to real-world video media (CCTV, illegally filmed content, etc.) lacking sound or subtitles is difficult. Although facial expressions in image sequences can be utilized in emotion recognition, the diverse identities of individuals in real-world content limits computational models of relationships between facial expressions. This study proposed a transformation model which employed a video vision transformer to focus on facial expression sequences in videos. It effectively understood and extracted facial expression information from the identities of individuals, instead of fusing multimodal models. The design entailed capture of higher-quality facial expression information through mixed-token embedding facial expression sequences augmented via various methods into a single data representation, and comprised two modules: spatial and temporal encoders. Further, temporal position embedding, focusing on relationships between video frames, was proposed and subsequently applied to the temporal encoder module. The performance of the proposed algorithm was compared with that of conventional methods on two emotion recognition datasets of video content, with results demonstrating its superiority.

APA, Harvard, Vancouver, ISO, and other styles

42

Lai, Derek Ka-Hei, Ethan Shiu-Wang Cheng, Bryan Pak-Hei So, Ye-Jiao Mao, Sophia Ming-Yan Cheung, Daphne Sze Ki Cheung, Duo Wai-Chi Wong, and James Chung-Wai Cheung. "Transformer Models and Convolutional Networks with Different Activation Functions for Swallow Classification Using Depth Video Data." Mathematics 11, no. 14 (July 12, 2023): 3081. http://dx.doi.org/10.3390/math11143081.

Full text

Abstract:

Dysphagia is a common geriatric syndrome that might induce serious complications and death. Standard diagnostics using the Videofluoroscopic Swallowing Study (VFSS) or Fiberoptic Evaluation of Swallowing (FEES) are expensive and expose patients to risks, while bedside screening is subjective and might lack reliability. An affordable and accessible instrumented screening is necessary. This study aimed to evaluate the classification performance of Transformer models and convolutional networks in identifying swallowing and non-swallowing tasks through depth video data. Different activation functions (ReLU, LeakyReLU, GELU, ELU, SiLU, and GLU) were then evaluated on the best-performing model. Sixty-five healthy participants (n = 65) were invited to perform swallowing (eating a cracker and drinking water) and non-swallowing tasks (a deep breath and pronouncing vowels: “/eɪ/”, “/iː/”, “/aɪ/”, “/oʊ/”, “/u:/”). Swallowing and non-swallowing were classified by Transformer models (TimeSFormer, Video Vision Transformer (ViViT)), and convolutional neural networks (SlowFast, X3D, and R(2+1)D), respectively. In general, convolutional neural networks outperformed the Transformer models. X3D was the best model with good-to-excellent performance (F1-score: 0.920; adjusted F1-score: 0.885) in classifying swallowing and non-swallowing conditions. Moreover, X3D with its default activation function (ReLU) produced the best results, although LeakyReLU performed better in deep breathing and pronouncing “/aɪ/” tasks. Future studies shall consider collecting more data for pretraining and developing a hyperparameter tuning strategy for activation functions and the high dimensionality video data for Transformer models.

APA, Harvard, Vancouver, ISO, and other styles

43

Liu, Yuqi, Luhui Xu, Pengfei Xiong, and Qin Jin. "Token Mixing: Parameter-Efficient Transfer Learning from Image-Language to Video-Language." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 2 (June 26, 2023): 1781–89. http://dx.doi.org/10.1609/aaai.v37i2.25267.

Full text

Abstract:

Applying large scale pre-trained image-language model to video-language tasks has recently become a trend, which brings two challenges. One is how to effectively transfer knowledge from static images to dynamic videos, and the other is how to deal with the prohibitive cost of fully fine-tuning due to growing model size. Existing works that attempt to realize parameter-efficient image-language to video-language transfer learning can be categorized into two types: 1) appending a sequence of temporal transformer blocks after the 2D Vision Transformer (ViT), and 2) inserting a temporal block into the ViT architecture. While these two types of methods only require fine-tuning the newly added components, there are still many parameters to update, and they are only validated on a single video-language task. In this work, based on our analysis of the core ideas of different temporal modeling components in existing approaches, we propose a token mixing strategy to enable cross-frame interactions, which enables transferring from the pre-trained image-language model to video-language tasks through selecting and mixing a key set and a value set from the input video samples. As token mixing does not require the addition of any components or modules, we can directly partially fine-tune the pre-trained image-language model to achieve parameter-efficiency. We carry out extensive experiments to compare our proposed token mixing method with other parameter-efficient transfer learning methods. Our token mixing method outperforms other methods on both understanding tasks and generation tasks. Besides, our method achieves new records on multiple video-language tasks. The code is available at https://github.com/yuqi657/video_language_model.

APA, Harvard, Vancouver, ISO, and other styles

44

Lorenzo, Javier, Ignacio Parra Alonso, Rubén Izquierdo, Augusto Luis Ballardini, Álvaro Hernández Saz, David Fernández Llorca, and Miguel Ángel Sotelo. "CAPformer: Pedestrian Crossing Action Prediction Using Transformer." Sensors 21, no. 17 (August 24, 2021): 5694. http://dx.doi.org/10.3390/s21175694.

Full text

Abstract:

Anticipating pedestrian crossing behavior in urban scenarios is a challenging task for autonomous vehicles. Early this year, a benchmark comprising JAAD and PIE datasets have been released. In the benchmark, several state-of-the-art methods have been ranked. However, most of the ranked temporal models rely on recurrent architectures. In our case, we propose, as far as we are concerned, the first self-attention alternative, based on transformer architecture, which has had enormous success in natural language processing (NLP) and recently in computer vision. Our architecture is composed of various branches which fuse video and kinematic data. The video branch is based on two possible architectures: RubiksNet and TimeSformer. The kinematic branch is based on different configurations of transformer encoder. Several experiments have been performed mainly focusing on pre-processing input data, highlighting problems with two kinematic data sources: pose keypoints and ego-vehicle speed. Our proposed model results are comparable to PCPA, the best performing model in the benchmark reaching an F1 Score of nearly 0.78 against 0.77. Furthermore, by using only bounding box coordinates and image data, our model surpasses PCPA by a larger margin (F1=0.75 vs. F1=0.72). Our model has proven to be a valid alternative to recurrent architectures, providing advantages such as parallelization and whole sequence processing, learning relationships between samples not possible with recurrent architectures.

APA, Harvard, Vancouver, ISO, and other styles

45

Guo, Zizhao, and Sancong Ying. "Whole-Body Keypoint and Skeleton Augmented RGB Networks for Video Action Recognition." Applied Sciences 12, no. 12 (June 18, 2022): 6215. http://dx.doi.org/10.3390/app12126215.

Full text

Abstract:

Incorporating multi-modality data is an effective way to improve action recognition performance. Based on this idea, we investigate a new data modality in which Whole-Body Keypoint and Skeleton (WKS) labels are used to capture refined body information. Unlike directly aggregated multi-modality, we leverage distillation to adapt an RGB network to classify action with the feature-extraction ability of the WKS network, which is only fed with RGB clips. Inspired by the success of transformers for vision tasks, we design an architecture that takes advantage of both three-dimensional (3D) convolutional neural networks (CNNs) and the Swin transformer to extract spatiotemporal features, resulting in advanced performance. Furthermore, considering the unequal discrimination among clips of a video, we also present a new method for aggregating the clip-level classification results, further improving the performance. The experimental results demonstrate that our framework achieves advanced accuracy of 93.4% with only RGB input on the UCF-101 dataset.

APA, Harvard, Vancouver, ISO, and other styles

46

Zhang, Renhong, Tianheng Cheng, Shusheng Yang, Haoyi Jiang, Shuai Zhang, Jiancheng Lyu, Xin Li, et al. "MobileInst: Video Instance Segmentation on the Mobile." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 7 (March 24, 2024): 7260–68. http://dx.doi.org/10.1609/aaai.v38i7.28555.

Full text

Abstract:

Video instance segmentation on mobile devices is an important yet very challenging edge AI problem. It mainly suffers from (1) heavy computation and memory costs for frame-by-frame pixel-level instance perception and (2) complicated heuristics for tracking objects. To address these issues, we present MobileInst, a lightweight and mobile-friendly framework for video instance segmentation on mobile devices. Firstly, MobileInst adopts a mobile vision transformer to extract multi-level semantic features and presents an efficient query-based dual-transformer instance decoder for mask kernels and a semantic-enhanced mask decoder to generate instance segmentation per frame. Secondly, MobileInst exploits simple yet effective kernel reuse and kernel association to track objects for video instance segmentation. Further, we propose temporal query passing to enhance the tracking ability for kernels. We conduct experiments on COCO and YouTube-VIS datasets to demonstrate the superiority of MobileInst and evaluate the inference latency on one single CPU core of the Snapdragon 778G Mobile Platform, without other methods of acceleration. On the COCO dataset, MobileInst achieves 31.2 mask AP and 433 ms on the mobile CPU, which reduces the latency by 50% compared to the previous SOTA. For video instance segmentation, MobileInst achieves 35.0 AP and 30.1 AP on YouTube-VIS 2019 & 2021.

APA, Harvard, Vancouver, ISO, and other styles

47

Zang, Chengbo, Mehmet Kerem Turkcan, Sanjeev Narasimhan, Yuqing Cao, Kaan Yarali, Zixuan Xiang, Skyler Szot, et al. "Surgical Phase Recognition in Inguinal Hernia Repair—AI-Based Confirmatory Baseline and Exploration of Competitive Models." Bioengineering 10, no. 6 (May 27, 2023): 654. http://dx.doi.org/10.3390/bioengineering10060654.

Full text

Abstract:

Video-recorded robotic-assisted surgeries allow the use of automated computer vision and artificial intelligence/deep learning methods for quality assessment and workflow analysis in surgical phase recognition. We considered a dataset of 209 videos of robotic-assisted laparoscopic inguinal hernia repair (RALIHR) collected from 8 surgeons, defined rigorous ground-truth annotation rules, then pre-processed and annotated the videos. We deployed seven deep learning models to establish the baseline accuracy for surgical phase recognition and explored four advanced architectures. For rapid execution of the studies, we initially engaged three dozen MS-level engineering students in a competitive classroom setting, followed by focused research. We unified the data processing pipeline in a confirmatory study, and explored a number of scenarios which differ in how the DL networks were trained and evaluated. For the scenario with 21 validation videos of all surgeons, the Video Swin Transformer model achieved ~0.85 validation accuracy, and the Perceiver IO model achieved ~0.84. Our studies affirm the necessity of close collaborative research between medical experts and engineers for developing automated surgical phase recognition models deployable in clinical settings.

APA, Harvard, Vancouver, ISO, and other styles

48

Liu, Hao, Jiwen Lu, Jianjiang Feng, and Jie Zhou. "Two-Stream Transformer Networks for Video-Based Face Alignment." IEEE Transactions on Pattern Analysis and Machine Intelligence 40, no. 11 (November 1, 2018): 2546–54. http://dx.doi.org/10.1109/tpami.2017.2734779.

Full text

APA, Harvard, Vancouver, ISO, and other styles

49

Khan, Salman, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. "Transformers in Vision: A Survey." ACM Computing Surveys, January 6, 2022. http://dx.doi.org/10.1145/3505244.

Full text

Abstract:

Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g. , Long short-term memory (LSTM). Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suited as set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities ( e.g. , images, videos, text and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge datasets. These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline. We start with an introduction to fundamental concepts behind the success of Transformers i.e., self-attention, large-scale pre-training, and bidirectional feature encoding. We then cover extensive applications of transformers in vision including popular recognition tasks ( e.g. , image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks ( e.g. , visual-question answering, visual reasoning, and visual grounding), video processing ( e.g. , activity recognition, video forecasting), low-level vision ( e.g. , image super-resolution, image enhancement, and colorization) and 3D analysis ( e.g. , point cloud classification and segmentation). We compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value. Finally, we provide an analysis on open research directions and possible future works. We hope this effort will ignite further interest in the community to solve current challenges towards the application of transformer models in computer vision.

APA, Harvard, Vancouver, ISO, and other styles

50

Hsu, Tzu-Chun, Yi-Sheng Liao, and Chun-Rong Huang. "Video Summarization With Spatiotemporal Vision Transformer." IEEE Transactions on Image Processing, 2023, 1. http://dx.doi.org/10.1109/tip.2023.3275069.

Full text

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!