Bibliographies thématiques / Video Vision Transformer

Sommaire

Articles de revues
Thèses
Livres
Chapitres de livres
Actes de conférences

Littérature scientifique sur le sujet « Video Vision Transformer »

Auteur : Grafiati

Publié le 12 avril 2025

Créez une référence correcte selon les styles APA, MLA, Chicago, Harvard et plusieurs autres

Choisissez une source :

Consultez les listes thématiques d’articles de revues, de livres, de thèses, de rapports de conférences et d’autres sources académiques sur le sujet « Video Vision Transformer ».

À côté de chaque source dans la liste de références il y a un bouton « Ajouter à la bibliographie ». Cliquez sur ce bouton, et nous générerons automatiquement la référence bibliographique pour la source choisie selon votre style de citation préféré : APA, MLA, Harvard, Vancouver, Chicago, etc.

Vous pouvez aussi télécharger le texte intégral de la publication scolaire au format pdf et consulter son résumé en ligne lorsque ces informations sont inclues dans les métadonnées.

Articles de revues sur le sujet "Video Vision Transformer"

Naikwadi, Sanket Shashikant. « Video Summarization Using Vision and Language Transformer Models ». International Journal of Research Publication and Reviews 6, n^o 6 (janvier 2025) : 5217–21. https://doi.org/10.55248/gengpi.6.0125.0654.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

Moutik, Oumaima, Hiba Sekkat, Smail Tigani, Abdellah Chehri, Rachid Saadane, Taha Ait Tchakoucht et Anand Paul. « Convolutional Neural Networks or Vision Transformers : Who Will Win the Race for Action Recognitions in Visual Data ? » Sensors 23, n^o 2 (9 janvier 2023) : 734. http://dx.doi.org/10.3390/s23020734.

Texte intégral

Résumé :

Understanding actions in videos remains a significant challenge in computer vision, which has been the subject of several pieces of research in the last decades. Convolutional neural networks (CNN) are a significant component of this topic and play a crucial role in the renown of Deep Learning. Inspired by the human vision system, CNN has been applied to visual data exploitation and has solved various challenges in various computer vision tasks and video/image analysis, including action recognition (AR). However, not long ago, along with the achievement of the transformer in natural language processing (NLP), it began to set new trends in vision tasks, which has created a discussion around whether the Vision Transformer models (ViT) will replace CNN in action recognition in video clips. This paper conducts this trending topic in detail, the study of CNN and Transformer for Action Recognition separately and a comparative study of the accuracy-complexity trade-off. Finally, based on the performance analysis’s outcome, the question of whether CNN or Vision Transformers will win the race will be discussed.

Styles APA, Harvard, Vancouver, ISO, etc.

Yuan, Hongchun, Zhenyu Cai, Hui Zhou, Yue Wang et Xiangzhi Chen. « TransAnomaly : Video Anomaly Detection Using Video Vision Transformer ». IEEE Access 9 (2021) : 123977–86. http://dx.doi.org/10.1109/access.2021.3109102.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

Sarraf, Saman, et Milton Kabia. « Optimal Topology of Vision Transformer for Real-Time Video Action Recognition in an End-To-End Cloud Solution ». Machine Learning and Knowledge Extraction 5, n^o 4 (29 septembre 2023) : 1320–39. http://dx.doi.org/10.3390/make5040067.

Texte intégral

Résumé :

This study introduces an optimal topology of vision transformers for real-time video action recognition in a cloud-based solution. Although model performance is a key criterion for real-time video analysis use cases, inference latency plays a more crucial role in adopting such technology in real-world scenarios. Our objective is to reduce the inference latency of the solution while admissibly maintaining the vision transformer’s performance. Thus, we employed the optimal cloud components as the foundation of our machine learning pipeline and optimized the topology of vision transformers. We utilized UCF101, including more than one million action recognition video clips. The modeling pipeline consists of a preprocessing module to extract frames from video clips, training two-dimensional (2D) vision transformer models, and deep learning baselines. The pipeline also includes a postprocessing step to aggregate the frame-level predictions to generate the video-level predictions at inference. The results demonstrate that our optimal vision transformer model with an input dimension of 56 × 56 × 3 with eight attention heads produces an F1 score of 91.497% for the testing set. The optimized vision transformer reduces the inference latency by 40.70%, measured through a batch-processing approach, with a 55.63% faster training time than the baseline. Lastly, we developed an enhanced skip-frame approach to improve the inference latency by finding an optimal ratio of frames for prediction at inference, where we could further reduce the inference latency by 57.15%. This study reveals that the vision transformer model is highly optimizable for inference latency while maintaining the model performance.

Styles APA, Harvard, Vancouver, ISO, etc.

Zhao, Hong, Zhiwen Chen, Lan Guo et Zeyu Han. « Video captioning based on vision transformer and reinforcement learning ». PeerJ Computer Science 8 (16 mars 2022) : e916. http://dx.doi.org/10.7717/peerj-cs.916.

Texte intégral

Résumé :

Global encoding of visual features in video captioning is important for improving the description accuracy. In this paper, we propose a video captioning method that combines Vision Transformer (ViT) and reinforcement learning. Firstly, Resnet-152 and ResNeXt-101 are used to extract features from videos. Secondly, the encoding block of the ViT network is applied to encode video features. Thirdly, the encoded features are fed into a Long Short-Term Memory (LSTM) network to generate a video content description. Finally, the accuracy of video content description is further improved by fine-tuning reinforcement learning. We conducted experiments on the benchmark dataset MSR-VTT used for video captioning. The results show that compared with the current mainstream methods, the model in this paper has improved by 2.9%, 1.4%, 0.9% and 4.8% under the four evaluation indicators of LEU-4, METEOR, ROUGE-L and CIDEr-D, respectively.

Styles APA, Harvard, Vancouver, ISO, etc.

Im, Heeju, et Yong Suk Choi. « A Full Transformer Video Captioning Model via Vision Transformer ». KIISE Transactions on Computing Practices 29, n^o 8 (31 août 2023) : 378–83. http://dx.doi.org/10.5626/ktcp.2023.29.8.378.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

Ugile, Tukaram, et Dr Nilesh Uke. « TRANSFORMER ARCHITECTURES FOR COMPUTER VISION : A COMPREHENSIVE REVIEW AND FUTURE RESEARCH DIRECTIONS ». Journal of Dynamics and Control 9, n^o 3 (15 mars 2025) : 70–79. https://doi.org/10.71058/jodac.v9i3005.

Texte intégral

Résumé :

Transformers have made revolutionary impacts in Natural Language Processing (NLP) area and started making significant contributions in Computer Vision problems. This paper provides a comprehensive review of the Transformer Architectures in Computer Vision, providing a detailed view about their evolution from Vision Transformers (ViTs) to more advanced variants of transformers like Swin Transformer, Transformer-XL, and Hybrid CNN-Transformer models. We have tried to make the study of the advantages of the Transformers over the traditional Convolutional Neural Networks (CNNs), their applications for Object Detection, Image Classification, Video Analysis, and their computational challenges. Finally, we discuss the future research directions, including the self-attention mechanisms, multi-modal learning, and lightweight architectures for Edge Computing.

Styles APA, Harvard, Vancouver, ISO, etc.

Wu, Pengfei, Le Wang, Sanping Zhou, Gang Hua et Changyin Sun. « Temporal Correlation Vision Transformer for Video Person Re-Identification ». Proceedings of the AAAI Conference on Artificial Intelligence 38, n^o 6 (24 mars 2024) : 6083–91. http://dx.doi.org/10.1609/aaai.v38i6.28424.

Texte intégral

Résumé :

Video Person Re-Identification (Re-ID) is a task of retrieving persons from multi-camera surveillance systems. Despite the progress made in leveraging spatio-temporal information in videos, occlusion in dense crowds still hinders further progress. To address this issue, we propose a Temporal Correlation Vision Transformer (TCViT) for video person Re-ID. TCViT consists of a Temporal Correlation Attention (TCA) module and a Learnable Temporal Aggregation (LTA) module. The TCA module is designed to reduce the impact of non-target persons by relative state, while the LTA module is used to aggregate frame-level features based on their completeness. Specifically, TCA is a parameter-free module that first aligns frame-level features to restore semantic coherence in videos and then enhances the features of the target person according to temporal correlation. Additionally, unlike previous methods that treat each frame equally with a pooling layer, LTA introduces a lightweight learnable module to weigh and aggregate frame-level features under the guidance of a classification score. Extensive experiments on four prevalent benchmarks demonstrate that our method achieves state-of-the-art performance in video Re-ID.

Styles APA, Harvard, Vancouver, ISO, etc.

Jin, Yanxiu, et Rulin Ma. « Applications of transformers in computer vision ». Applied and Computational Engineering 16, n^o 1 (23 octobre 2023) : 234–41. http://dx.doi.org/10.54254/2755-2721/16/20230898.

Texte intégral

Résumé :

Recently, research based on transformers has become a hot topic. Owing to their ability to capture long-range dependencies, transformers have been rapidly adopted in the field of computer vision for processing image and video data. Despite their widespread adoption, the application of transformer in computer vision such as semantic segmentation, image generation and image repair are still lacking. To address this gap, this paper provides a thorough review and summary of the latest research findings on the applications of transformers in these areas, with a focus on the mechanism of transformers and using ViT (Vision Transformer) as an example. The paper further highlights recent or popular discoveries of transformers in medical scenarios, image generation, and image inpainting. Based on the research, this work also provides insights on future developments and expectations.

Styles APA, Harvard, Vancouver, ISO, etc.

Pei, Pengfei, Xianfeng Zhao, Jinchuan Li, Yun Cao et Xuyuan Lai. « Vision Transformer-Based Video Hashing Retrieval for Tracing the Source of Fake Videos ». Security and Communication Networks 2023 (28 juin 2023) : 1–16. http://dx.doi.org/10.1155/2023/5349392.

Texte intégral

Résumé :

With the increasing negative impact of fake videos on individuals and society, it is crucial to detect different types of forgeries. Existing forgery detection methods often output a probability value, which lacks interpretability and reliability. In this paper, we propose a source-tracing-based solution to find the original real video of a fake video, which can provide more reliable results in practical situations. However, directly applying retrieval methods to traceability tasks is infeasible since traceability tasks require finding the unique source video from a large number of real videos, while retrieval methods are typically used to find similar videos. In addition, training an effective hashing center to distinguish similar real videos is challenging. To address the above issues, we introduce a novel loss function, hash triplet loss, to capture fine-grained features with subtle differences. Extensive experiments show that our method outperforms state-of-the-art methods on multiple datasets of object removal (video inpainting), object addition (video splicing), and object swapping (face swapping), demonstrating excellent robustness and cross-dataset performance. The effectiveness of the hash triplet loss for nondifferentiable optimization problems is validated through experiments in similar video scenes.

Styles APA, Harvard, Vancouver, ISO, etc.

Plus de sources

Thèses sur le sujet "Video Vision Transformer"

Zhang, Yujing. « Deep learning-assisted video list decoding in error-prone video transmission systems ». Electronic Thesis or Diss., Valenciennes, Université Polytechnique Hauts-de-France, 2024. http://www.theses.fr/2024UPHF0028.

Texte intégral

Résumé :

Au cours des dernières années, les applications vidéo ont connu un développement rapide. Par ailleurs, l’expérience en matière de qualité vidéo s’est considérablement améliorée grâce à l’avènement de la vidéo HD et à l’émergence des contenus 4K. En conséquence, les flux vidéo ont tendance à représenter une plus grande quantité de données. Pour réduire la taille de ces flux vidéo, de nouvelles solutions de compression vidéo telles que HEVC ont été développées.Cependant, les erreurs de transmission susceptibles de survenir sur les réseaux peuvent provoquer des artefacts visuels indésirables qui dégradent considérablement l'expérience utilisateur. Diverses approches ont été proposées dans la littérature pour trouver des solutions efficaces et peu complexes afin de réparer les paquets vidéo contenant des erreurs binaires, en évitant ainsi une retransmission coûteuse et incompatible avec les contraintes de faible latence de nombreuses applications émergentes (vidéo immersive, télé-opération). La correction d'erreurs basée sur le contrôle de redondance cyclique (CRC) est une approche prometteuse qui utilise des informations facilement disponibles sans surcoût de débit. Cependant, elle ne peut corriger en pratique qu'un nombre limité d'erreurs. Selon le polynôme générateur utilisé, la taille des paquets et le nombre maximum d'erreurs considéré, cette méthode peut conduire non pas à un paquet corrigé unique, mais plutôt à une liste de paquets possiblement corrigés. Dans ce cas, le décodage de liste devient pertinent en combinaison avec la correction d'erreurs basée CRC ainsi qu'avec les méthodes exploitant l'information sur la fiabilité des bits reçus. Celui-ci présente toutefois des inconvénients en termes de sélection de vidéos candidates. Suite à la génération des candidats classés lors du processus de décodage de liste dans l'état de l'art, la sélection finale considéra souvent le premier candidat valide dans la liste finale comme vidéo reconstruite. Cependant, cette sélection simple est arbitraire et non optimale, la séquence vidéo candidate en tête de liste n'étant pas nécessairement celle qui présente la meilleure qualité visuelle. Il est donc nécessaire de développer une nouvelle méthode permettant de sélectionner automatiquement la vidéo ayant la plus haute qualité dans la liste des candidats.Nous proposons de sélectionner le meilleur candidat en fonction de la qualité visuelle déterminée par un système d'apprentissage profond (DL). Considérant que la distorsion sera gérée sur chaque image, nous considérons l’évaluation de la qualité de l’image plutôt que l’évaluation de la qualité vidéo. Plus précisément, chaque candidat subit un traitement par une méthode d'évaluation de la qualité d'image (image quality assessment, IQA) sans référence basée sur l'apprentissage profond pour obtenir un score. Par la suite, le système sélectionne le candidat ayant le score IQA le plus élevé. Pour cela, notre système évalue la qualité des vidéos soumises à des erreurs de transmission sans éliminer les paquets perdus ni dissimuler les régions perdues. Les distorsions causées par les erreurs de transmission diffèrent de celles prises en compte par les mesures de qualité visuelle traditionnelles, qui traitent généralement des distorsions globales et uniformes de l'image. Ainsi, ces métriques ne parviennent pas à distinguer la version corrigée des différentes versions vidéo corrompues. Notre approche revisite et optimise la technique de décodage de liste classique en lui associant une architecture CNN d’abord, puis Transformer pour évaluer la qualité visuelle et identifier le meilleur candidat. Elle est sans précédent et offre d'excellentes performances. En particulier, nous montrons que lorsque les erreurs de transmission se produisent dans une trame intra, nos architectures basées sur CNN et Transformer atteignent une précision de décision de 100%. Pour les erreurs dans une image inter, la précision est de 93% et 95%, respectivement
In recent years, video applications have developed rapidly. At the same time, the video quality experience has improved considerably with the advent of HD video and the emergence of 4K content. As a result, video streams tend to represent a larger amount of data. To reduce the size of these video streams, new video compression solutions such as HEVC have been developed.However, transmission errors that may occur over networks can cause unwanted visual artifacts that significantly degrade the user experience. Various approaches have been proposed in the literature to find efficient and low-complexity solutions to repair video packets containing binary errors, thus avoiding costly retransmission that is incompatible with the low latency constraints of many emerging applications (immersive video, tele-operation). Error correction based on cyclic redundancy check (CRC) is a promising approach that uses readily available information without throughput overhead. However, in practice it can only correct a limited number of errors. Depending on the generating polynomial used, the size of the packets and the maximum number of errors considered, this method can lead not to a single corrected packet but rather to a list of possibly corrected packets. In this case, list decoding becomes relevant in combination with CRC-based error correction as well as methods exploiting information on the reliability of the received bits. However, this has disadvantages in terms of selection of candidate videos. Following the generation of ranked candidates during the state-of-the-art list decoding process, the final selection often considers the first valid candidate in the final list as the reconstructed video. However, this simple selection is arbitrary and not optimal, the candidate video sequence at the top of the list is not necessarily the one which presents the best visual quality. It is therefore necessary to develop a new method to automatically select the video with the highest quality from the list of candidates.We propose to select the best candidate based on the visual quality determined by a deep learning (DL) system. Considering that distortions will be assessed on each frame, we consider image quality assessment rather than video quality assessment. More specifically, each candidate undergoes processing by a reference-free image quality assessment (IQA) method based on deep learning to obtain a score. Subsequently, the system selects the candidate with the highest IQA score. To do this, our system evaluates the quality of videos subject to transmission errors without eliminating lost packets or concealing lost regions. Distortions caused by transmission errors differ from those accounted for by traditional visual quality measures, which typically deal with global, uniform image distortions. Thus, these metrics fail to distinguish the repaired version from different corrupted video versions when local, non-uniform errors occur. Our approach revisits and optimizes the classic list decoding technique by associating it with a CNN architecture first, then with a Transformer to evaluate the visual quality and identify the best candidate. It is unprecedented and offers excellent performance. In particular, we show that when transmission errors occur within an intra frame, our CNN and Transformer-based architectures achieve 100% decision accuracy. For errors in an inter frame, the accuracy is 93% and 95%, respectively

Styles APA, Harvard, Vancouver, ISO, etc.

Filali, razzouki Anas. « Deep learning-based video face-based digital markers for early detection and analysis of Parkinson disease ». Electronic Thesis or Diss., Institut polytechnique de Paris, 2025. http://www.theses.fr/2025IPPAS002.

Texte intégral

Résumé :

Cette thèse vise à développer des biomarqueurs numériques robustes pour la détection précoce de la maladie de Parkinson (MP) en analysant des vidéos faciales afin d'identifier les changements associés à l'hypomimie. Dans ce contexte, nous introduisons de nouvelles contributions à l'état de l'art : l'une fondée sur l'apprentissage automatique superficiel et l'autre fondée sur l'apprentissage profond. La première méthode utilise des modèles d'apprentissage automatique qui exploitent des caractéristiques faciales extraites manuellement, en particulier les dérivés des unités d'action faciale (AUs). Ces modèles intègrent des mécanismes d'interprétabilité qui permettent d'expliquer leur processus de décision auprès des parties prenantes, mettant en évidence les caractéristiques faciales les plus distinctives pour la MP. Nous examinons l'influence du sexe biologique sur ces biomarqueurs numériques, les comparons aux données de neuroimagerie et aux scores cliniques, et les utilisons pour prédire la gravité de la MP. La deuxième méthode exploite l'apprentissage profond pour extraire automatiquement des caractéristiques à partir de vidéos faciales brutes et des données de flux optique en utilisant des modèles fondamentaux basés sur les Vision Transformers pour vidéos. Pour pallier le manque de données d'entraînement, nous proposons des techniques avancées d'apprentissage par transfert adaptatif, en utilisant des modèles fondamentaux entraînés sur de grands ensembles de données pour la classification de vidéos. De plus, nous intégrons des mécanismes d'interprétabilité pour établir la relation entre les caractéristiques extraites automatiquement et les AUs faciales extraites manuellement, améliorant ainsi la clarté des décisions des modèles. Enfin, nos caractéristiques faciales générées proviennent à la fois de données transversales et longitudinales, ce qui offre un avantage significatif par rapport aux travaux existants. Nous utilisons ces enregistrements pour analyser la progression de l'hypomimie au fil du temps avec ces marqueurs numériques, et sa corrélation avec la progression des scores cliniques. La combinaison des deux approches proposées permet d'obtenir une AUC (Area Under the Curve) de classification de plus de 90%, démontrant l'efficacité des modèles d'apprentissage automatique et d'apprentissage profond dans la détection de l'hypomimie chez les patients atteints de MP à un stade précoce via des vidéos faciales. Cette recherche pourrait permettre une surveillance continue de l'hypomimie en dehors des environnements hospitaliers via la télémédecine
This thesis aims to develop robust digital biomarkers for early detection of Parkinson's disease (PD) by analyzing facial videos to identify changes associated with hypomimia. In this context, we introduce new contributions to the state of the art: one based on shallow machine learning and the other on deep learning.The first method employs machine learning models that use manually extracted facial features, particularly derivatives of facial action units (AUs). These models incorporate interpretability mechanisms that explain their decision-making process for stakeholders, highlighting the most distinctive facial features for PD. We examine the influence of biological sex on these digital biomarkers, compare them against neuroimaging data and clinical scores, and use them to predict PD severity.The second method leverages deep learning to automatically extract features from raw facial videos and optical flow using foundational models based on Video Vision Transformers. To address the limited training data, we propose advanced adaptive transfer learning techniques, utilizing foundational models trained on large-scale video classification datasets. Additionally, we integrate interpretability mechanisms to clarify the relationship between automatically extracted features and manually extracted facial AUs, enhancing the comprehensibility of the model's decisions.Finally, our generated facial features are derived from both cross-sectional and longitudinal data, which provides a significant advantage over existing work. We use these recordings to analyze the progression of hypomimia over time with these digital markers, and its correlation with the progression of clinical scores.Combining these two approaches allows for a classification AUC (Area Under the Curve) of over 90%, demonstrating the efficacy of machine learning and deep learning models in detecting hypomimia in early-stage PD patients through facial videos. This research could enable continuous monitoring of hypomimia outside hospital settings via telemedicine

Styles APA, Harvard, Vancouver, ISO, etc.

Cedernaes, Erasmus. « Runway detection in LWIR video : Real time image processing and presentation of sensor data ». Thesis, Uppsala universitet, Avdelningen för visuell information och interaktion, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-300690.

Texte intégral

Résumé :

Runway detection in long wavelength infrared (LWIR) video could potentially increase the number of successful landings by increasing the situational awareness of pilots and verifying a correct approach. A method for detecting runways in LWIR video was therefore proposed and evaluated for robustness, speed and FPGA acceleration. The proposed algorithm improves the detection probability by making assumptions of the runway appearance during approach, as well as by using a modified Hough line transform and a symmetric search of peaks in the accumulator that is returned by the Hough line transform. A video chain was implemented on a Xilinx ZC702 Development card with input and output via HDMI through an expansion card. The video frames were buffered to RAM, and the detection algorithm ran on the CPU, which however did not meet the real-time requirement. Strategies were proposed that would improve the processing speed by either acceleration in hardware or algorithmic changes.

Styles APA, Harvard, Vancouver, ISO, etc.

Saravi, Sara. « Use of Coherent Point Drift in computer vision applications ». Thesis, Loughborough University, 2013. https://dspace.lboro.ac.uk/2134/12548.

Texte intégral

Résumé :

This thesis presents the novel use of Coherent Point Drift in improving the robustness of a number of computer vision applications. CPD approach includes two methods for registering two images - rigid and non-rigid point set approaches which are based on the transformation model used. The key characteristic of a rigid transformation is that the distance between points is preserved, which means it can be used in the presence of translation, rotation, and scaling. Non-rigid transformations - or affine transforms - provide the opportunity of registering under non-uniform scaling and skew. The idea is to move one point set coherently to align with the second point set. The CPD method finds both the non-rigid transformation and the correspondence distance between two point sets at the same time without having to use a-priori declaration of the transformation model used. The first part of this thesis is focused on speaker identification in video conferencing. A real-time, audio-coupled video based approach is presented, which focuses more on the video analysis side, rather than the audio analysis that is known to be prone to errors. CPD is effectively utilised for lip movement detection and a temporal face detection approach is used to minimise false positives if face detection algorithm fails to perform. The second part of the thesis is focused on multi-exposure and multi-focus image fusion with compensation for camera shake. Scale Invariant Feature Transforms (SIFT) are first used to detect keypoints in images being fused. Subsequently this point set is reduced to remove outliers, using RANSAC (RANdom Sample Consensus) and finally the point sets are registered using CPD with non-rigid transformations. The registered images are then fused with a Contourlet based image fusion algorithm that makes use of a novel alpha blending and filtering technique to minimise artefacts. The thesis evaluates the performance of the algorithm in comparison to a number of state-of-the-art approaches, including the key commercial products available in the market at present, showing significantly improved subjective quality in the fused images. The final part of the thesis presents a novel approach to Vehicle Make & Model Recognition in CCTV video footage. CPD is used to effectively remove skew of vehicles detected as CCTV cameras are not specifically configured for the VMMR task and may capture vehicles at different approaching angles. A LESH (Local Energy Shape Histogram) feature based approach is used for vehicle make and model recognition with the novelty that temporal processing is used to improve reliability. A number of further algorithms are used to maximise the reliability of the final outcome. Experimental results are provided to prove that the proposed system demonstrates an accuracy in excess of 95% when tested on real CCTV footage with no prior camera calibration.

Styles APA, Harvard, Vancouver, ISO, etc.

Leoputra, Wilson Suryajaya. « Video foreground extraction for mobile camera platforms ». Thesis, Curtin University, 2009. http://hdl.handle.net/20.500.11937/1384.

Texte intégral

Résumé :

Foreground object detection is a fundamental task in computer vision with many applications in areas such as object tracking, event identification, and behavior analysis. Most conventional foreground object detection methods work only in a stable illumination environments using fixed cameras. In real-world applications, however, it is often the case that the algorithm needs to operate under the following challenging conditions: drastic lighting changes, object shape complexity, moving cameras, low frame capture rates, and low resolution images. This thesis presents four novel approaches for foreground object detection on real-world datasets using cameras deployed on moving vehicles.The first problem addresses passenger detection and tracking tasks for public transport buses investigating the problem of changing illumination conditions and low frame capture rates. Our approach integrates a stable SIFT (Scale Invariant Feature Transform) background seat modelling method with a human shape model into a weighted Bayesian framework to detect passengers. To deal with the problem of tracking multiple targets, we employ the Reversible Jump Monte Carlo Markov Chain tracking algorithm. Using the SVM classifier, the appearance transformation models capture changes in the appearance of the foreground objects across two consecutives frames under low frame rate conditions. In the second problem, we present a system for pedestrian detection involving scenes captured by a mobile bus surveillance system. It integrates scene localization, foreground-background separation, and pedestrian detection modules into a unified detection framework. The scene localization module performs a two stage clustering of the video data.In the first stage, SIFT Homography is applied to cluster frames in terms of their structural similarity, and the second stage further clusters these aligned frames according to consistency in illumination. This produces clusters of images that are differential in viewpoint and lighting. A kernel density estimation (KDE) technique for colour and gradient is then used to construct background models for each image cluster, which is further used to detect candidate foreground pixels. Finally, using a hierarchical template matching approach, pedestrians can be detected.In addition to the second problem, we present three direct pedestrian detection methods that extend the HOG (Histogram of Oriented Gradient) techniques (Dalal and Triggs, 2005) and provide a comparative evaluation of these approaches. The three approaches include: a) a new histogram feature, that is formed by the weighted sum of both the gradient magnitude and the filter responses from a set of elongated Gaussian filters (Leung and Malik, 2001) corresponding to the quantised orientation, which we refer to as the Histogram of Oriented Gradient Banks (HOGB) approach; b) the codebook based HOG feature with branch-and-bound (efficient subwindow search) algorithm (Lampert et al., 2008) and; c) the codebook based HOGB approach.In the third problem, a unified framework that combines 3D and 2D background modelling is proposed to detect scene changes using a camera mounted on a moving vehicle. The 3D scene is first reconstructed from a set of videos taken at different times. The 3D background modelling identifies inconsistent scene structures as foreground objects. For the 2D approach, foreground objects are detected using the spatio-temporal MRF algorithm. Finally, the 3D and 2D results are combined using morphological operations.The significance of these research is that it provides basic frameworks for automatic large-scale mobile surveillance applications and facilitates many higher-level applications such as object tracking and behaviour analysis.

Styles APA, Harvard, Vancouver, ISO, etc.

Ali, Abid. « Analyse vidéo à l'aide de réseaux de neurones profonds : une application pour l'autisme ». Electronic Thesis or Diss., Université Côte d'Azur, 2024. http://www.theses.fr/2024COAZ4066.

Texte intégral

Résumé :

La compréhension des actions dans les vidéos est un élément crucial de la vision par ordinateur, avec des implications significatives dans divers domaines. À mesure que notre dépendance aux données visuelles augmente, comprendre et interpréter les actions humaines dans les vidéos devient essentiel pour faire progresser les technologies dans la surveillance, les soins de santé, les systèmes autonomes et l'interaction homme-machine. L'interprétation précise des actions dans les vidéos est fondamentale pour créer des systèmes intelligents capables de naviguer efficacement et de répondre aux complexités du monde réel. Dans ce contexte, les avancées dans la compréhension des actions repoussent les limites de la vision par ordinateur et jouent un rôle crucial dans la transformation des applications de pointe qui impactent notre quotidien. La vision par ordinateur a réalisé des progrès significatifs avec l'essor des méthodes d'apprentissage profond, telles que les réseaux de neurones convolutifs (CNN), repoussant les frontières de la vision par ordinateur et permettant à la communauté de progresser dans de nombreux domaines, notamment la segmentation d'images, la détection d'objets, la compréhension des scènes, et bien plus encore. Cependant, le traitement des vidéos reste limité par rapport aux images statiques. Dans cette thèse, nous nous concentrons sur la compréhension des actions, en la divisant en deux parties principales : la reconnaissance d'actions et la détection d'actions, ainsi que leur application dans le domaine médical pour l'analyse de l'autisme. Dans cette thèse, nous explorons les divers aspects et défis de la compréhension des vidéos, tant d'un point de vue général que spécifique à une application. Nous présentons ensuite nos contributions et solutions pour relever ces défis. De plus, nous introduisons le jeu de données ACTIVIS, conçu pour diagnostiquer l'autisme chez les jeunes enfants. Notre travail est divisé en deux parties principales : la modélisation générique et les modèles appliqués. Dans un premier temps, nous nous concentrons sur l'adaptation des modèles d'images pour les tâches de reconnaissance d'actions en incorporant la modélisation temporelle à l'aide de techniques de fine-tuning efficaces en paramètres (PEFT). Nous abordons également la détection et l'anticipation des actions en temps réel en proposant un nouveau modèle conjoint pour l'anticipation des actions et la détection d'actions en ligne dans des scénarios de la vie réelle. En outre, nous introduisons une nouvelle tâche appelée "interaction lâche" dans des situations dyadiques et ses applications dans l'analyse de l'autisme. Enfin, nous nous concentrons sur l'aspect appliqué de la compréhension des vidéos en proposant un modèle de reconnaissance d'actions pour les comportements répétitifs dans les vidéos d'individus autistes. Nous concluons en proposant une méthode faiblement supervisée pour estimer le score de gravité des enfants autistes dans des vidéos longues
Understanding actions in videos is a crucial element of computer vision with significant implications across various fields. As our dependence on visual data grows, comprehending and interpreting human actions in videos becomes essential for advancing technologies in surveillance, healthcare, autonomous systems, and human-computer interaction. The accurate interpretation of actions in videos is fundamental for creating intelligent systems that can effectively navigate and respond to the complexities of the real world. In this context, advances in action understanding push the boundaries of computer vision and play a crucial role in shaping the landscape of cutting-edge applications that impact our daily lives. Computer vision has made significant progress with the rise of deep learning methods such as convolutional neural networks (CNNs) pushing the boundaries of computer vision and enabling the computer vision community to advance in many domains, including image segmentation, object detection, scene understanding, and more. However, video processing remains limited compared to static images. In this thesis, we focus on action understanding, dividing it into two main parts: action recognition and action detection, and their application in the medical domain for autism analysis.In this thesis, we explore the various aspects and challenges of video understanding from a general and an application-specific perspective. We then present our contributions and solutions to address these challenges. In addition, we introduce the ACTIVIS dataset, designed to diagnose autism in young children. Our work is divided into two main parts: generic modeling and applied models. Initially, we focus on adapting image models for action recognition tasks by incorporating temporal modeling using parameter-efficient fine-tuning (PEFT) techniques. We also address real-time action detection and anticipation by proposing a new joint model for action anticipation and online action detection in real-life scenarios. Furthermore, we introduce a new task called 'loose-interaction' in dyadic situations and its applications in autism analysis. Finally, we concentrate on the applied aspect of video understanding by proposing an action recognition model for repetitive behaviors in videos of autistic individuals. We conclude by proposing a weakly-supervised method to estimate the severity score of autistic children in long videos

Styles APA, Harvard, Vancouver, ISO, etc.

Burger, Thomas. « Reconnaissance automatique des gestes de la langue française parlée complétée ». Phd thesis, Grenoble INPG, 2007. http://tel.archives-ouvertes.fr/tel-00203360.

Texte intégral

Résumé :

Le LPC est un complément à la lecture labiale qui facilite la communication des malentendants. Sur le principe, il s'agit d'effectuer des gestes avec une main placée à côté du visage pour désambigüiser le mouvement des lèvres, qui pris isolément est insuffisant à la compréhension parfaite du message. Le projet RNTS TELMA a pour objectif de mettre en place un terminal téléphonique permettant la communication des malentendants en s'appuyant sur le LPC. Parmi les nombreuses fonctionnalités que cela implique, il est nécessaire de pouvoir reconnaître le geste manuel du LPC et de lui associer un sens. L'objet de ce travail est la segmentation vidéo, l'analyse et la reconnaissance des gestes de codeur LPC en situation de communication. Cela fait appel à des techniques de segmentation d'images, de classification, d'interprétation de geste, et de fusion de données. Afin de résoudre ce problème de reconnaissance de gestes, nous avons proposé plusieurs algorithmes originaux, parmi lesquels (1) un algorithme basé sur la persistance rétinienne permettant la catégorisation des images de geste cible et des images de geste de transition, (2) une amélioration des méthodes de multi-classification par SVM ou par classifieurs unaires via la théorie de l'évidence, assortie d'une méthode de conversion des probabilités subjectives en fonction de croyance, et (3) une méthode de décision partielle basée sur la généralisation de la Transformée Pignistique, afin d'autoriser les incertitudes dans l'interprétation de gestes ambigus.

Styles APA, Harvard, Vancouver, ISO, etc.

Livres sur le sujet "Video Vision Transformer"

Korsgaard, Mathias Bonde. Music Video Transformed. Sous la direction de John Richardson, Claudia Gorbman et Carol Vernallis. Oxford University Press, 2013. http://dx.doi.org/10.1093/oxfordhb/9780199733866.013.015.

Texte intégral

Résumé :

This article appears in theOxford Handbook of New Audiovisual Aestheticsedited by John Richardson, Claudia Gorbman, and Carol Vernallis. This chapter asks what music video has become today and how its audiovisual aesthetics have changed online. It suggests that music videos generally through process of remediation content more actively than any other media form, performing the dual function of “visualizing music” (by recasting a song visually) and “musicalizing vision” (by structuring images according to musical logic). The discussion identifies and provides an overview of several new music video types that have come into existence online, placing them in five categories. In particular, the chapter focuses on interactive music videos and music video apps through close analyses of both Arcade Fire’s interactive video “We Used to Wait” and Björk’s interactive “app album”Biophilia. Both of these actively challenge what we have come to expect of music videos while still performing some familiar functions, prompting us to consider whether they are even music videos.

Styles APA, Harvard, Vancouver, ISO, etc.

Chapitres de livres sur le sujet "Video Vision Transformer"

Gabeur, Valentin, Chen Sun, Karteek Alahari et Cordelia Schmid. « Multi-modal Transformer for Video Retrieval ». Dans Computer Vision – ECCV 2020, 214–29. Cham : Springer International Publishing, 2020. http://dx.doi.org/10.1007/978-3-030-58548-8_13.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

Kim, Hannah Halin, Shuzhi Yu, Shuai Yuan et Carlo Tomasi. « Cross-Attention Transformer for Video Interpolation ». Dans Computer Vision – ACCV 2022 Workshops, 325–42. Cham : Springer Nature Switzerland, 2023. http://dx.doi.org/10.1007/978-3-031-27066-6_23.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

Kim, Tae Hyun, Mehdi S. M. Sajjadi, Michael Hirsch et Bernhard Schölkopf. « Spatio-Temporal Transformer Network for Video Restoration ». Dans Computer Vision – ECCV 2018, 111–27. Cham : Springer International Publishing, 2018. http://dx.doi.org/10.1007/978-3-030-01219-9_7.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

Xue, Tong, Qianrui Wang, Xinyi Huang et Dengshi Li. « Self-guided Transformer for Video Super-Resolution ». Dans Pattern Recognition and Computer Vision, 186–98. Singapore : Springer Nature Singapore, 2023. http://dx.doi.org/10.1007/978-981-99-8549-4_16.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

Li, Zutong, et Lei Yang. « DCVQE : A Hierarchical Transformer for Video Quality Assessment ». Dans Computer Vision – ACCV 2022, 398–416. Cham : Springer Nature Switzerland, 2023. http://dx.doi.org/10.1007/978-3-031-26316-3_24.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

Courant, Robin, Maika Edberg, Nicolas Dufour et Vicky Kalogeiton. « Transformers and Visual Transformers ». Dans Machine Learning for Brain Disorders, 193–229. New York, NY : Springer US, 2012. http://dx.doi.org/10.1007/978-1-0716-3195-9_6.

Texte intégral

Résumé :

AbstractTransformers were initially introduced for natural language processing (NLP) tasks, but fast they were adopted by most deep learning fields, including computer vision. They measure the relationships between pairs of input tokens (words in the case of text strings, parts of images for visual transformers), termed attention. The cost is exponential with the number of tokens. For image classification, the most common transformer architecture uses only the transformer encoder in order to transform the various input tokens. However, there are also numerous other applications in which the decoder part of the traditional transformer architecture is also used. Here, we first introduce the attention mechanism (Subheading 1) and then the basic transformer block including the vision transformer (Subheading 2). Next, we discuss some improvements of visual transformers to account for small datasets or less computation (Subheading 3). Finally, we introduce visual transformers applied to tasks other than image classification, such as detection, segmentation, generation, and training without labels (Subheading 4) and other domains, such as video or multimodality using text or audio data (Subheading 5).

Styles APA, Harvard, Vancouver, ISO, etc.

Huo, Shuwei, Yuan Zhou et Haiyang Wang. « YFormer : A New Transformer Architecture for Video-Query Based Video Moment Retrieval ». Dans Pattern Recognition and Computer Vision, 638–50. Cham : Springer Nature Switzerland, 2022. http://dx.doi.org/10.1007/978-3-031-18913-5_49.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

Li, Li, Liansheng Zhuang, Shenghua Gao et Shafei Wang. « HaViT : Hybrid-Attention Based Vision Transformer for Video Classification ». Dans Computer Vision – ACCV 2022, 502–17. Cham : Springer Nature Switzerland, 2023. http://dx.doi.org/10.1007/978-3-031-26316-3_30.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

Zhang, Hui, Jiewen Yang, Xingbo Dong, Xingguo Lv, Wei Jia, Zhe Jin et Xuejun Li. « A Video Face Recognition Leveraging Temporal Information Based on Vision Transformer ». Dans Pattern Recognition and Computer Vision, 29–43. Singapore : Springer Nature Singapore, 2023. http://dx.doi.org/10.1007/978-981-99-8469-5_3.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

Wu, Jinlin, Lingxiao He, Wu Liu, Yang Yang, Zhen Lei, Tao Mei et Stan Z. Li. « CAViT : Contextual Alignment Vision Transformer for Video Object Re-identification ». Dans Lecture Notes in Computer Science, 549–66. Cham : Springer Nature Switzerland, 2022. http://dx.doi.org/10.1007/978-3-031-19781-9_32.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

Actes de conférences sur le sujet "Video Vision Transformer"

Kobayashi, Takumi, et Masataka Seo. « Efficient Compression Method in Video Reconstruction Using Video Vision Transformer ». Dans 2024 IEEE 13th Global Conference on Consumer Electronics (GCCE), 724–25. IEEE, 2024. https://doi.org/10.1109/gcce62371.2024.10760444.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

Yokota, Haruto, Mert Bozkurtlar, Benjamin Yen, Katsutoshi Itoyama, Kenji Nishida et Kazuhiro Nakadai. « A Video Vision Transformer for Sound Source Localization ». Dans 2024 32nd European Signal Processing Conference (EUSIPCO), 106–10. IEEE, 2024. http://dx.doi.org/10.23919/eusipco63174.2024.10715427.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

Ojaswee, R. Sreemathy, Mousami Turuk, Jayashree Jagdale et Mohammad Anish. « Indian Sign Language Recognition Using Video Vision Transformer ». Dans 2024 3rd International Conference for Advancement in Technology (ICONAT), 1–7. IEEE, 2024. https://doi.org/10.1109/iconat61936.2024.10774678.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

Thuan, Pham Minh, Bui Thu Lam et Pham Duy Trung. « Spatial Vision Transformer : A Novel Approach to Deepfake Video Detection ». Dans 2024 1st International Conference On Cryptography And Information Security (VCRIS), 1–6. IEEE, 2024. https://doi.org/10.1109/vcris63677.2024.10813391.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

Kumari, Supriya, Prince Kumar, Pooja Verma, Rajitha B et Sarsij Tripathi. « Hybrid Vision Transformer and Convolutional Neural Network for Sports Video Classification ». Dans 2024 International Conference on Intelligent Computing and Emerging Communication Technologies (ICEC), 1–5. IEEE, 2024. https://doi.org/10.1109/icec59683.2024.10837289.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

Isogawa, Junya, Fumihiko Sakaue et Jun Sato. « Simultaneous Estimation of Driving Intentions for Multiple Vehicles Using Video Transformer ». Dans 20th International Conference on Computer Vision Theory and Applications, 471–77. SCITEPRESS - Science and Technology Publications, 2025. https://doi.org/10.5220/0013232100003912.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

Gupta, Anisha, et Vidit Kumar. « A Hybrid U-Net and Vision Transformer approach for Video Anomaly detection ». Dans 2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT), 1–6. IEEE, 2024. http://dx.doi.org/10.1109/icccnt61001.2024.10725860.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

Ansari, Khustar, et Priyanka Srivastava. « Hybrid Attention Vision Transformer-based Deep Learning Model for Video Caption Generation ». Dans 2025 International Conference on Electronics and Renewable Systems (ICEARS), 1238–45. IEEE, 2025. https://doi.org/10.1109/icears64219.2025.10940922.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

Zhou, Xingyu, Leheng Zhang, Xiaorui Zhao, Keze Wang, Leida Li et Shuhang Gu. « Video Super-Resolution Transformer with Masked Inter&Intra-Frame Attention ». Dans 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 25399–408. IEEE, 2024. http://dx.doi.org/10.1109/cvpr52733.2024.02400.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

Choi, Joonmyung, Sanghyeok Lee, Jaewon Chu, Minhyuk Choi et Hyunwoo J. Kim. « vid-TLDR : Training Free Token merging for Light-Weight Video Transformer ». Dans 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 18771–81. IEEE, 2024. http://dx.doi.org/10.1109/cvpr52733.2024.01776.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

Nous offrons des réductions sur tous les plans premium pour les auteurs dont les œuvres sont incluses dans des sélections littéraires thématiques. Contactez-nous pour obtenir un code promo unique!