Articles de revues sur le sujet « Deep Video Representations »

Pour voir les autres types de publications sur ce sujet consultez le lien suivant : Deep Video Representations.

Créez une référence correcte selon les styles APA, MLA, Chicago, Harvard et plusieurs autres

Choisissez une source :

Consultez les 50 meilleurs articles de revues pour votre recherche sur le sujet « Deep Video Representations ».

À côté de chaque source dans la liste de références il y a un bouton « Ajouter à la bibliographie ». Cliquez sur ce bouton, et nous générerons automatiquement la référence bibliographique pour la source choisie selon votre style de citation préféré : APA, MLA, Harvard, Vancouver, Chicago, etc.

Vous pouvez aussi télécharger le texte intégral de la publication scolaire au format pdf et consulter son résumé en ligne lorsque ces informations sont inclues dans les métadonnées.

Parcourez les articles de revues sur diverses disciplines et organisez correctement votre bibliographie.

1

Feichtenhofer, Christoph, Axel Pinz, Richard P. Wildes et Andrew Zisserman. « Deep Insights into Convolutional Networks for Video Recognition ». International Journal of Computer Vision 128, no 2 (29 octobre 2019) : 420–37. http://dx.doi.org/10.1007/s11263-019-01225-w.

Texte intégral
Résumé :
Abstract As the success of deep models has led to their deployment in all areas of computer vision, it is increasingly important to understand how these representations work and what they are capturing. In this paper, we shed light on deep spatiotemporal representations by visualizing the internal representation of models that have been trained to recognize actions in video. We visualize multiple two-stream architectures to show that local detectors for appearance and motion objects arise to form distributed representations for recognizing human actions. Key observations include the following. First, cross-stream fusion enables the learning of true spatiotemporal features rather than simply separate appearance and motion features. Second, the networks can learn local representations that are highly class specific, but also generic representations that can serve a range of classes. Third, throughout the hierarchy of the network, features become more abstract and show increasing invariance to aspects of the data that are unimportant to desired distinctions (e.g. motion patterns across various speeds). Fourth, visualizations can be used not only to shed light on learned representations, but also to reveal idiosyncrasies of training data and to explain failure cases of the system.
Styles APA, Harvard, Vancouver, ISO, etc.
2

Pandeya, Yagya Raj, Bhuwan Bhattarai et Joonwhoan Lee. « Deep-Learning-Based Multimodal Emotion Classification for Music Videos ». Sensors 21, no 14 (20 juillet 2021) : 4927. http://dx.doi.org/10.3390/s21144927.

Texte intégral
Résumé :
Music videos contain a great deal of visual and acoustic information. Each information source within a music video influences the emotions conveyed through the audio and video, suggesting that only a multimodal approach is capable of achieving efficient affective computing. This paper presents an affective computing system that relies on music, video, and facial expression cues, making it useful for emotional analysis. We applied the audio–video information exchange and boosting methods to regularize the training process and reduced the computational costs by using a separable convolution strategy. In sum, our empirical findings are as follows: (1) Multimodal representations efficiently capture all acoustic and visual emotional clues included in each music video, (2) the computational cost of each neural network is significantly reduced by factorizing the standard 2D/3D convolution into separate channels and spatiotemporal interactions, and (3) information-sharing methods incorporated into multimodal representations are helpful in guiding individual information flow and boosting overall performance. We tested our findings across several unimodal and multimodal networks against various evaluation metrics and visual analyzers. Our best classifier attained 74% accuracy, an f1-score of 0.73, and an area under the curve score of 0.926.
Styles APA, Harvard, Vancouver, ISO, etc.
3

Ljubešić, Nikola. « ‟Deep lexicography” – Fad or Opportunity ? » Rasprave Instituta za hrvatski jezik i jezikoslovlje 46, no 2 (30 octobre 2020) : 839–52. http://dx.doi.org/10.31724/rihjj.46.2.21.

Texte intégral
Résumé :
In recent years, we are witnessing staggering improvements in various semantic data processing tasks due to the developments in the area of deep learning, ranging from image and video processing to speech processing, and natural language understanding. In this paper, we discuss the opportunities and challenges that these developments pose for the area of electronic lexicography. We primarily focus on the concept of representation learning of the basic elements of language, namely words, and the applicability of these word representations to lexicography. We first discuss well-known approaches to learning static representations of words, the so-called word embeddings, and their usage in lexicography-related tasks such as semantic shift detection, and cross-lingual prediction of lexical features such as concreteness and imageability. We wrap up the paper with the most recent developments in the area of word representation learning in form of learning dynamic, context-aware representations of words, showcasing some dynamic word embedding examples, and discussing improvements on lexicography-relevant tasks of word sense disambiguation and word sense induction.
Styles APA, Harvard, Vancouver, ISO, etc.
4

Kumar, Vidit, Vikas Tripathi et Bhaskar Pant. « Learning Unsupervised Visual Representations using 3D Convolutional Autoencoder with Temporal Contrastive Modeling for Video Retrieval ». International Journal of Mathematical, Engineering and Management Sciences 7, no 2 (14 mars 2022) : 272–87. http://dx.doi.org/10.33889/ijmems.2022.7.2.018.

Texte intégral
Résumé :
The rapid growth of tag-free user-generated videos (on the Internet), surgical recorded videos, and surveillance videos has necessitated the need for effective content-based video retrieval systems. Earlier methods for video representations are based on hand-crafted, which hardly performed well on the video retrieval tasks. Subsequently, deep learning methods have successfully demonstrated their effectiveness in both image and video-related tasks, but at the cost of creating massively labeled datasets. Thus, the economic solution is to use freely available unlabeled web videos for representation learning. In this regard, most of the recently developed methods are based on solving a single pretext task using 2D or 3D convolutional network. However, this paper designs and studies a 3D convolutional autoencoder (3D-CAE) for video representation learning (since it does not require labels). Further, this paper proposes a new unsupervised video feature learning method based on joint learning of past and future prediction using 3D-CAE with temporal contrastive learning. The experiments are conducted on UCF-101 and HMDB-51 datasets, where the proposed approach achieves better retrieval performance than state-of-the-art. In the ablation study, the action recognition task is performed by fine-tuning the unsupervised pre-trained model where it outperforms other methods, which further confirms the superiority of our method in learning underlying features. Such an unsupervised representation learning approach could also benefit the medical domain, where it is expensive to create large label datasets.
Styles APA, Harvard, Vancouver, ISO, etc.
5

Vihlman, Mikko, et Arto Visala. « Optical Flow in Deep Visual Tracking ». Proceedings of the AAAI Conference on Artificial Intelligence 34, no 07 (3 avril 2020) : 12112–19. http://dx.doi.org/10.1609/aaai.v34i07.6890.

Texte intégral
Résumé :
Single-target tracking of generic objects is a difficult task since a trained tracker is given information present only in the first frame of a video. In recent years, increasingly many trackers have been based on deep neural networks that learn generic features relevant for tracking. This paper argues that deep architectures are often fit to learn implicit representations of optical flow. Optical flow is intuitively useful for tracking, but most deep trackers must learn it implicitly. This paper is among the first to study the role of optical flow in deep visual tracking. The architecture of a typical tracker is modified to reveal the presence of implicit representations of optical flow and to assess the effect of using the flow information more explicitly. The results show that the considered network learns implicitly an effective representation of optical flow. The implicit representation can be replaced by an explicit flow input without a notable effect on performance. Using the implicit and explicit representations at the same time does not improve tracking accuracy. The explicit flow input could allow constructing lighter networks for tracking.
Styles APA, Harvard, Vancouver, ISO, etc.
6

Rouast, Philipp V., et Marc T. P. Adam. « Learning Deep Representations for Video-Based Intake Gesture Detection ». IEEE Journal of Biomedical and Health Informatics 24, no 6 (juin 2020) : 1727–37. http://dx.doi.org/10.1109/jbhi.2019.2942845.

Texte intégral
Styles APA, Harvard, Vancouver, ISO, etc.
7

Li, Jialu, Aishwarya Padmakumar, Gaurav Sukhatme et Mohit Bansal. « VLN-Video : Utilizing Driving Videos for Outdoor Vision-and-Language Navigation ». Proceedings of the AAAI Conference on Artificial Intelligence 38, no 17 (24 mars 2024) : 18517–26. http://dx.doi.org/10.1609/aaai.v38i17.29813.

Texte intégral
Résumé :
Outdoor Vision-and-Language Navigation (VLN) requires an agent to navigate through realistic 3D outdoor environments based on natural language instructions. The performance of existing VLN methods is limited by insufficient diversity in navigation environments and limited training data. To address these issues, we propose VLN-Video, which utilizes the diverse outdoor environments present in driving videos in multiple cities in the U.S. augmented with automatically generated navigation instructions and actions to improve outdoor VLN performance. VLN-Video combines the best of intuitive classical approaches and modern deep learning techniques, using template infilling to generate grounded non-repetitive navigation instructions, combined with an image rotation similarity based navigation action predictor to obtain VLN style data from driving videos for pretraining deep learning VLN models. We pre-train the model on the Touchdown dataset and our video-augmented dataset created from driving videos with three proxy tasks: Masked Language Modeling, Instruction and Trajectory Matching, and Next Action Prediction, so as to learn temporally-aware and visually-aligned instruction representations. The learned instruction representation is adapted to the state-of-the-art navigation agent when fine-tuning on the Touchdown dataset. Empirical results demonstrate that VLN-Video significantly outperforms previous state-of-the-art models by 2.1% in task completion rate, achieving a new state-of-the-art on the Touchdown dataset.
Styles APA, Harvard, Vancouver, ISO, etc.
8

Hu, Yueyue, Shiliang Sun, Xin Xu et Jing Zhao. « Multi-View Deep Attention Network for Reinforcement Learning (Student Abstract) ». Proceedings of the AAAI Conference on Artificial Intelligence 34, no 10 (3 avril 2020) : 13811–12. http://dx.doi.org/10.1609/aaai.v34i10.7177.

Texte intégral
Résumé :
The representation approximated by a single deep network is usually limited for reinforcement learning agents. We propose a novel multi-view deep attention network (MvDAN), which introduces multi-view representation learning into the reinforcement learning task for the first time. The proposed model approximates a set of strategies from multiple representations and combines these strategies based on attention mechanisms to provide a comprehensive strategy for a single-agent. Experimental results on eight Atari video games show that the MvDAN has effective competitive performance than single-view reinforcement learning methods.
Styles APA, Harvard, Vancouver, ISO, etc.
9

Dong, Zhen, Chenchen Jing, Mingtao Pei et Yunde Jia. « Deep CNN based binary hash video representations for face retrieval ». Pattern Recognition 81 (septembre 2018) : 357–69. http://dx.doi.org/10.1016/j.patcog.2018.04.014.

Texte intégral
Styles APA, Harvard, Vancouver, ISO, etc.
10

Psallidas, Theodoros, et Evaggelos Spyrou. « Video Summarization Based on Feature Fusion and Data Augmentation ». Computers 12, no 9 (15 septembre 2023) : 186. http://dx.doi.org/10.3390/computers12090186.

Texte intégral
Résumé :
During the last few years, several technological advances have led to an increase in the creation and consumption of audiovisual multimedia content. Users are overexposed to videos via several social media or video sharing websites and mobile phone applications. For efficient browsing, searching, and navigation across several multimedia collections and repositories, e.g., for finding videos that are relevant to a particular topic or interest, this ever-increasing content should be efficiently described by informative yet concise content representations. A common solution to this problem is the construction of a brief summary of a video, which could be presented to the user, instead of the full video, so that she/he could then decide whether to watch or ignore the whole video. Such summaries are ideally more expressive than other alternatives, such as brief textual descriptions or keywords. In this work, the video summarization problem is approached as a supervised classification task, which relies on feature fusion of audio and visual data. Specifically, the goal of this work is to generate dynamic video summaries, i.e., compositions of parts of the original video, which include its most essential video segments, while preserving the original temporal sequence. This work relies on annotated datasets on a per-frame basis, wherein parts of videos are annotated as being “informative” or “noninformative”, with the latter being excluded from the produced summary. The novelties of the proposed approach are, (a) prior to classification, a transfer learning strategy to use deep features from pretrained models is employed. These models have been used as input to the classifiers, making them more intuitive and robust to objectiveness, and (b) the training dataset was augmented by using other publicly available datasets. The proposed approach is evaluated using three datasets of user-generated videos, and it is demonstrated that deep features and data augmentation are able to improve the accuracy of video summaries based on human annotations. Moreover, it is domain independent, could be used on any video, and could be extended to rely on richer feature representations or include other data modalities.
Styles APA, Harvard, Vancouver, ISO, etc.
11

Liu, Shangdong, Puming Cao, Yujian Feng, Yimu Ji, Jiayuan Chen, Xuedong Xie et Longji Wu. « NRVC : Neural Representation for Video Compression with Implicit Multiscale Fusion Network ». Entropy 25, no 8 (4 août 2023) : 1167. http://dx.doi.org/10.3390/e25081167.

Texte intégral
Résumé :
Recently, end-to-end deep models for video compression have made steady advancements. However, this resulted in a lengthy and complex pipeline containing numerous redundant parameters. The video compression approaches based on implicit neural representation (INR) allow videos to be directly represented as a function approximated by a neural network, resulting in a more lightweight model, whereas the singularity of the feature extraction pipeline limits the network’s ability to fit the mapping function for video frames. Hence, we propose a neural representation approach for video compression with an implicit multiscale fusion network (NRVC), utilizing normalized residual networks to improve the effectiveness of INR in fitting the target function. We propose the multiscale representations for video compression (MSRVC) network, which effectively extracts features from the input video sequence to enhance the degree of overfitting in the mapping function. Additionally, we propose the feature extraction channel attention (FECA) block to capture interaction information between different feature extraction channels, further improving the effectiveness of feature extraction. The results show that compared to the NeRV method with similar bits per pixel (BPP), NRVC has a 2.16% increase in the decoded peak signal-to-noise ratio (PSNR). Moreover, NRVC outperforms the conventional HEVC in terms of PSNR.
Styles APA, Harvard, Vancouver, ISO, etc.
12

Pan, Haixia, Jiahua Lan, Hongqiang Wang, Yanan Li, Meng Zhang, Mojie Ma, Dongdong Zhang et Xiaoran Zhao. « UWV-Yolox : A Deep Learning Model for Underwater Video Object Detection ». Sensors 23, no 10 (18 mai 2023) : 4859. http://dx.doi.org/10.3390/s23104859.

Texte intégral
Résumé :
Underwater video object detection is a challenging task due to the poor quality of underwater videos, including blurriness and low contrast. In recent years, Yolo series models have been widely applied to underwater video object detection. However, these models perform poorly for blurry and low-contrast underwater videos. Additionally, they fail to account for the contextual relationships between the frame-level results. To address these challenges, we propose a video object detection model named UWV-Yolox. First, the Contrast Limited Adaptive Histogram Equalization method is used to augment the underwater videos. Then, a new CSP_CA module is proposed by adding Coordinate Attention to the backbone of the model to augment the representations of objects of interest. Next, a new loss function is proposed, including regression and jitter loss. Finally, a frame-level optimization module is proposed to optimize the detection results by utilizing the relationship between neighboring frames in videos, improving the video detection performance. To evaluate the performance of our model, We construct experiments on the UVODD dataset built in the paper, and select mAP@0.5 as the evaluation metric. The mAP@0.5 of the UWV-Yolox model reaches 89.0%, which is 3.2% better than the original Yolox model. Furthermore, compared with other object detection models, the UWV-Yolox model has more stable predictions for objects, and our improvements can be flexibly applied to other models.
Styles APA, Harvard, Vancouver, ISO, etc.
13

Gad, Gad, Eyad Gad, Korhan Cengiz, Zubair Fadlullah et Bassem Mokhtar. « Deep Learning-Based Context-Aware Video Content Analysis on IoT Devices ». Electronics 11, no 11 (4 juin 2022) : 1785. http://dx.doi.org/10.3390/electronics11111785.

Texte intégral
Résumé :
Integrating machine learning with the Internet of Things (IoT) enables many useful applications. For IoT applications that incorporate video content analysis (VCA), deep learning models are usually used due to their capacity to encode the high-dimensional spatial and temporal representations of videos. However, limited energy and computation resources present a major challenge. Video captioning is one type of VCA that describes a video with a sentence or a set of sentences. This work proposes an IoT-based deep learning-based framework for video captioning that can (1) Mine large open-domain video-to-text datasets to extract video-caption pairs that belong to a particular domain. (2) Preprocess the selected video-caption pairs including reducing the complexity of the captions’ language model to improve performance. (3) Propose two deep learning models: A transformer-based model and an LSTM-based model. Hyperparameter tuning is performed to select the best hyperparameters. Models are evaluated in terms of accuracy and inference time on different platforms. The presented framework generates captions in standard sentence templates to facilitate extracting information in later stages of the analysis. The two developed deep learning models offer a trade-off between accuracy and speed. While the transformer-based model yields a high accuracy of 97%, the LSTM-based model achieves near real-time inference.
Styles APA, Harvard, Vancouver, ISO, etc.
14

Lin, Jie, Ling-Yu Duan, Shiqi Wang, Yan Bai, Yihang Lou, Vijay Chandrasekhar, Tiejun Huang, Alex Kot et Wen Gao. « HNIP : Compact Deep Invariant Representations for Video Matching, Localization, and Retrieval ». IEEE Transactions on Multimedia 19, no 9 (septembre 2017) : 1968–83. http://dx.doi.org/10.1109/tmm.2017.2713410.

Texte intégral
Styles APA, Harvard, Vancouver, ISO, etc.
15

Zhang, Huijun, Ling Feng, Ningyun Li, Zhanyu Jin et Lei Cao. « Video-Based Stress Detection through Deep Learning ». Sensors 20, no 19 (28 septembre 2020) : 5552. http://dx.doi.org/10.3390/s20195552.

Texte intégral
Résumé :
Stress has become an increasingly serious problem in the current society, threatening mankind’s well-beings. With the ubiquitous deployment of video cameras in surroundings, detecting stress based on the contact-free camera sensors becomes a cost-effective and mass-reaching way without interference of artificial traits and factors. In this study, we leverage users’ facial expressions and action motions in the video and present a two-leveled stress detection network (TSDNet). TSDNet firstly learns face- and action-level representations separately, and then fuses the results through a stream weighted integrator with local and global attention for stress identification. To evaluate the performance of TSDNet, we constructed a video dataset containing 2092 labeled video clips, and the experimental results on the built dataset show that: (1) TSDNet outperformed the hand-crafted feature engineering approaches with detection accuracy 85.42% and F1-Score 85.28%, demonstrating the feasibility and effectiveness of using deep learning to analyze one’s face and action motions; and (2) considering both facial expressions and action motions could improve detection accuracy and F1-Score of that considering only face or action method by over 7%.
Styles APA, Harvard, Vancouver, ISO, etc.
16

Jiang, Pin, et Yahong Han. « Reasoning with Heterogeneous Graph Alignment for Video Question Answering ». Proceedings of the AAAI Conference on Artificial Intelligence 34, no 07 (3 avril 2020) : 11109–16. http://dx.doi.org/10.1609/aaai.v34i07.6767.

Texte intégral
Résumé :
The dominant video question answering methods are based on fine-grained representation or model-specific attention mechanism. They usually process video and question separately, then feed the representations of different modalities into following late fusion networks. Although these methods use information of one modality to boost the other, they neglect to integrate correlations of both inter- and intra-modality in an uniform module. We propose a deep heterogeneous graph alignment network over the video shots and question words. Furthermore, we explore the network architecture from four steps: representation, fusion, alignment, and reasoning. Within our network, the inter- and intra-modality information can be aligned and interacted simultaneously over the heterogeneous graph and used for cross-modal reasoning. We evaluate our method on three benchmark datasets and conduct extensive ablation study to the effectiveness of the network architecture. Experiments show the network to be superior in quality.
Styles APA, Harvard, Vancouver, ISO, etc.
17

Mumtaz, Nadia, Naveed Ejaz, Suliman Aladhadh, Shabana Habib et Mi Young Lee. « Deep Multi-Scale Features Fusion for Effective Violence Detection and Control Charts Visualization ». Sensors 22, no 23 (1 décembre 2022) : 9383. http://dx.doi.org/10.3390/s22239383.

Texte intégral
Résumé :
The study of automated video surveillance systems study using computer vision techniques is a hot research topic and has been deployed in many real-world CCTV environments. The main focus of the current systems is higher accuracy, while the assistance of surveillance experts in effective data analysis and instant decision making using efficient computer vision algorithms need researchers’ attentions. In this research, to the best of our knowledge, we are the first to introduce a process control technique: control charts for surveillance video data analysis. The control charts concept is merged with a novel deep learning-based violence detection framework. Different from the existing methods, the proposed technique considers the importance of spatial information, as well as temporal representations of the input video data, to detect human violence. The spatial information are fused with the temporal dimension of the deep learning model using a multi-scale strategy to ensure that the temporal information are properly assisted by the spatial representations at multi-levels. The proposed frameworks’ results are kept in the history-maintaining module of the control charts to validate the level of risks involved in the live input surveillance video. The detailed experimental results over the existing datasets and the real-world video data demonstrate that the proposed approach is a prominent solution towards automated surveillance with the pre- and post-analyses of violent events.
Styles APA, Harvard, Vancouver, ISO, etc.
18

Wu, Lin, Yang Wang, Ling Shao et Meng Wang. « 3-D PersonVLAD : Learning Deep Global Representations for Video-Based Person Reidentification ». IEEE Transactions on Neural Networks and Learning Systems 30, no 11 (novembre 2019) : 3347–59. http://dx.doi.org/10.1109/tnnls.2019.2891244.

Texte intégral
Styles APA, Harvard, Vancouver, ISO, etc.
19

Meshchaninov, Viacheslav Pavlovich, Ivan Andreevich Molodetskikh, Dmitriy Sergeevich Vatolin et Alexey Gennadievich Voloboy. « Combining contrastive and supervised learning for video super-resolution detection ». Keldysh Institute Preprints, no 80 (2022) : 1–13. http://dx.doi.org/10.20948/prepr-2022-80.

Texte intégral
Résumé :
Upscaled video detection is a helpful tool in multimedia forensics, but it’s a challenging task that involves various upscaling and compression algorithms. There are many resolution-enhancement methods, including interpolation and deep-learning based super-resolution, and they leave unique traces. This paper proposes a new upscaled-resolution-detection method based on learning of visual representations using contrastive and cross-entropy losses. To explain how the method detects videos, the major components of our framework are systematically reviewed — in particular, it is shown that most data-augmentation approaches hinder the learning of the method. Through extensive experiments on various datasets, our method has been shown to effectively detects upscaling even in compressed videos and outperforms the state-of-theart alternatives. The code and models are publicly available at https://github.com/msu-video-group/SRDM.
Styles APA, Harvard, Vancouver, ISO, etc.
20

Huang, Shaonian, Dongjun Huang et Xinmin Zhou. « Learning Multimodal Deep Representations for Crowd Anomaly Event Detection ». Mathematical Problems in Engineering 2018 (2018) : 1–13. http://dx.doi.org/10.1155/2018/6323942.

Texte intégral
Résumé :
Anomaly event detection in crowd scenes is extremely important; however, the majority of existing studies merely use hand-crafted features to detect anomalies. In this study, a novel unsupervised deep learning framework is proposed to detect anomaly events in crowded scenes. Specifically, low-level visual features, energy features, and motion map features are simultaneously extracted based on spatiotemporal energy measurements. Three convolutional restricted Boltzmann machines are trained to model the mid-level feature representation of normal patterns. Then a multimodal fusion scheme is utilized to learn the deep representation of crowd patterns. Based on the learned deep representation, a one-class support vector machine model is used to detect anomaly events. The proposed method is evaluated using two available public datasets and compared with state-of-the-art methods. The experimental results show its competitive performance for anomaly event detection in video surveillance.
Styles APA, Harvard, Vancouver, ISO, etc.
21

Kumar, Vidit, Vikas Tripathi, Bhaskar Pant, Sultan S. Alshamrani, Ankur Dumka, Anita Gehlot, Rajesh Singh, Mamoon Rashid, Abdullah Alshehri et Ahmed Saeed AlGhamdi. « Hybrid Spatiotemporal Contrastive Representation Learning for Content-Based Surgical Video Retrieval ». Electronics 11, no 9 (24 avril 2022) : 1353. http://dx.doi.org/10.3390/electronics11091353.

Texte intégral
Résumé :
In the medical field, due to their economic and clinical benefits, there is a growing interest in minimally invasive surgeries and microscopic surgeries. These types of surgeries are often recorded during operations, and these recordings have become a key resource for education, patient disease analysis, surgical error analysis, and surgical skill assessment. However, manual searching in this collection of long-term surgical videos is an extremely labor-intensive and long-term task, requiring an effective content-based video analysis system. In this regard, previous methods for surgical video retrieval are based on handcrafted features which do not represent the video effectively. On the other hand, deep learning-based solutions were found to be effective in both surgical image and video analysis, where CNN-, LSTM- and CNN-LSTM-based methods were proposed in most surgical video analysis tasks. In this paper, we propose a hybrid spatiotemporal embedding method to enhance spatiotemporal representations using an adaptive fusion layer on top of the LSTM and temporal causal convolutional modules. To learn surgical video representations, we propose exploring the supervised contrastive learning approach to leverage label information in addition to augmented versions. By validating our approach to a video retrieval task on two datasets, Surgical Actions 160 and Cataract-101, we significantly improve on previous results in terms of mean average precision, 30.012 ± 1.778 vs. 22.54 ± 1.557 for Surgical Actions 160 and 81.134 ± 1.28 vs. 33.18 ± 1.311 for Cataract-101. We also validate the proposed method’s suitability for surgical phase recognition task using the benchmark Cholec80 surgical dataset, where our approach outperforms (with 90.2% accuracy) the state of the art.
Styles APA, Harvard, Vancouver, ISO, etc.
22

Xu, Ming, Xiaosheng Yu, Dongyue Chen, Chengdong Wu et Yang Jiang. « An Efficient Anomaly Detection System for Crowded Scenes Using Variational Autoencoders ». Applied Sciences 9, no 16 (14 août 2019) : 3337. http://dx.doi.org/10.3390/app9163337.

Texte intégral
Résumé :
Anomaly detection in crowded scenes is an important and challenging part of the intelligent video surveillance system. As the deep neural networks make success in feature representation, the features extracted by a deep neural network represent the appearance and motion patterns in different scenes more specifically, comparing with the hand-crafted features typically used in the traditional anomaly detection approaches. In this paper, we propose a new baseline framework of anomaly detection for complex surveillance scenes based on a variational auto-encoder with convolution kernels to learn feature representations. Firstly, the raw frames series are provided as input to our variational auto-encoder without any preprocessing to learn the appearance and motion features of the receptive fields. Then, multiple Gaussian models are used to predict the anomaly scores of the corresponding receptive fields. Our proposed two-stage anomaly detection system is evaluated on the video surveillance dataset for a large scene, UCSD pedestrian datasets, and yields competitive performance compared with state-of-the-art methods.
Styles APA, Harvard, Vancouver, ISO, etc.
23

Bohunicky, Kyle Matthew. « Dear Punchy ». Animal Crossing Special Issue 13, no 22 (16 février 2021) : 39–58. http://dx.doi.org/10.7202/1075262ar.

Texte intégral
Résumé :
This article explores how the Animal Crossing series represents and invites players to practice writing. Adopting several frameworks including media speleology, affect theory, and writing studies, this article argues that the representation of writing in the first game in the Animal Crossing series, Animal Forest, resists both the technological and gendered histories typically ascribed to writing and video games. Turning to the ways that players actually practice writing, this article suggests that affect plays a key role in the deep connections that players develop with fellow villagers through the act of letter writing. Ultimately, this article calls for further examination of writing’s role in the cultural significance of Animal Crossing and careful study of its representations in other video games.
Styles APA, Harvard, Vancouver, ISO, etc.
24

Rezaei, Fariba, et Mehran Yazdi. « A New Semantic and Statistical Distance-Based Anomaly Detection in Crowd Video Surveillance ». Wireless Communications and Mobile Computing 2021 (15 mai 2021) : 1–9. http://dx.doi.org/10.1155/2021/5513582.

Texte intégral
Résumé :
Recently, attention toward autonomous surveillance has been intensified and anomaly detection in crowded scenes is one of those significant surveillance tasks. Traditional approaches include the extraction of handcrafted features that need the subsequent task of model learning. They are mostly used to extract low-level spatiotemporal features of videos, neglecting the effect of semantic information. Recently, deep learning (DL) methods have been emerged in various domains, especially CNN for visual problems, with the ability to extract high-level information at higher layers of their architectures. On the other side, topic modeling-based approaches like NMF can extract more semantic representations. Here, we investigate a new hybrid visual embedding method based on deep features and a topic model for anomaly detection. Features per frame are computed hierarchically through a pretrained deep model, and in parallel, topic distributions are learned through multilayer nonnegative matrix factorization entangling information from extracted deep features. Training is accomplished through normal samples. Thereafter, K -means is applied to find typical normal clusters. At test time, after achieving feature representation through deep model and topic distribution for test frames, a statistical earth mover distance (EMD) metric is evaluated to measure the difference between normal cluster centroids and test topic distributions. High difference versus a threshold is detected as an anomaly. Experimental results on the benchmark Ped1 and Ped2 UCSD datasets demonstrate the effectiveness of our proposed method in anomaly detection.
Styles APA, Harvard, Vancouver, ISO, etc.
25

Dong, Wenkai, Zhaoxiang Zhang et Tieniu Tan. « Attention-Aware Sampling via Deep Reinforcement Learning for Action Recognition ». Proceedings of the AAAI Conference on Artificial Intelligence 33 (17 juillet 2019) : 8247–54. http://dx.doi.org/10.1609/aaai.v33i01.33018247.

Texte intégral
Résumé :
Deep learning based methods have achieved remarkable progress in action recognition. Existing works mainly focus on designing novel deep architectures to achieve video representations learning for action recognition. Most methods treat sampled frames equally and average all the frame-level predictions at the testing stage. However, within a video, discriminative actions may occur sparsely in a few frames and most other frames are irrelevant to the ground truth and may even lead to a wrong prediction. As a result, we think that the strategy of selecting relevant frames would be a further important key to enhance the existing deep learning based action recognition. In this paper, we propose an attentionaware sampling method for action recognition, which aims to discard the irrelevant and misleading frames and preserve the most discriminative frames. We formulate the process of mining key frames from videos as a Markov decision process and train the attention agent through deep reinforcement learning without extra labels. The agent takes features and predictions from the baseline model as input and generates importance scores for all frames. Moreover, our approach is extensible, which can be applied to different existing deep learning based action recognition models. We achieve very competitive action recognition performance on two widely used action recognition datasets.
Styles APA, Harvard, Vancouver, ISO, etc.
26

Nida, Nudrat, Muhammad Haroon Yousaf, Aun Irtaza et Sergio A. Velastin. « Instructor Activity Recognition through Deep Spatiotemporal Features and Feedforward Extreme Learning Machines ». Mathematical Problems in Engineering 2019 (30 avril 2019) : 1–13. http://dx.doi.org/10.1155/2019/2474865.

Texte intégral
Résumé :
Human action recognition has the potential to predict the activities of an instructor within the lecture room. Evaluation of lecture delivery can help teachers analyze shortcomings and plan lectures more effectively. However, manual or peer evaluation is time-consuming, tedious and sometimes it is difficult to remember all the details of the lecture. Therefore, automation of lecture delivery evaluation significantly improves teaching style. In this paper, we propose a feedforward learning model for instructor’s activity recognition in the lecture room. The proposed scheme represents a video sequence in the form of a single frame to capture the motion profile of the instructor by observing the spatiotemporal relation within the video frames. First, we segment the instructor silhouettes from input videos using graph-cut segmentation and generate a motion profile. These motion profiles are centered by obtaining the largest connected components and normalized. Then, these motion profiles are represented in the form of feature maps by a deep convolutional neural network. Then, an extreme learning machine (ELM) classifier is trained over the obtained feature representations to recognize eight different activities of the instructor within the classroom. For the evaluation of the proposed method, we created an instructor activity video (IAVID-1) dataset and compared our method against different state-of-the-art activity recognition methods. Furthermore, two standard datasets, MuHAVI and IXMAS, were also considered for the evaluation of the proposed scheme.
Styles APA, Harvard, Vancouver, ISO, etc.
27

He, Dongliang, Zhichao Zhou, Chuang Gan, Fu Li, Xiao Liu, Yandong Li, Limin Wang et Shilei Wen. « StNet : Local and Global Spatial-Temporal Modeling for Action Recognition ». Proceedings of the AAAI Conference on Artificial Intelligence 33 (17 juillet 2019) : 8401–8. http://dx.doi.org/10.1609/aaai.v33i01.33018401.

Texte intégral
Résumé :
Despite the success of deep learning for static image understanding, it remains unclear what are the most effective network architectures for spatial-temporal modeling in videos. In this paper, in contrast to the existing CNN+RNN or pure 3D convolution based approaches, we explore a novel spatialtemporal network (StNet) architecture for both local and global modeling in videos. Particularly, StNet stacks N successive video frames into a super-image which has 3N channels and applies 2D convolution on super-images to capture local spatial-temporal relationship. To model global spatialtemporal structure, we apply temporal convolution on the local spatial-temporal feature maps. Specifically, a novel temporal Xception block is proposed in StNet, which employs a separate channel-wise and temporal-wise convolution over the feature sequence of a video. Extensive experiments on the Kinetics dataset demonstrate that our framework outperforms several state-of-the-art approaches in action recognition and can strike a satisfying trade-off between recognition accuracy and model complexity. We further demonstrate the generalization performance of the leaned video representations on the UCF101 dataset.
Styles APA, Harvard, Vancouver, ISO, etc.
28

Swinney, Carolyn J., et John C. Woods. « Unmanned Aerial Vehicle Operating Mode Classification Using Deep Residual Learning Feature Extraction ». Aerospace 8, no 3 (16 mars 2021) : 79. http://dx.doi.org/10.3390/aerospace8030079.

Texte intégral
Résumé :
Unmanned Aerial Vehicles (UAVs) undoubtedly pose many security challenges. We need only look to the December 2018 Gatwick Airport incident for an example of the disruption UAVs can cause. In total, 1000 flights were grounded for 36 h over the Christmas period which was estimated to cost over 50 million pounds. In this paper, we introduce a novel approach which considers UAV detection as an imagery classification problem. We consider signal representations Power Spectral Density (PSD); Spectrogram, Histogram and raw IQ constellation as graphical images presented to a deep Convolution Neural Network (CNN) ResNet50 for feature extraction. Pre-trained on ImageNet, transfer learning is utilised to mitigate the requirement for a large signal dataset. We evaluate performance through machine learning classifier Logistic Regression. Three popular UAVs are classified in different modes; switched on; hovering; flying; flying with video; and no UAV present, creating a total of 10 classes. Our results, validated with 5-fold cross validation and an independent dataset, show PSD representation to produce over 91% accuracy for 10 classifications. Our paper treats UAV detection as an imagery classification problem by presenting signal representations as images to a ResNet50, utilising the benefits of transfer learning and outperforming previous work in the field.
Styles APA, Harvard, Vancouver, ISO, etc.
29

Zhao, Hu, Yanyun Shen, Zhipan Wang et Qingling Zhang. « MFACNet : A Multi-Frame Feature Aggregating and Inter-Feature Correlation Framework for Multi-Object Tracking in Satellite Videos ». Remote Sensing 16, no 9 (30 avril 2024) : 1604. http://dx.doi.org/10.3390/rs16091604.

Texte intégral
Résumé :
Efficient multi-object tracking (MOT) in satellite videos is crucial for numerous applications, ranging from surveillance to environmental monitoring. Existing methods often struggle with effectively exploring the correlation and contextual cues inherent in the consecutive features of video sequences, resulting in redundant feature inference and unreliable motion estimation for tracking. To address these challenges, we propose the MFACNet, a novel multi-frame features aggregating and inter-feature correlation framework for enhancing MOT in satellite videos with the idea of utilizing the features of consecutive frames. The MFACNet integrates multi-frame feature aggregation techniques with inter-feature correlation mechanisms to improve tracking accuracy and robustness. Specifically, our framework leverages temporal information across the features of consecutive frames to capture contextual cues and refine object representations over time. Moreover, we introduce a mechanism to explicitly model the correlations between adjacent features in video sequences, facilitating a more accurate motion estimation and trajectory associations. We evaluated the MFACNet using benchmark datasets for satellite-based video MOT tasks and demonstrated its superiority in terms of tracking accuracy and robustness over state-of-the-art performance by 2.0% in MOTA and 1.6% in IDF1. Our experimental results highlight the potential of precisely utilizing deep features from video sequences.
Styles APA, Harvard, Vancouver, ISO, etc.
30

Kulvinder Singh, Et al. « Enhancing Multimodal Information Retrieval Through Integrating Data Mining and Deep Learning Techniques ». International Journal on Recent and Innovation Trends in Computing and Communication 11, no 9 (30 octobre 2023) : 560–69. http://dx.doi.org/10.17762/ijritcc.v11i9.8844.

Texte intégral
Résumé :
Multimodal information retrieval, the task of re trieving relevant information from heterogeneous data sources such as text, images, and videos, has gained significant attention in recent years due to the proliferation of multimedia content on the internet. This paper proposes an approach to enhance multimodal information retrieval by integrating data mining and deep learning techniques. Traditional information retrieval systems often struggle to effectively handle multimodal data due to the inherent complexity and diversity of such data sources. In this study, we leverage data mining techniques to preprocess and structure multimodal data efficiently. Data mining methods enable us to extract valuable patterns, relationships, and features from different modalities, providing a solid foundation for sub- sequent retrieval tasks. To further enhance the performance of multimodal information retrieval, deep learning techniques are employed. Deep neural networks have demonstrated their effectiveness in various multimedia tasks, including image recognition, natural language processing, and video analysis. By integrating deep learning models into our retrieval framework, we aim to capture complex intermodal dependencies and semantically rich representations, enabling more accurate and context-aware retrieval.
Styles APA, Harvard, Vancouver, ISO, etc.
31

Govender, Divina, et Jules-Raymond Tapamo. « Spatio-Temporal Scale Coded Bag-of-Words ». Sensors 20, no 21 (9 novembre 2020) : 6380. http://dx.doi.org/10.3390/s20216380.

Texte intégral
Résumé :
The Bag-of-Words (BoW) framework has been widely used in action recognition tasks due to its compact and efficient feature representation. Various modifications have been made to this framework to increase its classification power. This often results in an increased complexity and reduced efficiency. Inspired by the success of image-based scale coded BoW representations, we propose a spatio-temporal scale coded BoW (SC-BoW) for video-based recognition. This involves encoding extracted multi-scale information into BoW representations by partitioning spatio-temporal features into sub-groups based on the spatial scale from which they were extracted. We evaluate SC-BoW in two experimental setups. We first present a general pipeline to perform real-time action recognition with SC-BoW. Secondly, we apply SC-BoW onto the popular Dense Trajectory feature set. Results showed SC-BoW representations to successfully improve performance by 2–7% with low added computational cost. Notably, SC-BoW on Dense Trajectories outperformed more complex deep learning approaches. Thus, scale coding is a low-cost and low-level encoding scheme that increases classification power of the standard BoW without compromising efficiency.
Styles APA, Harvard, Vancouver, ISO, etc.
32

Huang, Haofeng, Wenhan Yang, Lingyu Duan et Jiaying Liu. « Seeing Dark Videos via Self-Learned Bottleneck Neural Representation ». Proceedings of the AAAI Conference on Artificial Intelligence 38, no 3 (24 mars 2024) : 2321–29. http://dx.doi.org/10.1609/aaai.v38i3.28006.

Texte intégral
Résumé :
Enhancing low-light videos in a supervised style presents a set of challenges, including limited data diversity, misalignment, and the domain gap introduced through the dataset construction pipeline. Our paper tackles these challenges by constructing a self-learned enhancement approach that gets rid of the reliance on any external training data. The challenge of self-supervised learning lies in fitting high-quality signal representations solely from input signals. Our work designs a bottleneck neural representation mechanism that extracts those signals. More in detail, we encode the frame-wise representation with a compact deep embedding and utilize a neural network to parameterize the video-level manifold consistently. Then, an entropy constraint is applied to the enhanced results based on the adjacent spatial-temporal context to filter out the degraded visual signals, e.g. noise and frame inconsistency. Last, a novel Chromatic Retinex decomposition is proposed to effectively align the reflectance distribution temporally. It benefits the entropy control on different components of each frame and facilitates noise-to-noise training, successfully suppressing the temporal flicker. Extensive experiments demonstrate the robustness and superior effectiveness of our proposed method. Our project is publicly available at: https://huangerbai.github.io/SLBNR/.
Styles APA, Harvard, Vancouver, ISO, etc.
33

Dhar, Moloy. « Object Detection using Deep Learning Approach ». International Journal for Research in Applied Science and Engineering Technology 10, no 6 (30 juin 2022) : 2963–69. http://dx.doi.org/10.22214/ijraset.2022.44417.

Texte intégral
Résumé :
Abstract: The most often utilized strategies for current deep learning models to accomplish a multitude of activities on devices are mobile networks and binary neural networks. In this research, we propose a method for identifying an object using the pretrained deep learning model MobileNet for Single Shot Multi-Box Detector (SSD). This technique is utilized for real-time detection as well as webcams to detect the object in a video feed.To construct the module, we use the MobileNet and SSD frameworks to provide a faster and effective deep learning-based object detection approach.Deep learning has evolved into a powerful machine learning technology that incorporates multiple layers of features or representations of data to get cuttingedge results. Deep learning has demonstrated outstanding performance in a variety of fields, including picture classification, segmentation, and object detection. Deep learning approaches have recently made significant progress in fine-grained picture categorization, which tries to differentiate subordinate-level categories.The major goal of our study is to investigate the accuracy of an object identification method called SSD, as well as the significance of a pre-trained deep learning model called MobileNet.To perform this challenge of detecting an item in an image or video, I used OpenCV libraries, Python, and NumPy. This enhances the accuracy of behavior recognition at a processing speed required for real-time detection and daily monitoring requirements indoors and outdoors.
Styles APA, Harvard, Vancouver, ISO, etc.
34

Mishra,, Vaishnavi. « Synthetic Media Analysis Using Deep Learning ». INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT 08, no 05 (7 mai 2024) : 1–5. http://dx.doi.org/10.55041/ijsrem32494.

Texte intégral
Résumé :
In an era characterized by the rapid evolution of digital content creation, synthetic media, particularly deepfake videos, present a formidable challenge to the veracity and integrity of online information. Addressing this challenge requires sophisticated analytical techniques capable of discerning between authentic and manipulated media. This research paper presents a comprehensive study on synthetic media analysis leveraging deep learning methodologies. The suggested method integrates advanced deep learning models, including Convolutional Neural Networks (CNNs) like VGG (Visual Geometry Group), alongside recurrent structures such as LSTM (Long Short-Term Memory). These models are trained and evaluated on a meticulously curated dataset, ensuring diversity and relevance in the synthetic media samples. To facilitate experimentation and reproducibility, the dataset is securely hosted on a reliable platform such as Google Drive. Prior to model training, preprocessing steps including frame extraction are employed to extract essential visual features from the video data. The VGG model serves as a feature extractor, capturing high-level representations of visual content, while the LSTM model learns temporal dependencies and contextual information across frames. Following comprehensive experimentation, the proposed method's ability to detect synthetic media is thoroughly assessed, utilizing metrics such as accuracy. This research contributes to the ongoing discourse on digital media forensics by providing insights into the efficacy of deep learning techniques for synthetic media analysis. The findings underscore the importance of continuous research and development in combating the proliferation of synthetic media, thereby safeguarding the authenticity and trustworthiness of online content. Index Terms—CNN, LSTM, VGG
Styles APA, Harvard, Vancouver, ISO, etc.
35

Thakur, Amey. « Generative Adversarial Networks ». International Journal for Research in Applied Science and Engineering Technology 9, no 8 (31 août 2021) : 2307–25. http://dx.doi.org/10.22214/ijraset.2021.37723.

Texte intégral
Résumé :
Abstract: Deep learning's breakthrough in the field of artificial intelligence has resulted in the creation of a slew of deep learning models. One of these is the Generative Adversarial Network, which has only recently emerged. The goal of GAN is to use unsupervised learning to analyse the distribution of data and create more accurate results. The GAN allows the learning of deep representations in the absence of substantial labelled training information. Computer vision, language and video processing, and image synthesis are just a few of the applications that might benefit from these representations. The purpose of this research is to get the reader conversant with the GAN framework as well as to provide the background information on Generative Adversarial Networks, including the structure of both the generator and discriminator, as well as the various GAN variants along with their respective architectures. Applications of GANs are also discussed with examples. Keywords: Generative Adversarial Networks (GANs), Generator, Discriminator, Supervised and Unsupervised Learning, Discriminative and Generative Modelling, Backpropagation, Loss Functions, Machine Learning, Deep Learning, Neural Networks, Convolutional Neural Network (CNN), Deep Convolutional GAN (DCGAN), Conditional GAN (cGAN), Information Maximizing GAN (InfoGAN), Stacked GAN (StackGAN), Pix2Pix, Wasserstein GAN (WGAN), Progressive Growing GAN (ProGAN), BigGAN, StyleGAN, CycleGAN, Super-Resolution GAN (SRGAN), Image Synthesis, Image-to-Image Translation.
Styles APA, Harvard, Vancouver, ISO, etc.
36

Wang, Bokun, Caiqian Yang et Yaojing Chen. « Detection Anomaly in Video Based on Deep Support Vector Data Description ». Computational Intelligence and Neuroscience 2022 (4 mai 2022) : 1–6. http://dx.doi.org/10.1155/2022/5362093.

Texte intégral
Résumé :
Video surveillance systems have been widely deployed in public places such as shopping malls, hospitals, banks, and streets to improve the safety of public life and assets. In most cases, how to detect video abnormal events in a timely and accurate manner is the main goal of social public safety risk prevention and control. Due to the ambiguity of anomaly definition, the scarcity of anomalous data, as well as the complex environmental background and human behavior, video anomaly detection is a major problem in the field of computer vision. Existing anomaly detection methods based on deep learning often use trained networks to extract features. These methods are based on existing network structures, instead of designing networks for the goal of anomaly detection. This paper proposed a method based on Deep Support Vector Data Description (DSVDD). By learning a deep neural network, the input normal sample space can be mapped to the smallest hypersphere. Through DSVDD, not only can the smallest size data hypersphere be found to establish SVDD but also useful data feature representations and normal models can be learned. In the test, the samples mapped inside the hypersphere are judged as normal, while the samples mapped outside the hypersphere are judged as abnormal. The proposed method achieves 86.84% and 73.2% frame-level AUC on the CUHK Avenue and ShanghaiTech Campus datasets, respectively. By comparison, the detection results achieved by the proposed method are better than those achieved by the existing state-of-the-art methods.
Styles APA, Harvard, Vancouver, ISO, etc.
37

Chen, Shuang, Zengcai Wang et Wenxin Chen. « Driver Drowsiness Estimation Based on Factorized Bilinear Feature Fusion and a Long-Short-Term Recurrent Convolutional Network ». Information 12, no 1 (22 décembre 2020) : 3. http://dx.doi.org/10.3390/info12010003.

Texte intégral
Résumé :
The effective detection of driver drowsiness is an important measure to prevent traffic accidents. Most existing drowsiness detection methods only use a single facial feature to identify fatigue status, ignoring the complex correlation between fatigue features and the time information of fatigue features, and this reduces the recognition accuracy. To solve these problems, we propose a driver sleepiness estimation model based on factorized bilinear feature fusion and a long- short-term recurrent convolutional network to detect driver drowsiness efficiently and accurately. The proposed framework includes three models: fatigue feature extraction, fatigue feature fusion, and driver drowsiness detection. First, we used a convolutional neural network (CNN) to effectively extract the deep representation of eye and mouth-related fatigue features from the face area detected in each video frame. Then, based on the factorized bilinear feature fusion model, we performed a nonlinear fusion of the deep feature representations of the eyes and mouth. Finally, we input a series of fused frame-level features into a long-short-term memory (LSTM) unit to obtain the time information of the features and used the softmax classifier to detect sleepiness. The proposed framework was evaluated with the National Tsing Hua University drowsy driver detection (NTHU-DDD) video dataset. The experimental results showed that this method had better stability and robustness compared with other methods.
Styles APA, Harvard, Vancouver, ISO, etc.
38

Rezaei, Behnaz, Yiorgos Christakis, Bryan Ho, Kevin Thomas, Kelley Erb, Sarah Ostadabbas et Shyamal Patel. « Target-Specific Action Classification for Automated Assessment of Human Motor Behavior from Video ». Sensors 19, no 19 (1 octobre 2019) : 4266. http://dx.doi.org/10.3390/s19194266.

Texte intégral
Résumé :
Objective monitoring and assessment of human motor behavior can improve the diagnosis and management of several medical conditions. Over the past decade, significant advances have been made in the use of wearable technology for continuously monitoring human motor behavior in free-living conditions. However, wearable technology remains ill-suited for applications which require monitoring and interpretation of complex motor behaviors (e.g., involving interactions with the environment). Recent advances in computer vision and deep learning have opened up new possibilities for extracting information from video recordings. In this paper, we present a hierarchical vision-based behavior phenotyping method for classification of basic human actions in video recordings performed using a single RGB camera. Our method addresses challenges associated with tracking multiple human actors and classification of actions in videos recorded in changing environments with different fields of view. We implement a cascaded pose tracker that uses temporal relationships between detections for short-term tracking and appearance based tracklet fusion for long-term tracking. Furthermore, for action classification, we use pose evolution maps derived from the cascaded pose tracker as low-dimensional and interpretable representations of the movement sequences for training a convolutional neural network. The cascaded pose tracker achieves an average accuracy of 88% in tracking the target human actor in our video recordings, and overall system achieves average test accuracy of 84% for target-specific action classification in untrimmed video recordings.
Styles APA, Harvard, Vancouver, ISO, etc.
39

Bourai, Nour, Hayet Farida Merouani et Akila Djebbar. « Advanced Image Compression Techniques for Medical Applications : Survey ». All Sciences Abstracts 1, no 1 (16 avril 2023) : 1. http://dx.doi.org/10.59287/as-abstracts.444.

Texte intégral
Résumé :
The field of artificial intelligence has grown significantly in the past decade, with deep learning being a particularly promising technique due to its ability to learn complex feature representations from data. One area where deep learning has shown promise is in image compression, which is important for applications such as medical imaging, remote sensing, and video streaming. Traditional compression methods such as JPEG, JPEG200 and PNG have been used for decades, but recent advances in deep learning have led to innovative techniques that involve training deep neural networks to learn a compressed representation of the image data. This survey paper focuses on research-based applications of deep learning in image compression and reviews recent works on using deep learning techniques to compress and accelerate deep neural networks, including pruning, quantization, and low-rank factorization methods. In addition, we discuss the use of deep learning techniques to minimize compression defects, such as block artifacts and ringing, which can degrade image quality. The paper provides an overview of popular methods and recent works in the field, highlighting their characteristics, advantages, and shortcomings. We also discuss challenges and open research questions, such as the trade-off between compression efficiency and reconstruction quality, and the need for standardized evaluation metrics for comparing different compression methods. Overall, this survey aims to provide a comprehensive understanding of current trends in image compression using deep learning techniques and their potential to revolutionize the field. By exploring the advantages and limitations of these methods, we hope to facilitate further research and development in this exciting area.
Styles APA, Harvard, Vancouver, ISO, etc.
40

Mai Magdy, Fahima A. Maghraby et Mohamed Waleed Fakhr. « A 4D Convolutional Neural Networks for Video Violence Detection ». Journal of Advanced Research in Applied Sciences and Engineering Technology 36, no 1 (24 décembre 2023) : 16–25. http://dx.doi.org/10.37934/araset.36.1.1625.

Texte intégral
Résumé :
As global crime has escalated, surveillance cameras have become widespread and will continue to proliferate. Due to the large amount of video, there must be systems that automatically look for suspicious activity and send out an online alert if they find it. This paper presents a deep learning architecture based on video-level four-dimensional convolution neural networks. The suggested architecture consists of residual blocks, which are combined with three-dimensional Convolutional Neural Networks (3D CNNs). The architecture aims to learn short-term and long-term representations of spatiotemporal from video, in addition to interactivity between clips. ResNet50 serves as the foundation for three-dimensional convolution networks and Dense optical flow in the region of concern. The proposed architecture is tested on the RWF2000 dataset with a test accuracy of 94.75. This research achieved higher results compared to other methods in the state of the art.
Styles APA, Harvard, Vancouver, ISO, etc.
41

Choi, Jinsoo, et Tae-Hyun Oh. « Joint Video Super-Resolution and Frame Interpolation via Permutation Invariance ». Sensors 23, no 5 (24 février 2023) : 2529. http://dx.doi.org/10.3390/s23052529.

Texte intégral
Résumé :
We propose a joint super resolution (SR) and frame interpolation framework that can perform both spatial and temporal super resolution. We identify performance variation according to permutation of inputs in video super-resolution and video frame interpolation. We postulate that favorable features extracted from multiple frames should be consistent regardless of input order if the features are optimally complementary for respective frames. With this motivation, we propose a permutation invariant deep architecture that makes use of the multi-frame SR principles by virtue of our order (permutation) invariant network. Specifically, given two adjacent frames, our model employs a permutation invariant convolutional neural network module to extract “complementary” feature representations facilitating both the SR and temporal interpolation tasks. We demonstrate the effectiveness of our end-to-end joint method against various combinations of the competing SR and frame interpolation methods on challenging video datasets, and thereby we verify our hypothesis.
Styles APA, Harvard, Vancouver, ISO, etc.
42

Kulkarni, Dr Shrinivasrao B., Abhishek Kuppelur, Akash Shetty, Shashank ,. Bidarakatti et Taranath Sangresakoppa. « Analysis of Physiotherapy Practices using Deep Learning ». International Journal for Research in Applied Science and Engineering Technology 12, no 4 (30 avril 2024) : 5084–89. http://dx.doi.org/10.22214/ijraset.2024.61194.

Texte intégral
Résumé :
Abstract: The proposed physiotherapy assessment system, utilizing deep learning, aims to enhance the accuracy and efficiency of assessments. Traditional manual methods used by physiotherapists are often time-consuming and prone to errors, potentially leading to incorrect diagnoses and treatment plans. This system tackles these challenges by employing advanced deep learning algorithms to detect angles and provide personalized audio feedback to patients based on their posture. The system commences by capturing the patient's video with a webcam and extracting frames using OpenCV. These frames are then analyzed through a media pipe library, which identifies key body points. These points are utilized to connect relevant body parts for specific exercises, calculating angles between them. Subsequently, the system evaluates posture correctness and delivers tailored audio feedback, counting reps if correct or providing guidance if incorrect. Each exercise receives unique audio feedback, offering precise guidance to improve posture. Moreover, the system tracks patient progress and displays visual representations of improvement over time. This feature aids patients in monitoring their progress and fosters motivation to adhere to therapy. By leveraging deep learning algorithms and the media pipe library, this system presents a precise, efficient, and economical approach to physiotherapy assessments.
Styles APA, Harvard, Vancouver, ISO, etc.
43

Liu, Daizong, Dongdong Yu, Changhu Wang et Pan Zhou. « F2Net : Learning to Focus on the Foreground for Unsupervised Video Object Segmentation ». Proceedings of the AAAI Conference on Artificial Intelligence 35, no 3 (18 mai 2021) : 2109–17. http://dx.doi.org/10.1609/aaai.v35i3.16308.

Texte intégral
Résumé :
Although deep learning based methods have achieved great progress in unsupervised video object segmentation, difficult scenarios (e.g., visual similarity, occlusions, and appearance changing) are still no well-handled. To alleviate these issues, we propose a novel Focus on Foreground Network (F2Net), which delves into the intra-inter frame details for the foreground objects and thus effectively improve the segmentation performance. Specifically, our proposed network consists of three main parts: Siamese Encoder Module, Center Guiding Appearance Diffusion Module, and Dynamic Information Fusion Module. Firstly, we take a siamese encoder to extract the feature representations of paired frames (reference frame and current frame). Then, a Center Guiding Appearance Diffusion Module is designed to capture the inter-frame feature (dense correspondences between reference frame and current frame), intra-frame feature (dense correspondences in current frame), and original semantic feature of current frame. Different from the Anchor Diffusion Network, we establish a Center Prediction Branch to predict the center location of the foreground object in current frame and leverage the center point information as spatial guidance prior to enhance the inter-frame and intra-frame feature extraction, and thus the feature representation considerably focus on the foreground objects. Finally, we propose a Dynamic Information Fusion Module to automatically select relatively important features through three aforementioned different level features. Extensive experiments on DAVIS, Youtube-object, and FBMS datasets show that our proposed F2Net achieves the state-of-the-art performance with significant improvement.
Styles APA, Harvard, Vancouver, ISO, etc.
44

Sun, Zheng, Andrew W. Sumsion, Shad A. Torrie et Dah-Jye Lee. « Learning Facial Motion Representation with a Lightweight Encoder for Identity Verification ». Electronics 11, no 13 (22 juin 2022) : 1946. http://dx.doi.org/10.3390/electronics11131946.

Texte intégral
Résumé :
Deep learning became an important image classification and object detection technique more than a decade ago. It has since achieved human-like performance for many computer vision tasks. Some of them involve the analysis of human face for applications like facial recognition, expression recognition, and facial landmark detection. In recent years, researchers have generated and made publicly available many valuable datasets that allow for the development of more accurate and robust models for these important tasks. Exploiting the information contained inside these pretrained deep structures could open the door to many new applications and provide a quick path to their success. This research focuses on a unique application that analyzes short facial motion video for identity verification. Our proposed solution leverages the rich information in those deep structures to provide accurate face representation for facial motion analysis. We have developed two strategies to employ the information contained in the existing models for image-based face analysis to learn the facial motion representations for our application. Combining with those pretrained spatial feature extractors for face-related analyses, our customized sequence encoder is capable of generating accurate facial motion embedding for identity verification application. The experimental results show that the facial geometry information from those feature extractors is valuable and helps our model achieve an impressive average precision of 98.8% for identity verification using facial motion.
Styles APA, Harvard, Vancouver, ISO, etc.
45

Wagner, Travis L., et Ashley Blewer. « “The Word Real Is No Longer Real” : Deepfakes, Gender, and the Challenges of AI-Altered Video ». Open Information Science 3, no 1 (1 janvier 2019) : 32–46. http://dx.doi.org/10.1515/opis-2019-0003.

Texte intégral
Résumé :
Abstract It is near-impossible for casual consumers of images to authenticate digitally-altered images without a keen understanding of how to “read” the digital image. As Photoshop did for photographic alteration, so to have advances in artificial intelligence and computer graphics made seamless video alteration seem real to the untrained eye. The colloquialism used to describe these videos are “deepfakes”: a portmanteau of deep learning AI and faked imagery. The implications for these videos serving as authentic representations matters, especially in rhetorics around “fake news.” Yet, this alteration software, one deployable both through high-end editing software and free mobile apps, remains critically under examined. One troubling example of deepfakes is the superimposing of women’s faces into pornographic videos. The implication here is a reification of women’s bodies as a thing to be visually consumed, here circumventing consent. This use is confounding considering the very bodies used to perfect deepfakes were men. This paper explores how the emergence and distribution of deepfakes continues to enforce gendered disparities within visual information. This paper, however, rejects the inevitability of deepfakes arguing that feminist oriented approaches to artificial intelligence building and a critical approaches to visual information literacy can stifle the distribution of violently sexist deepfakes.
Styles APA, Harvard, Vancouver, ISO, etc.
46

Sharif, Md Haidar, Lei Jiao et Christian W. Omlin. « CNN-ViT Supported Weakly-Supervised Video Segment Level Anomaly Detection ». Sensors 23, no 18 (7 septembre 2023) : 7734. http://dx.doi.org/10.3390/s23187734.

Texte intégral
Résumé :
Video anomaly event detection (VAED) is one of the key technologies in computer vision for smart surveillance systems. With the advent of deep learning, contemporary advances in VAED have achieved substantial success. Recently, weakly supervised VAED (WVAED) has become a popular VAED technical route of research. WVAED methods do not depend on a supplementary self-supervised substitute task, yet they can assess anomaly scores straightway. However, the performance of WVAED methods depends on pretrained feature extractors. In this paper, we first address taking advantage of two pretrained feature extractors for CNN (e.g., C3D and I3D) and ViT (e.g., CLIP), for effectively extracting discerning representations. We then consider long-range and short-range temporal dependencies and put forward video snippets of interest by leveraging our proposed temporal self-attention network (TSAN). We design a multiple instance learning (MIL)-based generalized architecture named CNN-ViT-TSAN, by using CNN- and/or ViT-extracted features and TSAN to specify a series of models for the WVAED problem. Experimental results on publicly available popular crowd datasets demonstrated the effectiveness of our CNN-ViT-TSAN.
Styles APA, Harvard, Vancouver, ISO, etc.
47

Jeon, DaeHyeon, et Min-Suk Kim. « Deep-Learning-Based Sequence Causal Long-Term Recurrent Convolutional Network for Data Fusion Using Video Data ». Electronics 12, no 5 (24 février 2023) : 1115. http://dx.doi.org/10.3390/electronics12051115.

Texte intégral
Résumé :
The purpose of AI-Based schemes in intelligent systems is to advance and optimize system performance. Most intelligent systems adopt sequential data types derived from such systems. Realtime video data, for example, are continuously updated as a sequence to make necessary predictions for efficient system performance. The majority of deep-learning-based network architectures such as long short-term memory (LSTM), data fusion, two streams, and temporal convolutional network (TCN) for sequence data fusion are generally used to enhance robust system efficiency. In this paper, we propose a deep-learning-based neural network architecture for non-fix data that uses both a causal convolutional neural network (CNN) and a long-term recurrent convolutional network (LRCN). Causal CNNs and LRCNs use incorporated convolutional layers for feature extraction, so both architectures are capable of processing sequential data such as time series or video data that can be used in a variety of applications. Both architectures also have extracted features from the input sequence data to reduce the dimensionality of the data and capture the important information, and learn hierarchical representations for effective sequence processing tasks. We have also adopted a concept of series compact convolutional recurrent neural network (SCCRNN), which is a type of neural network architecture designed for processing sequential data combined by both convolutional and recurrent layers compactly, reducing the number of parameters and memory usage to maintain high accuracy. The architecture is challenge-able and suitable for continuously incoming sequence video data, and doing so allowed us to bring advantages to both LSTM-based networks and CNNbased networks. To verify this method, we evaluated it through a sequence learning model with network parameters and memory that are required in real environments based on the UCF-101 dataset, which is an action recognition data set of realistic action videos, collected from YouTube with 101 action categories. The results show that the proposed model in a sequence causal long-term recurrent convolutional network (SCLRCN) provides a performance improvement of at least 12% approximately or more to be compared with the existing models (LRCN and TCN).
Styles APA, Harvard, Vancouver, ISO, etc.
48

Wu, Sijie, Kai Zhang, Shaoyi Li et Jie Yan. « Learning to Track Aircraft in Infrared Imagery ». Remote Sensing 12, no 23 (6 décembre 2020) : 3995. http://dx.doi.org/10.3390/rs12233995.

Texte intégral
Résumé :
Airborne target tracking in infrared imagery remains a challenging task. The airborne target usually has a low signal-to-noise ratio and shows different visual patterns. The features adopted in the visual tracking algorithm are usually deep features pre-trained on ImageNet, which are not tightly coupled with the current video domain and therefore might not be optimal for infrared target tracking. To this end, we propose a new approach to learn the domain-specific features, which can be adapted to the current video online without pre-training on a large datasets. Considering that only a few samples of the initial frame can be used for online training, general feature representations are encoded to the network for a better initialization. The feature learning module is flexible and can be integrated into tracking frameworks based on correlation filters to improve the baseline method. Experiments on airborne infrared imagery are conducted to demonstrate the effectiveness of our tracking algorithm.
Styles APA, Harvard, Vancouver, ISO, etc.
49

Kong, Weiqi. « Research Advanced in Multimodal Emotion Recognition Based on Deep Learning ». Highlights in Science, Engineering and Technology 85 (13 mars 2024) : 602–8. http://dx.doi.org/10.54097/p3yprn36.

Texte intégral
Résumé :
In summary, the field of computer science has long been intrigued by emotion recognition, which seeks to decode the emotional content hidden within data. Initial approaches to sentiment analysis were predominantly based on single-mode data sources like textual sentiment analysis, speech-based emotion detection, or the study of facial expressions. In recent years, with the increasingly abundant data representations, Many people has gradually pay attention on the multimodal emotion recognition. Multimodal emotion recognition involves not only text, but also audio, image, and video, which is of great significance for enhancing human-computer interaction, improving user experience, and improving emotion-aware applications. This paper thoroughly discusses the research advancements and primary techniques of multimodal emotion recognition tasks, with an emphasis on the aforementioned tasks. Specifically, this paper first introduces representative methods for single-modal emotion recognition based on graphic data, including its basic process, advantages and disadvantages, etc. Secondly, this article introduces pertinent studies on multimodal emotion recognition and offers a quantitative comparison of how different approaches perform on standard multimodal data sets. Lastly, it addresses the complexities inherent in multimodal emotion recognition research and suggests potential areas for future study.
Styles APA, Harvard, Vancouver, ISO, etc.
50

Tøttrup, Daniel, Stinus Lykke Skovgaard, Jonas le Fevre Sejersen et Rui Pimentel de Figueiredo. « A Fast and Accurate Approach to Multiple-Vehicle Localization and Tracking from Monocular Aerial Images ». Journal of Imaging 7, no 12 (8 décembre 2021) : 270. http://dx.doi.org/10.3390/jimaging7120270.

Texte intégral
Résumé :
In this work we present a novel end-to-end solution for tracking objects (i.e., vessels), using video streams from aerial drones, in dynamic maritime environments. Our method relies on deep features, which are learned using realistic simulation data, for robust object detection, segmentation and tracking. Furthermore, we propose the use of rotated bounding-box representations, which are computed by taking advantage of pixel-level object segmentation, for improved tracking accuracy, by reducing erroneous data associations during tracking, when combined with the appearance-based features. A thorough set of experiments and results obtained in a realistic shipyard simulation environment, demonstrate that our method can accurately, and fast detect and track dynamic objects seen from a top-view.
Styles APA, Harvard, Vancouver, ISO, etc.
Nous offrons des réductions sur tous les plans premium pour les auteurs dont les œuvres sont incluses dans des sélections littéraires thématiques. Contactez-nous pour obtenir un code promo unique!

Vers la bibliographie