Academic literature on the topic 'Deep Video Representations'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Deep Video Representations.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Deep Video Representations"

1

Feichtenhofer, Christoph, Axel Pinz, Richard P. Wildes, and Andrew Zisserman. "Deep Insights into Convolutional Networks for Video Recognition." International Journal of Computer Vision 128, no. 2 (October 29, 2019): 420–37. http://dx.doi.org/10.1007/s11263-019-01225-w.

Full text
Abstract:
Abstract As the success of deep models has led to their deployment in all areas of computer vision, it is increasingly important to understand how these representations work and what they are capturing. In this paper, we shed light on deep spatiotemporal representations by visualizing the internal representation of models that have been trained to recognize actions in video. We visualize multiple two-stream architectures to show that local detectors for appearance and motion objects arise to form distributed representations for recognizing human actions. Key observations include the following. First, cross-stream fusion enables the learning of true spatiotemporal features rather than simply separate appearance and motion features. Second, the networks can learn local representations that are highly class specific, but also generic representations that can serve a range of classes. Third, throughout the hierarchy of the network, features become more abstract and show increasing invariance to aspects of the data that are unimportant to desired distinctions (e.g. motion patterns across various speeds). Fourth, visualizations can be used not only to shed light on learned representations, but also to reveal idiosyncrasies of training data and to explain failure cases of the system.
APA, Harvard, Vancouver, ISO, and other styles
2

Pandeya, Yagya Raj, Bhuwan Bhattarai, and Joonwhoan Lee. "Deep-Learning-Based Multimodal Emotion Classification for Music Videos." Sensors 21, no. 14 (July 20, 2021): 4927. http://dx.doi.org/10.3390/s21144927.

Full text
Abstract:
Music videos contain a great deal of visual and acoustic information. Each information source within a music video influences the emotions conveyed through the audio and video, suggesting that only a multimodal approach is capable of achieving efficient affective computing. This paper presents an affective computing system that relies on music, video, and facial expression cues, making it useful for emotional analysis. We applied the audio–video information exchange and boosting methods to regularize the training process and reduced the computational costs by using a separable convolution strategy. In sum, our empirical findings are as follows: (1) Multimodal representations efficiently capture all acoustic and visual emotional clues included in each music video, (2) the computational cost of each neural network is significantly reduced by factorizing the standard 2D/3D convolution into separate channels and spatiotemporal interactions, and (3) information-sharing methods incorporated into multimodal representations are helpful in guiding individual information flow and boosting overall performance. We tested our findings across several unimodal and multimodal networks against various evaluation metrics and visual analyzers. Our best classifier attained 74% accuracy, an f1-score of 0.73, and an area under the curve score of 0.926.
APA, Harvard, Vancouver, ISO, and other styles
3

Ljubešić, Nikola. "‟Deep lexicography” – Fad or Opportunity?" Rasprave Instituta za hrvatski jezik i jezikoslovlje 46, no. 2 (October 30, 2020): 839–52. http://dx.doi.org/10.31724/rihjj.46.2.21.

Full text
Abstract:
In recent years, we are witnessing staggering improvements in various semantic data processing tasks due to the developments in the area of deep learning, ranging from image and video processing to speech processing, and natural language understanding. In this paper, we discuss the opportunities and challenges that these developments pose for the area of electronic lexicography. We primarily focus on the concept of representation learning of the basic elements of language, namely words, and the applicability of these word representations to lexicography. We first discuss well-known approaches to learning static representations of words, the so-called word embeddings, and their usage in lexicography-related tasks such as semantic shift detection, and cross-lingual prediction of lexical features such as concreteness and imageability. We wrap up the paper with the most recent developments in the area of word representation learning in form of learning dynamic, context-aware representations of words, showcasing some dynamic word embedding examples, and discussing improvements on lexicography-relevant tasks of word sense disambiguation and word sense induction.
APA, Harvard, Vancouver, ISO, and other styles
4

Kumar, Vidit, Vikas Tripathi, and Bhaskar Pant. "Learning Unsupervised Visual Representations using 3D Convolutional Autoencoder with Temporal Contrastive Modeling for Video Retrieval." International Journal of Mathematical, Engineering and Management Sciences 7, no. 2 (March 14, 2022): 272–87. http://dx.doi.org/10.33889/ijmems.2022.7.2.018.

Full text
Abstract:
The rapid growth of tag-free user-generated videos (on the Internet), surgical recorded videos, and surveillance videos has necessitated the need for effective content-based video retrieval systems. Earlier methods for video representations are based on hand-crafted, which hardly performed well on the video retrieval tasks. Subsequently, deep learning methods have successfully demonstrated their effectiveness in both image and video-related tasks, but at the cost of creating massively labeled datasets. Thus, the economic solution is to use freely available unlabeled web videos for representation learning. In this regard, most of the recently developed methods are based on solving a single pretext task using 2D or 3D convolutional network. However, this paper designs and studies a 3D convolutional autoencoder (3D-CAE) for video representation learning (since it does not require labels). Further, this paper proposes a new unsupervised video feature learning method based on joint learning of past and future prediction using 3D-CAE with temporal contrastive learning. The experiments are conducted on UCF-101 and HMDB-51 datasets, where the proposed approach achieves better retrieval performance than state-of-the-art. In the ablation study, the action recognition task is performed by fine-tuning the unsupervised pre-trained model where it outperforms other methods, which further confirms the superiority of our method in learning underlying features. Such an unsupervised representation learning approach could also benefit the medical domain, where it is expensive to create large label datasets.
APA, Harvard, Vancouver, ISO, and other styles
5

Vihlman, Mikko, and Arto Visala. "Optical Flow in Deep Visual Tracking." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 07 (April 3, 2020): 12112–19. http://dx.doi.org/10.1609/aaai.v34i07.6890.

Full text
Abstract:
Single-target tracking of generic objects is a difficult task since a trained tracker is given information present only in the first frame of a video. In recent years, increasingly many trackers have been based on deep neural networks that learn generic features relevant for tracking. This paper argues that deep architectures are often fit to learn implicit representations of optical flow. Optical flow is intuitively useful for tracking, but most deep trackers must learn it implicitly. This paper is among the first to study the role of optical flow in deep visual tracking. The architecture of a typical tracker is modified to reveal the presence of implicit representations of optical flow and to assess the effect of using the flow information more explicitly. The results show that the considered network learns implicitly an effective representation of optical flow. The implicit representation can be replaced by an explicit flow input without a notable effect on performance. Using the implicit and explicit representations at the same time does not improve tracking accuracy. The explicit flow input could allow constructing lighter networks for tracking.
APA, Harvard, Vancouver, ISO, and other styles
6

Rouast, Philipp V., and Marc T. P. Adam. "Learning Deep Representations for Video-Based Intake Gesture Detection." IEEE Journal of Biomedical and Health Informatics 24, no. 6 (June 2020): 1727–37. http://dx.doi.org/10.1109/jbhi.2019.2942845.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Li, Jialu, Aishwarya Padmakumar, Gaurav Sukhatme, and Mohit Bansal. "VLN-Video: Utilizing Driving Videos for Outdoor Vision-and-Language Navigation." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 17 (March 24, 2024): 18517–26. http://dx.doi.org/10.1609/aaai.v38i17.29813.

Full text
Abstract:
Outdoor Vision-and-Language Navigation (VLN) requires an agent to navigate through realistic 3D outdoor environments based on natural language instructions. The performance of existing VLN methods is limited by insufficient diversity in navigation environments and limited training data. To address these issues, we propose VLN-Video, which utilizes the diverse outdoor environments present in driving videos in multiple cities in the U.S. augmented with automatically generated navigation instructions and actions to improve outdoor VLN performance. VLN-Video combines the best of intuitive classical approaches and modern deep learning techniques, using template infilling to generate grounded non-repetitive navigation instructions, combined with an image rotation similarity based navigation action predictor to obtain VLN style data from driving videos for pretraining deep learning VLN models. We pre-train the model on the Touchdown dataset and our video-augmented dataset created from driving videos with three proxy tasks: Masked Language Modeling, Instruction and Trajectory Matching, and Next Action Prediction, so as to learn temporally-aware and visually-aligned instruction representations. The learned instruction representation is adapted to the state-of-the-art navigation agent when fine-tuning on the Touchdown dataset. Empirical results demonstrate that VLN-Video significantly outperforms previous state-of-the-art models by 2.1% in task completion rate, achieving a new state-of-the-art on the Touchdown dataset.
APA, Harvard, Vancouver, ISO, and other styles
8

Hu, Yueyue, Shiliang Sun, Xin Xu, and Jing Zhao. "Multi-View Deep Attention Network for Reinforcement Learning (Student Abstract)." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 10 (April 3, 2020): 13811–12. http://dx.doi.org/10.1609/aaai.v34i10.7177.

Full text
Abstract:
The representation approximated by a single deep network is usually limited for reinforcement learning agents. We propose a novel multi-view deep attention network (MvDAN), which introduces multi-view representation learning into the reinforcement learning task for the first time. The proposed model approximates a set of strategies from multiple representations and combines these strategies based on attention mechanisms to provide a comprehensive strategy for a single-agent. Experimental results on eight Atari video games show that the MvDAN has effective competitive performance than single-view reinforcement learning methods.
APA, Harvard, Vancouver, ISO, and other styles
9

Dong, Zhen, Chenchen Jing, Mingtao Pei, and Yunde Jia. "Deep CNN based binary hash video representations for face retrieval." Pattern Recognition 81 (September 2018): 357–69. http://dx.doi.org/10.1016/j.patcog.2018.04.014.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Psallidas, Theodoros, and Evaggelos Spyrou. "Video Summarization Based on Feature Fusion and Data Augmentation." Computers 12, no. 9 (September 15, 2023): 186. http://dx.doi.org/10.3390/computers12090186.

Full text
Abstract:
During the last few years, several technological advances have led to an increase in the creation and consumption of audiovisual multimedia content. Users are overexposed to videos via several social media or video sharing websites and mobile phone applications. For efficient browsing, searching, and navigation across several multimedia collections and repositories, e.g., for finding videos that are relevant to a particular topic or interest, this ever-increasing content should be efficiently described by informative yet concise content representations. A common solution to this problem is the construction of a brief summary of a video, which could be presented to the user, instead of the full video, so that she/he could then decide whether to watch or ignore the whole video. Such summaries are ideally more expressive than other alternatives, such as brief textual descriptions or keywords. In this work, the video summarization problem is approached as a supervised classification task, which relies on feature fusion of audio and visual data. Specifically, the goal of this work is to generate dynamic video summaries, i.e., compositions of parts of the original video, which include its most essential video segments, while preserving the original temporal sequence. This work relies on annotated datasets on a per-frame basis, wherein parts of videos are annotated as being “informative” or “noninformative”, with the latter being excluded from the produced summary. The novelties of the proposed approach are, (a) prior to classification, a transfer learning strategy to use deep features from pretrained models is employed. These models have been used as input to the classifiers, making them more intuitive and robust to objectiveness, and (b) the training dataset was augmented by using other publicly available datasets. The proposed approach is evaluated using three datasets of user-generated videos, and it is demonstrated that deep features and data augmentation are able to improve the accuracy of video summaries based on human annotations. Moreover, it is domain independent, could be used on any video, and could be extended to rely on richer feature representations or include other data modalities.
APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic "Deep Video Representations"

1

Yang, Yang. "Learning Hierarchical Representations for Video Analysis Using Deep Learning." Doctoral diss., University of Central Florida, 2013. http://digital.library.ucf.edu/cdm/ref/collection/ETD/id/5892.

Full text
Abstract:
With the exponential growth of the digital data, video content analysis (e.g., action, event recognition) has been drawing increasing attention from computer vision researchers. Effective modeling of the objects, scenes, and motions is critical for visual understanding. Recently there has been a growing interest in the bio-inspired deep learning models, which has shown impressive results in speech and object recognition. The deep learning models are formed by the composition of multiple non-linear transformations of the data, with the goal of yielding more abstract and ultimately more useful representations. The advantages of the deep models are three fold: 1) They learn the features directly from the raw signal in contrast to the hand-designed features. 2) The learning can be unsupervised, which is suitable for large data where labeling all the data is expensive and unpractical. 3) They learn a hierarchy of features one level at a time and the layerwise stacking of feature extraction, this often yields better representations. However, not many deep learning models have been proposed to solve the problems in video analysis, especially videos ``in a wild''. Most of them are either dealing with simple datasets, or limited to the low-level local spatial-temporal feature descriptors for action recognition. Moreover, as the learning algorithms are unsupervised, the learned features preserve generative properties rather than the discriminative ones which are more favorable in the classification tasks. In this context, the thesis makes two major contributions. First, we propose several formulations and extensions of deep learning methods which learn hierarchical representations for three challenging video analysis tasks, including complex event recognition, object detection in videos and measuring action similarity. The proposed methods are extensively demonstrated for each work on the state-of-the-art challenging datasets. Besides learning the low-level local features, higher level representations are further designed to be learned in the context of applications. The data-driven concept representations and sparse representation of the events are learned for complex event recognition; the representations for object body parts and structures are learned for object detection in videos; and the relational motion features and similarity metrics between video pairs are learned simultaneously for action verification. Second, in order to learn discriminative and compact features, we propose a new feature learning method using a deep neural network based on auto encoders. It differs from the existing unsupervised feature learning methods in two ways: first it optimizes both discriminative and generative properties of the features simultaneously, which gives our features a better discriminative ability. Second, our learned features are more compact, while the unsupervised feature learning methods usually learn a redundant set of over-complete features. Extensive experiments with quantitative and qualitative results on the tasks of human detection and action verification demonstrate the superiority of our proposed models.
Ph.D.
Doctorate
Electrical Engineering and Computer Science
Engineering and Computer Science
Electrical Engineering
APA, Harvard, Vancouver, ISO, and other styles
2

Sudhakaran, Swathikiran. "Deep Neural Architectures for Video Representation Learning." Doctoral thesis, Università degli studi di Trento, 2019. https://hdl.handle.net/11572/369191.

Full text
Abstract:
Automated analysis of videos for content understanding is one of the most challenging and well researched areas in computer vision and multimedia. This thesis addresses the problem of video content understanding in the context of action recognition. The major challenge faced by this research problem is the variations of the spatio-temporal patterns that constitute each action category and the difficulty in generating a succinct representation encapsulating these patterns. This thesis considers two important aspects of videos for addressing this problem: (1) a video is a sequence of images with an inherent temporal dependency that defines the actual pattern to be recognized; (2) not all spatial regions of the video frame are equally important for discriminating one action category from another. The first aspect shows the importance of aggregating frame level features in a sequential manner while the second aspect signifies the importance of selective encoding of frame level features. The first problem is addressed by analyzing popular Convolutional Neural Network (CNN)-Recurrent Neural Network (RNN) architectures for video representation generation and concludes that Convolutional Long Short-Term Memory (ConvLSTM), a variant of the popular Long Short-Term Memory (LSTM) RNN unit, is suitable for encoding spatio-temporal patterns occurring in a video sequence. The second problem is tackled by developing a spatial attention mechanism for the selective encoding of spatial features by weighting spatial regions in the feature tensor that are relevant for identifying the action category. Detailed experimental analysis carried out on two video recognition tasks showed that spatially selective encoding is indeed beneficial. Inspired from the two aforementioned findings, a new recurrent neural unit, called Long Short-Term Attention (LSTA), is developed by augmenting LSTM with built-in spatial attention and a revised output gating. The first enables LSTA to attend to the relevant spatial regions while maintaining a smooth tracking of the attended regions and the latter allows the network to propagate a filtered version of the memory localized on the most discriminative components of the video. LSTA surpasses the recognition accuracy of existing state-of-the-art techniques on popular egocentric activity recognition benchmarks, showing its effectiveness in video representation generation.
APA, Harvard, Vancouver, ISO, and other styles
3

Sudhakaran, Swathikiran. "Deep Neural Architectures for Video Representation Learning." Doctoral thesis, University of Trento, 2019. http://eprints-phd.biblio.unitn.it/3731/1/swathi_thesis_rev1.pdf.

Full text
Abstract:
Automated analysis of videos for content understanding is one of the most challenging and well researched areas in computer vision and multimedia. This thesis addresses the problem of video content understanding in the context of action recognition. The major challenge faced by this research problem is the variations of the spatio-temporal patterns that constitute each action category and the difficulty in generating a succinct representation encapsulating these patterns. This thesis considers two important aspects of videos for addressing this problem: (1) a video is a sequence of images with an inherent temporal dependency that defines the actual pattern to be recognized; (2) not all spatial regions of the video frame are equally important for discriminating one action category from another. The first aspect shows the importance of aggregating frame level features in a sequential manner while the second aspect signifies the importance of selective encoding of frame level features. The first problem is addressed by analyzing popular Convolutional Neural Network (CNN)-Recurrent Neural Network (RNN) architectures for video representation generation and concludes that Convolutional Long Short-Term Memory (ConvLSTM), a variant of the popular Long Short-Term Memory (LSTM) RNN unit, is suitable for encoding spatio-temporal patterns occurring in a video sequence. The second problem is tackled by developing a spatial attention mechanism for the selective encoding of spatial features by weighting spatial regions in the feature tensor that are relevant for identifying the action category. Detailed experimental analysis carried out on two video recognition tasks showed that spatially selective encoding is indeed beneficial. Inspired from the two aforementioned findings, a new recurrent neural unit, called Long Short-Term Attention (LSTA), is developed by augmenting LSTM with built-in spatial attention and a revised output gating. The first enables LSTA to attend to the relevant spatial regions while maintaining a smooth tracking of the attended regions and the latter allows the network to propagate a filtered version of the memory localized on the most discriminative components of the video. LSTA surpasses the recognition accuracy of existing state-of-the-art techniques on popular egocentric activity recognition benchmarks, showing its effectiveness in video representation generation.
APA, Harvard, Vancouver, ISO, and other styles
4

Sun, Shuyang. "Designing Motion Representation in Videos." Thesis, The University of Sydney, 2018. http://hdl.handle.net/2123/19724.

Full text
Abstract:
Motion representation plays a vital role in the vision-based human action recognition in videos. Generally, the information of a video could be divided into spatial information and temporal information. While the spatial information could be easily described by the RGB images, the design of the motion representation is yet a challenging problem. In order to design a motion representation that is efficient and effective, we design the feature according to two principles. First, to guarantee the robustness, the temporal information should be highly related to the informative modalities, e.g., the optical flow. Second, only basic operations could be applied to make the computational cost affordable when extracting the temporal information. Based on these principles, we introduce a novel compact motion representation for video action recognition, named Optical Flow guided Feature (OFF), which enables the network to distil temporal information through a fast and robust approach. The OFF is derived from the definition of optical flow and is orthogonal to the optical flow. The derivation also provides theoretical support for using the difference between two frames. By directly calculating pixel-wise spatiotemporal gradients of the deep feature maps, the OFF could be embedded in any existing CNN based video action recognition framework with only a slight additional cost. It enables the CNN to extract spatiotemporal information. This simple but powerful idea is validated by experimental results. The network with OFF fed only by RGB inputs achieves a competitive accuracy of 93.3% on UCF-101, which is comparable with the result obtained by two streams (RGB and optical flow), but is 15 times faster in speed. Experimental results also show that OFF is complementary to other motion modalities such as optical flow. When the proposed method is plugged into the state-of-the-art video action recognition framework, it has 96.0% and 74.2% accuracy on UCF-101 and HMDB-51 respectively.
APA, Harvard, Vancouver, ISO, and other styles
5

Mazari, Ahmed. "Apprentissage profond pour la reconnaissance d’actions en vidéos." Electronic Thesis or Diss., Sorbonne université, 2020. http://www.theses.fr/2020SORUS171.

Full text
Abstract:
De nos jours, les contenus vidéos sont omniprésents grâce à Internet et les smartphones, ainsi que les médias sociaux. De nombreuses applications de la vie quotidienne, telles que la vidéo surveillance et la description de contenus vidéos, ainsi que la compréhension de scènes visuelles, nécessitent des technologies sophistiquées pour traiter les données vidéos. Il devient nécessaire de développer des moyens automatiques pour analyser et interpréter la grande quantité de données vidéo disponibles. Dans cette thèse, nous nous intéressons à la reconnaissance d'actions dans les vidéos, c.a.d au problème de l'attribution de catégories d'actions aux séquences vidéos. Cela peut être considéré comme un ingrédient clé pour construire la prochaine génération de systèmes visuels. Nous l'abordons avec des méthodes d'intelligence artificielle, sous le paradigme de l'apprentissage automatique et de l'apprentissage profond, notamment les réseaux de neurones convolutifs. Les réseaux de neurones convolutifs actuels sont de plus en plus profonds, plus gourmands en données et leur succès est donc tributaire de l'abondance de données d'entraînement étiquetées. Les réseaux de neurones convolutifs s'appuient également sur le pooling qui réduit la dimensionnalité des couches de sortie (et donc atténue leur sensibilité à la disponibilité de données étiquetées)
Nowadays, video contents are ubiquitous through the popular use of internet and smartphones, as well as social media. Many daily life applications such as video surveillance and video captioning, as well as scene understanding require sophisticated technologies to process video data. It becomes of crucial importance to develop automatic means to analyze and to interpret the large amount of available video data. In this thesis, we are interested in video action recognition, i.e. the problem of assigning action categories to sequences of videos. This can be seen as a key ingredient to build the next generation of vision systems. It is tackled with AI frameworks, mainly with ML and Deep ConvNets. Current ConvNets are increasingly deeper, data-hungrier and this makes their success tributary of the abundance of labeled training data. ConvNets also rely on (max or average) pooling which reduces dimensionality of output layers (and hence attenuates their sensitivity to the availability of labeled data); however, this process may dilute the information of upstream convolutional layers and thereby affect the discrimination power of the trained video representations, especially when the learned action categories are fine-grained
APA, Harvard, Vancouver, ISO, and other styles
6

"Video2Vec: Learning Semantic Spatio-Temporal Embedding for Video Representations." Master's thesis, 2016. http://hdl.handle.net/2286/R.I.40765.

Full text
Abstract:
abstract: High-level inference tasks in video applications such as recognition, video retrieval, and zero-shot classification have become an active research area in recent years. One fundamental requirement for such applications is to extract high-quality features that maintain high-level information in the videos. Many video feature extraction algorithms have been purposed, such as STIP, HOG3D, and Dense Trajectories. These algorithms are often referred to as “handcrafted” features as they were deliberately designed based on some reasonable considerations. However, these algorithms may fail when dealing with high-level tasks or complex scene videos. Due to the success of using deep convolution neural networks (CNNs) to extract global representations for static images, researchers have been using similar techniques to tackle video contents. Typical techniques first extract spatial features by processing raw images using deep convolution architectures designed for static image classifications. Then simple average, concatenation or classifier-based fusion/pooling methods are applied to the extracted features. I argue that features extracted in such ways do not acquire enough representative information since videos, unlike images, should be characterized as a temporal sequence of semantically coherent visual contents and thus need to be represented in a manner considering both semantic and spatio-temporal information. In this thesis, I propose a novel architecture to learn semantic spatio-temporal embedding for videos to support high-level video analysis. The proposed method encodes video spatial and temporal information separately by employing a deep architecture consisting of two channels of convolutional neural networks (capturing appearance and local motion) followed by their corresponding Fully Connected Gated Recurrent Unit (FC-GRU) encoders for capturing longer-term temporal structure of the CNN features. The resultant spatio-temporal representation (a vector) is used to learn a mapping via a Fully Connected Multilayer Perceptron (FC-MLP) to the word2vec semantic embedding space, leading to a semantic interpretation of the video vector that supports high-level analysis. I evaluate the usefulness and effectiveness of this new video representation by conducting experiments on action recognition, zero-shot video classification, and semantic video retrieval (word-to-video) retrieval, using the UCF101 action recognition dataset.
Dissertation/Thesis
Masters Thesis Computer Science 2016
APA, Harvard, Vancouver, ISO, and other styles
7

(7486115), Gagandeep Singh Khanuja. "A STUDY OF REAL TIME SEARCH IN FLOOD SCENES FROM UAV VIDEOS USING DEEP LEARNING TECHNIQUES." Thesis, 2019.

Find full text
Abstract:
Following a natural disaster, one of the most important facet that influence a persons chances of survival/being found out is the time with which they are rescued. Traditional means of search operations involving dogs, ground robots, humanitarian intervention; are time intensive and can be a major bottleneck in search operations. The main aim of these operations is to rescue victims without critical delay in the shortest time possible which can be realized in real-time by using UAVs. With advancements in computational devices and the ability to learn from complex data, deep learning can be leveraged in real time environment for purpose of search and rescue operations. This research aims to solve the traditional means of search operation using the concept of deep learning for real time object detection and Photogrammetry for precise geo-location mapping of the objects(person,car) in real time. In order to do so, various pre-trained algorithms like Mask-RCNN, SSD300, YOLOv3 and trained algorithms like YOLOv3 have been deployed with their results compared with means of addressing the search operation in
real time.

APA, Harvard, Vancouver, ISO, and other styles
8

Souček, Tomáš. "Detekce střihů a vyhledávání známých scén ve videu s pomocí metod hlubokého učení." Master's thesis, 2020. http://www.nusl.cz/ntk/nusl-434967.

Full text
Abstract:
Video retrieval represents a challenging problem with many caveats and sub-problems. This thesis focuses on two of these sub-problems, namely shot transition detection and text-based search. In the case of shot detection, many solutions have been proposed over the last decades. Recently, deep learning-based approaches improved the accuracy of shot transition detection using 3D convolutional architectures and artificially created training data, but one hundred percent accuracy is still an unreachable ideal. In this thesis we present a deep network for shot transition detection TransNet V2 that reaches state-of- the-art performance on respected benchmarks. In the second case of text-based search, deep learning models projecting textual query and video frames into a joint space proved to be effective for text-based video retrieval. We investigate these query representation learning models in a setting of known-item search and propose improvements for the text encoding part of the model. 1
APA, Harvard, Vancouver, ISO, and other styles

Books on the topic "Deep Video Representations"

1

Aguayo, Angela J. Documentary Resistance. Oxford University Press, 2019. http://dx.doi.org/10.1093/oso/9780190676216.001.0001.

Full text
Abstract:
The potential of documentary moving images to foster democratic exchange has been percolating within media production culture for the last century, and now, with mobile cameras at our fingertips and broadcasts circulating through unpredictable social networks, the documentary impulse is coming into its own as a political force of social change. The exploding reach and power of audio and video are multiplying documentary modes of communication. Once considered an outsider media practice, documentary is finding mass appeal in the allure of moving images, collecting participatory audiences that create meaningful challenges to the social order. Documentary is adept at collecting frames of human experience, challenging those insights, and turning these stories into public knowledge that is palpable for audiences. Generating pathways of exchange between unlikely interlocutors, collective identification forged with documentary discourse constitutes a mode of political agency that is directing energy toward acting in the world. Reflecting experiences of life unfolding before the camera, documentary representations help order social relationships that deepen our public connections and generate collective roots. As digital culture creates new pathways through which information can flow, the connections generated from social change documentary constitute an emerging public commons. Considering the deep ideological divisions that are fracturing U.S. democracy, it is of critical significance to understand how communities negotiate power and difference by way of an expanding documentary commons. Investment in the force of documentary resistance helps cultivate an understanding of political life from the margins, where documentary production practices are a form of survival.
APA, Harvard, Vancouver, ISO, and other styles
2

Anderson, Crystal S. Soul in Seoul. University Press of Mississippi, 2020. http://dx.doi.org/10.14325/mississippi/9781496830098.001.0001.

Full text
Abstract:
Soul in Seoul: African American Popular Music and K-pop examines how K-pop cites musical and performative elements of Black popular music culture as well as the ways that fans outside of Korea understand these citations. K-pop represents a hybridized mode of Korean popular music that emerged in the 1990s with global aspirations. Its hybridity combines musical elements from Korean and foreign cultures, particularly rhythm and blues-based genres (R&B) of African American popular music. Korean pop, R&B and hip-hop solo artists and groups engage in citational practices by simultaneously emulating R&B’s instrumentation and vocals and enhancing R&B by employing Korean musical strategies to such an extent that K-pop becomes part of a global R&B tradition. Korean pop groups use dynamic images and quality musical production to engage in cultural work that culminates the kind of global form of crossover pioneered by Black American music producers. Korean R&B artists, with a focus on vocals, take the R&B tradition beyond the Black-white binary, and Korean hip-hop practitioners use sampling and live instrumentation to promote R&B’s innovative music aesthetics. K-pop artists also cite elements of African American performance in Korean music videos that disrupt limiting representations. K-pop’s citational practices reveal diverse musical aesthetics driven by the interplay of African American popular music and Korean music strategies. As a transcultural fandom, global fans function as part of K-pop’s music press and deem these citational practices authentic. Citational practices also challenge homogenizing modes of globalization by revealing the multiple cultural forces that inform K-pop.
APA, Harvard, Vancouver, ISO, and other styles

Book chapters on the topic "Deep Video Representations"

1

Loban, Rhett. "Designing to produce deep representations." In Embedding Culture into Video Games and Game Design, 140–52. London: Chapman and Hall/CRC, 2023. http://dx.doi.org/10.1201/9781003276289-10.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Yao, Yuan, Zhiyuan Liu, Yankai Lin, and Maosong Sun. "Cross-Modal Representation Learning." In Representation Learning for Natural Language Processing, 211–40. Singapore: Springer Nature Singapore, 2023. http://dx.doi.org/10.1007/978-981-99-1600-9_7.

Full text
Abstract:
AbstractCross-modal representation learning is an essential part of representation learning, which aims to learn semantic representations for different modalities including text, audio, image and video, etc., and their connections. In this chapter, we introduce the development of cross-modal representation learning from shallow to deep, and from respective to unified in terms of model architectures and learning mechanisms for different modalities and tasks. After that, we review how cross-modal capabilities can contribute to complex real-world applications.
APA, Harvard, Vancouver, ISO, and other styles
3

Mao, Feng, Xiang Wu, Hui Xue, and Rong Zhang. "Hierarchical Video Frame Sequence Representation with Deep Convolutional Graph Network." In Lecture Notes in Computer Science, 262–70. Cham: Springer International Publishing, 2019. http://dx.doi.org/10.1007/978-3-030-11018-5_24.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Becerra-Riera, Fabiola, Annette Morales-González, and Heydi Méndez-Vázquez. "Exploring Local Deep Representations for Facial Gender Classification in Videos." In Progress in Artificial Intelligence and Pattern Recognition, 104–12. Cham: Springer International Publishing, 2018. http://dx.doi.org/10.1007/978-3-030-01132-1_12.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Zhao, Kemeng, Liangrui Peng, Ning Ding, Gang Yao, Pei Tang, and Shengjin Wang. "Deep Representation Learning for License Plate Recognition in Low Quality Video Images." In Advances in Visual Computing, 202–14. Cham: Springer Nature Switzerland, 2023. http://dx.doi.org/10.1007/978-3-031-47966-3_16.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Chen, Yixiong, Chunhui Zhang, Li Liu, Cheng Feng, Changfeng Dong, Yongfang Luo, and Xiang Wan. "USCL: Pretraining Deep Ultrasound Image Diagnosis Model Through Video Contrastive Representation Learning." In Medical Image Computing and Computer Assisted Intervention – MICCAI 2021, 627–37. Cham: Springer International Publishing, 2021. http://dx.doi.org/10.1007/978-3-030-87237-3_60.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Dhurgadevi, M., D. Vimal Kumar, R. Senthilkumar, and K. Gunasekaran. "Detection of Video Anomaly in Public With Deep Learning Algorithm." In Advances in Psychology, Mental Health, and Behavioral Studies, 81–95. IGI Global, 2024. http://dx.doi.org/10.4018/979-8-3693-4143-8.ch004.

Full text
Abstract:
For traffic control and public safety, predicting the movement of people is crucial. The presented scheme entails the development of a wider network that can better satisfy created synthetic images by connecting spatial representations to temporal ones. The authors exclusively use the frames from those occurrences to create the dense optical flow for their corresponding normal events. In order to eliminate false-positive detection findings, they determine the local pixel reconstruction error. This particle prediction model and a likelihood model for giving these particles weights are both suggested. These models effectively use the variable-sized cell structure to produce sceneries with variable-sized sub-regions. It also successfully extracts and utilizes the video frame's size, motion, and position information. On the UCSD and LIVE datasets, the proposed framework is evaluated with the most recent algorithms reported in the literature. With a significantly shorter processing time, the suggested technique surpasses state-of-the-art techniques in relation to decreased equal error rate .
APA, Harvard, Vancouver, ISO, and other styles
8

Asma, Stephen T. "Drama In The Diorama: The Confederation & Art and Science." In Stuffed Animals & pickled Heads, 240–88. Oxford University PressNew York, NY, 2001. http://dx.doi.org/10.1093/oso/9780195130508.003.0007.

Full text
Abstract:
Abstract The Museums That We’ve Studied throughout this journey reveal the tremendous diversity of goals and motives for collecting and displaying elements of the natural world. Yet underneath all these various constructions of nature, there has been a continuous dialogue between image-making activities and knowledge-producing activities. Unlike texts, natural history museums are inherently aesthetic representations of science in particular and conceptual ideas in general. The fact that a roulette wheel at the Field could touch the central nerves of our deep metaphysical convictions is an indication of a museum’s epistemic potential. After spending long stretches in many natural history museums, one begins to see that a display’s potential for education and transformation is largely a function of its artistic, nondiscursive character. Three-dimensional representations of nature (dioramas), two-dimensional and three-dimensional representations of concepts (such as the roulette wheel), and visual images generally are not just candy coatings on the real educational process of textual information transmission. This chapter explores how and why visual communication works on museum visitors. And this requires an examination of the more general issue of how images themselves can be pedagogical, an issue that extends from da Vinci’s anatomy drawings to the latest video edutainment technology. These issues lead to a survey of some of the most recent trends in museology, followed by some reflections on the museum at the millennium.
APA, Harvard, Vancouver, ISO, and other styles
9

Verma, Gyanendra K. "Emotions Modelling in 3D Space." In Multimodal Affective Computing: Affective Information Representation, Modelling, and Analysis, 128–47. BENTHAM SCIENCE PUBLISHERS, 2023. http://dx.doi.org/10.2174/9789815124453123010013.

Full text
Abstract:
In this study, we have discussed emotion representation in two and three.dimensional space. The three-dimensional space is based on the three emotion primitives, i.e., valence, arousal, and dominance. The multimodal cues used in this study are EEG, Physiological signals, and video (under limitations). Due to the limited emotional content in videos from the DEAP database, we have considered only three classes of emotions, i.e., happy, sad, and terrible. The wavelet transforms, a classical transform, were employed for multi-resolution analysis of signals to extract features. We have evaluated the proposed emotion model with standard multimodal datasets, DEAP. The experimental results show that SVM and MLP can predict emotions in single and multimodal cues.
APA, Harvard, Vancouver, ISO, and other styles
10

Nandal, Priyanka. "Motion Imitation for Monocular Videos." In Examining the Impact of Deep Learning and IoT on Multi-Industry Applications, 118–35. IGI Global, 2021. http://dx.doi.org/10.4018/978-1-7998-7511-6.ch008.

Full text
Abstract:
This work represents a simple method for motion transfer (i.e., given a source video of a subject [person] performing some movements or in motion, that movement/motion is transferred to amateur target in different motion). The pose is used as an intermediate representation to perform this translation. To transfer the motion of the source subject to the target subject, the pose is extracted from the source subject, and then the target subject is generated by applying the learned pose to-appearance mapping. To perform this translation, the video is considered as a set of images consisting of all the frames. Generative adversarial networks (GANs) are used to transfer the motion from source subject to the target subject. GANs are an evolving field of deep learning.
APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Deep Video Representations"

1

Morere, Olivier, Hanlin Goh, Antoine Veillard, Vijay Chandrasekhar, and Jie Lin. "Co-regularized deep representations for video summarization." In 2015 IEEE International Conference on Image Processing (ICIP). IEEE, 2015. http://dx.doi.org/10.1109/icip.2015.7351387.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Yu, Feiwu, Xinxiao Wu, Yuchao Sun, and Lixin Duan. "Exploiting Images for Video Recognition with Hierarchical Generative Adversarial Networks." In Twenty-Seventh International Joint Conference on Artificial Intelligence {IJCAI-18}. California: International Joint Conferences on Artificial Intelligence Organization, 2018. http://dx.doi.org/10.24963/ijcai.2018/154.

Full text
Abstract:
Existing deep learning methods of video recognition usually require a large number of labeled videos for training. But for a new task, videos are often unlabeled and it is also time-consuming and labor-intensive to annotate them. Instead of human annotation, we try to make use of existing fully labeled images to help recognize those videos. However, due to the problem of domain shifts and heterogeneous feature representations, the performance of classifiers trained on images may be dramatically degraded for video recognition tasks. In this paper, we propose a novel method, called Hierarchical Generative Adversarial Networks (HiGAN), to enhance recognition in videos (i.e., target domain) by transferring knowledge from images (i.e., source domain). The HiGAN model consists of a \emph{low-level} conditional GAN and a \emph{high-level} conditional GAN. By taking advantage of these two-level adversarial learning, our method is capable of learning a domain-invariant feature representation of source images and target videos. Comprehensive experiments on two challenging video recognition datasets (i.e. UCF101 and HMDB51) demonstrate the effectiveness of the proposed method when compared with the existing state-of-the-art domain adaptation methods.
APA, Harvard, Vancouver, ISO, and other styles
3

Pernici, Federico, Federico Bartoli, Matteo Bruni, and Alberto Del Bimbo. "Memory Based Online Learning of Deep Representations from Video Streams." In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2018. http://dx.doi.org/10.1109/cvpr.2018.00247.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Jung, Ilchae, Minji Kim, Eunhyeok Park, and Bohyung Han. "Online Hybrid Lightweight Representations Learning: Its Application to Visual Tracking." In Thirty-First International Joint Conference on Artificial Intelligence {IJCAI-22}. California: International Joint Conferences on Artificial Intelligence Organization, 2022. http://dx.doi.org/10.24963/ijcai.2022/140.

Full text
Abstract:
This paper presents a novel hybrid representation learning framework for streaming data, where an image frame in a video is modeled by an ensemble of two distinct deep neural networks; one is a low-bit quantized network and the other is a lightweight full-precision network. The former learns coarse primary information with low cost while the latter conveys residual information for high fidelity to original representations. The proposed parallel architecture is effective to maintain complementary information since fixed-point arithmetic can be utilized in the quantized network and the lightweight model provides precise representations given by a compact channel-pruned network. We incorporate the hybrid representation technique into an online visual tracking task, where deep neural networks need to handle temporal variations of target appearances in real-time. Compared to the state-of-the-art real-time trackers based on conventional deep neural networks, our tracking algorithm demonstrates competitive accuracy on the standard benchmarks with a small fraction of computational cost and memory footprint.
APA, Harvard, Vancouver, ISO, and other styles
5

Garcia-Gonzalez, Jorge, Rafael M. Luque-Baena, Juan M. Ortiz-de-Lazcano-Lobato, and Ezequiel Lopez-Rubio. "Moving Object Detection in Noisy Video Sequences Using Deep Convolutional Disentangled Representations." In 2022 IEEE International Conference on Image Processing (ICIP). IEEE, 2022. http://dx.doi.org/10.1109/icip46576.2022.9897305.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Parchami, Mostafa, Saman Bashbaghi, Eric Granger, and Saif Sayed. "Using deep autoencoders to learn robust domain-invariant representations for still-to-video face recognition." In 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). IEEE, 2017. http://dx.doi.org/10.1109/avss.2017.8078553.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Bueno-Benito, Elena, Biel Tura, and Mariella Dimiccoli. "Leveraging Triplet Loss for Unsupervised Action Segmentation." In LatinX in AI at Computer Vision and Pattern Recognition Conference 2023. Journal of LatinX in AI Research, 2023. http://dx.doi.org/10.52591/lxai202306185.

Full text
Abstract:
In this paper, we propose a novel fully unsupervised framework that learns action representations suitable for the action segmentation task from the single input video itself, without requiring any training data. Our method is a deep metric learning approach rooted in a shallow network with a triplet loss operating on similarity distributions and a novel triplet selection strategy that effectively models temporal and semantic priors to discover actions in the new representational space. Under these circumstances, we successfully recover temporal boundaries in the learned action representations with higher quality compared with existing unsupervised approaches. The proposed method is evaluated on two widely used benchmark datasets for the action segmentation task and it achieves competitive performance by applying a generic clustering algorithm on the learned representations.
APA, Harvard, Vancouver, ISO, and other styles
8

Kich, Victor Augusto, Junior Costa de Jesus, Ricardo Bedin Grando, Alisson Henrique Kolling, Gabriel Vinícius Heisler, and Rodrigo da Silva Guerra. "Deep Reinforcement Learning Using a Low-Dimensional Observation Filter for Visual Complex Video Game Playing." In Anais Estendidos do Simpósio Brasileiro de Games e Entretenimento Digital. Sociedade Brasileira de Computação, 2021. http://dx.doi.org/10.5753/sbgames_estendido.2021.19659.

Full text
Abstract:
Deep Reinforcement Learning (DRL) has produced great achievements since it was proposed, including the possibility of processing raw vision input data. However, training an agent to perform tasks based on image feedback remains a challenge. It requires the processing of large amounts of data from high-dimensional observation spaces, frame by frame, and the agent's actions are computed according to deep neural network policies, end-to-end. Image pre-processing is an effective way of reducing these high dimensional spaces, eliminating unnecessary information present in the scene, supporting the extraction of features and their representations in the agent's neural network. Modern video-games are examples of this type of challenge for DRL algorithms because of their visual complexity. In this paper, we propose a low-dimensional observation filter that allows a deep Q-network agent to successfully play in a visually complex and modern video-game, called Neon Drive.
APA, Harvard, Vancouver, ISO, and other styles
9

Fan, Tingyu, Linyao Gao, Yiling Xu, Zhu Li, and Dong Wang. "D-DPCC: Deep Dynamic Point Cloud Compression via 3D Motion Prediction." In Thirty-First International Joint Conference on Artificial Intelligence {IJCAI-22}. California: International Joint Conferences on Artificial Intelligence Organization, 2022. http://dx.doi.org/10.24963/ijcai.2022/126.

Full text
Abstract:
The non-uniformly distributed nature of the 3D Dynamic Point Cloud (DPC) brings significant challenges to its high-efficient inter-frame compression. This paper proposes a novel 3D sparse convolution-based Deep Dynamic Point Cloud Compression (D-DPCC) network to compensate and compress the DPC geometry with 3D motion estimation and motion compensation in the feature space. In the proposed D-DPCC network, we design a Multi-scale Motion Fusion (MMF) module to accurately estimate the 3D optical flow between the feature representations of adjacent point cloud frames. Specifically, we utilize a 3D sparse convolution-based encoder to obtain the latent representation for motion estimation in the feature space and introduce the proposed MMF module for fused 3D motion embedding. Besides, for motion compensation, we propose a 3D Adaptively Weighted Interpolation (3DAWI) algorithm with a penalty coefficient to adaptively decrease the impact of distant neighbours. We compress the motion embedding and the residual with a lossy autoencoder-based network. To our knowledge, this paper is the first work proposing an end-to-end deep dynamic point cloud compression framework. The experimental result shows that the proposed D-DPCC framework achieves an average 76% BD-Rate (Bjontegaard Delta Rate) gains against state-of-the-art Video-based Point Cloud Compression (V-PCC) v13 in inter mode.
APA, Harvard, Vancouver, ISO, and other styles
10

Li, Yang, Kan Li, and Xinxin Wang. "Deeply-Supervised CNN Model for Action Recognition with Trainable Feature Aggregation." In Twenty-Seventh International Joint Conference on Artificial Intelligence {IJCAI-18}. California: International Joint Conferences on Artificial Intelligence Organization, 2018. http://dx.doi.org/10.24963/ijcai.2018/112.

Full text
Abstract:
In this paper, we propose a deeply-supervised CNN model for action recognition that fully exploits powerful hierarchical features of CNNs. In this model, we build multi-level video representations by applying our proposed aggregation module at different convolutional layers. Moreover, we train this model in a deep supervision manner, which brings improvement in both performance and efficiency. Meanwhile, in order to capture the temporal structure as well as preserve more details about actions, we propose a trainable aggregation module. It models the temporal evolution of each spatial location and projects them into a semantic space using the Vector of Locally Aggregated Descriptors (VLAD) technique. This deeply-supervised CNN model integrating the powerful aggregation module provides a promising solution to recognize actions in videos. We conduct experiments on two action recognition datasets: HMDB51 and UCF101. Results show that our model outperforms the state-of-the-art methods.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography