Conecte-se

Bibliografias temáticas / Visual and semantic embedding / Artigos de revistas

Artigos de revistas sobre o tema "Visual and semantic embedding"

Siga este link para ver outros tipos de publicações sobre o tema: Visual and semantic embedding.

Autor: Grafiati

Publicado: 25 de maio de 2024

Crie uma referência precisa em APA, MLA, Chicago, Harvard, e outros estilos

Selecione um tipo de fonte:

Veja os 50 melhores artigos de revistas para estudos sobre o assunto "Visual and semantic embedding".

Ao lado de cada fonte na lista de referências, há um botão "Adicionar à bibliografia". Clique e geraremos automaticamente a citação bibliográfica do trabalho escolhido no estilo de citação de que você precisa: APA, MLA, Harvard, Chicago, Vancouver, etc.

Você também pode baixar o texto completo da publicação científica em formato .pdf e ler o resumo do trabalho online se estiver presente nos metadados.

Veja os artigos de revistas das mais diversas áreas científicas e compile uma bibliografia correta.

1

Zhang, Yuanpeng, Jingye Guan, Haobo Wang, Kaiming Li, Ying Luo e Qun Zhang. "Generalized Zero-Shot Space Target Recognition Based on Global-Local Visual Feature Embedding Network". Remote Sensing 15, n.º 21 (28 de outubro de 2023): 5156. http://dx.doi.org/10.3390/rs15215156.

Texto completo da fonte

Resumo:

Existing deep learning-based space target recognition methods rely on abundantly labeled samples and are not capable of recognizing samples from unseen classes without training. In this article, based on generalized zero-shot learning (GZSL), we propose a space target recognition framework to simultaneously recognize space targets from both seen and unseen classes. First, we defined semantic attributes to describe the characteristics of different categories of space targets. Second, we constructed a dual-branch neural network, termed the global-local visual feature embedding network (GLVFENet), which jointly learns global and local visual features to obtain discriminative feature representations, thereby achieving GZSL for space targets with higher accuracy. Specifically, the global visual feature embedding subnetwork (GVFE-Subnet) calculates the compatibility score by measuring the cosine similarity between the projection of global visual features in the semantic space and various semantic vectors, thereby obtaining global visual embeddings. The local visual feature embedding subnetwork (LVFE-Subnet) introduces soft space attention, and an encoder discovers the semantic-guided local regions in the image to then generate local visual embeddings. Finally, the visual embeddings from both branches were combined and matched with semantics. The calibrated stacking method is introduced to achieve GZSL recognition of space targets. Extensive experiments were conducted on an electromagnetic simulation dataset of nine categories of space targets, and the effectiveness of our GLVFENet is confirmed.

Estilos ABNT, Harvard, Vancouver, APA, etc.

2

Yeh, Mei-Chen, e Yi-Nan Li. "Multilabel Deep Visual-Semantic Embedding". IEEE Transactions on Pattern Analysis and Machine Intelligence 42, n.º 6 (1 de junho de 2020): 1530–36. http://dx.doi.org/10.1109/tpami.2019.2911065.

Texto completo da fonte

Estilos ABNT, Harvard, Vancouver, APA, etc.

3

Merkx, Danny, e Stefan L. Frank. "Learning semantic sentence representations from visually grounded language without lexical knowledge". Natural Language Engineering 25, n.º 4 (julho de 2019): 451–66. http://dx.doi.org/10.1017/s1351324919000196.

Texto completo da fonte

Resumo:

AbstractCurrent approaches to learning semantic representations of sentences often use prior word-level knowledge. The current study aims to leverage visual information in order to capture sentence level semantics without the need for word embeddings. We use a multimodal sentence encoder trained on a corpus of images with matching text captions to produce visually grounded sentence embeddings. Deep Neural Networks are trained to map the two modalities to a common embedding space such that for an image the corresponding caption can be retrieved and vice versa. We show that our model achieves results comparable to the current state of the art on two popular image-caption retrieval benchmark datasets: Microsoft Common Objects in Context (MSCOCO) and Flickr8k. We evaluate the semantic content of the resulting sentence embeddings using the data from the Semantic Textual Similarity (STS) benchmark task and show that the multimodal embeddings correlate well with human semantic similarity judgements. The system achieves state-of-the-art results on several of these benchmarks, which shows that a system trained solely on multimodal data, without assuming any word representations, is able to capture sentence level semantics. Importantly, this result shows that we do not need prior knowledge of lexical level semantics in order to model sentence level semantics. These findings demonstrate the importance of visual information in semantics.

Estilos ABNT, Harvard, Vancouver, APA, etc.

4

Zhou, Mo, Zhenxing Niu, Le Wang, Zhanning Gao, Qilin Zhang e Gang Hua. "Ladder Loss for Coherent Visual-Semantic Embedding". Proceedings of the AAAI Conference on Artificial Intelligence 34, n.º 07 (3 de abril de 2020): 13050–57. http://dx.doi.org/10.1609/aaai.v34i07.7006.

Texto completo da fonte

Resumo:

For visual-semantic embedding, the existing methods normally treat the relevance between queries and candidates in a bipolar way – relevant or irrelevant, and all “irrelevant” candidates are uniformly pushed away from the query by an equal margin in the embedding space, regardless of their various proximity to the query. This practice disregards relatively discriminative information and could lead to suboptimal ranking in the retrieval results and poorer user experience, especially in the long-tail query scenario where a matching candidate may not necessarily exist. In this paper, we introduce a continuous variable to model the relevance degree between queries and multiple candidates, and propose to learn a coherent embedding space, where candidates with higher relevance degrees are mapped closer to the query than those with lower relevance degrees. In particular, the new ladder loss is proposed by extending the triplet loss inequality to a more general inequality chain, which implements variable push-away margins according to respective relevance degrees. In addition, a proper Coherent Score metric is proposed to better measure the ranking results including those “irrelevant” candidates. Extensive experiments on multiple datasets validate the efficacy of our proposed method, which achieves significant improvement over existing state-of-the-art methods.

Estilos ABNT, Harvard, Vancouver, APA, etc.

5

Ge, Jiannan, Hongtao Xie, Shaobo Min e Yongdong Zhang. "Semantic-guided Reinforced Region Embedding for Generalized Zero-Shot Learning". Proceedings of the AAAI Conference on Artificial Intelligence 35, n.º 2 (18 de maio de 2021): 1406–14. http://dx.doi.org/10.1609/aaai.v35i2.16230.

Texto completo da fonte

Resumo:

Generalized zero-shot Learning (GZSL) aims to recognize images from either seen or unseen domain, mainly by learning a joint embedding space to associate image features with the corresponding category descriptions. Recent methods have proved that localizing important object regions can effectively bridge the semantic-visual gap. However, these are all based on one-off visual localizers, lacking of interpretability and flexibility. In this paper, we propose a novel Semantic-guided Reinforced Region Embedding (SR2E) network that can localize important objects in the long-term interests to construct semantic-visual embedding space. SR2E consists of Reinforced Region Module (R2M) and Semantic Alignment Module (SAM). First, without the annotated bounding box as supervision, R2M encodes the semantic category guidance into the reward and punishment criteria to teach the localizer serialized region searching. Besides, R2M explores different action spaces during the serialized searching path to avoid local optimal localization, which thereby generates discriminative visual features with less redundancy. Second, SAM preserves the semantic relationship into visual features via semantic-visual alignment and designs a domain detector to alleviate the domain confusion. Experiments on four public benchmarks demonstrate that the proposed SR2E is an effective GZSL method with reinforced embedding space, which obtains averaged 6.1% improvements.

Estilos ABNT, Harvard, Vancouver, APA, etc.

6

Nguyen, Huy Manh, Tomo Miyazaki, Yoshihiro Sugaya e Shinichiro Omachi. "Multiple Visual-Semantic Embedding for Video Retrieval from Query Sentence". Applied Sciences 11, n.º 7 (3 de abril de 2021): 3214. http://dx.doi.org/10.3390/app11073214.

Texto completo da fonte

Resumo:

Visual-semantic embedding aims to learn a joint embedding space where related video and sentence instances are located close to each other. Most existing methods put instances in a single embedding space. However, they struggle to embed instances due to the difficulty of matching visual dynamics in videos to textual features in sentences. A single space is not enough to accommodate various videos and sentences. In this paper, we propose a novel framework that maps instances into multiple individual embedding spaces so that we can capture multiple relationships between instances, leading to compelling video retrieval. We propose to produce a final similarity between instances by fusing similarities measured in each embedding space using a weighted sum strategy. We determine the weights according to a sentence. Therefore, we can flexibly emphasize an embedding space. We conducted sentence-to-video retrieval experiments on a benchmark dataset. The proposed method achieved superior performance, and the results are competitive to state-of-the-art methods. These experimental results demonstrated the effectiveness of the proposed multiple embedding approach compared to existing methods.

Estilos ABNT, Harvard, Vancouver, APA, etc.

7

MATSUBARA, Takashi. "Target-Oriented Deformation of Visual-Semantic Embedding Space". IEICE Transactions on Information and Systems E104.D, n.º 1 (1 de janeiro de 2021): 24–33. http://dx.doi.org/10.1587/transinf.2020mup0003.

Texto completo da fonte

Estilos ABNT, Harvard, Vancouver, APA, etc.

8

Tang, Qi, Yao Zhao, Meiqin Liu, Jian Jin e Chao Yao. "Semantic Lens: Instance-Centric Semantic Alignment for Video Super-resolution". Proceedings of the AAAI Conference on Artificial Intelligence 38, n.º 6 (24 de março de 2024): 5154–61. http://dx.doi.org/10.1609/aaai.v38i6.28321.

Texto completo da fonte

Resumo:

As a critical clue of video super-resolution (VSR), inter-frame alignment significantly impacts overall performance. However, accurate pixel-level alignment is a challenging task due to the intricate motion interweaving in the video. In response to this issue, we introduce a novel paradigm for VSR named Semantic Lens, predicated on semantic priors drawn from degraded videos. Specifically, video is modeled as instances, events, and scenes via a Semantic Extractor. Those semantics assist the Pixel Enhancer in understanding the recovered contents and generating more realistic visual results. The distilled global semantics embody the scene information of each frame, while the instance-specific semantics assemble the spatial-temporal contexts related to each instance. Furthermore, we devise a Semantics-Powered Attention Cross-Embedding (SPACE) block to bridge the pixel-level features with semantic knowledge, composed of a Global Perspective Shifter (GPS) and an Instance-Specific Semantic Embedding Encoder (ISEE). Concretely, the GPS module generates pairs of affine transformation parameters for pixel-level feature modulation conditioned on global semantics. After that the ISEE module harnesses the attention mechanism to align the adjacent frames in the instance-centric semantic space. In addition, we incorporate a simple yet effective pre-alignment module to alleviate the difficulty of model training. Extensive experiments demonstrate the superiority of our model over existing state-of-the-art VSR methods.

Estilos ABNT, Harvard, Vancouver, APA, etc.

9

Keller, Patrick, Abdoul Kader Kaboré, Laura Plein, Jacques Klein, Yves Le Traon e Tegawendé F. Bissyandé. "What You See is What it Means! Semantic Representation Learning of Code based on Visualization and Transfer Learning". ACM Transactions on Software Engineering and Methodology 31, n.º 2 (30 de abril de 2022): 1–34. http://dx.doi.org/10.1145/3485135.

Texto completo da fonte

Resumo:

Recent successes in training word embeddings for Natural Language Processing ( NLP ) tasks have encouraged a wave of research on representation learning for source code, which builds on similar NLP methods. The overall objective is then to produce code embeddings that capture the maximum of program semantics. State-of-the-art approaches invariably rely on a syntactic representation (i.e., raw lexical tokens, abstract syntax trees, or intermediate representation tokens) to generate embeddings, which are criticized in the literature as non-robust or non-generalizable. In this work, we investigate a novel embedding approach based on the intuition that source code has visual patterns of semantics. We further use these patterns to address the outstanding challenge of identifying semantic code clones. We propose the WySiWiM ( ‘ ‘What You See Is What It Means ” ) approach where visual representations of source code are fed into powerful pre-trained image classification neural networks from the field of computer vision to benefit from the practical advantages of transfer learning. We evaluate the proposed embedding approach on the task of vulnerable code prediction in source code and on two variations of the task of semantic code clone identification: code clone detection (a binary classification problem), and code classification (a multi-classification problem). We show with experiments on the BigCloneBench (Java), Open Judge (C) that although simple, our WySiWiM approach performs as effectively as state-of-the-art approaches such as ASTNN or TBCNN. We also showed with data from NVD and SARD that WySiWiM representation can be used to learn a vulnerable code detector with reasonable performance (accuracy ∼90%). We further explore the influence of different steps in our approach, such as the choice of visual representations or the classification algorithm, to eventually discuss the promises and limitations of this research direction.

Estilos ABNT, Harvard, Vancouver, APA, etc.

10

He, Hai, e Haibo Yang. "Deep Visual Semantic Embedding with Text Data Augmentation and Word Embedding Initialization". Mathematical Problems in Engineering 2021 (28 de maio de 2021): 1–8. http://dx.doi.org/10.1155/2021/6654071.

Texto completo da fonte

Resumo:

Language and vision are the two most essential parts of human intelligence for interpreting the real world around us. How to make connections between language and vision is the key point in current research. Multimodality methods like visual semantic embedding have been widely studied recently, which unify images and corresponding texts into the same feature space. Inspired by the recent development of text data augmentation and a simple but powerful technique proposed called EDA (easy data augmentation), we can expand the information with given data using EDA to improve the performance of models. In this paper, we take advantage of the text data augmentation technique and word embedding initialization for multimodality retrieval. We utilize EDA for text data augmentation, word embedding initialization for text encoder based on recurrent neural networks, and minimizing the gap between the two spaces by triplet ranking loss with hard negative mining. On two Flickr-based datasets, we achieve the same recall with only 60% of the training dataset as the normal training with full available data. Experiment results show the improvement of our proposed model; and, on all datasets in this paper (Flickr8k, Flickr30k, and MS-COCO), our model performs better on image annotation and image retrieval tasks; the experiments also demonstrate that text data augmentation is more suitable for smaller datasets, while word embedding initialization is suitable for larger ones.

Estilos ABNT, Harvard, Vancouver, APA, etc.

11

Chen, Shiming, Ziming Hong, Yang Liu, Guo-Sen Xie, Baigui Sun, Hao Li, Qinmu Peng, Ke Lu e Xinge You. "TransZero: Attribute-Guided Transformer for Zero-Shot Learning". Proceedings of the AAAI Conference on Artificial Intelligence 36, n.º 1 (28 de junho de 2022): 330–38. http://dx.doi.org/10.1609/aaai.v36i1.19909.

Texto completo da fonte

Resumo:

Zero-shot learning (ZSL) aims to recognize novel classes by transferring semantic knowledge from seen classes to unseen ones. Semantic knowledge is learned from attribute descriptions shared between different classes, which are strong prior for localization of object attribute for representing discriminative region features enabling significant visual-semantic interaction. Although few attention-based models have attempted to learn such region features in a single image, the transferability and discriminative attribute localization of visual features are typically neglected. In this paper, we propose an attribute-guided Transformer network to learn the attribute localization for discriminative visual-semantic embedding representations in ZSL, termed TransZero. Specifically, TransZero takes a feature augmentation encoder to alleviate the cross-dataset bias between ImageNet and ZSL benchmarks and improve the transferability of visual features by reducing the entangled relative geometry relationships among region features. To learn locality-augmented visual features, TransZero employs a visual-semantic decoder to localize the most relevant image regions to each attributes from a given image under the guidance of attribute semantic information. Then, the locality-augmented visual features and semantic vectors are used for conducting effective visual-semantic interaction in a visual-semantic embedding network. Extensive experiments show that TransZero achieves a new state-of-the-art on three ZSL benchmarks. The codes are available at: https://github.com/shiming-chen/TransZero.

Estilos ABNT, Harvard, Vancouver, APA, etc.

12

Seo, Sanghyun, e Juntae Kim. "Hierarchical Semantic Loss and Confidence Estimator for Visual-Semantic Embedding-Based Zero-Shot Learning". Applied Sciences 9, n.º 15 (2 de agosto de 2019): 3133. http://dx.doi.org/10.3390/app9153133.

Texto completo da fonte

Resumo:

Traditional supervised learning is dependent on the label of the training data, so there is a limitation that the class label which is not included in the training data cannot be recognized properly. Therefore, zero-shot learning, which can recognize unseen-classes that are not used in training, is gaining research interest. One approach to zero-shot learning is to embed visual data such as images and rich semantic data related to text labels of visual data into a common vector space to perform zero-shot cross-modal retrieval on newly input unseen-class data. This paper proposes a hierarchical semantic loss and confidence estimator to more efficiently perform zero-shot learning on visual data. Hierarchical semantic loss improves learning efficiency by using hierarchical knowledge in selecting a negative sample of triplet loss, and the confidence estimator estimates the confidence score to determine whether it is seen-class or unseen-class. These methodologies improve the performance of zero-shot learning by adjusting distances from a semantic vector to visual vector when performing zero-shot cross-modal retrieval. Experimental results show that the proposed method can improve the performance of zero-shot learning in terms of hit@k accuracy.

Estilos ABNT, Harvard, Vancouver, APA, etc.

13

Liu, Huixia, e Zhihong Qin. "Deep quantization network with visual-semantic alignment for zero-shot image retrieval". Electronic Research Archive 31, n.º 7 (2023): 4232–47. http://dx.doi.org/10.3934/era.2023215.

Texto completo da fonte

Resumo:

<abstract><p>Approximate nearest neighbor (ANN) search has become an essential paradigm for large-scale image retrieval. Conventional ANN search requires the categories of query images to been seen in the training set. However, facing the rapid evolution of newly-emerging concepts on the web, it is too expensive to retrain the model via collecting labeled data with the new (unseen) concepts. Existing zero-shot hashing methods choose the semantic space or intermediate space as the embedding space, which ignore the inconsistency of visual space and semantic space and suffer from the hubness problem on the zero-shot image retrieval task. In this paper, we present an novel deep quantization network with visual-semantic alignment for efficient zero-shot image retrieval. Specifically, we adopt a multi-task architecture that is capable of $ 1) $ learning discriminative and polymeric image representations for facilitating the visual-semantic alignment; $ 2) $ learning discriminative semantic embeddings for knowledge transfer; and $ 3) $ learning compact binary codes for aligning the visual space and the semantic space. We compare the proposed method with several state-of-the-art methods on several benchmark datasets, and the experimental results validate the superiority of the proposed method.</p></abstract>

Estilos ABNT, Harvard, Vancouver, APA, etc.

14

Gorniak, P., e D. Roy. "Grounded Semantic Composition for Visual Scenes". Journal of Artificial Intelligence Research 21 (1 de abril de 2004): 429–70. http://dx.doi.org/10.1613/jair.1327.

Texto completo da fonte

Resumo:

We present a visually-grounded language understanding model based on a study of how people verbally describe objects in scenes. The emphasis of the model is on the combination of individual word meanings to produce meanings for complex referring expressions. The model has been implemented, and it is able to understand a broad range of spatial referring expressions. We describe our implementation of word level visually-grounded semantics and their embedding in a compositional parsing framework. The implemented system selects the correct referents in response to natural language expressions for a large percentage of test cases. In an analysis of the system's successes and failures we reveal how visual context influences the semantics of utterances and propose future extensions to the model that take such context into account.

Estilos ABNT, Harvard, Vancouver, APA, etc.

15

Ma, Peirong, e Xiao Hu. "A Variational Autoencoder with Deep Embedding Model for Generalized Zero-Shot Learning". Proceedings of the AAAI Conference on Artificial Intelligence 34, n.º 07 (3 de abril de 2020): 11733–40. http://dx.doi.org/10.1609/aaai.v34i07.6844.

Texto completo da fonte

Resumo:

Generalized zero-shot learning (GZSL) is a challenging task that aims to recognize not only unseen classes unavailable during training, but also seen classes used at training stage. It is achieved by transferring knowledge from seen classes to unseen classes via a shared semantic space (e.g. attribute space). Most existing GZSL methods usually learn a cross-modal mapping between the visual feature space and the semantic space. However, the mapping model learned only from the seen classes will produce an inherent bias when used in the unseen classes. In order to tackle such a problem, this paper integrates a deep embedding network (DE) and a modified variational autoencoder (VAE) into a novel model (DE-VAE) to learn a latent space shared by both image features and class embeddings. Specifically, the proposed model firstly employs DE to learn the mapping from the semantic space to the visual feature space, and then utilizes VAE to transform both original visual features and the features obtained by the mapping into latent features. Finally, the latent features are used to train a softmax classifier. Extensive experiments on four GZSL benchmark datasets show that the proposed model significantly outperforms the state of the arts.

Estilos ABNT, Harvard, Vancouver, APA, etc.

16

K. Dinesh Kumar, Et al. "Visual Storytelling: A Generative Adversarial Networks (GANs) and Graph Embedding Framework". International Journal on Recent and Innovation Trends in Computing and Communication 11, n.º 9 (5 de novembro de 2023): 1899–906. http://dx.doi.org/10.17762/ijritcc.v11i9.9184.

Texto completo da fonte

Resumo:

Visual storytelling is a powerful educational tool, using image sequences to convey complex ideas and establish emotional connections with the audience. A study at the Chinese University of Hong Kong found that 92.7% of students prefer visual storytelling through animation over text alone [21]. Our approach integrates dual coding and propositional theory to generate visual representations of text, such as graphs and images, thereby enhancing students' memory retention and visualization skills. We use Generative Adversarial Networks (GANs) with graph data to generate images while preserving semantic consistency across objects, encompassing their attributes and relationships. By incorporating graph embedding, which includes node and relation embedding, we further enhance the semantic consistency of the generated high-quality images, improving the effectiveness of visual storytelling in education.

Estilos ABNT, Harvard, Vancouver, APA, etc.

17

Liu, Fangyu, Rongtian Ye, Xun Wang e Shuaipeng Li. "HAL: Improved Text-Image Matching by Mitigating Visual Semantic Hubs". Proceedings of the AAAI Conference on Artificial Intelligence 34, n.º 07 (3 de abril de 2020): 11563–71. http://dx.doi.org/10.1609/aaai.v34i07.6823.

Texto completo da fonte

Resumo:

The hubness problem widely exists in high-dimensional embedding space and is a fundamental source of error for cross-modal matching tasks. In this work, we study the emergence of hubs in Visual Semantic Embeddings (VSE) with application to text-image matching. We analyze the pros and cons of two widely adopted optimization objectives for training VSE and propose a novel hubness-aware loss function (Hal) that addresses previous methods' defects. Unlike (Faghri et al. 2018) which simply takes the hardest sample within a mini-batch, Hal takes all samples into account, using both local and global statistics to scale up the weights of “hubs”. We experiment our method with various configurations of model architectures and datasets. The method exhibits exceptionally good robustness and brings consistent improvement on the task of text-image matching across all settings. Specifically, under the same model architectures as (Faghri et al. 2018) and (Lee et al. 2018), by switching only the learning objective, we report a maximum R@1 improvement of 7.4% on MS-COCO and 8.3% on Flickr30k.1

Estilos ABNT, Harvard, Vancouver, APA, etc.

18

Wan, Ziyu, Yan Li, Min Yang e Junge Zhang. "Transductive Zero-Shot Learning via Visual Center Adaptation". Proceedings of the AAAI Conference on Artificial Intelligence 33 (17 de julho de 2019): 10059–60. http://dx.doi.org/10.1609/aaai.v33i01.330110059.

Texto completo da fonte

Resumo:

In this paper, we propose a Visual Center Adaptation Method (VCAM) to address the domain shift problem in zero-shot learning. For the seen classes in the training data, VCAM builds an embedding space by learning the mapping from semantic space to some visual centers. While for unseen classes in the test data, the construction of embedding space is constrained by a symmetric Chamfer-distance term, aiming to adapt the distribution of the synthetic visual centers to that of the real cluster centers. Therefore the learned embedding space can generalize the unseen classes well. Experiments on two widely used datasets demonstrate that our model significantly outperforms state-of-the-art methods.

Estilos ABNT, Harvard, Vancouver, APA, etc.

19

Deutsch, Shay, Andrea Bertozzi e Stefano Soatto. "Zero Shot Learning with the Isoperimetric Loss". Proceedings of the AAAI Conference on Artificial Intelligence 34, n.º 07 (3 de abril de 2020): 10704–12. http://dx.doi.org/10.1609/aaai.v34i07.6698.

Texto completo da fonte

Resumo:

We introduce the isoperimetric loss as a regularization criterion for learning the map from a visual representation to a semantic embedding, to be used to transfer knowledge to unknown classes in a zero-shot learning setting. We use a pre-trained deep neural network model as a visual representation of image data, a Word2Vec embedding of class labels, and linear maps between the visual and semantic embedding spaces. However, the spaces themselves are not linear, and we postulate the sample embedding to be populated by noisy samples near otherwise smooth manifolds. We exploit the graph structure defined by the sample points to regularize the estimates of the manifolds by inferring the graph connectivity using a generalization of the isoperimetric inequalities from Riemannian geometry to graphs. Surprisingly, this regularization alone, paired with the simplest baseline model, outperforms the state-of-the-art among fully automated methods in zero-shot learning benchmarks such as AwA and CUB. This improvement is achieved solely by learning the structure of the underlying spaces by imposing regularity.

Estilos ABNT, Harvard, Vancouver, APA, etc.

20

Zhang, Weifeng, Hua Hu e Haiyang Hu. "Training Visual-Semantic Embedding Network for Boosting Automatic Image Annotation". Neural Processing Letters 48, n.º 3 (11 de janeiro de 2018): 1503–19. http://dx.doi.org/10.1007/s11063-017-9753-9.

Texto completo da fonte

Estilos ABNT, Harvard, Vancouver, APA, etc.

21

An, Rongqiao, Zhenjiang Miao, Qingyu Li, Wanru Xu e Qiang Zhang. "Spatiotemporal visual-semantic embedding network for zero-shot action recognition". Journal of Electronic Imaging 28, n.º 02 (8 de março de 2019): 1. http://dx.doi.org/10.1117/1.jei.28.2.023007.

Texto completo da fonte

Estilos ABNT, Harvard, Vancouver, APA, etc.

22

Yang, Guan, Ayou Han, Xiaoming Liu, Yang Liu, Tao Wei e Zhiyuan Zhang. "Enhancing Semantic-Consistent Features and Transforming Discriminative Features for Generalized Zero-Shot Classifications". Applied Sciences 12, n.º 24 (9 de dezembro de 2022): 12642. http://dx.doi.org/10.3390/app122412642.

Texto completo da fonte

Resumo:

Generalized zero-shot learning (GZSL) aims to classify classes that do not appear during training. Recent state-of-the-art approaches rely on generative models, which use correlating semantic embeddings to synthesize unseen classes visual features; however, these approaches ignore the semantic and visual relevance, and visual features synthesized by generative models do not represent their semantics well. Although existing GZSL methods based on generative model disentanglement consider consistency between visual and semantic models, these methods consider semantic consistency only in the training phase and ignore semantic consistency in the feature synthesis and classification phases. The absence of such constraints may lead to an unrepresentative synthesized visual model with respect to semantics, and the visual and semantic features are not modally well aligned, thus causing the bias between visual and semantic features. Therefore, an approach for GZSL is proposed to enhance semantic-consistent features and discriminative features transformation (ESTD-GZSL). The proposed method can enhance semantic-consistent features at all stages of GZSL. A semantic decoder module is first added to the VAE to map synthetic and real features to the corresponding semantic embeddings. This regularization method allows synthesizing unseen classes for a more representative visual representation, and synthetic features can better represent their semantics. Then, the semantic-consistent features decomposed by the disentanglement module and the features output by the semantic decoder are transformed into enhanced semantic-consistent discriminative features and used in classification to reduce the ambiguity between categories. The experimental results show that our proposed method achieves more competitive results on four benchmark datasets (AWA2, CUB, FLO, and APY) of GZSL.

Estilos ABNT, Harvard, Vancouver, APA, etc.

23

Zhang, Linhai, Deyu Zhou, Yulan He e Zeng Yang. "MERL: Multimodal Event Representation Learning in Heterogeneous Embedding Spaces". Proceedings of the AAAI Conference on Artificial Intelligence 35, n.º 16 (18 de maio de 2021): 14420–27. http://dx.doi.org/10.1609/aaai.v35i16.17695.

Texto completo da fonte

Resumo:

Previous work has shown the effectiveness of using event representations for tasks such as script event prediction and stock market prediction. It is however still challenging to learn the subtle semantic differences between events based solely on textual descriptions of events often represented as (subject, predicate, object) triples. As an alternative, images offer a more intuitive way of understanding event semantics. We observe that event described in text and in images show different abstraction levels and therefore should be projected onto heterogeneous embedding spaces, as opposed to what have been done in previous approaches which project signals from different modalities onto a homogeneous space. In this paper, we propose a Multimodal Event Representation Learning framework (MERL) to learn event representations based on both text and image modalities simultaneously. Event textual triples are projected as Gaussian density embeddings by a dual-path Gaussian triple encoder, while event images are projected as point embeddings by a visual event component-aware image encoder. Moreover, a novel score function motivated by statistical hypothesis testing is introduced to coordinate two embedding spaces. Experiments are conducted on various multimodal event-related tasks and results show that MERL outperforms a number of unimodal and multimodal baselines, demonstrating the effectiveness of the proposed framework.

Estilos ABNT, Harvard, Vancouver, APA, etc.

24

Bi, Bei, Yaojun Wang, Haicang Zhang e Yang Gao. "Microblog-HAN: A micro-blog rumor detection model based on heterogeneous graph attention network". PLOS ONE 17, n.º 4 (12 de abril de 2022): e0266598. http://dx.doi.org/10.1371/journal.pone.0266598.

Texto completo da fonte

Resumo:

Although social media has highly facilitated people’s daily communication and dissemination of information, it has unfortunately been an ideal hotbed for the breeding and dissemination of Internet rumors. Therefore, automatically monitoring rumor dissemination in the early stage is of great practical significance. However, the existing detection methods fail to take full advantage of the semantics of the microblog information propagation graph. To address this shortcoming, this study models the information transmission network of a microblog as a heterogeneous graph with a variety of semantic information and then constructs a Microblog-HAN, which is a graph-based rumor detection model, to capture and aggregate the semantic information using attention layers. Specifically, after the initial textual and visual features of posts are extracted, the node-level attention mechanism combines neighbors of the microblog nodes to generate three groups of node embeddings with specific semantics. Moreover, semantic-level attention fuses different semantics to obtain the final node embedding of the microblog, which is then used as a classifier’s input. Finally, the classification results of whether the microblog is a rumor or not are obtained. The experimental results on two real-world microblog rumor datasets, Weibo2016 and Weibo2021, demonstrate that the proposed Microblog-HAN can detect microblog rumors with an accuracy of over 92%, demonstrating its superiority over the most existing methods in identifying rumors from the view of the whole information transmission graph.

Estilos ABNT, Harvard, Vancouver, APA, etc.

25

Bai, Haoyue, Haofeng Zhang e Qiong Wang. "Dual discriminative auto-encoder network for zero shot image recognition". Journal of Intelligent & Fuzzy Systems 40, n.º 3 (2 de março de 2021): 5159–70. http://dx.doi.org/10.3233/jifs-201920.

Texto completo da fonte

Resumo:

Zero Shot learning (ZSL) aims to use the information of seen classes to recognize unseen classes, which is achieved by transferring knowledge of the seen classes from the semantic embeddings. Since the domains of the seen and unseen classes do not overlap, most ZSL algorithms often suffer from domain shift problem. In this paper, we propose a Dual Discriminative Auto-encoder Network (DDANet), in which visual features and semantic attributes are self-encoded by using the high dimensional latent space instead of the feature space or the low dimensional semantic space. In the embedded latent space, the features are projected to both preserve their original semantic meanings and have discriminative characteristics, which are realized by applying dual semantic auto-encoder and discriminative feature embedding strategy. Moreover, the cross modal reconstruction is applied to obtain interactive information. Extensive experiments are conducted on four popular datasets and the results demonstrate the superiority of this method.

Estilos ABNT, Harvard, Vancouver, APA, etc.

26

Huang, Yan, Yang Long e Liang Wang. "Few-Shot Image and Sentence Matching via Gated Visual-Semantic Embedding". Proceedings of the AAAI Conference on Artificial Intelligence 33 (17 de julho de 2019): 8489–96. http://dx.doi.org/10.1609/aaai.v33i01.33018489.

Texto completo da fonte

Resumo:

Although image and sentence matching has been widely studied, its intrinsic few-shot problem is commonly ignored, which has become a bottleneck for further performance improvement. In this work, we focus on this challenging problem of few-shot image and sentence matching, and propose a Gated Visual-Semantic Embedding (GVSE) model to deal with it. The model consists of three corporative modules in terms of uncommon VSE, common VSE, and gated metric fusion. The uncommon VSE exploits external auxiliary resources to extract generic features for representing uncommon instances and words in images and sentences, and then integrates them by modeling their semantic relation to obtain global representations for association analysis. To better model other common instances and words in rest content of images and sentences, the common VSE learns their discriminative representations directly from scratch. After obtaining two similarity metrics from the two VSE modules with different advantages, the gated metric fusion module adaptively fuses them by automatically balancing their relative importance. Based on the fused metric, we perform extensive experiments in terms of few-shot and conventional image and sentence matching, and demonstrate the effectiveness of the proposed model by achieving the state-of-the-art results on two public benchmark datasets.

Estilos ABNT, Harvard, Vancouver, APA, etc.

27

Luo, Minnan, Xiaojun Chang e Chen Gong. "Reliable shot identification for complex event detection via visual-semantic embedding". Computer Vision and Image Understanding 213 (dezembro de 2021): 103300. http://dx.doi.org/10.1016/j.cviu.2021.103300.

Texto completo da fonte

Estilos ABNT, Harvard, Vancouver, APA, etc.

28

Yu, Beibei, Cheng Xie, Peng Tang e Bin Li. "Semantic-visual shared knowledge graph for zero-shot learning". PeerJ Computer Science 9 (22 de março de 2023): e1260. http://dx.doi.org/10.7717/peerj-cs.1260.

Texto completo da fonte

Resumo:

Almost all existing zero-shot learning methods work only on benchmark datasets (e.g., CUB, SUN, AwA, FLO and aPY) which have already provided pre-defined attributes for all the classes. These methods thus are hard to apply on real-world datasets (like ImageNet) since there are no such pre-defined attributes in the data environment. The latest works have explored to use semantic-rich knowledge graphs (such as WordNet) to substitute pre-defined attributes. However, these methods encounter a serious “role=“presentation”>domain shift” problem because such a knowledge graph cannot provide detailed enough semantics to describe fine-grained information. To this end, we propose a semantic-visual shared knowledge graph (SVKG) to enhance the detailed information for zero-shot learning. SVKG represents high-level information by using semantic embedding but describes fine-grained information by using visual features. These visual features can be directly extracted from real-world images to substitute pre-defined attributes. A multi-modals graph convolution network is also proposed to transfer SVKG into graph representations that can be used for downstream zero-shot learning tasks. Experimental results on the real-world datasets without pre-defined attributes demonstrate the effectiveness of our method and show the benefits of the proposed. Our method obtains a +2.8%, +0.5%, and +0.2% increase compared with the state-of-the-art in 2-hops, 3-hops, and all divisions relatively.

Estilos ABNT, Harvard, Vancouver, APA, etc.

29

Li, Qiaozhe, Xin Zhao, Ran He e Kaiqi Huang. "Visual-Semantic Graph Reasoning for Pedestrian Attribute Recognition". Proceedings of the AAAI Conference on Artificial Intelligence 33 (17 de julho de 2019): 8634–41. http://dx.doi.org/10.1609/aaai.v33i01.33018634.

Texto completo da fonte

Resumo:

Pedestrian attribute recognition in surveillance is a challenging task due to poor image quality, significant appearance variations and diverse spatial distribution of different attributes. This paper treats pedestrian attribute recognition as a sequential attribute prediction problem and proposes a novel visual-semantic graph reasoning framework to address this problem. Our framework contains a spatial graph and a directed semantic graph. By performing reasoning using the Graph Convolutional Network (GCN), one graph captures spatial relations between regions and the other learns potential semantic relations between attributes. An end-to-end architecture is presented to perform mutual embedding between these two graphs to guide the relational learning for each other. We verify the proposed framework on three large scale pedestrian attribute datasets including PETA, RAP, and PA100k. Experiments show superiority of the proposed method over state-of-the-art methods and effectiveness of our joint GCN structures for sequential attribute prediction.

Estilos ABNT, Harvard, Vancouver, APA, etc.

30

Suo, Xinhua, Bing Guo, Yan Shen, Wei Wang, Yaosen Chen e Zhen Zhang. "Embodying the Number of an Entity’s Relations for Knowledge Representation Learning". International Journal of Software Engineering and Knowledge Engineering 31, n.º 10 (outubro de 2021): 1495–515. http://dx.doi.org/10.1142/s0218194021500509.

Texto completo da fonte

Resumo:

Knowledge representation learning (knowledge graph embedding) plays a critical role in the application of knowledge graph construction. The multi-source information knowledge representation learning, which is one class of the most promising knowledge representation learning at present, mainly focuses on learning a large number of useful additional information of entities and relations in the knowledge graph into their embeddings, such as the text description information, entity type information, visual information, graph structure information, etc. However, there is a kind of simple but very common information — the number of an entity’s relations which means the number of an entity’s semantic types has been ignored. This work proposes a multi-source knowledge representation learning model KRL-NER, which embodies information of the number of an entity’s relations between entities into the entities’ embeddings through the attention mechanism. Specifically, first of all, we design and construct a submodel of the KRL-NER LearnNER which learns an embedding including the information on the number of an entity’s relations; then, we obtain a new embedding by exerting attention onto the embedding learned by the models such as TransE with this embedding; finally, we translate based onto the new embedding. Experiments, such as related tasks on knowledge graph: entity prediction, entity prediction under different relation types, and triple classification, are carried out to verify our model. The results show that our model is effective on the large-scale knowledge graphs, e.g. FB15K.

Estilos ABNT, Harvard, Vancouver, APA, etc.

31

Ye, Jingwen, Ruonan Yu, Songhua Liu e Xinchao Wang. "Mutual-Modality Adversarial Attack with Semantic Perturbation". Proceedings of the AAAI Conference on Artificial Intelligence 38, n.º 7 (24 de março de 2024): 6657–65. http://dx.doi.org/10.1609/aaai.v38i7.28488.

Texto completo da fonte

Resumo:

Adversarial attacks constitute a notable threat to machine learning systems, given their potential to induce erroneous predictions and classifications. However, within real-world contexts, the essential specifics of the deployed model are frequently treated as a black box, consequently mitigating the vulnerability to such attacks. Thus, enhancing the transferability of the adversarial samples has become a crucial area of research, which heavily relies on selecting appropriate surrogate models. To address this challenge, we propose a novel approach that generates adversarial attacks in a mutual-modality optimization scheme. Our approach is accomplished by leveraging the pre-trained CLIP model. Firstly, we conduct a visual attack on the clean image that causes semantic perturbations on the aligned embedding space with the other textual modality. Then, we apply the corresponding defense on the textual modality by updating the prompts, which forces the re-matching on the perturbed embedding space. Finally, to enhance the attack transferability, we utilize the iterative training strategy on the visual attack and the textual defense, where the two processes optimize from each other. We evaluate our approach on several benchmark datasets and demonstrate that our mutual-modal attack strategy can effectively produce high-transferable attacks, which are stable regardless of the target networks. Our approach outperforms state-of-the-art attack methods and can be readily deployed as a plug-and-play solution.

Estilos ABNT, Harvard, Vancouver, APA, etc.

32

Bai, Jing, Mengjie Wang e Dexin Kong. "Deep Common Semantic Space Embedding for Sketch-Based 3D Model Retrieval". Entropy 21, n.º 4 (4 de abril de 2019): 369. http://dx.doi.org/10.3390/e21040369.

Texto completo da fonte

Resumo:

Sketch-based 3D model retrieval has become an important research topic in many applications, such as computer graphics and computer-aided design. Although sketches and 3D models have huge interdomain visual perception discrepancies, and sketches of the same object have remarkable intradomain visual perception diversity, the 3D models and sketches of the same class share common semantic content. Motivated by these findings, we propose a novel approach for sketch-based 3D model retrieval by constructing a deep common semantic space embedding using triplet network. First, a common data space is constructed by representing every 3D model as a group of views. Second, a common modality space is generated by translating views to sketches according to cross entropy evaluation. Third, a common semantic space embedding for two domains is learned based on a triplet network. Finally, based on the learned features of sketches and 3D models, four kinds of distance metrics between sketches and 3D models are designed, and sketch-based 3D model retrieval results are achieved. The experimental results using the Shape Retrieval Contest (SHREC) 2013 and SHREC 2014 datasets reveal the superiority of our proposed method over state-of-the-art methods.

Estilos ABNT, Harvard, Vancouver, APA, etc.

33

Xiao, Linlin, Huahu Xu, Junsheng Xiao e Yuzhe Huang. "Few-Shot Object Detection with Memory Contrastive Proposal Based on Semantic Priors". Electronics 12, n.º 18 (11 de setembro de 2023): 3835. http://dx.doi.org/10.3390/electronics12183835.

Texto completo da fonte

Resumo:

Few-shot object detection (FSOD) aims to detect objects belonging to novel classes with few training samples. With the small number of novel class samples, the visual information extracted is insufficient to accurately represent the object itself, presenting significant intra-class variance and confusion between classes of similar samples, resulting in large errors in the detection results of the novel class samples. We propose a few-shot object detection framework to achieve effective classification and detection by embedding semantic information and contrastive learning. Firstly, we introduced a semantic fusion (SF) module, which projects semantic spatial information into visual space for interaction, to compensate for the lack of visual information and further enhance the representation of feature information. To further improve the classification performance, we embed the memory contrastive proposal (MCP) module to adjust the distribution of the feature space by calculating the contrastive loss between the class-centered features of previous samples and the current input features to obtain a more discriminative embedding space for better intra-class aggregation and inter-class separation for subsequent classification and detection. Extensive experiments on the PASCAL VOC and MS-COCO datasets show that the performance of our proposed method is effectively improved. Our proposed method improves nAP50 over the baseline model by 4.5% and 3.5%.

Estilos ABNT, Harvard, Vancouver, APA, etc.

34

Wei, Longhui, Lingxi Xie, Jianzhong He, Xiaopeng Zhang e Qi Tian. "Can Semantic Labels Assist Self-Supervised Visual Representation Learning?" Proceedings of the AAAI Conference on Artificial Intelligence 36, n.º 3 (28 de junho de 2022): 2642–50. http://dx.doi.org/10.1609/aaai.v36i3.20166.

Texto completo da fonte

Resumo:

Recently, contrastive learning has largely advanced the progress of unsupervised visual representation learning. Pre-trained on ImageNet, some self-supervised algorithms reported higher transfer learning performance compared to fully-supervised methods, seeming to deliver the message that human labels hardly contribute to learning transferrable visual features. In this paper, we defend the usefulness of semantic labels but point out that fully-supervised and self-supervised methods are pursuing different kinds of features. To alleviate this issue, we present a new algorithm named Supervised Contrastive Adjustment in Neighborhood (SCAN) that maximally prevents the semantic guidance from damaging the appearance feature embedding. In a series of downstream tasks, SCAN achieves superior performance compared to previous fully-supervised and self-supervised methods, and sometimes the gain is significant. More importantly, our study reveals that semantic labels are useful in assisting self-supervised methods, opening a new direction for the community.

Estilos ABNT, Harvard, Vancouver, APA, etc.

35

Cai, Jiyan, Libing Wu, Dan Wu, Jianxin Li e Xianfeng Wu. "Multi-Dimensional Information Alignment in Different Modalities for Generalized Zero-Shot and Few-Shot Learning". Information 14, n.º 3 (24 de fevereiro de 2023): 148. http://dx.doi.org/10.3390/info14030148.

Texto completo da fonte

Resumo:

Generalized zero-shot learning (GZSL) aims to solve the category recognition tasks for unseen categories under the setting that training samples only contain seen classes while unseen classes are not available. This research is vital as there are always existing new categories and large amounts of unlabeled data in realistic scenarios. Previous work for GZSL usually maps the visual information of the visible classes and the semantic description of the invisible classes into the identical embedding space to bridge the gap between the disjointed visible and invisible classes, while ignoring the intrinsic features of visual images, which are sufficiently discriminative to classify themselves. To better use discriminative information from visual classes for GZSL, we propose the n-CADA-VAE. In our approach, we map the visual feature of seen classes to a high-dimensional distribution while mapping the semantic description of unseen classes to a low-dimensional distribution under the same latent embedding space, thus projecting information of different modalities to corresponding space positions more accurately. We conducted extensive experiments on four benchmark datasets (CUB, SUN, AWA1, and AWA2). The results show our model’s superior performance in generalized zero-shot as well as few-shot learning.

Estilos ABNT, Harvard, Vancouver, APA, etc.

36

Chen, J., X. Du, J. Zhang, Y. Wan e W. Zhao. "SEMANTIC KNOWLEDGE EMBEDDING DEEP LEARNING NETWORK FOR LAND COVER CLASSIFICATION". International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences XLVIII-1/W2-2023 (13 de dezembro de 2023): 85–90. http://dx.doi.org/10.5194/isprs-archives-xlviii-1-w2-2023-85-2023.

Texto completo da fonte

Resumo:

Abstract. Land cover classification is essential basic information and key parameters for environmental change research, geographical and national monitoring, and sustainable development planning. Deep learning can automatically and multi-level extract the features of complex features, which has been proven to be an effective method for information extraction. However, one of the major challenges of deep learning is its poor interpret-ability, which makes it difficult to understand and explain the reasoning behind its classification results. This paper proposes a deep cross-modal coupling model (CMCM) for integrating semantic features and visual features. The representation of knowledge map is indicatively introduced into remote sensing image classification. Compared to previous studies, the proposed method provides accurate descriptions of the complex semantic objects within a complex land cover environment. The results showed that the integration of semantic knowledge improved the accuracy and interpret-ability of land cover classification.

Estilos ABNT, Harvard, Vancouver, APA, etc.

37

Gong, Yan, Georgina Cosma e Hui Fang. "On the Limitations of Visual-Semantic Embedding Networks for Image-to-Text Information Retrieval". Journal of Imaging 7, n.º 8 (26 de julho de 2021): 125. http://dx.doi.org/10.3390/jimaging7080125.

Texto completo da fonte

Resumo:

Visual-semantic embedding (VSE) networks create joint image–text representations to map images and texts in a shared embedding space to enable various information retrieval-related tasks, such as image–text retrieval, image captioning, and visual question answering. The most recent state-of-the-art VSE-based networks are: VSE++, SCAN, VSRN, and UNITER. This study evaluates the performance of those VSE networks for the task of image-to-text retrieval and identifies and analyses their strengths and limitations to guide future research on the topic. The experimental results on Flickr30K revealed that the pre-trained network, UNITER, achieved 61.5% on average Recall@5 for the task of retrieving all relevant descriptions. The traditional networks, VSRN, SCAN, and VSE++, achieved 50.3%, 47.1%, and 29.4% on average Recall@5, respectively, for the same task. An additional analysis was performed on image–text pairs from the top 25 worst-performing classes using a subset of the Flickr30K-based dataset to identify the limitations of the performance of the best-performing models, VSRN and UNITER. These limitations are discussed from the perspective of image scenes, image objects, image semantics, and basic functions of neural networks. This paper discusses the strengths and limitations of VSE networks to guide further research into the topic of using VSE networks for cross-modal information retrieval tasks.

Estilos ABNT, Harvard, Vancouver, APA, etc.

38

Yang, Guang, Manling Li, Jiajie Zhang, Xudong Lin, Heng Ji e Shih-Fu Chang. "Video Event Extraction via Tracking Visual States of Arguments". Proceedings of the AAAI Conference on Artificial Intelligence 37, n.º 3 (26 de junho de 2023): 3136–44. http://dx.doi.org/10.1609/aaai.v37i3.25418.

Texto completo da fonte

Resumo:

Video event extraction aims to detect salient events from a video and identify the arguments for each event as well as their semantic roles. Existing methods focus on capturing the overall visual scene of each frame, ignoring fine-grained argument-level information. Inspired by the definition of events as changes of states, we propose a novel framework to detect video events by tracking the changes in the visual states of all involved arguments, which are expected to provide the most informative evidence for the extraction of video events. In order to capture the visual state changes of arguments, we decompose them into changes in pixels within objects, displacements of objects, and interactions among multiple arguments. We further propose Object State Embedding, Object Motion-aware Embedding and Argument Interaction Embedding to encode and track these changes respectively. Experiments on various video event extraction tasks demonstrate significant improvements compared to state-of-the-art models. In particular, on verb classification, we achieve 3.49% absolute gains (19.53% relative gains) in F1@5 on Video Situation Recognition. Our Code is publicly available at https://github.com/Shinetism/VStates for research purposes.

Estilos ABNT, Harvard, Vancouver, APA, etc.

39

Eyharabide, Victoria, Imad Eddine Ibrahim Bekkouch e Nicolae Dragoș Constantin. "Knowledge Graph Embedding-Based Domain Adaptation for Musical Instrument Recognition". Computers 10, n.º 8 (3 de agosto de 2021): 94. http://dx.doi.org/10.3390/computers10080094.

Texto completo da fonte

Resumo:

Convolutional neural networks raised the bar for machine learning and artificial intelligence applications, mainly due to the abundance of data and computations. However, there is not always enough data for training, especially when it comes to historical collections of cultural heritage where the original artworks have been destroyed or damaged over time. Transfer Learning and domain adaptation techniques are possible solutions to tackle the issue of data scarcity. This article presents a new method for domain adaptation based on Knowledge graph embeddings. Knowledge Graph embedding forms a projection of a knowledge graph into a lower-dimensional where entities and relations are represented into continuous vector spaces. Our method incorporates these semantic vector spaces as a key ingredient to guide the domain adaptation process. We combined knowledge graph embeddings with visual embeddings from the images and trained a neural network with the combined embeddings as anchors using an extension of Fisher’s linear discriminant. We evaluated our approach on two cultural heritage datasets of images containing medieval and renaissance musical instruments. The experimental results showed a significant increase in the baselines and state-of-the-art performance compared with other domain adaptation methods.

Estilos ABNT, Harvard, Vancouver, APA, etc.

40

Jadhav, Mrunal, e Matthew Guzdial. "Tile Embedding: A General Representation for Level Generation". Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment 17, n.º 1 (4 de outubro de 2021): 34–41. http://dx.doi.org/10.1609/aiide.v17i1.18888.

Texto completo da fonte

Resumo:

In recent years, Procedural Level Generation via Machine Learning (PLGML) techniques have been applied to generate game levels with machine learning. These approaches rely on human-annotated representations of game levels. Creating annotated datasets for games requires domain knowledge and is time-consuming. Hence, though a large number of video games exist, annotated datasets are curated only for a small handful. Thus current PLGML techniques have been explored in limited domains, with Super Mario Bros. as the most common example. To address this problem, we present tile embeddings, a unified, affordance-rich representation for tile-based 2D games. To learn this embedding, we employ autoencoders trained on the visual and semantic information of tiles from a set of existing, human-annotated games. We evaluate this representation on its ability to predict affordances for unseen tiles, and to serve as a PLGML representation for annotated and unannotated games.

Estilos ABNT, Harvard, Vancouver, APA, etc.

41

Li, Wei, Haiyu Song, Hongda Zhang, Houjie Li e Pengjie Wang. "The Image Annotation Refinement in Embedding Feature Space based on Mutual Information". International Journal of Circuits, Systems and Signal Processing 16 (10 de janeiro de 2022): 191–201. http://dx.doi.org/10.46300/9106.2022.16.23.

Texto completo da fonte

Resumo:

The ever-increasing size of images has made automatic image annotation one of the most important tasks in the fields of machine learning and computer vision. Despite continuous efforts in inventing new annotation algorithms and new models, results of the state-of-the-art image annotation methods are often unsatisfactory. In this paper, to further improve annotation refinement performance, a novel approach based on weighted mutual information to automatically refine the original annotations of images is proposed. Unlike the traditional refinement model using only visual feature, the proposed model use semantic embedding to properly map labels and visual features to a meaningful semantic space. To accurately measure the relevance between the particular image and its original annotations, the proposed model utilize all available information including image-to-image, label-to-label and image-to-label. Experimental results conducted on three typical datasets show not only the validity of the refinement, but also the superiority of the proposed algorithm over existing ones. The improvement largely benefits from our proposed mutual information method and utilizing all available information.

Estilos ABNT, Harvard, Vancouver, APA, etc.

42

Monka, Sebastian, Lavdim Halilaj e Achim Rettinger. "A survey on visual transfer learning using knowledge graphs". Semantic Web 13, n.º 3 (6 de abril de 2022): 477–510. http://dx.doi.org/10.3233/sw-212959.

Texto completo da fonte

Resumo:

The information perceived via visual observations of real-world phenomena is unstructured and complex. Computer vision (CV) is the field of research that attempts to make use of that information. Recent approaches of CV utilize deep learning (DL) methods as they perform quite well if training and testing domains follow the same underlying data distribution. However, it has been shown that minor variations in the images that occur when these methods are used in the real world can lead to unpredictable and catastrophic errors. Transfer learning is the area of machine learning that tries to prevent these errors. Especially, approaches that augment image data using auxiliary knowledge encoded in language embeddings or knowledge graphs (KGs) have achieved promising results in recent years. This survey focuses on visual transfer learning approaches using KGs, as we believe that KGs are well suited to store and represent any kind of auxiliary knowledge. KGs can represent auxiliary knowledge either in an underlying graph-structured schema or in a vector-based knowledge graph embedding. Intending to enable the reader to solve visual transfer learning problems with the help of specific KG-DL configurations we start with a description of relevant modeling structures of a KG of various expressions, such as directed labeled graphs, hypergraphs, and hyper-relational graphs. We explain the notion of feature extractor, while specifically referring to visual and semantic features. We provide a broad overview of knowledge graph embedding methods and describe several joint training objectives suitable to combine them with high dimensional visual embeddings. The main section introduces four different categories on how a KG can be combined with a DL pipeline: 1) Knowledge Graph as a Reviewer; 2) Knowledge Graph as a Trainee; 3) Knowledge Graph as a Trainer; and 4) Knowledge Graph as a Peer. To help researchers find meaningful evaluation benchmarks, we provide an overview of generic KGs and a set of image processing datasets and benchmarks that include various types of auxiliary knowledge. Last, we summarize related surveys and give an outlook about challenges and open issues for future research.

Estilos ABNT, Harvard, Vancouver, APA, etc.

43

Liu, Bo, Qiulei Dong e Zhanyi Hu. "Zero-Shot Learning from Adversarial Feature Residual to Compact Visual Feature". Proceedings of the AAAI Conference on Artificial Intelligence 34, n.º 07 (3 de abril de 2020): 11547–54. http://dx.doi.org/10.1609/aaai.v34i07.6821.

Texto completo da fonte

Resumo:

Recently, many zero-shot learning (ZSL) methods focused on learning discriminative object features in an embedding feature space, however, the distributions of the unseen-class features learned by these methods are prone to be partly overlapped, resulting in inaccurate object recognition. Addressing this problem, we propose a novel adversarial network to synthesize compact semantic visual features for ZSL, consisting of a residual generator, a prototype predictor, and a discriminator. The residual generator is to generate the visual feature residual, which is integrated with a visual prototype predicted via the prototype predictor for synthesizing the visual feature. The discriminator is to distinguish the synthetic visual features from the real ones extracted from an existing categorization CNN. Since the generated residuals are generally numerically much smaller than the distances among all the prototypes, the distributions of the unseen-class features synthesized by the proposed network are less overlapped. In addition, considering that the visual features from categorization CNNs are generally inconsistent with their semantic features, a simple feature selection strategy is introduced for extracting more compact semantic visual features. Extensive experimental results on six benchmark datasets demonstrate that our method could achieve a significantly better performance than existing state-of-the-art methods by ∼1.2-13.2% in most cases.

Estilos ABNT, Harvard, Vancouver, APA, etc.

44

Wang, Zhecheng, Haoyuan Li e Ram Rajagopal. "Urban2Vec: Incorporating Street View Imagery and POIs for Multi-Modal Urban Neighborhood Embedding". Proceedings of the AAAI Conference on Artificial Intelligence 34, n.º 01 (3 de abril de 2020): 1013–20. http://dx.doi.org/10.1609/aaai.v34i01.5450.

Texto completo da fonte

Resumo:

Understanding intrinsic patterns and predicting spatiotemporal characteristics of cities require a comprehensive representation of urban neighborhoods. Existing works relied on either inter- or intra-region connectivities to generate neighborhood representations but failed to fully utilize the informative yet heterogeneous data within neighborhoods. In this work, we propose Urban2Vec, an unsupervised multi-modal framework which incorporates both street view imagery and point-of-interest (POI) data to learn neighborhood embeddings. Specifically, we use a convolutional neural network to extract visual features from street view images while preserving geospatial similarity. Furthermore, we model each POI as a bag-of-words containing its category, rating, and review information. Analog to document embedding in natural language processing, we establish the semantic similarity between neighborhood (“document”) and the words from its surrounding POIs in the vector space. By jointly encoding visual, textual, and geospatial information into the neighborhood representation, Urban2Vec can achieve performances better than baseline models and comparable to fully-supervised methods in downstream prediction tasks. Extensive experiments on three U.S. metropolitan areas also demonstrate the model interpretability, generalization capability, and its value in neighborhood similarity analysis.

Estilos ABNT, Harvard, Vancouver, APA, etc.

45

Wang, Chaoqun, Xuejin Chen, Shaobo Min, Xiaoyan Sun e Houqiang Li. "Task-Independent Knowledge Makes for Transferable Representations for Generalized Zero-Shot Learning". Proceedings of the AAAI Conference on Artificial Intelligence 35, n.º 3 (18 de maio de 2021): 2710–18. http://dx.doi.org/10.1609/aaai.v35i3.16375.

Texto completo da fonte

Resumo:

Generalized Zero-Shot Learning (GZSL) targets recognizing new categories by learning transferable image representations. Existing methods find that, by aligning image representations with corresponding semantic labels, the semantic-aligned representations can be transferred to unseen categories. However, supervised by only seen category labels, the learned semantic knowledge is highly task-specific, which makes image representations biased towards seen categories. In this paper, we propose a novel Dual-Contrastive Embedding Network (DCEN) that simultaneously learns task-specific and task-independent knowledge via semantic alignment and instance discrimination. First, DCEN leverages task labels to cluster representations of the same semantic category by cross-modal contrastive learning and exploring semantic-visual complementarity. Besides task-specific knowledge, DCEN then introduces task-independent knowledge by attracting representations of different views of the same image and repelling representations of different images. Compared to high-level seen category supervision, this instance discrimination supervision encourages DCEN to capture low-level visual knowledge, which is less biased toward seen categories and alleviates the representation bias. Consequently, the task-specific and task-independent knowledge jointly make for transferable representations of DCEN, which obtains averaged 4.1% improvement on four public benchmarks.

Estilos ABNT, Harvard, Vancouver, APA, etc.

46

Wang, Yiqi, e Yingjie Tian. "Exploring Zero-Shot Semantic Segmentation with No Supervision Leakage". Electronics 12, n.º 16 (15 de agosto de 2023): 3452. http://dx.doi.org/10.3390/electronics12163452.

Texto completo da fonte

Resumo:

Zero-shot semantic segmentation (ZS3), the process of classifying unseen classes without explicit training samples, poses a significant challenge. Despite notable progress made by pre-trained vision-language models, they have a problem of “supervision leakage” in the unseen classes due to their large-scale pre-trained data. For example, CLIP is trained on 400M image–text pairs that contain large label space categories. So, it is not convincing for real “zero-shot” learning in machine learning. This paper introduces SwinZS3, an innovative framework that explores the “no-supervision-leakage” zero-shot semantic segmentation with an image encoder that is not pre-trained on the seen classes. SwinZS3 integrates the strengths of both visual and semantic embeddings within a unified joint embedding space. This approach unifies a transformer-based image encoder with a language encoder. A distinguishing feature of SwinZS3 is the implementation of four specialized loss functions in the training progress: cross-entropy loss, semantic-consistency loss, regression loss, and pixel-text score loss. These functions guide the optimization process based on dense semantic prototypes derived from the language encoder, making the encoder adept at recognizing unseen classes during inference without retraining. We evaluated SwinZS3 with standard ZS3 benchmarks, including PASCAL VOC and PASCAL Context. The outcomes affirm the effectiveness of our method, marking a new milestone in “no-supervison-leakage” ZS3 task performance.

Estilos ABNT, Harvard, Vancouver, APA, etc.

47

Wang, Feng, Wan-Lei Zhao, Chong-Wah Ngo e Bernard Merialdo. "A Hamming Embedding Kernel with Informative Bag-of-Visual Words for Video Semantic Indexing". ACM Transactions on Multimedia Computing, Communications, and Applications 10, n.º 3 (abril de 2014): 1–20. http://dx.doi.org/10.1145/2535938.

Texto completo da fonte

Estilos ABNT, Harvard, Vancouver, APA, etc.

48

Xu, Tong, Peilun Zhou, Linkang Hu, Xiangnan He, Yao Hu e Enhong Chen. "Socializing the Videos: A Multimodal Approach for Social Relation Recognition". ACM Transactions on Multimedia Computing, Communications, and Applications 17, n.º 1 (16 de abril de 2021): 1–23. http://dx.doi.org/10.1145/3416493.

Texto completo da fonte

Resumo:

As a crucial task for video analysis, social relation recognition for characters not only provides semantically rich description of video content but also supports intelligent applications, e.g., video retrieval and visual question answering. Unfortunately, due to the semantic gap between visual and semantic features, traditional solutions may fail to reveal the accurate relations among characters. At the same time, the development of social media platforms has now promoted the emergence of crowdsourced comments, which may enhance the recognition task with semantic and descriptive cues. To that end, in this article, we propose a novel multimodal-based solution to deal with the character relation recognition task. Specifically, we capture the target character pairs via a search module and then design a multistream architecture for jointly embedding the visual and textual information, in which feature fusion and attention mechanism are adapted for better integrating the multimodal inputs. Finally, supervised learning is applied to classify character relations. Experiments on real-world data sets validate that our solution outperforms several competitive baselines.

Estilos ABNT, Harvard, Vancouver, APA, etc.

49

Chang, Doo Soo, Gun Hee Cho e Yong Suk Choi. "Zero-Shot Recognition Enhancement by Distance-Weighted Contextual Inference". Applied Sciences 10, n.º 20 (16 de outubro de 2020): 7234. http://dx.doi.org/10.3390/app10207234.

Texto completo da fonte

Resumo:

Zero-shot recognition (ZSR) aims to perform visual classification by category in the absence of training samples. The focus in most traditional ZSR models is using semantic knowledge about familiar categories to represent unfamiliar categories with only the visual appearance of an unseen object. In this research, we consider not only visual information but context to enhance the classifier’s cognitive ability in a multi-object scene. We propose a novel method, contextual inference, that uses external resources such as knowledge graphs and semantic embedding spaces to obtain similarity measures between an unseen object and its surrounding objects. Using the intuition that close contexts involve more related associations than distant ones, distance weighting is applied to each piece of surrounding information with a newly defined distance calculation formula. We integrated contextual inference into traditional ZSR models to calibrate their visual predictions, and performed extensive experiments on two different datasets for comparative evaluations. The experimental results demonstrate the effectiveness of our method through significant enhancements in performance.

Estilos ABNT, Harvard, Vancouver, APA, etc.

50

Yan, Shipeng, Songyang Zhang e Xuming He. "A Dual Attention Network with Semantic Embedding for Few-Shot Learning". Proceedings of the AAAI Conference on Artificial Intelligence 33 (17 de julho de 2019): 9079–86. http://dx.doi.org/10.1609/aaai.v33i01.33019079.

Texto completo da fonte

Resumo:

Despite recent success of deep neural networks, it remains challenging to efficiently learn new visual concepts from limited training data. To address this problem, a prevailing strategy is to build a meta-learner that learns prior knowledge on learning from a small set of annotated data. However, most of existing meta-learning approaches rely on a global representation of images and a meta-learner with complex model structures, which are sensitive to background clutter and difficult to interpret. We propose a novel meta-learning method for few-shot classification based on two simple attention mechanisms: one is a spatial attention to localize relevant object regions and the other is a task attention to select similar training data for label prediction. We implement our method via a dual-attention network and design a semantic-aware meta-learning loss to train the meta-learner network in an end-to-end manner. We validate our model on three few-shot image classification datasets with extensive ablative study, and our approach shows competitive performances over these datasets with fewer parameters. For facilitating the future research, code and data split are available: https://github.com/tonysy/STANet-PyTorch

Estilos ABNT, Harvard, Vancouver, APA, etc.

Oferecemos descontos em todos os planos premium para autores cujas obras estão incluídas em seleções literárias temáticas. Contate-nos para obter um código promocional único!