Academic literature on the topic 'Multi-modal image translation'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Multi-modal image translation.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Multi-modal image translation"

1

Yang, Pengcheng, Boxing Chen, Pei Zhang, and Xu Sun. "Visual Agreement Regularized Training for Multi-Modal Machine Translation." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 05 (April 3, 2020): 9418–25. http://dx.doi.org/10.1609/aaai.v34i05.6484.

Full text
Abstract:
Multi-modal machine translation aims at translating the source sentence into a different language in the presence of the paired image. Previous work suggests that additional visual information only provides dispensable help to translation, which is needed in several very special cases such as translating ambiguous words. To make better use of visual information, this work presents visual agreement regularized training. The proposed approach jointly trains the source-to-target and target-to-source translation models and encourages them to share the same focus on the visual information when generating semantically equivalent visual words (e.g. “ball” in English and “ballon” in French). Besides, a simple yet effective multi-head co-attention model is also introduced to capture interactions between visual and textual features. The results show that our approaches can outperform competitive baselines by a large margin on the Multi30k dataset. Further analysis demonstrates that the proposed regularized training can effectively improve the agreement of attention on the image, leading to better use of visual information.
APA, Harvard, Vancouver, ISO, and other styles
2

Kaur, Jagroop, and Gurpreet Singh Josan. "English to Hindi Multi Modal Image Caption Translation." Journal of scientific research 64, no. 02 (2020): 274–81. http://dx.doi.org/10.37398/jsr.2020.640238.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Xiaobin Guo, Xiaobin Guo. "Image Visual Attention Mechanism-based Global and Local Semantic Information Fusion for Multi-modal English Machine Translation." 電腦學刊 33, no. 2 (April 2022): 037–50. http://dx.doi.org/10.53106/199115992022043302004.

Full text
Abstract:
<p>Machine translation is a hot research topic at present. Traditional machine translation methods are not effective because they require a large number of training samples. Image visual semantic information can improve the effect of the text machine translation model. Most of the existing works fuse the whole image visual semantic information into the translation model, but the image may contain different semantic objects. These different local semantic objects have different effects on the words prediction of the decoder. Therefore, this paper proposes a multi-modal machine translation model based on the image visual attention mechanism via global and local semantic information fusion. The global semantic information in the image and the local semantic information are fused into the text attention weight as the image attention. Thus, the alignment information between the hidden state of the decoder and the text of the source language is further enhanced. Experimental results on the English-German translation pair and the Indonesian-Chinese translation pair on the Multi30K dataset show that the proposed model has a better performance than the state-of-the-art multi-modal machine translation models, the BLEU values of English-German translation results and Indonesian-Chinese translation results exceed 43% and 29%, which proves the effectiveness of the proposed model.</p> <p>&nbsp;</p>
APA, Harvard, Vancouver, ISO, and other styles
4

Xiaobin Guo, Xiaobin Guo. "Image Visual Attention Mechanism-based Global and Local Semantic Information Fusion for Multi-modal English Machine Translation." 電腦學刊 33, no. 2 (April 2022): 037–50. http://dx.doi.org/10.53106/199115992022043302004.

Full text
Abstract:
<p>Machine translation is a hot research topic at present. Traditional machine translation methods are not effective because they require a large number of training samples. Image visual semantic information can improve the effect of the text machine translation model. Most of the existing works fuse the whole image visual semantic information into the translation model, but the image may contain different semantic objects. These different local semantic objects have different effects on the words prediction of the decoder. Therefore, this paper proposes a multi-modal machine translation model based on the image visual attention mechanism via global and local semantic information fusion. The global semantic information in the image and the local semantic information are fused into the text attention weight as the image attention. Thus, the alignment information between the hidden state of the decoder and the text of the source language is further enhanced. Experimental results on the English-German translation pair and the Indonesian-Chinese translation pair on the Multi30K dataset show that the proposed model has a better performance than the state-of-the-art multi-modal machine translation models, the BLEU values of English-German translation results and Indonesian-Chinese translation results exceed 43% and 29%, which proves the effectiveness of the proposed model.</p> <p>&nbsp;</p>
APA, Harvard, Vancouver, ISO, and other styles
5

Shi, Xiayang, Jiaqi Yuan, Yuanyuan Huang, Zhenqiang Yu, Pei Cheng, and Xinyi Liu. "Reference Context Guided Vector to Achieve Multimodal Machine Translation." Journal of Physics: Conference Series 2171, no. 1 (January 1, 2022): 012076. http://dx.doi.org/10.1088/1742-6596/2171/1/012076.

Full text
Abstract:
Abstract Traditional machine translation mainly realizes the introduction of static images from other modal information to improve translation quality. In processing, a variety of methods are combined to improve the data and features, so that the translation result is close to the upper limit, and some even need to rely on the sensitivity of the sample distance algorithm to the data. At the same time, multi-modal MT will cause problems such as lack of semantic interaction in the attention mechanism in the same corpus, or excessive encoding of the same text image information and corpus irrelevant information, resulting in excessive noise. In order to solve these problems, this article proposes a new input port that adds visual image processing to the decoder. The core idea is to combine visual image information with traditional attention mechanisms at each time step specific to decoding. The dynamic router extracts the relevant visual features, integrates the multi-modal visual features into the decoder, and predicts the target word by introducing the visual image process. At the same time, experiments were carried out on more than 30K datasets translated in the United Kingdom, France and the Czech Republic, which proved the superiority of adding visual images to the decoder to extract features.
APA, Harvard, Vancouver, ISO, and other styles
6

Calixto, Iacer, and Qun Liu. "An error analysis for image-based multi-modal neural machine translation." Machine Translation 33, no. 1-2 (April 8, 2019): 155–77. http://dx.doi.org/10.1007/s10590-019-09226-9.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Gómez, Jose L., Gabriel Villalonga, and Antonio M. López. "Co-Training for Deep Object Detection: Comparing Single-Modal and Multi-Modal Approaches." Sensors 21, no. 9 (May 4, 2021): 3185. http://dx.doi.org/10.3390/s21093185.

Full text
Abstract:
Top-performing computer vision models are powered by convolutional neural networks (CNNs). Training an accurate CNN highly depends on both the raw sensor data and their associated ground truth (GT). Collecting such GT is usually done through human labeling, which is time-consuming and does not scale as we wish. This data-labeling bottleneck may be intensified due to domain shifts among image sensors, which could force per-sensor data labeling. In this paper, we focus on the use of co-training, a semi-supervised learning (SSL) method, for obtaining self-labeled object bounding boxes (BBs), i.e., the GT to train deep object detectors. In particular, we assess the goodness of multi-modal co-training by relying on two different views of an image, namely, appearance (RGB) and estimated depth (D). Moreover, we compare appearance-based single-modal co-training with multi-modal. Our results suggest that in a standard SSL setting (no domain shift, a few human-labeled data) and under virtual-to-real domain shift (many virtual-world labeled data, no human-labeled data) multi-modal co-training outperforms single-modal. In the latter case, by performing GAN-based domain translation both co-training modalities are on par, at least when using an off-the-shelf depth estimation model not specifically trained on the translated images.
APA, Harvard, Vancouver, ISO, and other styles
8

Rodrigues, Ana, Bruna Sousa, Amílcar Cardoso, and Penousal Machado. "“Found in Translation”: An Evolutionary Framework for Auditory–Visual Relationships." Entropy 24, no. 12 (November 22, 2022): 1706. http://dx.doi.org/10.3390/e24121706.

Full text
Abstract:
The development of computational artifacts to study cross-modal associations has been a growing research topic, as they allow new degrees of abstraction. In this context, we propose a novel approach to the computational exploration of relationships between music and abstract images, grounded by findings from cognitive sciences (emotion and perception). Due to the problem’s high-level nature, we rely on evolutionary programming techniques to evolve this audio–visual dialogue. To articulate the complexity of the problem, we develop a framework with four modules: (i) vocabulary set, (ii) music generator, (iii) image generator, and (iv) evolutionary engine. We test our approach by evolving a given music set to a corresponding set of images, steered by the expression of four emotions (angry, calm, happy, sad). Then, we perform preliminary user tests to evaluate if the user’s perception is consistent with the system’s expression. Results suggest an agreement between the user’s emotional perception of the music–image pairs and the system outcomes, favoring the integration of cognitive science knowledge. We also discuss the benefit of employing evolutionary strategies, such as genetic programming on multi-modal problems of a creative nature. Overall, this research contributes to a better understanding of the foundations of auditory–visual associations mediated by emotions and perception.
APA, Harvard, Vancouver, ISO, and other styles
9

Lu, Chien-Yu, Min-Xin Xue, Chia-Che Chang, Che-Rung Lee, and Li Su. "Play as You Like: Timbre-Enhanced Multi-Modal Music Style Transfer." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 1061–68. http://dx.doi.org/10.1609/aaai.v33i01.33011061.

Full text
Abstract:
Style transfer of polyphonic music recordings is a challenging task when considering the modeling of diverse, imaginative, and reasonable music pieces in the style different from their original one. To achieve this, learning stable multi-modal representations for both domain-variant (i.e., style) and domaininvariant (i.e., content) information of music in an unsupervised manner is critical. In this paper, we propose an unsupervised music style transfer method without the need for parallel data. Besides, to characterize the multi-modal distribution of music pieces, we employ the Multi-modal Unsupervised Image-to-Image Translation (MUNIT) framework in the proposed system. This allows one to generate diverse outputs from the learned latent distributions representing contents and styles. Moreover, to better capture the granularity of sound, such as the perceptual dimensions of timbre and the nuance in instrument-specific performance, cognitively plausible features including mel-frequency cepstral coefficients (MFCC), spectral difference, and spectral envelope, are combined with the widely-used mel-spectrogram into a timbreenhanced multi-channel input representation. The Relativistic average Generative Adversarial Networks (RaGAN) is also utilized to achieve fast convergence and high stability. We conduct experiments on bilateral style transfer tasks among three different genres, namely piano solo, guitar solo, and string quartet. Results demonstrate the advantages of the proposed method in music style transfer with improved sound quality and in allowing users to manipulate the output.
APA, Harvard, Vancouver, ISO, and other styles
10

Islam, Kh Tohidul, Sudanthi Wijewickrema, and Stephen O’Leary. "A rotation and translation invariant method for 3D organ image classification using deep convolutional neural networks." PeerJ Computer Science 5 (March 4, 2019): e181. http://dx.doi.org/10.7717/peerj-cs.181.

Full text
Abstract:
Three-dimensional (3D) medical image classification is useful in applications such as disease diagnosis and content-based medical image retrieval. It is a challenging task due to several reasons. First, image intensity values are vastly different depending on the image modality. Second, intensity values within the same image modality may vary depending on the imaging machine and artifacts may also be introduced in the imaging process. Third, processing 3D data requires high computational power. In recent years, significant research has been conducted in the field of 3D medical image classification. However, most of these make assumptions about patient orientation and imaging direction to simplify the problem and/or work with the full 3D images. As such, they perform poorly when these assumptions are not met. In this paper, we propose a method of classification for 3D organ images that is rotation and translation invariant. To this end, we extract a representative two-dimensional (2D) slice along the plane of best symmetry from the 3D image. We then use this slice to represent the 3D image and use a 20-layer deep convolutional neural network (DCNN) to perform the classification task. We show experimentally, using multi-modal data, that our method is comparable to existing methods when the assumptions of patient orientation and viewing direction are met. Notably, it shows similarly high accuracy even when these assumptions are violated, where other methods fail. We also explore how this method can be used with other DCNN models as well as conventional classification approaches.
APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic "Multi-modal image translation"

1

Liu, Yahui. "Exploring Multi-Domain and Multi-Modal Representations for Unsupervised Image-to-Image Translation." Doctoral thesis, Università degli studi di Trento, 2022. http://hdl.handle.net/11572/342634.

Full text
Abstract:
Unsupervised image-to-image translation (UNIT) is a challenging task in the image manipulation field, where input images in a visual domain are mapped into another domain with desired visual patterns (also called styles). An ideal direction in this field is to build a model that can map an input image in a domain to multiple target domains and generate diverse outputs in each target domain, which is termed as multi-domain and multi-modal unsupervised image-to-image translation (MMUIT). Recent studies have shown remarkable results in UNIT but they suffer from four main limitations: (1) State-of-the-art UNIT methods are either built from several two-domain mappings that are required to be learned independently or they generate low-diversity results, a phenomenon also known as model collapse. (2) Most of the manipulation is with the assistance of visual maps or digital labels without exploring natural languages, which could be more scalable and flexible in practice. (3) In an MMUIT system, the style latent space is usually disentangled between every two image domains. While interpolations within domains are smooth, interpolations between two different domains often result in unrealistic images with artifacts when interpolating between two randomly sampled style representations from two different domains. Improving the smoothness of the style latent space can lead to gradual interpolations between any two style latent representations even between any two domains. (4) It is expensive to train MMUIT models from scratch at high resolution. Interpreting the latent space of pre-trained unconditional GANs can achieve pretty good image translations, especially high-quality synthesized images (e.g., 1024x1024 resolution). However, few works explore building an MMUIT system with such pre-trained GANs. In this thesis, we focus on these vital issues and propose several techniques for building better MMUIT systems. First, we base on the content-style disentangled framework and propose to fit the style latent space with Gaussian Mixture Models (GMMs). It allows a well-trained network using a shared disentangled style latent space to model multi-domain translations. Meanwhile, we can randomly sample different style representations from a Gaussian component or use a reference image for style transfer. Second, we show how the GMM-modeled latent style space can be combined with a language model (e.g., a simple LSTM network) to manipulate multiple styles by using textual commands. Then, we not only propose easy-to-use constraints to improve the smoothness of the style latent space in MMUIT models, but also design a novel metric to quantitatively evaluate the smoothness of the style latent space. Finally, we build a new model to use pretrained unconditional GANs to do MMUIT tasks.
APA, Harvard, Vancouver, ISO, and other styles

Book chapters on the topic "Multi-modal image translation"

1

Gobeill, Julien, Henning Müller, and Patrick Ruch. "Translation by Text Categorisation: Medical Image Retrieval in ImageCLEFmed 2006." In Evaluation of Multilingual and Multi-modal Information Retrieval, 706–10. Berlin, Heidelberg: Springer Berlin Heidelberg, 2007. http://dx.doi.org/10.1007/978-3-540-74999-8_88.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Ren, Mengwei, Heejong Kim, Neel Dey, and Guido Gerig. "Q-space Conditioned Translation Networks for Directional Synthesis of Diffusion Weighted Images from Multi-modal Structural MRI." In Medical Image Computing and Computer Assisted Intervention – MICCAI 2021, 530–40. Cham: Springer International Publishing, 2021. http://dx.doi.org/10.1007/978-3-030-87234-2_50.

Full text
APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Multi-modal image translation"

1

Chen, Zekang, Jia Wei, and Rui Li. "Unsupervised Multi-Modal Medical Image Registration via Discriminator-Free Image-to-Image Translation." In Thirty-First International Joint Conference on Artificial Intelligence {IJCAI-22}. California: International Joint Conferences on Artificial Intelligence Organization, 2022. http://dx.doi.org/10.24963/ijcai.2022/117.

Full text
Abstract:
In clinical practice, well-aligned multi-modal images, such as Magnetic Resonance (MR) and Computed Tomography (CT), together can provide complementary information for image-guided therapies. Multi-modal image registration is essential for the accurate alignment of these multi-modal images. However, it remains a very challenging task due to complicated and unknown spatial correspondence between different modalities. In this paper, we propose a novel translation-based unsupervised deformable image registration approach to convert the multi-modal registration problem to a mono-modal one. Specifically, our approach incorporates a discriminator-free translation network to facilitate the training of the registration network and a patchwise contrastive loss to encourage the translation network to preserve object shapes. Furthermore, we propose to replace an adversarial loss, that is widely used in previous multi-modal image registration methods, with a pixel loss in order to integrate the output of translation into the target modality. This leads to an unsupervised method requiring no ground-truth deformation or pairs of aligned images for training. We evaluate four variants of our approach on the public Learn2Reg 2021 datasets. The experimental results demonstrate that the proposed architecture achieves state-of-the-art performance. Our code is available at https://github.com/heyblackC/DFMIR.
APA, Harvard, Vancouver, ISO, and other styles
2

Vishnu Kumar, V. H., and N. Lalithamani. "English to Tamil Multi-Modal Image Captioning Translation." In 2022 IEEE World Conference on Applied Intelligence and Computing (AIC). IEEE, 2022. http://dx.doi.org/10.1109/aic55036.2022.9848810.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Huang, Ping, Shiliang Sun, and Hao Yang. "Image-Assisted Transformer in Zero-Resource Multi-Modal Translation." In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021. http://dx.doi.org/10.1109/icassp39728.2021.9413389.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Arar, Moab, Yiftach Ginger, Dov Danon, Amit H. Bermano, and Daniel Cohen-Or. "Unsupervised Multi-Modal Image Registration via Geometry Preserving Image-to-Image Translation." In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2020. http://dx.doi.org/10.1109/cvpr42600.2020.01342.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Laskar, Sahinur Rahman, Rohit Pratap Singh, Partha Pakray, and Sivaji Bandyopadhyay. "English to Hindi Multi-modal Neural Machine Translation and Hindi Image Captioning." In Proceedings of the 6th Workshop on Asian Translation. Stroudsburg, PA, USA: Association for Computational Linguistics, 2019. http://dx.doi.org/10.18653/v1/d19-5205.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Cortinhal, Tiago, Fatih Kurnaz, and Eren Erdal Aksoy. "Semantics-aware Multi-modal Domain Translation: From LiDAR Point Clouds to Panoramic Color Images." In 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW). IEEE, 2021. http://dx.doi.org/10.1109/iccvw54120.2021.00338.

Full text
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography