Journal articles on the topic 'Convolutional transformer'

To see the other types of publications on this topic, follow the link: Convolutional transformer.

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Convolutional transformer.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Li, Pengfei, Peixiang Zhong, Kezhi Mao, Dongzhe Wang, Xuefeng Yang, Yunfeng Liu, Jianxiong Yin, and Simon See. "ACT: an Attentive Convolutional Transformer for Efficient Text Classification." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 15 (May 18, 2021): 13261–69. http://dx.doi.org/10.1609/aaai.v35i15.17566.

Full text
Abstract:
Recently, Transformer has been demonstrating promising performance in many NLP tasks and showing a trend of replacing Recurrent Neural Network (RNN). Meanwhile, less attention is drawn to Convolutional Neural Network (CNN) due to its weak ability in capturing sequential and long-distance dependencies, although it has excellent local feature extraction capability. In this paper, we introduce an Attentive Convolutional Transformer (ACT) that takes the advantages of both Transformer and CNN for efficient text classification. Specifically, we propose a novel attentive convolution mechanism that utilizes the semantic meaning of convolutional filters attentively to transform text from complex word space to a more informative convolutional filter space where important n-grams are captured. ACT is able to capture both local and global dependencies effectively while preserving sequential information. Experiments on various text classification tasks and detailed analyses show that ACT is a lightweight, fast, and effective universal text classifier, outperforming CNNs, RNNs, and attentive models including Transformer.
APA, Harvard, Vancouver, ISO, and other styles
2

He, Ping, Yong Li, Shoulong Chen, Hoghua Xu, Lei Zhu, and Lingyan Wang. "Core looseness fault identification model based on Mel spectrogram-CNN." Journal of Physics: Conference Series 2137, no. 1 (December 1, 2021): 012060. http://dx.doi.org/10.1088/1742-6596/2137/1/012060.

Full text
Abstract:
Abstract In order to realize transformer voiceprint recognition, a transformer voiceprint recognition model based on Mel spectrum convolution neural network is proposed. Firstly, the transformer core looseness fault is simulated by setting different preloads, and the sound signals under different preloads are collected; Secondly, the sound signal is converted into a spectrogram that can be trained by convolutional neural network, and then the dimension is reduced by Mel filter bank to draw Mel spectrogram, which can generate spectrogram data sets under different preloads in batch; Finally, the data set is introduced into convolutional neural network for training, and the transformer voiceprint fault recognition model is obtained. The results show that the training accuracy of the proposed Mel spectrum convolution neural network transformer identification model is 99.91%, which can well identify the core loosening faults.
APA, Harvard, Vancouver, ISO, and other styles
3

Li, Xiaopeng, and Shuqin Li. "Transformer Help CNN See Better: A Lightweight Hybrid Apple Disease Identification Model Based on Transformers." Agriculture 12, no. 6 (June 19, 2022): 884. http://dx.doi.org/10.3390/agriculture12060884.

Full text
Abstract:
The complex backgrounds of crop disease images and the small contrast between the disease area and the background can easily cause confusion, which seriously affects the robustness and accuracy of apple disease- identification models. To solve the above problems, this paper proposes a Vision Transformer-based lightweight apple leaf disease- identification model, ConvViT, to extract effective features of crop disease spots to identify crop diseases. Our ConvViT includes convolutional structures and Transformer structures; the convolutional structure is used to extract the global features of the image, and the Transformer structure is used to obtain the local features of the disease region to help the CNN see better. The patch embedding method is improved to retain more edge information of the image and promote the information exchange between patches in the Transformer. The parameters and FLOPs (Floating Point Operations) of the model are significantly reduced by using depthwise separable convolution and linear-complexity multi-head attention operations. Experimental results on a complex background of a self-built apple leaf disease dataset show that ConvViT achieves comparable identification results (96.85%) with the current performance of the state-of-the-art Swin-Tiny. The parameters and FLOPs are only 32.7% and 21.7% of Swin-Tiny, and significantly ahead of MobilenetV3, Efficientnet-b0, and other models, which indicates that the proposed model is indeed an effective disease-identification model with practical application value.
APA, Harvard, Vancouver, ISO, and other styles
4

Zhou, Li, Tongqin Shi, Songquan Huang, Fangchao Ke, Zhenxi Huang, Zhaoyang Zhang, and Jinzheng Liang. "Convolutional neural network for real-time main transformer detection." Journal of Physics: Conference Series 2229, no. 1 (March 1, 2022): 012021. http://dx.doi.org/10.1088/1742-6596/2229/1/012021.

Full text
Abstract:
Abstract For substation constructions, the main transformer is the dominant electrical equipment, and its arrival and operation affect the progress of project directly. In the context of smart grid construction, in order to improve the efficiency of real-time main transformer detection, this paper proposes an identification and detection method based on the SSD algorithm. The SSD algorithm is able to extract the target device (such as main transformer) accurately and the Lenet algorithm module can analyse the features contained in the image. To improve the accuracy of the detection method, the image migration algorithm of VGG-Net is used to expand the negative samples of main transformers to improve the generalisation of the algorithm. Finally, the image set collected in the real substation projects is used for validation, and result shows that the method identifies main transformers more accurately, with high effectiveness and feasibility.
APA, Harvard, Vancouver, ISO, and other styles
5

Zhang, Zhiwen, Teng Li, Xuebin Tang, Xiang Hu, and Yuanxi Peng. "CAEVT: Convolutional Autoencoder Meets Lightweight Vision Transformer for Hyperspectral Image Classification." Sensors 22, no. 10 (May 20, 2022): 3902. http://dx.doi.org/10.3390/s22103902.

Full text
Abstract:
Convolutional neural networks (CNNs) have been prominent in most hyperspectral image (HSI) processing applications due to their advantages in extracting local information. Despite their success, the locality of the convolutional layers within CNNs results in heavyweight models and time-consuming defects. In this study, inspired by the excellent performance of transformers that are used for long-range representation learning in computer vision tasks, we built a lightweight vision transformer for HSI classification that can extract local and global information simultaneously, thereby facilitating accurate classification. Moreover, as traditional dimensionality reduction methods are limited in their linear representation ability, a three-dimensional convolutional autoencoder was adopted to capture the nonlinear characteristics between spectral bands. Based on the aforementioned three-dimensional convolutional autoencoder and lightweight vision transformer, we designed an HSI classification network, namely the “convolutional autoencoder meets lightweight vision transformer” (CAEVT). Finally, we validated the performance of the proposed CAEVT network using four widely used hyperspectral datasets. Our approach showed superiority, especially in the absence of sufficient labeled samples, which demonstrates the effectiveness and efficiency of the CAEVT network.
APA, Harvard, Vancouver, ISO, and other styles
6

Xu, Jun, Zi-Xuan Chen, Hao Luo, and Zhe-Ming Lu. "An Efficient Dehazing Algorithm Based on the Fusion of Transformer and Convolutional Neural Network." Sensors 23, no. 1 (December 21, 2022): 43. http://dx.doi.org/10.3390/s23010043.

Full text
Abstract:
The purpose of image dehazing is to remove the interference from weather factors in degraded images and enhance the clarity and color saturation of images to maximize the restoration of useful features. Single image dehazing is one of the most important tasks in the field of image restoration. In recent years, due to the progress of deep learning, single image dehazing has made great progress. With the success of Transformer in advanced computer vision tasks, some research studies also began to apply Transformer to image dehazing tasks and obtained surprising results. However, both the deconvolution-neural-network-based dehazing algorithm and Transformer based dehazing algorithm magnify their advantages and disadvantages separately. Therefore, this paper proposes a novel Transformer–Convolution fusion dehazing network (TCFDN), which uses Transformer’s global modeling ability and convolutional neural network’s local modeling ability to improve the dehazing ability. In the Transformer–Convolution fusion dehazing network, the classic self-encoder structure is used. This paper proposes a Transformer–Convolution hybrid layer, which uses an adaptive fusion strategy to make full use of the Swin-Transformer and convolutional neural network to extract and reconstruct image features. On the basis of previous research, this layer further improves the ability of the network to remove haze. A series of contrast experiments and ablation experiments not only proved that the Transformer–Convolution fusion dehazing network proposed in this paper exceeded the more advanced dehazing algorithm, but also provided solid and powerful evidence for the basic theory on which it depends.
APA, Harvard, Vancouver, ISO, and other styles
7

Yang, Liming, Yihang Yang, Jinghui Yang, Ningyuan Zhao, Ling Wu, Liguo Wang, and Tianrui Wang. "FusionNet: A Convolution–Transformer Fusion Network for Hyperspectral Image Classification." Remote Sensing 14, no. 16 (August 19, 2022): 4066. http://dx.doi.org/10.3390/rs14164066.

Full text
Abstract:
In recent years, deep-learning-based hyperspectral image (HSI) classification networks have become one of the most dominant implementations in HSI classification tasks. Among these networks, convolutional neural networks (CNNs) and attention-based networks have prevailed over other HSI classification networks. While convolutional neural networks with perceptual fields can effectively extract local features in the spatial dimension of HSI, they are poor at capturing the global and sequential features of spectral–spatial information; networks based on attention mechanisms, for example, Transformer, usually have better ability to capture global features, but are relatively weak in discriminating local features. This paper proposes a fusion network of convolution and Transformer for HSI classification, known as FusionNet, in which convolution and Transformer are fused in both serial and parallel mechanisms to achieve the full utilization of HSI features. Experimental results demonstrate that the proposed network has superior classification results compared to previous similar networks, and performs relatively well even on a small amount of training data.
APA, Harvard, Vancouver, ISO, and other styles
8

Ibrahem, Hatem, Ahmed Salem, and Hyun-Soo Kang. "RT-ViT: Real-Time Monocular Depth Estimation Using Lightweight Vision Transformers." Sensors 22, no. 10 (May 19, 2022): 3849. http://dx.doi.org/10.3390/s22103849.

Full text
Abstract:
The latest research in computer vision highlighted the effectiveness of the vision transformers (ViT) in performing several computer vision tasks; they can efficiently understand and process the image globally unlike the convolution which processes the image locally. ViTs outperform the convolutional neural networks in terms of accuracy in many computer vision tasks but the speed of ViTs is still an issue, due to the excessive use of the transformer layers that include many fully connected layers. Therefore, we propose a real-time ViT-based monocular depth estimation (depth estimation from single RGB image) method with encoder-decoder architectures for indoor and outdoor scenes. This main architecture of the proposed method consists of a vision transformer encoder and a convolutional neural network decoder. We started by training the base vision transformer (ViT-b16) with 12 transformer layers then we reduced the transformer layers to six layers, namely ViT-s16 (the Small ViT) and four layers, namely ViT-t16 (the Tiny ViT) to obtain real-time processing. We also try four different configurations of the CNN decoder network. The proposed architectures can learn the task of depth estimation efficiently and can produce more accurate depth predictions than the fully convolutional-based methods taking advantage of the multi-head self-attention module. We train the proposed encoder-decoder architecture end-to-end on the challenging NYU-depthV2 and CITYSCAPES benchmarks then we evaluate the trained models on the validation and test sets of the same benchmarks showing that it outperforms many state-of-the-art methods on depth estimation while performing the task in real-time (∼20 fps). We also present a fast 3D reconstruction (∼17 fps) experiment based on the depth estimated from our method which is considered a real-world application of our method.
APA, Harvard, Vancouver, ISO, and other styles
9

Wu, Jiajing, Zhiqiang Wei, Jinpeng Zhang, Yushi Zhang, Dongning Jia, Bo Yin, and Yunchao Yu. "Full-Coupled Convolutional Transformer for Surface-Based Duct Refractivity Inversion." Remote Sensing 14, no. 17 (September 3, 2022): 4385. http://dx.doi.org/10.3390/rs14174385.

Full text
Abstract:
A surface-based duct (SBD) is an abnormal atmospheric structure with a low probability of occurrence buta strong ability to trap electromagnetic waves. However, the existing research is based on the assumption that the range direction of the surface duct is homogeneous, which will lead to low productivity and large errors when applied in a real-marine environment. To alleviate these issues, we propose a framework for the inversion of inhomogeneous SBD M-profile based on a full-coupled convolutional Transformer (FCCT) deep learning network. We first designed a one-dimensional residual dilated causal convolution autoencoder to extract the feature representations from a high-dimension range direction inhomogeneous M-profile. Second, to improve efficiency and precision, we proposed a full-coupled convolutional Transformer (FCCT) that incorporated dilated causal convolutional layers to gain exponentially receptive field growth of the M-profile and help Transformer-like models improve the receptive field of each range direction inhomogeneous SBD M-profile information. We tested our proposed method performance on two sets of simulated sea clutter power data where the inversion of the simulated data reached 96.99% and 97.69%, which outperformed the existing baseline methods.
APA, Harvard, Vancouver, ISO, and other styles
10

Sowndarya, S., and Sujatha Balaraman. "Diagnosis of Partial Discharge in Power Transformer using Convolutional Neural Network." March 2022 4, no. 1 (April 30, 2022): 29–38. http://dx.doi.org/10.36548/jscp.2022.1.004.

Full text
Abstract:
In an electric power system, power transformers are essential. Transformer failures can degrade the quality of the power and create power outages. Partial Discharges (PD) are a condition that, if not adequately monitored, can cause power transformer failures. This project addresses the diagnosis of PD in power transformer using the Phase Amplitude (PA) response of PRPD (Phase-Resolved Partial Discharge) patterns recorded using PD Detectors. It is a widely used pattern for analysing Partial Discharge. A Convolutional Neural Network (CNN) is used to classify the type of PD defects. The PRPD patterns of 240 PA sample images have been taken from power transformer of rating 132/11 KV and 132/25 KV for training and testing the network. The feature extraction has also been done using CNN. In this work, the classification of PD faults is done using a supervised machine learning technique. The three different classes of PD faults such as Floating PD, Surface PD and Void PD are considered and predicted using Support Vector Machine (SVM) classifier. Simulation study is carried out using MATLAB. Based on the results obtained, it is found that CNN model has achieved a greater classification accuracy and thereby the life span of power transformer is enhanced.
APA, Harvard, Vancouver, ISO, and other styles
11

Zhu, Wenhao, Yujun Xie, Qun Huang, Zehua Zheng, Xiaozhao Fang, Yonghui Huang, and Weijun Sun. "Graph Transformer Collaborative Filtering Method for Multi-Behavior Recommendations." Mathematics 10, no. 16 (August 16, 2022): 2956. http://dx.doi.org/10.3390/math10162956.

Full text
Abstract:
Graph convolutional networks are widely used in recommendation tasks owing to their ability to learn user and item embeddings using collaborative signals from high-order neighborhoods. Most of the graph convolutional recommendation tasks in existing studies have specialized in modeling a single type of user–item interaction preference. Meanwhile, graph-convolution-network-based recommendation models are prone to over-smoothing problems when stacking increased numbers of layers. Therefore, in this study we propose a multi-behavior recommendation method based on graph transformer collaborative filtering. This method utilizes an unsupervised subgraph generation model that divides users with similar preferences and their interaction items into subgraphs. Furthermore, it fuses multi-headed attention layers with temporal coding strategies based on the user–item interaction graphs in the subgraphs such that the learned embeddings can reflect multiple user–item relationships and the potential for dynamic interactions. Finally, multi-behavior recommendation is performed by uniting multi-layer embedding representations. The experimental results on two real-world datasets show that the proposed method performs better than previously developed systems.
APA, Harvard, Vancouver, ISO, and other styles
12

Pu, Shilin, Liang Chu, Jincheng Hu, Shibo Li, Jihao Li, and Wen Sun. "SGGformer: Shifted Graph Convolutional Graph-Transformer for Traffic Prediction." Sensors 22, no. 22 (November 21, 2022): 9024. http://dx.doi.org/10.3390/s22229024.

Full text
Abstract:
Accurate traffic prediction is significant in intelligent cities’ safe and stable development. However, due to the complex spatiotemporal correlation of traffic flow data, establishing an accurate traffic prediction model is still challenging. Aiming to meet the challenge, this paper proposes SGGformer, an advanced traffic grade prediction model which combines a shifted window operation, a multi-channel graph convolution network, and a graph Transformer network. Firstly, the shifted window operation is used for coarsening the time series data, thus, the computational complexity can be reduced. Then, a multi-channel graph convolutional network is adopted to capture and aggregate the spatial correlations of the roads in multiple dimensions. Finally, the improved graph Transformer based on the advanced Transformer model is proposed to extract the long-term temporal correlation of traffic data effectively. The prediction performance is evaluated by using actual traffic datasets, and the test results show that the SGGformer proposed exceeds the state-of-the-art baseline.
APA, Harvard, Vancouver, ISO, and other styles
13

Bai, Yuanchao, Xu Yang, Xianming Liu, Junjun Jiang, Yaowei Wang, Xiangyang Ji, and Wen Gao. "Towards End-to-End Image Compression and Analysis with Transformers." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 1 (June 28, 2022): 104–12. http://dx.doi.org/10.1609/aaai.v36i1.19884.

Full text
Abstract:
We propose an end-to-end image compression and analysis model with Transformers, targeting to the cloud-based image classification application. Instead of placing an existing Transformer-based image classification model directly after an image codec, we aim to redesign the Vision Transformer (ViT) model to perform image classification from the compressed features and facilitate image compression with the long-term information from the Transformer. Specifically, we first replace the patchify stem (i.e., image splitting and embedding) of the ViT model with a lightweight image encoder modelled by a convolutional neural network. The compressed features generated by the image encoder are injected convolutional inductive bias and are fed to the Transformer for image classification bypassing image reconstruction. Meanwhile, we propose a feature aggregation module to fuse the compressed features with the selected intermediate features of the Transformer, and feed the aggregated features to a deconvolutional neural network for image reconstruction. The aggregated features can obtain the long-term information from the self-attention mechanism of the Transformer and improve the compression performance. The rate-distortion-accuracy optimization problem is finally solved by a two-step training strategy. Experimental results demonstrate the effectiveness of the proposed model in both the image compression and the classification tasks.
APA, Harvard, Vancouver, ISO, and other styles
14

Le, Viet-Tuan, Kiet Tran-Trung, and Vinh Truong Hoang. "A Comprehensive Review of Recent Deep Learning Techniques for Human Activity Recognition." Computational Intelligence and Neuroscience 2022 (April 20, 2022): 1–17. http://dx.doi.org/10.1155/2022/8323962.

Full text
Abstract:
Human action recognition is an important field in computer vision that has attracted remarkable attention from researchers. This survey aims to provide a comprehensive overview of recent human action recognition approaches based on deep learning using RGB video data. Our work divides recent deep learning-based methods into five different categories to provide a comprehensive overview for researchers who are interested in this field of computer vision. Moreover, a pure-transformer architecture (convolution-free) has outperformed its convolutional counterparts in many fields of computer vision recently. Our work also provides recent convolution-free-based methods which replaced convolution networks with the transformer networks that achieved state-of-the-art results on many human action recognition datasets. Firstly, we discuss proposed methods based on a 2D convolutional neural network. Then, methods based on a recurrent neural network which is used to capture motion information are discussed. 3D convolutional neural network-based methods are used in many recent approaches to capture both spatial and temporal information in videos. However, with long action videos, multistream approaches with different streams to encode different features are reviewed. We also compare the performance of recently proposed methods on four popular benchmark datasets. We review 26 benchmark datasets for human action recognition. Some potential research directions are discussed to conclude this survey.
APA, Harvard, Vancouver, ISO, and other styles
15

Jiang, Yun, Jing Liang, Tongtong Cheng, Xin Lin, Yuan Zhang, and Jinkun Dong. "MTPA_Unet: Multi-Scale Transformer-Position Attention Retinal Vessel Segmentation Network Joint Transformer and CNN." Sensors 22, no. 12 (June 17, 2022): 4592. http://dx.doi.org/10.3390/s22124592.

Full text
Abstract:
Retinal vessel segmentation is extremely important for risk prediction and treatment of many major diseases. Therefore, accurate segmentation of blood vessel features from retinal images can help assist physicians in diagnosis and treatment. Convolutional neural networks are good at extracting local feature information, but the convolutional block receptive field is limited. Transformer, on the other hand, performs well in modeling long-distance dependencies. Therefore, in this paper, a new network model MTPA_Unet is designed from the perspective of extracting connections between local detailed features and making complements using long-distance dependency information, which is applied to the retinal vessel segmentation task. MTPA_Unet uses multi-resolution image input to enable the network to extract information at different levels. The proposed TPA module not only captures long-distance dependencies, but also focuses on the location information of the vessel pixels to facilitate capillary segmentation. The Transformer is combined with the convolutional neural network in a serial approach, and the original MSA module is replaced by the TPA module to achieve finer segmentation. Finally, the network model is evaluated and analyzed on three recognized retinal image datasets DRIVE, CHASE DB1, and STARE. The evaluation metrics were 0.9718, 0.9762, and 0.9773 for accuracy; 0.8410, 0.8437, and 0.8938 for sensitivity; and 0.8318, 0.8164, and 0.8557 for Dice coefficient. Compared with existing retinal image segmentation methods, the proposed method in this paper achieved better vessel segmentation in all of the publicly available fundus datasets tested performance and results.
APA, Harvard, Vancouver, ISO, and other styles
16

Yuan, Wei, and Wenbo Xu. "MSST-Net: A Multi-Scale Adaptive Network for Building Extraction from Remote Sensing Images Based on Swin Transformer." Remote Sensing 13, no. 23 (November 23, 2021): 4743. http://dx.doi.org/10.3390/rs13234743.

Full text
Abstract:
The segmentation of remote sensing images by deep learning technology is the main method for remote sensing image interpretation. However, the segmentation model based on a convolutional neural network cannot capture the global features very well. A transformer, whose self-attention mechanism can supply each pixel with a global feature, makes up for the deficiency of the convolutional neural network. Therefore, a multi-scale adaptive segmentation network model (MSST-Net) based on a Swin Transformer is proposed in this paper. Firstly, a Swin Transformer is used as the backbone to encode the input image. Then, the feature maps of different levels are decoded separately. Thirdly, the convolution is used for fusion, so that the network can automatically learn the weight of the decoding results of each level. Finally, we adjust the channels to obtain the final prediction map by using the convolution with a kernel of 1 × 1. By comparing this with other segmentation network models on a WHU building data set, the evaluation metrics, mIoU, F1-score and accuracy are all improved. The network model proposed in this paper is a multi-scale adaptive network model that pays more attention to the global features for remote sensing segmentation.
APA, Harvard, Vancouver, ISO, and other styles
17

Zhang, Jianrong, Hongwei Zhao, and Jiao Li. "TRS: Transformers for Remote Sensing Scene Classification." Remote Sensing 13, no. 20 (October 16, 2021): 4143. http://dx.doi.org/10.3390/rs13204143.

Full text
Abstract:
Remote sensing scene classification remains challenging due to the complexity and variety of scenes. With the development of attention-based methods, Convolutional Neural Networks (CNNs) have achieved competitive performance in remote sensing scene classification tasks. As an important method of the attention-based model, the Transformer has achieved great success in the field of natural language processing. Recently, the Transformer has been used for computer vision tasks. However, most existing methods divide the original image into multiple patches and encode the patches as the input of the Transformer, which limits the model’s ability to learn the overall features of the image. In this paper, we propose a new remote sensing scene classification method, Remote Sensing Transformer (TRS), a powerful “pure CNNs → Convolution + Transformer → pure Transformers” structure. First, we integrate self-attention into ResNet in a novel way, using our proposed Multi-Head Self-Attention layer instead of 3 × 3 spatial revolutions in the bottleneck. Then we connect multiple pure Transformer encoders to further improve the representation learning performance completely depending on attention. Finally, we use a linear classifier for classification. We train our model on four public remote sensing scene datasets: UC-Merced, AID, NWPU-RESISC45, and OPTIMAL-31. The experimental results show that TRS exceeds the state-of-the-art methods and achieves higher accuracy.
APA, Harvard, Vancouver, ISO, and other styles
18

AlBadani, Barakat, Ronghua Shi, Jian Dong, Raeed Al-Sabri, and Oloulade Babatounde Moctard. "Transformer-Based Graph Convolutional Network for Sentiment Analysis." Applied Sciences 12, no. 3 (January 26, 2022): 1316. http://dx.doi.org/10.3390/app12031316.

Full text
Abstract:
Sentiment Analysis is an essential research topic in the field of natural language processing (NLP) and has attracted the attention of many researchers in the last few years. Recently, deep neural network (DNN) models have been used for sentiment analysis tasks, achieving promising results. Although these models can analyze sequences of arbitrary length, utilizing them in the feature extraction layer of a DNN increases the dimensionality of the feature space. More recently, graph neural networks (GNNs) have achieved a promising performance in different NLP tasks. However, previous models cannot be transferred to a large corpus and neglect the heterogeneity of textual graphs. To overcome these difficulties, we propose a new Transformer-based graph convolutional network for heterogeneous graphs called Sentiment Transformer Graph Convolutional Network (ST-GCN). To the best of our knowledge, this is the first study to model the sentiment corpus as a heterogeneous graph and learn document and word embeddings using the proposed sentiment graph transformer neural network. In addition, our model offers an easy mechanism to fuse node positional information for graph datasets using Laplacian eigenvectors. Extensive experiments on four standard datasets show that our model outperforms the existing state-of-the-art models.
APA, Harvard, Vancouver, ISO, and other styles
19

Zheng, Yuping, and Weiwei Jiang. "Evaluation of Vision Transformers for Traffic Sign Classification." Wireless Communications and Mobile Computing 2022 (June 4, 2022): 1–14. http://dx.doi.org/10.1155/2022/3041117.

Full text
Abstract:
Traffic sign recognition is one of the most important tasks in autonomous driving. Camera-based computer vision techniques have been proposed for this task, and various convolutional neural network structures are used and validated with multiple open datasets. Recently, novel Transformer-based models have been proposed for various computer vision tasks and have achieved state-of-the-art performance, outperforming convolutional neural networks in several tasks. In this study, our goal is to investigate whether the success of Vision Transformers can be replicated within the traffic sign recognition area. Based on existing resources, we first extract and contribute three open traffic sign classification datasets. Based on these datasets, we experiment with seven convolutional neural networks and five Vision Transformers. We find that Transformers are not as competitive as convolutional neural networks for the traffic sign classification task. Specifically, there are performance gaps of up to 12.81%, 2.01%, and 4.37% existing for the German, Indian, and Chinese traffic sign datasets, respectively. Furthermore, we propose some suggestions to improve the performance of Transformers.
APA, Harvard, Vancouver, ISO, and other styles
20

Shao, Ran, Xiao-Jun Bi, and Zheng Chen. "A novel hybrid transformer-CNN architecture for environmental microorganism classification." PLOS ONE 17, no. 11 (November 11, 2022): e0277557. http://dx.doi.org/10.1371/journal.pone.0277557.

Full text
Abstract:
The success of vision transformers (ViTs) has given rise to their application in classification tasks of small environmental microorganism (EM) datasets. However, due to the lack of multi-scale feature maps and local feature extraction capabilities, the pure transformer architecture cannot achieve good results on small EM datasets. In this work, a novel hybrid model is proposed by combining the transformer with a convolution neural network (CNN). Compared to traditional ViTs and CNNs, the proposed model achieves state-of-the-art performance when trained on small EM datasets. This is accomplished in two ways. 1) Instead of the original fixed-size feature maps of the transformer-based designs, a hierarchical structure is adopted to obtain multi-scale feature maps. 2) Two new blocks are introduced to the transformer’s two core sections, namely the convolutional parameter sharing multi-head attention block and the local feed-forward network block. The ways allow the model to extract more local features compared to traditional transformers. In particular, for classification on the sixth version of the EM dataset (EMDS-6), the proposed model outperforms the baseline Xception by 6.7 percentage points, while being 60 times smaller in parameter size. In addition, the proposed model also generalizes well on the WHOI dataset (accuracy of 99%) and constitutes a fresh approach to the use of transformers for visual classification tasks based on small EM datasets.
APA, Harvard, Vancouver, ISO, and other styles
21

Ayana, Gelan, Kokeb Dese, Yisak Dereje, Yonas Kebede, Hika Barki, Dechassa Amdissa, Nahimiya Husen, Fikadu Mulugeta, Bontu Habtamu, and Se-Woon Choe. "Vision-Transformer-Based Transfer Learning for Mammogram Classification." Diagnostics 13, no. 2 (January 4, 2023): 178. http://dx.doi.org/10.3390/diagnostics13020178.

Full text
Abstract:
Breast mass identification is a crucial procedure during mammogram-based early breast cancer diagnosis. However, it is difficult to determine whether a breast lump is benign or cancerous at early stages. Convolutional neural networks (CNNs) have been used to solve this problem and have provided useful advancements. However, CNNs focus only on a certain portion of the mammogram while ignoring the remaining and present computational complexity because of multiple convolutions. Recently, vision transformers have been developed as a technique to overcome such limitations of CNNs, ensuring better or comparable performance in natural image classification. However, the utility of this technique has not been thoroughly investigated in the medical image domain. In this study, we developed a transfer learning technique based on vision transformers to classify breast mass mammograms. The area under the receiver operating curve of the new model was estimated as 1 ± 0, thus outperforming the CNN-based transfer-learning models and vision transformer models trained from scratch. The technique can, hence, be applied in a clinical setting, to improve the early diagnosis of breast cancer.
APA, Harvard, Vancouver, ISO, and other styles
22

Xu, Yufen, Shangbo Zhou, and Yuhui Huang. "Transformer-Based Model with Dynamic Attention Pyramid Head for Semantic Segmentation of VHR Remote Sensing Imagery." Entropy 24, no. 11 (November 6, 2022): 1619. http://dx.doi.org/10.3390/e24111619.

Full text
Abstract:
Convolutional neural networks have long dominated semantic segmentation of very-high-resolution (VHR) remote sensing (RS) images. However, restricted by the fixed receptive field of convolution operation, convolution-based models cannot directly obtain contextual information. Meanwhile, Swin Transformer possesses great potential in modeling long-range dependencies. Nevertheless, Swin Transformer breaks images into patches that are single-dimension sequences without considering the position loss problem inside patches. Therefore, Inspired by Swin Transformer and Unet, we propose SUD-Net (Swin transformer-based Unet-like with Dynamic attention pyramid head Network), a new U-shaped architecture composed of Swin Transformer blocks and convolution layers simultaneously through a dual encoder and an upsampling decoder with a Dynamic Attention Pyramid Head (DAPH) attached to the backbone. First, we propose a dual encoder structure combining Swin Transformer blocks and reslayers in reverse order to complement global semantics with detailed representations. Second, aiming at the spatial loss problem inside each patch, we design a Multi-Path Fusion Model (MPFM) with specially devised Patch Attention (PA) to encode position information of patches and adaptively fuse features of different scales through attention mechanisms. Third, a Dynamic Attention Pyramid Head is constructed with deformable convolution to dynamically aggregate effective and important semantic information. SUD-Net achieves exceptional results on ISPRS Potsdam and Vaihingen datasets with 92.51%mF1, 86.4%mIoU, 92.98%OA, 89.49%mF1, 81.26%mIoU, and 90.95%OA, respectively.
APA, Harvard, Vancouver, ISO, and other styles
23

Li, Jiaju, Hanfa Xing, Zurui Ao, Hefeng Wang, Wenkai Liu, and Anbing Zhang. "Convolution-Transformer Adaptive Fusion Network for Hyperspectral Image Classification." Applied Sciences 13, no. 1 (December 30, 2022): 492. http://dx.doi.org/10.3390/app13010492.

Full text
Abstract:
Hyperspectral image (HSI) classification is an important but challenging topic in the field of remote sensing and earth observation. By coupling the advantages of convolutional neural network (CNN) and Transformer model, the CNN–Transformer hybrid model can extract local and global features simultaneously and has achieved outstanding performance in HSI classification. However, most of the existing CNN–Transformer hybrid models use artificially specified hybrid strategies, which have poor generalization ability and are difficult to meet the requirements of recognizing fine-grained objects in HSI of complex scenes. To overcome this problem, we proposed a convolution–Transformer adaptive fusion network (CTAFNet) for pixel-wise HSI classification. A local–global fusion feature extraction unit, called the convolution–Transformer adaptive fusion kernel, was designed and integrated into the CTAFNet. The kernel captures the local high-frequency features using a convolution module and extracts the global and sequential low-frequency information using a Transformer module. We developed an adaptive feature fusion strategy to fuse the local high-frequency and global low-frequency features to obtain a robust and discriminative representation of the HSI data. An encoder–decoder structure was adopted in the CTAFNet to improve the flow of fused local–global information between different stages, thus ensuring the generalization ability of the model. Experimental results conducted on three large-scale and challenging HSI datasets demonstrate that the proposed network is superior to nine state-of-the-art approaches. We highlighted the effectiveness of adaptive CNN–Transformer hybrid strategy in HSI classification.
APA, Harvard, Vancouver, ISO, and other styles
24

Kumar, Dinesh, and Dharmendra Sharma. "Feature Map Augmentation to Improve Scale Invariance in Convolutional Neural Networks." Journal of Artificial Intelligence and Soft Computing Research 13, no. 1 (November 28, 2022): 51–74. http://dx.doi.org/10.2478/jaiscr-2023-0004.

Full text
Abstract:
Abstract Introducing variation in the training dataset through data augmentation has been a popular technique to make Convolutional Neural Networks (CNNs) spatially invariant but leads to increased dataset volume and computation cost. Instead of data augmentation, augmentation of feature maps is proposed to introduce variations in the features extracted by a CNN. To achieve this, a rotation transformer layer called Rotation Invariance Transformer (RiT) is developed, which applies rotation transformation to augment CNN features. The RiT layer can be used to augment output features from any convolution layer within a CNN. However, its maximum effectiveness is shown when placed at the output end of final convolution layer. We test RiT in the application of scale-invariance where we attempt to classify scaled images from benchmark datasets. Our results show promising improvements in the networks ability to be scale invariant whilst keeping the model computation cost low.
APA, Harvard, Vancouver, ISO, and other styles
25

Xu, Shizhuo, Vibekananda Dutta, Xin He, and Takafumi Matsumaru. "A Transformer-Based Model for Super-resolution of Anime Image." Sensors 22, no. 21 (October 24, 2022): 8126. http://dx.doi.org/10.3390/s22218126.

Full text
Abstract:
Image super-resolution (ISR) technology aims to enhance resolution and improve image quality. It is widely applied to various real-world applications related to image processing, especially in medical images, while relatively little appliedto anime image production. Furthermore, contemporary ISR tools are often based on convolutional neural networks (CNNs), while few methods attempt to use transformers that perform well in other advanced vision tasks. We propose a so-called anime image super-resolution (AISR) method based on the Swin Transformer in this work. The work was carried out in several stages. First, a shallow feature extraction approach was employed to facilitate the features map of the input image’s low-frequency information, which mainly approximates the distribution of detailed information in a spatial structure (shallow feature). Next, we applied deep feature extraction to extract the image semantic information (deep feature). Finally, the image reconstruction method combines shallow and deep features to upsample the feature size and performs sub-pixel convolution to obtain many feature map channels. The novelty of the proposal is the enhancement of the low-frequency information using a Gaussian filter and the introduction of different window sizes to replace the patch merging operations in the Swin Transformer. A high-quality anime dataset was constructed to curb the effects of the model robustness on the online regime. We trained our model on this dataset and tested the model quality. We implement anime image super-resolution tasks at different magnifications (2×, 4×, 8×). The results were compared numerically and graphically with those delivered by conventional convolutional neural network-based and transformer-based methods. We demonstrate the experiments numerically using standard peak signal-to-noise ratio (PSNR) and structural similarity (SSIM), respectively. The series of experiments and ablation study showcase that our proposal outperforms others.
APA, Harvard, Vancouver, ISO, and other styles
26

Hussain, Altaf, Tanveer Hussain, Waseem Ullah, and Sung Wook Baik. "Vision Transformer and Deep Sequence Learning for Human Activity Recognition in Surveillance Videos." Computational Intelligence and Neuroscience 2022 (April 4, 2022): 1–10. http://dx.doi.org/10.1155/2022/3454167.

Full text
Abstract:
Human Activity Recognition is an active research area with several Convolutional Neural Network (CNN) based features extraction and classification methods employed for surveillance and other applications. However, accurate identification of HAR from a sequence of frames is a challenging task due to cluttered background, different viewpoints, low resolution, and partial occlusion. Current CNN-based techniques use large-scale computational classifiers along with convolutional operators having local receptive fields, limiting their performance to capture long-range temporal information. Therefore, in this work, we introduce a convolution-free approach for accurate HAR, which overcomes the above-mentioned problems and accurately encodes relative spatial information. In the proposed framework, the frame-level features are extracted via pretrained Vision Transformer; next, these features are passed to multilayer long short-term memory to capture the long-range dependencies of the actions in the surveillance videos. To validate the performance of the proposed framework, we carried out extensive experiments on UCF50 and HMDB51 benchmark HAR datasets and improved accuracy by 0.944% and 1.414%, respectively, when compared to state-of-the-art deep models.
APA, Harvard, Vancouver, ISO, and other styles
27

Yang, Zhiqiang, Chong Xu, and Lei Li. "Landslide Detection Based on ResU-Net with Transformer and CBAM Embedded: Two Examples with Geologically Different Environments." Remote Sensing 14, no. 12 (June 16, 2022): 2885. http://dx.doi.org/10.3390/rs14122885.

Full text
Abstract:
An efficient method of landslide detection can provide basic scientific data for emergency command and landslide susceptibility mapping. Compared to a traditional landslide detection approach, convolutional neural networks (CNN) have been proven to have powerful capabilities in reducing the time consumed for selecting the appropriate features for landslides. Currently, the success of transformers in natural language processing (NLP) demonstrates the strength of self-attention in global semantic information acquisition. How to effectively integrate transformers into CNN, alleviate the limitation of the receptive field, and improve the model generation are hot topics in remote sensing image processing based on deep learning (DL). Inspired by the vision transformer (ViT), this paper first attempts to integrate a transformer into ResU-Net for landslide detection tasks with small datasets, aiming to enhance the network ability in modelling the global context of feature maps and drive the model to recognize landslides with a small dataset. Besides, a spatial and channel attention module was introduced into the decoder to effectually suppress the noise in the feature maps from the convolution and transformer. By selecting two landslide datasets with different geological characteristics, the feasibility of the proposed model was validated. Finally, the standard ResU-Net was chosen as the benchmark to evaluate the proposed model rationality. The results indicated that the proposed model obtained the highest mIoU and F1-score in both datasets, demonstrating that the ResU-Net with a transformer embedded can be used as a robust landslide detection method and thus realize the generation of accurate regional landslide inventory and emergency rescue.
APA, Harvard, Vancouver, ISO, and other styles
28

Shen, Li, and Yangzhu Wang. "TCCT: Tightly-coupled convolutional transformer on time series forecasting." Neurocomputing 480 (April 2022): 131–45. http://dx.doi.org/10.1016/j.neucom.2022.01.039.

Full text
APA, Harvard, Vancouver, ISO, and other styles
29

Dai, Yin, Yifan Gao, and Fayu Liu. "TransMed: Transformers Advance Multi-Modal Medical Image Classification." Diagnostics 11, no. 8 (July 31, 2021): 1384. http://dx.doi.org/10.3390/diagnostics11081384.

Full text
Abstract:
Over the past decade, convolutional neural networks (CNN) have shown very competitive performance in medical image analysis tasks, such as disease classification, tumor segmentation, and lesion detection. CNN has great advantages in extracting local features of images. However, due to the locality of convolution operation, it cannot deal with long-range relationships well. Recently, transformers have been applied to computer vision and achieved remarkable success in large-scale datasets. Compared with natural images, multi-modal medical images have explicit and important long-range dependencies, and effective multi-modal fusion strategies can greatly improve the performance of deep models. This prompts us to study transformer-based structures and apply them to multi-modal medical images. Existing transformer-based network architectures require large-scale datasets to achieve better performance. However, medical imaging datasets are relatively small, which makes it difficult to apply pure transformers to medical image analysis. Therefore, we propose TransMed for multi-modal medical image classification. TransMed combines the advantages of CNN and transformer to efficiently extract low-level features of images and establish long-range dependencies between modalities. We evaluated our model on two datasets, parotid gland tumors classification and knee injury classification. Combining our contributions, we achieve an improvement of 10.1% and 1.9% in average accuracy, respectively, outperforming other state-of-the-art CNN-based models. The results of the proposed method are promising and have tremendous potential to be applied to a large number of medical image analysis tasks. To our best knowledge, this is the first work to apply transformers to multi-modal medical image classification.
APA, Harvard, Vancouver, ISO, and other styles
30

Jiang, Shen, Jinjiang Li, and Zhen Hua. "Transformer with progressive sampling for medical cellular image segmentation." Mathematical Biosciences and Engineering 19, no. 12 (2022): 12104–26. http://dx.doi.org/10.3934/mbe.2022563.

Full text
Abstract:
<abstract><p>The convolutional neural network, as the backbone network for medical image segmentation, has shown good performance in the past years. However, its drawbacks cannot be ignored, namely, convolutional neural networks focus on local regions and are difficult to model global contextual information. For this reason, transformer, which is used for text processing, was introduced into the field of medical segmentation, and thanks to its expertise in modelling global relationships, the accuracy of medical segmentation was further improved. However, the transformer-based network structure requires a certain training set size to achieve satisfactory segmentation results, and most medical segmentation datasets are small in size. Therefore, in this paper we introduce a gated position-sensitive axial attention mechanism in the self-attention module, so that the transformer-based network structure can also be adapted to the case of small datasets. The common operation of the visual transformer introduced to visual processing when dealing with segmentation tasks is to divide the input image into equal patches of the same size and then perform visual processing on each patch, but this simple division may lead to the destruction of the structure of the original image, and there may be large unimportant regions in the divided grid, causing attention to stay on the uninteresting regions, affecting the segmentation performance. Therefore, in this paper, we add iterative sampling to update the sampling positions, so that the attention stays on the region to be segmented, reducing the interference of irrelevant regions and further improving the segmentation performance. In addition, we introduce the strip convolution module (SCM) and pyramid pooling module (PPM) to capture the global contextual information. The proposed network is evaluated on several datasets and shows some improvement in segmentation accuracy compared to networks of recent years.</p></abstract>
APA, Harvard, Vancouver, ISO, and other styles
31

Xie, Fei, Dalong Zhang, and Chengming Liu. "Global–Local Self-Attention Based Transformer for Speaker Verification." Applied Sciences 12, no. 19 (October 10, 2022): 10154. http://dx.doi.org/10.3390/app121910154.

Full text
Abstract:
Transformer models are now widely used for speech processing tasks due to their powerful sequence modeling capabilities. Previous work determined an efficient way to model speaker embeddings using the Transformer model by combining transformers with convolutional networks. However, traditional global self-attention mechanisms lack the ability to capture local information. To alleviate these problems, we proposed a novel global–local self-attention mechanism. Instead of using local or global multi-head attention alone, this method performs local and global attention in parallel in two parallel groups to enhance local modeling and reduce computational cost. To better handle local location information, we introduced locally enhanced location encoding in the speaker verification task. The experimental results of the VoxCeleb1 test set and the VoxCeleb2 dev set demonstrated the improved effect of our proposed global–local self-attention mechanism. Compared with the Transformer-based Robust Embedding Extractor Baseline System, the proposed speaker Transformer network exhibited better performance in the speaker verification task.
APA, Harvard, Vancouver, ISO, and other styles
32

Fan, Yiming, Hewei Wang, Xiaoyu Zhu, Xiangming Cao, Chuanjian Yi, Yao Chen, Jie Jia, and Xiaofeng Lu. "FER-PCVT: Facial Expression Recognition with Patch-Convolutional Vision Transformer for Stroke Patients." Brain Sciences 12, no. 12 (November 28, 2022): 1626. http://dx.doi.org/10.3390/brainsci12121626.

Full text
Abstract:
Early rehabilitation with the right intensity contributes to the physical recovery of stroke survivors. In clinical practice, physicians determine whether the training intensity is suitable for rehabilitation based on patients’ narratives, training scores, and evaluation scales, which puts tremendous pressure on medical resources. In this study, a lightweight facial expression recognition algorithm is proposed to diagnose stroke patients’ training motivations automatically. First, the properties of convolution are introduced into the Vision Transformer’s structure, allowing the model to extract both local and global features of facial expressions. Second, the pyramid-shaped feature output mode in Convolutional Neural Networks is also introduced to reduce the model’s parameters and calculation costs significantly. Moreover, a classifier that can better classify facial expressions of stroke patients is designed to improve performance further. We verified the proposed algorithm on the Real-world Affective Faces Database (RAF-DB), the Face Expression Recognition Plus Dataset (FER+), and a private dataset for stroke patients. Experiments show that the backbone network of the proposed algorithm achieves better performance than Pyramid Vision Transformer (PvT) and Convolutional Vision Transformer (CvT) with fewer parameters and Floating-point Operations Per Second (FLOPs). In addition, the algorithm reaches an 89.44% accuracy on the RAF-DB dataset, which is higher than other recent studies. In particular, it obtains an accuracy of 99.81% on the private dataset, with only 4.10M parameters.
APA, Harvard, Vancouver, ISO, and other styles
33

Moutik, Oumaima, Hiba Sekkat, Smail Tigani, Abdellah Chehri, Rachid Saadane, Taha Ait Tchakoucht, and Anand Paul. "Convolutional Neural Networks or Vision Transformers: Who Will Win the Race for Action Recognitions in Visual Data?" Sensors 23, no. 2 (January 9, 2023): 734. http://dx.doi.org/10.3390/s23020734.

Full text
Abstract:
Understanding actions in videos remains a significant challenge in computer vision, which has been the subject of several pieces of research in the last decades. Convolutional neural networks (CNN) are a significant component of this topic and play a crucial role in the renown of Deep Learning. Inspired by the human vision system, CNN has been applied to visual data exploitation and has solved various challenges in various computer vision tasks and video/image analysis, including action recognition (AR). However, not long ago, along with the achievement of the transformer in natural language processing (NLP), it began to set new trends in vision tasks, which has created a discussion around whether the Vision Transformer models (ViT) will replace CNN in action recognition in video clips. This paper conducts this trending topic in detail, the study of CNN and Transformer for Action Recognition separately and a comparative study of the accuracy-complexity trade-off. Finally, based on the performance analysis’s outcome, the question of whether CNN or Vision Transformers will win the race will be discussed.
APA, Harvard, Vancouver, ISO, and other styles
34

Xu, Xiangkai, Zhejun Feng, Changqing Cao, Mengyuan Li, Jin Wu, Zengyan Wu, Yajie Shang, and Shubing Ye. "An Improved Swin Transformer-Based Model for Remote Sensing Object Detection and Instance Segmentation." Remote Sensing 13, no. 23 (November 25, 2021): 4779. http://dx.doi.org/10.3390/rs13234779.

Full text
Abstract:
Remote sensing image object detection and instance segmentation are widely valued research fields. A convolutional neural network (CNN) has shown defects in the object detection of remote sensing images. In recent years, the number of studies on transformer-based models increased, and these studies achieved good results. However, transformers still suffer from poor small object detection and unsatisfactory edge detail segmentation. In order to solve these problems, we improved the Swin transformer based on the advantages of transformers and CNNs, and designed a local perception Swin transformer (LPSW) backbone to enhance the local perception of the network and to improve the detection accuracy of small-scale objects. We also designed a spatial attention interleaved execution cascade (SAIEC) network framework, which helped to strengthen the segmentation accuracy of the network. Due to the lack of remote sensing mask datasets, the MRS-1800 remote sensing mask dataset was created. Finally, we combined the proposed backbone with the new network framework and conducted experiments on this MRS-1800 dataset. Compared with the Swin transformer, the proposed model improved the mask AP by 1.7%, mask APS by 3.6%, AP by 1.1% and APS by 4.6%, demonstrating its effectiveness and feasibility.
APA, Harvard, Vancouver, ISO, and other styles
35

Lang, Bin, and Jian Gong. "JR-TFViT: A Lightweight Efficient Radar Jamming Recognition Network Based on Global Representation of the Time–Frequency Domain." Electronics 11, no. 17 (September 5, 2022): 2794. http://dx.doi.org/10.3390/electronics11172794.

Full text
Abstract:
Efficient jamming recognition capability is a prerequisite for radar anti-jamming and can enhance the survivability of radar in electronic warfare. Traditional recognition methods based on manually designed feature parameters have found it difficult to cope with the increasingly complex electromagnetic environment, and research combining deep learning to achieve jamming recognition is gradually increasing. However, existing research on radar jamming recognition based on deep learning can ignore the global representation in the jamming time–frequency domain data, while not paying enough attention to the problem of lightweighting the recognition network itself. Therefore, this paper proposes a lightweight jamming recognition network (JR-TFViT) that can fuse the global representation of jamming time–frequency domain data while combining the advantages of the Vision Transformer and a convolutional neural network (CNN). The global representation and local information in the jamming time–frequency data are fused with the assistance of the multi-head self-attention (MSA) mechanism in the transformer to improve the feature extraction capabilities of the recognition network. The model’s parameters are further decreased by modifying the standard convolutional operation mechanism and substituting the convolutional operation needed by the network with Ghost convolution, which has less parameters. The experimental results show that the JR-TFViT requires fewer model parameters while maintaining higher recognition performance than mainstream convolutional neural networks and lightweight CNNs. For 12 types of radar jamming, the JR-TFViT achieves 99.5% recognition accuracy at JNR = −6 dB with only 3.66 M model parameters. In addition, 98.9% recognition accuracy is maintained when the JR-TFViT parameter number is further compressed to 0.67 M.
APA, Harvard, Vancouver, ISO, and other styles
36

Huang, Lei, Feng Mao, Kai Zhang, and Zhiheng Li. "Spatial-Temporal Convolutional Transformer Network for Multivariate Time Series Forecasting." Sensors 22, no. 3 (January 22, 2022): 841. http://dx.doi.org/10.3390/s22030841.

Full text
Abstract:
Multivariate time series forecasting has long been a research hotspot because of its wide range of application scenarios. However, the dynamics and multiple patterns of spatiotemporal dependencies make this problem challenging. Most existing methods suffer from two major shortcomings: (1) They ignore the local context semantics when modeling temporal dependencies. (2) They lack the ability to capture the spatial dependencies of multiple patterns. To tackle such issues, we propose a novel Transformer-based model for multivariate time series forecasting, called the spatial–temporal convolutional Transformer network (STCTN). STCTN mainly consists of two novel attention mechanisms to respectively model temporal and spatial dependencies. Local-range convolutional attention mechanism is proposed in STCTN to simultaneously focus on both global and local context temporal dependencies at the sequence level, which addresses the first shortcoming. Group-range convolutional attention mechanism is designed to model multiple spatial dependency patterns at graph level, as well as reduce the computation and memory complexity, which addresses the second shortcoming. Continuous positional encoding is proposed to link the historical observations and predicted future values in positional encoding, which also improves the forecasting performance. Extensive experiments on six real-world datasets show that the proposed STCTN outperforms the start-of-the-art methods and is more robust to nonsmooth time series data.
APA, Harvard, Vancouver, ISO, and other styles
37

Chen, Guang, and Yi Shang. "Transformer for Tree Counting in Aerial Images." Remote Sensing 14, no. 3 (January 20, 2022): 476. http://dx.doi.org/10.3390/rs14030476.

Full text
Abstract:
The number of trees and their spatial distribution are key information for forest management. In recent years, deep learning-based approaches have been proposed and shown promising results in lowering the expensive labor cost of a forest inventory. In this paper, we propose a new efficient deep learning model called density transformer or DENT for automatic tree counting from aerial images. The architecture of DENT contains a multi-receptive field convolutional neural network to extract visual feature representation from local patches and their wide context, a transformer encoder to transfer contextual information across correlated positions, a density map generator to generate spatial distribution map of trees, and a fast tree counter to estimate the number of trees in each input image. We compare DENT with a variety of state-of-art methods, including one-stage and two-stage, anchor-based and anchor-free deep neural detectors, and different types of fully convolutional regressors for density estimation. The methods are evaluated on a new large dataset we built and an existing cross-site dataset. DENT achieves top accuracy on both datasets, significantly outperforming most of the other methods. We have released our new dataset, called Yosemite Tree Dataset, containing a 10 km2 rectangular study area with around 100k trees annotated, as a benchmark for public access.
APA, Harvard, Vancouver, ISO, and other styles
38

Li, Siyuan, Zongxu Pan, and Yuxin Hu. "Multi-Aspect Convolutional-Transformer Network for SAR Automatic Target Recognition." Remote Sensing 14, no. 16 (August 12, 2022): 3924. http://dx.doi.org/10.3390/rs14163924.

Full text
Abstract:
In recent years, synthetic aperture radar (SAR) automatic target recognition (ATR) has been widely used in both military and civilian fields. Due to the sensitivity of SAR images to the observation azimuth, the multi-aspect SAR image sequence contains more information for recognition than a single-aspect one. Nowadays, multi-aspect SAR target recognition methods mainly use recurrent neural networks (RNN), which rely on the order between images and thus suffer from information loss. At the same time, the training of the deep learning model also requires a lot of training data, but multi-aspect SAR images are expensive to obtain. Therefore, this paper proposes a multi-aspect SAR recognition method based on self-attention, which is used to find the correlation between the semantic information of images. Simultaneously, in order to improve the anti-noise ability of the proposed method and reduce the dependence on a large amount of data, the convolutional autoencoder (CAE) used to pretrain the feature extraction part of the method is designed. The experimental results using the MSTAR dataset show that the proposed multi-aspect SAR target recognition method is superior in various working conditions, performs well with few samples and also has a strong ability of anti-noise.
APA, Harvard, Vancouver, ISO, and other styles
39

Zhu, Weidong, Jun Sun, Simin Wang, Jifeng Shen, Kaifeng Yang, and Xin Zhou. "Identifying Field Crop Diseases Using Transformer-Embedded Convolutional Neural Network." Agriculture 12, no. 8 (July 22, 2022): 1083. http://dx.doi.org/10.3390/agriculture12081083.

Full text
Abstract:
The yield and security of grain are seriously infringed on by crop diseases, which are the critical factor hindering the green and high-quality development of agriculture. The existing crop disease identification models make it difficult to focus on the disease spot area. Additionally, crops with similar disease characteristics are easily misidentified. To address the above problems, this paper proposed an accurate and efficient disease identification model, which not only incorporated local and global features of images for feature analysis, but also improved the separability between similar diseases. First, Transformer Encoder was introduced into the improved model as a convolution operation, so as to establish the dependency between long-distance features and extract the global features of the disease images. Then, Centerloss was introduced as a penalty term to optimize the common cross-entropy loss, so as to expand the inter-class difference of crop disease characteristics and narrow their intra-class gap. Finally, according to the characteristics of the datasets, a more appropriate evaluation index was used to carry out experiments on different datasets. The identification accuracy of 99.62% was obtained on Plant Village, and the balanced accuracy of 96.58% was obtained on Dataset1 with a complex background. It showed good generalization ability when facing disease images from different sources. The improved model also balanced the contradiction between identification accuracy and parameter quantity. Compared with pure CNN and Transformer models, the leaf disease identification model proposed in this paper not only focuses more on the disease regions of leaves, but also better distinguishes different diseases with similar characteristics.
APA, Harvard, Vancouver, ISO, and other styles
40

Li, Xiaopeng, Yuyun Xiang, and Shuqin Li. "Combining convolutional and vision transformer structures for sheep face recognition." Computers and Electronics in Agriculture 205 (February 2023): 107651. http://dx.doi.org/10.1016/j.compag.2023.107651.

Full text
APA, Harvard, Vancouver, ISO, and other styles
41

Mogan, Jashila Nair, Chin Poo Lee, Kian Ming Lim, and Kalaiarasi Sonai Muthu. "Gait-ViT: Gait Recognition with Vision Transformer." Sensors 22, no. 19 (September 28, 2022): 7362. http://dx.doi.org/10.3390/s22197362.

Full text
Abstract:
Identifying an individual based on their physical/behavioral characteristics is known as biometric recognition. Gait is one of the most reliable biometrics due to its advantages, such as being perceivable at a long distance and difficult to replicate. The existing works mostly leverage Convolutional Neural Networks for gait recognition. The Convolutional Neural Networks perform well in image recognition tasks; however, they lack the attention mechanism to emphasize more on the significant regions of the image. The attention mechanism encodes information in the image patches, which facilitates the model to learn the substantial features in the specific regions. In light of this, this work employs the Vision Transformer (ViT) with an attention mechanism for gait recognition, referred to as Gait-ViT. In the proposed Gait-ViT, the gait energy image is first obtained by averaging the series of images over the gait cycle. The images are then split into patches and transformed into sequences by flattening and patch embedding. Position embedding, along with patch embedding, are applied on the sequence of patches to restore the positional information of the patches. Subsequently, the sequence of vectors is fed to the Transformer encoder to produce the final gait representation. As for the classification, the first element of the sequence is sent to the multi-layer perceptron to predict the class label. The proposed method obtained 99.93% on CASIA-B, 100% on OU-ISIR D and 99.51% on OU-LP, which exhibit the ability of the Vision Transformer model to outperform the state-of-the-art methods.
APA, Harvard, Vancouver, ISO, and other styles
42

Pan, Zizheng, Bohan Zhuang, Haoyu He, Jing Liu, and Jianfei Cai. "Less Is More: Pay Less Attention in Vision Transformers." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 2 (June 28, 2022): 2035–43. http://dx.doi.org/10.1609/aaai.v36i2.20099.

Full text
Abstract:
Transformers have become one of the dominant architectures in deep learning, particularly as a powerful alternative to convolutional neural networks (CNNs) in computer vision. However, Transformer training and inference in previous works can be prohibitively expensive due to the quadratic complexity of self-attention over a long sequence of representations, especially for high-resolution dense prediction tasks. To this end, we present a novel Less attention vIsion Transformer (LIT), building upon the fact that the early self-attention layers in Transformers still focus on local patterns and bring minor benefits in recent hierarchical vision Transformers. Specifically, we propose a hierarchical Transformer where we use pure multi-layer perceptrons (MLPs) to encode rich local patterns in the early stages while applying self-attention modules to capture longer dependencies in deeper layers. Moreover, we further propose a learned deformable token merging module to adaptively fuse informative patches in a non-uniform manner. The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation, serving as a strong backbone for many vision tasks. Code is available at https://github.com/zip-group/LIT.
APA, Harvard, Vancouver, ISO, and other styles
43

Yao, Chao, Shuo Jin, Meiqin Liu, and Xiaojuan Ban. "Dense Residual Transformer for Image Denoising." Electronics 11, no. 3 (January 29, 2022): 418. http://dx.doi.org/10.3390/electronics11030418.

Full text
Abstract:
Image denoising is an important low-level computer vision task, which aims to reconstruct a noise-free and high-quality image from a noisy image. With the development of deep learning, convolutional neural network (CNN) has been gradually applied and achieved great success in image denoising, image compression, image enhancement, etc. Recently, Transformer has been a hot technique, which is widely used to tackle computer vision tasks. However, few Transformer-based methods have been proposed for low-level vision tasks. In this paper, we proposed an image denoising network structure based on Transformer, which is named DenSformer. DenSformer consists of three modules, including a preprocessing module, a local-global feature extraction module, and a reconstruction module. Specifically, the local-global feature extraction module consists of several Sformer groups, each of which has several ETransformer layers and a convolution layer, together with a residual connection. These Sformer groups are densely skip-connected to fuse the feature of different layers, and they jointly capture the local and global information from the given noisy images. We conduct our model on comprehensive experiments. In synthetic noise removal, DenSformer outperforms other state-of-the-art methods by up to 0.06–0.28 dB in gray-scale images and 0.57–1.19 dB in color images. In real noise removal, DenSformer can achieve comparable performance, while the number of parameters can be reduced by up to 40%. Experimental results prove that our DenSformer achieves improvement compared to some state-of-the-art methods, both for the synthetic noise data and real noise data, in the objective and subjective evaluations.
APA, Harvard, Vancouver, ISO, and other styles
44

Li, Shuping, Qianhao Yuan, Yeming Zhang, Baozhan Lv, and Feng Wei. "Image Dehazing Algorithm Based on Deep Learning Coupled Local and Global Features." Applied Sciences 12, no. 17 (August 26, 2022): 8552. http://dx.doi.org/10.3390/app12178552.

Full text
Abstract:
To address the problems that most convolutional neural network-based image defogging algorithm models capture incomplete global feature information and incomplete defogging, this paper proposes an end-to-end convolutional neural network and vision transformer hybrid image defogging algorithm. First, the shallow features of the haze image were extracted by a preprocessing module. Then, a symmetric network structure including a convolutional neural network (CNN) branch and a vision transformer branch was used to capture the local features and global features of the haze image, respectively. The mixed features were fused using convolutional layers to cover the global representation while retaining the local features. Finally, the features obtained by the encoder and decoder were fused to obtain richer feature information. The experimental results show that the proposed defogging algorithm achieved better defogging results in both the uniform and non-uniform haze datasets, solves the problems of dark and distorted colors after image defogging, and the recovered images are more natural for detail processing.
APA, Harvard, Vancouver, ISO, and other styles
45

Jamali, Ali, and Masoud Mahdianpari. "Swin Transformer for Complex Coastal Wetland Classification Using the Integration of Sentinel-1 and Sentinel-2 Imagery." Water 14, no. 2 (January 10, 2022): 178. http://dx.doi.org/10.3390/w14020178.

Full text
Abstract:
The emergence of deep learning techniques has revolutionized the use of machine learning algorithms to classify complicated environments, notably in remote sensing. Convolutional Neural Networks (CNNs) have shown considerable promise in classifying challenging high-dimensional remote sensing data, particularly in the classification of wetlands. State-of-the-art Natural Language Processing (NLP) algorithms, on the other hand, are transformers. Despite the fact that transformers have been utilized for a few remote sensing applications, they have not been compared to other well-known CNN networks in complex wetland classification. As such, for the classification of complex coastal wetlands in the study area of Saint John city, located in New Brunswick, Canada, we modified and employed the Swin Transformer algorithm. Moreover, the developed transformer classifier results were compared with two well-known deep CNNs of AlexNet and VGG-16. In terms of average accuracy, the proposed Swin Transformer algorithm outperformed the AlexNet and VGG-16 techniques by 14.3% and 44.28%, respectively. The proposed Swin Transformer classifier obtained F-1 scores of 0.65, 0.71, 0.73, 0.78, 0.82, 0.84, and 0.84 for the recognition of coastal marsh, shrub, bog, fen, aquatic bed, forested wetland, and freshwater marsh, respectively. The results achieved in this study suggest the high capability of transformers over very deep CNN networks for the classification of complex landscapes in remote sensing.
APA, Harvard, Vancouver, ISO, and other styles
46

Xing, Mengda, Weilong Ding, Han Li, and Tianpu Zhang. "A Power Transformer Fault Prediction Method through Temporal Convolutional Network on Dissolved Gas Chromatography Data." Security and Communication Networks 2022 (April 11, 2022): 1–11. http://dx.doi.org/10.1155/2022/5357412.

Full text
Abstract:
The power transformer is an example of the key equipment of power grid, and its potential faults limit the system availability and the enterprise security. However, fault prediction for power transformers has its limitations in low data quality, binary classification effect, and small sample learning. We propose a method for fault prediction for power transformers based on dissolved gas chromatography data: after data preprocessing of defective raw data, fault classification is performed based on the predictive regression results. Here, Mish-SN Temporal Convolutional Network (MSTCN) is introduced to improve the accuracy during the regression step. Several experiments are conducted using data set from China State Grid. The discussion of the results of experiments is provided.
APA, Harvard, Vancouver, ISO, and other styles
47

Hu, Xiang, Wenjing Yang, Hao Wen, Yu Liu, and Yuanxi Peng. "A Lightweight 1-D Convolution Augmented Transformer with Metric Learning for Hyperspectral Image Classification." Sensors 21, no. 5 (March 3, 2021): 1751. http://dx.doi.org/10.3390/s21051751.

Full text
Abstract:
Hyperspectral image (HSI) classification is the subject of intense research in remote sensing. The tremendous success of deep learning in computer vision has recently sparked the interest in applying deep learning in hyperspectral image classification. However, most deep learning methods for hyperspectral image classification are based on convolutional neural networks (CNN). Those methods require heavy GPU memory resources and run time. Recently, another deep learning model, the transformer, has been applied for image recognition, and the study result demonstrates the great potential of the transformer network for computer vision tasks. In this paper, we propose a model for hyperspectral image classification based on the transformer, which is widely used in natural language processing. Besides, we believe we are the first to combine the metric learning and the transformer model in hyperspectral image classification. Moreover, to improve the model classification performance when the available training samples are limited, we use the 1-D convolution and Mish activation function. The experimental results on three widely used hyperspectral image data sets demonstrate the proposed model’s advantages in accuracy, GPU memory cost, and running time.
APA, Harvard, Vancouver, ISO, and other styles
48

Bao, Shuai, Jiping Liu, Liang Wang, and Xizhi Zhao. "Application of Transformer Models to Landslide Susceptibility Mapping." Sensors 22, no. 23 (November 23, 2022): 9104. http://dx.doi.org/10.3390/s22239104.

Full text
Abstract:
Landslide susceptibility mapping (LSM) is of great significance for the identification and prevention of geological hazards. LSM is based on convolutional neural networks (CNNs); CNNs use fixed convolutional kernels, focus more on local information and do not retain spatial information. This is a property of the CNN itself, resulting in low accuracy of LSM. Based on the above problems, we use Vision Transformer (ViT) and its derivative model Swin Transformer (Swin) to conduct LSM for the selected study area. Machine learning and a CNN model are used for comparison. Fourier transform amplitude, feature similarity and other indicators were used to compare and analyze the difference in the results. The results show that the Swin model has the best accuracy, F1-score and AUC. The results of LSM are combined with landslide points, faults and other data analysis; the ViT model results are the most consistent with the actual situation, showing the strongest generalization ability. In this paper, we believe that the advantages of ViT and its derived models in global feature extraction ensure that ViT is more accurate than CNN and machine learning in predicting landslide probability in the study area.
APA, Harvard, Vancouver, ISO, and other styles
49

Liu, Xinyi, Baofeng Zhang, and Na Liu. "CAST-YOLO: An Improved YOLO Based on a Cross-Attention Strategy Transformer for Foggy Weather Adaptive Detection." Applied Sciences 13, no. 2 (January 15, 2023): 1176. http://dx.doi.org/10.3390/app13021176.

Full text
Abstract:
Both transformer and one-stage detectors have shown promising object detection results and have attracted increasing attention. However, the developments in effective domain adaptive techniques in transformer and one-stage detectors still have not been widely used. In this paper, we investigate this issue and propose a novel improved You Only Look Once (YOLO) model based on a cross-attention strategy transformer, called CAST-YOLO. This detector is a Teacher–Student knowledge transfer-based detector. We design a transformer encoder layer (TE-Layer) and a convolutional block attention module (CBAM) to capture global and rich contextual information. Then, the detector implements cross-domain object detection through the knowledge distillation method. Specifically, we propose a cross-attention strategy transformer to align domain-invariant features between the source and target domains. This strategy consists of three transformers with shared weights, identified as the source branch, target branch, and cross branch. The feature alignment uses knowledge distillation, to address better knowledge transfer from the source domain to the target domain. The above strategy provides better robustness for a model with noisy input. Extensive experiments show that our method outperforms the existing methods in foggy weather adaptive detection, significantly improving the detection results.
APA, Harvard, Vancouver, ISO, and other styles
50

Huo, Junliang, and Jiankun Ling. "Study on a visual coder acceleration algorithm for image classification applying dynamic scaling training techniques." Journal of Physics: Conference Series 2083, no. 4 (November 1, 2021): 042027. http://dx.doi.org/10.1088/1742-6596/2083/4/042027.

Full text
Abstract:
Abstract Nowadays, image classification techniques are used in the field of autonomous vehicles, and Convolutional Neural Network (CNN) is used extensively, and Vision Transformer (ViT) networks are used instead of deep convolutional networks in order to compress the network size and improve the model accuracy. The ViT network is used to replace the deep convolutional network. Since training ViT requires a large dataset to have sufficient accuracy, a variant of ViT, Data-Efficient Image Transformers (DEIT), is used in this paper. In addition, in order to greatly reduce the computing memory and shorten the computing time in practical use, the network is flexibly scaled in size and training speed by both adaptive width and adaptive depth. In this paper, we introduce DEIT, width adaptive techniques and depth adaptive techniques and combine them to be applied to image classification examples. Experiments are conducted on the Cifar100 dataset, and the experiments demonstrate the superiority of the algorithm on image classification scenarios.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography