
Convolutional Neural Networks (CNNs) have dominated the field of computer vision for the past decade. From object detection to image classification; they were the leading solutions in multiple tasks. CNNs have been the go-to method for any computer vision application.
Things started to change with the introduction of the Vision Transformer (ViT). It was the first paper that showed how to move Transformers, the framework that was state-of-the-art in natural language processing, into the realm of computer vision and surpassed its CNN counterpart in classification tasks. Subsequently, incremental improvements to vision transformers were proposed.
Despite its enormous success in image classification, ViT had some drawbacks, which made it impractical to apply to other tasks. For example, it cannot be used for object detection and segmentation because it only produces a single-scale, low-resolution feature map, which is not suitable for segmentation tasks. On the other hand, although its efficiency compared to its CNN counterpart in low-resolution inputs is impressive, the computational complexity increases exponentially when we increase the size of the input. This complexity becomes a problem when we try to apply ViT on common landmarks for object segmentation, where the input dimensions are much higher than image classification.
One wonders what could be done to address these limitations. How could we solve the problems of ViT so that it can be used for tasks other than image classification? Well, one answer given is the Pyramid Vision Transformer (PVT).


The authors of PVT have explored a clean, convolution-free transformer basic structure to solve the aforementioned problems. PVT can be an alternative to CNN in many computer vision tasks, both in image-level and pixel-level prediction.
PVT has a few tricks up its sleeve to overcome the disadvantages of ViT. It takes smaller image patches (4×4 pixels per patch) as input to learn the high-resolution representation. Additionally, it introduces feature pyramids to reduce the length of the sequence as we go deeper into the network. This greatly reduces the computational complexity. Finally, it uses a Spatial Reduction Attention (SRA) layer to further improve efficiency.

CNNs have local receptive fields and hence they increase the depth of the network to extract features at different levels. On the other hand, PVT always produces an overall receptive field which can be useful for detection and segmentation tasks. When we compare PVT with ViT, we can see that the advanced pyramid structure allows easy integration into dense prediction pipelines. Additionally, PVT can be combined with other task-specific transformer decoders to produce convolution-free pipelines for computer vision tasks.
PVT is the first fully convolution-free object detection pipeline. It can produce state-of-the-art results in various tasks, from object detection to segmentation. This is still a preliminary study on the application of transformers to computer vision tasks, but it is a good step towards a future dominated by transformers.
This was a brief summary of PVT. You can find more information in the links below.
Check Paper and GithubGenericName. All credit for this research goes to the researchers on this project. Also don’t forget to register. our Reddit page and discord channelwhere we share the latest AI research news, cool AI projects, and more.
Ekrem Çetinkaya obtained his B.Sc. in 2018 and M.Sc. in 2019 from Ozyegin University, Istanbul, Türkiye. He wrote his M.Sc. thesis on image denoising using deep convolutional networks. He is currently pursuing a doctorate. degree at the University of Klagenfurt, Austria, and working as a researcher on the ATHENA project. His research interests include deep learning, computer vision and multimedia networks.
#model #embeds #feature #pyramids #vision #transformers #enhance #capabilities