Detailed Information

Cited 0 time in webofscience Cited 0 time in scopus
Metadata Downloads

Centralized Position Embeddings for Vision Transformersopen access

Authors
Shin, ChanyongYun, IlwiLee, HyunkuRhee, Chae Eun
Issue Date
Nov-2025
Publisher
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Keywords
Transformers; Semantics; Computer vision; Feature extraction; Encoding; Convolution; Visualization; Data mining; Computer architecture; Attention mechanisms; position embedding; vision transformer
Citation
IEEE ACCESS, v.13, pp 190122 - 190135
Pages
14
Indexed
SCIE
SCOPUS
Journal Title
IEEE ACCESS
Volume
13
Start Page
190122
End Page
190135
URI
https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/212077
DOI
10.1109/ACCESS.2025.3629376
ISSN
2169-3536
Abstract
Vision Transformers (ViTs) have achieved remarkable success across various vision tasks. However, ViTs inherently lack spatial inductive biases, necessitating explicit position embedding (PE) schemes. Recently, many studies have adopted non-fixed length position embeddings (nFPEs) over traditional absolute or relative PEs. These nFPEs, typically implemented using inductive modules like convolutional layers, offer advantages such as adaptability to varying token sequence lengths and the potential for translation equivariance. However, our analysis reveals that prevalent nFPE methods often yield positional information that is significantly skewed by feature content, which is not discussed yet. In this paper, we argue that nFPEs in prior works have two common limitations. First, nFPEs exhibit a significant semantic bias, as they are strongly affected and distorted by the semantic content of input feature maps, leading to indistinct positional information. Second, although the intrinsic token order reamains constant throughout the network, nFPEs redundantly recompute positional information within each transformer block, leading to inefficiency and potentially inconsistent PE application. To overcome these drawbacks, we propose Centralized Position Embedding (CPE). The core idea of CPE is to replace the scattered PE module in each transformer block with a unified PE network per stage, whose output is broadcast to all transformer blocks within that stage. This centralized design allows for a significantly larger receptive field for PE network at a negligible computational overhead, facilitating the extraction of less biased and more consistent positional informations, thus addressing the aforementioned limitations of nFPEs. By applying the proposed CPE to various ViTs for several vision tasks, we show that CPE yileds more precise positional information, leading to consistent performance improvements over existing PE strategies, supporting our arguments.
Files in This Item
Go to Link
Appears in
Collections
서울 공과대학 > 서울 융합전자공학부 > 1. Journal Articles

qrcode

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.

Related Researcher

Researcher Eun, Rhee Chae photo

Eun, Rhee Chae
COLLEGE OF ENGINEERING (SCHOOL OF ELECTRONIC ENGINEERING)
Read more

Altmetrics

Total Views & Downloads

BROWSE