V-PRUNE: Semantic-Aware Patch Pruning Before Tokenization in Vision-Language Model Inference

Seo, Hyein; Choi, Yong Suk

doi:10.3390/app15179463

Detailed Information

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

V-PRUNE: Semantic-Aware Patch Pruning Before Tokenization in Vision-Language Model Inferenceopen accessV-PRUNE: Semantic-Aware Patch Pruning Before Tokenization in Vision–Language Model Inference

Other Titles: V-PRUNE: Semantic-Aware Patch Pruning Before Tokenization in Vision–Language Model Inference

Authors: Seo, Hyein; Choi, Yong Suk

Issue Date: Aug-2025

Publisher: MDPI

Keywords: vision-language models; efficient vision transformers; feature pruning; visual question answering

Citation: Applied Sciences-basel, v.15, no.17, pp 1 - 15

Pages: 15

Indexed: SCIE
SCOPUS

Journal Title: Applied Sciences-basel

Volume: 15

Number: 17

Start Page: 1

End Page: 15

URI: https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/208860

DOI: 10.3390/app15179463

ISSN: 2076-3417
2076-3417

Abstract: Recent vision-language models (VLMs) achieve strong performance across multimodal benchmarks but suffer from high inference costs due to the large number of visual tokens. Prior studies have shown that many image tokens receive consistently low attention scores during inference, indicating that a substantial portion of visual content contributes little to final predictions. These observations raise questions about the efficiency of conventional token pruning strategies, which are typically applied after all attention operations and depend on late-emerging attention scores. To address this, we propose V-PRUNE, a semantic-aware patch-level pruning framework for vision-language models that removes redundant content before tokenization. By evaluating local similarity via color and histogram statistics, our method enables lightweight and interpretable pruning without architectural changes. Applied to CLIP-based models, our approach reduces FLOPs and inference time across vision-language understanding tasks, while maintaining or improving accuracy. Qualitative results further confirm that essential regions are preserved and the pruning behavior is human-aligned, making our method a practical solution for efficient VLM inference.

Files in This Item: Go to Link

Appears in Collections: 서울 공과대학 > 서울 컴퓨터소프트웨어학부 > 1. Journal Articles

Show full item record

qrcode

Related Researcher

Researcher Choi, Yong Suk photo

Choi, Yong Suk: COLLEGE OF ENGINEERING (SCHOOL OF COMPUTER SCIENCE)

Read more

Altmetrics

Total Views & Downloads

RSS_1.0 RSS_2.0 ATOM_1.0

222, Wangsimni-ro, Seongdong-gu, Seoul, 04763, Korea+82-2-2220-1366

Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Detailed Information

Related Researcher

Altmetrics

Total Views & Downloads

BROWSE