TY - JOUR
T1 - Efficient Scopeformer
T2 - Toward Scalable and Rich Feature Extraction for Intracranial Hemorrhage Detection
AU - Barhoumi, Yassine
AU - Bouaynaya, Nidhal Carla
AU - Rasool, Ghulam
N1 - Funding Information:
This work was supported in part by the National Science Foundation under Award ECCS-1903466 and Award OAC-2008690.
Publisher Copyright:
© 2013 IEEE.
PY - 2023
Y1 - 2023
N2 - The quality and richness of feature maps extracted by convolution neural networks (CNNs) and vision Transformers (ViTs) directly relate to the robust model performance. In medical computer vision, these information-rich features are crucial for detecting rare cases within large datasets. This work presents the 'Scopeformer,' a novel multi-CNN-ViT model for intracranial hemorrhage classification in computed tomography (CT) images. The Scopeformer architecture is scalable and modular, which allows utilizing various CNN architectures as the backbone with diversified output features and pre-training strategies. We propose effective feature projection methods to reduce redundancies among CNN-generated features and to control the input size of ViTs. Extensive experiments with various Scopeformer models show that the model performance is proportional to the number of convolutional blocks employed in the feature extractor. Using multiple strategies, including diversifying the pre-training paradigms for CNNs, different pre-training datasets, and style transfer techniques, we demonstrate an overall improvement in the model performance at various computational budgets. Later, we propose smaller compute-efficient Scopeformer versions with three different types of input and output ViT configurations. Efficient Scopeformers use four different pre-trained CNN architectures as feature extractors to increase feature richness. Our best Efficient Scopeformer model achieved an accuracy of 96.94% and a weighted logarithmic loss of 0.083 with an eight times reduction in the number of trainable parameters compared to the base Scopeformer. Another version of the Efficient Scopeformer model further reduced the parameter space by almost 17 times with negligible performance reduction. In summary, our work showed that the hybrid architectures consisting of CNNs and ViTs might provide the desired feature richness for developing accurate medical computer vision models.
AB - The quality and richness of feature maps extracted by convolution neural networks (CNNs) and vision Transformers (ViTs) directly relate to the robust model performance. In medical computer vision, these information-rich features are crucial for detecting rare cases within large datasets. This work presents the 'Scopeformer,' a novel multi-CNN-ViT model for intracranial hemorrhage classification in computed tomography (CT) images. The Scopeformer architecture is scalable and modular, which allows utilizing various CNN architectures as the backbone with diversified output features and pre-training strategies. We propose effective feature projection methods to reduce redundancies among CNN-generated features and to control the input size of ViTs. Extensive experiments with various Scopeformer models show that the model performance is proportional to the number of convolutional blocks employed in the feature extractor. Using multiple strategies, including diversifying the pre-training paradigms for CNNs, different pre-training datasets, and style transfer techniques, we demonstrate an overall improvement in the model performance at various computational budgets. Later, we propose smaller compute-efficient Scopeformer versions with three different types of input and output ViT configurations. Efficient Scopeformers use four different pre-trained CNN architectures as feature extractors to increase feature richness. Our best Efficient Scopeformer model achieved an accuracy of 96.94% and a weighted logarithmic loss of 0.083 with an eight times reduction in the number of trainable parameters compared to the base Scopeformer. Another version of the Efficient Scopeformer model further reduced the parameter space by almost 17 times with negligible performance reduction. In summary, our work showed that the hybrid architectures consisting of CNNs and ViTs might provide the desired feature richness for developing accurate medical computer vision models.
UR - http://www.scopus.com/inward/record.url?scp=85166747837&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85166747837&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2023.3301160
DO - 10.1109/ACCESS.2023.3301160
M3 - Article
AN - SCOPUS:85166747837
SN - 2169-3536
VL - 11
SP - 81656
EP - 81671
JO - IEEE Access
JF - IEEE Access
ER -