Real-Time Instance Segmentation Using Lightweight CNN-Transformer Hybrids

Authors

  • Sajud Hamza Elinjulliparambil Pace University. Author

DOI:

https://doi.org/10.63282/3050-9246.IJETCSIT-V4I4P117

Keywords:

Instance segmentation, real-time vision, CNN Transformer hybrid, attention mechanism, model efficiency, autonomous system, robotics, computer vision

Abstract

The problem of instance segmentation is a basic computer vision problem in which localization, classification, and pixel-level localization of individual instances of objects should occur simultaneously. The trade-off between computational performance and segmentation quality is a big challenge in achieving real-time performance in instance segmentation especially in resource-constrained platforms like edge devices and embedded systems. Hybrid architecture Lightweight CNN Transformer hybrid networks have become an exciting solution, which combines the ability to extract local features efficiently of convolutional networks with the ability to model the global context of Transformers. This literature review entails an in-depth examination of the instant segmentation methods in real-time, with the focus on CNN-based pipelines, Transformer models, and their hybrid versions. Lightweight design strategies, such as model compression, efficient attention mechanisms, and backbone optimization, and standard datasets and benchmarking protocols to ensure consistent evaluation are discussed. Lastly, we look at real-life implementations in autonomous driving, robotics, and in industry vision, and determine current challenges and future research opportunities in accuracy, efficiency and robustness in real-time implementation

Downloads

Download data is not yet available.

References

[1] A. M. Hafiz and G. M. Bhat, “A survey on instance segmentation: State of the art,” Int. J. Multimedia Inf. Retrieval, vol. 9, no. 3, pp. 171–189, 2020.

[2] D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee, “YOLACT: Real-time instance segmentation,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Seoul, South Korea, 2019, pp. 9157–9166.

[3] R. Yang and Y. Yu, “Artificial convolutional neural network in object detection and semantic segmentation for medical imaging analysis,” Frontiers in Oncology, vol. 11, Art. no. 638182, 2021.

[4] Z.-M. Chen, X.-S. Wei, P. Wang, and Y. Guo, “SST: Spatial and semantic transformers for multi-label image recognition,” IEEE Trans. Image Process., vol. 31, pp. 2570–2583, 2022.

[5] Eshed Ohn-Bar and M. M. Trivedi, “Hand gesture recognition in real time for automotive interfaces: A multimodal vision-based approach and evaluations,” IEEE Trans. Intell. Transp. Syst., vol. 15, no. 6, pp. 2368–2377, 2014.

[6] Y. Li, K. Yang, W. Chen, and Y. Li, “Contextual transformer networks for visual recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 2, pp. 1489–1500, 2022.

[7] J. Yang, C. Fan, H. Wang, Y. Wang, and B. Chen, “Focal attention for long-range interactions in vision transformers,” in Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 34, 2021, pp. 30008–30022.

[8] S. Mehtab, Deep Neural Networks for Road Scene Perception in Autonomous Vehicles Using LiDARs and Vision Sensors, Ph.D. dissertation, Auckland Univ. of Technology, Auckland, New Zealand, 2022.

[9] P. Luc, C. Couprie, S. Chintala, and J. Verbeek, “Predicting future instance segmentation by forecasting convolutional features,” in Proc. Eur. Conf. Comput. Vis. (ECCV), Munich, Germany, 2018, pp. 584–599.

[10] J. Cao et al., “D2Det: Towards high quality object detection and instance segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Seattle, WA, USA, 2020, pp. 11485–11494.

[11] L.-C. Chen et al., “MaskLab: Instance segmentation by refining object detection with semantic and direction features,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Salt Lake City, UT, USA, 2018, pp. 4013–4022.

[12] S. M. Harrison, L. G. Biesecker, and H. L. Rehm, “Overview of specifications to the ACMG/AMP variant interpretation guidelines,” Curr. Protoc. Hum. Genet., vol. 103, no. 1, Art. no. e93, 2019.

[13] David DeBonis et al., “APE: Metrics for understanding application performance efficiency under power caps,” Sustainable Computing: Informatics and Systems, vol. 34, Art. no. 100702, 2022.

[14] A. M. Shabut et al., “An intelligent mobile-enabled expert system for tuberculosis disease diagnosis in real time,” Expert Syst. Appl., vol. 114, pp. 65–77, 2018.

[15] W. Wang, H. Lin, and J. Wang, “CNN-based lane detection with instance segmentation in edge-cloud computing,” J. Cloud Comput., vol. 9, no. 1, Art. no. 27, 2020.

[16] J. Park and H. Moon, “Lightweight Mask R-CNN for warship detection and segmentation,” IEEE Access, vol. 10, pp. 24936–24944, 2022.

[17] B. Kim et al., “Energy-efficient acceleration of deep neural networks on real-time-constrained embedded edge devices,” IEEE Access, vol. 8, pp. 216259–216270, 2020.

[18] C. Zhou, YOLACT++: Better Real-Time Instance Segmentation, Univ. of California, Davis, CA, USA, Tech. Rep., 2020.

[19] M. Ekman, Learning Deep Learning: Theory and Practice of Neural Networks, Computer Vision, Natural Language Processing, and Transformers Using TensorFlow, Boston, MA, USA: Addison-Wesley, 2021.

[20] S. Khan et al., “Transformers in vision: A survey,” ACM Comput. Surveys, vol. 54, no. 10s, pp. 1–41, 2022.

[21] B. Yang et al., “Context-aware self-attention networks,” in Proc. AAAI Conf. Artif. Intell., vol. 33, no. 1, 2019, pp. 9334–9341.

[22] N. Carion et al., “End-to-end object detection with transformers,” in Proc. Eur. Conf. Comput. Vis. (ECCV), Cham, Switzerland: Springer, 2020, pp. 213–229.

[23] R. Shao, X.-J. Bi, and Z. Chen, “A novel hybrid transformer-CNN architecture for environmental microorganism classification,” PLOS ONE, vol. 17, no. 11, Art. no. e0277557, 2022.

[24] C. Zhang et al., “Transformer and CNN hybrid deep neural network for semantic segmentation of very-high-resolution remote sensing imagery,” IEEE Trans. Geosci. Remote Sens., vol. 60, pp. 1–20, 2022.

[25] H. Caesar, J. Uijlings, and V. Ferrari, “COCO-Stuff: Thing and stuff classes in context,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Salt Lake City, UT, USA, 2018, pp. 1209–1218.

[26] M. Cordts et al., “The Cityscapes dataset for semantic urban scene understanding,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Las Vegas, NV, USA, 2016, pp. 3213–3223.

Published

2023-12-30

Issue

Section

Articles

How to Cite

1.
Elinjulliparambil SH. Real-Time Instance Segmentation Using Lightweight CNN-Transformer Hybrids. IJETCSIT [Internet]. 2023 Dec. 30 [cited 2026 Jan. 28];4(4):159-67. Available from: https://ijetcsit.org/index.php/ijetcsit/article/view/527

Similar Articles

11-20 of 403

You may also start an advanced similarity search for this article.