Data augmentation for UAV-captured vessel images in maritime surveillance using multimodal language and diffusion models

Authors

  • Le Thi Thu Hong Institute of Information Technology and Electronics, Academy of Military Science and Technology
  • Pham Thu Huong Institute of Information Technology and Electronics, Academy of Military Science and Technology
  • Doan Quang Tu Institute of Information Technology and Electronics, Academy of Military Science and Technology
  • Nguyen Chi Thanh (Corresponding Author) Institute of Information Technology and Electronics, Academy of Military Science and Technology

DOI:

https://doi.org/10.54939/1859-1043.j.mst.IITE.2025.160-168

Keywords:

Diffusion; Image synthesis; Data augmentation; Vessel detection.

Abstract

In maritime surveillance, UAV-based vessel detection is essential for ensuring security and safety at sea. However, limited and non-diverse annotated data often restrict model performance in complex maritime environments. This study introduces a novel data augmentation pipeline using multimodal generative models to enhance training datasets with realistic synthetic images. Scene descriptions are automatically generated from UAV imagery using Gemma, a lightweight multimodal language model, and then used to guide FLUX, a text-to-image diffusion model, in creating diverse vessel-centric scenes under varying environmental conditions. A hybrid annotation strategy combines YOLO-World for initial object proposals with manual refinement to ensure label accuracy. The augmented dataset is integrated with the original data to train a vessel detection model. Experiments on the VESSELImg benchmark demonstrate that the proposed approach improves the YOLOv11 detector’s mean average precision (mAP) from 0.775 to 0.805 at IoU thresholds of 0.50:0.95. These results validate the effectiveness of combining multimodal diffusion and language models for domain-specific data synthesis, offering improved generalization and robustness in UAV-based maritime vessel detection.

References

[1]. Cheng, S., Zhu, Y., & Wu, S. “Deep learning based efficient ship detection from drone-captured images for maritime surveillance.” Ocean engineering, 285, 115440, (2023).

[2]. Shorten, C., & Khoshgoftaar, T. M. “A survey on image data augmentation for deep learning.” Journal of big data, 6(1), 1–48, (2019).

[3]. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. “High-resolution image synthesis with latent diffusion models.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695, (2022).

[4]. Team, G et al. “Gemma: Open models based on gemini research and technology.” arXiv preprint arXiv:2403.08295, (2024).

[5]. Black Forest Lab. “FLUX.”, (2024). https://github.com/black-forest-labs/flux.

[6]. Cheng, T., Song, L., Ge, Y., Liu, W., Wang, X., & Shan, Y. “Yolo-world: Real-time open-vocabulary object detection.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, –, 16901–16911, (2024).

[7]. Glenn, J., & Jing, Q. “Ultralytics YOLO11.”, (2024). https://github.com/ultralytics/ultralytics.

[8]. Goodfellow. I et al. “Generative adversarial nets.” Advances in neural information processing systems, pp. 2672–2680, (2014).

[9]. Xu, M., Xie, L., Liu, Y., Wang, S., & Zhang, Y. “Generative adversarial networks in remote sensing: A review.” ISPRS journal of photogrammetry and remote sensing, 166, 296–312, (2020).

[10]. Zhang, Y., Zhang, C., Zhang, Q., & Xie, W. “Data augmentation with conditional GAN for aerial scene classification.” Remote sensing, 11(3), 243, (2019).

[11]. Dhariwal, P., & Nichol, A. “Diffusion models beat GANs on image synthesis.” Advances in neural information processing systems, 34, 8780–8794, (2021).

[12]. Ho, J., Jain, A., & Abbeel, P. “Denoising diffusion probabilistic models.” arXiv preprint arXiv:2006.11239, (2020).

[13]. Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E., Salimans, T., Ho, J., Fleet, D., & Norouzi, M. “Imagen: Text-to-image diffusion models.” International conference on machine learning (ICML), (2022).

[14]. Wolleb, J., Dejakum, K., Sandkühler, P., Reich, M., Lunz, S., & Cattin, P. C. “Diffusion models for medical anomaly detection.” Medical image analysis, 76, 102327, (2022).

[15]. Rubis, B., Cacace, J., Rodriguez, J., Company, R., Tanner, M., Arzo, R., & Cayero, J. “VESSELImg: A large UAV-based vessel image dataset for port surveillance.” International conference on unmanned aircraft systems (ICUAS), 76–83, (2024).

[16]. https://huggingface.co/google/gemma-3-4b-it

[17]. https://huggingface.co/black-forest-labs/FLUX.1-dev

Downloads

Published

30-10-2025

How to Cite

[1]
Le Thi Thu Hong, Pham Thu Huong, Doan Quang Tu, and Nguyen Chi Thanh, “Data augmentation for UAV-captured vessel images in maritime surveillance using multimodal language and diffusion models”, JMST, no. IITE, pp. 160–168, Oct. 2025.

Issue

Section

Information Technology

Most read articles by the same author(s)

1 2 > >>