MarineGPT: Advancing Marine Vision-Language AI

In an article recently submitted to the ArXiv* server, researchers proposed and demonstrated the feasibility of using MarineGPT, a vision-language model designed for the marine domain for the first time.

Study: MarineGPT: Advancing Marine Vision-Language AI. Image credit: Generated using DALL.E.3
Study: MarineGPT: Advancing Marine Vision-Language AI. Image credit: Generated using DALL.E.3

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

Background

Large language models (LLMs), such as GPT-4/ChatGPT, can comprehend human intents and execute different real-world tasks as a general-purpose/artificial intelligence (AI) assistant. However, existing LLMs primarily focus on unimodal text inputs, which makes text-only chatbots less optimal for an effective AI assistant.

Multi-modal LLMs (MLLMs) can empower LLMs to align with human intent, follow multi-modal instructions, and sense several modality inputs by constructing a joint semantic/visual-text space to complete different real-world tasks. A vision-language multi-modal model connects an LLM and vision encoder for general-purpose language and visual understanding.

Although significant progress has been achieved in the field of MLLMs by leveraging substantial numbers of image-text pairs from the public web, such general-purpose vision-language models lack the conception coverage and sophistication in delivering and understanding domain-specific knowledge, specifically for the marine domain. Marine-specific MLLMs must yield more scientific, informative, and sensitive responses.

The proposed MarineGPT

In this study, researchers proposed MarineGPT, the first vision-language model designed specifically for the marine domain. The model could identify marine objects from given visual inputs and effectively yield corresponding scientific, informative, and sensitive responses as a robust marine AI assistant.

They also introduced the Marine-5M dataset containing over five million marine image-text pairs with redundant marine domain conceptions to incorporate domain-specific marine knowledge into the proposed model and realize an improved marine vision and language alignment.

The Marine-5M dataset was used for marine-specific continuous pre-training, which adapted the general-purpose MLLM to the domain-specific expert model effectively by aligning images with domain expertise managed and defined flexibly based on the language descriptions.

Subsequently, researchers designed 50 separate marine-specific instructions based on the requirements and expertise of marine biologists to enable the MarineGPT model to effectively understand the user intent. Instruction-following training data were scalably generated using ChatGPT/GPT-4 following the design of marine-specific instructions. Additionally, researchers summarized 129 comprehensive, hierarchical, and diverse attributes, including reproduction, feeding diet, shape, color, size, morphology, habitat, and distribution, of marine objects.

They generated various attribute descriptions for images with the same category annotation based on texts crawled from reliable marine websites, including Reeflex and FishDB. The MarineGPT offered more reasonable and accurate responses to human inquiries after marine-specific continuous pre-training.

Then, researchers constructed 1.12 million high-quality marine image-text pairs with a large range of instruction-following templates/instruction-following question-answer pairs, which presented different tasks of describing marine organisms in the given image, for marine knowledge foundation model training to improve the ability of the MarineGPT to generate more scientific and fine-grained responses.

Experimental evaluation and findings

Researchers developed MarineGPT to realize cross-modality visual-and-language alignment between LLMs and visual observations. In this study, they utilized the same visual encoder as the one employed in BLIP-2 and a ViT backbone with a pre-trained Q-Former to achieve effective visual perception.

Additionally, LLaMA-13B was used as the decoder for the LLM to generate responses. The gap between the LLM and visual encoder was addressed by computing the similarity between the captions and visual content using the Q-Former and additional linear layers.

During the continuous pre-training, the data type of the language model and frozen ViT parameters was converted to FP16 to improve computational efficiency. Eventually, the performance of MarineGPT was compared with MiniGPT-4 and GPT-4V.

The comparative analysis demonstrated that MarineGPT could generate longer and more detailed responses with corresponding biology information, including the common name and scientific name of the recognized marine objects, compared to MiniGPT-4 and GPT-4V.

MarineGPT also generated more diverse and relative information about the recognized objects compared to GPT-4V and MiniGPT-4. However, MarineGPT generated a wrong scientific name for dog-faced puffer fish, which was attributed to the ineffectiveness of frozen LLM/LLaMA-13B used in this study.

The proposed model successfully identified many different marine creatures, provided the corresponding scientific and common names and generated comprehensive and diverse image descriptions of recognized marine objects. MarineGPT also generated the corresponding references to offer additional information and informative responses that described the benefits for the recognized object from the physical appearance, steps that can be implemented to protect the recognized threatened marine species, physical characteristics of black-tip reef shark, social behavior, and feeding diet of bottlenose dolphin, and spatial distribution of sarpa salpa.

Moreover, MarineGPT effectively differentiated very similar marine organisms and generated separate responses based on them. This novel fine-grained object recognition ability was introduced in the model by the Marine-5M dataset and can be effective for diversity monitoring.

In the multi-round conversation with users, MarineGPT recognized the marine objects present in different marine images uploaded by the users successfully and generated corresponding responses aligning with the user intent, which indicated the ability of the model to increase public awareness about the significance of marine biodiversity. However, more research is required to address several limitations of the proposed MarineGPT. For instance, the model failed to generate informative and long responses for images that contain instances.

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

Journal reference:
Samudrapom Dam

Written by

Samudrapom Dam

Samudrapom Dam is a freelance scientific and business writer based in Kolkata, India. He has been writing articles related to business and scientific topics for more than one and a half years. He has extensive experience in writing about advanced technologies, information technology, machinery, metals and metal products, clean technologies, finance and banking, automotive, household products, and the aerospace industry. He is passionate about the latest developments in advanced technologies, the ways these developments can be implemented in a real-world situation, and how these developments can positively impact common people.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Dam, Samudrapom. (2023, October 25). MarineGPT: Advancing Marine Vision-Language AI. AZoAi. Retrieved on December 26, 2024 from https://www.azoai.com/news/20231025/MarineGPT-Advancing-Marine-Vision-Language-AI.aspx.

  • MLA

    Dam, Samudrapom. "MarineGPT: Advancing Marine Vision-Language AI". AZoAi. 26 December 2024. <https://www.azoai.com/news/20231025/MarineGPT-Advancing-Marine-Vision-Language-AI.aspx>.

  • Chicago

    Dam, Samudrapom. "MarineGPT: Advancing Marine Vision-Language AI". AZoAi. https://www.azoai.com/news/20231025/MarineGPT-Advancing-Marine-Vision-Language-AI.aspx. (accessed December 26, 2024).

  • Harvard

    Dam, Samudrapom. 2023. MarineGPT: Advancing Marine Vision-Language AI. AZoAi, viewed 26 December 2024, https://www.azoai.com/news/20231025/MarineGPT-Advancing-Marine-Vision-Language-AI.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
CoTracker3 Revolutionizes Point Tracking by Simplifying Architectures and Leveraging Real Video Data