In an article recently submitted to the ArXiv* server, researchers proposed and demonstrated the feasibility of using MarineGPT, a vision-language model designed for the marine domain for the first time.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Background
Large language models (LLMs), such as GPT-4/ChatGPT, can comprehend human intents and execute different real-world tasks as a general-purpose/artificial intelligence (AI) assistant. However, existing LLMs primarily focus on unimodal text inputs, which makes text-only chatbots less optimal for an effective AI assistant.
Multi-modal LLMs (MLLMs) can empower LLMs to align with human intent, follow multi-modal instructions, and sense several modality inputs by constructing a joint semantic/visual-text space to complete different real-world tasks. A vision-language multi-modal model connects an LLM and vision encoder for general-purpose language and visual understanding.
Although significant progress has been achieved in the field of MLLMs by leveraging substantial numbers of image-text pairs from the public web, such general-purpose vision-language models lack the conception coverage and sophistication in delivering and understanding domain-specific knowledge, specifically for the marine domain. Marine-specific MLLMs must yield more scientific, informative, and sensitive responses.
The proposed MarineGPT
In this study, researchers proposed MarineGPT, the first vision-language model designed specifically for the marine domain. The model could identify marine objects from given visual inputs and effectively yield corresponding scientific, informative, and sensitive responses as a robust marine AI assistant.
They also introduced the Marine-5M dataset containing over five million marine image-text pairs with redundant marine domain conceptions to incorporate domain-specific marine knowledge into the proposed model and realize an improved marine vision and language alignment.
The Marine-5M dataset was used for marine-specific continuous pre-training, which adapted the general-purpose MLLM to the domain-specific expert model effectively by aligning images with domain expertise managed and defined flexibly based on the language descriptions.
Subsequently, researchers designed 50 separate marine-specific instructions based on the requirements and expertise of marine biologists to enable the MarineGPT model to effectively understand the user intent. Instruction-following training data were scalably generated using ChatGPT/GPT-4 following the design of marine-specific instructions. Additionally, researchers summarized 129 comprehensive, hierarchical, and diverse attributes, including reproduction, feeding diet, shape, color, size, morphology, habitat, and distribution, of marine objects.
They generated various attribute descriptions for images with the same category annotation based on texts crawled from reliable marine websites, including Reeflex and FishDB. The MarineGPT offered more reasonable and accurate responses to human inquiries after marine-specific continuous pre-training.
Then, researchers constructed 1.12 million high-quality marine image-text pairs with a large range of instruction-following templates/instruction-following question-answer pairs, which presented different tasks of describing marine organisms in the given image, for marine knowledge foundation model training to improve the ability of the MarineGPT to generate more scientific and fine-grained responses.
Experimental evaluation and findings
Researchers developed MarineGPT to realize cross-modality visual-and-language alignment between LLMs and visual observations. In this study, they utilized the same visual encoder as the one employed in BLIP-2 and a ViT backbone with a pre-trained Q-Former to achieve effective visual perception.
Additionally, LLaMA-13B was used as the decoder for the LLM to generate responses. The gap between the LLM and visual encoder was addressed by computing the similarity between the captions and visual content using the Q-Former and additional linear layers.
During the continuous pre-training, the data type of the language model and frozen ViT parameters was converted to FP16 to improve computational efficiency. Eventually, the performance of MarineGPT was compared with MiniGPT-4 and GPT-4V.
The comparative analysis demonstrated that MarineGPT could generate longer and more detailed responses with corresponding biology information, including the common name and scientific name of the recognized marine objects, compared to MiniGPT-4 and GPT-4V.
MarineGPT also generated more diverse and relative information about the recognized objects compared to GPT-4V and MiniGPT-4. However, MarineGPT generated a wrong scientific name for dog-faced puffer fish, which was attributed to the ineffectiveness of frozen LLM/LLaMA-13B used in this study.
The proposed model successfully identified many different marine creatures, provided the corresponding scientific and common names and generated comprehensive and diverse image descriptions of recognized marine objects. MarineGPT also generated the corresponding references to offer additional information and informative responses that described the benefits for the recognized object from the physical appearance, steps that can be implemented to protect the recognized threatened marine species, physical characteristics of black-tip reef shark, social behavior, and feeding diet of bottlenose dolphin, and spatial distribution of sarpa salpa.
Moreover, MarineGPT effectively differentiated very similar marine organisms and generated separate responses based on them. This novel fine-grained object recognition ability was introduced in the model by the Marine-5M dataset and can be effective for diversity monitoring.
In the multi-round conversation with users, MarineGPT recognized the marine objects present in different marine images uploaded by the users successfully and generated corresponding responses aligning with the user intent, which indicated the ability of the model to increase public awareness about the significance of marine biodiversity. However, more research is required to address several limitations of the proposed MarineGPT. For instance, the model failed to generate informative and long responses for images that contain instances.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Journal reference:
- Preliminary scientific report.
Zheng, Z., Zhang, J., Vu, T., Diao, S., Tim, Y. H., Yeung, S. (2023). MarineGPT: Unlocking Secrets of Ocean to the Public. ArXiv. https://doi.org/10.48550/arXiv.2310.13596, https://arxiv.org/abs/2310.13596