A groundbreaking AI model, TaxaBind, combines six data sources—images, audio, text, and more—to enhance species classification and ecological predictions, helping scientists track biodiversity and environmental changes with unprecedented accuracy.
Species image to satellite image retrieval task. For each example, we show the top 4 most similar satellite images retrieved by our model from a gallery of 100k satellite images in the iSatNat-test set.

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Have you ever seen an image of an animal and wondered, "What is that?" TaxaBind, a new tool developed by computer scientists at the McKelvey School of Engineering at Washington University in St. Louis, can satisfy that curiosity and more.
TaxaBind addresses the need for more robust and unified approaches to ecological problems by combining multiple models to perform species classification (what kind of bear is this?), distribution mapping (where are the cardinals?), and other technological tasks. The tool can also be used as a starting point for larger studies related to ecological modeling, which scientists might use to predict shifts in plant and animal populations, climate change effects, or impacts of human activities on ecosystems.
Srikumar Sastry, the project's lead author, presented TaxaBind on March 2-3 at the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) in Tucson, AZ.
"With TaxaBind we're unlocking the potential of multiple modalities in the ecological domain," Sastry said. "Unlike existing models that only focus on one task at a time, we combine six modalities – ground-level images of species, geographic location, satellite images, text, audio and other environmental features – into one cohesive framework. This enables our models to address a diverse range of ecological tasks."
Sastry, a graduate student working with Nathan Jacobs, a professor of computer science and engineering, used an innovative technique known as multimodal patching to distill information from different modalities into one binding modality. Sastry describes this binding modality as the "mutual friend" that connects and maintains synergy among the other five modalities.
For TaxaBind, the binding modality is ground-level images of species. The tool captures unique features from each of the other five modalities. It condenses them into the binding modality, enabling the AI to simultaneously learn from images, text, sound, geography, and environmental context.
When the team assessed the tool's performance across various ecological tasks, TaxaBind demonstrated superior capabilities in zero-shot classification, which is the ability to classify a species not present in its training dataset. The demo version of the tool was trained on roughly 450,000 species and can classify a given image by the species it shows, including previously unseen species.
"During training we only need to maintain the synergy between ground-level images and other modalities," Sastry said. "That bridge then creates emergent synergies between the modalities – for example, between satellite images and audio – when TaxaBind is applied to retrieval tasks, even though those modes were not trained together."
This cross-modal retrieval was another area where TaxaBind outperformed state-of-the-art methods. For example, the combination of satellite images and ground-level species images allowed TaxaBind to retrieve habitat characteristics and climate data related to species' locations. It also returned relevant satellite images based on species images, proving the tool's ability to link fine-grained ecological data with real-world environmental information.
The implications of TaxaBind extend far beyond species classification. Sastry notes that the models are general-purpose and could potentially be used as foundational models for other ecology and climate-related applications, such as deforestation monitoring and habitat mapping. He also envisions future iterations of the technology that can make sense of natural language text inputs to respond to user queries.

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Source:
Journal reference:
- Preliminary scientific report.
Sastry, S., Khanal, S., Dhakal, A., Ahmad, A., & Jacobs, N. (2024). TaxaBind: A Unified Embedding Space for Ecological Applications. ArXiv. https://arxiv.org/abs/2411.00683