Summary

This project presents an automated biodiversity monitoring pipeline based on camera traps in Costa Rica, using records collected at the CEMEDE watering site of the National University. The work introduces the cemede-redbioma-ct dataset, a new resource of local species images, and compares two classification strategies: one that uses MegaDetector as a detection and cropping stage before fine-tuning Vision Transformers, and another that fine-tunes MegaDetector itself for direct species classification.

Costa Rica hosts an extraordinary share of global biodiversity, yet monitoring that richness remains costly and time-consuming when images must be reviewed manually. In this case study, camera traps were deployed from 2022 to 2024 in a dry tropical forest fragment next to El Cornizuelo Trail, within the Yaguarundí Forests Biological Corridor, to evaluate the ecological contribution of a watering place built to support local wildlife.

The images reflect real field conditions: rain, low light, night scenes, motion blur, limited resolution, and animals that appear small or only partially visible. These constraints, together with severe class imbalance, make automatic species classification especially difficult and at the same time especially valuable for accelerating tropical biodiversity monitoring.

cemede-redbioma-ct

New open dataset for tropical biodiversity monitoring with camera traps.

9,000+ processed videos

29,498 images generated after re-sampling

11,674 final images across train, validation, and test splits

26 species classes

Available on Hugging Face.

The proposed method follows two branches. In the first one, MegaDetector is used to detect animals and generate cropped regions of interest; those crops are then used to fine-tune three transformer classifiers: DeiT, Swin, and EfficientViT. In the second branch, MegaDetector itself is fine-tuned to classify species from the original images and their annotations. To avoid leakage, the split into training, validation, and test sets was performed at the video level.

The study also evaluates model robustness on interaction images containing multiple animals, a relevant scenario for ecological interpretation. Beyond classification, the pipeline includes an individual counting stage for these interaction images, increasing the ecological value of the results for monitoring and conservation.

Multi-animal interaction image with MegaDetector annotation

The results show that DeiT achieved the best overall performance, with 82% accuracy, outperforming Swin (76.2%), EfficientViT (74.1%), and the fine-tuned MegaDetector used for direct classification (75.1%). In multi-animal images, DeiT also ranked first with 0.8280 accuracy, followed by EfficientViT (0.7850), Swin (0.7742), and MegaDetector (0.7556). These results are competitive relative to other reported studies, especially given the adverse tropical conditions and the difficulty of the dataset.

The paper concludes that Vision Transformers are a strong option for automated biodiversity monitoring in Costa Rica, although major challenges remain, including class imbalance, the scarcity of samples for rare species, and the low quality of part of the imagery. Future work includes advanced data augmentation, semi-supervised learning, and the extraction of behavioral information directly from video.

F1-score comparison by species across models

More information is available in the cemede-redbioma-ct dataset.

References

Biarreta-Portillo, M., Mora-Cross, M., Morataya-Sandoval, P., Salinas-Acosta, A., Víquez-Mora, E., López-Venegas, M., Gomez-Solís, W. & Bautista-Solís, P. (2025). Deep Learning Pipelines for Biodiversity Monitoring: MegaDetector and Vision Transformer Approaches with Camera Traps in Costa Rica.
Salinas-Acosta, A., Lopez-Venegas, M., Biarreta-Portillo, M., Morataya-Sandoval, P., Víquez-Mora, E., Mora-Cross, M., Gomez-Solís, W. & Bautista-Solís, P. (2025). redbioma/cemede-redbioma-ct. Hugging Face. https://huggingface.co/datasets/redbioma/cemede-redbioma-ct/.
Sánchez Brenes, R. J., Salinas Acosta, A. & López Venegas, M. F. (2025). Richness, abundance, and activity of wild mammals captured at a watering site on El Cornizuelo trail, Costa Rica. Revista Iberoamericana Ambiente & Sustentabilidad, 8, e474. https://doi.org/10.46380/rias.v8.e474.
Gadot, T., Istrate, Ș., Kim, H., Morris, D., Beery, S., Birch, T. & Ahumada, J. (2024). To crop or not to crop: Comparing whole-image and cropped classification on a large dataset of camera trap images. IET Computer Vision. https://doi.org/10.1049/cvi2.12318.
Kyathanahally, S. P., Hardeman, T., Reyes, M. et al. (2022). Ensembles of data-efficient vision transformers as a new paradigm for automated classification in ecology. Scientific Reports, 12, 18590. https://doi.org/10.1038/s41598-022-21910-0.