phyloforfun/VoucherVision · Apply for community grant: Academic project (gpu and storage)

Greetings!

My name is Will and I am PhD candidate at the University of Michigan where I biodiversity informatics. For the past few months, I have been developing a software tool called VoucherVision which blends OCR, NLP, and LLM techniques to transcribe and database the labels of museum specimens in natural history collections. These specimens hold immense value for scientific research, offering unparalleled insights into the biological diversity and evolutionary processes of our planet. They are critical for disciplines ranging from evolutionary biology and botany to agronomy and climate change studies, providing essential data that underpin advancements in understanding and conserving our natural world.

Vouchered specimens are akin to time capsules, preserving the essence of past ecosystems and biodiversity. They play a pivotal role in unraveling long-term ecological changes and shaping conservation strategies. However, despite their immense value, accessing this treasure trove of information is often hampered by traditional, manual transcription methods. These methods are not only labor-intensive but also create significant backlogs, leaving potentially transformative data locked away on shelves. Currently, a staff member must manually transcribe each label into a spreadsheet before the record can be made publicly available.

In an unprecedented collaboration, I have partnered with researchers at a growing list of institutions including the Smithsonian National Museum of Natural History, Oregon State University, University of Colorado Boulder, Botanical Research Institute of Texas, South African National Biodiversity Institute, Botanischer Garten Berlin, Freie Universität Berlin, Morton Arboretum, University of Chicago, Florida Museum, iDigBio, and the University of Cambridge.
Our collective goal is to liberate millions of preserved plant specimens from obscurity, making their valuable data accessible for research and conservation efforts.

We are reaching out for support through the Hugging Face Space community grant to enhance our ability to refine and expand VoucherVision. Our project stands at a critical juncture where the right tools and resources could dramatically accelerate our progress. Each institution has specific workflow requirements, so it is crucial that we have a centralized testing platform – Hugging Face Spaces gives easy access to the most current iteration of VoucherVision and facilitates rapid development.

The potential impact of VoucherVision within the Natural History community is immense. By making specimen data readily accessible, we not only advance scientific research but also foster a deeper understanding and appreciation of biodiversity. Our Hugging Face Space presents an unparalleled opportunity for us to bring this vision to life, offering an intuitive platform for our diverse team of researchers to test and improve our software.

We hope you will join us in this groundbreaking endeavor to catalog, preserve, and make accessible the invaluable data held within natural history museum collections. Thank you for considering our application and the opportunity to make a lasting impact on science and conservation. We recently published a short paper related to this topic, which you can find here:
https://bsapubs.onlinelibrary.wiley.com/doi/10.1002/ajb2.16256

The figure below outlines our workflow and appears in our publication.

Schematic of LLM-assisted label transcription for a batch of herbarium specimens. (A) An OCR algorithm identifies labels from a specimen image and generates unformatted text. (B) The unformatted OCR text forms part of the dynamically assembled prompt, consisting of several blocks. The first block (red) contains grounding instructions to define the LLM's task, including the desired response format (JSON object) and text arrangement per Darwin Core Archive standards. The second block (green) contains all the unformatted OCR text. The third block (blue) are example JSON objects sourced through a semantic similarity search, providing the LLM with a greater context, also called domain knowledge. The fourth block (gray) defines field-specific rules, such as date or GPS formatting, ending with an empty JSON object. The empty JSON object increases the likelihood of receiving a JSON-formatted LLM response. Blocks are combined to form a single text prompt. (C) The prompt is submitted to the chosen LLM. A structured object parser wraps the prompt, guiding the LLM toward a valid JSON output. (D) If the LLM's response is a valid JSON object, then it is converted to a spreadsheet row and appended to the project's spreadsheet. For invalid responses, recursive prompting directs the LLM to correct the invalid JSON. (E) After processing a batch of images, spreadsheet entries are manually edited using the VoucherVision Editor (https://github.com/Gene-Weaver/VoucherVisionEditor) before submission to the database of record.