Spaces:
Running
Running
Project Information
- The project name is LibRAG (Retrieval Augmented Generation)
- https://github.com/BU-Spark/ml-bpl-rag/tree/main
- Google Drive
- This project involved implementing natural language querying into the Digial Commonwealth project.
- Client: Boston Public Library
- Contact: Eben English
- Class: DS549
Dataset Information
- Our data is contained on the SCC at /projectnb/sparkgrp/ml-bpl-rag-data
- /vectorstore/final_embeddings/metadata_index - faiss index for the metadata
- /vectorstore/final_embeddings/fulltext_index - faiss index for the OCR text
- /full_data/bpl_data.json - metadata
- /full_data/clean_ft.json - fulltext
- We did not have formal datasets, instead we used the Digital Commonwealth API and created embeddings from it. There is no need for a data dictionary outside of Digital Commonwealth API.
- What keywords or tags would you attach to the data set?
- Domain(s) of Application: Natural Language Processing, Library Science
- Civic tech
The following questions pertain to the datasets you used in your project.
Motivation
- We needed to create embeddings of the Digital Commonwealth's data in order to perform retrieval
Composition
- Each entry in the Digital Commonwealth API represents an object in their repo of varying format
- There were ~1.3 million total objects last we checked, about 147,000 of which containing full-text from OCR'd documents.
- Our data was a comprehensive snapshot, the API is being updated.
- Each field from the API represented metadata classifications
- Data is publicly accessible and non-confidential
Collection Process
- We collected data from an API endpoint.
- No sampling was performed
- This data was collected in October 2024
Preprocessing/cleaning/labeling
- Very limited character correction was performed on the fulltext data.
- No transformations were applied outside of embedding.
- The raw data is saved in ml-bpl-rag-data/full_data/bpl_data.json (metadata) clean_ft.json (fulltext)
Uses
- Embedding for retrieval
Distribution
- This data is free to use and access by subsequent students of our project.
Maintenance
There is currently no system in place for cleanly updating the data, though in our instructions within WRITEUP.md we include a way to ingest your own data from the API and embed it.