dataset-documentation/DATASETDOC-fa24.md · spark-ds549/LibRAG at 1e0a96893cd0288471d864446cc69b267ff31705

Project Information

The project name is LibRAG (Retrieval Augmented Generation)
https://github.com/BU-Spark/ml-bpl-rag/tree/main
Google Drive
This project involved implementing natural language querying into the Digial Commonwealth project.
Client: Boston Public Library
Contact: Eben English
Class: DS549

Dataset Information

Our data is contained on the SCC at /projectnb/sparkgrp/ml-bpl-rag-data
- /vectorstore/final_embeddings/metadata_index - faiss index for the metadata
- /vectorstore/final_embeddings/fulltext_index - faiss index for the OCR text
- /full_data/bpl_data.json - metadata
- /full_data/clean_ft.json - fulltext
We did not have formal datasets, instead we used the Digital Commonwealth API and created embeddings from it. There is no need for a data dictionary outside of Digital Commonwealth API.
What keywords or tags would you attach to the data set?
- Domain(s) of Application: Natural Language Processing, Library Science
- Civic tech

The following questions pertain to the datasets you used in your project.
Motivation

We needed to create embeddings of the Digital Commonwealth's data in order to perform retrieval

Composition

Each entry in the Digital Commonwealth API represents an object in their repo of varying format
There were ~1.3 million total objects last we checked, about 147,000 of which containing full-text from OCR'd documents.
Our data was a comprehensive snapshot, the API is being updated.
Each field from the API represented metadata classifications
Data is publicly accessible and non-confidential

Collection Process

We collected data from an API endpoint.
No sampling was performed
This data was collected in October 2024

Preprocessing/cleaning/labeling

Very limited character correction was performed on the fulltext data.
No transformations were applied outside of embedding.
The raw data is saved in ml-bpl-rag-data/full_data/bpl_data.json (metadata) clean_ft.json (fulltext)

Uses

Embedding for retrieval

Distribution

This data is free to use and access by subsequent students of our project.

Maintenance

There is currently no system in place for cleanly updating the data, though in our instructions within WRITEUP.md we include a way to ingest your own data from the API and embed it.