Spaces:
Runtime error
Runtime error
File size: 6,456 Bytes
bfe03ac |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 |
# Sparrow Data
## Description
This module implements data structure for Sparrow ML model fine-tuning. We are using list of invoices to build Hugging Face dataset.
## Install
1. Install
```
pip install -r requirements.txt
```
2. Install Poppler, required for pdf2image to work (macos example)
```
brew install poppler
```
3. Mindee docTR OCR installation with dependencies
```
pip install torch torchvision torchaudio
pip install python-doctr
```
## Usage
1. Run OCR on invoices with PDF conversion to JPG
```
python run_ocr.py
```
2. Run data conversion to Sparrow format
```
python run_converter.py
```
Run Sparrow UI to annotate the documents and create key/value pairs.
3. Run data preparation task for Donut model fine-tuning. This task will create metadata. It will create Hugging Face dataset with train, validation and test splits for Donut model fine-tuning
```
python run_donut.py
```
4. Push dataset to Huggung Face Hub. You need to have Hugging Face account and Hugging Face Hub token. Read more: https://huggingface.co/docs/datasets/main/en/image_dataset
```
python run_donut_upload.py
```
5. Test dataset by using load_dataset and fetching data from Hugging Face Hub
```
python run_donut_test.py
```
## FastAPI Service
Set environment variables in **set_env_vars.sh**
1. Run
```
cd api
```
```
RUN_LOCALLY=true ./start.sh
```
2. FastAPI Swagger
```
http://127.0.0.1:8000/api/v1/sparrow-data/docs
```
**Run in Docker container**
1. Build Docker image
```
docker build --tag katanaml/sparrow-data .
```
2. Run Docker container
```
docker run -e RUN_LOCALLY=true -it --name sparrow-data -p 7860:7860 katanaml/sparrow-data:latest
```
## Endpoints
1. Info
```
curl -X 'GET' \
'https://katanaml-org-sparrow-data.hf.space/api-dataset/v1/sparrow-data/dataset_info' \
-H 'accept: application/json'
```
Replace URL with your own
2. Ground truth
```
curl -X 'GET' \
'https://katanaml-org-sparrow-data.hf.space/api-dataset/v1/sparrow-data/ground_truth' \
-H 'accept: application/json'
```
Replace URL with your own
3. OCR service
```
curl -X 'POST' \
'https://katanaml-org-sparrow-data.hf.space/api-ocr/v1/sparrow-data/ocr' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-F 'file=' \
-F 'image_url=https://raw.githubusercontent.com/katanaml/sparrow/main/sparrow-data/docs/input/invoices/processed/images/invoice_10.jpg' \
-F 'post_processing=false' \
-F 'sparrow_key=your_key'
```
Replace URL with your own
4. OCR statistics
```
curl -X 'GET' \
'https://katanaml-org-sparrow-data.hf.space/api-ocr/v1/sparrow-data/statistics' \
-H 'accept: application/json'
```
Replace URL with your own
## Endpoints - ChatGPT Plugin
1. Get OCR content for receipt
```
curl -X 'GET' \
'https://katanaml-org-sparrow-data.hf.space/api-chatgpt-plugin/v1/sparrow-data/receipt_by_id?receipt_id=34563&sparrow_key=your_key' \
-H 'accept: application/json'
```
Replace URL with your own
2. Post Receipt JSON content to DB
```
curl -X 'POST' \
'https://katanaml-org-sparrow-data.hf.space/api-chatgpt-plugin/v1/sparrow-data/store_receipt_db' \
-H 'accept: application/json' \
-H 'Content-Type: application/x-www-form-urlencoded' \
-d 'chatgpt_user=user&receipt_id=12345&receipt_content=%7Breceipt%7D&sparrow_key=your_key'
```
Replace URL with your own
3. Get receipt JSON from DB by ID
```
curl -X 'GET' \
'https://katanaml-org-sparrow-data.hf.space/api-chatgpt-plugin/v1/sparrow-data/receipt_db_by_id?chatgpt_user=user&receipt_id=12345&sparrow_key=your_key' \
-H 'accept: application/json'
```
Replace URL with your own
4. Delete receipt JSON from DB by ID
```
curl -X 'DELETE' \
'https://katanaml-org-sparrow-data.hf.space/api-chatgpt-plugin/v1/sparrow-data/receipt_db_by_id?chatgpt_user=user&receipt_id=13456&sparrow_key=your_key' \
-H 'accept: application/json'
```
Replace URL with your own
5. Get all IDs for receipts stored in DB
```
curl -X 'GET' \
'https://katanaml-org-sparrow-data.hf.space/api-chatgpt-plugin/v1/sparrow-data/receipt_db_ids_by_user?chatgpt_user=user&sparrow_key=your_key' \
-H 'accept: application/json'
```
Replace URL with your own
6. Get all receipts content stored in DB
```
curl -X 'GET' \
'https://katanaml-org-sparrow-data.hf.space/api-chatgpt-plugin/v1/sparrow-data/receipt_db_content_by_user?chatgpt_user=user&sparrow_key=your_key' \
-H 'accept: application/json'
```
Replace URL with your own
## CLI
Navigate to 'cli' folder and run 'chmod +x sparrowdata'. Add to system path to make it executable globally on the system.
1. OCR
```
./sparrowdata --api_url https://katanaml-org-sparrow-data.hf.space/api-ocr/v1/sparrow-data/ocr \
--file_path ../docs/models/donut/data/img/test/invoice_2.jpg \
--post_processing false \
--sparrow_key your_key
```
## Deploy to Hugging Face Spaces
1. Create new space - https://huggingface.co/spaces. Follow instructions from readme doc
2. Create huggingface_key secret in space settings
3. In config.py, replace huggingface_key variable with this line of code
```
huggingface_key: str = os.environ.get("huggingface_key")
```
4. Commit and push code to the space, follow readme instructions. Docker container will be deployed automatically. Example:
```
https://huggingface.co/spaces/katanaml-org/sparrow-data
```
5. Sparrow Data API will be accessible by URL, you can get it from space info. Example:
```
https://katanaml-org-sparrow-data.hf.space/api/v1/sparrow-data/docs
```
## MongoDB connection
If post_processing is set to True, then OCR results will be saved to MongoDB. You need to have MongoDB Atlas account and MongoDB Atlas token. Read more: https://docs.atlas.mongodb.com/configure-api-access/
1. Set environment variable for MongoDB Atlas connection, before starting FastAPI service
```
export MONGODB_URL="mongodb+srv://sparrow:<password>@<url>/?retryWrites=true&w=majority"
```
## Dataset info
- [Samples of electronic invoices](https://data.mendeley.com/datasets/tnj49gpmtz)
- [Receipts](https://www.kaggle.com/jenswalter/receipts)
- [SROIE](https://github.com/zzzDavid/ICDAR-2019-SROIE)
## Author
[Katana ML](https://katanaml.io), [Andrej Baranovskij](https://github.com/abaranovskis-redsamurai)
## License
Licensed under the Apache License, Version 2.0. Copyright 2020-2023 Katana ML, Andrej Baranovskij. [Copy of the license](https://github.com/katanaml/sparrow/blob/main/LICENSE).
|