Spaces:

strickvl
/

redaction-detector

Runtime error

App Files Files Community

Alex Strick van Linschoten commited on May 5, 2022

Commit

64c717a

1 Parent(s): ef4decc

upload app

Browse files

Files changed (8) hide show

README.md +6 -5
app.py +107 -0
article.md +45 -0
packages.txt +1 -0
requirements.txt +10 -0
test1.jpg +0 -0
test1.pdf +0 -0
test2.pdf +0 -0

README.md CHANGED Viewed

@@ -1,13 +1,14 @@
 ---
 title: Redaction Detector
-emoji: 🔥
-colorFrom: pink
-colorTo: red
 sdk: gradio
 sdk_version: 2.9.4
 app_file: app.py
-pinned: false
 license: apache-2.0
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces#reference

 ---
 title: Redaction Detector
+emoji: 📄
+colorFrom: blue
+colorTo: yellow
 sdk: gradio
 sdk_version: 2.9.4
 app_file: app.py
+pinned: true
 license: apache-2.0
 ---
+Check out the configuration reference at
+https://huggingface.co/docs/hub/spaces#reference

app.py ADDED Viewed

	@@ -0,0 +1,107 @@

+import gradio as gr
+import skimage
+from fastai.learner import load_learner
+from fastai.vision.all import *
+from huggingface_hub import hf_hub_download
+import fitz
+import tempfile
+import os
+from fpdf import FPDF
+learn = load_learner(
+    hf_hub_download("strickvl/redaction-classifier-fastai", "model.pkl")
+)
+labels = learn.dls.vocab
+def predict(pdf, confidence, generate_file):
+    document = fitz.open(pdf.name)
+    results = []
+    images = []
+    tmp_dir = tempfile.gettempdir()
+    for page_num, page in enumerate(document, start=1):
+        image_pixmap = page.get_pixmap()
+        image = image_pixmap.tobytes()
+        _, _, probs = learn.predict(image)
+        results.append(
+            {labels[i]: float(probs[i]) for i in range(len(labels))}
+        )
+        if probs[0] > (confidence / 100):
+            redaction_count = len(images)
+            image_pixmap.save(os.path.join(tmp_dir, f"page-{page_num}.png"))
+            images.append(
+                [
+                    f"Redacted page #{redaction_count + 1} on page {page_num}",
+                    os.path.join(tmp_dir, f"page-{page_num}.png"),
+                ]
+            )
+    redacted_pages = [
+        str(page + 1)
+        for page in range(len(results))
+        if results[page]["redacted"] > (confidence / 100)
+    ]
+    report = os.path.join(tmp_dir, "redacted_pages.pdf")
+    if generate_file:
+        pdf = FPDF()
+        pdf.set_auto_page_break(0)
+        imagelist = sorted(
+            [i for i in os.listdir(tmp_dir) if i.endswith("png")]
+        )
+        for image in imagelist:
+            pdf.add_page()
+            pdf.image(os.path.join(tmp_dir, image), w=190, h=280)
+        pdf.output(report, "F")
+    text_output = f"A total of {len(redacted_pages)} pages were redacted. \n\n The redacted page numbers were: {', '.join(redacted_pages)}."
+    if generate_file:
+        return text_output, images, report
+    else:
+        return text_output, images, None
+title = "Redaction Detector"
+description = "A classifier trained on publicly released redacted (and unredacted) FOIA documents, using [fastai](https://github.com/fastai/fastai)."
+with open("article.md") as f:
+    article = f.read()
+examples = [["test1.pdf", 80, False], ["test2.pdf", 80, False]]
+interpretation = "default"
+enable_queue = True
+theme = "grass"
+allow_flagging = "never"
+demo = gr.Interface(
+    fn=predict,
+    inputs=[
+        "file",
+        gr.inputs.Slider(
+            minimum=0,
+            maximum=100,
+            step=None,
+            default=80,
+            label="Confidence",
+            optional=False,
+        ),
+        "checkbox",
+    ],
+    outputs=[
+        gr.outputs.Textbox(label="Document Analysis"),
+        gr.outputs.Carousel(["text", "image"], label="Redacted pages"),
+        gr.outputs.File(label="Download redacted pages"),
+    ],
+    title=title,
+    description=description,
+    article=article,
+    theme=theme,
+    allow_flagging=allow_flagging,
+    examples=examples,
+    interpretation=interpretation,
+)
+demo.launch(
+    cache_examples=True,
+    enable_queue=enable_queue,
+)

article.md ADDED Viewed

	@@ -0,0 +1,45 @@

+I've been working through the first two lessons of
+[the fastai course](https://course.fast.ai/). For lesson one I trained a model
+to recognise my cat, Mr Blupus. For lesson two the emphasis is on getting those
+models out in the world as some kind of demo or application.
+[Gradio](https://gradio.app) and
+[Huggingface Spaces](https://huggingface.co/spaces) makes it super easy to get a
+prototype of your model on the internet.
+This model has an accuracy of ~96% on the validation dataset.
+## The Dataset
+I downloaded a few thousand publicly-available FOIA documents from a government
+website. I split the PDFs up into individual `.jpg` files and then used
+[Prodigy](https://prodi.gy/) to annotate the data. (This process was described
+in
+[a blogpost written last year](https://mlops.systems/fastai/redactionmodel/computervision/datalabelling/2021/09/06/redaction-classification-chapter-2.html).)
+## Training the model
+I trained the model with fastai's flexible `vision_learner`, fine-tuning
+`resnet18` which was both smaller than `resnet34` (no surprises there) and less
+liable to early overfitting. I trained the model for 10 epochs.
+## Further Reading
+This initial dataset spurred an ongoing interest in the domain and I've since
+been working on the problem of object detection, i.e. identifying exactly which
+parts of the image contain redactions.
+Some of the key blogs I've written about this project:
+- How to annotate data for an object detection problem with Prodigy
+  ([link](https://mlops.systems/redactionmodel/computervision/datalabelling/2021/11/29/prodigy-object-detection-training.html))
+- How to create synthetic images to supplement a small dataset
+  ([link](https://mlops.systems/redactionmodel/computervision/python/tools/2022/02/10/synthetic-image-data.html))
+- How to use error analysis and visual tools like FiftyOne to improve model
+  performance
+  ([link](https://mlops.systems/redactionmodel/computervision/tools/debugging/jupyter/2022/03/12/fiftyone-computervision.html))
+- Creating more synthetic data focused on the tasks my model finds hard
+  ([link](https://mlops.systems/tools/redactionmodel/computervision/2022/04/06/synthetic-data-results.html))
+- Data validation for object detection / computer vision (a three part series —
+  [part 1](https://mlops.systems/tools/redactionmodel/computervision/datavalidation/2022/04/19/data-validation-great-expectations-part-1.html),
+  [part 2](https://mlops.systems/tools/redactionmodel/computervision/datavalidation/2022/04/26/data-validation-great-expectations-part-2.html),
+  [part 3](https://mlops.systems/tools/redactionmodel/computervision/datavalidation/2022/04/28/data-validation-great-expectations-part-3.html))

packages.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ python3-opencv

requirements.txt ADDED Viewed

	@@ -0,0 +1,10 @@

+--find-links https://download.openmmlab.com/mmcv/dist/cpu/torch1.10.0/index.html
+mmcv-full==1.3.17
+mmdet==2.17.0
+gradio==2.7.5
+icevision[all]==0.12.0
+fastai
+scikit-image
+pymupdf
+fpdf

test1.jpg ADDED Viewed

test1.pdf ADDED Viewed

Binary file (921 kB). View file

test2.pdf ADDED Viewed

Binary file (740 kB). View file