Spaces:

seanpedrickcase
/

document_redaction

Running

App Files Files Community

seanpedrickcase commited on 16 days ago

Commit

04d80a1

•

1 Parent(s): 542c252

Improved time taken reporting and readme

Browse files

Files changed (4) hide show

README.md +2 -6
app.py +4 -8
tools/file_conversion.py +0 -2
tools/file_redaction.py +2 -11

README.md CHANGED Viewed

@@ -12,15 +12,11 @@ license: agpl-3.0
 # Document redaction
 Redact personally identifiable information (PII) from documents (pdf, images), open text, or tabular data (xlsx/csv/parquet). Documents/images can be redacted using 'Quick' image analysis that works fine for typed text, but not handwriting/signatures. On the Redaction settings tab, choose 'Complex image analysis' OCR using AWS Textract (if you are using AWS) to redact these more complex elements (this service has a cost). Addtionally you can choose the method for PII identification. 'Local' gives quick, lower quality results, AWS Comprehend gives better results but has a cost.
-See the 'Redaction settings' tab to choose which pages to redact, the type of information to redact (e.g. people, places), or terms to exclude from redaction.
-You can also review suggested redactions on the 'Review redactions' tab using a point and click visual interface. Please see the [User Guide](https://github.com/seanpedrick-case/doc_redaction/blob/main/README.md) for a walkthrough on how to use this and all other features in the app.
 NOTE: In testing the app seems to find about 60% of personal information on a given (typed) page of text. It is essential that all outputs are checked **by a human** to ensure that all personal information has been removed.
-This app accepts a maximum file size of 100mb. Please consider giving feedback for the quality of the answers underneath the redact buttons when the option appears, this will help to improve the app.
 # USER GUIDE
 Please refer to these example files to follow this guide:

 # Document redaction
 Redact personally identifiable information (PII) from documents (pdf, images), open text, or tabular data (xlsx/csv/parquet). Documents/images can be redacted using 'Quick' image analysis that works fine for typed text, but not handwriting/signatures. On the Redaction settings tab, choose 'Complex image analysis' OCR using AWS Textract (if you are using AWS) to redact these more complex elements (this service has a cost). Addtionally you can choose the method for PII identification. 'Local' gives quick, lower quality results, AWS Comprehend gives better results but has a cost.
+Review suggested redactions on the 'Review redactions' tab using a point and click visual interface. See the 'Redaction settings' tab to choose which pages to redact, the type of information to redact (e.g. people, places), or terms to exclude from redaction. Please see the [User Guide](https://github.com/seanpedrick-case/doc_redaction/blob/main/README.md) for a walkthrough on how to use this and all other features in the app. The app accepts a maximum file size of 100mb. Please consider giving feedback for the quality of the answers underneath the redact buttons when the option appears, this will help to improve the app in future.
 NOTE: In testing the app seems to find about 60% of personal information on a given (typed) page of text. It is essential that all outputs are checked **by a human** to ensure that all personal information has been removed.
 # USER GUIDE
 Please refer to these example files to follow this guide:

app.py CHANGED Viewed

@@ -126,20 +126,16 @@ with app:
     Redact personally identifiable information (PII) from documents (pdf, images), open text, or tabular data (xlsx/csv/parquet). Documents/images can be redacted using 'Quick' image analysis that works fine for typed text, but not handwriting/signatures. On the Redaction settings tab, choose 'Complex image analysis' OCR using AWS Textract (if you are using AWS) to redact these more complex elements (this service has a cost). Addtionally you can choose the method for PII identification. 'Local' gives quick, lower quality results, AWS Comprehend gives better results but has a cost.
-    See the 'Redaction settings' tab to choose which pages to redact, the type of information to redact (e.g. people, places), or terms to exclude from redaction.
-    You can also review suggested redactions on the 'Review redactions' tab using a point and click visual interface. Please see the [User Guide](https://github.com/seanpedrick-case/doc_redaction/blob/main/README.md) for a walkthrough on how to use this and all other features in the app.
-    NOTE: In testing the app seems to find about 60% of personal information on a given (typed) page of text. It is essential that all outputs are checked **by a human** to ensure that all personal information has been removed.
-    This app accepts a maximum file size of 100mb. Please consider giving feedback for the quality of the answers underneath the redact buttons when the option appears, this will help to improve the app.""")
     # PDF / IMAGES TAB
     with gr.Tab("PDFs/images"):
         with gr.Accordion("Redact document", open = True):
             in_doc_files = gr.File(label="Choose a document or image file (PDF, JPG, PNG)", file_count= "single", file_types=['.pdf', '.jpg', '.png', '.json'])
-            in_redaction_method = gr.Radio(label="Choose document redaction method. AWS Textract has a cost per page so please only use when needed.", value = default_ocr_val, choices=[text_ocr_option, tesseract_ocr_option, textract_option])
-            pii_identification_method_drop = gr.Radio(label = "Choose PII detection method", value = default_pii_detector, choices=[local_pii_detector, aws_pii_detector])
             gr.Markdown("""If you only want to redact certain pages, or certain entities (e.g. just email addresses), please go to the redaction settings tab.""")
             document_redact_btn = gr.Button("Redact document(s)", variant="primary")

     Redact personally identifiable information (PII) from documents (pdf, images), open text, or tabular data (xlsx/csv/parquet). Documents/images can be redacted using 'Quick' image analysis that works fine for typed text, but not handwriting/signatures. On the Redaction settings tab, choose 'Complex image analysis' OCR using AWS Textract (if you are using AWS) to redact these more complex elements (this service has a cost). Addtionally you can choose the method for PII identification. 'Local' gives quick, lower quality results, AWS Comprehend gives better results but has a cost.
+    Review suggested redactions on the 'Review redactions' tab using a point and click visual interface. See the 'Redaction settings' tab to choose which pages to redact, the type of information to redact (e.g. people, places), or terms to exclude from redaction. Please see the [User Guide](https://github.com/seanpedrick-case/doc_redaction/blob/main/README.md) for a walkthrough on how to use this and all other features in the app. The app accepts a maximum file size of 100mb. Please consider giving feedback for the quality of the answers underneath the redact buttons when the option appears, this will help to improve the app in future.
+    NOTE: In testing the app seems to find about 60% of personal information on a given (typed) page of text. It is essential that all outputs are checked **by a human** to ensure that all personal information has been removed.""")
     # PDF / IMAGES TAB
     with gr.Tab("PDFs/images"):
         with gr.Accordion("Redact document", open = True):
             in_doc_files = gr.File(label="Choose a document or image file (PDF, JPG, PNG)", file_count= "single", file_types=['.pdf', '.jpg', '.png', '.json'])
+            in_redaction_method = gr.Radio(label="Choose text extract method. AWS Textract has a cost per page.", value = default_ocr_val, choices=[text_ocr_option, tesseract_ocr_option, textract_option])
+            pii_identification_method_drop = gr.Radio(label = "Choose PII detection method. AWS Comprehend has a cost per 100 characters.", value = default_pii_detector, choices=[local_pii_detector, aws_pii_detector])
             gr.Markdown("""If you only want to redact certain pages, or certain entities (e.g. just email addresses), please go to the redaction settings tab.""")
             document_redact_btn = gr.Button("Redact document(s)", variant="primary")

tools/file_conversion.py CHANGED Viewed

@@ -113,8 +113,6 @@ def process_file(file_path):
         # Run your function for processing PDF files here
         img_object = convert_pdf_to_images(file_path)
-        print("img_object has length", len(img_object), "and contains", img_object)
     else:
         print(f"{file_path} is not an image or PDF file.")
         img_object = ['']

         # Run your function for processing PDF files here
         img_object = convert_pdf_to_images(file_path)
     else:
         print(f"{file_path} is not an image or PDF file.")
         img_object = ['']

tools/file_redaction.py CHANGED Viewed

@@ -309,7 +309,7 @@ def choose_and_run_redactor(file_paths:List[str],
             latest_file_completed += 1
             current_loop_page = 999
-            if latest_file_completed != len(file_paths):
                 print("Completed file number:", str(latest_file_completed), "there are more files to do")
             # Save file
@@ -384,15 +384,6 @@ def choose_and_run_redactor(file_paths:List[str],
             #     if isinstance(out_message, list):
             #         out_message.append(out_message_new)  # Ensure out_message is a list of strings
-            if latest_file_completed != len(file_paths):
-                print("Completed file number:", str(latest_file_completed), " there are more files to do")
-            # Make a combined message for the file
-            if isinstance(out_message, list):
-                combined_out_message = '\n'.join(out_message)  # Ensure out_message is a list of strings
-            else: combined_out_message = out_message
    # If textract requests made, write to logging file
     if all_request_metadata:
@@ -409,7 +400,7 @@ def choose_and_run_redactor(file_paths:List[str],
     if combined_out_message: out_message = combined_out_message
-    print("\nout_message at choose_and_run_redactor end is:", out_message)
     # Ensure no duplicated output files
     log_files_output_paths = list(set(log_files_output_paths))

             latest_file_completed += 1
             current_loop_page = 999
+            if latest_file_completed != len(file_paths_list):
                 print("Completed file number:", str(latest_file_completed), "there are more files to do")
             # Save file
             #     if isinstance(out_message, list):
             #         out_message.append(out_message_new)  # Ensure out_message is a list of strings
    # If textract requests made, write to logging file
     if all_request_metadata:
     if combined_out_message: out_message = combined_out_message
+    #print("\nout_message at choose_and_run_redactor end is:", out_message)
     # Ensure no duplicated output files
     log_files_output_paths = list(set(log_files_output_paths))