metadata
title: Commit Rewriting Visualization
sdk: gradio
sdk_version: 4.25.0
app_file: change_visualizer.py
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
Description
This project is a main artifact of the "Research on evaluation for AI Commit Message Generation" research.
Structure (important components)
Configuration: config.py
- Grazie API JWT token and Hugging Face token must be stored as environment variables.
Visualization app -- a Gradio application that is currently deployed
at https://huggingface.co/spaces/JetBrains-Research/commit-rewriting-visualization.Shows
- The "golden" dataset of manually collected samples; the dataset is downloaded on startup from https://huggingface.co/datasets/JetBrains-Research/commit-msg-rewriting
- The entire dataset that includes the synthetic samples; the dataset is downloaded on startup from https://huggingface.co/datasets/JetBrains-Research/synthetic-commit-msg-rewriting
- Some statistics collected for the dataset (and its parts); computed on startup
Note: datasets updated => need to restart the app to see the changes.
Files
Data processing pipeline (note: datasets and files names can be changed in the configuration file)
- Run the whole pipeline by running run_pipeline.py
- All intermediate results are stored as files defined in config
- Intermediate steps (can run them separately by running the corresponding files from generation_steps). The input is then taken from the previous step's artifact.
- Generate the synthetic samples
- Files generation_steps/synthetic_end_to_start.py and generation_steps/synthetic_start_to_end.py
- The first generation step (end to start) downloads the
JetBrains-Research/commit-msg-rewriting
andJetBrains-Research/lca-commit-message-generation
datasets from Hugging Face datasets.
- Compute metrics
- File generation_steps/metrics_analysis.py
- Includes the functions for all metrics
- Downloads
JetBrains-Research/lca-commit-message-generation
Hugging Face dataset.
- The resulting artifact (dataset with golden and synthetic samples, attached reference messages and computed metrics) is saved to the file output/synthetic.csv. It should be uploaded to https://huggingface.co/datasets/JetBrains-Research/synthetic-commit-msg-rewriting manually.
- Run the whole pipeline by running run_pipeline.py
Data analysis
- analysis_util.py -- some functions used for data analysis, e.g., correlations computation.
- analysis.ipynb -- compute the correlations, the resulting tables.
- chart_processing.ipynb -- Jupyter Notebook that draws the charts that were used in the presentation/thesis.