File size: 3,161 Bytes
93ff886
 
 
 
 
 
 
3907263
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
---
title: Commit Rewriting Visualization
sdk: gradio
sdk_version: 4.25.0
app_file: change_visualizer.py
---

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

# Description

This project is a main artifact of the "Research on evaluation for AI Commit Message Generation" research.

# Structure (important components)

- ### Configuration: [config.py](config.py)
    - Grazie API JWT token and Hugging Face token must be stored as environment variables.
- ### Visualization app -- a Gradio application that is currently deployed
  at https://huggingface.co/spaces/JetBrains-Research/commit-rewriting-visualization.
    - Shows
        - The "golden" dataset of manually collected samples; the dataset is downloaded on startup
          from https://huggingface.co/datasets/JetBrains-Research/commit-msg-rewriting
        - The entire dataset that includes the synthetic samples; the dataset is downloaded on startup
          from https://huggingface.co/datasets/JetBrains-Research/synthetic-commit-msg-rewriting
        - Some statistics collected for the dataset (and its parts); computed on startup

      _Note: datasets updated => need to restart the app to see the changes._
    - Files
        - [change_visualizer.py](change_visualizer.py)
- ### Data processing pipeline (_note: datasets and files names can be changed in the configuration file_)
    - Run the whole pipeline by running [run_pipeline.py](run_pipeline.py)
        - All intermediate results are stored as files defined in config
    - Intermediate steps (can run them separately by running the corresponding files
      from [generation_steps](generation_steps)). The input is then taken from the previous step's artifact.
    - Generate the synthetic samples
        - Files [generation_steps/synthetic_end_to_start.py](generation_steps/synthetic_end_to_start.py)
          and [generation_steps/synthetic_start_to_end.py](generation_steps/synthetic_start_to_end.py)
        - The first generation step (end to start) downloads the `JetBrains-Research/commit-msg-rewriting`
          and `JetBrains-Research/lca-commit-message-generation` datasets from
          Hugging Face datasets.
    - Compute metrics
        - File [generation_steps/metrics_analysis.py](generation_steps/metrics_analysis.py)
        - Includes the functions for all metrics
        - Downloads `JetBrains-Research/lca-commit-message-generation` Hugging Face dataset.
    - The resulting artifact (dataset with golden and synthetic samples, attached reference messages and computed
      metrics) is saved to the file [output/synthetic.csv](output/synthetic.csv). It should be uploaded
      to https://huggingface.co/datasets/JetBrains-Research/synthetic-commit-msg-rewriting **manually**.
- ### Data analysis
    - [analysis_util.py](analysis_util.py) -- some functions used for data analysis, e.g., correlations computation.
    - [analysis.ipynb](analysis.ipynb) -- compute the correlations, the resulting tables.
    - [chart_processing.ipynb](chart_processing.ipynb) -- Jupyter Notebook that draws the charts that were used in the
      presentation/thesis.