gupshup_h2e_mbart / README.md
rajivratn's picture
Create README.md
b0e78f6
|
raw
history blame
5.68 kB

Gupshup

GupShup: Summarizing Open-Domain Code-Switched Conversations EMNLP 2021 Paper: https://aclanthology.org/2021.emnlp-main.499.pdf Github: https://github.com/midas-research/gupshup

Dataset

Please request for the Gupshup data using this Google form.

Dataset is available for Hinglish Dilaogues to English Summarization(h2e) and English Dialogues to English Summarization(e2e). For each task, Dialogues/conversastion have .source(train.source) as file extension whereas Summary has .target(train.target) file extension. ".source" file need to be provided to input_path and ".target" file to reference_path argument in the scripts.

Models

All model weights are available on the Huggingface model hub. Users can either directly download these weights in their local and provide this path to model_name argument in the scripts or use the provided alias (to model_name argument) in scripts directly; this will lead to download weights automatically by scripts.

Model names were aliased in "gupshup_TASK_MODEL" sense, where "TASK" can be h2e,e2e and MODEL can be mbart, pegasus, etc., as listed below.

1. Hinglish Dialogues to English Summary (h2e)

2. English Dialogues to English Summary (e2e)

Inference

Using command line

  1. Clone this repo and create a python virtual environment (https://docs.python.org/3/library/venv.html). Install the required packages using
git clone https://github.com/midas-research/gupshup.git
pip install -r requirements.txt
  1. run_eval script has the following arguments.
  • model_name : Path or alias to one of our models available on Huggingface as listed above.
  • input_path : Source file or path to file containing conversations, which will be summarized.
  • save_path : File path where to save summaries generated by the model.
  • reference_path : Target file or path to file containing summaries, used to calculate matrices.
  • score_path : File path where to save scores.
  • bs : Batch size
  • device: Cuda devices to use.

Please make sure you have downloaded the Gupshup dataset using the above google form and provide the correct path to these files in the argument's input_path and refrence_path. Or you can simply put test.source and test.target in data/h2e/(hinglish to english) or data/e2e/(english to english) folder. For example, to generate English summaries from Hinglish dialogues using the mbart model, run the following command

python run_eval.py \
    --model_name midas/gupshup_h2e_mbart \
    --input_path  data/h2e/test.source \
    --save_path generated_summary.txt \
    --reference_path data/h2e/test.target \
    --score_path scores.txt \
    --bs 8

Another example, to generate English summaries from English dialogues using the Pegasus model

python run_eval.py \
    --model_name midas/gupshup_e2e_pegasus \
    --input_path  data/e2e/test.source \
    --save_path generated_summary.txt \
    --reference_path data/e2e/test.target \
    --score_path scores.txt \
    --bs 8

Please create an issue if you are facing any difficulties in replicating the results.

References

Please cite [1] if you found the resources in this repository useful.

[1] Mehnaz, Laiba, Debanjan Mahata, Rakesh Gosangi, Uma Sushmitha Gunturi, Riya Jain, Gauri Gupta, Amardeep Kumar, Isabelle G. Lee, Anish Acharya, and Rajiv Shah. GupShup: Summarizing Open-Domain Code-Switched Conversations

@inproceedings{mehnaz2021gupshup,
  title={GupShup: Summarizing Open-Domain Code-Switched Conversations},
  author={Mehnaz, Laiba and Mahata, Debanjan and Gosangi, Rakesh and Gunturi, Uma Sushmitha and Jain, Riya and Gupta, Gauri and Kumar, Amardeep and Lee, Isabelle G and Acharya, Anish and Shah, Rajiv},
  booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing},
  pages={6177--6192},
  year={2021}
}