--- license: apache-2.0 datasets: - yuan-tian/chartgpt-dataset language: - en metrics: - rouge pipeline_tag: text2text-generation base_model: - google/flan-t5-xl new_version: yuan-tian/chartgpt-llama3 --- # Model Card for ChartGPT ## Model Details ### Model Description This model is used to generate charts from natural language. For more information, please refer to the paper. * **Model type:** Language model * **Language(s) (NLP)**: English * **License**: Apache 2.0 * **Finetuned from model**: [FLAN-T5-XL](https://huggingface.co/google/flan-t5-xl) * **Research paper**: [ChartGPT: Leveraging LLMs to Generate Charts from Abstract Natural Language](https://ieeexplore.ieee.org/document/10443572) ### Model Input Format
Click to expand Model input on the Step `x`. Specifically, `<...>` serves as a seperation token. ``` {table name} {column names} {column types} {data row 1} {data row 2} {NL utterance} {Step 1 prompt} {Answer 2} ... {Step x-1 prompt} {Answer x-1} {Step x prompt} ``` And the model should output the answer corresponding to step `x`. The step 1-6 prompts are as follows: ``` Step 1. Select columns: Step 2. Add filter: Step 3. Add aggregations: Step 4. Select chart type: Step 5. Choose encoding: Step 6. Add sort: ```
## How to Get Started with the Model ### Running the Model on a GPU An example of a movie dataset with an utterance "What kinds of movies are the most popular?". The model should give the answers to step 1 (select columns). You can use the code below to test if you can run the model successfully.
Click to expand ```python from transformers import ( AutoTokenizer, AutoModelForSeq2SeqLM, ) tokenizer = AutoTokenizer.from_pretrained("yuan-tian/chartgpt") model = AutoModelForSeq2SeqLM.from_pretrained("yuan-tian/chartgpt", device_map="auto") input_text = "movies Title,Worldwide_Gross,Production_Budget,Release_Year,Content_Rating,Running_Time,Major_Genre,Creative_Type,Rotten_Tomatoes_Rating,IMDB_Rating nominal,quantitative,quantitative,temporal,nominal,quantitative,nominal,nominal,quantitative,quantitative From Dusk Till Dawn,25728961,20000000,1996,R,107,Horror,Fantasy,63,7.1 Broken Arrow,148345997,65000000,1996,R,108,Action,Contemporary Fiction,55,5.8 What kinds of movies are the most popular? Step 1. Select the columns:" inputs = tokenizer(input_text, return_tensors="pt", padding=True).to("cuda") outputs = model.generate(**inputs) print(tokenizer.decode(outputs[0], skip_special_tokens = True)) ```
## Training Details ### Training Data This model is Fine-tuned from [FLAN-T5-XL](https://huggingface.co/google/flan-t5-xl) on the [chartgpt-dataset](https://huggingface.co/datasets/yuan-tian/chartgpt-dataset). ### Training Procedure Plan to update the preprocessing and training procedure in the future. ## Citation **BibTeX:** ``` @article{tian2024chartgpt, title={ChartGPT: Leveraging LLMs to Generate Charts from Abstract Natural Language}, author={Tian, Yuan and Cui, Weiwei and Deng, Dazhen and Yi, Xinjing and Yang, Yurun and Zhang, Haidong and Wu, Yingcai}, journal={IEEE Transactions on Visualization and Computer Graphics}, year={2024}, pages={1-15}, doi={10.1109/TVCG.2024.3368621} } ```