LT3

LT3 is a novel Conditional Transformer designed for generating synthetic medical instructions as an alternative to real medical data, addressing data privacy restrictions and scracity issues. It has demonstrated better generation quality and diversity than Large Language Models (LLM), and the ability to effectively train NER model with performances comparable to those achieved with real data. On top of that, our research proposes a new Beam Search Decoding algorithm (B2SD) which outperformed state-of-the-art methods on our task.

This work was presented at NeurIPS 2023's Workshop on Synthetic Data Generation with Generative AI.

Our pre-print can be found here: https://arxiv.org/abs/2310.19727.

Authors: Samuel Belkadi, Nicolo Micheletti, Lifeng Han, Warren Del-Pinto, Goran Nenadic.

Usage

In order to generate syntethic data, you can follow the instructions given on our Github repository: https://github.com/SamySam0/LT3 .

Evaluation results

Lexical Similarity Evaluation against References

The results below show that LT3’s generations are the closest match to the reference samples. We used multi-reference evaluation to consolidate our results. Higher scores are better.

Models	BLEU	ROUGE-1	ROUGE-2	ROUGE-L	BERTScore
T5 Small	71.75	76.16	66.24	75.55	0.70
T5 Base	71.98	76.28	66.30	75.45	0.70
T5 Large	69.89	75.07	65.19	74.22	0.68
LT3	78.52	78.16	68.72	77.55	0.72

Lexical Diversity Evaluation within Generated Outputs

The results below measure the diversity between models' outputs. For each label, we measured the Jaccard similarity score of the generations of our models. A higher Jaccard Score indicates more similarity between the two populations, while a lower score indicates better diversity in our tasks.

	Median Jaccard Score	Average Jaccard Score
LT3	0.650	0.652
T5 Base	0.658	0.660

Downstream NER Evaluation

The results below demonstrate the efficiency of our generated synthetic dataset to train an NER model compared to when using real data.

Thank you

Feel free to use LT3 for any research purpose.

Please contact us if you have any questions, and cite our work whenever used.