Model	ROUGE-1	ROUGE-2	ROUGE-L	ROUGE-Lsum
T5-GenQ-T-v1	75.2151	54.8735	74.5142	74.5262
T5-GenQ-TD-v1	78.2570	58.9586	77.5308	77.5466
T5-GenQ-TDE-v1	76.9075	57.0980	76.1464	76.1502
T5-GenQ-TDC-v1 (best)	80.0754	61.5974	79.3557	79.3427

Model

ROUGE-1

ROUGE-2

ROUGE-L

ROUGE-Lsum

T5-GenQ-T-v1

75.2151

54.8735

74.5142

74.5262

T5-GenQ-TD-v1

78.2570

58.9586

77.5308

77.5466

T5-GenQ-TDE-v1

76.9075

57.0980

76.1464

76.1502

T5-GenQ-TDC-v1 (best)

80.0754

61.5974

79.3557

79.3427

Model	ROUGE-1	ROUGE-2	ROUGE-L	ROUGE-Lsum
T5-GenQ-TDC-v1	76.14	56.11	75.47	75.47
query-gen-msmarco-t5-base-v1	34.94	15.29	34.18	34.18

Model

ROUGE-1

ROUGE-2

ROUGE-L

ROUGE-Lsum

T5-GenQ-TDC-v1

76.14

56.11

75.47

query-gen-msmarco-t5-base-v1

34.94

15.29

34.18

Input Text	Target Query	Before Fine-tuning	After Fine-tuning
AC/DC Men's Back in Black T-Shirt Black This youth T-shirt is an officially licensed AC/DC product and features our cool Back In Black design printed on 100% cotton.	AC/DC Back In Black T-Shirt	what is ac dc back in black	AC/DC Back in Black T-Shirt
AISYBB Women's Sweet Japanese Strawberry Embroidered Knit Cardigan Top Korean Retro Elegant Cute Kawaii Coat BrandName: AISYBBSleeveStyle: RegularMaterialComposition: AcrylicPatternType: SolidSleeveLength(cm): FullStyle: SweetSeason: Spring/AutumnThickness: STANDARDDecoration: ButtonClothingLength: RegularMaterial: AcrylicClosureType: SingleBreastedGender: WOMENCollar: V-NeckPattern: Loose-fittingAge: Ages16-28YearsOldPercentageofMaterial: 31%-50%YarnThickness: RegularyarnColor: Red,PinkSize: M,L,XLFabricName: AcrylicPopularElements: Patchwork,ContrastColor,ButtonAge-appropriate: 18-29SweaterCraft: FlatKnittingFeature1: LeisureFeature2: ChicFeature3: ElegantFeature4: Fashion	AISYBB Women's Strawberry Cardigan	aisybb cardigans	Japanese Strawberry Cardigan
Flying Fisherman Square Wrap Polarized Protection Oval Sand Bank Matte Black Amber, One Size As stunning as the crystal blue water over a shallow, The Flying Fisherman sand bank polarized sunglasses will have you seeing clearer with increased visual perception. This light weight wrap style has a sturdy build ready for whatever comes your way. Medium fit features non-slip temples and built in Rubberized nose pads adding to solid performance and comfort for all day wear. Polarized Triacetate lenses are impact and scratch resistant, lightweight and durable. Acutint lens coloring system adds color contrast without distorting natural colors, allowing you to see more clearly. 100% protection from harmful UVA and UVB rays.	Polarized sunglasses for fishing	what is polarized sand bank	Flying Fisherman polarized sunglasses

Input Text

Target Query

Before Fine-tuning

After Fine-tuning

AC/DC Men's Back in Black T-Shirt Black This youth T-shirt is an officially licensed AC/DC product and features our cool Back In Black design printed on 100% cotton.

AC/DC Back In Black T-Shirt

what is ac dc back in black

AC/DC Back in Black T-Shirt

AISYBB Women's Sweet Japanese Strawberry Embroidered Knit Cardigan Top Korean Retro Elegant Cute Kawaii Coat BrandName: AISYBBSleeveStyle: RegularMaterialComposition: AcrylicPatternType: SolidSleeveLength(cm): FullStyle: SweetSeason: Spring/AutumnThickness: STANDARDDecoration: ButtonClothingLength: RegularMaterial: AcrylicClosureType: SingleBreastedGender: WOMENCollar: V-NeckPattern: Loose-fittingAge: Ages16-28YearsOldPercentageofMaterial: 31%-50%YarnThickness: RegularyarnColor: Red,PinkSize: M,L,XLFabricName: AcrylicPopularElements: Patchwork,ContrastColor,ButtonAge-appropriate: 18-29SweaterCraft: FlatKnittingFeature1: LeisureFeature2: ChicFeature3: ElegantFeature4: Fashion

AISYBB Women's Strawberry Cardigan

aisybb cardigans

Japanese Strawberry Cardigan

Flying Fisherman Square Wrap Polarized Protection Oval Sand Bank Matte Black Amber, One Size As stunning as the crystal blue water over a shallow, The Flying Fisherman sand bank polarized sunglasses will have you seeing clearer with increased visual perception. This light weight wrap style has a sturdy build ready for whatever comes your way. Medium fit features non-slip temples and built in Rubberized nose pads adding to solid performance and comfort for all day wear. Polarized Triacetate lenses are impact and scratch resistant, lightweight and durable. Acutint lens coloring system adds color contrast without distorting natural colors, allowing you to see more clearly. 100% protection from harmful UVA and UVB rays.

Polarized sunglasses for fishing

what is polarized sand bank

Flying Fisherman polarized sunglasses

Epoch	Step	Loss	Grad Norm	Learning Rate	Eval Loss	ROUGE-1	ROUGE-2	ROUGE-L	ROUGE-Lsum
1	5181	0.6783	1.4479	0.000049	0.5469	78.3314	59.1425	77.7077	77.6942
2	10362	0.5621	1.5229	0.000042	0.5191	79.3411	60.5070	78.6538	78.6498
3	15543	0.5113	1.7697	0.000035	0.5109	79.4785	60.7267	78.8132	78.8043
4	20724	0.4758	1.5957	0.000028	0.5113	79.8056	61.0473	79.0640	79.0545
5	25905	0.4480	1.5304	0.000021	0.5102	79.9302	61.4653	79.2436	79.2355
6	31086	0.4277	1.7578	0.000014	0.5130	79.8888	61.2644	79.1514	79.1433
7	36267	0.4119	1.8958	0.000007	0.5157	80.0362	61.5594	79.3187	79.3069
8	41448	0.4018	1.4958	0	0.5190	80.0754	61.5974	79.3557	79.3427

Epoch

Step

Loss

Grad Norm

Learning Rate

Eval Loss

ROUGE-1

ROUGE-2

ROUGE-L

ROUGE-Lsum

5181

0.6783

1.4479

0.000049

0.5469

78.3314

59.1425

77.7077

77.6942

10362

0.5621

1.5229

0.000042

0.5191

79.3411

60.5070

78.6538

78.6498

15543

0.5113

1.7697

0.000035

0.5109

79.4785

60.7267

78.8132

78.8043

20724

0.4758

1.5957

0.000028

0.5113

79.8056

61.0473

79.0640

79.0545

25905

0.4480

1.5304

0.000021

0.5102

79.9302

61.4653

79.2436

79.2355

31086

0.4277

1.7578

0.000014

0.5130

79.8888

61.2644

79.1514

79.1433

36267

0.4119

1.8958

0.000007

0.5157

80.0362

61.5594

79.3187

79.3069

41448

0.4018

1.4958

0.5190

80.0754

61.5974

79.3557

79.3427

### Model Analysis

Average scores by model

The checkpoint-41448 (T5-GenQ-TDC-v1) model outperforms query-gen-msmarco-t5-base-v1 across all ROUGE metrics. The largest performance gap is in ROUGE2, where checkpoint-41448 achieves 56.11, whereas query-gen-msmarco-t5-base-v1 scores 15.29. ROUGE1, ROUGEL, and ROUGELSUM scores are very similar in both trends, with checkpoint-41448 consistently scoring above 72, while query-gen-msmarco-t5-base-v1 stays below 41.

Density comparison

```T5-GenQ-TDC-v1``` - Higher concentration of high ROUGE scores, especially near 100%, indicating strong text overlap with references. ```query-gen-msmarco-t5-base-v1``` – more spread-out distribution, with multiple peaks at 10-40%, suggesting greater variability but lower precision. ROUGE-1 & ROUGE-L: ```T5-GenQ-TDC-v1``` peaks at 100%, while ```query-gen-msmarco-t5-base-v1``` has lower, broader peaks. ROUGE-2: ```query-gen-msmarco-t5-base-v1``` has a high density at 0%, indicating many low-overlap outputs.

Histogram comparison

```T5-GenQ-TDC-v1``` – higher concentration of high ROUGE scores, especially near 100%, indicating strong text overlap with references. ```query-gen-msmarco-t5-base-v1``` – more spread-out distribution, with peaks in the 10-40% range, suggesting greater variability but lower precision. ROUGE-1 & ROUGE-L: ```T5-GenQ-TDC-v1``` shows a rising trend towards higher scores, while ```query-gen-msmarco-t5-base-v1``` has multiple peaks at lower scores. ROUGE-2: ```query-gen-msmarco-t5-base-v1``` has a high concentration of low-score outputs, whereas ```T5-GenQ-TDC-v1``` generates more accurate text.

Scores by generated query length

This visualization analyzes average ROUGE scores and score differences across different word sizes. Consistently High ROUGE Scores (Sizes 2-8): ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-LSUM scores remain stable and above 60% across most word sizes. Significant Drop at Size 9: A sharp decline in scores at size 9, followed by a spike at size 10, indicates a possible edge case in the dataset. Stable Score Differences (Sizes 3-9): After the initial spike at size 2, score differences stay close to zero, indicating consistent performance across phrase lengths.

Semantic similarity distribution

This histogram visualizes the distribution of cosine similarity scores, which measure the semantic similarity between paired texts. The majority of similarity scores cluster near 1.0, indicating that most text pairs are highly similar. A gradual increase in frequency is observed as similarity scores rise, with a sharp peak at 1.0. Lower similarity scores (0.0–0.4) are rare, suggesting fewer instances of dissimilar text pairs.

Semantic similarity score against ROUGE scores

This scatter plot matrix compares semantic similarity (cosine similarity) with ROUGE scores, showing their correlation. Higher similarity → Higher ROUGE scores, indicating strong n-gram overlap in semantically similar texts. ROUGE-1 & ROUGE-L show the strongest correlation, while ROUGE-2 has more variability. Some low-similarity outliers still achieve moderate ROUGE scores, suggesting surface-level overlap without deep semantic alignment. These plots help assess how semantic similarity aligns with n-gram overlap metrics in text evaluation.

Performance on Another Dataset

To assess the performance of our fine-tuned query generation model, we conducted evaluations on a dataset containing real user queries, which was not part of the fine-tuning data. The goal was to verify the model’s generalizability and effectiveness in generating high-quality queries for e-commerce products.

Average scores by model

The figure presents a comparison of ROUGE scores between GenQ-TDC-v1 (checkpoint-41448) and the base model (query-gen-msmarco-t5-base-v1). ROUGE scores (ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-LSUM) measure the similarity between generated queries and ground-truth user queries.

The fine-tuned model outperforms the base model across all ROUGE metrics, with significant improvements in ROUGE-1, ROUGE-2, and ROUGE-LSUM.

These improvements indicate that the fine-tuned model generates queries that are more aligned with real user queries, making it a better fit for practical e-commerce applications.

Semantic similarity distribution

The figure shows the distribution of semantic similarity scores (cosine similarity between the generated queries and real user queries.

The distribution shows a peak around 0.1 - 0.2, with a secondary peak near 0.6 - 0.7.

The bimodal nature of the distribution suggests that while a portion of generated queries are highly aligned with real user queries, some still exhibit lower similarity, likely due to domain-specific variations.

A significant number of generated queries achieve a similarity score of 0.6 or higher, confirming that the fine-tuned model effectively captures user intent better than a generic, pre-trained model.

Better Generalization: Even though the fine-tuned model was not trained on this dataset, it generalizes better than the original model on real user queries.

Improved Query Quality: The fine-tuned model produces more relevant, structured, and user-aligned queries which is critical for enhancing search and recommendation performance in e-commerce.

Robust Semantic Alignment: Higher semantic similarity scores indicate that queries generated by the fine-tuned model better match user intent, leading to improved search and retrieval performance.