--- library_name: transformers tags: - e-commerce - query-generation datasets: - smartcat/Amazon-2023-GenQ language: - en metrics: - rouge base_model: - BeIR/query-gen-msmarco-t5-base-v1 pipeline_tag: text2text-generation license: mit --- # Model Card for T5-GenQ-TDC-v1 ๐Ÿค– โœจ ๐Ÿ” Generate precise, realistic user-focused search queries from product text ๐Ÿ›’ ๐Ÿš€ ๐Ÿ“Š ### Model Description - **Model Name:** Fine-Tuned Query-Generation Model - **Model type:** Text-to-Text Transformer - **Finetuned from model:** [BeIR/query-gen-msmarco-t5-base-v1](https://huggingface.co/BeIR/query-gen-msmarco-t5-base-v1) - **Dataset**: [smartcat/Amazon-2023-GenQ](https://huggingface.co/datasets/smartcat/Amazon-2023-GenQ) - **Primary Use Case**: Generating accurate and relevant search queries from item descriptions - **Repository:** [smartcat-labs/product2query](https://github.com/smartcat-labs/product2query) ### Model variations
Model ROUGE-1 ROUGE-2 ROUGE-L ROUGE-Lsum
T5-GenQ-T-v1 75.2151 54.8735 74.5142 74.5262
T5-GenQ-TD-v1 78.2570 58.9586 77.5308 77.5466
T5-GenQ-TDE-v1 76.9075 57.0980 76.1464 76.1502
T5-GenQ-TDC-v1 (best) 80.0754 61.5974 79.3557 79.3427
### Uses This model is designed to improve e-commerce search functionality by generating user-friendly search queries based on product descriptions. It is particularly suited for applications where product descriptions are the primary input, and the goal is to create concise, descriptive queries that align with user search intent. ### Examples of Use:
  • Generating search queries for product indexing.
  • Enhancing product discoverability in e-commerce search engines.
  • Automating query generation for catalog management.
  • ### Comparison of ROUGE scores:
    Model ROUGE-1 ROUGE-2 ROUGE-L ROUGE-Lsum
    T5-GenQ-TDC-v1 76.14 56.11 75.47 75.47
    query-gen-msmarco-t5-base-v1 34.94 15.29 34.18 34.18
    **Note:** This evaluation is done after training, based on the test split of the [smartcat/Amazon-2023-GenQ](https://huggingface.co/datasets/smartcat/Amazon-2023-GenQ/viewer/default/test?views%5B%5D=test). ### Examples
    Expand to see the table with examples
    Input Text Target Query Before Fine-tuning After Fine-tuning
    AC/DC Men's Back in Black T-Shirt Black This youth T-shirt is an officially licensed AC/DC product and features our cool Back In Black design printed on 100% cotton. AC/DC Back In Black T-Shirt what is ac dc back in black AC/DC Back in Black T-Shirt
    AISYBB Women's Sweet Japanese Strawberry Embroidered Knit Cardigan Top Korean Retro Elegant Cute Kawaii Coat BrandName: AISYBBSleeveStyle: RegularMaterialComposition: AcrylicPatternType: SolidSleeveLength(cm): FullStyle: SweetSeason: Spring/AutumnThickness: STANDARDDecoration: ButtonClothingLength: RegularMaterial: AcrylicClosureType: SingleBreastedGender: WOMENCollar: V-NeckPattern: Loose-fittingAge: Ages16-28YearsOldPercentageofMaterial: 31%-50%YarnThickness: RegularyarnColor: Red,PinkSize: M,L,XLFabricName: AcrylicPopularElements: Patchwork,ContrastColor,ButtonAge-appropriate: 18-29SweaterCraft: FlatKnittingFeature1: LeisureFeature2: ChicFeature3: ElegantFeature4: Fashion AISYBB Women's Strawberry Cardigan aisybb cardigans Japanese Strawberry Cardigan
    Flying Fisherman Square Wrap Polarized Protection Oval Sand Bank Matte Black Amber, One Size As stunning as the crystal blue water over a shallow, The Flying Fisherman sand bank polarized sunglasses will have you seeing clearer with increased visual perception. This light weight wrap style has a sturdy build ready for whatever comes your way. Medium fit features non-slip temples and built in Rubberized nose pads adding to solid performance and comfort for all day wear. ย ย Polarized Triacetate lenses are impact and scratch resistant, lightweight and durable. Acutint lens coloring system adds color contrast without distorting natural colors, allowing you to see more clearly. 100% protection from harmful UVA and UVB rays. Polarized sunglasses for fishing what is polarized sand bank Flying Fisherman polarized sunglasses
    ## How to Get Started with the Model Use the code below to get started with the model. ```python from transformers import AutoModelForSeq2SeqLM, AutoTokenizer model = AutoModelForSeq2SeqLM.from_pretrained("smartcat/T5-GenQ-TDC-v1") tokenizer = AutoTokenizer.from_pretrained("smartcat/T5-GenQ-TDC-v1") description = "Silver-colored cuff with embossed braid pattern. Made of brass, flexible to fit wrist." inputs = tokenizer(description, return_tensors="pt", padding=True, truncation=True) generated_ids = model.generate(inputs["input_ids"], max_length=30, num_beams=4, early_stopping=True) generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True) ``` ## Training Details ### Training Data The model was trained on the [smartcat/Amazon-2023-GenQ](https://huggingface.co/datasets/smartcat/Amazon-2023-GenQ) dataset, which consists of user-like queries generated from product descriptions. The dataset was created using Claude Haiku 3, incorporating key product attributes such as the title, description, and images to ensure relevant and realistic queries. For more information, read the Dataset Card. ๐Ÿ˜Š ### Preprocessing - Trained on titles + descriptions of the products and a subset of products with titles only that had a similarity score with short queries above 85% - Tokenized using T5โ€™s default tokenizer with truncation to handle long text. ### Training Hyperparameters ### Train time: 16.79 hrs ### Hardware A6000 GPU: - Memory Size: 48 GB - Memory Type: GDDR6 - CUDA: 8.6 ### Metrics ### Metrics **[ROUGE](https://en.wikipedia.org/wiki/ROUGE_(metric))**, or **R**ecall-**O**riented **U**nderstudy for **G**isting **E**valuation, is a set of metrics used for evaluating automatic summarization and machine translation in NLP. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation. ROUGE metrics range between 0 and 1, with higher scores indicating higher similarity between the automatically produced summary and the reference. In our evaluation, ROUGE scores are scaled to resemble percentages for better interpretability. The metric used in the training was ROUGE-L.
    Epoch Step Loss Grad Norm Learning Rate Eval Loss ROUGE-1 ROUGE-2 ROUGE-L ROUGE-Lsum
    1 5181 0.6783 1.4479 0.000049 0.5469 78.3314 59.1425 77.7077 77.6942
    2 10362 0.5621 1.5229 0.000042 0.5191 79.3411 60.5070 78.6538 78.6498
    3 15543 0.5113 1.7697 0.000035 0.5109 79.4785 60.7267 78.8132 78.8043
    4 20724 0.4758 1.5957 0.000028 0.5113 79.8056 61.0473 79.0640 79.0545
    5 25905 0.4480 1.5304 0.000021 0.5102 79.9302 61.4653 79.2436 79.2355
    6 31086 0.4277 1.7578 0.000014 0.5130 79.8888 61.2644 79.1514 79.1433
    7 36267 0.4119 1.8958 0.000007 0.5157 80.0362 61.5594 79.3187 79.3069
    8 41448 0.4018 1.4958 0 0.5190 80.0754 61.5974 79.3557 79.3427
    ### Model Analysis
    Average scores by model
    image The checkpoint-41448 (T5-GenQ-TDC-v1) model outperforms query-gen-msmarco-t5-base-v1 across all ROUGE metrics. The largest performance gap is in ROUGE2, where checkpoint-41448 achieves 56.11, whereas query-gen-msmarco-t5-base-v1 scores 15.29. ROUGE1, ROUGEL, and ROUGELSUM scores are very similar in both trends, with checkpoint-41448 consistently scoring above 72, while query-gen-msmarco-t5-base-v1 stays below 41.
    Density comparison
    image ```T5-GenQ-TDC-v1``` - Higher concentration of high ROUGE scores, especially near 100%, indicating strong text overlap with references. ```query-gen-msmarco-t5-base-v1``` โ€“ more spread-out distribution, with multiple peaks at 10-40%, suggesting greater variability but lower precision. ROUGE-1 & ROUGE-L: ```T5-GenQ-TDC-v1``` peaks at 100%, while ```query-gen-msmarco-t5-base-v1``` has lower, broader peaks. ROUGE-2: ```query-gen-msmarco-t5-base-v1``` has a high density at 0%, indicating many low-overlap outputs.
    Histogram comparison
    image ```T5-GenQ-TDC-v1``` โ€“ higher concentration of high ROUGE scores, especially near 100%, indicating strong text overlap with references. ```query-gen-msmarco-t5-base-v1``` โ€“ more spread-out distribution, with peaks in the 10-40% range, suggesting greater variability but lower precision. ROUGE-1 & ROUGE-L: ```T5-GenQ-TDC-v1``` shows a rising trend towards higher scores, while ```query-gen-msmarco-t5-base-v1``` has multiple peaks at lower scores. ROUGE-2: ```query-gen-msmarco-t5-base-v1``` has a high concentration of low-score outputs, whereas ```T5-GenQ-TDC-v1``` generates more accurate text.
    Scores by generated query length
    image This visualization analyzes average ROUGE scores and score differences across different word sizes. Consistently High ROUGE Scores (Sizes 2-8): ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-LSUM scores remain stable and above 60% across most word sizes. Significant Drop at Size 9: A sharp decline in scores at size 9, followed by a spike at size 10, indicates a possible edge case in the dataset. Stable Score Differences (Sizes 3-9): After the initial spike at size 2, score differences stay close to zero, indicating consistent performance across phrase lengths.
    Semantic similarity distribution
    image This histogram visualizes the distribution of cosine similarity scores, which measure the semantic similarity between paired texts. The majority of similarity scores cluster near 1.0, indicating that most text pairs are highly similar. A gradual increase in frequency is observed as similarity scores rise, with a sharp peak at 1.0. Lower similarity scores (0.0โ€“0.4) are rare, suggesting fewer instances of dissimilar text pairs.
    Semantic similarity score against ROUGE scores
    image This scatter plot matrix compares semantic similarity (cosine similarity) with ROUGE scores, showing their correlation. Higher similarity โ†’ Higher ROUGE scores, indicating strong n-gram overlap in semantically similar texts. ROUGE-1 & ROUGE-L show the strongest correlation, while ROUGE-2 has more variability. Some low-similarity outliers still achieve moderate ROUGE scores, suggesting surface-level overlap without deep semantic alignment. These plots help assess how semantic similarity aligns with n-gram overlap metrics in text evaluation.
    Performance on Another Dataset

    To assess the performance of our fine-tuned query generation model, we conducted evaluations on a dataset containing real user queries, which was not part of the fine-tuning data. The goal was to verify the modelโ€™s generalizability and effectiveness in generating high-quality queries for e-commerce products.

    Average scores by model
    Average scores by model The figure presents a comparison of ROUGE scores between GenQ-TDC-v1 (checkpoint-41448) and the base model (query-gen-msmarco-t5-base-v1). ROUGE scores (ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-LSUM) measure the similarity between generated queries and ground-truth user queries.

    The fine-tuned model outperforms the base model across all ROUGE metrics, with significant improvements in ROUGE-1, ROUGE-2, and ROUGE-LSUM.

    These improvements indicate that the fine-tuned model generates queries that are more aligned with real user queries, making it a better fit for practical e-commerce applications.
    Semantic similarity distribution
    Semantic similarity distribution The figure shows the distribution of semantic similarity scores (cosine similarity between the generated queries and real user queries.

    The distribution shows a peak around 0.1 - 0.2, with a secondary peak near 0.6 - 0.7.

    The bimodal nature of the distribution suggests that while a portion of generated queries are highly aligned with real user queries, some still exhibit lower similarity, likely due to domain-specific variations.

    A significant number of generated queries achieve a similarity score of 0.6 or higher, confirming that the fine-tuned model effectively captures user intent better than a generic, pre-trained model.

    Better Generalization: Even though the fine-tuned model was not trained on this dataset, it generalizes better than the original model on real user queries.

    Improved Query Quality: The fine-tuned model produces more relevant, structured, and user-aligned queries which is critical for enhancing search and recommendation performance in e-commerce.

    Robust Semantic Alignment: Higher semantic similarity scores indicate that queries generated by the fine-tuned model better match user intent, leading to improved search and retrieval performance.

    ## More Information - Please visit the [GitHub Repository](https://github.com/smartcat-labs/product2query) ## Authors - Mentor: [Milutin Studen](https://www.linkedin.com/in/milutin-studen/) - Engineers: [Petar Surla](https://www.linkedin.com/in/petar-surla-6448b6269/), [Andjela Radojevic](https://www.linkedin.com/in/an%C4%91ela-radojevi%C4%87-936197196/) ## Model Card Contact For questions, please open an issue on the [GitHub Repository](https://github.com/smartcat-labs/product2query)