Model Card for logicsct-phi4

logicsct-phi4 is a QLoRA 4-bit fine-tuned version of microsoft/phi-4. This model has been adapted with domain-specific knowledge to serve as a support chatbot for Connect-Transport, our transport management system developed at Logics Software GmbH.

While tailored for our internal use, the training principles and techniques we employed can also be applied by others interested in developing their own chatbot assistants.

We are continuously evaluating and refining our models to enhance the performance of our support chatbot for Connect-Transport.

Finding a Good Base Model – Proficient in German and Following Instructions

We have evaluated over 70 models for basic technical instruction tasks in German. The evaluation was carried out manually by reviewing the responses to the following questions:

Wie kann ich in Chrome machen dass meine Downloads immer am gleichen Ort gespeichert werden?
Wie kann ich in Outlook meine Mail Signatur anpassen und einen Link und Bild dort einfügen?

The best models according to our subjective rating scale (1 = poor, 5 = excellent) are:

5-Star Rating:

Big proprietary models such as OpenAI o1, OpenAI 4o and OpenAI o1-mini
Huge models: deepseek-ai/DeepSeek-R1 (685B), deepseek-ai/DeepSeek-V3 (685B) and mistralai/Mistral-Large-Instruct-2411 (123B)
Large models: Nexusflow/Athene-V2-Chat (72.7B) and nvidia/Llama-3.1-Nemotron-70B-Instruct (70.6B)

4-Star Rating:

Huge models: mistralai/Mixtral-8x22B-Instruct-v0.1 (141B), alpindale/WizardLM-2-8x22B (141B) and CohereForAI/c4ai-command-r-plus-08-2024 (104B)
Large models: meta-llama/Llama-3.3-70B-Instruct (70.6B) and NousResearch/Hermes-3-Llama-3.1-70B (70.6B)
Big models: mistralai/Mixtral-8x7B-Instruct-v0.1 (46.7B)
Medium-sized models: google/gemma-2-27b (27.2B) and mistralai/Mistral-Small-Instruct-2409 (22.2B)
Small-Sized Models (Current Main Focus):
- microsoft/phi-4 (14.7B)
- mistralai/Mistral-Nemo-Instruct-2407 (12.2B)

Models rated 3 stars or lower are not listed here. We have tested dozens of models with fewer than 20B and 10B parameters, but most do not understand or speak German well enough or perform adequately in the context of answering support chatbot technical questions.

Some models also have smaller versions that are not listed above because they did not achieve a 4+ rating. Additionally, some models (e.g., Hermes 3) have larger versions available that are not included, as their performance relative to model size was not impressive, making their massive 405B versions less interesting for our purposes.

Given our goal of training, exporting, and running inference on our dedicated server hardware, we primarily focus on models with fewer than 20B parameters while comparing their performance with that of some larger models.

How We Fine-Tune Our Base Model

Due to our small training dataset and GPU VRAM constraints, we exclusively use QLoRA fine-tuning. After testing our own scripts, we ultimately settled on LLaMA-Factory, which meets our needs for straightforward training, inference, and export functionality across a wide range of models.

Training Data

Our training data currently consists of approximately 220 prompt-response pairs in a carefully curated dataset.

To collect this training data, we developed an internal web app for our employees. In this app:

Employees enter both a prompt and a response — each pair forms a single training record.
Gamification is implemented via daily and weekly high score systems that highlight the employee who contributes the most training data.
The web app is also connected to a selection of current evaluation models to facilitate internal testing of how the models respond both to prompts from the training data and to new prompts.

Here are three example records of our training data:

Prompt: Wie kann ich eine Tour umbenennen?
Response: 1. Öffnen Sie die Stoppliste der Tour und klicken Sie auf das Tourmenü (graues Zahnradsymbol ⚙️) rechts oben in der Stoppliste.\n2. Wählen Sie im Tourmenü "Tour umbenennen".\n3. Folgen Sie der Anweisung im Dialog und klicken Sie auf "OK".\nBeachten Sie: Fahrer können die Tour nicht umbenennen, sondern nur ein Disponent (Admin Account).

Prompt: Wie werden Daten in Connect-Transport importiert?
Response: Der Flex Importer liest Auftragsdaten aus Excel-Dateien ein. Alternativ können Daten per Schnittstelle direkt aus ERP- oder Warenwirtschaftssystemen importiert werden.

Prompt: Gibt es eine Möglichkeit zur Sendungsverfolgung für Endkunden?
Response: Ja, mit dem Avisierungs-Modul können Disponenten per SMS oder E-Mail voraussichtliche Lieferzeiten an Empfänger senden, die ihre Sendungen live verfolgen können.

QLoRA Settings

Full settings for logicsct_train_Phi4_qlora_sft_otfq.yaml:

### model
model_name_or_path: microsoft/phi-4
quantization_bit: 4
quantization_method: bitsandbytes
trust_remote_code: true

### method
stage: sft
do_train: true
finetuning_type: lora
lora_rank: 128 # we still experiment with that
#lora_alpha # default lora_rank * 2
lora_target: all

### dataset
dataset: logicsct
template: phi4
cutoff_len: 512
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: saves/logicsct-phi4/lora/sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 2.0e-4   # we still experiment with that
num_train_epochs: 4.0   # we still experiment with that
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

### eval
val_size: 0.2         # use 20% of dataset as the validation split.
per_device_eval_batch_size: 1 # Keeps the evaluation batch size at 1 per device
eval_strategy: steps  # or "epoch" if you prefer evaluating at the end of each epoch
eval_steps: 500       # adjust this if needed (e.g., if you use "steps", it determines evaluation frequency)

Training, Inference, and Export

We follow the instructions provided in the LLaMA-Factory Quickstart Guide:

llamafactory-cli train logicsct_train_Phi4_qlora_sft_otfq.yaml       # VRAM used: 11093MiB for 4 bit QLoRA training
llamafactory-cli chat logicsct_inference_Phi4_qlora_sft_otfq.yaml    # VRAM used: 30927MiB for inference of base model + QLoRA adapter
llamafactory-cli export logicsct_export_Phi4_qlora_sft.yaml          # VRAM used:   665MiB + about 29 GB of system RAM for exporting a merged verison of the model with its adapter
llamafactory-cli export logicsct_export_Phi4_qlora_sft_Q4.yaml       # VRAM used: 38277MiB for a 4bit quant export of the merged model
llamafactory-cli chat logicsct_inference_Phi4_qlora_sft_otfq_Q4.yaml # VRAM used:  9255MiB-11405MiB for inference of the 4bit quant merged model (increasing with increasing context length)

Comparison of Open Source Training/Models with OpenAI Proprietary Fine-Tuning

We have fine-tuned both OpenAI GPT 4o and 4o-mini and compared their performance to that of our best small-sized models. After some initial runs with unsatisfactory results, we significantly adjusted the hyperparameters and focused primarily on experimenting with 4o-mini.

With our current training data, both 4o and 4o-mini appear to require 5 epochs using the default learning rate, with the training loss approaching zero. With fewer epochs, however, the models seem not to learn sufficiently—perhaps due to the small size of our training dataset. Significant overfitting occurs at approximately 7 epochs for both models.

Our best settings so far are:

Epochs: 5
Batch Size: 3
Learning Rate: Automatically determined

Currently, our small-sized open-source models perform comparably to or even better than the fine-tuned 4o-mini. We will continue testing with OpenAI fine-tuning once we have a larger training dataset.

Next Steps

Our top priority at the moment is to collect more training data.

logicssoftwaregmbh
/

logicsct-phi4