metadata
tags:
- generated_from_trainer
model-index:
- name: fusion_gttbsc_phi-3-best
results:
- task:
type: dialogue act classification
dataset:
name: asapp/slue-phase-2
type: hvb
metrics:
- name: F1 macro E2E
type: F1 macro
value: 70.09
- name: F1 macro GT
type: F1 macro
value: 71.86
datasets:
- asapp/slue-phase-2
language:
- en
metrics:
- f1-macro
fusion_gttbsc_phi-3-best
Ground truth text with prosody encoding and ASR encoding residual cross attention fusion multi-label DAC
Model description
ASR encoder: Whisper small encoder
Prosody encoder: 2 layer transformer encoder with initial dense projection
Backbone: Phi 3 mini
Fusion: 2 residual cross attention fusion layers (F_asr x F_text and F_prosody x F_text) with dense layer on top
Pooling: Self-attention
Multi-label classification head: 2 dense layers with two dropouts 0.3 and Tanh activation inbetween
Training and evaluation data
Trained on ground truth.
Evaluated on ground truth (GT) and normalized Whisper small transcripts (E2E).
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.0002
- train_batch_size: 2
- eval_batch_size: 2
- seed: 42
- gradient_accumulation_steps: 4
- total_train_batch_size: 8
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 20
- mixed_precision_training: Native AMP
Framework versions
- Transformers 4.41.2
- Pytorch 2.3.0+cu121
- Datasets 2.19.2
- Tokenizers 0.19.1