johntsi
/

ZeroSwot-Medium_asr-mustc_mt-mustc_en-to-8

+---
+language:
+- ace
+- acm
+- acq
+- aeb
+- af
+- ajp
+- ak
+- als
+- am
+- apc
+- ar
+- ars
+- ary
+- arz
+- as
+- ast
+- awa
+- ayr
+- azb
+- azj
+- ba
+- bm
+- ban
+- be
+- bem
+- bn
+- bho
+- bjn
+- bo
+- bs
+- bug
+- bg
+- ca
+- ceb
+- cs
+- cjk
+- ckb
+- crh
+- cy
+- da
+- de
+- dik
+- dyu
+- dz
+- el
+- en
+- eo
+- et
+- eu
+- ee
+- fo
+- fj
+- fi
+- fon
+- fr
+- fur
+- fuv
+- gaz
+- gd
+- ga
+- gl
+- gn
+- gu
+- ht
+- ha
+- he
+- hi
+- hne
+- hr
+- hu
+- hy
+- ig
+- ilo
+- id
+- is
+- it
+- jv
+- ja
+- kab
+- kac
+- kam
+- kn
+- ks
+- ka
+- kk
+- kbp
+- kea
+- khk
+- km
+- ki
+- rw
+- ky
+- kmb
+- kmr
+- knc
+- kg
+- ko
+- lo
+- lij
+- li
+- ln
+- lt
+- lmo
+- ltg
+- lb
+- lua
+- lg
+- luo
+- lus
+- lvs
+- mag
+- mai
+- ml
+- mar
+- min
+- mk
+- mt
+- mni
+- mos
+- mi
+- my
+- nl
+- nn
+- nb
+- npi
+- nso
+- nus
+- ny
+- oc
+- ory
+- pag
+- pa
+- pap
+- pbt
+- pes
+- plt
+- pl
+- pt
+- prs
+- quy
+- ro
+- rn
+- ru
+- sg
+- sa
+- sat
+- scn
+- shn
+- si
+- sk
+- sl
+- sm
+- sn
+- sd
+- so
+- st
+- es
+- sc
+- sr
+- ss
+- su
+- sv
+- swh
+- szl
+- ta
+- taq
+- tt
+- te
+- tg
+- tl
+- th
+- ti
+- tpi
+- tn
+- ts
+- tk
+- tum
+- tr
+- tw
+- tzm
+- ug
+- uk
+- umb
+- ur
+- uzn
+- vec
+- vi
+- war
+- wo
+- xh
+- ydd
+- yo
+- yue
+- zh
+- zsm
+- zu
+language_details: >-
+  ace_Arab, ace_Latn, acm_Arab, acq_Arab, aeb_Arab, afr_Latn, ajp_Arab,
+  aka_Latn, amh_Ethi, apc_Arab, arb_Arab, ars_Arab, ary_Arab, arz_Arab,
+  asm_Beng, ast_Latn, awa_Deva, ayr_Latn, azb_Arab, azj_Latn, bak_Cyrl,
+  bam_Latn, ban_Latn,bel_Cyrl, bem_Latn, ben_Beng, bho_Deva, bjn_Arab, bjn_Latn,
+  bod_Tibt, bos_Latn, bug_Latn, bul_Cyrl, cat_Latn, ceb_Latn, ces_Latn,
+  cjk_Latn, ckb_Arab, crh_Latn, cym_Latn, dan_Latn, deu_Latn, dik_Latn,
+  dyu_Latn, dzo_Tibt, ell_Grek, eng_Latn, epo_Latn, est_Latn, eus_Latn,
+  ewe_Latn, fao_Latn, pes_Arab, fij_Latn, fin_Latn, fon_Latn, fra_Latn,
+  fur_Latn, fuv_Latn, gla_Latn, gle_Latn, glg_Latn, grn_Latn, guj_Gujr,
+  hat_Latn, hau_Latn, heb_Hebr, hin_Deva, hne_Deva, hrv_Latn, hun_Latn,
+  hye_Armn, ibo_Latn, ilo_Latn, ind_Latn, isl_Latn, ita_Latn, jav_Latn,
+  jpn_Jpan, kab_Latn, kac_Latn, kam_Latn, kan_Knda, kas_Arab, kas_Deva,
+  kat_Geor, knc_Arab, knc_Latn, kaz_Cyrl, kbp_Latn, kea_Latn, khm_Khmr,
+  kik_Latn, kin_Latn, kir_Cyrl, kmb_Latn, kon_Latn, kor_Hang, kmr_Latn,
+  lao_Laoo, lvs_Latn, lij_Latn, lim_Latn, lin_Latn, lit_Latn, lmo_Latn,
+  ltg_Latn, ltz_Latn, lua_Latn, lug_Latn, luo_Latn, lus_Latn, mag_Deva,
+  mai_Deva, mal_Mlym, mar_Deva, min_Latn, mkd_Cyrl, plt_Latn, mlt_Latn,
+  mni_Beng, khk_Cyrl, mos_Latn, mri_Latn, zsm_Latn, mya_Mymr, nld_Latn,
+  nno_Latn, nob_Latn, npi_Deva, nso_Latn, nus_Latn, nya_Latn, oci_Latn,
+  gaz_Latn, ory_Orya, pag_Latn, pan_Guru, pap_Latn, pol_Latn, por_Latn,
+  prs_Arab, pbt_Arab, quy_Latn, ron_Latn, run_Latn, rus_Cyrl, sag_Latn,
+  san_Deva, sat_Beng, scn_Latn, shn_Mymr, sin_Sinh, slk_Latn, slv_Latn,
+  smo_Latn, sna_Latn, snd_Arab, som_Latn, sot_Latn, spa_Latn, als_Latn,
+  srd_Latn, srp_Cyrl, ssw_Latn, sun_Latn, swe_Latn, swh_Latn, szl_Latn,
+  tam_Taml, tat_Cyrl, tel_Telu, tgk_Cyrl, tgl_Latn, tha_Thai, tir_Ethi,
+  taq_Latn, taq_Tfng, tpi_Latn, tsn_Latn, tso_Latn, tuk_Latn, tum_Latn,
+  tur_Latn, twi_Latn, tzm_Tfng, uig_Arab, ukr_Cyrl, umb_Latn, urd_Arab,
+  uzn_Latn, vec_Latn, vie_Latn, war_Latn, wol_Latn, xho_Latn, ydd_Hebr,
+  yor_Latn, yue_Hant, zho_Hans, zho_Hant, zul_Latn
+license: mit
+metrics:
+- bleu
+datasets:
+- mozilla-foundation/common_voice_8_0
+pipeline_tag: automatic-speech-recognition
+tags:
+- zeroswot
+- speech translation
+- zero-shot
+- end-to-end
+- nllb
+- wav2vec2
+---
+# ZeroSwot ✨🤖✨
+<!-- <div style='display:flex; gap: 0.25rem; '>
+<a href='https://arxiv.org/abs/2402.10422'><img src='https://img.shields.io/badge/paper-PDF-green'></a>
+<a href='https://github.com/mt-upc/ZeroSwot/blob/main/LICENSE'><img src='https://img.shields.io/badge/License-MIT-blue.svg'></a>
+<a href='https://github.com/mt-upc/ZeroSwot'><img src='https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&logo=github&logoColor=white'></a>
+</div> -->
+ZeroSwot is a state-of-the-art zero-shot end-to-end Speech Translation system.
+<div align=center><img src="resources/intro.png" height="65%" width="65%"/></div>
+The model is created by adapting a wav2vec2.0-based encoder to the embedding space of NLLB, using a novel subword compression module and Optimal Transport, while only utilizing ASR data. It thus enables **Zero-shot E2E Speech Translation to all the 200 languages supported by NLLB**.
+For more details please refer to our [paper](https://arxiv.org/abs/2402.10422) and the [original repo](https://github.com/mt-upc/ZeroSwot) build on fairseq.
+## Architecture
+The compression module is a light-weight transformer that takes as input the hidden state of wav2vec2.0 and the corresponding CTC predictions, and compresses them to subword-like embeddings similar to those expected from NLLB and aligns them using Optimal Transport. For inference we simply pass the output of the speech encoder to NLLB encoder.
+<div align=center><img src="resources/methodology.png" height="120%" width="120%"/></div>
+## Version
+This version of ZeroSwot is trained with ASR data from CommonVoice, and adapted [wav2vec2.0-large](https://huggingface.co/facebook/wav2vec2-large-960h-lv60-self) to the [nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) model.
+We have more versions available:
+| Models | ASR data | NLLB version |
+|:------:|:--------:|:------------:|
+| [ZeroSwot-Medium_asr-mustc](https://huggingface.co/johntsi/ZeroSwot-Medium_asr-mustc_en-to-200) | MuST-C v1.0 | [distilled-600M original](https://huggingface.co/facebook/nllb-200-distilled-600M)|
+| [ZeroSwot-Medium_asr-mustc_mt-mustc](https://huggingface.co/johntsi/ZeroSwot-Medium_asr-mustc_mt-mustc_en-to-8)  | MuST-C v1.0 | [distilled-600M finetuned w/ MuST-C](https://huggingface.co/johntsi/nllb-200-distilled-600M_mustc_en-to-8) |
+| [ZeroSwot-Large_asr-mustc](https://huggingface.co/johntsi/ZeroSwot-Large_asr-mustc_en-to-200)  | MuST-C v1.0 | [distilled-1.3B original](https://huggingface.co/facebook/nllb-200-distilled-1.3B) |
+| [ZeroSwot-Large_asr-mustc_mt-mustc](https://huggingface.co/johntsi/ZeroSwot-Large_asr-mustc_mt-mustc_en-to-8) | MuST-C v1.0 | [distilled-1.3B finetuned w/ MuST-C](https://huggingface.co/johntsi/nllb-200-distilled-1.3B_mustc_en-to-8) |
+| [ZeroSwot-Medium_asr-cv](https://huggingface.co/johntsi/ZeroSwot-Medium_asr-cv_en-to-200) | CommonVoice | [distilled-600M original](https://huggingface.co/facebook/nllb-200-distilled-600M)|
+| [ZeroSwot-Medium_asr-cv_mt-covost2](https://huggingface.co/johntsi/ZeroSwot-Medium_asr-cv_mt-covost2_en-to-15) | CommonVoice  | [distilled-600M finetuned w/ CoVoST2](https://huggingface.co/johntsi/nllb-200-distilled-600M_covost2_en-to-15) |
+| [ZeroSwot-Large_asr-cv](https://huggingface.co/johntsi/ZeroSwot-Large_asr-cv_en-to-200) | CommonVoice  | [distilled-1.3B original](https://huggingface.co/facebook/nllb-200-distilled-1.3B) |
+| [ZeroSwot-Large_asr-cv_mt-covost2](https://huggingface.co/johntsi/ZeroSwot-Large_asr-cv_mt-covost2_en-to-15) | CommonVoice  | [distilled-1.3B finetuned w/ CoVoST2](https://huggingface.co/johntsi/nllb-200-distilled-1.3B_covost2_en-to-15) |
+## Usage
+The model is tested with python 3.9.16 and Transformer v4.41.2. Install also torchaudio and sentencepiece for processing.
+```bash
+pip install transformers torchaudio sentencepiece
+```
+```python
+from transformers import Wav2Vec2Processor, NllbTokenizer, AutoModel, AutoModelForSeq2SeqLM
+import torchaudio
+def load_and_resample_audio(audio_path, target_sr=16000):
+    audio, orig_freq = torchaudio.load(audio_path)
+    if orig_freq != target_sr:
+        audio = torchaudio.functional.resample(audio, orig_freq=orig_freq, new_freq=target_sr)
+    audio = audio.squeeze(0).numpy()
+    return audio
+# Load processors and tokenizers
+processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")
+tokenizer = NllbTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
+# Load ZeroSwot Encoder
+commit_hash = "eafabee295ea1c8b45483d1fd26bd747d9a7d937"
+zeroswot_encoder = AutoModel.from_pretrained(
+    "johntsi/ZeroSwot-Medium_asr-cv_en-to-200", trust_remote_code=True, revision=commit_hash,
+)
+zeroswot_encoder.eval()
+zeroswot_encoder.to("cuda")
+# Load NLLB Model
+nllb_model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M")
+nllb_model.eval()
+nllb_model.to("cuda")
+# Load audio file
+audio = load_and_resample_audio(path_to_audio_file) # you can use "resources/sample.wav" for testing
+input_values = processor(audio, sampling_rate=16000, return_tensors="pt").to("cuda")
+# translation to German
+compressed_embeds, attention_mask = zeroswot_encoder(**input_values)
+predicted_ids = nllb_model.generate(
+    inputs_embeds=compressed_embeds,
+    attention_mask=attention_mask,
+    forced_bos_token_id=tokenizer.lang_code_to_id["deu_Latn"],
+    num_beams=5,
+)
+translation = tokenizer.decode(predicted_ids[0], skip_special_tokens=True)
+print(translation)
+```
+## Results
+BLEU scores on CoVoST-2 test compared to supervised SOTA models [XLS-R-1B](https://huggingface.co/facebook/wav2vec2-xls-r-1b) and [SeamlessM4T-Medium](https://huggingface.co/facebook/seamless-m4t-medium). You can refer to Table 5 of the Results section in the paper for more details.
+|     Models     |  ZS  |  Size (B)  |  Ar  |  Ca  |  Cy  |  De  |  Et  |  Fa  |  Id  |  Ja  |  Lv  |  Mn  |  Sl  |  Sv  |  Ta  |  Tr  |  Zh  | Average |
+|:--------------:|:----:|:----------:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:-------:|
+|    [XLS-R-1B](https://huggingface.co/facebook/wav2vec2-xls-r-1b)    |  ✗   |    1.0     | 19.2 | 32.1 | **31.8** | 26.2 | 22.4 | 21.3 | 30.3 | 39.9 | 22.0 | 14.9 | 25.4 | 32.3 | 18.1 | 17.1 | 36.7 |   26.0  |
+| [SeamlessM4T-Medium](https://huggingface.co/facebook/seamless-m4t-medium)  |  ✗   |    1.2     | 20.8 | 37.3 | 29.9 | **31.4** | 23.3 | 17.2 | 34.8 | 37.5 | 19.5 | 12.9 | 29.0 | 37.3 | 18.9 | **19.8** | 30.0 |   26.6  |
+| [ZeroSwot-M_asr-cv](https://huggingface.co/johntsi/ZeroSwot-Medium_asr-cv_en-to-200) |  ✓   | 0.35/0.95  | 17.6 | 32.5 | 18.0 | 29.9 | 20.4 | 16.3 | 32.4 | 32.0 | 13.3 | 10.0 | 25.2 | 34.4 | 17.8 | 15.6 | 30.5 |   23.1  |
+| [ZeroSwot-M_asr-cv_mt-covost2](https://huggingface.co/johntsi/ZeroSwot-Medium_asr-cv_mt-covost2_en-to-200) |  ✓   | 0.35/0.95  | **24.4** | **38.7** | 28.8 | 31.2 | **26.2** | **26.0** | **36.0** | **46.0** | **24.8** | **19.0** | **31.6** | **37.8** | **24.4** | 18.6 | **39.0** |   **30.2**  |
+## Citation
+If you find ZeroSwot useful for your research, please cite our paper :)
+```
+@misc{tsiamas2024pushing,
+      title={{Pushing the Limits of Zero-shot End-to-End Speech Translation}},
+      author={Ioannis Tsiamas and Gerard I. Gállego and José A. R. Fonollosa and Marta R. Costa-jussà},
+      year={2024},
+      eprint={2402.10422},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```

config.json ADDED Viewed

	@@ -0,0 +1,27 @@

+{
+  "_name_or_path": "johntsi/ZeroSwot-Medium_asr-mustc_mt-mustc_en-to-8/model.safetensors",
+  "architectures": [
+    "ZeroSwotEncoderModel"
+  ],
+  "auto_map": {
+    "AutoConfig": "model.ZeroSwotEncoderConfig",
+    "AutoModel": "model.ZeroSwotEncoderModel"
+  },
+  "compression_adapter": {
+    "blank_idx": 0,
+    "dropout": 0.1,
+    "embed_dim": 1024,
+    "sep_idx": 4,
+    "transformer_layers": 3
+  },
+  "embed_dim": 1024,
+  "model_type": "zero_swot_encoder",
+  "nllb_model_name_or_path": "johntsi/nllb-200-distilled-600M_mustc_en-to-8",
+  "speech_embedder": {
+    "nllb_eng_id": 256047,
+    "nllb_eos_id": 2
+  },
+  "torch_dtype": "float32",
+  "transformers_version": "4.41.2",
+  "wav2vec2_model_name_or_path": "facebook/wav2vec2-large-960h-lv60-self"
+}

model.py ADDED Viewed

	@@ -0,0 +1,366 @@

+from transformers import PreTrainedModel, PretrainedConfig, Wav2Vec2ForCTC
+import json
+import torch
+from torch import nn
+from torch.nn.utils.rnn import pad_sequence
+import math
+from typing import Optional
+# x: torch.FloatTensor [T, B, D]
+# mask: torch.BoolTensor [B, T], where True indicates padding
+# returns: torch.LongTensor [B]
+def get_lengths(x, mask=None):
+    if mask is not None:
+        return (~mask).long().sum(dim=1)
+    else:
+        return torch.LongTensor([x.size(0)] * x.size(1)).to(x.device)
+# lens: torch.LongTensor [B]
+# returns: torch.BoolTensor [B, max_lens], where True indicates padding
+def lengths_to_padding_mask(lens):
+    bsz, max_lens = lens.size(0), torch.max(lens).item()
+    mask = torch.arange(max_lens).to(lens.device).view(1, max_lens)
+    mask = mask.expand(bsz, -1) >= lens.view(bsz, 1).expand(-1, max_lens)
+    return mask
+# input_lengths: torch.LongTensor [B]
+def get_output_lengths(input_lengths):
+    conv_feature_layers = "[(512, 10, 5)] + [(512, 3, 2)] * 4 + [(512,2,2)] + [(512,2,2)]"
+    conv_cfg_list = eval(conv_feature_layers)
+    def _conv_out_length(input_length, kernel_size, stride):
+        return torch.floor((input_length - kernel_size) / stride + 1)
+    for i in range(len(conv_cfg_list)):
+        input_lengths = _conv_out_length(
+            input_lengths, conv_cfg_list[i][1], conv_cfg_list[i][2]
+        )
+    return input_lengths.to(torch.long)
+class ZeroSwotEncoderConfig(PretrainedConfig):
+    model_type = "zero_swot_encoder"
+    def __init__(
+        self,
+        wav2vec2_model_name_or_path="",
+        compression_adapter=None,
+        embed_dim=1024,
+        **kwargs
+    ):
+        super().__init__(**kwargs)
+        self.wav2vec2_model_name_or_path = wav2vec2_model_name_or_path
+        self.compression_adapter = compression_adapter
+        self.embed_dim = embed_dim
+    @classmethod
+    def from_json_file(cls, json_file):
+        with open(json_file, "r") as reader:
+            text = reader.read()
+        config_dict = json.loads(text)
+        return cls(**config_dict)
+class ZeroSwotEncoderModel(PreTrainedModel):
+    config_class = ZeroSwotEncoderConfig
+    model_type = "zero_swot_encoder"
+    def __init__(self, config):
+        super().__init__(config)
+        self.wav2vec2 = Wav2Vec2ForCTC.from_pretrained(config.wav2vec2_model_name_or_path)
+        self.compression_adapter = CompressionAdapter(config.compression_adapter)
+        self.speech_embedder = SpeechEmbedder(config.embed_dim)
+    def forward(self, input_values, attention_mask=None):
+        input_lens = get_lengths(input_values, ~attention_mask)
+        # Forward pass through wav2vec2 encoder
+        x = self.wav2vec2.wav2vec2(input_values, attention_mask)[0]  # [B, T, D]
+        # CTC predictions
+        preds = self.wav2vec2.lm_head(x).argmax(-1)  # [B, T]
+        # Get output lengths for x
+        output_lens = get_output_lengths(input_lens)
+        # Compression
+        x, mask, _ = self.compression_adapter(x, preds, output_lens) # [B, N, D] with N << T
+        # BOS and EOS embeddings
+        x, mask = self.speech_embedder(x, mask) # [B, N+2, D]
+        return x, ~mask
+class SpeechEmbedder(nn.Module):
+    def __init__(self, embed_dim):
+        super().__init__()
+        self.embed_dim = embed_dim
+        self.bos_emb = nn.Parameter(torch.empty(embed_dim))
+        self.eos_emb = nn.Parameter(torch.empty(embed_dim))
+        self.scale = self.embed_dim ** 0.5
+    def forward(self, x, padding_mask=None):
+        """Add special embedding and positional embedding.
+        Args:
+            x (FloatTensor): (B, T, C)
+            padding_mask (ByteTensor): (B, T)
+        Outputs:
+            x (FloatTensor): (B, T+2, C)
+            padding_mask (ByteTensor): (B, T+2)
+        """
+        B = x.size(0)
+        lengths = get_lengths(x.transpose(0, 1), padding_mask)
+        assert B == len(lengths)
+        if padding_mask is not None:
+            x = x * (1 - padding_mask.unsqueeze(-1).type_as(x))
+        # prepend bos
+        x = torch.cat([self.bos_emb.view(1, 1, -1).expand(B, 1, -1), x], dim=1)
+        lengths += 1
+        # append padding (zeros) and then convert first padding to eos
+        x = torch.cat([x, torch.zeros(B, 1, x.size(-1), device=x.device, dtype=x.dtype)], dim=1)
+        for i in range(B):
+            x[i, lengths[i], :] = self.eos_emb
+        lengths += 1
+        padding_mask = lengths_to_padding_mask(lengths)
+        x = x * self.scale
+        return x, padding_mask
+class PositionalEmbedding(nn.Module):
+    def __init__(self, num_embeddings, embedding_dim, padding_idx):
+        super().__init__()
+        self.embedding_dim = embedding_dim
+        self.padding_idx = padding_idx if padding_idx is not None else 0
+        num_embeddings += padding_idx + 1
+        self.weights = PositionalEmbedding.get_embedding(
+            num_embeddings, embedding_dim, padding_idx
+        )
+        self.register_buffer("_float_tensor", torch.FloatTensor(1))
+        self.max_positions = int(1e5)
+    @staticmethod
+    def get_embedding(
+        num_embeddings: int, embedding_dim: int, padding_idx: Optional[int] = None
+    ):
+        half_dim = embedding_dim // 2
+        emb = math.log(10000) / (half_dim - 1)
+        emb = torch.exp(torch.arange(half_dim, dtype=torch.float) * -emb)
+        emb = torch.arange(num_embeddings, dtype=torch.float).unsqueeze(1) * emb.unsqueeze(0)
+        emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=1).view(num_embeddings, -1)
+        if embedding_dim % 2 == 1:
+            # zero pad
+            emb = torch.cat([emb, torch.zeros(num_embeddings, 1)], dim=1)
+        if padding_idx is not None:
+            emb[padding_idx, :] = 0
+        return emb
+    def make_positions(self, x, padding_idx: int):
+        mask = x.ne(padding_idx).int()
+        return (torch.cumsum(mask, dim=1).type_as(mask) * mask).long() + padding_idx
+    def forward(self, input):
+        """Input is expected to be of size [bsz x seqlen]."""
+        bsz, seq_len = input.size()
+        max_pos = self.padding_idx + 1 + seq_len
+        if self.weights is None or max_pos > self.weights.size(0):
+            # recompute/expand embeddings if needed
+            self.weights = PositionalEmbedding.get_embedding(
+                max_pos, self.embedding_dim, self.padding_idx
+            )
+        self.weights = self.weights.to(self._float_tensor)
+        positions = self.make_positions(input, self.padding_idx)
+        return (
+            self.weights.index_select(0, positions.view(-1))
+            .view(bsz, seq_len, -1)
+            .detach()
+        )
+class CLSPooling(nn.Module):
+    def __init__(self, embed_dim, num_transformer_layers, dropout_rate):
+        super().__init__()
+        self.cls_token = nn.Parameter(torch.empty(1, 1, embed_dim))
+        nn.init.normal_(self.cls_token, mean=0.0, std=0.25)
+        self.transformer = nn.TransformerEncoder(
+            nn.TransformerEncoderLayer(
+                embed_dim,
+                nhead=16 if embed_dim == 1024 else 8,
+                dim_feedforward=4*embed_dim,
+                dropout=dropout_rate,
+                activation="relu",
+                batch_first=True,
+                norm_first=True
+            ),
+            num_layers=num_transformer_layers,
+        )
+        self.pos_emb = PositionalEmbedding(512, embed_dim, 1)
+        self.scale = math.sqrt(embed_dim)
+    def forward(self, x, lens):
+        # x: [B, N, D]
+        # lens: [B]
+        # prepend cls token
+        x = torch.cat(
+            [
+                self.cls_token.to(dtype=x.dtype, device=x.device).repeat(x.size(0), 1, 1), # B x 1 x D
+                x
+            ],
+        dim=1) # [B, N+1, D]
+        mask = lengths_to_padding_mask(lens+1)
+        x = x + self.pos_emb(mask.long()) / self.scale
+        x = self.transformer(x, src_key_padding_mask=mask) # [B, N+1, D]
+        x = x[:, 0] # [B, D]
+        return x
+class CompressionAdapter(nn.Module):
+    def __init__(self, cfg):
+        super().__init__()
+        self.embed_dim = cfg["embed_dim"]
+        self.transformer_layers = cfg["transformer_layers"]
+        self.dropout = cfg["dropout"]
+        self.blank_idx = cfg["blank_idx"]
+        self.sep_idx = cfg["sep_idx"]
+        self.token_pooling_module = CLSPooling(
+            self.embed_dim, self.transformer_layers, self.dropout
+        )
+    def char_compression(self, x, preds, lens):
+        # x: B x T x D
+        # preds: B x T
+        # lens: B
+        B, T, D = x.size()
+        device = x.device
+        dtype = x.dtype
+        # zero-out the padding
+        mask = lengths_to_padding_mask(lens) # B x T
+        x = x.masked_fill(mask.unsqueeze(-1), 0)
+        preds = preds.masked_fill(mask, self.blank_idx)
+        # add a vector of -1 to know where each example ends after flattening the batch
+        preds = torch.cat([-torch.ones(B, 1, device=device, dtype=torch.long), preds], dim=1).view(-1)
+        x = torch.cat([torch.zeros(B, 1, D, device=device, dtype=dtype), x], dim=1).view(-1, D)
+        # get points of consecutive preds
+        preds, counts = preds.unique_consecutive(return_counts=True)
+        # split in representations of same chars
+        x = torch.split(x, counts.tolist())
+        # remove blanks
+        valid_mask = preds != self.blank_idx
+        preds = preds[valid_mask]
+        counts = counts[valid_mask] # [N]
+        x = [x_i for x_i, v_i in zip(x, valid_mask) if v_i]
+        # pack into tensor
+        x = pad_sequence(x, batch_first=True, padding_value=0)
+        # char pooling
+        x = torch.sum(x, dim=1) / counts.to(dtype=x.dtype).unsqueeze(1) # [B, N, D] -> [B, D]
+        # find split points for retrieving the examples
+        split_points = (preds == -1).nonzero(as_tuple=True)[0]
+        split_points = torch.cat([split_points, torch.tensor([len(preds)], device=device)])
+        split_points = (split_points[1:] - split_points[:-1]).tolist()
+        # split into examples
+        x = torch.split(x, split_points)
+        preds = torch.split(preds, split_points)
+        lens = torch.tensor([len(x_i) for x_i in x], device=device)
+        # pack into tensors
+        x = pad_sequence(x, batch_first=True, padding_value=0)
+        preds = pad_sequence(preds, batch_first=True, padding_value=self.blank_idx)
+        # remove the parts we add to identify the bounds for each example
+        x = x[:, 1:]
+        preds = preds[:, 1:]
+        lens -= 1
+        mask = lengths_to_padding_mask(lens)
+        # account for empty examples (just a sep token)
+        empty_examples = lens == 0
+        num_empty_examples = empty_examples.sum()
+        if num_empty_examples > 0:
+            mask[empty_examples, 0] = True
+            lens[empty_examples] = 1
+            preds[empty_examples, 0] = self.sep_idx
+        return x, mask, lens, preds, num_empty_examples
+    def token_compression(self, x, preds, lens):
+        # x: B x T x D
+        # preds: B x T
+        # lens: B
+        B, T, D = x.size()
+        device = x.device
+        dtype = x.dtype
+        # new lengths after compression
+        new_lens = preds.eq(self.sep_idx).sum(dim=1)
+        # unpad and unpack to list of tensors
+        preds = [preds[i, :lens[i]] for i in range(B)]
+        x = [x[i, :lens[i]] for i in range(B)]
+        # make sure every example ends with a separator
+        num_examples_without_ending_sep = torch.tensor(0, device=device, dtype=torch.long)
+        for i in range(B):
+            if preds[i][-1] != self.sep_idx:
+                preds[i] = torch.cat([preds[i], torch.tensor([self.sep_idx], device=device, dtype=torch.long)])
+                x[i] = torch.cat([x[i], torch.zeros(1, D, device=device, dtype=dtype)])
+                new_lens[i] += 1
+                num_examples_without_ending_sep += 1
+        # flatten
+        preds = torch.cat(preds)
+        x = torch.cat(x)
+        # split points according to separators
+        split_points = preds.eq(self.sep_idx).nonzero(as_tuple=True)[0] + 1
+        split_points = torch.cat([torch.tensor([0], device=device, dtype=torch.long), split_points])
+        split_points = (split_points[1:] - split_points[:-1]).tolist()
+        # re-arrange in 3d [total_num_tokens x max(count) x D]
+        x = torch.split(x, split_points) # Tuple[2d tensor]
+        counts = torch.tensor([len(x_i) for x_i in x], device=device, dtype=torch.long)
+        x = pad_sequence(x, batch_first=True, padding_value=0)
+        # reduce dim 1
+        x = self.token_pooling_module(x, counts)
+        # reconstruct the batch
+        split_points = new_lens.cumsum(dim=0)
+        split_points = torch.cat([torch.tensor([0], device=device, dtype=torch.long), split_points])
+        split_points = (split_points[1:] - split_points[:-1]).tolist()
+        x = torch.split(x, split_points)
+        x = pad_sequence(x, batch_first=True, padding_value=0) # B x ? x D
+        mask = lengths_to_padding_mask(new_lens)
+        return x, mask, new_lens, num_examples_without_ending_sep
+    def forward(self, x, preds, lens):
+        x, mask, lens, preds, _ = self.char_compression(x, preds, lens)
+        x, mask, lens, _ = self.token_compression(x, preds, lens)
+        return x, mask, lens

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b7bf47ed603d355b7b4b8f7d23d0331cdf262c2a2e0ef1018320d3d57abe0ceb
+size 1413115412