CLASP / README.md

Update README.md

d622f31 verified 2 months ago

3.95 kB

	---
	license: mit
	datasets:
	- mozilla-foundation/common_voice_4_0
	- google/fleurs
	- llm-lab/SpeechBrown
	pipeline_tag: automatic-speech-recognition
	---

	[![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2412.13071) [![GitHub](https://img.shields.io/badge/GitHub-Code-181717?logo=github)](https://github.com/language-modeling-lab/CLASP)

	CLASP (Contrastive Language-Speech Pretraining) is a novel, lightweight, multilingual, multimodal representation designed for audio-text retrieval.
	To learn more about our proposed model, please refer to this [paper](https://arxiv.org/abs/2412.13071). All code is available on this [GitHub page](https://github.com/language-modeling-lab/CLASP).
	The newly introduced dataset, SpeechBrown, which we created for training this model, can be found on [this page](https://huggingface.co/datasets/llm-lab/SpeechBrown)

	CLASP creates powerful and meaningful semantic embeddings for raw speech in a 768-dimensional multilingual representation space. These embeddings can be used in various tasks such as speech retrieval or classification.

	Here, we have uploaded different versions of the CLASP and LASP models we trained:

	- `CLASP_Concat_Final_Fusion_Encoder.pt`: The best model we trained based on retrieval and classification metrics. It uses the concatenation fusion encoder strategy and is trained with contrastive loss.
	- `CLASP_Gating.pt`: Trained with contrastive loss and employs a gating algorithm.
	- `LASP_Concat.pt`: Trained with Huber loss and employs the concatenation strategy.
	- `LASP_Gating.pt`: Trained with Huber loss and employs the gating algorithm.

	To use these models or train your own on custom datasets, please refer to our [GitHub page](https://github.com/language-modeling-lab/CLASP). The `clasp-inference.ipynb` notebook provides an example of loading and using the model.

	## Steps for Inference with CLASP:
	1. Load our model with the specified architecture.
	2. Load the [EfficientNet](https://pytorch.org/vision/main/models/generated/torchvision.models.efficientnet_b7.html) encoder for spectrogram encoding.
	3. Load the [HuBERT](https://huggingface.co/facebook/hubert-large-ls960-ft) encoder for self-supervised speech encoding.
	4. Use our scripts to generate embeddings for your audio files.
	5. Use these embeddings for tasks like classification or speech retrieval. For speech retrieval, you can load the [LaBSE](https://huggingface.co/sentence-transformers/LaBSE) sentence transformer to compute the cosine similarity between query and speech embeddings.

	## Architecture Overview

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/64ba58d377dd483716aba098/3Eb-6SXQ6c48jJNZedrsZ.png)

	## Contributions
	1. We introduce CLASP (Contrastive Language-Speech Pretraining), a novel, lightweight, multilingual, multimodal representation designed for audio-text retrieval.
	2. We present a diverse paired speech-text dataset (Speech Brown) in 15 categories, covering a wide range of topics from fiction to religion.
	3. We demonstrate that combining audio spectrograms with a pre-trained self-supervised speech model enhances audio encoding in retrieval applications.
	4. Evaluations in multiple languages show that CLASP achieves new benchmarks in HITS@1, Mean Reciprocal Rank (MRR), and Mean Rank (meanR) metrics.

	## Citations
	If you find our paper, code, data, or models useful, please cite the paper:
	```
	@misc{abootorabi2024claspcontrastivelanguagespeechpretraining,
	title={CLASP: Contrastive Language-Speech Pretraining for Multilingual Multimodal Information Retrieval},
	author={Mohammad Mahdi Abootorabi and Ehsaneddin Asgari},
	year={2024},
	eprint={2412.13071},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2412.13071},
	}
	```

	## Contact
	If you have questions, please email mahdi.abootorabi2@gmail.com or asgari@berkeley.edu.