Aismantas commited on
Commit
d6eff3d
1 Parent(s): 0e1b264

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +111 -0
README.md CHANGED
@@ -1,3 +1,114 @@
1
  ---
 
 
 
 
 
 
 
 
 
 
 
2
  license: apache-2.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - lt
4
+ tags:
5
+ - audio
6
+ - automatic-speech-recognition
7
+ widget:
8
+ - example_title: Librispeech sample 1
9
+ src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
10
+ - example_title: Librispeech sample 2
11
+ src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
12
+ pipeline_tag: automatic-speech-recognition
13
  license: apache-2.0
14
  ---
15
+ # Whisper-lithuanian
16
+
17
+ Whisper-lithuanian is a finetuned model for automatic speech recognition (ASR). Trained on 2 hours
18
+ with filtered only lithuanian language mozilla-foundation/common_voice_13_0 dataset.
19
+
20
+ # trining log
21
+
22
+ [9309/9309 1:53:12, Epoch 3/3]
23
+
24
+ |Epoch |Training Loss |Validation Loss |
25
+ |----------|-----------------|-----------------|
26
+ |1 |0.030600 |0.034302 |
27
+ |2 |0.013200 |0.030458 |
28
+ |3 |0.004100 |0.029847 |
29
+
30
+ The original code repository can be found [here](https://github.com/openai/whisper).
31
+
32
+ # Whisper model list
33
+
34
+ | Size | Parameters | English-only | Multilingual |
35
+ |----------|------------|------------------------------------------------------|-----------------------------------------------------|
36
+ | tiny | 39 M | [✓](https://huggingface.co/openai/whisper-tiny.en) | [✓](https://huggingface.co/openai/whisper-tiny) |
37
+ | base | 74 M | [✓](https://huggingface.co/openai/whisper-base.en) | [✓](https://huggingface.co/openai/whisper-base) |
38
+ | small | 244 M | [✓](https://huggingface.co/openai/whisper-small.en) | [✓](https://huggingface.co/openai/whisper-small) |
39
+ | medium | 769 M | [✓](https://huggingface.co/openai/whisper-medium.en) | [✓](https://huggingface.co/openai/whisper-medium) |
40
+ | large | 1550 M | x | [✓](https://huggingface.co/openai/whisper-large) |
41
+ | large-v2 | 1550 M | x | [✓](https://huggingface.co/openai/whisper-large-v2) |
42
+ | large-v3 | 1550 M | x | [✓](https://huggingface.co/openai/whisper-large-v3) |
43
+
44
+ ## Usage
45
+
46
+ ```python
47
+ import torch
48
+ from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
49
+ from datasets import load_dataset
50
+ device = "cuda:0" if torch.cuda.is_available() else "cpu"
51
+ torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
52
+ model_id = "openai/whisper-large-v3"
53
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(
54
+ model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
55
+ )
56
+ model.to(device)
57
+ processor = AutoProcessor.from_pretrained(model_id)
58
+ pipe = pipeline(
59
+ "automatic-speech-recognition",
60
+ model=model,
61
+ tokenizer=processor.tokenizer,
62
+ feature_extractor=processor.feature_extractor,
63
+ max_new_tokens=128,
64
+ chunk_length_s=30,
65
+ batch_size=16,
66
+ return_timestamps=True,
67
+ torch_dtype=torch_dtype,
68
+ device=device,
69
+ )
70
+ dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
71
+ sample = dataset[0]["audio"]
72
+ result = pipe(sample)
73
+ print(result["text"])
74
+ ```
75
+
76
+ ## Fine-Tuning
77
+
78
+ The pre-trained Whisper model demonstrates a strong ability to generalise to different datasets and domains. However,
79
+ its predictive capabilities can be improved further for certain languages and tasks through *fine-tuning*. The blog
80
+ post [Fine-Tune Whisper with 🤗 Transformers](https://huggingface.co/blog/fine-tune-whisper) provides a step-by-step
81
+ guide to fine-tuning the Whisper model with as little as 5 hours of labelled data.
82
+
83
+ ### Evaluated Use
84
+
85
+ The primary intended users of these models are AI researchers studying robustness, generalization, capabilities, biases, and constraints of the current model. However, Whisper is also potentially quite useful as an ASR solution for developers, especially for English speech recognition. We recognize that once models are released, it is impossible to restrict access to only “intended” uses or to draw reasonable guidelines around what is or is not research.
86
+
87
+ The models are primarily trained and evaluated on ASR and speech translation to English tasks. They show strong ASR results in ~10 languages. They may exhibit additional capabilities, particularly if fine-tuned on certain tasks like voice activity detection, speaker classification, or speaker diarization but have not been robustly evaluated in these areas. We strongly recommend that users perform robust evaluations of the models in a particular context and domain before deploying them.
88
+
89
+ In particular, we caution against using Whisper models to transcribe recordings of individuals taken without their consent or purporting to use these models for any kind of subjective classification. We recommend against use in high-risk domains like decision-making contexts, where flaws in accuracy can lead to pronounced flaws in outcomes. The models are intended to transcribe and translate speech, use of the model for classification is not only not evaluated but also not appropriate, particularly to infer human attributes.
90
+
91
+
92
+ ## Training Data
93
+
94
+ The models are trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio collected using Whisper `large-v2`.
95
+
96
+ As discussed in [the accompanying paper](https://cdn.openai.com/papers/whisper.pdf), we see that performance on transcription in a given language is directly correlated with the amount of training data we employ in that language.
97
+
98
+
99
+ ## Performance and Limitations
100
+
101
+ Our studies show that, over many existing ASR systems, the models exhibit improved robustness to accents, background noise, technical language, as well as zero shot translation from multiple languages into English; and that accuracy on speech recognition and translation is near the state-of-the-art level.
102
+
103
+ However, because the models are trained in a weakly supervised manner using large-scale noisy data, the predictions may include texts that are not actually spoken in the audio input (i.e. hallucination). We hypothesize that this happens because, given their general knowledge of language, the models combine trying to predict the next word in audio with trying to transcribe the audio itself.
104
+
105
+ Our models perform unevenly across languages, and we observe lower accuracy on low-resource and/or low-discoverability languages or languages where we have less training data. The models also exhibit disparate performance on different accents and dialects of particular languages, which may include higher word error rate across speakers of different genders, races, ages, or other demographic criteria. Our full evaluation results are presented in [the paper accompanying this release](https://cdn.openai.com/papers/whisper.pdf).
106
+
107
+ In addition, the sequence-to-sequence architecture of the model makes it prone to generating repetitive texts, which can be mitigated to some degree by beam search and temperature scheduling but not perfectly. Further analysis on these limitations are provided in [the paper](https://cdn.openai.com/papers/whisper.pdf). It is likely that this behavior and hallucinations may be worse on lower-resource and/or lower-discoverability languages.
108
+
109
+
110
+ ## Broader Implications
111
+
112
+ We anticipate that Whisper models’ transcription capabilities may be used for improving accessibility tools. While Whisper models cannot be used for real-time transcription out of the box – their speed and size suggest that others may be able to build applications on top of them that allow for near-real-time speech recognition and translation. The real value of beneficial applications built on top of Whisper models suggests that the disparate performance of these models may have real economic implications.
113
+
114
+ There are also potential dual use concerns that come with releasing Whisper. While we hope the technology will be used primarily for beneficial purposes, making ASR technology more accessible could enable more actors to build capable surveillance technologies or scale up existing surveillance efforts, as the speed and accuracy allow for affordable automatic transcription and translation of large volumes of audio communication. Moreover, these models may have some capabilities to recognize specific individuals out of the box, which in turn presents safety concerns related both to dual use and disparate performance. In practice, we expect that the cost of transcription is not the limiting factor of scaling up surveillance projects.