|
--- |
|
license: mit |
|
language: |
|
- ko |
|
metrics: |
|
- accuracy |
|
pipeline_tag: automatic-speech-recognition |
|
--- |
|
# Speech Verification Repository |
|
|
|
μ΄ μ μ₯μλ μμ± λ°μ΄ν°λ₯Ό κΈ°λ°μΌλ‘ νμ μΈμ λͺ¨λΈμ νμ΅νκ³ μ¬μ©νλ λ°©λ²μ μ 곡ν©λλ€. νκ΅μ΄ μμ± λ°μ΄ν°μ
μΈ AIHubμ νμ μΈμμ© μμ± λ°μ΄ν°μ
μ μ¬μ©νμ¬ νμ΅μ΄ μ§νλ λͺ¨λΈμ
λλ€. |
|
|
|
## λͺ¨λΈ κ°μ |
|
- λͺ¨λΈ μ΄λ¦: wav2vec2-large-960h-lv60-contrastive |
|
- λͺ¨λΈ μ€λͺ
: μ΄ λͺ¨λΈμ Wav2Vec 2.0 μν€ν
μ²λ₯Ό κΈ°λ°μΌλ‘ ν νμ μΈμ λͺ¨λΈμ
λλ€. λμ‘° νμ΅(Contrastive Learning)μ ν΅ν΄ λμΌ νμμ λν μμ± λ°μ΄ν°μ νΉμ§μ ν¨κ³Όμ μΌλ‘ νμ΅ν μ μμ΅λλ€. |
|
- νμ© λΆμΌ: μμ± μΈμ, νμ λΆλ₯λ±μ νμ€ν¬μ νμ©λ μ μμ΅λλ€. |
|
|
|
The original model can be found [wav2vec2-large-960h-lv60](https://huggingface.co/facebook/wav2vec2-large-960h-lv60) |
|
|
|
## νμ΅ λ°μ΄ν° |
|
- AIHubμ νμ μΈμμ© μμ± λ°μ΄ν°μ
μ¬μ© |
|
- νκ΅μ΄ μμ± λ°μ΄ν°λ‘ ꡬμ±λμ΄ μμΌλ©°, λ€μν νμμ μμ± μνμ ν¬ν¨ |
|
- μλ³Έ λ°μ΄ν° λ§ν¬: [AIHub νμ μΈμ λ°μ΄ν°μ
](https://aihub.or.kr/aihubdata/data/view.do?currMenu=&topMenu=&aihubDataSe=data&dataSetSn=537) |
|
|
|
## μ¬μ© λ°©λ² |
|
1. Library import |
|
```shell |
|
import librosa |
|
import torch |
|
import torch.nn.functional as F |
|
from transformers import Wav2Vec2Model |
|
from transformers import Wav2Vec2FeatureExtractor |
|
from torch.nn.functional import cosine_similarity |
|
``` |
|
|
|
2. Load Model |
|
```shell |
|
from transformers import Wav2Vec2Model, AutoFeatureExtractor |
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
model_name = "Songhun/wav2vec2-large-960h-lv60-contrastive" |
|
model = Wav2Vec2Model.from_pretrained(model_name).to(device) |
|
feature_extractor = AutoFeatureExtractor.from_pretrained(model_name) |
|
``` |
|
|
|
3. Calculate Voice Similarity |
|
```shell |
|
file_path1 = './sample_data/voice1.mp3' |
|
file_path2 = './sample_data/voice2.mp3' |
|
|
|
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_name) |
|
def load_and_process_audio(file_path, feature_extractor, max_length=4.0): |
|
audio, sampling_rate = librosa.load(file_path, sr=16000) |
|
inputs = feature_extractor(audio, sampling_rate=sampling_rate, return_tensors="pt", padding="max_length", truncation=True, max_length=int(max_length * sampling_rate)) |
|
return inputs.input_values |
|
|
|
audio_input1 = load_and_process_audio(file_path1, feature_extractor).to(device) |
|
audio_input2 = load_and_process_audio(file_path2, feature_extractor).to(device) |
|
|
|
embedding1 = model(audio_input1).last_hidden_state.mean(dim=1) |
|
embedding2 = model(audio_input2).last_hidden_state.mean(dim=1) |
|
|
|
similarity = F.cosine_similarity(embedding1, embedding2).item() |
|
print(f"Similarity between the two audio files: {similarity}") |
|
``` |
|
|
|
Threshold: 0.3728 is Youden's J statistic optimal threshold |