Songhun's picture
Update README.md
fc33de8 verified
---
license: mit
language:
- ko
metrics:
- accuracy
pipeline_tag: automatic-speech-recognition
---
# Speech Verification Repository
이 μ €μž₯μ†ŒλŠ” μŒμ„± 데이터λ₯Ό 기반으둜 ν™”μž 인식 λͺ¨λΈμ„ ν•™μŠ΅ν•˜κ³  μ‚¬μš©ν•˜λŠ” 방법을 μ œκ³΅ν•©λ‹ˆλ‹€. ν•œκ΅­μ–΄ μŒμ„± 데이터셋인 AIHub의 ν™”μž μΈμ‹μš© μŒμ„± 데이터셋을 μ‚¬μš©ν•˜μ—¬ ν•™μŠ΅μ΄ μ§„ν–‰λœ λͺ¨λΈμž…λ‹ˆλ‹€.
## λͺ¨λΈ κ°œμš”
- λͺ¨λΈ 이름: wav2vec2-large-960h-lv60-contrastive
- λͺ¨λΈ μ„€λͺ…: 이 λͺ¨λΈμ€ Wav2Vec 2.0 μ•„ν‚€ν…μ²˜λ₯Ό 기반으둜 ν•œ ν™”μž 인식 λͺ¨λΈμž…λ‹ˆλ‹€. λŒ€μ‘° ν•™μŠ΅(Contrastive Learning)을 톡해 동일 ν™”μžμ— λŒ€ν•œ μŒμ„± λ°μ΄ν„°μ˜ νŠΉμ§•μ„ 효과적으둜 ν•™μŠ΅ν•  수 μžˆμŠ΅λ‹ˆλ‹€.
- ν™œμš© λΆ„μ•Ό: μŒμ„± 인식, ν™”μž λΆ„λ₯˜λ“±μ˜ νƒœμŠ€ν¬μ— ν™œμš©λ  수 μžˆμŠ΅λ‹ˆλ‹€.
The original model can be found [wav2vec2-large-960h-lv60](https://huggingface.co/facebook/wav2vec2-large-960h-lv60)
## ν•™μŠ΅ 데이터
- AIHub의 ν™”μž μΈμ‹μš© μŒμ„± 데이터셋 μ‚¬μš©
- ν•œκ΅­μ–΄ μŒμ„± λ°μ΄ν„°λ‘œ κ΅¬μ„±λ˜μ–΄ 있으며, λ‹€μ–‘ν•œ ν™”μžμ˜ μŒμ„± μƒ˜ν”Œμ„ 포함
- 원본 데이터 링크: [AIHub ν™”μž 인식 데이터셋](https://aihub.or.kr/aihubdata/data/view.do?currMenu=&topMenu=&aihubDataSe=data&dataSetSn=537)
## μ‚¬μš© 방법
1. Library import
```shell
import librosa
import torch
import torch.nn.functional as F
from transformers import Wav2Vec2Model
from transformers import Wav2Vec2FeatureExtractor
from torch.nn.functional import cosine_similarity
```
2. Load Model
```shell
from transformers import Wav2Vec2Model, AutoFeatureExtractor
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_name = "Songhun/wav2vec2-large-960h-lv60-contrastive"
model = Wav2Vec2Model.from_pretrained(model_name).to(device)
feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)
```
3. Calculate Voice Similarity
```shell
file_path1 = './sample_data/voice1.mp3'
file_path2 = './sample_data/voice2.mp3'
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_name)
def load_and_process_audio(file_path, feature_extractor, max_length=4.0):
audio, sampling_rate = librosa.load(file_path, sr=16000)
inputs = feature_extractor(audio, sampling_rate=sampling_rate, return_tensors="pt", padding="max_length", truncation=True, max_length=int(max_length * sampling_rate))
return inputs.input_values
audio_input1 = load_and_process_audio(file_path1, feature_extractor).to(device)
audio_input2 = load_and_process_audio(file_path2, feature_extractor).to(device)
embedding1 = model(audio_input1).last_hidden_state.mean(dim=1)
embedding2 = model(audio_input2).last_hidden_state.mean(dim=1)
similarity = F.cosine_similarity(embedding1, embedding2).item()
print(f"Similarity between the two audio files: {similarity}")
```
Threshold: 0.3728 is Youden's J statistic optimal threshold