metadata
license: mit
language:
- ko
metrics:
- accuracy
pipeline_tag: automatic-speech-recognition
Speech Verification Repository
μ΄ μ μ₯μλ μμ± λ°μ΄ν°λ₯Ό κΈ°λ°μΌλ‘ νμ μΈμ λͺ¨λΈμ νμ΅νκ³ μ¬μ©νλ λ°©λ²μ μ 곡ν©λλ€. νκ΅μ΄ μμ± λ°μ΄ν°μ μΈ AIHubμ νμ μΈμμ© μμ± λ°μ΄ν°μ μ μ¬μ©νμ¬ νμ΅μ΄ μ§νλ λͺ¨λΈμ λλ€.
λͺ¨λΈ κ°μ
- λͺ¨λΈ μ΄λ¦: wav2vec2-large-960h-lv60-contrastive
- λͺ¨λΈ μ€λͺ : μ΄ λͺ¨λΈμ Wav2Vec 2.0 μν€ν μ²λ₯Ό κΈ°λ°μΌλ‘ ν νμ μΈμ λͺ¨λΈμ λλ€. λμ‘° νμ΅(Contrastive Learning)μ ν΅ν΄ λμΌ νμμ λν μμ± λ°μ΄ν°μ νΉμ§μ ν¨κ³Όμ μΌλ‘ νμ΅ν μ μμ΅λλ€.
- νμ© λΆμΌ: μμ± μΈμ, νμ λΆλ₯λ±μ νμ€ν¬μ νμ©λ μ μμ΅λλ€.
The original model can be found wav2vec2-large-960h-lv60
νμ΅ λ°μ΄ν°
- AIHubμ νμ μΈμμ© μμ± λ°μ΄ν°μ μ¬μ©
- νκ΅μ΄ μμ± λ°μ΄ν°λ‘ ꡬμ±λμ΄ μμΌλ©°, λ€μν νμμ μμ± μνμ ν¬ν¨
- μλ³Έ λ°μ΄ν° λ§ν¬: AIHub νμ μΈμ λ°μ΄ν°μ
μ¬μ© λ°©λ²
- Library import
import librosa
import torch
import torch.nn.functional as F
from transformers import Wav2Vec2Model
from transformers import Wav2Vec2FeatureExtractor
from torch.nn.functional import cosine_similarity
- Load Model
from transformers import Wav2Vec2Model, AutoFeatureExtractor
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_name = "Songhun/wav2vec2-large-960h-lv60-contrastive"
model = Wav2Vec2Model.from_pretrained(model_name).to(device)
feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)
- Calculate Voice Similarity
file_path1 = './sample_data/voice1.mp3'
file_path2 = './sample_data/voice2.mp3'
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_name)
def load_and_process_audio(file_path, feature_extractor, max_length=4.0):
audio, sampling_rate = librosa.load(file_path, sr=16000)
inputs = feature_extractor(audio, sampling_rate=sampling_rate, return_tensors="pt", padding="max_length", truncation=True, max_length=int(max_length * sampling_rate))
return inputs.input_values
audio_input1 = load_and_process_audio(file_path1, feature_extractor).to(device)
audio_input2 = load_and_process_audio(file_path2, feature_extractor).to(device)
embedding1 = model(audio_input1).last_hidden_state.mean(dim=1)
embedding2 = model(audio_input2).last_hidden_state.mean(dim=1)
similarity = F.cosine_similarity(embedding1, embedding2).item()
print(f"Similarity between the two audio files: {similarity}")
Threshold: 0.3728 is Youden's J statistic optimal threshold