Edit model card

Speech Verification Repository

이 μ €μž₯μ†ŒλŠ” μŒμ„± 데이터λ₯Ό 기반으둜 ν™”μž 인식 λͺ¨λΈμ„ ν•™μŠ΅ν•˜κ³  μ‚¬μš©ν•˜λŠ” 방법을 μ œκ³΅ν•©λ‹ˆλ‹€. ν•œκ΅­μ–΄ μŒμ„± 데이터셋인 AIHub의 ν™”μž μΈμ‹μš© μŒμ„± 데이터셋을 μ‚¬μš©ν•˜μ—¬ ν•™μŠ΅μ΄ μ§„ν–‰λœ λͺ¨λΈμž…λ‹ˆλ‹€.

λͺ¨λΈ κ°œμš”

  • λͺ¨λΈ 이름: wav2vec2-base-960h-contrastive
  • λͺ¨λΈ μ„€λͺ…: 이 λͺ¨λΈμ€ Wav2Vec 2.0 μ•„ν‚€ν…μ²˜λ₯Ό 기반으둜 ν•œ ν™”μž 인식 λͺ¨λΈμž…λ‹ˆλ‹€. λŒ€μ‘° ν•™μŠ΅(Contrastive Learning)을 톡해 동일 ν™”μžμ— λŒ€ν•œ μŒμ„± λ°μ΄ν„°μ˜ νŠΉμ§•μ„ 효과적으둜 ν•™μŠ΅ν•  수 μžˆμŠ΅λ‹ˆλ‹€.
  • ν™œμš© λΆ„μ•Ό: μŒμ„± 인식, ν™”μž λΆ„λ₯˜λ“±μ˜ νƒœμŠ€ν¬μ— ν™œμš©λ  수 μžˆμŠ΅λ‹ˆλ‹€.

The original model can be found facebook/wav2vec2-base-960h

ν•™μŠ΅ 데이터

  • AIHub의 ν™”μž μΈμ‹μš© μŒμ„± 데이터셋 μ‚¬μš©
  • ν•œκ΅­μ–΄ μŒμ„± λ°μ΄ν„°λ‘œ κ΅¬μ„±λ˜μ–΄ 있으며, λ‹€μ–‘ν•œ ν™”μžμ˜ μŒμ„± μƒ˜ν”Œμ„ 포함
  • 원본 데이터 링크: AIHub ν™”μž 인식 데이터셋

μ‚¬μš© 방법

  1. Library import
import librosa
import torch
import torch.nn.functional as F
from transformers import Wav2Vec2Model
from transformers import Wav2Vec2FeatureExtractor
from torch.nn.functional import cosine_similarity
  1. Load Model
from transformers import Wav2Vec2Model, AutoFeatureExtractor

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_name = "Songhun/wav2vec2-base-960h-contrastive"
model = Wav2Vec2Model.from_pretrained(model_name).to(device)
feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)
  1. Calculate Voice Similarity
file_path1 = './test1.wav'
file_path2 = './test2.wav'

feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_name)
def load_and_process_audio(file_path, feature_extractor, max_length=4.0):
    audio, sampling_rate = librosa.load(file_path, sr=16000)
    inputs = feature_extractor(audio, sampling_rate=sampling_rate, return_tensors="pt", padding="max_length", truncation=True, max_length=int(max_length * sampling_rate))
    return inputs.input_values

audio_input1 = load_and_process_audio(file_path1, feature_extractor).to(device)
audio_input2 = load_and_process_audio(file_path2, feature_extractor).to(device)

embedding1 = model(audio_input1).last_hidden_state.mean(dim=1)
embedding2 = model(audio_input2).last_hidden_state.mean(dim=1)

similarity = F.cosine_similarity(embedding1, embedding2).item()
print(f"Similarity between the two audio files: {similarity}")

Threshold: 0.3331 is Youden's J statistic optimal threshold

Downloads last month
78
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.