Songhun's picture
Update README.md
fc33de8 verified
metadata
license: mit
language:
  - ko
metrics:
  - accuracy
pipeline_tag: automatic-speech-recognition

Speech Verification Repository

이 μ €μž₯μ†ŒλŠ” μŒμ„± 데이터λ₯Ό 기반으둜 ν™”μž 인식 λͺ¨λΈμ„ ν•™μŠ΅ν•˜κ³  μ‚¬μš©ν•˜λŠ” 방법을 μ œκ³΅ν•©λ‹ˆλ‹€. ν•œκ΅­μ–΄ μŒμ„± 데이터셋인 AIHub의 ν™”μž μΈμ‹μš© μŒμ„± 데이터셋을 μ‚¬μš©ν•˜μ—¬ ν•™μŠ΅μ΄ μ§„ν–‰λœ λͺ¨λΈμž…λ‹ˆλ‹€.

λͺ¨λΈ κ°œμš”

  • λͺ¨λΈ 이름: wav2vec2-large-960h-lv60-contrastive
  • λͺ¨λΈ μ„€λͺ…: 이 λͺ¨λΈμ€ Wav2Vec 2.0 μ•„ν‚€ν…μ²˜λ₯Ό 기반으둜 ν•œ ν™”μž 인식 λͺ¨λΈμž…λ‹ˆλ‹€. λŒ€μ‘° ν•™μŠ΅(Contrastive Learning)을 톡해 동일 ν™”μžμ— λŒ€ν•œ μŒμ„± λ°μ΄ν„°μ˜ νŠΉμ§•μ„ 효과적으둜 ν•™μŠ΅ν•  수 μžˆμŠ΅λ‹ˆλ‹€.
  • ν™œμš© λΆ„μ•Ό: μŒμ„± 인식, ν™”μž λΆ„λ₯˜λ“±μ˜ νƒœμŠ€ν¬μ— ν™œμš©λ  수 μžˆμŠ΅λ‹ˆλ‹€.

The original model can be found wav2vec2-large-960h-lv60

ν•™μŠ΅ 데이터

  • AIHub의 ν™”μž μΈμ‹μš© μŒμ„± 데이터셋 μ‚¬μš©
  • ν•œκ΅­μ–΄ μŒμ„± λ°μ΄ν„°λ‘œ κ΅¬μ„±λ˜μ–΄ 있으며, λ‹€μ–‘ν•œ ν™”μžμ˜ μŒμ„± μƒ˜ν”Œμ„ 포함
  • 원본 데이터 링크: AIHub ν™”μž 인식 데이터셋

μ‚¬μš© 방법

  1. Library import
import librosa
import torch
import torch.nn.functional as F
from transformers import Wav2Vec2Model
from transformers import Wav2Vec2FeatureExtractor
from torch.nn.functional import cosine_similarity
  1. Load Model
from transformers import Wav2Vec2Model, AutoFeatureExtractor

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_name = "Songhun/wav2vec2-large-960h-lv60-contrastive"
model = Wav2Vec2Model.from_pretrained(model_name).to(device)
feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)
  1. Calculate Voice Similarity
file_path1 = './sample_data/voice1.mp3'
file_path2 = './sample_data/voice2.mp3'

feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_name)
def load_and_process_audio(file_path, feature_extractor, max_length=4.0):
    audio, sampling_rate = librosa.load(file_path, sr=16000)
    inputs = feature_extractor(audio, sampling_rate=sampling_rate, return_tensors="pt", padding="max_length", truncation=True, max_length=int(max_length * sampling_rate))
    return inputs.input_values

audio_input1 = load_and_process_audio(file_path1, feature_extractor).to(device)
audio_input2 = load_and_process_audio(file_path2, feature_extractor).to(device)

embedding1 = model(audio_input1).last_hidden_state.mean(dim=1)
embedding2 = model(audio_input2).last_hidden_state.mean(dim=1)

similarity = F.cosine_similarity(embedding1, embedding2).item()
print(f"Similarity between the two audio files: {similarity}")

Threshold: 0.3728 is Youden's J statistic optimal threshold