Ivydata
/

wav2vec2-large-speech-diarization-jp

audio-frame-classification

speaker-diarization

Inference Endpoints

Model card Files Files and versions Community

wav2vec2-large-speech-diarization-jp / README.md

Kosuke-Szk's picture

Update README.md

adeb5b6 over 1 year ago

|

history blame contribute delete

2 kB

	---
	language: ja
	license: apache-2.0
	tags:
	- speech
	- speaker-diarization
	datasets:
	- callhome
	---

	# Fine-tuned XLSR-53 large model for speech diarization in Japanese phone-call

	2 speakers diarization model which was fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on Japanese using phone-call data [CallHome](https://media.talkbank.org/ca/CallHome/jpn/).

	## Usage
	The model can be used directly as follows.

	```python
	import numpy as np
	import torch
	from pydub import AudioSegment

	from transformers import Wav2Vec2ForAudioFrameClassification, Wav2Vec2FeatureExtractor


	def _make_timegrid(sound_duration: float, total_len: int):
	start_timegrid = np.linspace(0, sound_duration, total_len + 1)
	dt = start_timegrid[1] - start_timegrid[0]
	end_timegrid = start_timegrid + dt
	return start_timegrid[:total_len], end_timegrid[:total_len]

	feature_extractor = Wav2Vec2FeatureExtractor(
	feature_size=1,
	sampling_rate=16_000,
	padding_value=0.0,
	do_normalize=True,
	return_attention_mask=True,
	)
	model = Wav2Vec2ForAudioFrameClassification.from_pretrained("Ivydata/wav2vec2-large-speech-diarization-jp")
	filepath = "/path/to/file.wav"
	sound = AudioSegment.from_file(filepath)
	sound = sound.set_frame_rate(16_000)
	sound_duration = sound.duration_seconds

	feature = feature_extractor(np.array(sound.get_array_of_samples())).input_values[0]
	input_values = torch.tensor(feature, dtype=torch.float32).unsqueeze(0)

	with torch.no_grad():
	logits = model(input_values).logits
	pred = logits.argmax(dim=-1).squeeze(0)
	start_timegrid, end_timegrid = _make_timegrid(sound_duration, len(pred))

	print("sec speaker_label")
	for p, start_time in zip(pred, start_timegrid):
	print(f"{start_time:.4f} {p}")
	```

	## Training

	The model was trained on Japanese phone-call corpus [CallHome](https://media.talkbank.org/ca/CallHome/jpn/).

	## License

	[The Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0)