chenhaodev commited on
Commit
ca430b9
·
1 Parent(s): 377e4e1
Files changed (5) hide show
  1. Dockerfile +47 -0
  2. Example.jpg +0 -0
  3. README.md +111 -6
  4. app.py +333 -0
  5. requirements.txt +4 -0
Dockerfile ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Use Ubuntu as base image
2
+ FROM ubuntu:22.04
3
+
4
+ # Prevent interactive prompts during package installation
5
+ ENV DEBIAN_FRONTEND=noninteractive
6
+
7
+ # Install system dependencies
8
+ RUN apt-get update && apt-get install -y \
9
+ python3 \
10
+ python3-pip \
11
+ curl \
12
+ wget \
13
+ git \
14
+ net-tools \
15
+ && rm -rf /var/lib/apt/lists/*
16
+
17
+ # Install Ollama
18
+ RUN curl -fsSL https://ollama.com/install.sh | sh
19
+
20
+ # Set working directory
21
+ WORKDIR /app
22
+
23
+ # Copy requirements and install Python dependencies
24
+ COPY requirements.txt .
25
+ RUN pip3 install --no-cache-dir -r requirements.txt
26
+
27
+ # Copy application code
28
+ COPY . .
29
+
30
+ # Create startup script
31
+ RUN echo '#!/bin/bash\n\
32
+ # Start Ollama server\n\
33
+ ollama serve & \n\
34
+ sleep 5\n\
35
+ \n\
36
+ # Pull the model if not exists\n\
37
+ ollama pull deepseek-r1:1.5b\n\
38
+ \n\
39
+ # Start the Gradio app\n\
40
+ exec python3 -u app.py\n\
41
+ ' > start.sh && chmod +x start.sh
42
+
43
+ # Expose port for Gradio web interface
44
+ EXPOSE 7860
45
+
46
+ # Run the application
47
+ ENTRYPOINT ["./start.sh"]
Example.jpg ADDED
README.md CHANGED
@@ -1,11 +1,116 @@
1
  ---
2
- title: Evaluate ASR
3
- emoji: 🏃
4
  colorFrom: blue
5
- colorTo: green
6
- sdk: docker
 
 
7
  pinned: false
8
- short_description: version 1.1
9
  ---
10
 
11
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: ASR Evaluation Tool
3
+ emoji: 🎯
4
  colorFrom: blue
5
+ colorTo: red
6
+ sdk: gradio
7
+ sdk_version: 5.16.0
8
+ app_file: app.py
9
  pinned: false
 
10
  ---
11
 
12
+ # ASR Evaluation Tool (Ver 1.1)
13
+
14
+ This Gradio app provides a user-friendly interface for calculating Word Error Rate (WER) and related metrics between reference and hypothesis texts. It's particularly useful for evaluating speech recognition or machine translation outputs.
15
+
16
+ ## Features
17
+
18
+ - Calculate WER, MER, WIL, and WIP metrics
19
+ - Text normalization options
20
+ - Custom word filtering
21
+ - Detailed error analysis
22
+ - Example inputs for testing
23
+
24
+ ## How to Use
25
+
26
+ 1. Enter or paste your reference text
27
+ 2. Enter or paste your hypothesis text
28
+ 3. Configure options (normalization, word filtering)
29
+ 4. Click "Calculate WER" to see results
30
+
31
+ NOTE: There might be a 30-second delay due to the r1:1.5B model being called for medical term recall calculations.
32
+
33
+ ![Example](./Example.jpg)
34
+
35
+
36
+
37
+ ## Local Development
38
+
39
+ 1. Clone the repository:
40
+ ```bash
41
+ git clone https://github.com/yourusername/wer-evaluation-tool.git
42
+ cd wer-evaluation-tool
43
+ ```
44
+
45
+ 2. Create and activate a virtual environment using `uv`:
46
+ ```bash
47
+ uv venv
48
+ source .venv/bin/activate # On Unix/macOS
49
+ # or
50
+ .venv\Scripts\activate # On Windows
51
+ ```
52
+
53
+ 3. Install dependencies:
54
+ ```bash
55
+ uv pip install -r requirements.txt
56
+ ```
57
+
58
+ 4. Run the app locally:
59
+ ```bash
60
+ uv run python app_gradio.py
61
+ ```
62
+
63
+ ## Installation
64
+
65
+ You can install the package directly from PyPI:
66
+
67
+ ```bash
68
+ uv pip install wer-evaluation-tool
69
+ ```
70
+
71
+ ## Testing
72
+
73
+ Run the test suite using pytest:
74
+
75
+ ```bash
76
+ uv run pytest tests/
77
+ ```
78
+
79
+ ## Contributing
80
+
81
+ 1. Fork the repository
82
+ 2. Create a new branch (`git checkout -b feature/improvement`)
83
+ 3. Make your changes
84
+ 4. Run tests to ensure everything works
85
+ 5. Commit your changes (`git commit -am 'Add new feature'`)
86
+ 6. Push to the branch (`git push origin feature/improvement`)
87
+ 7. Create a Pull Request
88
+
89
+ ## License
90
+
91
+ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
92
+
93
+ ## Acknowledgments
94
+
95
+ - Thanks to all contributors who have helped with the development
96
+ - Inspired by the need for better speech recognition evaluation tools
97
+ - Built with [Gradio](https://gradio.app/)
98
+
99
+ ## Contact
100
+
101
+ For questions or feedback, please:
102
+ - Open an issue in the GitHub repository
103
+ - Contact the maintainers at [email/contact information]
104
+
105
+ ## Citation
106
+
107
+ If you use this tool in your research, please cite:
108
+
109
+ ```bibtex
110
+ @software{wer_evaluation_tool,
111
+ title = {WER Evaluation Tool},
112
+ author = {Your Name},
113
+ year = {2024},
114
+ url = {https://github.com/yourusername/wer-evaluation-tool}
115
+ }
116
+ ```
app.py ADDED
@@ -0,0 +1,333 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ import jiwer
3
+ import pandas as pd
4
+ import logging
5
+ from typing import List, Optional, Tuple, Dict
6
+ from ollama import Client
7
+ import re
8
+ import os
9
+
10
+ # Set up logging configuration
11
+ logging.basicConfig(
12
+ level=logging.INFO,
13
+ format='%(asctime)s - %(levelname)s - %(message)s',
14
+ force=True,
15
+ handlers=[
16
+ logging.StreamHandler(),
17
+ ]
18
+ )
19
+ logger = logging.getLogger(__name__)
20
+
21
+ def calculate_wer_metrics(
22
+ hypothesis: str,
23
+ reference: str,
24
+ normalize: bool = True,
25
+ words_to_filter: Optional[List[str]] = None
26
+ ) -> Dict:
27
+ """
28
+ Calculate WER metrics between hypothesis and reference texts.
29
+
30
+ Args:
31
+ hypothesis (str): The hypothesis text
32
+ reference (str): The reference text
33
+ normalize (bool): Whether to normalize texts before comparison
34
+ words_to_filter (List[str], optional): Words to filter out before comparison
35
+
36
+ Returns:
37
+ dict: Dictionary containing WER metrics
38
+
39
+ Raises:
40
+ ValueError: If inputs are invalid or result in empty text after processing
41
+ """
42
+ logger.info(f"Calculating WER metrics with inputs - Hypothesis: {hypothesis}, Reference: {reference}")
43
+
44
+ # Validate inputs
45
+ if not hypothesis.strip() or not reference.strip():
46
+ raise ValueError("Both hypothesis and reference texts must contain non-empty strings")
47
+
48
+ if normalize:
49
+ # Define basic transformations
50
+ basic_transform = jiwer.Compose([
51
+ jiwer.ExpandCommonEnglishContractions(),
52
+ jiwer.ToLowerCase(),
53
+ jiwer.RemoveMultipleSpaces(),
54
+ jiwer.RemovePunctuation(),
55
+ jiwer.Strip(),
56
+ jiwer.ReduceToListOfListOfWords()
57
+ ])
58
+
59
+ if words_to_filter and any(words_to_filter):
60
+ def filter_words_transform(words: List[str]) -> List[str]:
61
+ filtered = [word for word in words
62
+ if word.lower() not in [w.lower() for w in words_to_filter]]
63
+ if not filtered:
64
+ raise ValueError("Text is empty after filtering words")
65
+ return filtered
66
+
67
+ transformation = jiwer.Compose([
68
+ basic_transform,
69
+ filter_words_transform
70
+ ])
71
+ else:
72
+ transformation = basic_transform
73
+
74
+ # Pre-check the transformed text
75
+ try:
76
+ transformed_ref = transformation(reference)
77
+ transformed_hyp = transformation(hypothesis)
78
+ if not transformed_ref or not transformed_hyp:
79
+ raise ValueError("Text is empty after normalization")
80
+ logger.debug(f"Transformed reference: {transformed_ref}")
81
+ logger.debug(f"Transformed hypothesis: {transformed_hyp}")
82
+ except Exception as e:
83
+ logger.error(f"Transformation error: {str(e)}")
84
+ raise ValueError(f"Error during text transformation: {str(e)}")
85
+
86
+ measures = jiwer.compute_measures(
87
+ truth=reference,
88
+ hypothesis=hypothesis,
89
+ truth_transform=transformation,
90
+ hypothesis_transform=transformation
91
+ )
92
+ else:
93
+ measures = jiwer.compute_measures(
94
+ truth=reference,
95
+ hypothesis=hypothesis
96
+ )
97
+
98
+ return measures
99
+
100
+ # Initialize Ollama client
101
+ client = Client(host='http://localhost:11434')
102
+
103
+ def extract_medical_terms(text: str) -> List[str]:
104
+ """
105
+ Extract medical terms from text using Qwen model via Ollama.
106
+
107
+ Args:
108
+ text (str): Input text
109
+
110
+ Returns:
111
+ List[str]: List of extracted medical terms
112
+ """
113
+ prompt = f"""Extract all medical terms from the following text.
114
+ Return only the medical terms as a comma-separated list.
115
+ Text: {text}"""
116
+
117
+ try:
118
+ response = client.generate(
119
+ model='deepseek-r1:1.5b',
120
+ prompt=prompt,
121
+ stream=False
122
+ )
123
+
124
+ # Get the response text
125
+ response_text = response['response']
126
+
127
+ # Remove the thinking process
128
+ if '<think>' in response_text and '</think>' in response_text:
129
+ # Extract everything after </think>
130
+ medical_terms_text = response_text.split('</think>')[-1].strip()
131
+ else:
132
+ medical_terms_text = response_text
133
+
134
+ # Parse the comma-separated response
135
+ medical_terms = [term.strip() for term in medical_terms_text.split(',')]
136
+ # Remove empty terms and clean up
137
+ return [term for term in medical_terms if term and not term.startswith('<') and not term.endswith('>')]
138
+
139
+ except Exception as e:
140
+ logger.error(f"Error in medical term extraction: {str(e)}")
141
+ return []
142
+
143
+ def calculate_medical_recall(
144
+ hypothesis_terms: List[str],
145
+ reference_terms: List[str]
146
+ ) -> float:
147
+ """
148
+ Calculate medical term recall rate.
149
+
150
+ Args:
151
+ hypothesis_terms (List[str]): Medical terms from hypothesis
152
+ reference_terms (List[str]): Medical terms from reference
153
+
154
+ Returns:
155
+ float: Recall rate
156
+ """
157
+ if not reference_terms:
158
+ return 1.0 if not hypothesis_terms else 0.0
159
+
160
+ correct_terms = set(hypothesis_terms) & set(reference_terms)
161
+ return len(correct_terms) / len(set(reference_terms))
162
+
163
+ def process_inputs(
164
+ reference: str,
165
+ hypothesis: str,
166
+ normalize: bool,
167
+ words_to_filter: str
168
+ ) -> Tuple[str, str, str, str]:
169
+ """
170
+ Process inputs and calculate both WER and medical term recall metrics.
171
+
172
+ Args:
173
+ reference (str): Reference text
174
+ hypothesis (str): Hypothesis text
175
+ normalize (bool): Whether to normalize text
176
+ words_to_filter (str): Comma-separated words to filter
177
+
178
+ Returns:
179
+ Tuple[str, str, str, str]: HTML formatted main metrics, error analysis,
180
+ and explanations
181
+ """
182
+ if not reference or not hypothesis:
183
+ return "Please provide both reference and hypothesis texts.", "", "", ""
184
+
185
+ try:
186
+ # Extract medical terms
187
+ reference_terms = extract_medical_terms(reference)
188
+ hypothesis_terms = extract_medical_terms(hypothesis)
189
+
190
+ # Calculate medical recall
191
+ med_recall = calculate_medical_recall(hypothesis_terms, reference_terms)
192
+
193
+ # Calculate WER metrics
194
+ filter_words = [word.strip() for word in words_to_filter.split(",")] if words_to_filter else None
195
+ measures = calculate_wer_metrics(
196
+ hypothesis=hypothesis,
197
+ reference=reference,
198
+ normalize=normalize,
199
+ words_to_filter=filter_words
200
+ )
201
+
202
+ # Format metrics
203
+ metrics_df = pd.DataFrame({
204
+ 'Metric': ['WER', 'MER', 'WIL', 'WIP', 'Medical Term Recall'],
205
+ 'Value': [
206
+ f"{measures['wer']:.3f}",
207
+ f"{measures['mer']:.3f}",
208
+ f"{measures['wil']:.3f}",
209
+ f"{measures['wip']:.3f}",
210
+ f"{med_recall:.3f}"
211
+ ]
212
+ })
213
+
214
+ # Format error analysis
215
+ error_df = pd.DataFrame({
216
+ 'Metric': ['Substitutions', 'Deletions', 'Insertions', 'Hits'],
217
+ 'Count': [
218
+ measures['substitutions'],
219
+ measures['deletions'],
220
+ measures['insertions'],
221
+ measures['hits']
222
+ ]
223
+ })
224
+
225
+ # Format medical terms comparison
226
+ med_terms_df = pd.DataFrame({
227
+ 'Source': ['Reference', 'Hypothesis'],
228
+ 'Medical Terms': [
229
+ ', '.join(reference_terms),
230
+ ', '.join(hypothesis_terms)
231
+ ]
232
+ })
233
+
234
+ metrics_html = metrics_df.to_html(index=False)
235
+ error_html = error_df.to_html(index=False)
236
+ med_terms_html = med_terms_df.to_html(index=False)
237
+
238
+ explanation = f"""
239
+ <h3>Metrics Explanation:</h3>
240
+ <ul>
241
+ <li><b>WER (Word Error Rate)</b>: The percentage of words that were incorrectly predicted</li>
242
+ <li><b>MER (Match Error Rate)</b>: The percentage of words that were incorrectly matched</li>
243
+ <li><b>WIL (Word Information Lost)</b>: The percentage of word information that was lost</li>
244
+ <li><b>WIP (Word Information Preserved)</b>: The percentage of word information that was preserved</li>
245
+ <li><b>Medical Term Recall</b>: The proportion of reference medical terms that were correctly identified in the hypothesis</li>
246
+ </ul>
247
+ <h3>Extracted Medical Terms:</h3>
248
+ {med_terms_html}
249
+ """
250
+
251
+ return metrics_html, error_html, explanation, ""
252
+
253
+ except Exception as e:
254
+ error_msg = f"Error in processing: {str(e)}"
255
+ logger.error(error_msg)
256
+ return "", "", "", error_msg
257
+
258
+ def load_example() -> Tuple[str, str]:
259
+ """Load example texts for demonstration."""
260
+ return (
261
+ "The patient shows signs of heart attack and hypertension.",
262
+ "The patient shows signs of heart attack and high blood pressure."
263
+ )
264
+
265
+ def create_interface() -> gr.Blocks:
266
+ """Create the Gradio interface."""
267
+ with gr.Blocks(title="WER Evaluation Tool") as interface:
268
+ gr.Markdown("# Word Error Rate (WER) Evaluation Tool")
269
+ gr.Markdown(
270
+ "This tool helps you evaluate the Word Error Rate (WER) between a reference "
271
+ "text and a hypothesis text. WER is commonly used in speech recognition and "
272
+ "machine translation evaluation."
273
+ )
274
+
275
+ with gr.Row():
276
+ with gr.Column():
277
+ reference = gr.Textbox(
278
+ label="Reference Text",
279
+ placeholder="Enter the reference text here...",
280
+ lines=5
281
+ )
282
+ with gr.Column():
283
+ hypothesis = gr.Textbox(
284
+ label="Hypothesis Text",
285
+ placeholder="Enter the hypothesis text here...",
286
+ lines=5
287
+ )
288
+
289
+ with gr.Row():
290
+ normalize = gr.Checkbox(
291
+ label="Normalize text (lowercase, remove punctuation)",
292
+ value=True
293
+ )
294
+ words_to_filter = gr.Textbox(
295
+ label="Words to filter (comma-separated)",
296
+ placeholder="e.g., um, uh, ah"
297
+ )
298
+
299
+ with gr.Row():
300
+ example_btn = gr.Button("Load Example")
301
+ calculate_btn = gr.Button("Calculate WER", variant="primary")
302
+
303
+ with gr.Row():
304
+ metrics_output = gr.HTML(label="Main Metrics")
305
+ error_output = gr.HTML(label="Error Analysis")
306
+
307
+ explanation_output = gr.HTML()
308
+ error_msg_output = gr.HTML()
309
+
310
+ # Event handlers
311
+ example_btn.click(
312
+ load_example,
313
+ outputs=[reference, hypothesis]
314
+ )
315
+
316
+ calculate_btn.click(
317
+ process_inputs,
318
+ inputs=[reference, hypothesis, normalize, words_to_filter],
319
+ outputs=[metrics_output, error_output, explanation_output, error_msg_output]
320
+ )
321
+
322
+ return interface
323
+
324
+ if __name__ == "__main__":
325
+ logger.info("Application started")
326
+ app = create_interface()
327
+ # Explicitly configure Gradio to be accessible from outside the container
328
+ app.launch(
329
+ server_name="0.0.0.0", # Bind to all interfaces
330
+ server_port=7860,
331
+ share=False, # Don't create a public URL
332
+ debug=True # Enable debug mode for more information
333
+ )
requirements.txt ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ gradio==5.16.0
2
+ jiwer==3.1.0
3
+ pandas==2.2.0
4
+ ollama==0.4.5