|
# Adding a New Dataset Parser |
|
|
|
This guide explains how to add a new dataset parser to the llmdataparser library. The library is designed to make it easy to add support for new datasets while maintaining consistent interfaces and functionality. |
|
|
|
## Step-by-Step Guide |
|
|
|
### 1. Create a New Parser Class |
|
|
|
Create a new file `your_dataset_parser.py` in the `llmdataparser` folder. Your parser should inherit from `HuggingFaceDatasetParser[T]` where T is your custom entry type. |
|
|
|
```python |
|
from llmdataparser.base_parser import ( |
|
DatasetDescription, |
|
EvaluationMetric, |
|
HuggingFaceDatasetParser, |
|
HuggingFaceParseEntry, |
|
) |
|
|
|
@dataclass(frozen=True, kw_only=True, slots=True) |
|
class YourDatasetParseEntry(HuggingFaceParseEntry): |
|
"""Custom entry class for your dataset.""" |
|
# Add any additional fields specific to your dataset |
|
custom_field: str |
|
|
|
@classmethod |
|
def create(cls, prompt: str, answer: str, raw_question: str, |
|
raw_answer: str, task_name: str, custom_field: str) -> "YourDatasetParseEntry": |
|
return cls( |
|
prompt=prompt, |
|
answer=answer, |
|
raw_question=raw_question, |
|
raw_answer=raw_answer, |
|
task_name=task_name, |
|
custom_field=custom_field |
|
) |
|
|
|
class YourDatasetParser(HuggingFaceDatasetParser[YourDatasetParseEntry]): |
|
"""Parser for your dataset.""" |
|
|
|
# Required class variables |
|
_data_source = "huggingface/your-dataset" |
|
_default_task = "default" |
|
_task_names = ["task1", "task2", "task3"] |
|
_default_system_prompt = YOUR_SYSTEM_PROMPT |
|
``` |
|
|
|
### 2. Define System Prompt |
|
|
|
Add your system prompt to `llmdataparser/prompts.py`: |
|
|
|
```python |
|
YOUR_SYSTEM_PROMPT: Final[str] = textwrap.dedent( |
|
"""\ |
|
You are an expert in [your domain]. Your task is to [describe the task]. |
|
|
|
Instructions: |
|
1. [First instruction] |
|
2. [Second instruction] |
|
... |
|
""" |
|
) |
|
``` |
|
|
|
### 3. Implement Required Methods |
|
|
|
Your parser needs to implement these key methods: |
|
|
|
```python |
|
def process_entry( |
|
self, |
|
row: dict[str, Any], |
|
task_name: str | None = None, |
|
**kwargs: Any |
|
) -> YourDatasetParseEntry: |
|
"""Process a single dataset entry.""" |
|
# Extract data from the row |
|
raw_question = row["question"] |
|
raw_answer = row["answer"] |
|
task = task_name or self._get_current_task(row) |
|
|
|
# Format the prompt |
|
prompt = f"{self._system_prompt}\nQuestion: {raw_question}\nAnswer:" |
|
|
|
return YourDatasetParseEntry.create( |
|
prompt=prompt, |
|
answer=raw_answer, |
|
raw_question=raw_question, |
|
raw_answer=raw_answer, |
|
task_name=task, |
|
custom_field=row["custom_field"] |
|
) |
|
|
|
def get_dataset_description(self) -> DatasetDescription: |
|
"""Returns description of your dataset.""" |
|
return DatasetDescription.create( |
|
name="Your Dataset Name", |
|
purpose="Purpose of the dataset", |
|
source="Dataset source/URL", |
|
language="Dataset language", |
|
format="Data format (e.g., multiple choice, free text)", |
|
characteristics="Key characteristics of the dataset", |
|
citation="Dataset citation if available" |
|
) |
|
|
|
def get_evaluation_metrics(self) -> list[EvaluationMetric]: |
|
"""Returns recommended evaluation metrics.""" |
|
return [ |
|
EvaluationMetric.create( |
|
name="metric_name", |
|
type="metric_type", |
|
description="Metric description", |
|
implementation="implementation_details", |
|
primary=True |
|
) |
|
] |
|
``` |
|
|
|
### 4. Add Example Usage |
|
|
|
Add example usage at the bottom of your parser file: |
|
|
|
```python |
|
if __name__ == "__main__": |
|
# Example usage |
|
parser = YourDatasetParser() |
|
parser.load() |
|
parser.parse() |
|
|
|
# Get parsed data |
|
parsed_data = parser.get_parsed_data |
|
|
|
# Print example entry |
|
if parsed_data: |
|
example = parsed_data[0] |
|
print("\nExample parsed entry:") |
|
print(f"Question: {example.raw_question}") |
|
print(f"Answer: {example.answer}") |
|
``` |
|
|
|
### 5. Create Tests |
|
|
|
Create a test file `tests/test_your_dataset_parser.py`: |
|
|
|
```python |
|
import pytest |
|
from llmdataparser.your_dataset_parser import YourDatasetParser, YourDatasetParseEntry |
|
|
|
def test_parser_initialization(): |
|
parser = YourDatasetParser() |
|
assert parser._data_source == "huggingface/your-dataset" |
|
assert parser._default_task == "default" |
|
assert "task1" in parser._task_names |
|
|
|
def test_process_entry(): |
|
parser = YourDatasetParser() |
|
sample_row = { |
|
"question": "Sample question", |
|
"answer": "Sample answer", |
|
"custom_field": "Custom value" |
|
} |
|
|
|
entry = parser.process_entry(sample_row) |
|
assert isinstance(entry, YourDatasetParseEntry) |
|
assert entry.raw_question == "Sample question" |
|
assert entry.custom_field == "Custom value" |
|
``` |
|
|
|
## Best Practices |
|
|
|
1. **Type Safety**: Use type hints consistently and ensure your parser is properly typed. |
|
1. **Documentation**: Add clear docstrings and comments explaining your parser's functionality. |
|
1. **Error Handling**: Include appropriate error checking and validation. |
|
1. **Testing**: Write comprehensive tests covering different scenarios. |
|
1. **System Prompt**: Design your system prompt carefully to guide the model effectively. |
|
|
|
## Examples |
|
|
|
Look at existing parsers for reference: |
|
|
|
- `mmlu_parser.py` for multiple-choice questions |
|
- `gsm8k_parser.py` for math word problems |
|
- `humaneval_parser.py` for code generation tasks |
|
|
|
## Common Patterns |
|
|
|
1. **Parse Entry Class**: Create a custom parse entry class if you need additional fields. |
|
1. **Task Names**: Define all available tasks in `_task_names`. |
|
1. **System Prompt**: Write clear instructions in the system prompt. |
|
1. **Process Entry**: Handle data extraction and formatting in `process_entry`. |
|
1. **Dataset Description**: Provide comprehensive dataset information. |
|
1. **Evaluation Metrics**: Define appropriate metrics for your dataset. |
|
|
|
## Testing Your Parser |
|
|
|
1. Run the example usage code to verify basic functionality |
|
1. Run pytest to execute your test cases |
|
1. Try different dataset splits and tasks |
|
1. Verify the parsed output format |
|
1. Check error handling with invalid inputs |
|
|