Logging

EvaluationTracker

class lighteval.logging.evaluation_tracker.EvaluationTracker

( output_dir: str save_details: bool = True push_to_hub: bool = False push_to_tensorboard: bool = False hub_results_org: str | None = '' tensorboard_metric_prefix: str = 'eval' public: bool = False nanotron_run_info: GeneralArgs = None )

Parameters

output_dir (str) — Local folder path where you want results to be saved.
save_details (bool, defaults to True) — If True, details are saved to the output_dir.
push_to_hub (bool, defaults to False) — If True, details are pushed to the hub. Results are pushed to {hub_results_org}/details__{sanitized model_name} for the model model_name, a public dataset, if public is True else {hub_results_org}/details__{sanitized model_name}_private, a private dataset.
push_to_tensorboard (bool, defaults to False) — If True, will create and push the results for a tensorboard folder on the hub.
hub_results_org (str, optional) — The organisation to push the results to. See more details about the datasets organisation in EvaluationTracker.save.
tensorboard_metric_prefix (str, defaults to “eval”) — Prefix for the metrics in the tensorboard logs.
public (bool, defaults to False) — If True, results and details are pushed to public orgs.
nanotron_run_info (~nanotron.config.GeneralArgs, optional) — Reference to information about Nanotron models runs.

Keeps track of the overall evaluation process and relevant information.

The EvaluationTracker contains specific loggers for experiments details (DetailsLogger), metrics (MetricsLogger), task versions (VersionsLogger) as well as for the general configurations of both the specific task (TaskConfigLogger) and overall evaluation run (GeneralConfigLogger). It compiles the data from these loggers and writes it to files, which can be published to the Hugging Face hub if requested.

Attributes:

details_logger (DetailsLogger) — Logger for experiment details.
metrics_logger (MetricsLogger) — Logger for experiment metrics.
versions_logger (VersionsLogger) — Logger for task versions.
general_config_logger (GeneralConfigLogger) — Logger for general configuration.
task_config_logger (TaskConfigLogger) — Logger for task configuration.

generate_final_dict

< source >

( )

Aggregates and returns all the logger’s experiment information in a dictionary.

This function should be used to gather and display said information at the end of an evaluation run.

push_to_hub

< source >

( date_id: str details: dict results_dict: dict )

Pushes the experiment details (all the model predictions for every step) to the hub.

recreate_metadata_card

< source >

( repo_id: str )

Parameters

repo_id (str) — Details dataset repository path on the hub (org/dataset)

Fully updates the details repository metadata card for the currently evaluated model

save

< source >

( )

Saves the experiment information and results to files, and to the hub if requested.

GeneralConfigLogger

class lighteval.logging.info_loggers.GeneralConfigLogger

< source >

( )

Parameters

lighteval_sha (str) — Current commit sha of lighteval used for the evaluation (for reproducibility purposes)
num_fewshot_seeds (int) — Number of seeds for the few-shot sampling. If equal to or below 1, the experiment is done once only, with a single few-shot seed (equal to 0). If above, the experiment is reproduced several times, with a different sampling/shuffling for the few-shot examples, which follows what is done in HELM for example.
override_batch_size (int) — Manages the batch size. If strictly positive, its value is used as the batch size for all experiments. Else, the batch size is automatically inferred depending on what fits in memory.
max_samples (int) — If set, cuts the number of samples per task to max_samples. Note: This should only be used for debugging purposes!
job_id (int) — If the evaluation suite is launched as a slurm job, stores the current job id. Purely informative parameter used to retrieve scheduler logs.
start_time (float) — Start time of the experiment. Logged at class init.
end_time (float) — End time of the experiment. Logged when calling GeneralConfigLogger.log_end_time
total_evaluation_time_secondes (str) — Inferred total evaluation time in seconds (from the start and end times).
model_name (str) — Name of the currently evaluated model.
model_sha (str) — Commit hash of the currently evaluated model on the hub if available.
model_dtype (str) — Dtype of the model weights, as obtained when loading the model config.
model_size (str) — Model size as obtained when loading the model config.

Logger for the evaluation parameters.

log_args_info

< source >

( num_fewshot_seeds: int override_batch_size: typing.Optional[int] max_samples: typing.Optional[int] job_id: str config: Config = None )

Parameters

num_fewshot_seeds (int) — number of few-shot seeds.
override_batch_size (Union[None, int]) — overridden batch size. If strictly positive, its value is used as the batch size for all experiments. Else, the batch size is automatically inferred depending on what fits in memory.
max_samples (Union[None, int]) — maximum number of samples, if None, use all the samples available.
job_id (str) — job ID, used to retrieve logs.
config (optional) — Nanotron Config

Logs the information about the arguments passed to the method.

log_model_info

< source >

( model_info: ModelInfo )

Parameters

model_info (ModelInfo) — Model information to be logged.

Logs the model information.

DetailsLogger

class lighteval.logging.info_loggers.DetailsLogger

< source >

( hashes: dict = <factory> compiled_hashes: dict = <factory> details: dict = <factory> compiled_details: dict = <factory> compiled_details_over_all_tasks: DetailsLogger.CompiledDetailOverAllTasks = <factory> )

Parameters

hashes (dict[str, listHash) — Maps each task name to the list of all its samples’ Hash.
compiled_hashes (dict[str, CompiledHash) — Maps each task name to its CompiledHas, an aggregation of all the individual sample hashes.
details (dict[str, listDetail]) — Maps each task name to the list of its samples’ details. Example: winogrande: [sample1_details, sample2_details, …]
compiled_details (dict[str, CompiledDetail]) — : Maps each task name to the list of its samples’ compiled details.
compiled_details_over_all_tasks (CompiledDetailOverAllTasks) — Aggregated details over all the tasks.

Logger for the experiment details.

Stores and logs experiment information both at the task and at the sample level.

aggregate

< source >

( )

Aggregate the details and hashes for each task and then for all tasks. We end up with a dict of compiled details for each task and a dict of compiled details for all tasks.

log

< source >

( task_name: str task: LightevalTask doc: Doc outputs: list metrics: dict llm_as_prompt_judgement: typing.Optional[tuple[str, str]] = None )

Parameters

task_name (str) — Name of the current task of interest.
task (LightevalTask) — Current task of interest.
doc (Doc) — Current sample that we want to store.
outputs (list[ModelResponse]) — Model outputs for the current sample
metrics (type) — Model scores for said sample on the current task’s metrics.
llm_as_prompt_judgement (tuple[str, str]) — Tuple containing the prompt passed to the judge and the judgement for the current sample when using llm-as-judge metric.

Stores the relevant information for one sample of one task to the total list of samples stored in the DetailsLogger.

MetricsLogger

class lighteval.logging.info_loggers.MetricsLogger

< source >

( metrics_values: dict = <factory> metric_aggregated: dict = <factory> )

Parameters

metrics_value (dict[str, dict[str, list[float]]]) — Maps each task to its dictionary of metrics to scores for all the example of the task. Example: {“winogrande|winogrande_xl”: {“accuracy”: [0.5, 0.5, 0.5, 0.5, 0.5, 0.5]}‌}
metric_aggregated (dict[str, dict[str, float]]) — Maps each task to its dictionary of metrics to aggregated scores over all the example of the task. Example: {“winogrande|winogrande_xl”: {“accuracy”: 0.5}‌}

Logs the actual scores for each metric of each task.

aggregate

< source >

( task_dict: dict bootstrap_iters: int = 1000 )

Parameters

task_dict (dict[str, LightevalTask]) — used to determine what aggregation function to use for each metric
bootstrap_iters (int, optional) — Number of runs used to run the statistical bootstrap. Defaults to 1000.

Aggregate the metrics for each task and then for all tasks.

VersionsLogger

class lighteval.logging.info_loggers.VersionsLogger

< source >

( versions: dict = <factory> )

Parameters

version (dict[str, int]) — Maps the task names with the task versions.

Logger of the tasks versions.

Tasks can have a version number/date, which indicates what is the precise metric definition and dataset version used for an evaluation.

TaskConfigLogger

class lighteval.logging.info_loggers.TaskConfigLogger

< source >

( tasks_configs: dict = <factory> )

Parameters

tasks_config (dict[str, LightevalTaskConfig]) — Maps each task to its associated LightevalTaskConfig

Logs the different parameters of the current LightevalTask of interest.

< > Update on GitHub