Forecast evaluation leaderboard

TITLE = """<h1 align="center" id="space-title">Forecast evaluation leaderboard</h1>"""

# What does your leaderboard evaluate?
INTRODUCTION_TEXT = """
This space hosts evaluation results for time series forecasting models.

The results are obtained using [fev](https://github.com/autogluon/fev) - a lightweight library for evaluating time series forecasting models.
"""

ABOUT_LEADERBOARD = """
## What is `fev`?

[`fev`](https://github.com/autogluon/fev) is a lightweight wrapper around the 🤗 [`datasets`](https://huggingface.co/docs/datasets/en/index) library that makes it easy to benchmark time series forecasting models.

For more information about `fev`, please check out [github.com/autogluon/fev](https://github.com/autogluon/fev).

Currently, the results in this space are a minimal proof of concept. We plan to add new benchmark datasets and tasks in the future.

## How is `fev` different from other benchmarking tools?
Existing forecasting benchmarks usually fall into one of two categories:

- Standalone datasets without any supporting infrastructure. These provide no guarantees that the results obtained by different users are comparable. For example, changing the start date or duration of the forecast horizon totally changes the meaning of the scores.
- Bespoke end-to-end systems that combine models, datasets and forecasting tasks. Such packages usually come with lots of dependencies and assumptions, which makes extending or integrating these libraries into existing systems difficult.

`fev` aims for the middle ground - it provides the core benchmarking functionality without introducing unnecessary constraints or bloated dependencies. The library supports point & probabilistic forecasting, different types of covariates, as well as all popular forecasting metrics.


## Submitting your model
For instructions on how to evaluate your model using `fev` and contribute your results to the leaderboard, please follow the [instructions in the GitHub repo](https://github.com/autogluon/fev/blob/main/docs/04-models.ipynb).
"""

CHRONOS_BENCHMARK = """
## Chronos Benchmark II results

This tab contains results for various forecasting models on the 28 datasets used in Benchmark II in the publication [Chronos: Learning the Language of Time Series](https://arxiv.org/abs/2403.07815).

These datasets were used for zero-shot evaluation of Chronos models (i.e., Chronos models were not trained on these datasets), but some other models did include certain datasets in their training corpus.

Each table contains the following information:

* **Average relative error**: Geometric mean of the relative errors for each task. The relative error for each task is computed as `model_error / baseline_error`.
* **Average rank**: Arithmetic mean of the ranks achieved by each model on each task.
* **Median inference time (s)**: Median of the times required to make predictions for the entire dataset (in seconds).
* **Training corpus overlap (%)**: Percentage of the datasets used in the benchmark that were included in the model's training corpus. Zero-shot models are highlighted in <span style="color:green; font-weight:bold;">green</span>.

Lower values are better for all of the above metrics.

Task definitions and the detailed results are available on [GitHub](https://github.com/autogluon/fev/tree/main/benchmarks/chronos_zeroshot). More information for the datasets is available in [Table 3 of the paper](https://arxiv.org/abs/2403.07815).

"""