Model Card for QuadConnect2.5-0.5B-v0.0.8b

Model Details

Model Description

  • Still very early training experiments, the reward functions are still changing.

  • This model was created using GRPO and Unsloth. It was trained to reason over Connect Four and learn to play it strategically.

  • It's made for a specific project task.

  • Developed by: Lyte

  • Model type: Small Language Model

  • Language(s) (NLP): English

  • License: TBD

  • Finetuned from model: unsloth/qwen2.5-0.5b-instruct-unsloth-bnb-4bit

  • Trained Using: TRL's GRPO.

Demo:

  • Example from the hf space(version: 0.0.6b): image/png

Quick start

  • Solution #1:
from transformers import pipeline

SYSTEM_PROMPT = """You are a master Connect Four strategist whose goal is to win while preventing your opponent from winning. The game is played on a 6x7 grid (columns a–g, rows 1–6 with 1 at the bottom) where pieces drop to the lowest available spot.

Board:
- Represented as a list of occupied cells in the format: <column><row>(<piece>), e.g., 'a1(O)'.
- For example: 'a1(O), a2(X), b1(O)' indicates that cell a1 has an O, a2 has an X, and b1 has an O.
- An empty board is shown as 'Empty Board'.
- Win by connecting 4 pieces in any direction (horizontal, vertical, or diagonal).

Strategy:
1. Identify taken positions, and empty positions.
2. Find and execute winning moves.
3. If There isn't a winning move, then block your opponent’s potential wins.
4. Control the center and set up future moves.

Respond in XML:
<reasoning>
Explain your thought process, focusing on your winning move, how you block your opponent, and your strategic plans.
</reasoning>
<move>
Specify the column letter (a–g) for your next move.
</move>
"""

board = {
    "empty": "Game State:\n- You are playing as: X\n- Your previous moves: \n- Opponent's moves: \n- Current board state: Empty Board\n- Next available position per column:\n  a: a1\n  b: b1\n  c: c1\n  d: d1\n  e: e1\n  f: f1\n  g: g1\n\nMake your move.",
    "one_move": "Game State:\n- You are playing as: X\n- Your previous moves: \n- Opponent's moves: b1\n- Current board state: b1(O)\n- Next available position per column:\n  a: a1\n  b: b2\n  c: c1\n  d: d1\n  e: e1\n  f: f1\n  g: g1\n\nMake your move.",
    "four_moves": "Game State:\n- You are playing as: X\n- Your previous moves: a2(X)\n- Opponent's moves: a1 a2 b1\n- Current board state: a1(O), b1(O)\n- Next available position per column:\n  a: a3\n  b: b2\n  c: c1\n  d: d1\n  e: e1\n  f: f1\n  g: g1\n\nMake your move.",
}

generator = pipeline("text-generation", model="Lyte/QuadConnect2.5-0.5B-v0.0.8b", device="cuda")

# use 'empty', 'one_move' or 'four_moves' in board['']
output = generator([{"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": board['empty']}], max_new_tokens=10245, return_full_text=False)[0]
print(output["generated_text"])
  • Solution #2: GGUF Q8: Download the Quantized GGUF in any of your favorite GGUF inference engine(e.g. LMStudio)

  • Solution #3: Huggingface Space): You can duplicate the space or download the code from the space and use it locally.

Training procedure

This model was trained with GRPO, a method introduced in DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.

Preprocessing

  • First I searched for datasets of the game Connect Four and found 3 potential datasets and ended up selecting this dataset Leon-LLM/Connect-Four-Datasets-Collection, I took the dataset filtered it for any empty or broken entries and uploaded it as Lyte/ConnectFour-clean and finally filtered to remove games that go for more than 10 turns, I then split it into train and validation(which wasn't used).
  • The final dataset is Lyte/ConnectFour-T10

Evaluation

  • Evaluations were conducted on the Lyte/ConnectFour-T10 dataset's validation split to test whether the model learns to win by presenting it with a board showing only the winning position left.

Summary Metrics Comparison

Metric Lyte/QuadConnect2.5-0.5B-v0.0.6b Lyte/QuadConnect2.5-0.5B-v0.0.8b
Total games evaluated 5082 5082
Correct predictions 518 394
Accuracy 10.19% 7.75%
Most common move d (41.14%) d (67.61%)
Middle column usage 75.05% 99.53%

Move Distribution Comparison

Column Lyte/QuadConnect2.5-0.5B-v0.0.6b (Count, %) Lyte/QuadConnect2.5-0.5B-v0.0.8b (Count, %)
a 603 (19.02%) 3 (0.12%)
b 111 (3.50%) 4 (0.16%)
c 785 (24.76%) 463 (17.96%)
d 1304 (41.14%) 1743 (67.61%)
e 290 (9.15%) 360 (13.96%)
f 50 (1.58%) 3 (0.12%)
g 27 (0.85%) 2 (0.08%)

Framework versions

  • TRL: 0.15.1
  • Transformers: 4.49.0
  • Pytorch: 2.5.1+cu121
  • Datasets: 3.2.0
  • Tokenizers: 0.21.0

Citations

Cite GRPO as:

@article{zhihong2024deepseekmath,
    title        = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}},
    author       = {Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo},
    year         = 2024,
    eprint       = {arXiv:2402.03300},
}

Cite TRL as:

@misc{vonwerra2022trl,
    title        = {{TRL: Transformer Reinforcement Learning}},
    author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallouédec},
    year         = 2020,
    journal      = {GitHub repository},
    publisher    = {GitHub},
    howpublished = {\url{https://github.com/huggingface/trl}}
}
Downloads last month
87
Safetensors
Model size
494M params
Tensor type
FP16
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for Lyte/QuadConnect2.5-0.5B-v0.0.8b

Quantizations
1 model

Dataset used to train Lyte/QuadConnect2.5-0.5B-v0.0.8b

Space using Lyte/QuadConnect2.5-0.5B-v0.0.8b 1