Natural Language Reinforcement Learning
Collection
4 items
•
Updated
This model serves as a language policy in Natural Language Reinforcement Learning (NLRL) framework, specifically trained for the TicTacToe game. It generates actions through chain-of-thought reasoning and outputs move decisions.
This model can be used as a TicTacToe player that explains its strategic thinking through natural language before making moves. The model generates both reasoning chains and final move decisions.
This model is specifically trained for TicTacToe and should not be used for other games or tasks.
Training data consists of state-action pairs collected through NLRL actor-critic learning process, with language-based Monte Carlo value estimates used for policy improvement.
@misc{nlrl,
title={Natural Language Reinforcement Learning},
author={Xidong Feng and Ziyu Wan and Haotian Fu and Bo Liu and Mengyue Yang and Girish A. Koushik and Zhiyuan Hu and Ying Wen and Jun Wang},
year={2024},
eprint={2411.14251},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2411.14251},
}
Base model
meta-llama/Llama-3.1-8B