Any method can eval llms on planning & decision?

#74
by ke99L - opened

llama tokenizer issue I notice this topic
following natural language / benchmark on human exams / common sense reasoning not always useful
recently leaderboard trends most roleplay models
those are not useful and visionary evals for

  • robust / which outstanding model able handle nosisy prompt, reading format and multi-choice distractors (2306.09479)
  • Rational data self improving
  • automated Agents, tool use and skill mining

might related:
Language Models reinforcement learning game environment
natural language code generation feedback
gpt4 Text Agents solve game

SPRING made read rule and following
https://arxiv.org/pdf/2305.15486.pdf

SwiftSage made lm agents aid
https://arxiv.org/pdf/2305.17390.pdf

language modeling heads pretrained value head initialized
https://arxiv.org/pdf/2302.02662.pdf

symbolic Pseudo-Code reasoning Executors
VOYAGER long-form system prompt / init logits(logprobs), state
https://arxiv.org/pdf/2305.11790.pdf
https://arxiv.org/pdf/2305.05383.pdf
https://arxiv.org/pdf/2201.11473.pdf
https://arxiv.org/pdf/2305.19213.pdf

Open LLM Leaderboard org

Hi! We are mostly going to focus on evaluations already in the Eleuther AI Harness Leaderboard, so if you have suggestions available there I'll add them to the list of evals we are considering for next steps!

@clefourrier

back to this discussions
you maybe already know AgentBench, and textual game eval https://github.com/Ber666/llm-reasoners

https://arxiv.org/pdf/2308.04592.pdf
Meta just now released a finetuning model for eval, seem this good

and some recently I found some topic

task embedding, soft prompt, task relationships
HELM https://arxiv.org/pdf/2306.10062
mosaicml explain https://www.mosaicml.com/llm-evaluation
https://aclanthology.org/2021.emnlp-main.827.pdf
https://arxiv.org/pdf/2110.07904.pdf
soft prompt Interpolate
https://arxiv.org/pdf/2210.03029.pdf
https://arxiv.org/pdf/2205.11961.pdf

/// ARC = long MMLU, language modeling = HellaSwag, MMLU = human exam, TruthfulQA = zeroshot
incontext roleplay need sense & grammar
sense: PIQA OpenBook BIG-bench Strange Stories BIG-bench: Novel Concepts
grammar: LAMBADA, HellaSwag, Winogrande
///

finetuning eval / study out-of-domain
https://arxiv.org/pdf/2308.04014
https://arxiv.org/pdf/2308.04430
https://wandb.ai/novelaix/basedformer-tests/reports/NovelAI-LM-13B-402k-pretrain--Vmlldzo0Nzk5OTE0?accessToken=xo28vvfdusi2qny2m5vfgxsj3tsxe4qxsjkl8nsxgz0u852k5i7qae3bgze2hyei

ICL/rational: lost of context
Anthropic influence https://arxiv.org/pdf/2308.03296.pdf
GPT copypaste https://arxiv.org/pdf/2304.08637.pdf
HELM multi-choice https://arxiv.org/pdf/2306.09479.pdf

Cross-lingual / coding eval / out of vocab / Clustering, Vocabulary allocation
https://www.reddit.com/r/LocalLLaMA/comments/157khlg/google_sheets_link_to_huggingface_data_i_scraped/
https://arxiv.org/pdf/2112.06598.pdf https://arxiv.org/pdf/2301.09626.pdf
https://github.com/malteos/clp-transfer https://arxiv.org/pdf/2202.12312.pdf
title "As Good as New. How to Successfully Recycle English GPT-2 to Make Models for Other Languages" lol
https://arxiv.org/pdf/2110.13434.pdf https://arxiv.org/pdf/2109.07460.pdf
Domain Tokenization Transfer

ke99L changed discussion status to closed

Sign up or log in to comment