Argunauts: Open LLMs that Master Argument Analysis with Argdown

Community Article Published February 14, 2025

This is a kick-off anouncement of my Argunauts project. I'm currently trying to teach LLMs logical argument analysis and argument mapping with Argdown, and will share progress and lessons learned in a series of articles, starting with a brief description of the overall goals, background and preliminary plan today.

Goal: Mastering Argdown
Challenges
Some Background and Related Work
Masterplan
Resources

Goal: Mastering Argdown

Argdown is a simple markup language for structuring complex argumentation. It's especially useful for mapping comprehensive debates (e.g., about geoengineering or eating animals) and for noting detailed logical analyses of individual arguments.

Argunauts are LLMs that master these methods.

The aim of this project is to build Argunauts by teaching LLMs

1️⃣ to carry out comprehensive and in-depth argumentative analysis,
2️⃣ to document their reconstructions in a standardized form,
3️⃣ to effectively use external tools (parsers, theorem provers) for those purposes,
4️⃣ without sacrificing any other skills the base model possesses.

Training will focus on producing semantically good Argdown snippets, but Argunauts should be able to reliably annotate source texts with XML or write Z3 code for formalization and validity checks, too.

What are Argunauts good for? They could, for example,

assist, as forbearing AI tutors, students in learning logical argument analysis;
supercharge, as powerful copilots, expert argumentation analysts;
carry out self-critique tasks in AI workflows and agentic AI systems;
power LLM-based critical thinking tools as expert AIs;
serve as expert models that bring specialist domain-knowledge to MoE models.

Challenges

1. LLMs virtually don't see any Argdown code during pretraining.

Argdown is a fringe project.

You find hardly any Argdown code examples on Github (according to Github search):

language	search string	hits (# files)
YAML	```yaml	1.8M
Mermaid.js	```mermaid	257K
Argdown	```argdown	266

And that shows, e.g. in this shared chat with Llama-3.3-70B. (I'm somewhat surprised and pleased, however, to see that Llama-3.3-70B seems to know what Argdown is about.)

2. LLMs don't see how to analyse individual arguments or how to map an entire controversy during pretraining, either.

Carrying out logical analysis and presenting arguments in standard form is, considering the entirety of our natural langauge text corpora, an absolutely rare exception. Most speakers and readers will never come across an argument map or see the presentation of a logical inference in standard form. (Rendering arguments as premise-conclusion structures is somewhat common in philosophy, but then, mainly restricted to some of its branches.)

That's confirmed by Google Books Ngrams: Characteristic terms that indidcate explicit logical argument analysis are several orders of magnitude less frequent than unspecific reason or argument talk.

n-gram	frequency (2020)
pros and cons	~2e-4
political debate	~6e-5
controversial issue	~2e-5
inference rule	~4e-6
argumentation scheme	~7e-7

So, while LLMs may very well learn summarization, translation, arithmetics or javascript on the fly during pre-training, that's not the case for deep logical argument analysis.

3. It's tricky to find sufficient amounts of training data for logical argument analysis finetuning.

Moreover, and despite there being a substantial literature on critical thinking, including textbooks and other training materials which introduce the language of, and describe the methods for logical analysis, there exist, in my opinion, relatively few texts that explicitly teach logical analysis through step-by-step demonstratation: Students are hardly ever shown how to gradually apply an abstract theory of argumentation or a general method of analysis in order to reconstruct or map a messy source text.

So it's far from clear that explicit and targeted corpus scraping and literature research will yield a sufficiently large and diverse finetuning dataset.

4. Argdown argument analysis is a complex problem that requires the coordinated execution of diverse and difficult subtasks.

Argument analysis is not a single, well-defined NLP task.

If we liken summarization, NLI and RTE, classification, multistep Q&A, paraphrasing, formalization etc. to footing, roofing, plumbing, electrical installation, insulating, or painting, then analysing an argumentation is more like building an entire house.

A proficient argument analyst

decomposes the comprehensive argumentative analysis into clearly delineated tasks with specific success criteria and subgoals,
reliably executes linguistic and logical subtasks (ranging from text annotation to FOL formalization),
effectively coordinates subtasks involved in analysing an argumentation through adaptive planning,
while assessing and monitoring the overall quality of the reconstruction throughout the analysis.

So, LLMs will have to be trained on a large variety of elementary critical thinking tasks, and must be enabled to critically assess a given state of analysis, as well as to plan ahead.

5. There is no clear right or wrong.

Argumentation analysis is rational reconstruction. An analyst extracts, clarifies and renders the text's reasoning as a transparent and correct argumentation. The quality of a reconstruction is broadly determined by two questions:

Is the argumentative reconstruction faithful to source text (exegetic adequacy)?
Is the argumentative reconstruction logically correct and epistemically plausible (systematic adequacy)?

We can now see why there is no gold standard or ground truth when reconstructing arguments:

We lack a scholarly consensus about what it means for an interpretation to be faithful, or for an argument to be logically correct and epistemically plausible.
Accordingly, there is no agreed-upon operationalisation of exegetic and systematic adequacy.
Even if exegetic and systematic adequacy are carefully specified and can be unequivocally assessed, both standards will typically pull in opposite directions (more faithful means less plausible, and vice versa), leaving room for discretion when balancing the criteria.
Similarly, both exegetic and systematic adequacy include a bunch of different sub-criteria (e.g., relevance, inferential validity, clarity) which are applied to different parts of a reconstruction when assessing its quality (each premise, for example, might or might not be logically relevant; might be confirmed by, be consistent with, or contradict a scientific consensus). This demands weighing the criteria, and increases further the evaluative wiggle room and the overall underdetermination of logical analysis.

6. Humans are not really good at it, either.

Cognitive scientists disagree about how irrational humans really are, which is what the Great Rationality Debate is about.

Teaching critical thinking and logical argument analysis for more than 20 years has made me a sceptic: Students and scholars alike (me included) do not excel at producing sound arguments, struggle in controversial debates, and typically fail to evaluate, reflect, clarify and improve their argumentation. (Fun-fact: The Dunning-Kruger effect seems to have been found, originally, for critical thinking tasks.)

Empirical data from critical thinking tests seems to confirm my personal experience. If interested, consider Fabio Paglieri's discussion for further reading.

So, collecting human feedback at large scale will not necessarily allow to improve the argumentative skills of LLMs, either.

In sum, the challenge when building Argunauts is to teach LLMs a method

which involves a semantically rich markup language they are totally unfamilar with,
whose execution they have barely witnessed, and
which requires complex coordination and planning

— all without:

relying on a existing large and diverse corpus of textbooks and demonstrations,
assuming a ground truth for any given reconstruction task, and
resorting to (crowd-sourced) human feedback.

Some Background and Related Work

I'm taking up previous research on argument analysis with LLMs, especially the DeepA2 project, where Kyle Richardson and I described a general framework for implementing complex argumentation analysis with multi-purpose sequence-to-sequence models. You'll find a lot of pointers and references relevant for building Argunauts in the DeepA2 paper.

Argunauts seek to bridge the gap between natural-language and symbolic reasoning. This has become a fascinating field in the last years – in particular with respect to mathematical reasoning, proofs, and Lean. My focus here will be on teaching LLMs formalization and logico-semantic analysis by framing this problem as a subtask required for interpreting, understanding and rationally reconstructing argumentative texts.

And, of course, there's the powerful AI and argumentation community, which has been cultivating the field of argument mining in the last decade. I see argument mining, understood as argumentative text annotation, as a specific and potentially useful subtask involved in analysing and reconstruction complex argumentation.

Masterplan

Given the goals and challenges, my initial strategy is:

Take multi-step post-training pipelines like Tülu 3 as conceptual starting point.
Follow Textbooks are all you need and generate massive synthetic datasets with coordinated, multi-step demonstrations of the diverse critical thinking and logical analysis tasks.
Conceptualize first SFT post-training step as continual pretraining, aimed at making LLMs familiar with Argdown syntax and semantics.
Repeatedly learn from high-quality examples without memorization by (online) DPO.
Build tools and diverse verifiers based on Argdown parser to further iterate with RLVR.
Keep training methods straightforward, deviating from and simplifying existing designs.
When training AIs, try to learn as much as possible from the experience with teaching critical thinking at university and school.
Generously mix in general purpuse instruction-following and reasoning data to retain present skills.
Set up benchmark for higher-order critical thinking / argument analysis to track performance.
Open science and open source.
Invite others to join (but don't be frustrated if no one follows).

Resources

For further reading, some of this has been linked above:

Argdown.org: Examples, Documentation, Tutorial, Sandbox
"Analysing Practical Argumentation" (G Brun and G Betz), in: The Argumentative Turn in Policy Analysis. Reasoning About Uncertainty. Springer.
Blog post introducing DeepaA2
"A Plea for Ecological Argument Technologies" (F Paglieri) Philosophy & Technology.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote