TL;DR This page collects curated insights on improving LLM reasoning via post-training (like reinforcement learning) and test-time compute (like search and sampling). In short, post-training empowers LLM with reasoning ability, and test-time compute scales it.

OpenAI, DeepSeek, and More

<aside>

The discussion on reasoning ability started to go viral with the release of OpenAI o-series and DeepSeek R1 models. This section collects thoughts on these two and other SOTA reasoning models.

Reasoning Models

Post-Training: Gaining Reasoning Ability

<aside>

While reinforcement learning-based fine-tuning is emerging as a post-training approach to enhance LLMs' reasoning ability, its effectiveness compared to supervised fine-tuning is still unclear. This section collects thoughts on both methods for LLM reasoning.

Post-Training

Test-Time Compute: Scaling Reasoning Ability

<aside>

Test-time compute is an emerging field where folks are trying tons of different methods (e.g., search and sampling) and using extra components (e.g., verifiers). This section classifies them based on the optimization targets for LLMs. ****Part of this idea comes from Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters”.

Test-Time Compute

Verification: The Key to Reasoning

<aside>

Original line: “Verification, The Key to AI” by Rich Sutton. Verifiers serve as a key component in both post-training (e.g., as reward models to reinforcement learning) and test-time compute (e.g., as signals to guide search). This section tries to collect thoughts on process-based verification, outcome-based verification and more.

</aside>

Verifiers

Other Artifacts

<aside>

This section collects survey, evaluation, benchmark, application papers and online resources like blogs, post, videos, code, data.

Other Artifacts

Citation