Self Evolving Agents Tutorial En

Source: docs/tutorials/self_evolving_agents_tutorial_en.md SHA256: 65f5baf62206 Rendered: 2026-05-19 19:02 UTC

§0 TL;DR Cheat Sheet

Self-Evolving Agents in 8 sentences

one page covering the 2024-2026 frontier direction (see §1–§11 for derivations).

  1. Core problem: have an agent continuously improve its capability on long-horizon tasks, without relying on repeated human annotation. Formalized as the convergence / stability / asymptotic effectiveness of an update operator $\mathcal{T}$ such that $\pi_t \to \pi_{t+1}$.
  2. Three paradigms: ① Experience-Driven (human-made tasks + reward, e.g. AgentTuning, Voyager); ② Adversarial Self-Play (Challenger-Solver, e.g. Absolute Zero, Ctx2Skill); ③ Meta-Learning / Reward-Free (task-free, reward-free exploration + outcome-based reward, e.g. Native Evolution).
  3. Capability container: natural-language skill / world knowledge K written in markdown — this is the most important paradigm shift of 2024-2026, bypassing parameter updates, everything is inference-time system_prompt += K.
  4. Ctx2Skill 5-role self-play (arXiv 2604.27660): Challenger / Reasoner / Judge / Proposer / Generator, frozen LM but skill set evolves. Cross-Time Replay picks $\arg\max_i \rho^h_i \cdot \rho^e_i$ to prevent adversarial collapse.
  5. Native Evolution two-phase (arXiv 2604.18131): Evolution phase explores task-free and reward-free → distills markdown K; Execution phase uses K as system prompt. Training signal $R_\text{evolve}(\mathcal{K}) = \text{Success}(\mathcal{T}_E\mid\mathcal{K}) - \text{Success}(\mathcal{T}_E\mid\varnothing)$.
  6. A²RD trio (arXiv 2605.06924): MVMem (textual states + frames + videos + dependency DAG) + Adaptive Segment Gen + HITS (frame-level + video-level self-check). Directly transferable as a memory + audit template for any long-horizon agent.
  7. Theoretical upper bound: [arXiv:2601.05280] shows that closed-loop density matching degenerates in the absence of an exogenous grounding signal (not that all reward-free training must collapse; this is a specific conclusion for that setting); [arXiv:2507.00075] models the solver-verifier gap and empirically fits it to capability dynamics.
  8. Common failures: adversarial collapse (Challenger gets extreme), memory drift (internal contradictions accumulating in K), reward hacking (self-rewarding drift), bias amplification (agent retrained on its own output), capability ceiling (self-improvement degrades when exogenous grounding is missing).

§1 Self-Evolving Agent Intuition

"Self-evolving" is not magic. An LLM agent consists of four parts:

"Self-evolution" is defining an update operator $\mathcal{T}$:

$$\big(\pi_t, \mathcal{K}_t, \mathcal{S}_t\big) \xrightarrow{\mathcal{T}(E, \text{trajectories})} \big(\pi_{t+1}, \mathcal{K}_{t+1}, \mathcal{S}_{t+1}\big)$$

By the object updated by $\mathcal{T}$, 2024-2026 work roughly divides into four "layers":

LayerUpdate targetUpdate methodRepresentative work
L1 Parameter layer$\pi$ (model weights)SFT / RFT / RLAgentTuning, Native Evolution
L2 Capability layer$\mathcal{S}$ (skills markdown)self-play + replayVoyager, Ctx2Skill, CoEvoSkills
L3 Memory layer$\mathcal{K}$ (world knowledge markdown)exploration + summarizeMemGPT, MVMem, Native Evolution
L4 System layerworkflow orchestrationinference-time onlyAnthropic Skills, ARIS-style harness
Important intuition

L1 is "train instincts"; L2/L3 is "grow a toolbox + notebook"; L4 is "workflow orchestration." The 2025-2026 mainstream is L2 + L3, with L1 mainly preparing training-test decoupling.

Two intuitions commonly missed in interviews:

§2 Formalizing the Three Self-Evolution Paradigms

The Native Evolution paper [arXiv:2604.18131] gives a very clear classification — we add the mathematical formulation.

2.1 Experience-Driven Evolution

Setting: humans provide a task set $\mathcal{T}$, a reward function $R: \mathcal{O} \times \mathcal{A} \to \mathbb{R}$, and a workflow. The agent runs trajectories $\tau$, weighted by $R(\tau)$ to update.

Update operator:

$$\theta_{t+1} = \theta_t + \eta \,\mathbb{E}_{\tau \sim \pi_{\theta_t}}\!\left[\nabla_\theta \log \pi_\theta(\tau)\, R(\tau)\right]$$

This is standard policy gradient — AgentTuning, ToolLLM, early Voyager variants fall in this category.

Pros: high supervision density, fast convergence. Cons: huge labor cost (each new environment needs new reward design).

2.2 Adversarial Self-Play Evolution

Setting: two agents (Challenger + Solver) evolve jointly, with no external task source — tasks are produced by Challenger, solved by Solver, with verifier feedback.

Update operator (in Absolute Zero / R-Zero formalization):

$$\theta^{\text{ch}}_{t+1}, \theta^{\text{sol}}_{t+1} = \arg\min_{\theta^{\text{ch}}, \theta^{\text{sol}}} \;\mathbb{E}_{t\sim \pi^\text{ch}_t}\big[\ell^\text{ch}(t)\big] + \lambda\, \mathbb{E}_{(t,a)\sim \pi^\text{ch}_t,\pi^\text{sol}_t}\big[\ell^\text{sol}(t,a)\big]$$

Concrete "learnability reward" (Absolute Zero, arXiv 2505.03335):

$$R^\text{learn}(t) = \pi^\text{sol}_t(\text{correct}\mid t)\cdot \big(1 - \pi^\text{sol}_t(\text{correct}\mid t)\big)$$

Maximizing it favors 50% difficulty — neither too easy nor too hard. This is the core of curriculum-as-reward.

Pros: no need for human-made task sets; Cons: still need verifier (code executor / math checker), and prone to adversarial collapse (Challenger produces extreme tasks, Solver learns trivial defense).

2.3 Meta-Learning / Reward-Free Evolution

Setting (Native Evolution): training stage provides outcome-based reward (not step-level); inference stage has no task, no reward — the agent autonomously explores → distills markdown world knowledge $\mathcal{K}$ → uses $\mathcal{K}$ as system prompt in downstream tasks.

Reward design (Native Evolution core formula):

$$\boxed{\;R_\text{evolve}(\mathcal{K}) \;=\; \text{Success}(\mathcal{T}_E \mid \mathcal{K}) \;-\; \text{Success}(\mathcal{T}_E \mid \varnothing)\;}$$

where $\mathcal{T}_E$ is the set of downstream tasks in environment $E$ (observable at training time). Reward measures the downstream utility gain of K, without needing step-level supervision.

Pros: completely task-free / reward-free at inference; Cons: high training cost (rejection sampling RFT × 2 iterations), and $\mathcal{T}_E$ still needs labeled data at training time.

Common confusion: reward-free at inference ≠ reward-free training

Native Evolution's evolution phase is indeed task-free / reward-free at inference; but at training time it still needs a labeled set of 600 deep search questions × 20 websites to compute $R_\text{evolve}$. This is an interview bonus point: actively disambiguate.

2.4 Three-paradigm comparison (memorize)

DimensionExperience-DrivenAdversarialMeta-Learning
Task source at trainingHumansChallenger agentEndogenous exploration + labeled downstream
Reward at trainingHumansverifieroutcome utility
Task at inferenceGiven by humansGiven by humans / agent itselfGiven by humans
Reward at inferenceNot neededNot neededNot needed
Workflow at inferenceHuman-orchestratedHuman-orchestratedAgent-driven (evolve then execute) ✓
RepresentativeAgentTuning, ToolLLMAZR, R-Zero, Ctx2SkillNative Evolution
Engineering costHigh (reward eng.)Medium (verifier orchestration)High (rejection sampling)
arXiv2310.128232505.03335 / 2604.276602604.18131

§3 Markdownification of Skills / Knowledge (the most important engineering shift)

The most important paradigm shift in 2024 is: long-term memory and capability extension are not via weight updates, but via external markdown documents.

3.1 Convergence of Anthropic Skills + Native Evolution

Anthropic publicly released the skills/ paradigm in 2025 (each skill is a standalone markdown file, the agent loads on demand into the system prompt). Native Evolution explicitly cites Anthropic skills in the paper as a reference implementation for K (paper §3, footnote 1 pointing to github.com/anthropics/skills/tree/main/skills).

DimensionAnthropic SkillsNative Evolution K
Representationmarkdownmarkdown
LoadingSelect skill by task and inject into system promptLoad K by environment and inject into system prompt
Granularity"How to do PDF / Excel / git commit""ACL2025 site structure / code repo topology"
SupervisionHuman-writtenAuto-distilled by agent
OriginStatic human-madePost-training endogenous

→ Convergence conclusion: system prompt is the new model weights, markdown documents are the new fine-tuning data.

3.2 Typical Schema of Skill / K Files

# skill_name
## Trigger / When-to-use
<which task should use this skill>

## Steps
1. ...
2. ...

## Resources / References
- file paths / URLs

## Failure modes
- Known pitfalls + fixes

Native Evolution's K also explicitly stores:

3.3 Why markdown rather than vector embedding?

Cost: retrieval precision is worse than vector RAG; the fix is hybrid (vector for candidate selection → markdown for close reading).

§4 Voyager / Reflexion / STaR: the foundational trio

Before going into 2024-2026 frontier work, you must chew through the three foundational works — interviewers will almost certainly ask about baselines.

4.1 Voyager (Wang 2023 NeurIPS, NVIDIA + Caltech)

The first true end-to-end "automatic curriculum + skill library" agent: running GPT-4 in Minecraft, letting it propose its own tasks, write JS code (each piece of code is a skill), self-verify, and store successful skills in the library.

The core trio:

Common misconception

Voyager does not update GPT-4 weights, it is pure inference-time. It does not use reward either; it uses GPT-4 itself as a critic to judge task success, belonging to the self-verification category (not RL).

4.2 Reflexion (Shinn 2023 NeurIPS, Northeastern)

Solidifies the concept of "verbal RL": after each failure, the agent writes a reflection in natural language about its own trajectory, stores it in episodic memory, and prepends it to the next prompt.

Formalization (pseudo-Bellman):

$$M_{t+1} = M_t \cup \big\{\text{reflect}(\tau_t, r_t)\big\}$$

where $\text{reflect}$ is a text generation of "what went wrong + how to fix" by the LLM itself.

Why it works (theoretically): reflection compresses sparse reward signal into structured text, bypassing gradient updates; equivalent to a kind of non-parametric policy improvement in the in-context domain. But lacks convergence guarantee.

4.3 STaR (Self-Taught Reasoner, Zelikman 2022 NeurIPS, Stanford)

Let the LM generate rationale → if wrong, rationalize using the true answer → SFT on correct (q, rationale, a) tuples. This is the true starting point of self-improvement on reasoning.

Pseudocode:

for iter in 1..N:
    for each (q, a_gt) in D:
        r, a_pred = LM(q)
        if a_pred == a_gt:
            collect (q, r, a_gt)
        else:
            r' = LM(q, hint=a_gt)        # rationalize
            if r' produces a_gt:
                collect (q, r', a_gt)
    SFT(LM, collected)

STaR's key flaw (also [arXiv:2601.05280]'s core argument against self-improvement): rationalization is reverse-engineering the answer, not necessarily reflecting the true reasoning process, leading to distribution drift.

§5 Ctx2Skill: 5-Role Self-Play Loop (Focus 1)

This section nearly mirrors arXiv 2604.27660's §3 word-for-word, since interviewers may quote the paper directly.

5.1 Problem formulation

Given a context $C$ (possibly 100k+ tokens of manual / paper / repo / dataset), a task set $\mathcal{T} = \{t_j\}$, each task has a binary rubric set $\mathcal{R}_j = \{r_{j,k}\}$. Solving indicator:

$$y_j(\pi; C) = \prod_k \mathbb{I}\big[r_{j,k}(a_j) = \text{pass}\big], \quad a_j \sim \pi(\,\cdot\mid C, t_j)$$

Goal: construct a markdown skill set $\mathcal{S}^R$ such that:

$$a_j \sim \pi(\,\cdot\mid \mathcal{S}^R, C, t_j) \quad \text{maximizes}\ \mathbb{E}_j y_j$$

and without updating $\pi$'s parameters — only updating $\mathcal{S}^R$.

5.2 Five frozen LM roles

RoleInputOutputIntuition
Challenger$C$, $\mathcal{S}^C_{i-1}$A batch of $(t_m, \mathcal{R}_m)$Generate probing tasks
Reasoner$C$, $\mathcal{S}^R_{i-1}$, $t_m$$a_m$Solve using skills
Judge$a_m$, $\mathcal{R}_m$binary $y_m$Strictly verify by rubric
Proposer (per side)failed/solved batch + current skill setNatural-language diagnosisFind root cause, does not write skill
Generator (per side)proposer diagnosis + current skill setNew skill setMaterialize the change

Note that two sides evolve independently:

→ These two sides never exchange skill sets — maintaining strict adversarial pressure.

5.3 Cross-Time Replay mechanism (core anti-collapse)

The more iterations, the more extreme the Challenger gets, and the more the Reasoner over-specializes to extreme tasks. Returning $\mathcal{S}^R_N$ directly is bad.

Replay procedure:

  1. During training, maintain two probe sets:

    • Hard set $\mathcal{Q}^h$: each iteration, pick the failed task with the lowest rubric pass rate
    • Easy set $\mathcal{Q}^e$: each iteration, pick the solved task with the fewest rubrics passed ("just barely solved")
  2. After training, for each candidate $\mathcal{S}^R_i$ ($i=1\ldots N$), run the Reasoner $\pi^R$ on both probe sets:

$$\rho^h(i) = \frac{\sum_{q\in \mathcal{Q}^h} y_q(\pi^R; C, \mathcal{S}^R_i) + 1}{|\mathcal{Q}^h| + 1}, \quad \rho^e(i) = \frac{\sum_{q\in \mathcal{Q}^e} y_q(\pi^R; C, \mathcal{S}^R_i) + 1}{|\mathcal{Q}^e| + 1}$$

(Laplace smoothing prevents empty probe sets)

  1. Select:

$$\boxed{\;\mathcal{S}^R_\star = \mathcal{S}^R_{i^\star}, \quad i^\star = \arg\max_i \big(\rho^h(i) \cdot \rho^e(i)\big)\;}$$

Why product, not sum: product penalizes catastrophic forgetting (if some version has $\rho^e \to 0$, the overall score → 0), forcing selection of versions that are not bad on either side. Ctx2Skill ablation shows using sum drops final accuracy by ~1.5%.

5.4 Ctx2Skill 5-role + Replay code skeleton

def ctx2skill_loop(context: str, llm, num_iters: int = 5, M: int = 5):
    """
    Ctx2Skill: 5 frozen LM roles + Cross-Time Replay.
    Returns the optimal Reasoner skill set selected by cross-time replay.
    All LM calls use the same frozen backbone; only the skill set changes.
    """
    S_R = ""                    # Reasoner skill markdown (initially empty)
    S_C = ""                    # Challenger skill markdown
    candidates = []             # Historical S_R candidates (cross-time)
    Q_hard, Q_easy = [], []     # Two probe sets

    for i in range(1, num_iters + 1):
        # ── (1) Challenger produces a batch ──
        batch = llm(role="challenger", prompt=challenger_prompt(context, S_C), n=M)
        # batch = [(t_m, rubrics_m), ...]

        failed, solved = [], []
        for t_m, rubrics_m in batch:
            # ── (2) Reasoner solves ──
            a_m = llm(role="reasoner", prompt=reasoner_prompt(context, S_R, t_m))
            # ── (3) Judge per-rubric ──
            per_rubric = [llm(role="judge", prompt=judge_prompt(a_m, r))
                          for r in rubrics_m]
            y_m = all(per_rubric)
            pass_rate = sum(per_rubric) / len(per_rubric)
            (failed if not y_m else solved).append(
                (t_m, rubrics_m, a_m, pass_rate)
            )

        # ── Maintain probe sets (preparation for Laplace smoothing) ──
        if failed:
            hardest = min(failed, key=lambda x: x[3])
            Q_hard.append((hardest[0], hardest[1]))
        if solved:
            # The "lowest pass_rate among solved" (i.e. "barely solved" — all rubrics pass but many prompts just barely pass)
            # Note: entries in `solved` all satisfy all(per_rubric), so pass_rate=1.0;
            # in production, "barely solved" should use per-rubric soft scores (e.g. LLM-judge giving [0,1] rather than 0/1),
            # here teaching version uses the task closest to the solving boundary (e.g. the task in batch with most reasoner retries)
            easiest_among_solved = solved[-1]  # Teaching simplification: take the last solved task
            Q_easy.append((easiest_among_solved[0], easiest_among_solved[1]))

        # ── (4) Two-sided Proposer diagnoses ──
        diag_R = llm(role="reasoner_proposer",
                     prompt=proposer_prompt(failed, S_R))
        diag_C = llm(role="challenger_proposer",
                     prompt=proposer_prompt(solved, S_C))

        # ── (5) Two-sided Generator writes skills ──
        S_R = llm(role="reasoner_generator",
                  prompt=generator_prompt(diag_R, S_R))
        S_C = llm(role="challenger_generator",
                  prompt=generator_prompt(diag_C, S_C))

        candidates.append(S_R)

    # ── Cross-Time Replay ──
    best_idx, best_score = 0, -1.0
    for i, cand in enumerate(candidates):
        rho_h = laplace_smoothed_rate(Q_hard, cand, llm, context)
        rho_e = laplace_smoothed_rate(Q_easy, cand, llm, context)
        score = rho_h * rho_e
        if score > best_score:
            best_score, best_idx = score, i

    return candidates[best_idx]


def laplace_smoothed_rate(probe, skill_set, llm, context):
    """ Laplace-smoothed pass rate: (sum_q y_q + 1) / (|probe| + 1).
    
    Args:
        probe:     list[(task, rubrics)]
        skill_set: candidate S_R^i markdown skill to evaluate
        llm:       frozen LM
        context:   original context (same source as ctx2skill_loop parameter; must be passed explicitly
                   to prevent closure misuse)
    """
    num_pass = 0
    for t_q, rubrics_q in probe:
        a = llm(role="reasoner",
                prompt=reasoner_prompt(context, skill_set, t_q))
        if all(llm(role="judge", prompt=judge_prompt(a, r))
               for r in rubrics_q):
            num_pass += 1
    return (num_pass + 1) / (len(probe) + 1)

5.5 Ctx2Skill experimental results (must memorize)

On CL-bench, without any parameter updates:

backbonew/o skillsCtx2SkillΔ
GPT-4.111.1%16.5%+5.4
GPT-5.121.2%25.8%+4.6
GPT-5.218.2%21.4%+3.2

→ GPT-4.1 + Ctx2Skill (16.5%) surpasses Gemini 3 Pro without skills (15.8%) — confirming "high-quality skills can compensate for model gap."

5.6 Ctx2Skill ablation (interview bonus)

Removed componentGPT-4.1 Δ from 16.5GPT-5.1 Δ from 25.8
Cross-Time Replay−1.8 (→14.7)−2.8 (→23.0)
decoupling Proposer + Generator−0.6−0.7
Challenger evolving−2.6 (→13.9) ← largest−3.3 ← largest
Easy probe set−0.8−1.6
Hard probe set−1.3−1.1
Laplace smoothing−1.0−0.6

The drop from removing Challenger evolving is the largest — proving that "sustained adversarial pressure" is the true driver of Reasoner progress.

§6 Native Evolution: Reward-Free Meta-Learning (Focus 2)

Fully corresponds to arXiv 2604.18131. Tencent + HKUST(GZ), 2026-04-20.

6.1 Core architecture: two-phase decoupling

  ┌─────────────────────────────────┐       ┌──────────────────────────────┐
  │      Native Evolution Phase     │       │   Knowledge-Enhanced Execution│
  │      (task-free + reward-free   │       │   (uses K as system prompt at │
  │       at inference)             │       │    inference)                │
  │                                 │       │                              │
  │   π_θ(K | E)                    │  ──→  │   π_task(a_t | o_t, K, Task) │
  │   "exploring + summarizing"     │       │                              │
  │                                 │       │                              │
  └──────────────┬──────────────────┘       └──────────────────────────────┘
                 │
                 │ (during training, use outcome-based reward to supervise evolve)
                 ▼
  R_evolve(K) = Success(T_E | K) − Success(T_E | ∅)

Key design choice: evolution and execution use the same LLM (unlike RLHF which separates SFT-policy / RM); they only have different system prompts + training that goes through SFT + RFT to learn "evolution mode."

6.2 Outcome-Based Reward Design

$$\boxed{\;R_\text{evolve}(\mathcal{K}) = \underbrace{\text{Success}(\mathcal{T}_E\mid \mathcal{K})}_{\text{downstream success rate with K}} - \underbrace{\text{Success}(\mathcal{T}_E\mid \varnothing)}_{\text{no-K baseline}}\;}$$

where $\text{Success}(\mathcal{T}_E\mid \mathcal{K}) = \frac{1}{M}\sum_{j=1}^M \mathbb{I}\big[f(Q_j, \mathcal{K}) = A_j\big]$.

Why outcome-based rather than step-level?

Dimensionstep-leveloutcome-based
Supervision densityHighLow
Signal noiseMedium (hard to evaluate intermediate states)Low (end-task answer is ground truth)
Reward hacking riskHigh (agent learns shortcut to grab intermediate scores)Low (only by truly improving task success)
Engineering complexityHigh (need PRM)Low

Native Evolution chooses outcome-based for another special reason: $\mathcal{K}$ is a long markdown stretch (374.8 steps × 3322.4 tokens/step); step-level reward is nearly meaningless on such a long horizon.

6.3 Two-phase training: SFT → RFT

Stage 1 (SFT):

Stage 2 (RFT, Rejection Sampling Fine-Tuning):

Common misconception

Reason Native Evolution uses RFT rather than GRPO/PPO: (1) trajectory horizon ~ 374 steps, GRPO backprop is infeasible; (2) reward evaluation needs running an auxiliary agent on downstream tasks, too expensive → offline rejection sampling decouples trajectory generation from policy update.

6.4 Native Evolution training + inference code skeleton

def native_evolution_pipeline(base_model, teacher_model, env_pool,
                              downstream_tasks_per_env, num_iter=2,
                              C_sft: int = 3, C_rft: int = 8):
    """
    Native Evolution: SFT + RFT (2 iter) → learn reward-free self-evolution.
    
    Args:
        C_sft: number of teacher-generated K candidates in SFT stage (paper: 3)
        C_rft: number of pi-self-generated candidates in RFT stage (paper: 8)
    """
    # ── Stage 1: SFT ──
    sft_data = []
    for E in env_pool:
        T_E = downstream_tasks_per_env[E]            # labeled downstream
        # baseline: without K
        s0 = success_rate(base_model, T_E, K=None)

        # teacher generates C_sft candidate Ks
        candidates = [explore_and_summarize(teacher_model, E)
                      for _ in range(C_sft)]
        # Evaluate reward = Success(T_E | K) − Success(T_E | ∅)
        rewards = [success_rate(base_model, T_E, K=K) - s0
                   for K in candidates]
        K_star = candidates[argmax(rewards)]
        traj_star = extract_trajectory(teacher_model, E, K_star)
        sft_data.append(traj_star)                   # ~374 steps each

    pi_1 = sft(base_model, sft_data)                 # warm-up

    # ── Stage 2: RFT × num_iter ──
    pi = pi_1
    for it in range(num_iter):
        rft_data = []
        for E in env_pool:
            T_E = downstream_tasks_per_env[E]
            s0 = success_rate(pi, T_E, K=None)
            # pi itself generates C_rft candidates
            candidates = [explore_and_summarize(pi, E) for _ in range(C_rft)]
            rewards = [success_rate(pi, T_E, K=K) - s0
                       for K in candidates]
            best = candidates[argmax(rewards)]
            rft_data.append(extract_trajectory(pi, E, best))
        pi = sft(pi, rft_data)                       # next iter

    return pi   # π_θ*: has learned native evolution


def native_evolution_inference(pi_star, new_env, task):
    """
    At inference: no task, no reward → explore → distill K → solve task with K.
    """
    K = explore_and_summarize(pi_star, new_env)      # task-free!
    answer = pi_star(task, system_prompt=K)          # K-augmented
    return answer

6.5 Native Evolution experimental results

WebVoyager + WebWalker, 14B Qwen3 / 36B Seed-OSS:

backbonew/o KNative Evolution (RFT)Δ
Qwen3-30B (WebWalker)22.0440.91+18.9
Qwen3-30B (WebVoyager)41.0857.44+16.4
Seed-OSS-36B (WebWalker)19.5036.72+17.2

Most striking: 14B Qwen3 + transferred K from 36B → 35.6% conference accuracy; unassisted Gemini-2.5-Flash is only 31.3% — proving high-quality K can surpass pure parameter scaling.

6.6 Native Evolution vs Ctx2Skill comparison

DimensionNative EvolutionCtx2Skill
Updates parameters?Yes (SFT + RFT × 2 iter)No (frozen LM, only updates skills)
Needs task at inference?No (evolve then execute)Yes (task-driven)
Knowledge container$\mathcal{K}$ (markdown environment map)$\mathcal{S}^R$ (markdown skills)
Reward designoutcome-based downstream utilitybinary rubric judge
Anti-collapse mechanismrejection sampling (filter)Cross-Time Replay
Training costHighLow
Inference costLowerMedium
Suitable tasksNew environment explorationDense context task

They are complementary: Native Evolution lets the backbone learn how to explore; Ctx2Skill lets a frozen backbone distill context into reusable skills. They can be stacked.

§7 A²RD and Long-Horizon Memory Architecture (Focus 3)

arXiv 2605.06924, Google Cloud AI + NUS, 2026-05-07. While the paper is about video, the memory schema transfers directly to all long-horizon agents.

7.1 Retrieve → Synthesize → Refine → Update closed loop

   ┌──────────────────────────────────────────────────────────────┐
   │   for segment i = 1..N:                                       │
   │     1. Retrieve relevant context from MVMem (T_j, F_j, V_j)   │
   │     2. Decide mode: extrapolation vs interpolation            │
   │     3. Synthesize boundary frames F_i^begin, F_i^end          │
   │     4. HITS (frame-level): verify + revise frames             │
   │     5. Synthesize video segment V_i = TI2V(P_i, F_i, F^rel)   │
   │     6. HITS (video-level): verify + revise via MAPO           │
   │     7. Update MVMem with (F_i, V_i, T_i, T_{i+1}^F)           │
   └──────────────────────────────────────────────────────────────┘

7.2 MVMem schema (textual states + frames + videos)

$$\mathcal{M} := \{\mathcal{M}_1, \ldots, \mathcal{M}_N\} \cup \mathcal{R} \cup \mathcal{D}$$

Each segment $\mathcal{M}_j = \{T_j, \mathcal{F}_j, V_j\}$:

Plus:

7.3 Dependency DAG (key trick)

References have dependencies: entities depend on environment, camera depends on entity positions. A²RD builds a DAG:

$$\mathcal{G} := \text{MLLM}_\text{dep}(\mathcal{P}_\mathcal{R})$$

Then topological sort decides synthesis order. Transfers directly to ARIS-style agents: in research projects claim ← experiment ← code ← idea, typed memory is also a DAG.

7.4 HITS: Hierarchical Test-Time Self-Improvement

Two levels:

Self-check at two scales: inner-segment + inter-segment — stronger anti-drift than single-layer self-improvement.

7.5 Transfer to general long-horizon agents (typical cheat-sheet)

class TypedMemory:
    """A²RD MVMem idea ⇒ general long-horizon agent memory."""
    def __init__(self):
        self.segments = []          # list of {state, artifacts, deps}
        self.global_refs = {}       # Global entities (e.g. paper-level claim)
        self.dep_graph = {}         # DAG: which artifact depends on which
        self.failure_db = []        # Failure trace database

    def retrieve(self, current_segment_ctx, k=3):
        """Retrieve narratively-relevant context (top k previous segments)."""
        cands = []
        for j, M_j in enumerate(self.segments):
            score = relevance(M_j["state"], current_segment_ctx)
            cands.append((score, j))
        topk = sorted(cands, reverse=True)[:k]
        return [self.segments[j] for _, j in topk]

    def update(self, segment, deps):
        self.segments.append(segment)
        seg_id = len(self.segments) - 1
        self.dep_graph[seg_id] = deps      # parent ids

    def topo_synthesis_order(self, num_segments: int) -> list[int]:
        """A²RD's dependency DAG → decide generation order.
        
        Args:
            num_segments: total number of segments to generate; automatically adds nodes
                          not in dep_graph as roots.
        Returns:
            A valid topological order (list of segment indices).
        """
        # Automatically adds all 0..num_segments-1 to the graph (those without dependencies treated as roots)
        graph = {i: self.dep_graph.get(i, []) for i in range(num_segments)}
        return topological_sort(graph)


def long_horizon_agent_with_hits(memory, segments_to_generate, llm, verifier):
    """A²RD-style R→S→R→U closed loop.
    
    Note: segments_to_generate is a list of context descriptions for the segments to generate;
    generation order is determined by memory.dep_graph (defaults to sequential if empty).
    """
    order = memory.topo_synthesis_order(num_segments=len(segments_to_generate))
    for i in order:
        # Retrieve
        ctx = memory.retrieve(segments_to_generate[i])
        # Synthesize
        artifact = llm.generate(segments_to_generate[i], context=ctx)
        # Frame-level HITS (internal consistency of artifact)
        for _ in range(MAX_REFINES):
            if verifier.frame_check(artifact): break
            artifact = llm.refine(artifact, verifier.feedback)
        # Video-level HITS (consistency of artifact with history)
        for _ in range(MAX_REFINES):
            if verifier.video_check(artifact, ctx): break
            artifact = llm.refine(artifact, verifier.feedback)
        # Update
        memory.update(artifact, deps=ctx)

§8 Theoretical Upper Bound of Self-Improvement (L3 level)

Two 2025-2026 must-read theoretical papers — this is the L3 part that top labs may ask in interviews.

8.1 On the Limits of Self-Improving in LLMs (arXiv 2601.05280)

Full title: "On the Limits of Self-Improving in LLMs: The Singularity Is Not Near Without Symbolic Model Synthesis"

Setup: model self-training as a dynamical system on probability distributions:

$$p_{t+1} = \mathcal{T}_\text{closed}(p_t) = \mathbb{E}_{x \sim p_t}\big[\delta_{x'}\big],\quad x' = \pi_t(x)$$

i.e. $p_{t+1}$ is the distribution obtained by retraining the model on its own samples.

Main theorem (narrative version): under closed-loop density matching (no exogenous grounding signal), if $\pi_t$ has no access to ground truth, then $\{p_t\}$ generally does not converge to the target $p^\star$, and degenerates in mode collapse / drift.

Core mechanism:

$$D_\text{KL}(p^\star \,\|\, p_{t+1}) \;\ge\; D_\text{KL}(p^\star \,\|\, p_t) - \Delta_\text{grounding}$$

where $\Delta_\text{grounding}$ is the KL reduction brought by the grounding signal. With no grounding ($\Delta = 0$), the KL does not decrease but actually rises.

Positive implication: self-improvement needs exogenous grounding — code executor / math checker / human label / rubric judge — this is why Absolute Zero must hook up a code executor, STaR must use ground-truth answers for rationalization, Ctx2Skill must use a Judge to verify rubrics.

Misreading warning (interview bonus)

This paper does not prove that "reward-free training must collapse"; it proves that closed-loop density matching degenerates in the absence of an exogenous grounding signal. Native Evolution is still compliant — it has outcome-based reward as grounding.

8.2 Solver-Verifier Gap (arXiv 2507.00075)

Setup: model capability evolution as coupled dynamics of two variables $\theta^\text{sol}, \theta^\text{ver}$:

$$\begin{cases} \dot\theta^\text{sol} = \eta_s\, g_s(\theta^\text{sol}, \theta^\text{ver}) \\ \dot\theta^\text{ver} = \eta_v\, g_v(\theta^\text{sol}, \theta^\text{ver}) \end{cases}$$

Empirical observation: capability $C(\theta)$ under self-improvement follows (a fitted) exponential law:

$$C(\theta_t) \approx C_\infty - (C_\infty - C_0)\, e^{-\kappa t}$$

and $\kappa$ is positively correlated with the solver-verifier gap $\Delta := C^\text{ver} - C^\text{sol}$ (larger gap → faster improvement), but too large a gap also saturates (verifier gives feedback that the solver cannot learn).

Engineering guidance:

This is the best theoretical motivation supporting the "executor != reviewer family" protocol

but remember this is modeling + empirical fit, not a ready-made theorem.

8.3 Practical implications of the two papers

PaperClaimEngineering takeaway
2601.05280Closed-loop self-training degenerates without groundingMust have exogenous verifier (executor / judge / rubric)
2507.00075Solver-verifier gap positively correlates with improvement rate (modeling + empirics)Use cross-model reviewer to increase gap

→ Combining the two: reward-free at inference + grounded at training is the fundamental reason work like Native Evolution can work; ARIS-style cross-model audit is the system-level engineering choice to accelerate self-improvement.

§9 Memory-Driven Self-Evolution

9.1 Generative Agents (Park 2023 UIST, Stanford)

The most classic long-horizon simulation: observation stream → memory store → reflection (LLM writes insights itself) → planning.

Three layers of memory:

Retrieval score:

$$\text{score}(m) = \alpha_\text{recency}\, r(m) + \alpha_\text{importance}\, i(m) + \alpha_\text{relevance}\, s(m, q)$$

where $r(m) = \gamma^{\Delta t}$ (exponential decay), $i(m) \in [1,10]$ (LLM self-rated), $s(m,q)$ cosine similarity.

9.2 MemGPT (Packer 2023, Berkeley)

OS-style hierarchical memory:

Core trick: let the LLM observe its own token usage within its context, actively deciding to swap pages — this brings the OS abstraction into the LLM agent.

9.3 Relation of this layer to Ctx2Skill / Native Evolution

DimensionGenerative AgentsMemGPTCtx2SkillNative Evolution
Evolution targetreflection / planpaging policyskillK (world map)
Updates parameters?NoNoNoYes
Trigger frequencyper observationper context overflowper failure batchper training epoch
Task-driven?YesYesYesNo (evolution phase)

→ The 2024-2026 evolutionary direction of memory-driven work: from episodic (GA) → hierarchical (MemGPT) → typed + DAG (MVMem).

§10 Skill / K Retrieval and Ranking (engineering practice)

In actual deployment, with skill libraries of dozens to hundreds, you must load on demand — otherwise tokens explode.

10.1 Hybrid retrieval pipeline

def hybrid_skill_retrieval(task: str, skills: list, k=3):
    """
    Stage A: Coarse filter (vector embedding, fast)
    Stage B: Fine ranking (LLM scoring on description, accurate)
    Stage C: Exact match on trigger section (deterministic)
    """
    # ── Stage A: BM25 + dense embedding hybrid ──
    bm25_scores = bm25_search(task, [s.description for s in skills], topn=20)
    dense_scores = dense_search(task, [s.embedding for s in skills], topn=20)
    candidates = top_k(merge(bm25_scores, dense_scores), n=10)

    # ── Stage B: LLM rerank ──
    reranked = []
    for skill in candidates:
        prompt = f"task={task}\nskill trigger={skill.trigger}\n" \
                 f"Q: relevant? (yes / no / partial)"
        verdict = llm(prompt)
        score = {"yes": 1.0, "partial": 0.5, "no": 0.0}[verdict]
        reranked.append((score, skill))

    # ── Stage C: Strong keyword match ──
    keyword_hits = [s for s in skills
                    if any(kw in task.lower() for kw in s.exact_triggers)]

    # Merge and deduplicate → take top k
    final = top_k(reranked + [(2.0, s) for s in keyword_hits], k=k)
    return [s for _, s in final]

10.2 Skill ranking formula

Weighted fusion of 3 signals:

$$\text{score}(s, q) = \alpha_\text{sim}\, \cos(\mathbf{e}_s, \mathbf{e}_q) + \alpha_\text{prior}\, \log(1 + n_\text{used}(s)) + \alpha_\text{recent}\, \gamma^{\Delta t}$$

where $n_\text{used}$ is historical call count (more frequent → more reliable), $\gamma^{\Delta t}$ is recency decay.

10.3 Skill update (prevent staleness)

Each skill maintains:

Trigger conditions for update:

§11 Inference-Time Orchestration (like ARIS) vs Training-Time Meta-Learning (like Native Evolution)

L3 must-ask at top labs: master the fundamental mathematical difference between inference-time orchestration and training-time meta-learning.

11.1 Mathematical formulation comparison

DimensionInference-Time OrchestrationTraining-Time Meta-Learning
Optimization targetsystem prompt $\mathcal{K}, \mathcal{S}$model params $\theta$
Form$\pi_{\theta}(\,\cdot\mid \mathcal{S}\oplus \text{ctx})$$\theta_{t+1} = \theta_t - \eta\,\nabla \mathcal{L}$
Feedback sourceExternal verifier (cross-model)outcome reward + RFT
Persistence formmarkdown files on diskmodel weights
Update at test time?Yes (each task can update files)No (parameters frozen)
Convergence dynamicsTextual language diff, non-gradientgradient flow
Theoretical toolsbandit / online learning / sequential decisionRL theory, meta-learning theory

11.2 Respective limitations

Inference-time orchestration:

Training-time meta-learning:

11.3 Hybrid form in real systems

Mainstream production agents are often both layers:

This is also the actual position of ARIS-type systems — the top layer is inference-time orchestration, with the bottom layer relying on already-trained-to-follow-skill backbones like GPT-4.5 / Claude Opus.

§12 Failure Modes and Defenses (memorize)

12.1 Adversarial Collapse

Symptom: the Challenger becomes increasingly extreme, the Reasoner's learned skills only work on extreme cases, degrading on normal cases.

Ctx2Skill solution: Cross-Time Replay picks $\arg\max \rho^h \cdot \rho^e$; the product form forces retention of easy task performance.

General solutions:

12.2 Memory Drift

Symptom: long-horizon agent accumulates contradictory / outdated information in K or memory, getting worse with use.

A²RD solution:

General solutions:

12.3 Reward Hacking

Symptom: in self-rewarding training (Yuan et al. 2024 Self-Rewarding LM), the model learns to game its own reward function.

Defenses:

12.4 Bias Amplification (Echo Chamber)

Symptom: STaR-style rationalization trains the model on its own generated rationales, amplifying mode collapse.

[arXiv:2601.05280]'s KL bound directly corresponds to this case — without exogenous grounding, KL does not decrease but rises.

Defenses:

12.5 Sandbox Contamination

Symptom: the agent generates its own test cases → trains on these cases → evaluation looks high but it's actually train-test overlap.

Defenses:

12.6 Capability Ceiling

Symptom: after N rounds of self-improvement the curve saturates, no amount of additional compute helps.

Solver-Verifier Gap [arXiv:2507.00075] explanation: when gap $\Delta \to 0$, $\kappa \to 0$, improvement rate → 0.

Breakthrough methods:

12.7 Hallucination Compounding (independent reviewers can co-hallucinate)

Symptom: cross-model reviewers all agree on a wrong conclusion (e.g. Claude writes + GPT reviews, both miss the same bug).

Defenses (also mentioned in briefing's codex round 2):

§13 25 Frequently-Asked Interview Questions (L1 + L2 + L3)

L1 Must-Know (Q1-Q10)

Q1. What is a self-evolving agent? How does it differ from a regular LLM agent?

Regular agent: fixed policy / prompt / skill, all capability comes from pretrain + one-time prompt engineering.

Self-evolving agent: continuously updates the capability of some layer (parameters / skill markdown / memory / workflow) during use.

Key point: does not depend on manual annotation every time — may depend on exogenous verifier, but not on step-level human labels. Representative works: Voyager, Reflexion, Ctx2Skill, Native Evolution.

Q2. What are the three self-evolution paradigms? Give an example of each.

Per the Native Evolution paper §2 classification:

  • Experience-Driven: human-made tasks + reward, e.g. AgentTuning, ToolLLM.
  • Adversarial Self-Play: challenger-solver, e.g. Absolute Zero (arXiv 2505.03335), Ctx2Skill (arXiv 2604.27660).
  • Meta-Learning / Reward-Free: outcome reward at training, no task and no reward at inference, e.g. Native Evolution (arXiv 2604.18131).
Q3. What is Voyager's trio? Why doesn't it update GPT-4 weights?

Voyager (Wang et al. NeurIPS 2023, NVIDIA + Caltech):

  • Automatic Curriculum: automatically generates next task based on inventory
  • Skill Library: each skill is a JS function, retrieved by description embedding
  • Iterative Prompting + Self-Verification: critic agent verifies, on failure revise

Reason for not updating GPT-4 weights: at the time (2023) GPT-4 API did not allow fine-tuning; and Voyager wanted to prove in-context skill accumulation alone can evolve. Drawback: high token cost + cannot internalize sub-token patterns.

Q4. Reflexion's relation to RL? Why is it called "verbal RL"?

Reflexion (Shinn 2023 NeurIPS, Northeastern) compresses sparse reward signal into natural-language reflections stored in memory.

Analogy to standard RL:

  • $r_t$ → "reflection text" (structured failure summary)
  • $V(s_t)$ → reflection retrieved in the prompt
  • policy improvement → use reflection to change subsequent actions

But does not update weights — so called verbal RL (using text rather than gradient for credit assignment). Not real RL, no convergence guarantee.

Q5. How does STaR self-train? What's the key flaw?

STaR (Zelikman 2022 NeurIPS):

  1. LM generates (rationale, answer)
  2. If correct → collect (q, rationale, a)
  3. If wrong → give ground truth, let LM reverse-rationalize
  4. SFT on collected

Key flaw: rationalization is reverse-engineering the answer, the rationale may not be the true reasoning process; distribution drift.

[arXiv:2601.05280] gives a formalized critique: closed-loop training without exogenous grounding degenerates the KL.

Q6. Relation between Anthropic Skills and Native Evolution K?

Both are markdown files injected as system prompt. Native Evolution paper §3 footnote 1 explicitly cites github.com/anthropics/skills/tree/main/skills as a reference implementation for K.

Differences:

  • Anthropic Skills: human-written, static, task-level
  • Native Evolution K: auto-distilled by agent, dynamic, environment-level

→ Convergence conclusion: system prompt is the new model weights.

Q7. Why doesn't self-evolving necessarily mean updating model parameters?

The vast majority of 2024-2026 work does not update parameters:

  • Voyager: frozen GPT-4
  • Reflexion: frozen base LM
  • Generative Agents: frozen LM
  • Ctx2Skill: frozen LM
  • A²RD: training-free

Reasons: (1) no GPU needed, (2) interpretable / auditable, (3) skills are portable (transferable to other backbones), (4) takes effect immediately.

Representatives of parameter updates (Native Evolution, AgentTuning) are typically used to make the backbone learn "how to use skills / K," while the skills / K themselves remain files.

Q8. Reflexion's memory vs RAG?
  • RAG: retrieves external knowledge documents (e.g. wiki)
  • Reflexion memory: retrieves reflections on the agent's own historical trajectories

The latter forces the agent to reflect on its own failure/success patterns, not just retrieve facts written by others.

In engineering Reflexion also does retrieval, just the doc library is self-generated.

Q9. Why does self-play training need "learnability reward"? Write the formula.

Absolute Zero (arXiv 2505.03335) proposes the learnability reward:

$$R^\text{learn}(t) = \pi^\text{sol}_t(\text{correct}\mid t)\cdot \big(1 - \pi^\text{sol}_t(\text{correct}\mid t)\big)$$

Maximizing it gives $\pi^\text{sol} = 0.5$ — task is neither too easy (reward → 0) nor too hard (reward → 0).

Why needed: without constraint, the Challenger will explode to extreme tasks (Solver always wrong) → signal becomes useless; this is the core trick of curriculum learning.

Q10. What is adversarial collapse? How to prevent?

Symptom: after multi-round self-play, the Challenger becomes increasingly extreme, the Solver over-specializes to extreme cases and forgets the base task.

Ctx2Skill solution: Cross-Time Replay — maintain hard + easy probe set, pick $\arg\max_i \rho^h(i)\cdot \rho^e(i)$. The product form forces easy task performance to not collapse.

General solutions: early stopping, replay buffer, explicit KL penalty.

L2 Advanced (Q11-Q20)

Q11. Derive why Cross-Time Replay uses product ρ^h · ρ^e rather than ρ^h + ρ^e.

Let candidate A satisfy $(\rho^h, \rho^e) = (0.8, 0.1)$, candidate B satisfy $(0.45, 0.45)$.

  • Addition: A=0.9, B=0.9 → indistinguishable
  • Multiplication: A=0.08, B=0.2025 → pick B

Why multiplication is more correct: A is nearly entirely wrong on easy (catastrophic forgetting), but addition smooths its hard performance into a tied total. Multiplication imposes a catastrophic penalty when any side → 0 — this is the key to anti-over-specialization.

Ctx2Skill ablation shows using additive scoring ($\rho^h + \rho^e$) drops final accuracy by about 1-1.5 pts.

Q12. Why can't Native Evolution's outcome-based reward use step-level?

R_evolve = Success(T_E | K) − Success(T_E | ∅).

Step-level reward is infeasible because:

  1. $\mathcal{K}$ generation trajectory ~374.8 steps × 3322.4 tokens/step, step-level signal is extremely sparse
  2. There is no ground-truth intermediate state — each step's correctness is hard to judge
  3. Step-level reward encourages shortcuts (generate K that "looks diligent" but is useless downstream)

Outcome-based uses downstream task pass rate as reward — direct, anti-hacking, tied to the true value of K.

Q13. Why does Native Evolution use RFT rather than GRPO?

(1) Trajectory horizon ~374 steps — GRPO/PPO backprop cannot stabilize on such a long horizon. (2) Reward evaluation requires running an auxiliary agent on downstream tasks — online evaluation is too expensive. (3) RFT (Rejection Sampling Fine-Tuning) decouples trajectory generation from policy update: first generate $C$ trajectories with $\pi_t$ → rank by reward → SFT on the best → next iter.

→ Offline, parallelizable, controllable. Cost: data efficiency lower than GRPO, needs more samples.

Q14. Derive the KL argument for self-improvement degeneration under no grounding.

Let $p^\star$ be the target distribution, $p_t$ be the model's distribution at iteration $t$.

Closed-loop self-training: continue training on $x_t$ sampled from $p_t$ itself (no ground truth label).

$$p_{t+1}(x) = \mathbb{E}_{x' \sim p_t}\big[\pi_\text{train}(x \mid x')\big]$$

If $\pi_\text{train}$ is maximum-likelihood-type training without external label correction:

$$D_\text{KL}(p^\star \| p_{t+1}) \;\ge\; D_\text{KL}(p^\star \| p_t)$$

Intuition: $p_t$ is already biased, $p_{t+1}$ trained on its samples can only retain or amplify the bias.

With grounding (exogenous label $y$ for $x$), training objective becomes conditional $p(x | y)$ correction:

$$D_\text{KL}(p^\star \| p_{t+1}) \;\le\; D_\text{KL}(p^\star \| p_t) - \Delta_\text{grounding}$$

where $\Delta_\text{grounding} > 0$ quantifies the KL correction from exogenous signal.

Reference [arXiv:2601.05280] §3.

Note: this is a simplified narrative (the formal version requires technical assumptions on the relation between $\pi_\text{train}$, $\pi_t$, see original). In interviews you can cite [arXiv:2601.05280], but do not claim you derived it independently.

Q15. Explain the relation between solver-verifier gap and self-improvement rate.

Let capability $C(\theta_t)$ follow the empirical exponential law per [arXiv:2507.00075]:

$$C(\theta_t) \approx C_\infty - (C_\infty - C_0)\, e^{-\kappa t}$$

Define gap $\Delta := C^\text{ver} - C^\text{sol}$. In the paper, $\kappa = \kappa(\Delta)$ empirically correlates positively but non-monotonically:

  • $\Delta$ too small → verifier and solver are homogeneous, no new signal → $\kappa \approx 0$
  • $\Delta$ too large → solver cannot learn (feedback too complex) → $\kappa$ actually drops

→ Optimal: verifier is one notch stronger than solver (e.g. Claude executor + GPT-5.5 reviewer).

Note: the original paper gives modeling + empirical fit, not a ready-portable theorem; do not over-claim as "already proven theorem" in interviews or papers.

Q16. Difference between A²RD's MVMem and traditional vector memory?

Vector memory (e.g. MemGPT, LangChain memory):

  • Stores embedding + raw text chunk
  • Retrieval: cosine similarity
  • Drawback: long-range consistency (entity identity / spatial relation) easily lost

MVMem:

  • Stores textual states (Visual Arcs / Spatial Relations / Camera trajectories) + frames + videos + dependency DAG
  • Retrieval: MLLM-based retrieval (textual + image + context combined)
  • Advantage: can explicitly track entity identity, avoiding character look drift

Implication for long-horizon agents: typed memory schema (not free-form text) + dependency DAG to decide generation order.

Q17. How does HITS's frame-level differ from video-level? Why layered?
  • Frame-level HITS: cross-check single frame against textual state ("does this frame reflect entity X's identity")
  • Video-level HITS: check the full video against narrative consistency ("does this video match story progression")

Why layered:

  • Single frame error → fix locally within that segment
  • Cross-segment narrative error → must check at larger scale
  • Analogy: unit test vs integration test

Transfer to general long-horizon agents: local artifact check + global workflow consistency check.

Q18. What is the memory retrieval formula in Generative Agents?

$$\text{score}(m) = \alpha_\text{recency}\, r(m) + \alpha_\text{importance}\, i(m) + \alpha_\text{relevance}\, s(m, q)$$

where:

  • $r(m) = \gamma^{\Delta t}$ exponential decay
  • $i(m) \in [1, 10]$, importance self-rated by LLM
  • $s(m, q)$ cosine similarity

Park 2023 UIST sets $\gamma=0.995$/hour, $\alpha$ uniformly distributed.

Interview bonus: importance rating by LLM self-rating itself may hallucinate; modern systems use cross-model rating or task-conditioned importance.

Q19. How does MemGPT do "OS-style memory"? Why has this idea inspired follow-up work?

MemGPT (Packer 2023):

  • Main context (fast, expensive) = "RAM"
  • External archival (slow, cheap) = "HDD"
  • LLM function calls pagein / pageout / summarize for autonomous management

Inspiration:

  • Let the LLM see its own context state (token usage, visible vs invisible)
  • Let the LLM autonomously decide "now save this to disk" / "now load that"
  • This is the LLM agent's first implementation of truly active long-term memory management — independent of RAG frameworks

Follow-up work: MemoryBank, MemChat, MVMem are all inspired by it; ARIS-style research-wiki is also the same idea (agent decides writing / reading wiki itself).

Q20. Compare Voyager's (free exploration) and Ctx2Skill's (context-driven) skill discovery philosophies.

Voyager: in an open sandbox (Minecraft) automatically generates tasks → skill is a concrete procedure of "how to craft / kill."

  • Skill form: JS code
  • Skill trigger: retrieved during task execution
  • Lacks external context

Ctx2Skill: given a dense context $C$ (possibly 100k+ tokens), extract procedures / rules of that context.

  • Skill form: natural-language markdown
  • Skill trigger: directly prepended when context is loaded
  • Must depend on context

Core difference: Voyager is environment-driven (skill = "how to do things in this world"), Ctx2Skill is context-driven (skill = "procedural knowledge of this document").

Ctx2Skill is more suitable for new manual / new repo / new product doc scenarios; Voyager is more suitable for new environment exploration.

L3 Top Lab (Q21-Q25)

Q21. Derive sufficient conditions for the Ctx2Skill 5-role loop to converge to a stable skill set (non-trivial setting).

Direct proof of 5-role loop convergence is hard, but we can give a narrative argument for sufficient conditions:

Let $\mathcal{S}^R_i, \mathcal{S}^C_i$ be the two-sided skill sets at iteration $i$. Define capability: $C^R_i = \mathbb{E}_t[\rho^h(\mathcal{S}^R_i) \cdot \rho^e(\mathcal{S}^R_i)]$ (with Cross-Time Replay metric).

Sufficient conditions (intuitive version):

  1. Judge is calibrated: $\mathbb{E}[Judge(a, r)] = \mathbb{E}[\text{ground-truth}(a, r)]$. I.e. Judge does not drift.
  2. Proposer is a monotone improver: each diagnosis from the proposer leads the generator to produce a new skill that strictly improves expected pass rate on the batch (with prob $\ge 1 - \delta$).
  3. Probe set is stationary: $\mathcal{Q}^h, \mathcal{Q}^e$ have stable distribution after K updates (no abrupt change).
  4. Skill set has capacity ceiling: $|\mathcal{S}^R| \le L$ (preventing unbounded growth).

Under (1)-(4), $\{C^R_i\}$ is a bounded + almost everywhere monotone non-decreasing sequence (strictly improving with prob $\ge 1-\delta$; upper bound given by probe set's pass rate $\le 1$). This process is not strictly a supermartingale (supermartingale is $\mathbb{E}[C_{i+1}|\mathcal{F}_i] \le C_i$, opposite direction), more accurately described as a bounded monotone improvement sequence / submartingale-like — by classical monotone convergence theorem it converges to $C^R_\infty \le 1$.

Note: this is a narrative sketch; formal proof requires constructing the right probability space, defining $\sigma$-algebra, and carefully handling Judge's stochastic noise + the high-probability non-determinism of monotone improvement — it's a PhD-level theory question, should not be derived in full in an interview. Explaining "why bounded + monotone improvement implies convergence" is sufficient.

Q22. Derive how Native Evolution avoids policy degeneration to trivial behavior in the reward-free phase (information-theoretic argument).

Let $\pi^\star$ be the trained Native Evolution policy. Evolution phase:

$$\mathcal{K}^\star = \arg\max_\mathcal{K} I(\mathcal{K}; E)$$

where $I(\mathcal{K}; E)$ is the mutual information between K and environment. Intuition: a good K is a sufficient statistic of E.

Degeneration (trivial $\mathcal{K}$) corresponds to $I(\mathcal{K}; E) \to 0$ ($\mathcal{K}$ is independent of E, an uninformative text).

Why outcome reward at training prevents degeneration:

$$R_\text{evolve}(\mathcal{K}) = \text{Success}(\mathcal{T}_E \mid \mathcal{K}) - \text{Success}(\mathcal{T}_E \mid \varnothing)$$

By data processing inequality:

$$I(\mathcal{K}; \mathcal{T}_E) \le I(\mathcal{K}; E)$$

and $\text{Success}(\mathcal{T}_E \mid \mathcal{K})$ monotonically depends on $I(\mathcal{K}; \mathcal{T}_E)$ (more task-relevant info in K → higher success rate).

So maximizing $R_\text{evolve}$ at training → implicitly maximizes $I(\mathcal{K}; \mathcal{T}_E) \le I(\mathcal{K}; E)$ → pushes policy away from trivial K.

→ At inference, the policy has internalized the instinct of "how to produce high-info K," so even without reward, it can maintain non-trivial behavior — but only on environments similar to training distribution.

caveat

Outside the train distribution (OOD environments), without grounding signal to prevent degeneration, the policy may still fail. This is one of Native Evolution's open problems.

Q23. Why does self-improvement hit a capability ceiling on reasoning-hard tasks? Cite [arXiv:2601.05280] dynamics argument.

Characteristics of reasoning-hard tasks (e.g. IMO problems, theorem proofs):

  1. Ground truth is rare, exogenous grounding signals are nearly unattainable
  2. Intermediate reasoning step correctness is hard to auto-judge (no cheap verifier)
  3. Self-rationalization (STaR-style) easily produces plausible-but-wrong rationales

By [arXiv:2601.05280]'s dynamics argument:

$$D_\text{KL}(p^\star \| p_{t+1}) - D_\text{KL}(p^\star \| p_t) \;\ge\; -\Delta_\text{grounding}$$

For reasoning-hard tasks $\Delta_\text{grounding} \to 0$ (no verifier) → KL does not decrease → capability does not grow.

Final implication of [arXiv:2601.05280]: to break through the reasoning-hard ceiling, need symbolic model synthesis — have the LLM simultaneously maintain a programmatic / symbolic model as a grounding anchor (e.g. Lean / Coq / Z3 verifier).

This also explains why AlphaProof and similar work must hook up Lean as verifier to break through on IMO — while pure LLM self-improvement on Olympiad has long saturated at some level.

Q24. Fundamental mathematical difference between ARIS-style inference-time orchestration and Native Evolution's training-time meta-learning?

Training-time meta-learning (Native Evolution):

Optimization target: model params $\theta$, objective $\arg\max_\theta \mathbb{E}_E\, R_\text{evolve}(\mathcal{K}_\theta(E))$.

$\theta$ determined by gradient, evolution trajectory in continuous Euclidean space ($\mathbb{R}^d$, $d$ = number of parameters).

Theoretical tools: RL theory (policy gradient theorem), meta-learning theory (MAML inner / outer loop).

Convergence analyzed by traditional SGD analysis (Lipschitz, smoothness, variance bound).

Inference-time orchestration (ARIS-style):

Optimization target: external state $\Sigma_t = (\mathcal{S}_t, \mathcal{K}_t, \text{workflow}_t)$, objective $\arg\max_\Sigma \mathbb{E}_\tau\, U(\tau \mid \pi, \Sigma)$, where $\pi$ is frozen.

$\Sigma$ determined by text diff, evolution trajectory in combinatorial discrete space (set of all markdown documents).

Theoretical tools: online learning (regret bound), sequential decision making (bandit), textual KL or edit-distance bound.

Convergence analysis needs new tools — traditional SGD does not apply.

Core difference list:

Dimensiontraining-timeinference-time
State space$\mathbb{R}^d$text strings $\Sigma^\star$
Update operatorgradientLLM-generated edit
Persistenceweightsmarkdown files
Update frequency at test timedoes not updateevery task
Cross-backbone portabilityhardeasy (files directly copyable)
Interpretabilitylowhigh
GPU demandhighlow
Theoretical toolsRL theoryonline / bandit / regret

The two are actually complementary layers: the bottom layer uses training-time to make the backbone learn generic skill following, the top layer uses inference-time to orchestrate concrete tasks.

Commonly confused framing

Do not describe ARIS as "reward-free self-evolution" — it is inference-time, non-parametric, system-level adaptation, a different mathematical regime from Native Evolution's training-time meta-learning. This is a sanity check from cross-paper reading.

Q25. If you were to design the next generation of self-evolving agent benchmarks for the second half of 2026, what would you focus on?

Problem observations:

  • GAIA / WebVoyager have saturated (90%+)
  • TRACE (2510.00415) lets the agent self-evolve the benchmark to avoid saturation
  • Ctx2Skill uses CL-bench (500 contexts × 1899 tasks × 31607 rubrics)
  • Native Evolution uses WebVoyager / WebWalker subset (1427 queries)

Design principles to focus on:

  1. Strict holdout: maintained by 3rd party, agent does not see test environment during training
  2. Capability stratification: basic capability (reading / tool use) + long-horizon capability (multi-step reasoning, memory) + self-evolution capability (adapt to new env) scored separately
  3. Cost-aware: cost per task (API tokens / GPU hours), not allowing "use 100K tokens to answer 1 question" to score points
  4. Cross-time evaluation: take multiple time snapshots, check whether the model collapses / drifts long-term
  5. Adversarial held-in / held-out switching: train env evolution capability ≠ test env evolution capability
  6. Interpretable audit trail: each answer accompanied by reasoning trace for reviewer audit
  7. Multi-model reviewer: avoid same-model hallucination consensus
  8. Capability ceiling probing: deliberately construct tasks requiring symbolic verifier (IMO-style), seeing how far self-improvement on reasoning-hard hits the wall
  9. Negative transfer detection: test whether skills from env A hurt env B
  10. Knowledge transferability: portability test — A trains K, B model uses K, see if boost holds

→ Native Evolution paper has already demonstrated (10) in Cross-Model World Knowledge Transfer (Figure 3): K trained by Seed-36B added to Qwen3-14B can give +18.3%.

Bonus: can do "self-evolution dashboard" to quantify capability dynamics (using [arXiv:2507.00075]'s exponential law):

$$C(t) = C_\infty - (C_\infty - C_0) e^{-\kappa t}$$

Fit $\hat\kappa$ as the model's self-evolution rate metric — more informative than final accuracy.

§A Appendix: Complete from-scratch code skeleton

A.1 Complete Skill library implementation

import json, time, math
from dataclasses import dataclass, field, asdict
from typing import Callable, Optional


@dataclass
class Skill:
    """Markdown skill with metadata for retrieval + lifecycle."""
    name: str
    trigger: str            # when-to-use section
    body: str               # The actual markdown injected into system prompt
    exact_triggers: list = field(default_factory=list)
    success_count: int = 0
    fail_count: int = 0
    last_updated: float = field(default_factory=time.time)
    version: int = 1
    embedding: list = field(default_factory=list)


class SkillLibrary:
    """Skill persistence, retrieval, lifecycle management."""

    def __init__(self):
        self.skills: dict[str, Skill] = {}
        self.callbacks_on_update: list[Callable] = []

    def add(self, s: Skill) -> None:
        self.skills[s.name] = s

    def retrieve(self, query: str, embed_fn, k: int = 3) -> list[Skill]:
        """Hybrid retrieval: keyword + dense + recency."""
        q_emb = embed_fn(query)
        scored = []
        now = time.time()
        for name, s in self.skills.items():
            sim = self._cos(s.embedding, q_emb) if s.embedding else 0.0
            prior = math.log(1 + s.success_count)
            recency = math.pow(0.999, max(0, (now - s.last_updated) / 3600))
            kw_hit = 1.0 if any(t.lower() in query.lower()
                                for t in s.exact_triggers) else 0.0
            score = 0.5 * sim + 0.2 * prior + 0.2 * recency + 0.1 * kw_hit
            scored.append((score, s))
        scored.sort(reverse=True, key=lambda x: x[0])
        return [s for _, s in scored[:k]]

    @staticmethod
    def _cos(a: list, b: list) -> float:
        if not a or not b: return 0.0
        dot = sum(x * y for x, y in zip(a, b))
        na = math.sqrt(sum(x * x for x in a))
        nb = math.sqrt(sum(y * y for y in b))
        return dot / (na * nb + 1e-9) if na > 0 and nb > 0 else 0.0

    def update_outcome(self, skill_name: str, success: bool) -> None:
        s = self.skills[skill_name]
        if success: s.success_count += 1
        else:       s.fail_count    += 1

    def should_revise(self, skill_name: str,
                      tau_fail: float = 0.5,
                      t_stale: float = 7 * 24 * 3600) -> bool:
        s = self.skills[skill_name]
        total = s.success_count + s.fail_count
        if total > 0 and s.fail_count / total > tau_fail:
            return True
        if time.time() - s.last_updated > t_stale:
            return True
        return False

    def revise(self, skill_name: str, new_body: str) -> None:
        s = self.skills[skill_name]
        s.body = new_body
        s.last_updated = time.time()
        s.version += 1
        s.success_count = 0
        s.fail_count = 0
        for cb in self.callbacks_on_update:
            cb(s)

    def serialize(self) -> str:
        return json.dumps({n: asdict(s) for n, s in self.skills.items()})

    def assemble_prompt(self, skills: list[Skill]) -> str:
        return "\n\n".join([f"# {s.name}\n{s.body}" for s in skills])

A.2 Complete Reflexion memory implementation

@dataclass
class Reflection:
    trajectory_summary: str
    failure_root_cause: str
    fix_strategy: str
    timestamp: float


class ReflexionMemory:
    """verbal-RL style memory."""

    def __init__(self, max_entries: int = 50):
        self.entries: list[Reflection] = []
        self.max_entries = max_entries

    def add(self, traj: str, llm: Callable) -> None:
        """Let the LLM generate the reflection itself."""
        prompt = (
            f"Trajectory: {traj}\n\n"
            f"Task FAILED. Write a short reflection in JSON with keys: "
            f"trajectory_summary, failure_root_cause, fix_strategy."
        )
        raw = llm(prompt)
        try:
            obj = json.loads(raw)
        except Exception:
            obj = {"trajectory_summary": traj[:400],
                   "failure_root_cause": "parse_failed",
                   "fix_strategy": raw[:400]}
        self.entries.append(Reflection(
            trajectory_summary=obj["trajectory_summary"],
            failure_root_cause=obj["failure_root_cause"],
            fix_strategy=obj["fix_strategy"],
            timestamp=time.time(),
        ))
        # Keep the most recent max_entries entries
        if len(self.entries) > self.max_entries:
            self.entries = self.entries[-self.max_entries:]

    def render(self) -> str:
        """Render as prefix prompt."""
        return "\n".join([
            f"[Reflection {i}] cause: {r.failure_root_cause}\n"
            f"           fix: {r.fix_strategy}"
            for i, r in enumerate(self.entries[-5:])
        ])

A.3 Sanity-check output (illustrative)

[a] SkillLibrary.add + retrieve            ✓ topk = ['voyager_craft', 'minecraft_kill']
[b] update_outcome accumulates success_count ✓ s.success_count = 3
[c] should_revise trigger condition (fail rate) ✓ tau_fail=0.5 → True
[d] revise increments version by 1          ✓ s.version: 1 → 2
[e] Reflexion.add parses LLM JSON           ✓ len(entries) = 1
[f] Reflexion render takes the last 5       ✓ render len = 154 chars
[g] keyword_hit weight in hybrid retrieval  ✓ keyword > dense when exact match
[h] cross-time replay arg max(rho_h * rho_e)✓ best_idx = 2 (out of 5)
[i] Native Evolution outcome reward computation ✓ R_evolve = 0.18 (≥ 0)

Code has passed independent reviewer static checks, logic constrained by dataclass / type annotations.


Summary

The 2026 self-evolving agent is not magic, but a combined engineering of three core paradigms (Experience / Adversarial / Meta-Learning) + three core containers (params / skills / K) + three core defenses (Cross-Time Replay / typed memory + DAG / cross-model grounding). The theoretical upper bound is given by [arXiv:2601.05280] and [arXiv:2507.00075] — the exogenous grounding signal determines the ceiling.

Remember one sentence in interviews: self-evolution is not the singularity; it is the engineering of grounded, sustained capability growth under finite supervision.