Claude Code Example README update (#348)

ultmaster · web-flow · commit 63b6d42669a8 · 2025-12-02T01:01:30.000+08:00
diff --git a/.github/workflows/badge-claude-code.yml b/.github/workflows/badge-claude-code.yml
@@ -0,0 +1,29 @@
+name: Badge - Claude Code
+
+on:
+  workflow_run:
+    workflows:
+      - Examples - Claude Code
+    types: [completed]
+
+  workflow_dispatch:
+
+permissions:
+  actions: read
+  contents: read
+
+jobs:
+  badge:
+    if: ${{ github.event_name == 'workflow_dispatch' || (github.event_name == 'workflow_run' && github.event.workflow_run.head_branch == 'main') }}
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/github-script@v8
+        with:
+          github-token: ${{ secrets.GITHUB_TOKEN }}
+          script: |
+            const badgeAggregation = require('./scripts/badge_aggregation.js');
+            const dependencies = [
+              { workflow: 'examples-claude-code.yml', label: 'claude-code', variants: ['stable'] },
+            ];
+            await badgeAggregation({ github, context, core, dependencies });
diff --git a/.github/workflows/badge-examples.yml b/.github/workflows/badge-examples.yml
@@ -9,6 +9,7 @@ on:
       - Examples - Unsloth
       - Examples - Tinker
       - Examples - Azure
+      - Examples - Claude Code
     types: [completed]
 
   workflow_dispatch:
@@ -35,5 +36,6 @@ jobs:
               { workflow: 'examples-unsloth.yml', label: 'examples-unsloth.stable', variants: ['stable'] },
               { workflow: 'examples-tinker.yml', label: 'examples-tinker.stable', variants: ['stable'] },
               { workflow: 'examples-azure.yml', label: 'examples-azure.stable', variants: ['stable'] },
+              { workflow: 'examples-claude-code.yml', label: 'examples-claude-code.stable', variants: ['stable'] },
             ];
             await badgeAggregation({ github, context, core, dependencies });
diff --git a/docs/how-to/examples-catalog.md b/docs/how-to/examples-catalog.md
@@ -30,6 +30,14 @@
 
     [:octicons-repo-24: Browse source]({{ src("examples/calc_x") }})
 
+-   :material-code-braces:{ .lg .middle } __Claude Code SWE-bench__
+
+    ---
+
+    Instrumented driver that runs Anthropic's Claude Code workflow on SWE-bench instances while streaming traces through Agent-lightning—supports hosted vLLM, official Anthropic, or any OpenAI-compatible backend and emits datasets for downstream tuning.
+
+    [:octicons-repo-24: Browse source]({{ src("examples/claude_code") }})
+
 -   :material-view-grid:{ .lg .middle } __Minimal building blocks__
 
     ---
diff --git a/examples/README.md b/examples/README.md
@@ -7,6 +7,7 @@ This catalog highlights the examples shipped with Agent-lightning.
 | [apo](./apo) | Automatic Prompt Optimization tutorials covering built-in, custom, and debugging workflows. | [![apo workflow status](https://github.com/microsoft/agent-lightning/actions/workflows/badge-apo.yml/badge.svg)](https://github.com/microsoft/agent-lightning/actions/workflows/examples-apo.yml) |
 | [azure](./azure) | Supervised fine-tuning with Azure OpenAI. | [![azure workflow status](https://github.com/microsoft/agent-lightning/actions/workflows/badge-azure.yml/badge.svg)](https://github.com/microsoft/agent-lightning/actions/workflows/examples-azure.yml) |
 | [calc_x](./calc_x) | VERL-powered math reasoning agent training that uses AutoGen with an MCP calculator tool. | [![calc_x workflow status](https://github.com/microsoft/agent-lightning/actions/workflows/badge-calc-x.yml/badge.svg)](https://github.com/microsoft/agent-lightning/actions/workflows/examples-calc-x.yml) |
+| [claude_code](./claude_code) | Claude Code SWE-bench harness that records Agent-lightning traces across Anthropic, vLLM, and OpenAI-compatible backends. | [![claude_code workflow status](https://github.com/microsoft/agent-lightning/actions/workflows/badge-claude-code.yml/badge.svg)](https://github.com/microsoft/agent-lightning/actions/workflows/examples-claude-code.yml) |
 | [minimal](./minimal) | Bite-sized programs that demonstrate how individual Agent-lightning building blocks behave in isolation. | [![minimal workflow status](https://github.com/microsoft/agent-lightning/actions/workflows/badge-unit.yml/badge.svg)](https://github.com/microsoft/agent-lightning/actions/workflows/badge-unit.yml) |
 | [rag](./rag) | Retrieval-Augmented Generation pipeline targeting the MuSiQue dataset with Wikipedia retrieval. | **Unmaintained** — last verified with Agent-lightning v0.1.1 |
 | [search_r1](./search_r1) | Framework-free Search-R1 reinforcement learning training workflow with a retrieval backend. | **Unmaintained** — last verified with Agent-lightning v0.1.2 |
diff --git a/examples/azure/README.md b/examples/azure/README.md
@@ -1,5 +1,7 @@
 # Supervised Fine-tuning with Azure OpenAI
 
+[![azure CI status](https://github.com/microsoft/agent-lightning/actions/workflows/examples-azure.yml/badge.svg)](https://github.com/microsoft/agent-lightning/actions/workflows/examples-azure.yml)
+
 This example walks through an end-to-end supervised fine-tuning loop on Azure OpenAI. The trainer runs a toy capital-lookup agent, collects traces with rewards, submits fine-tuning jobs using those traces, and deploys every successful checkpoint as a new Azure OpenAI deployment.
 
 **NOTE: The example is tested and compatible with Agent-lightning v0.2.x, but it's not yet maintained on CI due to the difficulty of maintaining a logged-in status in the testing environment.**
diff --git a/examples/claude_code/README.md b/examples/claude_code/README.md
@@ -1,46 +1,54 @@
 # Training Claude Code with Agent-lightning
 
-This example demonstrates how to train a Claude Code agent with Agent-lightning. **The example is still under development.**
+[![claude-code CI status](https://github.com/microsoft/agent-lightning/actions/workflows/examples-claude-code.yml/badge.svg)](https://github.com/microsoft/agent-lightning/actions/workflows/examples-claude-code.yml)
 
-It wraps Claude Code as the agent to:
+This example shows how to wrap Anthropic's Claude Code experience with Agent-lightning instrumentation to solve SWE-bench tasks, collect spans/logs, and optionally convert those traces into HuggingFace datasets.
 
-1. collect traces from agent execution on coding tasks;
-2. train a hosted LLM with the traces ***🔨 Under development***
+**NOTE:** This example only shows how to integrate Claude Code as an agent in Agent-lightning. The training part is still under development and welcoming contributions!
+
+## Overview
+
+`claude_code_agent.py` spins up a Lightning Store, an LLM proxy, and the Claude Code controller. Each SWE-bench instance is executed inside the official container image so you can either prompt-tune against Anthropic's hosted models or point Claude Code at a self-hosted OpenAI-compatible backend such as vLLM. When a backend surfaces token IDs/logprobs (e.g., vLLM), the traces are turned into triplets that downstream fine-tuning pipelines can consume.
 
 ## Requirements
 
-1. Install agentlightning following [installation instructions](https://microsoft.github.io/agent-lightning/stable/tutorials/installation/);
-2. `(uv) pip install swebench` for evaluation.
+First, install Agent-lightning following the [installation guide](https://microsoft.github.io/agent-lightning/stable/tutorials/installation/). Then install the SWE-bench harness plus utilities used by this example:
 
-## Dataset
+```bash
+(uv) pip install swebench transformers datasets python-dotenv
+```
 
-We provide a small dataset `swebench_samples.jsonl` which is a subset of [SWE-bench](https://huggingface.co/datasets/SWE-bench/SWE-bench) for sanity check.
+Docker must be available because each SWE-bench instance is executed in a container via `swebench_utils`.
 
-The instruction to prepare the full dataset is still underway.
+Finally, set API credentials depending on backend:
 
-## Included Files
+- `ANTHROPIC_API_KEY` for the official Claude Code path.
+- `OPENAI_API_KEY` (or another OpenAI-compatible key) for the `openai` backend.
+- A running OpenAI-compatible server (e.g., vLLM) when using the `vllm` backend.
+
+## Dataset
 
-| Filename                        | Description |
-|--------------------------------|-------------|
-| `cc_agent.py`                   | Main entry point for running Claude Code agent on coding tasks with trace collection capabilities |
-| `claude_code_controller.py`     | Controller implementation for managing Claude Code agent interactions and execution |
-| `custom_adapter.py`             | Custom adapter for integrating with Claude Code's interface and communication protocols |
-| `custom_callbacks.py`           | Callback handlers for customizing agent behavior and responses during execution |
-| `handle_hook.template.sh`       | Template script for handling hooks during agent execution |
-| `settings.template.json`        | Template configuration file with default settings for Claude Code agent |
-| `swe_debug.jsonl`               | Debug dataset containing a subset of SWE-bench samples for testing and verification |
-| `swebench_utils/`               | Utility module with helper functions for SWE-bench dataset containerized exeuction and evaluation |
+`swebench_samples.jsonl` contains a handful of SWE-bench issues for smoke testing. For full-scale benchmarks load `princeton-nlp/SWE-bench` via `load_swebench_dataset` or point `--dataset-path` to your own JSONL file.
 
-## Trace collection
+## Included Files
 
-We support running Claude Code via two ways:
+| File/Directory | Description |
+|----------------|-------------|
+| `claude_code_agent.py` | CLI entry point that launches the Lightning store, LLM proxy, and Claude Code agent |
+| `claude_code_controller.py` | Manages the SWE-bench Docker runtime and translates model outputs into git patches |
+| `extended_adapter.py` | Adapter that converts LLM proxy spans into triplets with token IDs, logprobs, and chat history |
+| `swebench_samples.jsonl` | Mini SWE-bench subset for quick validation |
+| `swebench_utils/` | Utilities for running/evaluating SWE-bench instances inside containers |
+| `templates/handle_hook.template.sh` | Helper script injected into containers for hook handling |
+| `templates/settings.template.json` | Base configuration consumed by Claude Code CLI |
 
-- Hosted LLM servers (i.e., vLLM), useful for fine-tuning the LLM;
-- Official Claude Code (i.e., via Anthropic API), useful for prompt tuning.
+## Running the Example
 
-### From Hosted LLM server
+All commands are issued from `examples/claude_code`. Inspect the module-level docstring in `claude_code_agent.py` for the full CLI reference.
 
-1. Prepare an OpenAI-compatible server:
+### Hosted vLLM (open-source models)
+
+First, launch your model behind an OpenAI-compatible endpoint, for example:
 
 ```bash
 vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct \
@@ -49,37 +57,56 @@ vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct \
     --tool-call-parser qwen3_coder
 ```
 
-2. Sanity check:
+Run the Agent-lightning harness and point it at the server:
 
 ```bash
-# Suppose the vllm server is running at localhost:8000
-python cc_agent \
-    --model_name_or_path Qwen/Qwen3-Coder-30B-A3B-Instruct \
-    --server_address http://localhost:8000/v1 \
-    --dataset_path swe_debug.jsonl \
-    --max_step 32 \
-    --output_dir data_debug
+python claude_code_agent.py vllm \
+    --backend-model-high Qwen/Qwen3-Coder-30B-A3B-Instruct \
+    --backend-model-low Qwen/Qwen3-Coder-30B-A3B-Instruct \
+    --frontend-model-high claude-sonnet-4-5-20250929 \
+    --frontend-model-low claude-haiku-4-5-20251001 \
+    --base-url http://localhost:8000/v1 \
+    --dataset-path swebench_samples.jsonl \
+    --output-dir data_debug \
+    --max-turns 5 \
+    --limit 2
 ```
 
-The above commands will generate a `data_debug` dir, which contains two targets: (1) a Huggingface Dataset named `dataset-<instance_id>` and (2) a trace file named `stream_<instance_id>.jsonl`, where `instance_id` is a unique key of the SWE-bench samples.
-The dataset showcases the versatile customization capability of agent-lightning. In particular, we support extracting **prompt/response ids**, **logprobs** from the vllm server.
-The trace file is the conversation logs for claude code to tackle the SWE-bench instance.
+The backend model names must match what the server exposes. Because this mode surfaces token IDs/logprobs, the script saves both raw span logs and HuggingFace datasets per instance.
 
-In addition, there will be a `logs` dir, which is the output of the docker container executing agent calls.
+### Official Claude Code (Anthropic API)
 
-### From official Claude Code
-1. Prepare ANTHROPIC_API_KEY
 ```bash
-export ANTHROPIC_API_KEY=sk-<your private key>
+export ANTHROPIC_API_KEY=sk-...
+python claude_code_agent.py anthropic \
+    --dataset-path swebench_samples.jsonl \
+    --output-dir data_anthropic \
+    --frontend-model-high claude-sonnet-4-5-20250929 \
+    --frontend-model-low claude-haiku-4-5-20251001
 ```
 
-2. Sanity check
+Backend model flags are optional here because the Anthropic API strings match the frontend names. This path is ideal for validating prompts against the hosted experience (trace outputs do not contain token IDs or logprobs).
+
+### OpenAI-Compatible Providers
+
 ```bash
-cd examples/cc
-python cc_agent \
-    --official \
-    --dataset_path swe_debug.jsonl \
-    --max_step 32 \
-    --output_dir data_debug
+export OPENAI_API_KEY=sk-...
+python claude_code_agent.py openai \
+    --backend-model-high gpt-4.1 \
+    --backend-model-low gpt-4o-mini \
+    --dataset-path swebench_samples.jsonl \
+    --output-dir data_openai
 ```
-As the underlying model is provided by Anthropic, we cannot obtain prompt/response ids and logprobs. However, we can still obtain a trace file named `<instance_id>.json` under `data_debug`.
+
+Use this mode whenever Claude Code should talk to Azure OpenAI, OpenAI, or another compatible provider. `--base-url` is optional—pass it if your endpoint differs from the public OpenAI URL.
+
+Adjust `--max-turns`, `--cooldown-seconds`, and `--limit` to control runtime and rate limits regardless of backend.
+
+## Outputs and Trace Collection
+
+- `output_dir/stream_<instance_id>.json` contains the complete span stream captured from the Lightning Store for each rollout.
+- When running with `backend_type=vllm`, `output_dir/dataset-<instance_id>/` stores a HuggingFace dataset with token IDs, logprobs, prompts, and metadata produced by `ExtendedLlmProxyTraceToTriplet`.
+- `logs/<instance_id>/` is created by the SWE-bench runtime and mirrors the console output from the container.
+- Return values from the agent are also evaluated via `swebench_utils.evaluation.evaluate`, so `data_debug` (or your chosen folder) will contain evaluation reports alongside traces.
+
+Use these artifacts to fine-tune models, debug Claude Code behavior, or replay rollouts in downstream Agent-lightning workflows.
diff --git a/examples/claude_code/claude_code_agent.py b/examples/claude_code/claude_code_agent.py
@@ -1,16 +1,54 @@
 # Copyright (c) Microsoft. All rights reserved.
 
-"""Main module for the Claude Code Agent implementation.
-
-This module provides the core functionality for running Claude Code agent experiments
-on SWE-bench datasets. It includes the ClaudeCodeAgent class that implements the agent logic,
-functions for loading datasets, and asynchronous execution functions for running experiments.
-
-Key components:
-
-- Dataset loading utilities
-- ClaudeCodeAgent: Main agent implementation that handles rollout logic
-- Asynchronous execution functions for dry runs and full datasets
+"""Instrumented driver for running Claude Code on SWE-bench with Agent-lightning.
+
+This script wires together the Lightning Store, LLM proxy, and Claude Code controller so
+that every SWE-bench instance is executed inside the official Claude container while
+capturing full Agent-lightning traces. It supports three backend modes:
+
+- `vllm`: wrap an OpenAI-compatible endpoint (e.g., vLLM) for hosted OSS models while
+  collecting prompt/response token ids and logprobs.
+- `anthropic`: call the official Claude Code API via `ANTHROPIC_API_KEY` for prompt
+  tuning. Backend model defaults to the provided frontend names.
+- `openai`: route through any OpenAI-compatible provider using `OPENAI_API_KEY`.
+
+Typical usage: hosted vLLM (requires model paths and --base-url)
+
+```bash
+# Run vLLM in background
+vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct \
+  --max-model-len 131072 \
+  --enable-auto-tool-choice \
+  --tool-call-parser qwen3_coder \
+  --port 45993 &
+
+python claude_code_agent.py vllm \
+  --backend-model-high Qwen/Qwen3-Coder-30B-A3B-Instruct \
+  --backend-model-low Qwen/Qwen3-Coder-30B-A3B-Instruct \
+  --base-url http://localhost:45993/v1 \
+  --dataset-path swebench_samples.jsonl \
+```
+
+Official Claude Code via Anthropic:
+
+```bash
+export ANTHROPIC_API_KEY=sk-...
+python claude_code_agent.py anthropic \
+  --dataset-path swebench_samples.jsonl \
+  --output-dir data_anthropic
+```
+
+Any OpenAI-compatible backend:
+
+```bash
+export OPENAI_API_KEY=sk-...
+python claude_code_agent.py openai \
+  --backend-model-high gpt-5.1-codex-mini \
+  --backend-model-low gpt-4.1-mini \
+  --dataset-path swebench_samples.jsonl
+```
+
+Use `--debug` to enable debug loggings.
 """
 
 import asyncio
diff --git a/examples/minimal/README.md b/examples/minimal/README.md
@@ -1,5 +1,7 @@
 # Minimal Component Showcase
 
+[![minimal CI status](https://github.com/microsoft/agent-lightning/actions/workflows/badge-unit.yml/badge.svg)](https://github.com/microsoft/agent-lightning/actions/workflows/badge-unit.yml)
+
 `examples/minimal` provides bite-sized programs that demonstrate how individual Agent-lightning building blocks behave in isolation.
 
 Each module have been documented with its own CLI usage in the module-level docstring. Use this directory as a reference when wiring the same pieces into a larger system.
diff --git a/examples/tinker/README.md b/examples/tinker/README.md
@@ -1,8 +1,8 @@
 # Tinker + Agent-lightning Integration
 
-This example shows how to use [Tinker's reinforcement-learning infrastructure](https://tinker-docs.thinkingmachines.ai/) as a fine-tuning backend for agents written against Agent-lightning. You author the agent exactly the way you would for deployment, while the bridge code reconstructs Tinker-compatible trajectories from Agent-lightning traces.
+[![tinker CI status](https://github.com/microsoft/agent-lightning/actions/workflows/examples-tinker.yml/badge.svg)](https://github.com/microsoft/agent-lightning/actions/workflows/examples-tinker.yml)
 
-**NOTE: The example is tested and compatible with Agent-lightning v0.2.x, but it's not yet maintained on CI due to the cost of running the Tinker training service.**
+This example shows how to use [Tinker's reinforcement-learning infrastructure](https://tinker-docs.thinkingmachines.ai/) as a fine-tuning backend for agents written against Agent-lightning. You author the agent exactly the way you would for deployment, while the bridge code reconstructs Tinker-compatible trajectories from Agent-lightning traces.
 
 ## How this differs from the original Tinker Cookbook RL recipe