Skip to content

Commit 63b6d42

Browse files
authored
Claude Code Example README update (#348)
1 parent 8c21917 commit 63b6d42

File tree

9 files changed

+171
-62
lines changed

9 files changed

+171
-62
lines changed
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
name: Badge - Claude Code
2+
3+
on:
4+
workflow_run:
5+
workflows:
6+
- Examples - Claude Code
7+
types: [completed]
8+
9+
workflow_dispatch:
10+
11+
permissions:
12+
actions: read
13+
contents: read
14+
15+
jobs:
16+
badge:
17+
if: ${{ github.event_name == 'workflow_dispatch' || (github.event_name == 'workflow_run' && github.event.workflow_run.head_branch == 'main') }}
18+
runs-on: ubuntu-latest
19+
steps:
20+
- uses: actions/checkout@v4
21+
- uses: actions/github-script@v8
22+
with:
23+
github-token: ${{ secrets.GITHUB_TOKEN }}
24+
script: |
25+
const badgeAggregation = require('./scripts/badge_aggregation.js');
26+
const dependencies = [
27+
{ workflow: 'examples-claude-code.yml', label: 'claude-code', variants: ['stable'] },
28+
];
29+
await badgeAggregation({ github, context, core, dependencies });

.github/workflows/badge-examples.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ on:
99
- Examples - Unsloth
1010
- Examples - Tinker
1111
- Examples - Azure
12+
- Examples - Claude Code
1213
types: [completed]
1314

1415
workflow_dispatch:
@@ -35,5 +36,6 @@ jobs:
3536
{ workflow: 'examples-unsloth.yml', label: 'examples-unsloth.stable', variants: ['stable'] },
3637
{ workflow: 'examples-tinker.yml', label: 'examples-tinker.stable', variants: ['stable'] },
3738
{ workflow: 'examples-azure.yml', label: 'examples-azure.stable', variants: ['stable'] },
39+
{ workflow: 'examples-claude-code.yml', label: 'examples-claude-code.stable', variants: ['stable'] },
3840
];
3941
await badgeAggregation({ github, context, core, dependencies });

docs/how-to/examples-catalog.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,14 @@
3030

3131
[:octicons-repo-24: Browse source]({{ src("examples/calc_x") }})
3232

33+
- :material-code-braces:{ .lg .middle } __Claude Code SWE-bench__
34+
35+
---
36+
37+
Instrumented driver that runs Anthropic's Claude Code workflow on SWE-bench instances while streaming traces through Agent-lightning—supports hosted vLLM, official Anthropic, or any OpenAI-compatible backend and emits datasets for downstream tuning.
38+
39+
[:octicons-repo-24: Browse source]({{ src("examples/claude_code") }})
40+
3341
- :material-view-grid:{ .lg .middle } __Minimal building blocks__
3442

3543
---

examples/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ This catalog highlights the examples shipped with Agent-lightning.
77
| [apo](./apo) | Automatic Prompt Optimization tutorials covering built-in, custom, and debugging workflows. | [![apo workflow status](https://github.com/microsoft/agent-lightning/actions/workflows/badge-apo.yml/badge.svg)](https://github.com/microsoft/agent-lightning/actions/workflows/examples-apo.yml) |
88
| [azure](./azure) | Supervised fine-tuning with Azure OpenAI. | [![azure workflow status](https://github.com/microsoft/agent-lightning/actions/workflows/badge-azure.yml/badge.svg)](https://github.com/microsoft/agent-lightning/actions/workflows/examples-azure.yml) |
99
| [calc_x](./calc_x) | VERL-powered math reasoning agent training that uses AutoGen with an MCP calculator tool. | [![calc_x workflow status](https://github.com/microsoft/agent-lightning/actions/workflows/badge-calc-x.yml/badge.svg)](https://github.com/microsoft/agent-lightning/actions/workflows/examples-calc-x.yml) |
10+
| [claude_code](./claude_code) | Claude Code SWE-bench harness that records Agent-lightning traces across Anthropic, vLLM, and OpenAI-compatible backends. | [![claude_code workflow status](https://github.com/microsoft/agent-lightning/actions/workflows/badge-claude-code.yml/badge.svg)](https://github.com/microsoft/agent-lightning/actions/workflows/examples-claude-code.yml) |
1011
| [minimal](./minimal) | Bite-sized programs that demonstrate how individual Agent-lightning building blocks behave in isolation. | [![minimal workflow status](https://github.com/microsoft/agent-lightning/actions/workflows/badge-unit.yml/badge.svg)](https://github.com/microsoft/agent-lightning/actions/workflows/badge-unit.yml) |
1112
| [rag](./rag) | Retrieval-Augmented Generation pipeline targeting the MuSiQue dataset with Wikipedia retrieval. | **Unmaintained** — last verified with Agent-lightning v0.1.1 |
1213
| [search_r1](./search_r1) | Framework-free Search-R1 reinforcement learning training workflow with a retrieval backend. | **Unmaintained** — last verified with Agent-lightning v0.1.2 |

examples/azure/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
# Supervised Fine-tuning with Azure OpenAI
22

3+
[![azure CI status](https://github.com/microsoft/agent-lightning/actions/workflows/examples-azure.yml/badge.svg)](https://github.com/microsoft/agent-lightning/actions/workflows/examples-azure.yml)
4+
35
This example walks through an end-to-end supervised fine-tuning loop on Azure OpenAI. The trainer runs a toy capital-lookup agent, collects traces with rewards, submits fine-tuning jobs using those traces, and deploys every successful checkpoint as a new Azure OpenAI deployment.
46

57
**NOTE: The example is tested and compatible with Agent-lightning v0.2.x, but it's not yet maintained on CI due to the difficulty of maintaining a logged-in status in the testing environment.**

examples/claude_code/README.md

Lines changed: 76 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -1,46 +1,54 @@
11
# Training Claude Code with Agent-lightning
22

3-
This example demonstrates how to train a Claude Code agent with Agent-lightning. **The example is still under development.**
3+
[![claude-code CI status](https://github.com/microsoft/agent-lightning/actions/workflows/examples-claude-code.yml/badge.svg)](https://github.com/microsoft/agent-lightning/actions/workflows/examples-claude-code.yml)
44

5-
It wraps Claude Code as the agent to:
5+
This example shows how to wrap Anthropic's Claude Code experience with Agent-lightning instrumentation to solve SWE-bench tasks, collect spans/logs, and optionally convert those traces into HuggingFace datasets.
66

7-
1. collect traces from agent execution on coding tasks;
8-
2. train a hosted LLM with the traces ***🔨 Under development***
7+
**NOTE:** This example only shows how to integrate Claude Code as an agent in Agent-lightning. The training part is still under development and welcoming contributions!
8+
9+
## Overview
10+
11+
`claude_code_agent.py` spins up a Lightning Store, an LLM proxy, and the Claude Code controller. Each SWE-bench instance is executed inside the official container image so you can either prompt-tune against Anthropic's hosted models or point Claude Code at a self-hosted OpenAI-compatible backend such as vLLM. When a backend surfaces token IDs/logprobs (e.g., vLLM), the traces are turned into triplets that downstream fine-tuning pipelines can consume.
912

1013
## Requirements
1114

12-
1. Install agentlightning following [installation instructions](https://microsoft.github.io/agent-lightning/stable/tutorials/installation/);
13-
2. `(uv) pip install swebench` for evaluation.
15+
First, install Agent-lightning following the [installation guide](https://microsoft.github.io/agent-lightning/stable/tutorials/installation/). Then install the SWE-bench harness plus utilities used by this example:
1416

15-
## Dataset
17+
```bash
18+
(uv) pip install swebench transformers datasets python-dotenv
19+
```
1620

17-
We provide a small dataset `swebench_samples.jsonl` which is a subset of [SWE-bench](https://huggingface.co/datasets/SWE-bench/SWE-bench) for sanity check.
21+
Docker must be available because each SWE-bench instance is executed in a container via `swebench_utils`.
1822

19-
The instruction to prepare the full dataset is still underway.
23+
Finally, set API credentials depending on backend:
2024

21-
## Included Files
25+
- `ANTHROPIC_API_KEY` for the official Claude Code path.
26+
- `OPENAI_API_KEY` (or another OpenAI-compatible key) for the `openai` backend.
27+
- A running OpenAI-compatible server (e.g., vLLM) when using the `vllm` backend.
28+
29+
## Dataset
2230

23-
| Filename | Description |
24-
|--------------------------------|-------------|
25-
| `cc_agent.py` | Main entry point for running Claude Code agent on coding tasks with trace collection capabilities |
26-
| `claude_code_controller.py` | Controller implementation for managing Claude Code agent interactions and execution |
27-
| `custom_adapter.py` | Custom adapter for integrating with Claude Code's interface and communication protocols |
28-
| `custom_callbacks.py` | Callback handlers for customizing agent behavior and responses during execution |
29-
| `handle_hook.template.sh` | Template script for handling hooks during agent execution |
30-
| `settings.template.json` | Template configuration file with default settings for Claude Code agent |
31-
| `swe_debug.jsonl` | Debug dataset containing a subset of SWE-bench samples for testing and verification |
32-
| `swebench_utils/` | Utility module with helper functions for SWE-bench dataset containerized exeuction and evaluation |
31+
`swebench_samples.jsonl` contains a handful of SWE-bench issues for smoke testing. For full-scale benchmarks load `princeton-nlp/SWE-bench` via `load_swebench_dataset` or point `--dataset-path` to your own JSONL file.
3332

34-
## Trace collection
33+
## Included Files
3534

36-
We support running Claude Code via two ways:
35+
| File/Directory | Description |
36+
|----------------|-------------|
37+
| `claude_code_agent.py` | CLI entry point that launches the Lightning store, LLM proxy, and Claude Code agent |
38+
| `claude_code_controller.py` | Manages the SWE-bench Docker runtime and translates model outputs into git patches |
39+
| `extended_adapter.py` | Adapter that converts LLM proxy spans into triplets with token IDs, logprobs, and chat history |
40+
| `swebench_samples.jsonl` | Mini SWE-bench subset for quick validation |
41+
| `swebench_utils/` | Utilities for running/evaluating SWE-bench instances inside containers |
42+
| `templates/handle_hook.template.sh` | Helper script injected into containers for hook handling |
43+
| `templates/settings.template.json` | Base configuration consumed by Claude Code CLI |
3744

38-
- Hosted LLM servers (i.e., vLLM), useful for fine-tuning the LLM;
39-
- Official Claude Code (i.e., via Anthropic API), useful for prompt tuning.
45+
## Running the Example
4046

41-
### From Hosted LLM server
47+
All commands are issued from `examples/claude_code`. Inspect the module-level docstring in `claude_code_agent.py` for the full CLI reference.
4248

43-
1. Prepare an OpenAI-compatible server:
49+
### Hosted vLLM (open-source models)
50+
51+
First, launch your model behind an OpenAI-compatible endpoint, for example:
4452

4553
```bash
4654
vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct \
@@ -49,37 +57,56 @@ vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct \
4957
--tool-call-parser qwen3_coder
5058
```
5159

52-
2. Sanity check:
60+
Run the Agent-lightning harness and point it at the server:
5361

5462
```bash
55-
# Suppose the vllm server is running at localhost:8000
56-
python cc_agent \
57-
--model_name_or_path Qwen/Qwen3-Coder-30B-A3B-Instruct \
58-
--server_address http://localhost:8000/v1 \
59-
--dataset_path swe_debug.jsonl \
60-
--max_step 32 \
61-
--output_dir data_debug
63+
python claude_code_agent.py vllm \
64+
--backend-model-high Qwen/Qwen3-Coder-30B-A3B-Instruct \
65+
--backend-model-low Qwen/Qwen3-Coder-30B-A3B-Instruct \
66+
--frontend-model-high claude-sonnet-4-5-20250929 \
67+
--frontend-model-low claude-haiku-4-5-20251001 \
68+
--base-url http://localhost:8000/v1 \
69+
--dataset-path swebench_samples.jsonl \
70+
--output-dir data_debug \
71+
--max-turns 5 \
72+
--limit 2
6273
```
6374

64-
The above commands will generate a `data_debug` dir, which contains two targets: (1) a Huggingface Dataset named `dataset-<instance_id>` and (2) a trace file named `stream_<instance_id>.jsonl`, where `instance_id` is a unique key of the SWE-bench samples.
65-
The dataset showcases the versatile customization capability of agent-lightning. In particular, we support extracting **prompt/response ids**, **logprobs** from the vllm server.
66-
The trace file is the conversation logs for claude code to tackle the SWE-bench instance.
75+
The backend model names must match what the server exposes. Because this mode surfaces token IDs/logprobs, the script saves both raw span logs and HuggingFace datasets per instance.
6776

68-
In addition, there will be a `logs` dir, which is the output of the docker container executing agent calls.
77+
### Official Claude Code (Anthropic API)
6978

70-
### From official Claude Code
71-
1. Prepare ANTHROPIC_API_KEY
7279
```bash
73-
export ANTHROPIC_API_KEY=sk-<your private key>
80+
export ANTHROPIC_API_KEY=sk-...
81+
python claude_code_agent.py anthropic \
82+
--dataset-path swebench_samples.jsonl \
83+
--output-dir data_anthropic \
84+
--frontend-model-high claude-sonnet-4-5-20250929 \
85+
--frontend-model-low claude-haiku-4-5-20251001
7486
```
7587

76-
2. Sanity check
88+
Backend model flags are optional here because the Anthropic API strings match the frontend names. This path is ideal for validating prompts against the hosted experience (trace outputs do not contain token IDs or logprobs).
89+
90+
### OpenAI-Compatible Providers
91+
7792
```bash
78-
cd examples/cc
79-
python cc_agent \
80-
--official \
81-
--dataset_path swe_debug.jsonl \
82-
--max_step 32 \
83-
--output_dir data_debug
93+
export OPENAI_API_KEY=sk-...
94+
python claude_code_agent.py openai \
95+
--backend-model-high gpt-4.1 \
96+
--backend-model-low gpt-4o-mini \
97+
--dataset-path swebench_samples.jsonl \
98+
--output-dir data_openai
8499
```
85-
As the underlying model is provided by Anthropic, we cannot obtain prompt/response ids and logprobs. However, we can still obtain a trace file named `<instance_id>.json` under `data_debug`.
100+
101+
Use this mode whenever Claude Code should talk to Azure OpenAI, OpenAI, or another compatible provider. `--base-url` is optional—pass it if your endpoint differs from the public OpenAI URL.
102+
103+
Adjust `--max-turns`, `--cooldown-seconds`, and `--limit` to control runtime and rate limits regardless of backend.
104+
105+
## Outputs and Trace Collection
106+
107+
- `output_dir/stream_<instance_id>.json` contains the complete span stream captured from the Lightning Store for each rollout.
108+
- When running with `backend_type=vllm`, `output_dir/dataset-<instance_id>/` stores a HuggingFace dataset with token IDs, logprobs, prompts, and metadata produced by `ExtendedLlmProxyTraceToTriplet`.
109+
- `logs/<instance_id>/` is created by the SWE-bench runtime and mirrors the console output from the container.
110+
- Return values from the agent are also evaluated via `swebench_utils.evaluation.evaluate`, so `data_debug` (or your chosen folder) will contain evaluation reports alongside traces.
111+
112+
Use these artifacts to fine-tune models, debug Claude Code behavior, or replay rollouts in downstream Agent-lightning workflows.

examples/claude_code/claude_code_agent.py

Lines changed: 49 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,54 @@
11
# Copyright (c) Microsoft. All rights reserved.
22

3-
"""Main module for the Claude Code Agent implementation.
4-
5-
This module provides the core functionality for running Claude Code agent experiments
6-
on SWE-bench datasets. It includes the ClaudeCodeAgent class that implements the agent logic,
7-
functions for loading datasets, and asynchronous execution functions for running experiments.
8-
9-
Key components:
10-
11-
- Dataset loading utilities
12-
- ClaudeCodeAgent: Main agent implementation that handles rollout logic
13-
- Asynchronous execution functions for dry runs and full datasets
3+
"""Instrumented driver for running Claude Code on SWE-bench with Agent-lightning.
4+
5+
This script wires together the Lightning Store, LLM proxy, and Claude Code controller so
6+
that every SWE-bench instance is executed inside the official Claude container while
7+
capturing full Agent-lightning traces. It supports three backend modes:
8+
9+
- `vllm`: wrap an OpenAI-compatible endpoint (e.g., vLLM) for hosted OSS models while
10+
collecting prompt/response token ids and logprobs.
11+
- `anthropic`: call the official Claude Code API via `ANTHROPIC_API_KEY` for prompt
12+
tuning. Backend model defaults to the provided frontend names.
13+
- `openai`: route through any OpenAI-compatible provider using `OPENAI_API_KEY`.
14+
15+
Typical usage: hosted vLLM (requires model paths and --base-url)
16+
17+
```bash
18+
# Run vLLM in background
19+
vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct \
20+
--max-model-len 131072 \
21+
--enable-auto-tool-choice \
22+
--tool-call-parser qwen3_coder \
23+
--port 45993 &
24+
25+
python claude_code_agent.py vllm \
26+
--backend-model-high Qwen/Qwen3-Coder-30B-A3B-Instruct \
27+
--backend-model-low Qwen/Qwen3-Coder-30B-A3B-Instruct \
28+
--base-url http://localhost:45993/v1 \
29+
--dataset-path swebench_samples.jsonl \
30+
```
31+
32+
Official Claude Code via Anthropic:
33+
34+
```bash
35+
export ANTHROPIC_API_KEY=sk-...
36+
python claude_code_agent.py anthropic \
37+
--dataset-path swebench_samples.jsonl \
38+
--output-dir data_anthropic
39+
```
40+
41+
Any OpenAI-compatible backend:
42+
43+
```bash
44+
export OPENAI_API_KEY=sk-...
45+
python claude_code_agent.py openai \
46+
--backend-model-high gpt-5.1-codex-mini \
47+
--backend-model-low gpt-4.1-mini \
48+
--dataset-path swebench_samples.jsonl
49+
```
50+
51+
Use `--debug` to enable debug loggings.
1452
"""
1553

1654
import asyncio

examples/minimal/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
# Minimal Component Showcase
22

3+
[![minimal CI status](https://github.com/microsoft/agent-lightning/actions/workflows/badge-unit.yml/badge.svg)](https://github.com/microsoft/agent-lightning/actions/workflows/badge-unit.yml)
4+
35
`examples/minimal` provides bite-sized programs that demonstrate how individual Agent-lightning building blocks behave in isolation.
46

57
Each module have been documented with its own CLI usage in the module-level docstring. Use this directory as a reference when wiring the same pieces into a larger system.

examples/tinker/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
# Tinker + Agent-lightning Integration
22

3-
This example shows how to use [Tinker's reinforcement-learning infrastructure](https://tinker-docs.thinkingmachines.ai/) as a fine-tuning backend for agents written against Agent-lightning. You author the agent exactly the way you would for deployment, while the bridge code reconstructs Tinker-compatible trajectories from Agent-lightning traces.
3+
[![tinker CI status](https://github.com/microsoft/agent-lightning/actions/workflows/examples-tinker.yml/badge.svg)](https://github.com/microsoft/agent-lightning/actions/workflows/examples-tinker.yml)
44

5-
**NOTE: The example is tested and compatible with Agent-lightning v0.2.x, but it's not yet maintained on CI due to the cost of running the Tinker training service.**
5+
This example shows how to use [Tinker's reinforcement-learning infrastructure](https://tinker-docs.thinkingmachines.ai/) as a fine-tuning backend for agents written against Agent-lightning. You author the agent exactly the way you would for deployment, while the bridge code reconstructs Tinker-compatible trajectories from Agent-lightning traces.
66

77
## How this differs from the original Tinker Cookbook RL recipe
88

0 commit comments

Comments
 (0)