You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Instrumented driver that runs Anthropic's Claude Code workflow on SWE-bench instances while streaming traces through Agent-lightning—supports hosted vLLM, official Anthropic, or any OpenAI-compatible backend and emits datasets for downstream tuning.
|[azure](./azure)| Supervised fine-tuning with Azure OpenAI. |[](https://github.com/microsoft/agent-lightning/actions/workflows/examples-azure.yml)|
9
9
|[calc_x](./calc_x)| VERL-powered math reasoning agent training that uses AutoGen with an MCP calculator tool. |[](https://github.com/microsoft/agent-lightning/actions/workflows/examples-calc-x.yml)|
10
+
|[claude_code](./claude_code)| Claude Code SWE-bench harness that records Agent-lightning traces across Anthropic, vLLM, and OpenAI-compatible backends. |[](https://github.com/microsoft/agent-lightning/actions/workflows/examples-claude-code.yml)|
10
11
|[minimal](./minimal)| Bite-sized programs that demonstrate how individual Agent-lightning building blocks behave in isolation. |[](https://github.com/microsoft/agent-lightning/actions/workflows/badge-unit.yml)|
11
12
|[rag](./rag)| Retrieval-Augmented Generation pipeline targeting the MuSiQue dataset with Wikipedia retrieval. |**Unmaintained** — last verified with Agent-lightning v0.1.1 |
12
13
|[search_r1](./search_r1)| Framework-free Search-R1 reinforcement learning training workflow with a retrieval backend. |**Unmaintained** — last verified with Agent-lightning v0.1.2 |
Copy file name to clipboardExpand all lines: examples/azure/README.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,7 @@
1
1
# Supervised Fine-tuning with Azure OpenAI
2
2
3
+
[](https://github.com/microsoft/agent-lightning/actions/workflows/examples-azure.yml)
4
+
3
5
This example walks through an end-to-end supervised fine-tuning loop on Azure OpenAI. The trainer runs a toy capital-lookup agent, collects traces with rewards, submits fine-tuning jobs using those traces, and deploys every successful checkpoint as a new Azure OpenAI deployment.
4
6
5
7
**NOTE: The example is tested and compatible with Agent-lightning v0.2.x, but it's not yet maintained on CI due to the difficulty of maintaining a logged-in status in the testing environment.**
This example demonstrates how to train a Claude Code agent with Agent-lightning. **The example is still under development.**
3
+
[](https://github.com/microsoft/agent-lightning/actions/workflows/examples-claude-code.yml)
4
4
5
-
It wraps Claude Code as the agent to:
5
+
This example shows how to wrap Anthropic's Claude Code experience with Agent-lightning instrumentation to solve SWE-bench tasks, collect spans/logs, and optionally convert those traces into HuggingFace datasets.
6
6
7
-
1. collect traces from agent execution on coding tasks;
8
-
2. train a hosted LLM with the traces ***🔨 Under development***
7
+
**NOTE:** This example only shows how to integrate Claude Code as an agent in Agent-lightning. The training part is still under development and welcoming contributions!
8
+
9
+
## Overview
10
+
11
+
`claude_code_agent.py` spins up a Lightning Store, an LLM proxy, and the Claude Code controller. Each SWE-bench instance is executed inside the official container image so you can either prompt-tune against Anthropic's hosted models or point Claude Code at a self-hosted OpenAI-compatible backend such as vLLM. When a backend surfaces token IDs/logprobs (e.g., vLLM), the traces are turned into triplets that downstream fine-tuning pipelines can consume.
9
12
10
13
## Requirements
11
14
12
-
1. Install agentlightning following [installation instructions](https://microsoft.github.io/agent-lightning/stable/tutorials/installation/);
13
-
2.`(uv) pip install swebench` for evaluation.
15
+
First, install Agent-lightning following the [installation guide](https://microsoft.github.io/agent-lightning/stable/tutorials/installation/). Then install the SWE-bench harness plus utilities used by this example:
We provide a small dataset `swebench_samples.jsonl` which is a subset of [SWE-bench](https://huggingface.co/datasets/SWE-bench/SWE-bench) for sanity check.
21
+
Docker must be available because each SWE-bench instance is executed in a container via `swebench_utils`.
18
22
19
-
The instruction to prepare the full dataset is still underway.
23
+
Finally, set API credentials depending on backend:
20
24
21
-
## Included Files
25
+
-`ANTHROPIC_API_KEY` for the official Claude Code path.
26
+
-`OPENAI_API_KEY` (or another OpenAI-compatible key) for the `openai` backend.
27
+
- A running OpenAI-compatible server (e.g., vLLM) when using the `vllm` backend.
28
+
29
+
## Dataset
22
30
23
-
| Filename | Description |
24
-
|--------------------------------|-------------|
25
-
|`cc_agent.py`| Main entry point for running Claude Code agent on coding tasks with trace collection capabilities |
26
-
|`claude_code_controller.py`| Controller implementation for managing Claude Code agent interactions and execution |
27
-
|`custom_adapter.py`| Custom adapter for integrating with Claude Code's interface and communication protocols |
28
-
|`custom_callbacks.py`| Callback handlers for customizing agent behavior and responses during execution |
29
-
|`handle_hook.template.sh`| Template script for handling hooks during agent execution |
30
-
|`settings.template.json`| Template configuration file with default settings for Claude Code agent |
31
-
|`swe_debug.jsonl`| Debug dataset containing a subset of SWE-bench samples for testing and verification |
32
-
|`swebench_utils/`| Utility module with helper functions for SWE-bench dataset containerized exeuction and evaluation |
31
+
`swebench_samples.jsonl` contains a handful of SWE-bench issues for smoke testing. For full-scale benchmarks load `princeton-nlp/SWE-bench` via `load_swebench_dataset` or point `--dataset-path` to your own JSONL file.
33
32
34
-
## Trace collection
33
+
## Included Files
35
34
36
-
We support running Claude Code via two ways:
35
+
| File/Directory | Description |
36
+
|----------------|-------------|
37
+
|`claude_code_agent.py`| CLI entry point that launches the Lightning store, LLM proxy, and Claude Code agent |
38
+
|`claude_code_controller.py`| Manages the SWE-bench Docker runtime and translates model outputs into git patches |
39
+
|`extended_adapter.py`| Adapter that converts LLM proxy spans into triplets with token IDs, logprobs, and chat history |
40
+
|`swebench_samples.jsonl`| Mini SWE-bench subset for quick validation |
41
+
|`swebench_utils/`| Utilities for running/evaluating SWE-bench instances inside containers |
42
+
|`templates/handle_hook.template.sh`| Helper script injected into containers for hook handling |
43
+
|`templates/settings.template.json`| Base configuration consumed by Claude Code CLI |
37
44
38
-
- Hosted LLM servers (i.e., vLLM), useful for fine-tuning the LLM;
39
-
- Official Claude Code (i.e., via Anthropic API), useful for prompt tuning.
45
+
## Running the Example
40
46
41
-
### From Hosted LLM server
47
+
All commands are issued from `examples/claude_code`. Inspect the module-level docstring in `claude_code_agent.py` for the full CLI reference.
42
48
43
-
1. Prepare an OpenAI-compatible server:
49
+
### Hosted vLLM (open-source models)
50
+
51
+
First, launch your model behind an OpenAI-compatible endpoint, for example:
The above commands will generate a `data_debug` dir, which contains two targets: (1) a Huggingface Dataset named `dataset-<instance_id>` and (2) a trace file named `stream_<instance_id>.jsonl`, where `instance_id` is a unique key of the SWE-bench samples.
65
-
The dataset showcases the versatile customization capability of agent-lightning. In particular, we support extracting **prompt/response ids**, **logprobs** from the vllm server.
66
-
The trace file is the conversation logs for claude code to tackle the SWE-bench instance.
75
+
The backend model names must match what the server exposes. Because this mode surfaces token IDs/logprobs, the script saves both raw span logs and HuggingFace datasets per instance.
67
76
68
-
In addition, there will be a `logs` dir, which is the output of the docker container executing agent calls.
Backend model flags are optional here because the Anthropic API strings match the frontend names. This path is ideal for validating prompts against the hosted experience (trace outputs do not contain token IDs or logprobs).
89
+
90
+
### OpenAI-Compatible Providers
91
+
77
92
```bash
78
-
cd examples/cc
79
-
python cc_agent \
80
-
--official \
81
-
--dataset_path swe_debug.jsonl \
82
-
--max_step 32 \
83
-
--output_dir data_debug
93
+
export OPENAI_API_KEY=sk-...
94
+
python claude_code_agent.py openai \
95
+
--backend-model-high gpt-4.1 \
96
+
--backend-model-low gpt-4o-mini \
97
+
--dataset-path swebench_samples.jsonl \
98
+
--output-dir data_openai
84
99
```
85
-
As the underlying model is provided by Anthropic, we cannot obtain prompt/response ids and logprobs. However, we can still obtain a trace file named `<instance_id>.json` under `data_debug`.
100
+
101
+
Use this mode whenever Claude Code should talk to Azure OpenAI, OpenAI, or another compatible provider. `--base-url` is optional—pass it if your endpoint differs from the public OpenAI URL.
102
+
103
+
Adjust `--max-turns`, `--cooldown-seconds`, and `--limit` to control runtime and rate limits regardless of backend.
104
+
105
+
## Outputs and Trace Collection
106
+
107
+
-`output_dir/stream_<instance_id>.json` contains the complete span stream captured from the Lightning Store for each rollout.
108
+
- When running with `backend_type=vllm`, `output_dir/dataset-<instance_id>/` stores a HuggingFace dataset with token IDs, logprobs, prompts, and metadata produced by `ExtendedLlmProxyTraceToTriplet`.
109
+
-`logs/<instance_id>/` is created by the SWE-bench runtime and mirrors the console output from the container.
110
+
- Return values from the agent are also evaluated via `swebench_utils.evaluation.evaluate`, so `data_debug` (or your chosen folder) will contain evaluation reports alongside traces.
111
+
112
+
Use these artifacts to fine-tune models, debug Claude Code behavior, or replay rollouts in downstream Agent-lightning workflows.
Copy file name to clipboardExpand all lines: examples/minimal/README.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,7 @@
1
1
# Minimal Component Showcase
2
2
3
+
[](https://github.com/microsoft/agent-lightning/actions/workflows/badge-unit.yml)
4
+
3
5
`examples/minimal` provides bite-sized programs that demonstrate how individual Agent-lightning building blocks behave in isolation.
4
6
5
7
Each module have been documented with its own CLI usage in the module-level docstring. Use this directory as a reference when wiring the same pieces into a larger system.
Copy file name to clipboardExpand all lines: examples/tinker/README.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,8 +1,8 @@
1
1
# Tinker + Agent-lightning Integration
2
2
3
-
This example shows how to use [Tinker's reinforcement-learning infrastructure](https://tinker-docs.thinkingmachines.ai/) as a fine-tuning backend for agents written against Agent-lightning. You author the agent exactly the way you would for deployment, while the bridge code reconstructs Tinker-compatible trajectories from Agent-lightning traces.
3
+
[](https://github.com/microsoft/agent-lightning/actions/workflows/examples-tinker.yml)
4
4
5
-
**NOTE: The example is tested and compatible with Agent-lightning v0.2.x, but it's not yet maintained on CI due to the cost of running the Tinker training service.**
5
+
This example shows how to use [Tinker's reinforcement-learning infrastructure](https://tinker-docs.thinkingmachines.ai/) as a fine-tuning backend for agents written against Agent-lightning. You author the agent exactly the way you would for deployment, while the bridge code reconstructs Tinker-compatible trajectories from Agent-lightning traces.
6
6
7
7
## How this differs from the original Tinker Cookbook RL recipe
0 commit comments