[grpc] Support gRPC server entrypoint #30190

CatherineSue · 2025-12-06T20:29:04Z

Purpose

Add gRPC server support to vLLM, enabling the community to integrate vLLM via gRPC protocol for any upstream application or routing layer.

Key Benefits:

Native gRPC Protocol Support
- Enables upstream applications to connect via gRPC/Protobuf instead of HTTP/JSON
- Binary protocol reduces serialization overhead
- HTTP/2 multiplexing improves connection efficiency
- Expands vLLM's integration options beyond HTTP/REST APIs
Integration with sgl-model-gateway
- Enables vLLM workers to operate as gRPC backends
- Bypasses Python GIL bottleneck by moving tokenization logic to Rust
- Provides production-grade features: advanced routing, secured mcp and database management, responses api
- Measured performance gains at high concurrency (see Test Results)

Changed Files

Protocol & Codegen:

vllm_scheduler.proto - Protocol buffer definition (source)
vllm_scheduler_pb2.py - Generated protobuf messages (auto-generated)
vllm_scheduler_pb2_grpc.py - Generated gRPC service (auto-generated)
compile_protos.py - Script to compile proto files
__init__.py - Module initialization

Server Implementation:

vllm/grpc/grpc_request_manager.py - Request manager (GrpcRequestManager class)
vllm/entrypoints/grpc_server.py - Server entrypoint (VllmSchedulerServicer + main)

Compilation

To regenerate the Python code from the .proto file:

python vllm/grpc/compile_protos.py

Requirements: pip install grpcio-tools

This generates:

vllm_scheduler_pb2.py - Message classes
vllm_scheduler_pb2_grpc.py - Service stubs and servicers

Test Plan

Run the gRPC server with a Llama-3.1-8B-Instruct:

  python3 -m vllm.entrypoints.grpc_server \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --port 8080 \
    --host 0.0.0.0 \
    --tensor-parallel-size 1

Test with gateway integration with sgl-model-gateway

Verify:

Health check endpoint responds correctly
Streaming generation returns token IDs (not text)
gRPC reflection is available for introspection
Request abort/cancellation works properly
GetModelInfo and GetServerInfo return correct metadata

Test Result

We use genai-bench to measure the http_server vs (grpc_server + sgl-model-gateway) with Llama-3.3-70B-Instruct on 4xH100.

Performance Results (Llama-3.3-70B, D100_100, Concurrency 256):

At high concurrency, gRPC demonstrates superior production characteristics:

Metric	gRPC	HTTP	Improvement
Throughput	9,068 tok/s	6,629 tok/s	+37%
Requests/sec	45.7	33.4	+37%
p99 TTFT	1,792ms	2,434ms	-26%
p90 TTFT	1,728ms	2,188ms	-21%
TTFT Variance (stddev)	428ms	651ms	-34%

Key Value Proposition:

Processes 39% more requests in same time with 26% better tail latency
34% more consistent performance (lower variance)

D100_100_group_by_server_version_combined_plots_1x4

D100_1000_group_by_server_version_combined_plots_1x4

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

# Conflicts: # docker/Dockerfile Signed-off-by: Chang Su <chang.s.su@oracle.com>

Signed-off-by: Chang Su <chang.s.su@oracle.com>

gemini-code-assist

Code Review

This pull request introduces a gRPC server entrypoint for vLLM, providing an alternative to the existing HTTP/REST API. This is a significant feature that enables more efficient communication through binary protocols and HTTP/2 multiplexing. The implementation is well-structured, with a dedicated GrpcRequestManager to handle the interaction with the vLLM engine, and a clean server implementation in grpc_server.py. The code includes graceful shutdown handling and client cancellation, which are important for a production-ready server.

My review focuses on improving robustness and security. I've identified a potential security vulnerability related to unlimited gRPC message sizes and several places where logging could be improved to include full tracebacks for easier debugging of production issues. These are important for maintaining a reliable and secure service.

gemini-code-assist · 2025-12-06T20:31:51Z

vllm/entrypoints/grpc_server.py

+                    yield self._complete_response(request_id, output)
+
+        except Exception as e:
+            logger.error("Error in Generate for %s: %s", request_id, e)


Using logger.error with the exception object as a formatted argument might not capture the full stack trace, which is crucial for debugging. It's better to use logger.exception to ensure the full traceback is always logged when an exception occurs.

Suggested change

logger.error("Error in Generate for %s: %s", request_id, e)

logger.exception("Error in Generate for %s", request_id)

gemini-code-assist · 2025-12-06T20:31:51Z

vllm/entrypoints/grpc_server.py

+            ("grpc.max_send_message_length", -1),
+            ("grpc.max_receive_message_length", -1),


Setting grpc.max_receive_message_length to -1 (unlimited) can expose the server to Denial of Service (DoS) attacks. A malicious client could send a very large gRPC message, potentially causing the server to run out of memory. It is recommended to set a reasonable limit, for example 100MB, to mitigate this risk. The same applies to grpc.max_send_message_length for consistency and to prevent accidental large responses.

Suggested change

("grpc.max_send_message_length", -1),

("grpc.max_receive_message_length", -1),

("grpc.max_send_message_length", 100 * 1024 * 1024),

("grpc.max_receive_message_length", 100 * 1024 * 1024),

gemini-code-assist · 2025-12-06T20:31:51Z

vllm/grpc/grpc_request_manager.py

+        except Exception as e:
+            logger.error("Error in generate for %s: %s", request_id, e)


Using logger.error here can hide the full stack trace of the exception. Using logger.exception would provide more context for debugging by including the traceback. Since e is not used in the log message, it can be removed from the except clause.

Suggested change

except Exception as e:

logger.error("Error in generate for %s: %s", request_id, e)

except Exception:

logger.exception("Error in generate for %s", request_id)

gemini-code-assist · 2025-12-06T20:31:51Z

vllm/grpc/grpc_request_manager.py

+            await self.async_llm.engine_core.add_request_async(request)
+
+        except Exception as e:
+            logger.error("Error submitting request %s: %s", request.request_id, e)


To ensure full stack traces are logged for exceptions, which is critical for debugging, it's better to use logger.exception instead of logger.error.

Suggested change

logger.error("Error submitting request %s: %s", request.request_id, e)

logger.exception("Error submitting request %s", request.request_id)

gemini-code-assist · 2025-12-06T20:31:51Z

vllm/grpc/grpc_request_manager.py

+        except Exception as e:
+            logger.error("Error aborting request %s: %s", request_id, e)


Using logger.exception instead of logger.error will provide a full stack trace, which is invaluable for debugging unexpected errors during request abortion. Since e is not used, it can be removed from the except clause.

Suggested change

except Exception as e:

logger.error("Error aborting request %s: %s", request_id, e)

except Exception:

logger.exception("Error aborting request %s", request_id)

gemini-code-assist · 2025-12-06T20:31:51Z

vllm/grpc/grpc_request_manager.py

+            return True, "Healthy"
+
+        except Exception as e:
+            logger.error("Health check error: %s", e)


Using logger.exception is preferred for logging exceptions as it includes the stack trace, which is very helpful for debugging. logger.error does not include it by default.

Suggested change

logger.error("Health check error: %s", e)

logger.exception("Health check error")

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2025-12-06T20:31:52Z

vllm/grpc/grpc_request_manager.py

+        structured_outputs=structured_outputs,
+        detokenize=False,
+        output_kind=RequestOutputKind.DELTA if stream else RequestOutputKind.CUMULATIVE,


Stop strings unusable with generated sampling params

The gRPC sampler factory always constructs SamplingParams with detokenize=False even though it forwards any stop strings from the request. SamplingParams validation rejects stop strings when detokenization is disabled, so any Generate RPC that sets the stop field will throw a ValueError before the request reaches the engine. This makes the advertised stop support in vllm_engine.proto unusable for gRPC clients.

Useful? React with 👍 / 👎.

Signed-off-by: Chang Su <chang.s.su@oracle.com>

mergify · 2025-12-06T21:28:18Z

Hi @CatherineSue, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Signed-off-by: Chang Su <chang.s.su@oracle.com>

mergify · 2025-12-06T21:37:06Z

Hi @CatherineSue, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Signed-off-by: Chang Su <chang.s.su@oracle.com>

CatherineSue requested review from aarnphm and chaunceyjiang as code owners December 6, 2025 20:29

mergify bot added ci/build frontend labels Dec 6, 2025

CatherineSue and others added 2 commits December 6, 2025 12:29

[grpc] Add gRPC server

0c015e3

# Conflicts: # docker/Dockerfile Signed-off-by: Chang Su <chang.s.su@oracle.com>

Add grpc in CODEOWNERS

e97ee77

Signed-off-by: Chang Su <chang.s.su@oracle.com>

CatherineSue force-pushed the vllm-grpc-upstream branch from d0cdccb to e97ee77 Compare December 6, 2025 20:30

gemini-code-assist bot reviewed Dec 6, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Dec 6, 2025

View reviewed changes

Add type stubs for proto files

d8e6c9b

Signed-off-by: Chang Su <chang.s.su@oracle.com>

Exclude auto-generated gRPC stubs in mkdocs

6b57d27

Signed-off-by: Chang Su <chang.s.su@oracle.com>

CatherineSue requested a review from hmellor as a code owner December 6, 2025 21:33

CatherineSue added 2 commits December 6, 2025 13:34

Run precommit

5d03f6f

Signed-off-by: Chang Su <chang.s.su@oracle.com>

Add pyi in pyproject.toml

24ee879

Signed-off-by: Chang Su <chang.s.su@oracle.com>

CatherineSue added 2 commits December 6, 2025 13:40

Add mypy ignores to all generated grpc stubs

9594a0d

Signed-off-by: Chang Su <chang.s.su@oracle.com>

Exclude grpc in api-autonav

5994e61

Signed-off-by: Chang Su <chang.s.su@oracle.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[grpc] Support gRPC server entrypoint #30190

[grpc] Support gRPC server entrypoint #30190

CatherineSue commented Dec 6, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 6, 2025

Uh oh!

gemini-code-assist bot Dec 6, 2025

Uh oh!

gemini-code-assist bot Dec 6, 2025

Uh oh!

gemini-code-assist bot Dec 6, 2025

Uh oh!

gemini-code-assist bot Dec 6, 2025

Uh oh!

gemini-code-assist bot Dec 6, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Dec 6, 2025

Uh oh!

mergify bot commented Dec 6, 2025

Uh oh!

mergify bot commented Dec 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	logger.error("Error in Generate for %s: %s", request_id, e)
	logger.exception("Error in Generate for %s", request_id)

		("grpc.max_send_message_length", -1),
		("grpc.max_receive_message_length", -1),

		except Exception as e:
		logger.error("Error in generate for %s: %s", request_id, e)

	logger.error("Error submitting request %s: %s", request.request_id, e)
	logger.exception("Error submitting request %s", request.request_id)

		except Exception as e:
		logger.error("Error aborting request %s: %s", request_id, e)

	logger.error("Health check error: %s", e)
	logger.exception("Health check error")

Uh oh!

[grpc] Support gRPC server entrypoint #30190

Are you sure you want to change the base?

[grpc] Support gRPC server entrypoint #30190

Conversation

CatherineSue commented Dec 6, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changed Files

Compilation

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Dec 6, 2025

Uh oh!

mergify bot commented Dec 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

CatherineSue commented Dec 6, 2025 •

edited by github-actions bot

Loading