Skip to content

Conversation

@CatherineSue
Copy link
Contributor

@CatherineSue CatherineSue commented Dec 6, 2025

Purpose

Add gRPC server support to vLLM, enabling the community to integrate vLLM via gRPC protocol for any upstream application or routing layer.

Key Benefits:

  1. Native gRPC Protocol Support

    • Enables upstream applications to connect via gRPC/Protobuf instead of HTTP/JSON
    • Binary protocol reduces serialization overhead
    • HTTP/2 multiplexing improves connection efficiency
    • Expands vLLM's integration options beyond HTTP/REST APIs
  2. Integration with sgl-model-gateway

    • Enables vLLM workers to operate as gRPC backends
    • Bypasses Python GIL bottleneck by moving tokenization logic to Rust
    • Provides production-grade features: advanced routing, secured mcp and database management, responses api
    • Measured performance gains at high concurrency (see Test Results)

Changed Files

Protocol & Codegen:

  • vllm_scheduler.proto - Protocol buffer definition (source)
  • vllm_scheduler_pb2.py - Generated protobuf messages (auto-generated)
  • vllm_scheduler_pb2_grpc.py - Generated gRPC service (auto-generated)
  • compile_protos.py - Script to compile proto files
  • __init__.py - Module initialization

Server Implementation:

  • vllm/grpc/grpc_request_manager.py - Request manager (GrpcRequestManager class)
  • vllm/entrypoints/grpc_server.py - Server entrypoint (VllmSchedulerServicer + main)

Compilation

To regenerate the Python code from the .proto file:

python vllm/grpc/compile_protos.py

Requirements: pip install grpcio-tools

This generates:

  • vllm_scheduler_pb2.py - Message classes
  • vllm_scheduler_pb2_grpc.py - Service stubs and servicers

Test Plan

Run the gRPC server with a Llama-3.1-8B-Instruct:

  python3 -m vllm.entrypoints.grpc_server \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --port 8080 \
    --host 0.0.0.0 \
    --tensor-parallel-size 1

Test with gateway integration with sgl-model-gateway

Verify:

  1. Health check endpoint responds correctly
  2. Streaming generation returns token IDs (not text)
  3. gRPC reflection is available for introspection
  4. Request abort/cancellation works properly
  5. GetModelInfo and GetServerInfo return correct metadata

Test Result

We use genai-bench to measure the http_server vs (grpc_server + sgl-model-gateway) with Llama-3.3-70B-Instruct on 4xH100.

Performance Results (Llama-3.3-70B, D100_100, Concurrency 256):

At high concurrency, gRPC demonstrates superior production characteristics:

Metric gRPC HTTP Improvement
Throughput 9,068 tok/s 6,629 tok/s +37%
Requests/sec 45.7 33.4 +37%
p99 TTFT 1,792ms 2,434ms -26%
p90 TTFT 1,728ms 2,188ms -21%
TTFT Variance (stddev) 428ms 651ms -34%

Key Value Proposition:

  • Processes 39% more requests in same time with 26% better tail latency
  • 34% more consistent performance (lower variance)
D100_100_group_by_server_version_combined_plots_1x4 D100_1000_group_by_server_version_combined_plots_1x4
Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

CatherineSue and others added 2 commits December 6, 2025 12:29
# Conflicts:
#	docker/Dockerfile

Signed-off-by: Chang Su <chang.s.su@oracle.com>
Signed-off-by: Chang Su <chang.s.su@oracle.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a gRPC server entrypoint for vLLM, providing an alternative to the existing HTTP/REST API. This is a significant feature that enables more efficient communication through binary protocols and HTTP/2 multiplexing. The implementation is well-structured, with a dedicated GrpcRequestManager to handle the interaction with the vLLM engine, and a clean server implementation in grpc_server.py. The code includes graceful shutdown handling and client cancellation, which are important for a production-ready server.

My review focuses on improving robustness and security. I've identified a potential security vulnerability related to unlimited gRPC message sizes and several places where logging could be improved to include full tracebacks for easier debugging of production issues. These are important for maintaining a reliable and secure service.

yield self._complete_response(request_id, output)

except Exception as e:
logger.error("Error in Generate for %s: %s", request_id, e)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using logger.error with the exception object as a formatted argument might not capture the full stack trace, which is crucial for debugging. It's better to use logger.exception to ensure the full traceback is always logged when an exception occurs.

Suggested change
logger.error("Error in Generate for %s: %s", request_id, e)
logger.exception("Error in Generate for %s", request_id)

Comment on lines +421 to +422
("grpc.max_send_message_length", -1),
("grpc.max_receive_message_length", -1),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Setting grpc.max_receive_message_length to -1 (unlimited) can expose the server to Denial of Service (DoS) attacks. A malicious client could send a very large gRPC message, potentially causing the server to run out of memory. It is recommended to set a reasonable limit, for example 100MB, to mitigate this risk. The same applies to grpc.max_send_message_length for consistency and to prevent accidental large responses.

Suggested change
("grpc.max_send_message_length", -1),
("grpc.max_receive_message_length", -1),
("grpc.max_send_message_length", 100 * 1024 * 1024),
("grpc.max_receive_message_length", 100 * 1024 * 1024),

Comment on lines +109 to +110
except Exception as e:
logger.error("Error in generate for %s: %s", request_id, e)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using logger.error here can hide the full stack trace of the exception. Using logger.exception would provide more context for debugging by including the traceback. Since e is not used in the log message, it can be removed from the except clause.

Suggested change
except Exception as e:
logger.error("Error in generate for %s: %s", request_id, e)
except Exception:
logger.exception("Error in generate for %s", request_id)

await self.async_llm.engine_core.add_request_async(request)

except Exception as e:
logger.error("Error submitting request %s: %s", request.request_id, e)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

To ensure full stack traces are logged for exceptions, which is critical for debugging, it's better to use logger.exception instead of logger.error.

Suggested change
logger.error("Error submitting request %s: %s", request.request_id, e)
logger.exception("Error submitting request %s", request.request_id)

Comment on lines +185 to +186
except Exception as e:
logger.error("Error aborting request %s: %s", request_id, e)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using logger.exception instead of logger.error will provide a full stack trace, which is invaluable for debugging unexpected errors during request abortion. Since e is not used, it can be removed from the except clause.

Suggested change
except Exception as e:
logger.error("Error aborting request %s: %s", request_id, e)
except Exception:
logger.exception("Error aborting request %s", request_id)

return True, "Healthy"

except Exception as e:
logger.error("Health check error: %s", e)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using logger.exception is preferred for logging exceptions as it includes the stack trace, which is very helpful for debugging. logger.error does not include it by default.

Suggested change
logger.error("Health check error: %s", e)
logger.exception("Health check error")

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +316 to +319
structured_outputs=structured_outputs,
detokenize=False,
output_kind=RequestOutputKind.DELTA if stream else RequestOutputKind.CUMULATIVE,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Stop strings unusable with generated sampling params

The gRPC sampler factory always constructs SamplingParams with detokenize=False even though it forwards any stop strings from the request. SamplingParams validation rejects stop strings when detokenization is disabled, so any Generate RPC that sets the stop field will throw a ValueError before the request reaches the engine. This makes the advertised stop support in vllm_engine.proto unusable for gRPC clients.

Useful? React with 👍 / 👎.

Signed-off-by: Chang Su <chang.s.su@oracle.com>
@mergify
Copy link

mergify bot commented Dec 6, 2025

Hi @CatherineSue, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Signed-off-by: Chang Su <chang.s.su@oracle.com>
@CatherineSue CatherineSue requested a review from hmellor as a code owner December 6, 2025 21:33
Signed-off-by: Chang Su <chang.s.su@oracle.com>
Signed-off-by: Chang Su <chang.s.su@oracle.com>
@mergify
Copy link

mergify bot commented Dec 6, 2025

Hi @CatherineSue, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Signed-off-by: Chang Su <chang.s.su@oracle.com>
Signed-off-by: Chang Su <chang.s.su@oracle.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant