-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
[grpc] Support gRPC server entrypoint #30190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
# Conflicts: # docker/Dockerfile Signed-off-by: Chang Su <chang.s.su@oracle.com>
Signed-off-by: Chang Su <chang.s.su@oracle.com>
d0cdccb to
e97ee77
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a gRPC server entrypoint for vLLM, providing an alternative to the existing HTTP/REST API. This is a significant feature that enables more efficient communication through binary protocols and HTTP/2 multiplexing. The implementation is well-structured, with a dedicated GrpcRequestManager to handle the interaction with the vLLM engine, and a clean server implementation in grpc_server.py. The code includes graceful shutdown handling and client cancellation, which are important for a production-ready server.
My review focuses on improving robustness and security. I've identified a potential security vulnerability related to unlimited gRPC message sizes and several places where logging could be improved to include full tracebacks for easier debugging of production issues. These are important for maintaining a reliable and secure service.
| yield self._complete_response(request_id, output) | ||
|
|
||
| except Exception as e: | ||
| logger.error("Error in Generate for %s: %s", request_id, e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using logger.error with the exception object as a formatted argument might not capture the full stack trace, which is crucial for debugging. It's better to use logger.exception to ensure the full traceback is always logged when an exception occurs.
| logger.error("Error in Generate for %s: %s", request_id, e) | |
| logger.exception("Error in Generate for %s", request_id) |
| ("grpc.max_send_message_length", -1), | ||
| ("grpc.max_receive_message_length", -1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting grpc.max_receive_message_length to -1 (unlimited) can expose the server to Denial of Service (DoS) attacks. A malicious client could send a very large gRPC message, potentially causing the server to run out of memory. It is recommended to set a reasonable limit, for example 100MB, to mitigate this risk. The same applies to grpc.max_send_message_length for consistency and to prevent accidental large responses.
| ("grpc.max_send_message_length", -1), | |
| ("grpc.max_receive_message_length", -1), | |
| ("grpc.max_send_message_length", 100 * 1024 * 1024), | |
| ("grpc.max_receive_message_length", 100 * 1024 * 1024), |
| except Exception as e: | ||
| logger.error("Error in generate for %s: %s", request_id, e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using logger.error here can hide the full stack trace of the exception. Using logger.exception would provide more context for debugging by including the traceback. Since e is not used in the log message, it can be removed from the except clause.
| except Exception as e: | |
| logger.error("Error in generate for %s: %s", request_id, e) | |
| except Exception: | |
| logger.exception("Error in generate for %s", request_id) |
| await self.async_llm.engine_core.add_request_async(request) | ||
|
|
||
| except Exception as e: | ||
| logger.error("Error submitting request %s: %s", request.request_id, e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To ensure full stack traces are logged for exceptions, which is critical for debugging, it's better to use logger.exception instead of logger.error.
| logger.error("Error submitting request %s: %s", request.request_id, e) | |
| logger.exception("Error submitting request %s", request.request_id) |
| except Exception as e: | ||
| logger.error("Error aborting request %s: %s", request_id, e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using logger.exception instead of logger.error will provide a full stack trace, which is invaluable for debugging unexpected errors during request abortion. Since e is not used, it can be removed from the except clause.
| except Exception as e: | |
| logger.error("Error aborting request %s: %s", request_id, e) | |
| except Exception: | |
| logger.exception("Error aborting request %s", request_id) |
| return True, "Healthy" | ||
|
|
||
| except Exception as e: | ||
| logger.error("Health check error: %s", e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| structured_outputs=structured_outputs, | ||
| detokenize=False, | ||
| output_kind=RequestOutputKind.DELTA if stream else RequestOutputKind.CUMULATIVE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stop strings unusable with generated sampling params
The gRPC sampler factory always constructs SamplingParams with detokenize=False even though it forwards any stop strings from the request. SamplingParams validation rejects stop strings when detokenization is disabled, so any Generate RPC that sets the stop field will throw a ValueError before the request reaches the engine. This makes the advertised stop support in vllm_engine.proto unusable for gRPC clients.
Useful? React with 👍 / 👎.
Signed-off-by: Chang Su <chang.s.su@oracle.com>
|
Hi @CatherineSue, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, |
Signed-off-by: Chang Su <chang.s.su@oracle.com>
Signed-off-by: Chang Su <chang.s.su@oracle.com>
Signed-off-by: Chang Su <chang.s.su@oracle.com>
|
Hi @CatherineSue, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, |
Signed-off-by: Chang Su <chang.s.su@oracle.com>
Signed-off-by: Chang Su <chang.s.su@oracle.com>
Purpose
Add gRPC server support to vLLM, enabling the community to integrate vLLM via gRPC protocol for any upstream application or routing layer.
Key Benefits:
Native gRPC Protocol Support
Integration with
sgl-model-gatewayChanged Files
Protocol & Codegen:
vllm_scheduler.proto- Protocol buffer definition (source)vllm_scheduler_pb2.py- Generated protobuf messages (auto-generated)vllm_scheduler_pb2_grpc.py- Generated gRPC service (auto-generated)compile_protos.py- Script to compile proto files__init__.py- Module initializationServer Implementation:
vllm/grpc/grpc_request_manager.py- Request manager (GrpcRequestManager class)vllm/entrypoints/grpc_server.py- Server entrypoint (VllmSchedulerServicer + main)Compilation
To regenerate the Python code from the
.protofile:Requirements:
pip install grpcio-toolsThis generates:
vllm_scheduler_pb2.py- Message classesvllm_scheduler_pb2_grpc.py- Service stubs and servicersTest Plan
Run the gRPC server with a
Llama-3.1-8B-Instruct:python3 -m vllm.entrypoints.grpc_server \ --model meta-llama/Llama-3.1-8B-Instruct \ --port 8080 \ --host 0.0.0.0 \ --tensor-parallel-size 1Test with gateway integration with
sgl-model-gatewayVerify:
GetModelInfoandGetServerInforeturn correct metadataTest Result
We use
genai-benchto measure the http_server vs (grpc_server +sgl-model-gateway) withLlama-3.3-70B-Instructon 4xH100.Performance Results (Llama-3.3-70B, D100_100, Concurrency 256):
At high concurrency, gRPC demonstrates superior production characteristics:
Key Value Proposition:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.