Skip to content

Lorax Stop Responding after non-concurrent/concurrent requests #779

@MAHMUTGOKSU

Description

@MAHMUTGOKSU

System Info

Hi, I'm currently working on using Lorax to test, benchmark, and serve my LoRA adapters, but I keep facing the same problem. The issue is that after concurrent or non-concurrent requests (about ~195), the OpenAI chat completion endpoints stop responding to my requests. This happens every time I use the Docker with these parameters for Lorax launcher:
Model: meta-llama/Meta-Llama-3-8B-Instruct
max_batch_prefill_tokens: 27000
max_batch_total_tokens: 32000
max_input_length: 27000
max_total_tokens: 32000
adapter_memory_fraction: 0.2
The endpoints /health and /info do indeed work, but the generation endpoints stop responding. I have looked at the GPU utilization for any activity, and it does indeed utilize the GPU when requested from Swagger but does not respond. I use 8 different adapters to test Lorax. Also, the logger does not seem to even see the request, suggesting that there is some type of deadlock before processing the request. The requests are simple and are at most 10,000 tokens.

NOTE: In some cases I could not even reach the swagger, but this did not happen that often.

nvidia-smi :
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.163.01 Driver Version: 550.163.01 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 80GB HBM3 On | 00000000:91:00.0 Off | 0 |
| N/A 38C P0 285W / 700W | 60146MiB / 81559MiB | 64% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

  1. launch lorax using these parameters, everything else is defult (I'm using a pre-downloaded model locally):
    Model: meta-llama/Meta-Llama-3-8B-Instruct
    max_batch_prefill_tokens: 27000
    max_batch_total_tokens: 32000
    max_input_length: 27000
    max_total_tokens: 32000
    adapter_memory_fraction: 0.2
  2. constantly send request to the chat completion endpoint. after a while you will get no response and timeout.

Expected behavior

lorax not hanging when requested concurrently

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions