Lorax Stop Responding after non-concurrent/concurrent requests

### System Info

Hi, I'm currently working on using Lorax to test, benchmark, and serve my LoRA adapters, but I keep facing the same problem. The issue is that after concurrent or non-concurrent requests (about ~195), the OpenAI chat completion endpoints stop responding to my requests. This happens every time I use the Docker with these parameters for Lorax launcher:
Model: meta-llama/Meta-Llama-3-8B-Instruct
max_batch_prefill_tokens: 27000
max_batch_total_tokens: 32000
max_input_length: 27000
max_total_tokens: 32000
adapter_memory_fraction: 0.2
The endpoints /health and /info do indeed work, but the generation endpoints stop responding. I have looked at the GPU utilization for any activity, and it does indeed utilize the GPU when requested from Swagger but does not respond. I use 8 different adapters to test Lorax. Also, the logger does not seem to even see the request, suggesting that there is some type of deadlock before processing the request. The requests are simple and are at most 10,000 tokens.

NOTE: In some cases I could not even reach the swagger, but this did not happen that often. 


nvidia-smi : 
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.163.01             Driver Version: 550.163.01     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:91:00.0 Off |                    0 |
| N/A   38C    P0            285W /  700W |   60146MiB /  81559MiB |     64%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+





### Information

- [x] Docker
- [ ] The CLI directly

### Tasks

- [ ] An officially supported command
- [ ] My own modifications

### Reproduction

1. launch lorax using these parameters, everything else is defult (I'm using a pre-downloaded model locally): 
Model: meta-llama/Meta-Llama-3-8B-Instruct
max_batch_prefill_tokens: 27000
max_batch_total_tokens: 32000
max_input_length: 27000
max_total_tokens: 32000
adapter_memory_fraction: 0.2
2. constantly send request to the chat completion endpoint. after a while you will get no response and timeout. 


### Expected behavior

lorax not hanging when requested concurrently 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Lorax Stop Responding after non-concurrent/concurrent requests #779

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Lorax Stop Responding after non-concurrent/concurrent requests #779

Description

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions