Rightsizing LLM Serving on vLLM for GPUs and TPUs

Alfonso Martinez Aug 19, 2025 12:21:17 PM

Additional contributors include Hossein Sarshar and Ashish Narasimham. Large Language Models (LLMs) are revolutionizing how we interact with technology, but serving these powerful models efficiently can be a challenge. vLLM has rapidly become the primary choice for serving open source large language models at scale, but using vLLM is not a silver bullet. Teams that are serving LLMs for downstream applications have stringent latency and throughput requirements that necessitate a thorough analysis of which accelerator to run on and what configuration offers the best possible performance. This guide provides a bottoms-up approach to determining the best accelerator for your use case and optimizing your vLLM’s configuration to achieve the best and most cost effective results possible. Note : This guide assumes that you are familiar with GPUs, TPUs, vLLM, and the underlying features that make it such an effective serving framework. Prerequisites Before we begin, ensure you have: A Google Cloud Project with billing enabled. The gcloud command-line tool installed and authenticated . Basic familiarity with Linux commands and Docker. Hugging Face account, a read token and access to the Gemma 3 27B model . Gathering Information on Your Use Case Choosing the right accelerator can feel like an intimidating process because each inference use case is unique. There is no a priori ideal set up from a cost/performance perspective; we can’t say model X should always be run on accelerator Y. T he following considerations need to be taken into account to best determine how to proceed: What model are you using? Our example model is google/gemma-3-27b-it . This is a 27-billion parameter instruction-tuned model from Google's Gemma 3 family. What is the precision of the model you’re using? We will use bfloat16 (BF16). Note : Model precision determines the number of bytes used to store each model weight. Common options are float32 (4 bytes), float16 (2 bytes), and bfloat16 (2 bytes). Many models are now also available in quantized formats like 8-bit, 4-bit (e.g., GPTQ, AWQ), or even lower. Lower precision reduces memory requirements and can increase speed, but may come with a slight trade-off in accuracy. Workload characteristics: How many requests/second are you expecting? We are targeting support for 100 requests/second. What is the average sequence length per request? Input Length : 1500 tokens Output Length: 200 tokens The total sequence length per request is therefore 1500 + 200 = 1700 tokens on average. What is the maximum total sequence length we will need to be able to handle? Let’s say in this case it is 2000 total tokens What is the GPU Utilization you’ll be using? The gpu_memory_utilization parameter in vLLM controls how much of the GPU’s VRAM is pre-allocated for the KV cache (given the allocated memory for the model weights). By default, this is 90% in vLLM, but we generally want to set this as high as possible to optimize performance without causing OOM issues - which is how our auto_tune.sh script works (as described in the "Benchmarking, Tuning and Finalizing Your vLLM Configuration" section of this post ). What is your prefix cache rate ? This will be determined from application logs, but we'll estimate 50% for our calculations. Note : Prefix caching is a powerful vLLM optimization that reuses the computed KV cache for shared prefixes across different requests. For example, if many requests share the same lengthy system prompt, the KV cache for that prompt is calculated once and shared, saving significant computation and memory. The hit rate is highly application-specific. You can estimate it by analyzing your request logs for common instruction patterns or system prompts. What is your latency requirement? The end-to-end latency from request to final token should not exceed 10 seconds (P99 E2E) . This is our primary performance constraint. Selecting Accelerators (GPU/TPU) We live in a world of resource scarcity! What does this mean for your use case? It means that of course you could probably get the best possible latency and throughput by using the most up to date hardware - but as an engineer it makes no sense to do this when you can achieve your requirements at a better price/performance point. Identifying Candidate Accelerators We can refer to our Accelerator-Optimized Machine Family of Google Cloud Instances to determine which GPUs are viable candidates. We can refer to our Cloud TPU offerings to determine which TPUs are viable candidates. The following are examples of accelerators that can be used for our workloads, as we will see in the following Calculate Memory Requirements section. The following options have different Tensor Parallelism (TP) configurations required depending on the total VRAM. Please see the next section for an explanation of Tensor Parallelism. GPU Options L4 GPUs g2-standard-48 instance provides 4xL4 GPUs with 96 GB of GDDR6 TP = 4 A100 GPUs a2-ultragpu-1g instance provides 1xA100 80GB GPU of HBM TP = 1 H100 GPUs a3-highgpu-1g instances provides 1xH100 80GB GPU of HBM TP = 1 TPU Options TPU v5e (16 GB of HBM per chip) v5litepod-8 provides 8 v5e TPU chips with 128GB of total HBM TP = 8 TPU v6e aka Trillium (32 GB of HBM per chip) v6e-4 provides 4 v6e TPU chips with 128GB of total HBM TP = 4 Calculate Memory Requirements We must estimate the total minimum VRAM needed . This will tell us if the model can fit on a single accelerator or if we need to use parallelism. Memory utilization can be broken down into two main components: static memory from our model weights, activations, and overhead & the KV Cache memory. The following tool was created to answer this question: Colab: HBM Calculator You can enter the information we determined above to estimate the minimum required VRAM to run our model. Hugging Face API Key Model Name from Hugging Face Number of Active Parameters (billions) The average input and output length (in tokens) for your workload. A batch size of 1 The calculation itself is generally out of scope for this discussion, but it can be determined from the following equation: Required GPU/TPU memory = [(model_weight + non_torch_memory + pytorch_activation_peak_memory) + (kv_cache_memory_per_batch * batch_size)] , where model_weight is equal to the number of parameters x a constant depending on parameter data type/precision non_torch_memory is a buffer for memory overhead (estimated ~1GB) pytorch_activation_peak_memory is the memory required for intermediate activations kv_cache_memory_per_batch is the memory required for the KV cache per batch batch_size is the number of sequences that will be processed simultaneously by the engine A batch size of one is not a realistic value, but it does provide us with the minimum VRAM we will need for the engine to get off the ground. You can vary this parameter in the calculator to see just how much VRAM we will need to support our larger batch sizes of 128 - 512 sequences. In our case, we find that we need a minimum of ~57 GB of VRAM to run gemma-3-27b-it on vLLM for our specific workload. Is Tensor Parallelism Required? In this case, the answer is that parallelism is not necessarily required, but we could and should consider our options from a price/performance perspective. Why does it matter? Very quickly - what is Tensor Parallelism? At the highest level, Tensor Parallelism is a method of breaking apart a large model across multiple accelerators (GPU/TPU) so that the model can actually fit on the hardware we need. See here for more information. vLLM supports Tensor Parallelism (TP) . With tensor parallelism, accelerators must constantly communicate and synchronize with each other over the network for the model to work. This inter-accelerator communication can add overhead, which has a negative impact on latency. This means we have a tradeoff between cost and latency in our case. Note : Tensor parallelism is required for TPU’s because of the particular size of this model. v5e and v6e have 16 GB and 32 GB of HBM respectively and mentioned above, so multiple chips are required to support the model size. In this guide, v6e-4 does pay a slight performance penalty for this communication overhead while our 1xH100 instance will not . Benchmarking, Tuning and Finalizing Your vLLM Configuration Now that you have your short list of accelerator candidates (4xL4, 1xA100-80GB, 1xH100-80GB, TPU v5e-8, TPU v6e-4), it is time to see the best level of performance we can across each potential setup. We will only overview the H100 and Trillium (v6e) benchmarking & tuning in this section - but the process would be nearly identical for the other accelerators: Launch, SSH, Update VMs Pull vLLM Docker Image Update and Launch Auto Tune Script Analyze Results H100 80GB In your project, open the Cloud Shell and enter the following command to launch an a3-highgpu-1g instance. Be sure to update your project ID accordingly and select a zone that supports the a3-highgpu-1g machine type for which you have quota. code_block )])]> SSH into the instance. code_block )])]> Now that we’re in our running instance, we can go ahead and pull the latest vLLM Docker image and then run it interactively. A final detail - if we are using a gated model (and we are in this demo) we will need to provide our HF_TOKEN in the container: code_block )])]> In our running container, we can now find a file called vllm-workspace/benchmarks/auto_tune/auto_tune.sh which we will need to update with the information we determined above to tune our vLLM configuration for the best possible throughput and latency. code_block )])]> In the auto_tune.sh script, you will need to make the following updates: code_block )])]> Specify the model we will be using. Specify that we are leveraging GPU in this case. Tensor Parallelism is set to 1. Specify our inputs and outputs. Specify our 50% min_cache_hit_pct. Specify our latency requirement. Update our num_seqs_list to reflect a range of common values for high performance. Update num_batched_tokens_list if necessary Likely will not be necessary, but if a use case is particularly small or particularly large inputs/outputs Be sure to specify the BASE, DOWNLOAD_DIR, and cd “$BASE” statement exactly as shown. Once the parameters have been updated, launch the auto_tune.sh script code_block )])]> The following processes occur: Our auto_tune.sh script downloads the required model and attempts to start a vLLM server at the highest possible gpu_utilization (0.98 by default). If a CUDA OOM occurs, we go down 1% until we find a stable configuration. Troubleshooting Note : In rare cases, a vLLM server may be able to start during the initial gpu_utilization test but then fail due to CUDA OOM at the start of the next benchmark. Alternatively, the initial test may fail and then not spawn a follow up server resulting in what appears to be a hang. If either happens, edit the auto_tune.sh near the very end of the file so that gpu_utilization begins at 0.95 or a lower value rather than beginning at 0.98. Troubleshooting Note : By default, profiling is currently being passed to the benchmarking_server.py script. In some cases this may cause the process to hang if the GPU profiler is not capable of handling the large number of requests for that specific model. You can confirm this by reviewing the logs for the current run; if the logs include the following line with an indefinite hang afterwards, you’ve run into this problem: code_block )])]> If that is the case, simply remove the --profile flag from the benchmarking_server.py calls in the auto_tune.sh script under the run_benchmark() function: code_block "$bm_log" # Remove this flag, making sure to keep the &> "$bm_log" # on the argument above'), ('language', ''), ('caption', )])]> Then, for each permutation of num_seqs_list and num_batched_tokens, a server is spun up and our workload is simulated. A benchmark is first run with an infinite request rate. If the resulting P99 E2E Latency is within the MAX_LATENCY_ALLOWED_MS limit, this throughput is considered the maximum for this configuration. If the latency is too high, the script performs a search by iteratively decreasing the request rate until the latency constraint is met. This finds the highest sustainable throughput for the given parameters and latency requirement. In our results.txt file at /vllm-workspace/auto-benchmark/$TAG/result.txt, we will find which combination of parameters is most efficient, and then we can take a closer look at that run: code_block )])]> code_block )])]> Let's look at the best-performing result to understand our position: max_num_seqs: 256, max_num_batched_tokens: 512 These were the settings for the vLLM server during this specific test run. request_rate: 6 This is the final input from the script's loop. It means your script determined that sending 6 requests per second was the highest rate this server configuration could handle while keeping latency below 10,000 ms. If it tried 7 req/s, the latency was too high. e2el: 7612.31 This is the P99 latency that was measured when the server was being hit with 6 req/s. Since 7612.31 is less than 10000, the script accepted this as a successful run. throughput: 4.17 This is the actual, measured output. Even though you were sending requests at a rate of 6 per second, the server could only successfully process them at a rate of 4.17 per second. TPU v6e (aka Trillium) Let’s do the same optimization process for TPU now. You will find that vLLM has a robust ecosystem for supporting TPU-based inference and that there is little difference between how we execute our benchmarking script for GPU and TPU. First we’ll need to launch and configure networking for our TPU instance - in this case we can use Queued Resources . Back in our Cloud Shell, use the following command to deploy a v6e-4 instance. Be sure to select a zone where v6e is available. code_block )])]> To monitor the status of your request: code_block )])]> Wait for the TPU VM to become active (status will update from PROVISIONING to ACTIVE). This might take some time depending on resource availability in the selected zone. SSH directly into the instance with the following command: code_block )])]> Now that we’re in, pull the vLLM-TPU Docker image, launch our container, and exec into the container: code_block )])]> Again, we will need to install a dependency, provide our HF_TOKEN and update our auto-tune script as we did above with the H100. code_block )])]> We will want to make the following updates to the vllm/benchmarks/auto_tune.sh file: code_block /dev/null && pwd )\r\nBASE="/workspace"\r\nMODEL="google/gemma-3-27b-it"\r\nSYSTEM="TPU"\r\nTP=4\r\nDOWNLOAD_DIR="/workspace/models"\r\nINPUT_LEN=1500\r\nOUTPUT_LEN=200\r\nMAX_MODEL_LEN=2000\r\nMIN_CACHE_HIT_PCT=50\r\nMAX_LATENCY_ALLOWED_MS=10000\r\nNUM_SEQS_LIST="128 256"\r\nNUM_BATCHED_TOKENS_LIST="512 1024 2048"'), ('language', ''), ('caption', )])]> And then execute: code_block )])]> As our auto_tune.sh executes we determine the largest possible gpu_utilization value our server can run on and then cycle through the different num_batched_tokens parameters to determine which is most efficient. Troubleshooting Note : It can take a longer amount of time to start up a vLLM engine on TPU compared to GPU due to a series of compilation steps that are required. In some cases, this can go longer than 10 minutes - and when that occurs the auto_tune.sh script may kill the process. If this happens, update the start_server() function such that the for loop sleeps for 30 seconds rather than 10 seconds as shown here: code_block )])]> The outputs are printed as our program executes and we can also find them in log files at $BASE/auto-benchmark/TAG. We can see in these logs that our current configurations are still able to achieve our latency requirements. Again we can observe our results.txt file: code_block )])]> And the corresponding metrics for our best run: code_block )])]> Let's look at the best-performing result to understand our position: max_num_seqs: 256, max_num_batched_tokens: 512 These were the settings for the vLLM server during this specific test run. request_rate: 9 This is the final input from the script's loop. It means your script determined that sending 9 requests per second was the highest rate this server configuration could handle while keeping latency below 10,000 ms. If it tried 10 req/s, the latency was too high. e2el: 8423.40 This is the P99 latency that was measured when the server was being hit with 9 req/s. Since 8423.40 is less than 10,000, the script accepted this as a successful run. throughput: 5.63 This is the actual, measured output. Even though you were sending requests at a rate of 9 per second, the server could only successfully process them at a rate of 5.63 per second. Calculating Performance-Cost Ratio Now that we have tuned and benchmarked our two primary accelerator candidates, we can bring the data together to make a final, cost-based decision. The goal is to find the most economical configuration that can meet our workload requirement of 100 requests per second while staying under our P99 end-to-end latency limit of 10,000 ms. We will analyze the cost to meet our 100 req/s target using the best-performing configuration for both the H100 GPU and the TPU v6e. NVIDIA H100 x 80GB (a3-highgpu-1g) Measured Throughput: The benchmark showed a single H100 vLLM engine achieved a throughput of 4.17 req/s. Instances Required: To meet our 100 req/s goal, we would need to run multiple instances. The calculation is: Target Throughput / Throughput Per Instance = 100 req/s ÷ 4.17 req/s ≈ 23.98 Since we can't provision a fraction of an instance, we must round up to 24 instances. Estimated Cost: As of July 2025, the spot price for an a3-highgpu-1g machine type in us-central1 is approximately $2.25 per hour. The total hourly cost for our cluster would be: 24 instances × $2.25/hr = $54.00/hr Note : We are choosing Spot instance pricing for the simple cost figures, this would not be a typical provisioning pattern for this type of workload. Google Cloud TPU v6e ( v6e-4) Measured Throughput: The benchmark showed a single v6e-4 vLLM engine achieved a higher throughput of 5.63 req/s. Instances Required: We perform the same calculation for the TPU cluster: Target Throughput ÷ Throughput per Instance = 100 req/s ÷ 5.63 req/s ≈ 17.76 Again, we must round up to 18 instances to strictly meet the 100 req/s requirement. Estimated Cost: As of July 2025, the spot price for a v6e-4 queued resource in us-central1 is approximately $0.56 per chip per hour. The total hourly cost for this cluster would be: 18 instances × 4 chips x $0.56/hr = $40.32/hr Conclusion: The Most Cost-Effective Choice Let's summarize our findings in a table to make the comparison clear. Metric H100 (a3-highgpu-1g) TPU (v6e-4) Throughput per Instance 4.17 req/s 5.63 req/s Instances Needed (100 req/s) 24 18 Spot Instance Cost Per Hour $2.25 / hour $0.56 x 4 chips = $2.24 / hour Spot Cost Total $54.00 / hour $40.32 / hour Total Monthly Cost (730h) ~ $39,400 ~ $29,400 The results are definitive. For this specific workload (serving the gemma-3-27b-it model with long contexts), the v6e-4 configuration is the winner. Not only does the v6e-4 instance provide higher throughput than the a3-highgpu-1g instance, but it does so at a significantly reduced cost. This translates to massive savings at higher scales. Looking at the performance-per-dollar, the advantage is clear: H100: 4.17 req/s ÷ $54.00/hr ≈ 0.08 req/s per dollar-hour TPU v6e: 5.63 req/s ÷ $40.32/hr ≈ 0.14 req/s per dollar-hour The v6e-4 configuration delivers almost twice the performance for every dollar spent, making it the superior, efficient choice for deploying this workload. Final Reminder This benchmarking and tuning process demonstrates the critical importance of evaluating different hardware options to find the optimal balance of performance and cost for your specific AI workload. We need to keep in mind the following sizing these workloads: If our workload changed (e.g., input length, output length, prefix-caching percentage, or our requirements) the outcome of this guide may be different - H100 could outperform v6e in several scenarios depending on the workload. If we considered the other possible accelerators mentioned above, we may find a more cost effective approach that meets our requirements. Finally, we covered a relatively small parameter space in our auto_tune.sh script for this example - perhaps if we searched a larger space we may have found a configuration with even greater cost savings potential . Additional Resources The following is a collection of additional resources to help you complete the guide and better understand the concepts described. Auto Tune ReadMe in Github TPU Optimization Tips Currently Supported Models for TPU on vLLM More on Optimization and Parallelism from vLLM Great Article on KV Cache

info@suitebriar.com

+1 888 545 3685

Rightsizing LLM Serving on vLLM for GPUs and TPUs

Leave a Reply

Our Services

Get our latest news!