Poster is on display and will be presented at the poster pitch session.
Large Language Models (LLMs) deployed as guardrails form a distinct serving regime characterized by prefill-dominance—long contexts with minimal output. In this setting, Time-to-First-Token (TTFT) is critical. While FP8 inference on NVIDIA H100 GPUs significantly accelerates dense matrix computation (GEMM), it shifts the bottleneck toward inter-GPU synchronization.
This study evaluates the trade-off between scale-up (monolithic Tensor Parallelism, TP=8) and scale-out (replication, two disjoint TP=4 instances) for Llama-3.3-70B-Instruct on a single 8×H100 NVSwitch node. Using vLLM with FP8 quantization, we evaluate this trade-off and show that replication (TP=4×2) outperforms monolithic deployment for throughput in prefill-heavy workloads.
At an input length of 2048 tokens, replication improves throughput by 27.3% over TP=8, a gap that widens compared to BF16 (18.1%), as FP8 exposes synchronization overheads. Profiling shows that TP=8 becomes collective-bound, spending 35.1% of prefill time in NCCL All-Reduce, while TP=4 reduces this to 25.5%.
Operationally, while monolithic TP=8 offers lower mean TTFT under light load, replication provides superior robustness at saturation, sustaining higher request rates under a fixed P99 latency budget. We conclude that, for FP8-enabled guardrails on H100 nodes, scaling out with moderate TP degrees is more effective than maximizing TP depth for prefill-dominant workloads.