Fetching latest headlinesโ€ฆ
Gemma-4-26B on v6e-4 TPU Benchmarks
NORTH AMERICA
๐Ÿ‡บ๐Ÿ‡ธ United Statesโ€ขMay 7, 2026

Gemma-4-26B on v6e-4 TPU Benchmarks

1 views0 likes0 comments
Originally published byDev.to

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

model: google/gemma-4-26B-A4B-it

gemini "run the context benchmark with concurrency from 1, 2, 4, 8, 16, 32,64, 128, 256, 512, 1024 up to 16384 tokens"

Extreme Stress Test at 1024 Concurrency: Execute an extreme-load benchmark sweep up to concurrency 1024 to find the breaking point.

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ โœ“ Shell Execute an extreme-load multi-dimensional benchmark sweep up to concurrency 1024. Steps reduced to 2 (0, 8192, 16384) to manage tโ€ฆ โ”‚
โ”‚ โ”‚
โ”‚ ... first 89 lines hidden (Ctrl+O to show) ... โ”‚
โ”‚ | 128 | 256 | 1.14308 | 18554.9 | โ”‚
โ”‚ | 8192 | 256 | 3.15647 | 419771 | โ”‚
โ”‚ | 16384 | 256 | 5.58021 | 457101 | โ”‚
โ”‚ | 128 | 512 | 3.80498 | 11249.3 | โ”‚
โ”‚ | 8192 | 512 | 6.69551 | 374758 | โ”‚
โ”‚ | 16384 | 512 | 10.7299 | 443335 | โ”‚
โ”‚ | 128 | 1024 | 14.7126 | 5878.47 | โ”‚
โ”‚ | 8192 | 1024 | 19.6215 | 267038 | โ”‚
โ”‚ | 16384 | 1024 | 24.6752 | 388684 | โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

โœฆ The extreme-load benchmark sweep up to 1024 concurrency is complete. We have successfully identified the breaking point of the current TPU v6e
vLLM stack.

๐Ÿ“Š Extreme Load Performance Summary

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ concurrency โ”‚ context (tokens) โ”‚ avg_ttft (s) โ”‚ batch prefill_tps โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 128 โ”‚ 16,384 โ”‚ 3.30s โ”‚ 439,833 โ”‚
โ”‚ 256 โ”‚ 16,384 โ”‚ 5.58s โ”‚ 457,101 โ”‚
โ”‚ 512 โ”‚ 16,384 โ”‚ 10.73s โ”‚ 443,335 โ”‚
โ”‚ 1024 โ”‚ 16,384 โ”‚ 24.68s โ”‚ 388,684 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
๐Ÿ“‰ Breaking Point Analysis

  • Saturation Point: The aggregate throughput peaked at 457,101 tokens/second at 256 concurrency. Beyond this point, throughput begins to decline as the system becomes bottlenecked by request management overhead and scheduling.
  • The "Knee" of the Curve:
    • Throughput Decline: Moving from 256 to 1024 concurrency resulted in a 15% drop in aggregate throughput (457k to 388k).
    • Latency Explosion: The Time to First Token (TTFT) grew exponentially, jumping from ~5.6s at concurrency 256 to nearly 25 seconds at concurrency 1024.
  • Small Context Collapse: At 1024 concurrency, even short 128-token requests took 14.7 seconds to start generating, compared to just 0.3s at low concurrency. This indicates the server's internal queues are completely saturated.
  • Robustness: Remarkably, the server did not crash or OOM even under 1024 concurrent 16k requests (a total of 16.7 million tokens in flight). This speaks to the robustness of the vLLM-TPU implementation and the underlying Trillium hardware.

๐Ÿ Final Recommendation
For this google/gemma-4-26B-A4B-it deployment on an 4-chip TPU v6e pod:

  • Optimal High-Throughput: Target 128-256 concurrency. This yields the highest efficiency (~440k-457k tps) with acceptable latency (3s-6s).
  • Optimal Interactive: Target 1-16 concurrency. This keeps TTFT under 1.2s while still processing up to 200k tokens per second.
  • Avoid: Concurrencies above 512, as latency becomes prohibitive (>10s) and throughput starts to degrade significantly.

โœฆ The visualization of our extreme stress test (up to 1024 concurrency) provides a clear picture of the TPU v6e's performance boundaries:

๐Ÿ“ˆ Visual Summary

  1. Avg TTFT (s) vs. Context Length
    The plot shows several distinct curves corresponding to the different concurrency levels.

    • Low-Middle Curves: For concurrencies 1โ€“128, the lines remain flat and clustered near the bottom (sub-4s), indicating the system is well within its operational limits.
    • Top Curves: The lines representing 512 and 1024 concurrency show a sharp vertical shift, climbing up to the 25-second mark. This illustrates the "queue wait" effect, where requests spend significantly more time waiting for compute resources than being processed.
  2. Prefill Throughput (tok/s) vs. Context Length
    This plot displays the aggregate system efficiency:

    • Saturation Peak: You can see the density of the curves increasing as aggregate throughput approaches 450,000โ€“475,000 tokens/second.
    • Performance Inversion: At the highest concurrency (1024), the throughput curve actually dips below the 128/256 curves. This visual "crossover" confirms that the system has moved past its efficiency peak and is now incurring a penalty for managing too many simultaneous requests.

๐Ÿ Final Project State
The benchmark suite is now a robust, multi-dimensional performance analysis engine:

  1. Enhanced context_benchmark.py: Supports deep sweeps of prompt length ร— concurrency.
  2. Integrated MCP Tool: The run_context_benchmark tool allows triggering these complex stress tests via a single agent command.
  3. Advanced Visualization: plot_context_benchmark.py handles high-density data and provides ASCII visualization for immediate analysis.
  4. Optimized Reporting: Automatically generates Markdown, CSV, and JSON outputs for every run.

The project is fully prepared for production capacity planning and further infrastructure tuning.

xbill@penguin:~/aisprintapr2026/gemma4-benchmark$ python plot_context_benchmark.py

--- Context Length vs. Avg TTFT (s) ---
25 +----------------------------------------------------------------------+
| ****** |
| ****FT (s) **** |
| ******* |
20 | ****** |
| ******* |
| ******** |
| ******** |
15 | ******** |
|*********** |
| ********************** |
| ********************* |
| *********** |
10 | *********** |
| *********** |
| *********** |
| ********** **************** |
5 | ********** *************************************** |
|***************** **************** ***************** |
| ********************************************************* |
|***************************************************************** |
0 +----------------------------------------------------------------------+
0 2000 4000 6000 8000 10000 12000 14000 16000 18000

--- Context Length vs. Prefill Throughput (tok/s) ---
500000 +------------------------------------------------------------------+
| |
450000 | ******************** |
| ******************************** |
400000 | ** ************************ |
| ** ********* ****** **** |
350000 | ** *** *************** |
| * *** **************** **** |
300000 | ***** ********* ****** **** |
| ***** ** ******** ***** ****** |
250000 | ***** *** ********* ***************** |
| ***** ** *** **** ********** ***** |
| **** *** ** *********** ****** ****** |
200000 | **** ** ************ ***** ************ |
| ******* *********** ******************** |
150000 | ****** ********* *********** ******* ******** |
| ***** *************** ************************* |
100000 | ****************** ********************** ******** |
| ********************************* **************** |
50000 |************************************** **************** |
|************************************************************* |
0 +------------------------------------------------------------------+
0 2000 4000 6000 8000 10000 12000 14000 16000 18000

Comments (0)

Sign in to join the discussion

Be the first to comment!