Benchmarking 30-Qubit Statevector Simulation on NVIDIA HGX B300

By Manan Narang20 February 202611 min read

The most common question we get from research groups planning their first serious quantum simulation cluster is some version of: what does a 30-qubit statevector workload actually cost in real hardware terms? This post is our practitioner's answer, calibrated against production deployments on NVIDIA HGX B300 systems.

It is intentionally not a marketing piece. The numbers below reflect what we have seen in actual customer environments, with the usual caveats about workload mix, gate composition, and the fact that no two simulation campaigns look exactly alike.

Why 30 qubits is the interesting threshold

Twenty-eight to thirty-two qubits is the band where statevector simulation transitions from "fits comfortably on a single high-end accelerator" to "requires deliberate multi-GPU engineering." It is also the band where most non-trivial near-term applied quantum research — variational chemistry, QAOA on industrially relevant graphs, applied QSVM — lives.

Below 28 qubits, almost any modern GPU works. Above 32 qubits, the conversation shifts to tensor-network methods, sampling-based approaches, or much larger clusters. In between sits the band where engineering choices most directly determine what a research team can actually do this quarter.

The memory ceiling

A complex-128 statevector at $n$ qubits requires $2^n \times 16$ bytes:

28 qubits → 4 GB
30 qubits → 16 GB
32 qubits → 64 GB

This is just the state. Realistic simulations also need workspace for intermediate gate fusion, noise-channel application, and Kraus operators. As a rule of thumb, plan for at least 2–3× the raw statevector size in available HBM for headroom.

A 30-qubit complex-128 simulation therefore wants ~32–48 GB of usable HBM if you want to avoid disk spill. Complex-64 — which is acceptable for many workloads — halves that.

Where HGX B300 helps

The HGX B300 architecture is, at its core, a high-bandwidth, high-NVLink-throughput platform optimised for exactly the access patterns that statevector simulation exhibits. Three properties matter most:

Per-GPU HBM headroom. B300-class accelerators provide enough on-package memory to hold a 30-qubit complex-128 statevector comfortably on a single device, with workspace.
NVLink fabric. When you cross the single-device boundary — for 32+ qubits, or for noise-heavy circuits with deep Kraus expansions — NVLink determines whether multi-GPU execution stays compute-bound or becomes communication-bound. B300's NVLink throughput is the difference between linear and sublinear scaling for many circuit topologies.
CUDA-Q maturity. The cuQuantum and cuStateVec libraries are now mature enough that idiomatic CUDA-Q code performs within striking distance of hand-tuned implementations for most circuit shapes. This matters because it means scientific users — not just systems engineers — can extract production-grade performance.

What we measured

Across a recent set of customer engagements, on HGX B300 hardware with CUDA-Q and cuStateVec:

30-qubit dense random circuits, depth 30, single-GPU: end-to-end statevector evolution in the single-digit minutes range, dominated by gate-fusion bookkeeping rather than raw FLOPs.
30-qubit VQE chemistry circuits (UCCSD ansatz, small molecules), single-GPU: per-iteration cost in the seconds-to-low-minutes range; full optimisation campaigns complete in hours, not days.
32-qubit circuits distributed across two B300-class GPUs over NVLink: 1.7–1.9× speedup over single-GPU equivalent, depending on entanglement structure. The gap from ideal 2× is almost entirely communication.
Noise-modelled simulation of 28-qubit circuits with realistic depolarising and amplitude-damping channels: 4–6× the cost of noiseless equivalents, but stable and within the same wall-clock envelope as a single-shot simulation.

These numbers will not surprise anyone who has done this work. They are useful to publish because so much of the quantum-simulation literature is reported in machine-independent FLOP counts that obscure the real engineering question, which is: can my graduate student get an answer this week?

For 30-qubit-class research workloads on B300 hardware, the answer is yes — comfortably.

What changes the picture

Three workload characteristics most affect whether the above numbers hold for your specific case:

Entanglement structure of the circuit. Block-diagonal or low-entanglement circuits are dramatically cheaper than fully dense circuits because cuStateVec's gate fusion is highly effective in those regimes.
Noise-channel choice. Trace-preserving Kraus expansions are expensive; noise approximations like Pauli-channel sampling or trajectories restore much of the wall-clock budget.
Memory layout assumptions. Custom gates that violate cuStateVec's preferred layouts can flip a workload from compute-bound to memory-bound. This is the single most common reason a "should be fast" circuit runs slowly.

Practical guidance

If you are sizing infrastructure for a research group whose target window is the 28–32 qubit band:

Specify HBM-rich GPUs first, NVLink fabric second, networking third. The first two determine what you can simulate; networking determines how you scale.
Plan for noise-aware workloads. A noiseless-only simulator is a teaching tool; institutional research requires noise-channel work, and noise-channel work is where the budget disappears.
Invest in benchmarking discipline. A documented baseline — measured on your actual hardware, with your actual circuits — is worth more than any vendor benchmark.
Plan for multi-tenancy. Most institutional deployments support multiple PIs. The scheduling and resource-quota story matters as much as the raw FLOPs.

Beyond the numbers

The deeper point is that infrastructure decisions in this band are not really about peak performance. They are about consistency of throughput — how reliably your scientists get a sensible answer in a sensible time, week after week. The HGX B300 platform, paired with a hardened CUDA-Q stack and a sensible operations layer, gets a research team to that consistency faster than any other configuration we have deployed.

For institutions sizing this infrastructure now, the practical question is rarely "which platform" — by 2026, the answer is overwhelmingly NVIDIA HGX-class. The harder questions are about the integration: which CUDA-Q version, which cuStateVec patches, which scheduler, which observability stack, which on-call posture. Those are engineering decisions, not procurement decisions.

We make them every week, and we are happy to share what we have learned.