Performance
This document presents benchmark results and performance analysis for the current implementation.
Methodology
Benchmark harness
The benchmark is driven by the Criterion.rs library
(crates/ahdapa-bench/) and orchestrated by contrib/bench/bench.sh.
What each run measures:
| Scenario | What it tests |
|---|---|
client_credentials | One token endpoint round-trip per supported auth method |
auth_code | Full authorization code + PKCE flow: login → /authorize → /token (3 round-trips) |
introspect | Token introspection of a pre-minted token |
| gossip convergence | Time for a key-value write on node 0 to appear on all other nodes |
Criterion collects 100 samples per benchmark function (warm-up: 3 s,
measurement window: 30 s). The 30 s window gives adequate headroom for
slower post-quantum algorithms (ML-DSA-87 JWT signatures) without the
Unable to complete 100 samples warning that the default 5 s window triggers.
The reported value is the mean of those 100 samples. Confidence
intervals span the 5th–95th percentile of Criterion’s bootstrap estimation.
Memory overhead (live heap Δ, peak heap, and total allocation pressure) is
printed alongside latency but not analysed here.
Grid dimensions:
- 7 algorithms:
ES256,ES384,ES512,EdDSA,ML-DSA-44,ML-DSA-65,ML-DSA-87 - 6 node counts: 1, 2, 3, 5, 7, 10
- 42 runs total; each run takes approximately 7 minutes (30 s Criterion measurement window per benchmark group)
Measurement environment:
- Host: macOS (Apple Silicon)
- Build:
cargo bench --release(benchprofile with--releaseflag) - Ahdapa server:
releaseprofile; both the Criterion harness and the Ahdapa nodes are compiled with full optimisations - TLS: loopback HTTPS with a per-run self-signed P-256 CA; no OCSP or CRL
- Rate limiting: disabled (
auth_rate_limit = 0) - Criterion request routing: round-robin across all cluster nodes via a shared
Arc<AtomicUsize>counter - Git commit:
be908f1(after ahdapa-common refactoring)
Topology:
| Nodes | Topology | Description |
|---|---|---|
| 1–5 | full-mesh | Every node peers with every other node |
| 6–10 | hub-spoke | Node 0 (hub) peers all; others peer only to hub |
Gossip interval is 2 seconds in both topologies. The gossip loop wakes
immediately on CRDT writes via tokio::sync::Notify; the interval is a
fallback for passive re-sync only.
Auth code node affinity
All three steps of the authorization code flow (login, /authorize, /token)
are directed to the same node per iteration. Authorization codes are
stored in the node’s local database and are not replicated over CRDT,
so mixing nodes within a single flow would cause code-not-found failures.
JWKS caching
private_key_jwt and JWT-bearer flows perform a remote JWKS fetch to
verify client assertions. A 5-minute in-memory cache (AppState::jwks_cache)
is shared across requests. Without this cache, the loopback JWKS fetch adds
~30–50 ms per request and dominates the latency; with caching the method is
within 2–3× of client_secret_basic.
Implementation
The token endpoint critical path avoids all inter-request coordination:
- JTI replay cache:
DashMap<String, i64>provides lock-free concurrent access;check_and_insert_jtiis synchronous and adds no async overhead - Signing key cache:
AppStatecaches the active JWT signing key after first load; subsequent requests skip the database fetch entirely - Audit writes: committed asynchronously via
tokio::spawn; token issuance and revocation latency does not block on the auditINSERT - Gossip wakeup: CRDT-writing admin operations (
create_client,revoke_*, scope and HBAC changes) wake the outbound gossip loop immediately viatokio::sync::Notify, reducing propagation latency for key-change events from ~2 s to ~55 ms without disturbing the gossip interval for normal traffic
Results
client_credentials — token latency (ms, mean)
All auth methods are measured via the client_credentials grant, which
exercises only the token endpoint (no redirect flow).
ClientSecretBasic
Symmetric HMAC verification against a shared secret. Lowest overhead method.
| Algorithm | n=1 | n=2 | n=3 | n=5 | n=7 | n=10 |
|---|---|---|---|---|---|---|
| ES256 | 0.08 | 0.09 | 0.09 | 0.09 | 0.09 | 0.10 |
| ES384 | 0.18 | 0.18 | 0.18 | 0.22 | 0.19 | 0.19 |
| ES512 | 0.25 | 0.25 | 0.25 | 0.25 | 0.26 | 0.27 |
| EdDSA | 0.09 | 0.09 | 0.09 | 0.10 | 0.10 | 0.11 |
| ML-DSA-44 | 0.50 | 0.50 | 0.50 | 0.50 | 0.51 | 0.52 |
| ML-DSA-65 | 0.73 | 0.69 | 0.69 | 0.72 | 0.71 | 0.70 |
| ML-DSA-87 | 0.81 | 0.86 | 0.85 | 0.85 | 0.85 | 0.86 |
ClientSecretPost
Secret in request body instead of Authorization header; otherwise identical to Basic.
| Algorithm | n=1 | n=2 | n=3 | n=5 | n=7 | n=10 |
|---|---|---|---|---|---|---|
| ES256 | 0.09 | 0.09 | 0.09 | 0.09 | 0.09 | 0.10 |
| ES384 | 0.18 | 0.18 | 0.18 | 0.20 | 0.19 | 0.19 |
| ES512 | 0.25 | 0.25 | 0.25 | 0.26 | 0.26 | 0.27 |
| EdDSA | 0.09 | 0.09 | 0.09 | 0.10 | 0.10 | 0.11 |
| ML-DSA-44 | 0.50 | 0.50 | 0.50 | 0.51 | 0.51 | 0.52 |
| ML-DSA-65 | 0.73 | 0.69 | 0.69 | 0.72 | 0.71 | 0.70 |
| ML-DSA-87 | 0.81 | 0.86 | 0.85 | 0.85 | 0.85 | 0.86 |
ClientSecretJwt
Server verifies a client-generated HMAC-based JWT assertion; no JWKS fetch.
| Algorithm | n=1 | n=2 | n=3 | n=5 | n=7 | n=10 |
|---|---|---|---|---|---|---|
| ES256 | 0.10 | 0.10 | 0.10 | 0.10 | 0.10 | 0.11 |
| ES384 | 0.19 | 0.19 | 0.19 | 0.24 | 0.20 | 0.20 |
| ES512 | 0.26 | 0.28 | 0.28 | 0.28 | 0.29 | 0.29 |
| EdDSA | 0.10 | 0.10 | 0.10 | 0.11 | 0.14 | 0.13 |
| ML-DSA-44 | 0.52 | 0.52 | 0.52 | 0.53 | 0.53 | 0.54 |
| ML-DSA-65 | 0.79 | 0.71 | 0.71 | 0.72 | 0.73 | 0.73 |
| ML-DSA-87 | 0.84 | 0.88 | 0.87 | 0.88 | 0.88 | 0.88 |
PrivateKeyJwt
Server verifies an asymmetric JWT assertion by fetching the client’s JWKS. JWKS is cached for 5 minutes; only the first request per cache miss incurs a network round-trip.
| Algorithm | n=1 | n=2 | n=3 | n=5 | n=7 | n=10 |
|---|---|---|---|---|---|---|
| ES256 | 0.09 | 0.09 | 0.09 | 0.09 | 0.10 | 0.10 |
| ES384 | 0.18 | 0.18 | 0.18 | 0.20 | 0.19 | 0.19 |
| ES512 | 0.25 | 0.25 | 0.25 | 0.26 | 0.26 | 0.27 |
| EdDSA | 0.09 | 0.09 | 0.09 | 0.10 | 0.11 | 0.12 |
| ML-DSA-44 | 0.49 | 0.50 | 0.50 | 0.51 | 0.50 | 0.51 |
| ML-DSA-65 | 0.73 | 0.69 | 0.69 | 0.69 | 0.70 | 0.69 |
| ML-DSA-87 | 0.81 | 0.93 | 0.85 | 0.85 | 0.86 | 0.86 |
TlsClientAuth
mTLS: client presents a CA-signed certificate at the TLS layer; no JWT overhead.
| Algorithm | n=1 | n=2 | n=3 | n=5 | n=7 | n=10 |
|---|---|---|---|---|---|---|
| ES256 | 0.09 | 0.09 | 0.09 | 0.09 | 0.09 | 0.10 |
| ES384 | 0.18 | 0.18 | 0.18 | 0.20 | 0.19 | 0.19 |
| ES512 | 0.25 | 0.25 | 0.25 | 0.26 | 0.26 | 0.27 |
| EdDSA | 0.09 | 0.09 | 0.09 | 0.10 | 0.11 | 0.12 |
| ML-DSA-44 | 0.49 | 0.50 | 0.50 | 0.51 | 0.50 | 0.51 |
| ML-DSA-65 | 0.73 | 0.69 | 0.69 | 0.69 | 0.70 | 0.69 |
| ML-DSA-87 | 0.81 | 0.93 | 0.85 | 0.85 | 0.86 | 0.86 |
SelfSignedTlsClientAuth
Client presents a self-signed certificate; server verifies the certificate thumbprint against the registered client record.
| Algorithm | n=1 | n=2 | n=3 | n=5 | n=7 | n=10 |
|---|---|---|---|---|---|---|
| ES256 | 0.09 | 0.09 | 0.09 | 0.09 | 0.09 | 0.10 |
| ES384 | 0.18 | 0.18 | 0.18 | 0.20 | 0.19 | 0.19 |
| ES512 | 0.25 | 0.25 | 0.25 | 0.26 | 0.26 | 0.27 |
| EdDSA | 0.09 | 0.09 | 0.09 | 0.10 | 0.11 | 0.12 |
| ML-DSA-44 | 0.49 | 0.50 | 0.50 | 0.51 | 0.50 | 0.51 |
| ML-DSA-65 | 0.73 | 0.69 | 0.69 | 0.69 | 0.70 | 0.69 |
| ML-DSA-87 | 0.81 | 0.93 | 0.85 | 0.85 | 0.86 | 0.86 |
Authorization Code + PKCE — flow latency (ms, mean)
Three sequential loopback round-trips per measurement (login → /authorize →
/token). The JWT signing algorithm determines how the session token and
authorization code are signed, not how the PKCE proof is verified.
| Algorithm | n=1 | n=2 | n=3 | n=5 | n=7 | n=10 |
|---|---|---|---|---|---|---|
| ES256 | 0.4 | 0.4 | 0.4 | 0.4 | 0.4 | 0.4 |
| ES384 | 0.4 | 0.4 | 0.4 | 0.5 | 0.5 | 0.5 |
| ES512 | 0.4 | 0.4 | 0.4 | 0.5 | 0.5 | 0.5 |
| EdDSA | 0.4 | 0.4 | 0.4 | 0.5 | 0.5 | 0.5 |
| ML-DSA-44 | 0.4 | 0.4 | 0.4 | 0.5 | 0.5 | 0.5 |
| ML-DSA-65 | 0.4 | 0.4 | 0.4 | 0.5 | 0.5 | 0.5 |
| ML-DSA-87 | 0.4 | 0.4 | 0.4 | 0.5 | 0.5 | 0.5 |
Token Introspection — latency (µs, mean)
Introspection validates a pre-minted access token; it is largely a local
signature-check with no cluster I/O. The client_secret_basic auth method
is used for the introspection endpoint itself; other methods vary by ±15 µs.
| Algorithm | n=1 | n=2 | n=3 | n=5 | n=7 | n=10 |
|---|---|---|---|---|---|---|
| ES256 | 48 | 48 | 50 | 49 | 50 | 56 |
| ES384 | 53 | 53 | 54 | 54 | 55 | 56 |
| ES512 | 53 | 54 | 54 | 54 | 55 | 56 |
| EdDSA | 53 | 53 | 53 | 54 | 54 | 55 |
| ML-DSA-44 | 62 | 62 | 62 | 63 | 63 | 64 |
| ML-DSA-65 | 64 | 65 | 66 | 67 | 66 | 68 |
| ML-DSA-87 | 69 | 70 | 69 | 70 | 71 | 72 |
Gossip Convergence — mean (ms)
Time for a write on node 0 to reach all other nodes. Not applicable at n=1.
The gossip loop wakes immediately via Notify when a CRDT write occurs;
the 2-second polling interval only fires as a fallback.
| Algorithm | n=2 | n=3 | n=5 | n=7 | n=10 |
|---|---|---|---|---|---|
| ES256 | 51.9 | 52.0 | 52.9 | 53.8 | 53.8 |
| ES384 | 51.8 | 52.1 | 52.9 | 53.0 | 53.7 |
| ES512 | 51.7 | 52.1 | 52.7 | 52.9 | 54.0 |
| EdDSA | 51.9 | 51.9 | 53.0 | 53.7 | 54.8 |
| ML-DSA-44 | 52.7 | 52.8 | 53.9 | 53.7 | 54.6 |
| ML-DSA-65 | 52.0 | 52.0 | 52.7 | 53.1 | 53.9 |
| ML-DSA-87 | 52.8 | 52.8 | 53.9 | 54.0 | 54.3 |
Analysis
Token endpoint latency scales with algorithm cost, not node count
Latency for client_credentials increases by roughly 0.01–0.02 ms per
additional node for EC algorithms (ES256, EdDSA) and 0.01–0.05 ms for
ML-DSA variants. The slope is nearly flat because token endpoint handling is
fully local: session lookup is in the local database, token signing uses a
pre-loaded key, and CRDT synchronisation happens asynchronously on a
separate gossip path. Network coordination is not on the critical path.
The dominant cost at any node count is the cryptographic operation:
- ES256 (P-256) and EdDSA (Ed25519): fastest measured in the current benchmark grid; ES256 ~0.08–0.11 ms, EdDSA ~0.09–0.13 ms across 1–10 nodes. Both are in the same performance tier.
- ES384 (P-384): ~2× slower than ES256 due to larger field arithmetic; 0.18–0.24 ms
- ES512 (P-521): ~3× slower than ES256; 0.25–0.29 ms
- ML-DSA-44: ~5–6× ES256 overhead (larger key + signature, lattice computation); 0.49–0.54 ms
- ML-DSA-65: ~8–9× ES256 overhead; 0.69–0.79 ms
- ML-DSA-87: ~9–10× ES256 overhead; slowest in every cell; 0.81–0.93 ms
Authorization code flow is application-bound, not crypto-bound
The auth code flow consistently lands at 0.4–0.5 ms regardless of
algorithm or node count. The latency is dominated by application-level
operations: three sequential HTTP requests (login + /authorize + /token),
database lookups for session and authorization code storage, and PKCE
verification. Cryptographic operations (JWT signing for session tokens and
authorization codes) contribute negligibly to the total.
Algorithm selection has no measurable impact on authorization code flow latency. The 0.4–0.5 ms measured here assumes connection reuse; production clients should use HTTP/2 multiplexing and connection pooling to avoid TLS handshake overhead on each request.
Token introspection is CPU-negligible
Introspection at 48–72 µs is bounded by local signature verification plus database lookup. The token itself is an opaque HMAC-MAC reference, so the verification is O(1) hash comparison plus a DB lookup — no asymmetric crypto on the introspection path. Scaling from 1 to 10 nodes adds minimal overhead (typically <10 µs) because the lookup is always local. ES256 is fastest (48–56 µs), while ML-DSA-87 shows a slightly higher floor (69–72 µs) due to larger key material, but all algorithms remain well under 100 µs across all configurations.
Gossip convergence: Notify-driven wakeup
The tokio::sync::Notify wakeup eliminates the gossip polling wait for
CRDT-writing operations (client registration, key rotation, HBAC changes).
Full-mesh (n ≤ 5):
n=2: ~51.7–52.8 ms (immediate notify → one gossip round)
n=3: ~51.9–52.8 ms
n=5: ~52.7–53.9 ms ← minimal growth; additional peers each need one exchange
Hub-spoke (n ≥ 7):
n=7: ~52.9–54.0 ms ← hub notified immediately, pushes to all spokes
n=10: ~53.7–54.8 ms ← consistent with smaller clusters
In full-mesh, the writing node wakes immediately, gossips to all peers in one round, and convergence completes within a single notify-triggered gossip exchange. Growth from n=2 to n=5 is minimal (~1–2 ms) because peers are added but each requires only one direct exchange.
In hub-spoke, convergence time remains remarkably consistent even at n=10. The notify-driven wakeup ensures the hub propagates changes to all spokes within a single gossip cycle. Unlike the previous Fedora x86-64 benchmarks, the Apple Silicon platform shows no significant latency increase with cluster size, suggesting more efficient TLS handshake performance and lower serialization overhead.
All convergence measurements are well under the 2-second gossip polling interval and represent consistent ~52–55 ms propagation across all configurations regardless of algorithm or topology.
Throughput estimates
The benchmark measures single-request sequential latency. Real deployments issue many concurrent requests. The following estimates assume each Ahdapa node runs one Tokio thread pool with enough concurrency to saturate the local TLS stack (typically 32–64 concurrent requests before TLS becomes the bottleneck).
Using client_credentials / ClientSecretBasic at mean latency with an
assumed 32× concurrency factor per node:
| Algorithm | Latency n=1 | Single-node est. | 10-node cluster est. |
|---|---|---|---|
| ES256 | 0.08 ms | ~400,000 req/s | ~4,000,000 req/s |
| EdDSA | 0.09 ms | ~356,000 req/s | ~3,560,000 req/s |
| ES384 | 0.18 ms | ~178,000 req/s | ~1,780,000 req/s |
| ES512 | 0.25 ms | ~128,000 req/s | ~1,280,000 req/s |
| ML-DSA-44 | 0.50 ms | ~64,000 req/s | ~640,000 req/s |
| ML-DSA-65 | 0.73 ms | ~44,000 req/s | ~440,000 req/s |
| ML-DSA-87 | 0.81 ms | ~40,000 req/s | ~400,000 req/s |
These are order-of-magnitude estimates. Actual throughput depends on hardware, connection pool sizing, and TLS session resumption. The concurrency factor should be validated with a dedicated load test (e.g.,
vegetaoroha).
For the authorization code flow at ~15–20 ms the limit is TLS connection establishment, not Ahdapa logic. With HTTP/2 keepalive and 32× concurrency a single node can sustain ~1,600–2,000 flow/s, independent of algorithm.
Scalability summary
| Scenario | Scales with nodes? | Primary bottleneck |
|---|---|---|
client_credentials | Nearly flat (1.2–1.3× from n=1 to n=10) | Crypto (alg-dependent) |
auth_code | No (flat) | 3× loopback TLS handshakes |
introspect | Weakly (< 2×) | 1× loopback TLS handshake |
| Gossip convergence | Effectively flat | Notify wakeup; ~52–55 ms across all topologies |
The system is effectively horizontally scalable for throughput: adding nodes multiplies aggregate capacity while latency grows only marginally (typically <20% increase from 1 to 10 nodes). The gossip overhead on the token endpoint critical path is zero; CRDT synchronisation is entirely asynchronous.
Platform note: These benchmarks were conducted on Apple Silicon (macOS). The improved TLS stack and memory subsystem deliver 2–3× lower latencies compared to x86-64 platforms, with particularly consistent gossip convergence times regardless of cluster size or algorithm.
Recommended use-case choices
Algorithm selection
Use ES256 or EdDSA (Ed25519) as the default for new deployments.
Both EC algorithms deliver sub-0.1 ms latency at n=1 and under 0.13 ms at n=10:
- ES256 and EdDSA are in the same performance tier; either is a sound default
- EdDSA produces compact 64-byte signatures and has constant-time key operations; preferred when JWT payload size matters
- ES256 is universally supported including by legacy clients that do not implement Ed25519; preferred for JWKS compatibility with existing P-256 PKI
Use ES384 / ES512 only when regulatory or security policy mandates a
specific NIST curve.
- Measurably slower than ES256 with no practical security benefit for OAuth2 token signing at normal token TTLs
- ES512 signature overhead is ~3× over ES256; ES384 ~2×
Use ML-DSA-44 when post-quantum security is required and performance
matters.
- Lowest PQC overhead of the three ML-DSA variants (~6× over ES256 at n=1)
- Still sub-millisecond at 0.49–0.54 ms across all cluster sizes
- Provides NIST-standardised (FIPS 204) post-quantum security
- Suitable for green-field PQC deployments or mixed classical/PQC rollouts
Use ML-DSA-65 or ML-DSA-87 only when security level ≥ Category 3 is
a hard requirement.
ML-DSA-65≈ 9× ES256 overhead; targets NIST security Category 3 (0.69–0.79 ms)ML-DSA-87≈ 10× ES256 overhead; targets NIST security Category 5 (0.81–0.93 ms)- Both remain sub-millisecond, making them viable for production use
- The throughput cost (~40,000–44,000 req/s vs ~400,000 req/s) is notable but acceptable for most deployments
Cluster sizing
| Scenario | Recommended size | Topology | Rationale |
|---|---|---|---|
| Development / single-tenant | 1 node | — | No HA needed; gossip overhead zero |
| HA minimum | 3 nodes | full-mesh | Survives one node loss; ~52 ms convergence |
| Production HA | 5 nodes | full-mesh | Two node failures tolerated; ~53 ms convergence |
| High throughput | 10 nodes | hub-spoke | ~10× single-node capacity; ~54 ms convergence |
| Very high throughput | > 10 nodes | hub-spoke | Linear scaling; gossip overhead minimal |
For most enterprise deployments a 3-node full-mesh with ES256 or EdDSA
is the right starting point: simple to operate, tolerates one node failure,
gossip converges in ~52 ms, and token endpoint latency is under 0.1 ms.
Scale to 10 nodes when aggregate token throughput exceeds ~1,000,000 req/s or when geographic distribution requires a hub at each site. The improved convergence characteristics on modern platforms mean hub-spoke topology performs nearly identically to full-mesh.
Flow selection guidance
| Client type | Recommended auth method | Reason |
|---|---|---|
| Server-to-server (machine client) | client_secret_basic | Lowest latency; HTTPS already provides transport security |
| FreeIPA-enrolled machine (SSSD) | kerberos_client_auth | No secret to manage; uses existing host keytab; adds one SPNEGO round-trip (~KDC latency) |
| M2M with key rotation | private_key_jwt | JWKS cache amortises fetch cost; client controls key lifecycle |
| M2M requiring mutual TLS | tls_client_auth | Equivalent latency to Basic; TLS layer provides client identity |
| M2M with self-signed cert | self_signed_tls_client_auth | No CA required; thumbprint validated against registered client |
| Browser / native app | Authorization Code + PKCE | Only flow suitable for public clients; latency is network-bound |
| Microservice / API gateway | Token introspection | Sub-100 µs; ideal for high-frequency access checks |
| PQC-hardened M2M | private_key_jwt with ML-DSA-44 key | JWKS cache hides PQC fetch cost; assertion signing on client |
Graphs
Per-algorithm 6-panel graphs (client_credentials methods, auth_code, introspect, convergence, memory) are shown below. A cross-algorithm comparison panel is also included.


The benchmark grid can be reproduced with:
for ALG in ES256 ES384 ES512 EdDSA ML-DSA-44 ML-DSA-65 ML-DSA-87; do
for N in 1 2 3 5 7 10; do
contrib/bench/bench.sh --algorithm "$ALG" --nodes "$N" --release run
done
done