# Epoch Boundary Investigation Notes > **Status**: Deep-dive completed 2026-05-23. Five divergence dimensions fully mapped > across all five consensus clients (Prysm, Lighthouse, Grandine, Teku, Lodestar). This document captures a comprehensive cross-client audit of **epoch-boundary handling** — specifically how each client manages the transition between epochs, refreshes duties, tolerates head-staleness, and recovers from reorgs. --- ## Why epoch-boundary handling is worth prioritizing Epoch boundaries are where many clients refresh or invalidate: - proposer duties - attester duties - aggregate duties - dependent roots - head-freshness assumptions - EL/CL coordination for the next slot These paths are often not consensus-invalid by themselves, but they are rich in **behavioral divergence** that can affect: - **liveness**: missed proposal / missed attestation / delayed recovery - **reorg recovery**: whether duties are refreshed after head changes - **duty freshness**: whether a duty is still considered safe/fresh near the boundary In practice, this is usually a **medium-value audit direction**: it often produces real cross-client asymmetries, but it does **not automatically imply a severe consensus-critical vulnerability** unless it can be extended into a clear exploit path. --- ## Dimension 1: Proposal epoch-gap tolerance The original anchor example: what happens when `head_epoch` and `proposal_epoch` differ by **2**? ### Code-level findings | Client | Mechanism | Code Anchor | gap=2 tolerated? | Configurable? | |--------|-----------|-------------|:---:|:---:| | **Lighthouse** | `head_epoch + sync_tolerance_epochs < proposal_epoch` | `beacon_chain.rs:4724` | ✅ Yes | `--sync-tolerance-epochs` (default 2) | | **Prysm** | No explicit epoch-gap gate; duty-refresh & BN-side driven | `validator.go:562`, `duties.go:68` | ✅ Yes | N/A | | **Teku** | `signingEpoch > currentEpoch + lookAheadEpochs + 1` | `AbstractDutyScheduler.java:139`, `BlockDutyScheduler.java:30` | ❌ **No** | Hardcoded `LOOKAHEAD_EPOCHS = 0` | | **Grandine** | `head_slot + max_empty_slots < slot` | `validator.rs:734` | ❌ No (default) | `--max-empty-slots` (default 32) | | **Lodestar** | No explicit epoch-gap gate; duty-polling driven | `blockDuties.ts:131-148` | ✅ Yes | N/A | ### Key code snippets **Lighthouse** (`beacon_chain.rs:4724-4739`): ```rust if head_epoch + self.config.sync_tolerance_epochs < proposal_epoch { warn!("Skipping proposer preparation"); // skips prep, still allows proposal Err(Error::SkipProposerPreparation) } ``` **Teku** (`AbstractDutyScheduler.java:133-140`): ```java // LOOKAHEAD_EPOCHS = 0 → signingEpoch <= currentEpoch + 1 return !signingEpoch.isGreaterThan(epoch.plus(lookAheadEpochs + 1)); ``` **Prysm** — fetches `current+1` and `current+2` concurrently, no freshness gate: ```go // validator.go:562 — pre-fetch current+1 epoch := slots.ToEpoch(slots.CurrentSlot(v.genesisTime) + 1) // duties.go:68 — concurrently pull current+2 c.dutiesForEpoch(ctx, nextEpochDuties, in.Epoch+1, vals, fetchSyncDuties) ``` **Grandine** (`validator.rs:722-740`): ```rust async fn slot_head(&self, slot: Slot) -> Result, HeadFarBehind>> { if head_slot + max_empty_slots < slot { return Ok(Err(HeadFarBehind { head_slot, max_empty_slots, slot })); } // ... on failure, ALL duties (propose, attest, aggregate) are silently skipped } ``` ### Impact analysis - **Liveness asymmetry**: Prysm, Lighthouse, and Lodestar keep proposing at gap=2; Teku and Grandine give up. This means Teku/Grandine validator slots are **systematically more likely to be skipped** during network instability. - **Economic asymmetry**: Proposal rewards are orders of magnitude larger than attestation rewards. A Teku validator losing proposals but still attesting during turbulence suffers a disproportionate economic penalty compared to a Prysm validator. - **Not a consensus split**: A skipped proposal is protocol-identical to an offline proposer. No fork is created. --- ## Dimension 2: Attestation future-window tolerance How far ahead can attestation duties be requested and served? ### BN-side tolerance | Client | Max Query Epoch | Mechanism | Unique behavior | |--------|:--:|------|------| | **Teku** | `current + 2` | `MIN_SEED_LOOKAHEAD(1) + DUTY_EPOCH_TOLERANCE(1)` | **Most permissive** | | **Lighthouse** | `current + 1` | With `tolerant_current_epoch` clock-disparity correction | Boundary-smoothing via `MAXIMUM_GOSSIP_CLOCK_DISPARITY` | | **Grandine** | `store_epoch + 1` (or +2 on last slot) | `MIN_SEED_LOOKAHEAD` with epoch-boundary extension | **Unique last-slot-of-epoch special case** | | **Prysm** | `current + 1` | Strict check, no clock-disparity adjustment | Simplest implementation | | **Lodestar** | `current + 1` | Strict check; code comments explain "epoch+2 has dependent root jitter" | Hardcoded rationale | ### VC-side pre-fetch | Client | Fetches | Strategy | |--------|---------|----------| | **Lodestar** | Current ≈ epoch+2 (via `nextShuffling` precompute) | **Only client that precomputes 2 epochs ahead** | | **Lighthouse** | `current` + `next` | Stores `current_epoch_shuffling_id` + `next_epoch_shuffling_id` on forkchoice blocks | | **Teku** | `headEpoch`, `headEpoch+1` | Compares dependent roots, not epochs, to trigger refetch | | **Grandine** | Query-time from head state | Stateless; `MIN_SEED_LOOKAHEAD` precompute boundary | | **Prysm** | `current` + `next` (concurrent fetch) | Lazy, no precomputation | ### 🔑 Key asymmetry: Teku Teku is the only client where **attestation is more permissive than proposal**: | Duty type | Teku tolerance | vs other clients | |-----------|:---:|------| | Proposal | `current + 1` (gap=2 rejected) | **Strictest** | | Attestation | `current + 2` | **Most permissive** | This means a Teku node under clock skew or network delay will **keep attesting but stop proposing** — an asymmetric liveness profile unique to Teku. --- ## Dimension 3: Dependent-root refresh model How does each client detect and react to dependent-root changes (which indicate a reorg or head change affecting duty validity)? ### Architecture comparison | Client | Cache/State Key | Invalidation Trigger | Latency | Pattern | |--------|----------------|---------------------|---------|---------| | **Teku** | `dependentRoot` direct comparison | `onHeadUpdate` → immediate `recalculate()` | ~0ms | **Push + Pending** | | **Lodestar** | `dependentRoot` per epoch in VC | `onNewHead` → immediate (epoch+1), deferred (epoch+2) | 0ms ~ 1 epoch | **Push + Pending-root** | | **Prysm** | Randao seed (not dependent root) | Head event → `checkDependentRoots()` → `UpdateDuties()` | ~0ms | **Push** | | **Lighthouse** | `AttestationShufflingId` (block root + epoch) | Next slot poll (up to 12s) | ≤12s | **Pull (polling)** | | **Grandine** | No cache — queries head state at execution time | N/A (always fresh) | 0ms | **Stateless** | ### 🔑 Critical finding: Lighthouse "send-then-verify" pattern Lighthouse is unique in using a **poll-based** refresh model. On each slot: ```rust // 1. Send immediate notification from cache (potentially stale) notify_block_production_service(cached_duties).await; // 2. Then poll for latest duties beacon_node.get_validator_duties_proposer(current_epoch).await; // 3. Only warn if dependent_root changed if dependent_root != prior_dependent_root { warn!("Proposer duties re-org"); // no retry, just log } ``` This creates a **≤12 second window** where a Lighthouse VC operates on stale duties after a reorg. The stale duty is already dispatched before the poll completes. ### Lodestar's deferred epoch+2 refresh ```typescript // attestationDuties.ts:321-331 if (head !== nextTwoEpochDependentRoot) { if (isLastSlotOfEpoch) { await this.handleAttesterDutiesReorg(...); // immediate } else { this.pendingDependentRootByEpoch.set(epoch, dependentRoot); // deferred } } ``` Epoch+2 duties can remain stale for **up to one full epoch** (~6.4 minutes) if a reorg occurs early in the current epoch. ### Teku's pending mechanism ```java // PendingDuties.java:160-169 public synchronized void onHeadUpdate(Bytes32 dependentRoot) { getCurrentDuties().ifPresentOrElse( duties -> { if (duties.requiresRecalculation(dependentRoot)) recalculate(); }, () -> pendingHeadUpdate = Optional.of(dependentRoot) // store for later ); } ``` If duties are still loading when a head update arrives, the new dependent root is stored and checked when loading completes. This is elegant but can create a gap if `recalculate()` cancels the old Future before the new one resolves. --- ## Dimension 4: Head-gap stop-work thresholds At what lag does each client stop performing duties entirely? ### Threshold comparison | Client | Threshold | Unit | Configurable? | Applies To | |--------|-----------|------|:---:|------| | **Lighthouse BN** | 2 epochs (`sync_tolerance_epochs`) | Epochs | ✅ `--sync-tolerance-epochs` | All duty endpoints (HTTP 503) | | **Lighthouse VC** | 8 slots | Slots | ❌ Hardcoded `DEFAULT_SYNC_TOLERANCE` | VC fallback health check | | **Grandine** | 32 slots (`max_empty_slots`) | Slots | ✅ `--max-empty-slots` | **All duties uniformly** (silent skip) | | **Lodestar** | 1 epoch (`SYNC_TOLERANCE_EPOCHS`) | Epochs | ❌ Hardcoded | All duty endpoints (HTTP 503) | | **Teku** | 32 slots (`HEAD_TOO_OLD_THRESHOLD`) | Slots | ❌ Hardcoded | **Sync committee only** | | **Prysm** | 1 epoch (`headEpoch + 1 < currentEpoch`) | Epochs | ❌ Hardcoded | Sync status binary gate | ### Key code anchors **Grandine** — unified check for ALL duty types (`validator.rs:722-740`): ```rust async fn slot_head(&self, slot: Slot) -> Result, HeadFarBehind>> { if head_slot + max_empty_slots < slot { return Ok(Err(HeadFarBehind { head_slot, max_empty_slots, slot })); } // ... } // Called once per tick; if it fails, propose/attest/aggregate are ALL skipped let Some(slot_head) = slot_head else { return Ok(()); }; ``` **Lighthouse** — separate proposer-prep guard that skips prep but still allows proposal: ```rust // beacon_chain.rs:4724 — skips proposer preparation only if head_epoch + self.config.sync_tolerance_epochs < proposal_epoch { Err(Error::SkipProposerPreparation) // proposal itself still proceeds } // beacon_chain.rs:5979 — skips prepare_beacon_proposer call if head_slot + tolerance_slots < current_slot { return Ok(None); } ``` **Lodestar** — hardcoded 1 epoch with explicit rationale (`validator/index.ts:98-104`): ```typescript // "Lighthouse uses 8. However, 8 kills Lodestar since validators can trigger // regen to fast-forward head state 8 epochs..." export const SYNC_TOLERANCE_EPOCHS = 1; ``` ### 🔑 Key difference: Lighthouse's dual tolerance Lighthouse has **two different thresholds**: - BN HTTP API: 2 epochs (configurable) for sync status - VC fallback health check: 8 slots (hardcoded) for BN selection This means a Lighthouse VC may consider itself healthy (8-slot threshold) while the BN rejects duty requests (2-epoch threshold), creating a confusing failure mode. ### 🔑 Key difference: Grandine's silent skip Grandine is the only client that **silently skips all duties** when the head is stale. All other clients either return an explicit error (503/gRPC Unavailable) or log a warning. This makes Grandine's behavior harder to observe and debug. --- ## Dimension 5: Reorg-triggered duty invalidation ### Detection mechanism | Client | Detection | Mode | |--------|-----------|------| | **Prysm** | Head event from BN → `checkDependentRoots()` | Push | | **Teku** | `onHeadUpdate()` via `ValidatorTimingChannel` | Push | | **Lodestar** | SSE `head` event → `onNewHead()` | Push | | **Grandine** | `ValidatorMessage::Head` → N/A (stateless) | Push (no action needed) | | **Lighthouse** | Slot polling → `poll_beacon_proposers()` | **Pull** | ### Stale-duty window per scenario | Reorg scenario | Lighthouse | Prysm | Teku | Grandine | Lodestar | |---|---|---|---|---|---| | Mid-epoch (all duties) | ≤1 slot (~12s) | ~0ms | ~0ms | 0ms | ~0ms | | Epoch boundary | ≤1 slot | ~0ms | ~0ms | 0ms | ≤1 epoch (epoch+2 duties) | | During duty fetch | ≤1 slot | ~0ms | Gap on cancel | 0ms | ~0ms | ### 🔑 Most fragile: Lighthouse Lighthouse's poll-based model means: 1. Reorg detected → duties stale for up to 12 seconds 2. Attestation in the reorg slot uses **wrong dependent root** 3. Code explicitly sends cached notification before verifying This is a deliberate design choice (simplicity over latency), but it means Lighthouse validators lose at least one attestation per reorg that other clients would handle correctly. ### 🔑 Most robust: Grandine Grandine's stateless model means every duty execution reads from the current head state. A reorg 50ms before attestation is transparently reflected. The trade-off: computing committees and proposer indices at execution time adds latency and memory pressure. ### Prysm's indirect invalidation Prysm uses **randao seed** (not dependent root) as the committee cache key. This means: - Reorg changes dependent root but NOT randao seed → cache hit → **stale committee used** (edge case, low probability) - Reorg changes randao seed → cache miss → recompute (correct behavior) The probability depends on reorg depth relative to `EPOCHS_PER_HISTORICAL_VECTOR - MIN_SEED_LOOKAHEAD - 1`. --- ## Impact Analysis ### Impact 1: Teku's proposal/attestation asymmetry Teku is the only client where proposal and attestation have **opposite** tolerance directions: ``` Proposal: LOOKAHEAD_EPOCHS = 0 → max gap = 1 epoch (STRICT) Attestation: MIN_SEED_LOOKAHEAD(1) + DUTY_EPOCH_TOLERANCE(1) = +2 (PERMISSIVE) ``` **Scenario**: Network instability causes head to lag 2 epochs. | Action | Teku | Prysm/Lighthouse/Lodestar | |--------|------|---------------------------| | Attest | ✅ Continues | ✅ or ❌ (client-dependent) | | Propose | ❌ **Skips** | ✅ Continues | **Consequences**: - Proposal is worth ~100× more than an attestation in rewards - Teku validators suffer **disproportionate economic penalty** during turbulence - Teku-occupied proposal slots are **systematically more likely to be skipped** - This could, in extreme market conditions, create an incentive to avoid Teku ### Impact 2: Lighthouse's "blind flight" after reorg **Scenario**: A 2-block reorg occurs at an epoch boundary. ``` Slot N: Reorg — head switches from chain_A to chain_B Slot N: Lighthouse sends cached (chain_A) attestation duty → WRONG Slot N+1: Poll detects dependent_root change → finally refreshes ``` - One attestation lost per reorg (other clients lose 0) - At ~0.1% reorg probability per slot: ~3 extra missed attestations/year - The "send-then-verify" pattern is explicit in code — not a bug, a design trade-off ### Impact 3: Lodestar's epoch+2 staleness **Scenario**: Reorg at epoch N, slot 0. Validator has duties in epoch N+2. ``` Epoch N, Slot 0: Reorg, dependent_root for epoch N+2 changes → Stored in pendingDependentRootByEpoch, NO refresh Epoch N, Slot 1-30: Subnet subscriptions and precomputes use STALE data Epoch N, Slot 31: prepareForNextEpoch finally detects mismatch → Total stale window: ~6.4 minutes (one full epoch) ``` Actual attestation signing still uses correct BN-side state at epoch N+2, but the VC's preparation (subnet subscriptions, distributed aggregation selection) runs on stale committee assignments for nearly the entire epoch. ### Impact 4: Grandine's performance/reliability trade-off Grandine's stateless model eliminates stale-duty risk entirely — a reorg 50ms before attestation is transparent. However: - Every duty execution reads full beacon state (several MB) - Committee lookup and proposer index computed at execution time - No precomputation possible → higher per-duty latency - Nodes with many validators may experience cumulative delays ### Impact 5: Prysm's indirect cache invalidation Prysm's randao-seed-based caching creates a subtle edge case: a reorg that changes the dependent root but not the randao seed leaves stale committee data in cache. Probability is extremely low (requires reorg depth outside the seed computation window) but the failure mode — wrong committee assignment — is severe when triggered. --- ## Risk Matrix | Scenario | Lighthouse | Prysm | Teku | Grandine | Lodestar | |----------|:--:|:--:|:--:|:--:|:--:| | **1-epoch reorg** | ⚠️ 1 slot stale | ✅ | ✅ | ✅ | ✅ | | **2-epoch reorg** | ⚠️ 1 slot stale | ✅ | ⚠️ proposal skipped | ✅ | ⚠️ epoch+2 stale | | **Deep reorg (3+ epochs)** | ❌ Stops | ✅ | ❌ All duties | ⚠️ Stops | ❌ Stops | | **VC clock skew (+2 epochs)** | ❌ Rejected | ✅ Tolerated | ❌ proposal / ✅ attest | ❌ Stops | ❌ Rejected | | **Epoch boundary reorg** | ⚠️ "send-then-verify" | ✅ | ✅ | ✅ | ⚠️ deferred epoch+2 | --- ## Concrete Code Anchors (Complete Reference) ### Lighthouse - `beacon_node/beacon_chain/src/chain_config.rs:22` — `DEFAULT_SYNC_TOLERANCE_EPOCHS = 2` - `beacon_node/beacon_chain/src/beacon_chain.rs:4724-4739` — Proposal gap check: `head_epoch + sync_tolerance_epochs < proposal_epoch` - `beacon_node/beacon_chain/src/beacon_chain.rs:5979-5992` — Proposer prep slot check - `beacon_node/beacon_chain/src/beacon_chain.rs:6690-6715` — `with_committee_cache` using `AttestationShufflingId` - `beacon_node/http_api/src/attester_duties.rs:44-47` — Attestation duty epoch tolerance - `beacon_node/http_api/src/lib.rs:470-479` — HTTP sync filter - `validator_client/beacon_node_fallback/src/beacon_node_health.rs:17` — VC-side 8-slot tolerance - `validator_client/validator_services/src/duties_service.rs:1438-1517` — **Poll-based duty refresh** with "send-then-verify" ### Prysm - `validator/client/validator.go:539-585` — Duty fetch: `current_epoch + 1` - `validator/client/validator.go:1172-1221` — `checkDependentRoots()` on head event - `validator/client/beacon-api/duties.go:47-81` — Concurrent fetch of `in.Epoch` and `in.Epoch + 1` - `validator/client/propose.go:45-101` — `ProposeBlock` — no freshness gate - `beacon-chain/cache/committee.go` — Randao-seed-based committee cache (4–32 entries) - `beacon-chain/sync/service.go:333-341` — Sync status: `headEpoch + 1 < currentEpoch` ### Teku - `validator/client/.../BlockDutyScheduler.java:30` — `LOOKAHEAD_EPOCHS = 0` - `validator/client/.../AbstractDutyScheduler.java:133-140` — `isAbleToVerifyEpoch()` — `signingEpoch <= currentEpoch + 1` - `validator/client/.../AbstractDutyScheduler.java:148-167` — `onProductionDue()` — rejects gap=2 - `validator/client/.../PendingDuties.java:107-168` — `onHeadUpdate()` → immediate `recalculate()` or `pendingHeadUpdate` - `validator/client/.../BlockDutyScheduler.java:60-72` — `getExpectedDependentRoot()` - `beacon/validator/.../ValidatorApiHandler.java:225-233` — Attestation: `MIN_SEED_LOOKAHEAD + DUTY_EPOCH_TOLERANCE = +2` ### Grandine - `validator/src/validator_config.rs:15-16` — `max_empty_slots = 32` (default) - `validator/src/validator.rs:105-115` — `HeadFarBehind` error type - `validator/src/validator.rs:722-740` — `slot_head()` — unified freshness check - `validator/src/validator.rs:599-615` — `handle_tick` → `slot_head` failure → all duties silently skipped - `fork_choice_control/src/queries.rs:327-339` — `store_epoch + MIN_SEED_LOOKAHEAD` attestation tolerance - `fork_choice_control/src/storage.rs:807-818` — `dependent_root()` from state at query time - `validator/src/slot_head.rs:68-73` — Stateless proposer/committee query ### Lodestar - `packages/beacon-node/src/api/impl/validator/index.ts:98-104` — `SYNC_TOLERANCE_EPOCHS = 1` (hardcoded with rationale) - `packages/beacon-node/src/api/impl/validator/index.ts:340-365` — `notWhileSyncing()` uniform gate - `packages/beacon-node/src/api/impl/validator/index.ts:1139-1141` — Attestation: `currentEpoch + 1` max - `packages/validator/src/services/attestationDuties.ts:131-148` — `prepareForNextEpoch` boundary check - `packages/validator/src/services/attestationDuties.ts:306-357` — `onNewHead()` with **deferred epoch+2** refresh - `packages/validator/src/services/blockDuties.ts:197-206` — `dependentRoot` comparison on poll - `packages/state-transition/src/cache/epochCache.ts:630-684` — `nextShuffling` precompute (2 epochs ahead) --- ## Suggested Default Hypothesis When starting from epoch boundaries, the default hypothesis should be: > Different clients are likely to agree on the final consensus rules, but differ > on **when** they decide a duty is still fresh, **when** they refresh it after > a reorg, and **when** they stop acting on stale head context. --- ## How to Classify Findings Use the following labels unless stronger evidence appears: - **Primary label**: `liveness`, `reorg_recovery`, or `duty_freshness` - **Primary workflow**: usually `block_generate` or `attestation_generate` - **Severity default**: low-to-medium until a concrete exploit path is shown **Escalate only if** there is evidence of: - slashable signing - invalid block acceptance / rejection split - lasting fork-choice corruption - recoverable issue turning into a persistent halt under attacker control