# Epoch Boundary Investigation Notes

> **Status**: Deep-dive completed 2026-05-23. Five divergence dimensions fully mapped
> across all five consensus clients (Prysm, Lighthouse, Grandine, Teku, Lodestar).

This document captures a comprehensive cross-client audit of **epoch-boundary
handling** — specifically how each client manages the transition between epochs,
refreshes duties, tolerates head-staleness, and recovers from reorgs.

---

## Why epoch-boundary handling is worth prioritizing

Epoch boundaries are where many clients refresh or invalidate:

- proposer duties
- attester duties
- aggregate duties
- dependent roots
- head-freshness assumptions
- EL/CL coordination for the next slot

These paths are often not consensus-invalid by themselves, but they are rich in
**behavioral divergence** that can affect:

- **liveness**: missed proposal / missed attestation / delayed recovery
- **reorg recovery**: whether duties are refreshed after head changes
- **duty freshness**: whether a duty is still considered safe/fresh near the boundary

In practice, this is usually a **medium-value audit direction**: it often
produces real cross-client asymmetries, but it does **not automatically imply a
severe consensus-critical vulnerability** unless it can be extended into a
clear exploit path.

---

## Dimension 1: Proposal epoch-gap tolerance

The original anchor example: what happens when `head_epoch` and
`proposal_epoch` differ by **2**?

### Code-level findings

| Client | Mechanism | Code Anchor | gap=2 tolerated? | Configurable? |
|--------|-----------|-------------|:---:|:---:|
| **Lighthouse** | `head_epoch + sync_tolerance_epochs < proposal_epoch` | `beacon_chain.rs:4724` | ✅ Yes | `--sync-tolerance-epochs` (default 2) |
| **Prysm** | No explicit epoch-gap gate; duty-refresh & BN-side driven | `validator.go:562`, `duties.go:68` | ✅ Yes | N/A |
| **Teku** | `signingEpoch > currentEpoch + lookAheadEpochs + 1` | `AbstractDutyScheduler.java:139`, `BlockDutyScheduler.java:30` | ❌ **No** | Hardcoded `LOOKAHEAD_EPOCHS = 0` |
| **Grandine** | `head_slot + max_empty_slots < slot` | `validator.rs:734` | ❌ No (default) | `--max-empty-slots` (default 32) |
| **Lodestar** | No explicit epoch-gap gate; duty-polling driven | `blockDuties.ts:131-148` | ✅ Yes | N/A |

### Key code snippets

**Lighthouse** (`beacon_chain.rs:4724-4739`):
```rust
if head_epoch + self.config.sync_tolerance_epochs < proposal_epoch {
    warn!("Skipping proposer preparation");  // skips prep, still allows proposal
    Err(Error::SkipProposerPreparation)
}
```

**Teku** (`AbstractDutyScheduler.java:133-140`):
```java
// LOOKAHEAD_EPOCHS = 0  →  signingEpoch <= currentEpoch + 1
return !signingEpoch.isGreaterThan(epoch.plus(lookAheadEpochs + 1));
```

**Prysm** — fetches `current+1` and `current+2` concurrently, no freshness gate:
```go
// validator.go:562 — pre-fetch current+1
epoch := slots.ToEpoch(slots.CurrentSlot(v.genesisTime) + 1)
// duties.go:68 — concurrently pull current+2
c.dutiesForEpoch(ctx, nextEpochDuties, in.Epoch+1, vals, fetchSyncDuties)
```

**Grandine** (`validator.rs:722-740`):
```rust
async fn slot_head(&self, slot: Slot) -> Result<Result<SlotHead<P>, HeadFarBehind>> {
    if head_slot + max_empty_slots < slot {
        return Ok(Err(HeadFarBehind { head_slot, max_empty_slots, slot }));
    }
    // ... on failure, ALL duties (propose, attest, aggregate) are silently skipped
}
```

### Impact analysis

- **Liveness asymmetry**: Prysm, Lighthouse, and Lodestar keep proposing at gap=2; Teku and Grandine give up. This means Teku/Grandine validator slots are **systematically more likely to be skipped** during network instability.
- **Economic asymmetry**: Proposal rewards are orders of magnitude larger than attestation rewards. A Teku validator losing proposals but still attesting during turbulence suffers a disproportionate economic penalty compared to a Prysm validator.
- **Not a consensus split**: A skipped proposal is protocol-identical to an offline proposer. No fork is created.

---

## Dimension 2: Attestation future-window tolerance

How far ahead can attestation duties be requested and served?

### BN-side tolerance

| Client | Max Query Epoch | Mechanism | Unique behavior |
|--------|:--:|------|------|
| **Teku** | `current + 2` | `MIN_SEED_LOOKAHEAD(1) + DUTY_EPOCH_TOLERANCE(1)` | **Most permissive** |
| **Lighthouse** | `current + 1` | With `tolerant_current_epoch` clock-disparity correction | Boundary-smoothing via `MAXIMUM_GOSSIP_CLOCK_DISPARITY` |
| **Grandine** | `store_epoch + 1` (or +2 on last slot) | `MIN_SEED_LOOKAHEAD` with epoch-boundary extension | **Unique last-slot-of-epoch special case** |
| **Prysm** | `current + 1` | Strict check, no clock-disparity adjustment | Simplest implementation |
| **Lodestar** | `current + 1` | Strict check; code comments explain "epoch+2 has dependent root jitter" | Hardcoded rationale |

### VC-side pre-fetch

| Client | Fetches | Strategy |
|--------|---------|----------|
| **Lodestar** | Current ≈ epoch+2 (via `nextShuffling` precompute) | **Only client that precomputes 2 epochs ahead** |
| **Lighthouse** | `current` + `next` | Stores `current_epoch_shuffling_id` + `next_epoch_shuffling_id` on forkchoice blocks |
| **Teku** | `headEpoch`, `headEpoch+1` | Compares dependent roots, not epochs, to trigger refetch |
| **Grandine** | Query-time from head state | Stateless; `MIN_SEED_LOOKAHEAD` precompute boundary |
| **Prysm** | `current` + `next` (concurrent fetch) | Lazy, no precomputation |

### 🔑 Key asymmetry: Teku

Teku is the only client where **attestation is more permissive than proposal**:

| Duty type | Teku tolerance | vs other clients |
|-----------|:---:|------|
| Proposal | `current + 1` (gap=2 rejected) | **Strictest** |
| Attestation | `current + 2` | **Most permissive** |

This means a Teku node under clock skew or network delay will **keep attesting but stop proposing** — an asymmetric liveness profile unique to Teku.

---

## Dimension 3: Dependent-root refresh model

How does each client detect and react to dependent-root changes (which indicate
a reorg or head change affecting duty validity)?

### Architecture comparison

| Client | Cache/State Key | Invalidation Trigger | Latency | Pattern |
|--------|----------------|---------------------|---------|---------|
| **Teku** | `dependentRoot` direct comparison | `onHeadUpdate` → immediate `recalculate()` | ~0ms | **Push + Pending** |
| **Lodestar** | `dependentRoot` per epoch in VC | `onNewHead` → immediate (epoch+1), deferred (epoch+2) | 0ms ~ 1 epoch | **Push + Pending-root** |
| **Prysm** | Randao seed (not dependent root) | Head event → `checkDependentRoots()` → `UpdateDuties()` | ~0ms | **Push** |
| **Lighthouse** | `AttestationShufflingId` (block root + epoch) | Next slot poll (up to 12s) | ≤12s | **Pull (polling)** |
| **Grandine** | No cache — queries head state at execution time | N/A (always fresh) | 0ms | **Stateless** |

### 🔑 Critical finding: Lighthouse "send-then-verify" pattern

Lighthouse is unique in using a **poll-based** refresh model. On each slot:

```rust
// 1. Send immediate notification from cache (potentially stale)
notify_block_production_service(cached_duties).await;

// 2. Then poll for latest duties
beacon_node.get_validator_duties_proposer(current_epoch).await;

// 3. Only warn if dependent_root changed
if dependent_root != prior_dependent_root {
    warn!("Proposer duties re-org");  // no retry, just log
}
```

This creates a **≤12 second window** where a Lighthouse VC operates on stale
duties after a reorg. The stale duty is already dispatched before the poll
completes.

### Lodestar's deferred epoch+2 refresh

```typescript
// attestationDuties.ts:321-331
if (head !== nextTwoEpochDependentRoot) {
    if (isLastSlotOfEpoch) {
        await this.handleAttesterDutiesReorg(...);  // immediate
    } else {
        this.pendingDependentRootByEpoch.set(epoch, dependentRoot);  // deferred
    }
}
```

Epoch+2 duties can remain stale for **up to one full epoch** (~6.4 minutes) if
a reorg occurs early in the current epoch.

### Teku's pending mechanism

```java
// PendingDuties.java:160-169
public synchronized void onHeadUpdate(Bytes32 dependentRoot) {
    getCurrentDuties().ifPresentOrElse(
        duties -> {
            if (duties.requiresRecalculation(dependentRoot)) recalculate();
        },
        () -> pendingHeadUpdate = Optional.of(dependentRoot)  // store for later
    );
}
```

If duties are still loading when a head update arrives, the new dependent root
is stored and checked when loading completes. This is elegant but can create a
gap if `recalculate()` cancels the old Future before the new one resolves.

---

## Dimension 4: Head-gap stop-work thresholds

At what lag does each client stop performing duties entirely?

### Threshold comparison

| Client | Threshold | Unit | Configurable? | Applies To |
|--------|-----------|------|:---:|------|
| **Lighthouse BN** | 2 epochs (`sync_tolerance_epochs`) | Epochs | ✅ `--sync-tolerance-epochs` | All duty endpoints (HTTP 503) |
| **Lighthouse VC** | 8 slots | Slots | ❌ Hardcoded `DEFAULT_SYNC_TOLERANCE` | VC fallback health check |
| **Grandine** | 32 slots (`max_empty_slots`) | Slots | ✅ `--max-empty-slots` | **All duties uniformly** (silent skip) |
| **Lodestar** | 1 epoch (`SYNC_TOLERANCE_EPOCHS`) | Epochs | ❌ Hardcoded | All duty endpoints (HTTP 503) |
| **Teku** | 32 slots (`HEAD_TOO_OLD_THRESHOLD`) | Slots | ❌ Hardcoded | **Sync committee only** |
| **Prysm** | 1 epoch (`headEpoch + 1 < currentEpoch`) | Epochs | ❌ Hardcoded | Sync status binary gate |

### Key code anchors

**Grandine** — unified check for ALL duty types (`validator.rs:722-740`):
```rust
async fn slot_head(&self, slot: Slot) -> Result<Result<SlotHead<P>, HeadFarBehind>> {
    if head_slot + max_empty_slots < slot {
        return Ok(Err(HeadFarBehind { head_slot, max_empty_slots, slot }));
    }
    // ...
}
// Called once per tick; if it fails, propose/attest/aggregate are ALL skipped
let Some(slot_head) = slot_head else { return Ok(()); };
```

**Lighthouse** — separate proposer-prep guard that skips prep but still allows proposal:
```rust
// beacon_chain.rs:4724 — skips proposer preparation only
if head_epoch + self.config.sync_tolerance_epochs < proposal_epoch {
    Err(Error::SkipProposerPreparation)  // proposal itself still proceeds
}
// beacon_chain.rs:5979 — skips prepare_beacon_proposer call
if head_slot + tolerance_slots < current_slot {
    return Ok(None);
}
```

**Lodestar** — hardcoded 1 epoch with explicit rationale (`validator/index.ts:98-104`):
```typescript
// "Lighthouse uses 8. However, 8 kills Lodestar since validators can trigger
//  regen to fast-forward head state 8 epochs..."
export const SYNC_TOLERANCE_EPOCHS = 1;
```

### 🔑 Key difference: Lighthouse's dual tolerance

Lighthouse has **two different thresholds**:
- BN HTTP API: 2 epochs (configurable) for sync status
- VC fallback health check: 8 slots (hardcoded) for BN selection

This means a Lighthouse VC may consider itself healthy (8-slot threshold) while
the BN rejects duty requests (2-epoch threshold), creating a confusing failure
mode.

### 🔑 Key difference: Grandine's silent skip

Grandine is the only client that **silently skips all duties** when the head is
stale. All other clients either return an explicit error (503/gRPC Unavailable)
or log a warning. This makes Grandine's behavior harder to observe and debug.

---

## Dimension 5: Reorg-triggered duty invalidation

### Detection mechanism

| Client | Detection | Mode |
|--------|-----------|------|
| **Prysm** | Head event from BN → `checkDependentRoots()` | Push |
| **Teku** | `onHeadUpdate()` via `ValidatorTimingChannel` | Push |
| **Lodestar** | SSE `head` event → `onNewHead()` | Push |
| **Grandine** | `ValidatorMessage::Head` → N/A (stateless) | Push (no action needed) |
| **Lighthouse** | Slot polling → `poll_beacon_proposers()` | **Pull** |

### Stale-duty window per scenario

| Reorg scenario | Lighthouse | Prysm | Teku | Grandine | Lodestar |
|---|---|---|---|---|---|
| Mid-epoch (all duties) | ≤1 slot (~12s) | ~0ms | ~0ms | 0ms | ~0ms |
| Epoch boundary | ≤1 slot | ~0ms | ~0ms | 0ms | ≤1 epoch (epoch+2 duties) |
| During duty fetch | ≤1 slot | ~0ms | Gap on cancel | 0ms | ~0ms |

### 🔑 Most fragile: Lighthouse

Lighthouse's poll-based model means:
1. Reorg detected → duties stale for up to 12 seconds
2. Attestation in the reorg slot uses **wrong dependent root**
3. Code explicitly sends cached notification before verifying

This is a deliberate design choice (simplicity over latency), but it means
Lighthouse validators lose at least one attestation per reorg that other
clients would handle correctly.

### 🔑 Most robust: Grandine

Grandine's stateless model means every duty execution reads from the current
head state. A reorg 50ms before attestation is transparently reflected. The
trade-off: computing committees and proposer indices at execution time adds
latency and memory pressure.

### Prysm's indirect invalidation

Prysm uses **randao seed** (not dependent root) as the committee cache key.
This means:
- Reorg changes dependent root but NOT randao seed → cache hit → **stale committee used** (edge case, low probability)
- Reorg changes randao seed → cache miss → recompute (correct behavior)

The probability depends on reorg depth relative to `EPOCHS_PER_HISTORICAL_VECTOR - MIN_SEED_LOOKAHEAD - 1`.

---

## Impact Analysis

### Impact 1: Teku's proposal/attestation asymmetry

Teku is the only client where proposal and attestation have **opposite**
tolerance directions:

```
Proposal:  LOOKAHEAD_EPOCHS = 0  →  max gap = 1 epoch  (STRICT)
Attestation: MIN_SEED_LOOKAHEAD(1) + DUTY_EPOCH_TOLERANCE(1) = +2  (PERMISSIVE)
```

**Scenario**: Network instability causes head to lag 2 epochs.

| Action | Teku | Prysm/Lighthouse/Lodestar |
|--------|------|---------------------------|
| Attest | ✅ Continues | ✅ or ❌ (client-dependent) |
| Propose | ❌ **Skips** | ✅ Continues |

**Consequences**:
- Proposal is worth ~100× more than an attestation in rewards
- Teku validators suffer **disproportionate economic penalty** during turbulence
- Teku-occupied proposal slots are **systematically more likely to be skipped**
- This could, in extreme market conditions, create an incentive to avoid Teku

### Impact 2: Lighthouse's "blind flight" after reorg

**Scenario**: A 2-block reorg occurs at an epoch boundary.

```
Slot N:    Reorg — head switches from chain_A to chain_B
Slot N:    Lighthouse sends cached (chain_A) attestation duty → WRONG
Slot N+1:  Poll detects dependent_root change → finally refreshes
```

- One attestation lost per reorg (other clients lose 0)
- At ~0.1% reorg probability per slot: ~3 extra missed attestations/year
- The "send-then-verify" pattern is explicit in code — not a bug, a design trade-off

### Impact 3: Lodestar's epoch+2 staleness

**Scenario**: Reorg at epoch N, slot 0. Validator has duties in epoch N+2.

```
Epoch N, Slot 0:   Reorg, dependent_root for epoch N+2 changes
                   → Stored in pendingDependentRootByEpoch, NO refresh
Epoch N, Slot 1-30: Subnet subscriptions and precomputes use STALE data
Epoch N, Slot 31:  prepareForNextEpoch finally detects mismatch →

Total stale window: ~6.4 minutes (one full epoch)
```

Actual attestation signing still uses correct BN-side state at epoch N+2, but
the VC's preparation (subnet subscriptions, distributed aggregation selection)
runs on stale committee assignments for nearly the entire epoch.

### Impact 4: Grandine's performance/reliability trade-off

Grandine's stateless model eliminates stale-duty risk entirely — a reorg 50ms
before attestation is transparent. However:
- Every duty execution reads full beacon state (several MB)
- Committee lookup and proposer index computed at execution time
- No precomputation possible → higher per-duty latency
- Nodes with many validators may experience cumulative delays

### Impact 5: Prysm's indirect cache invalidation

Prysm's randao-seed-based caching creates a subtle edge case: a reorg that
changes the dependent root but not the randao seed leaves stale committee data
in cache. Probability is extremely low (requires reorg depth outside the seed
computation window) but the failure mode — wrong committee assignment — is
severe when triggered.

---

## Risk Matrix

| Scenario | Lighthouse | Prysm | Teku | Grandine | Lodestar |
|----------|:--:|:--:|:--:|:--:|:--:|
| **1-epoch reorg** | ⚠️ 1 slot stale | ✅ | ✅ | ✅ | ✅ |
| **2-epoch reorg** | ⚠️ 1 slot stale | ✅ | ⚠️ proposal skipped | ✅ | ⚠️ epoch+2 stale |
| **Deep reorg (3+ epochs)** | ❌ Stops | ✅ | ❌ All duties | ⚠️ Stops | ❌ Stops |
| **VC clock skew (+2 epochs)** | ❌ Rejected | ✅ Tolerated | ❌ proposal / ✅ attest | ❌ Stops | ❌ Rejected |
| **Epoch boundary reorg** | ⚠️ "send-then-verify" | ✅ | ✅ | ✅ | ⚠️ deferred epoch+2 |

---

## Concrete Code Anchors (Complete Reference)

### Lighthouse
- `beacon_node/beacon_chain/src/chain_config.rs:22` — `DEFAULT_SYNC_TOLERANCE_EPOCHS = 2`
- `beacon_node/beacon_chain/src/beacon_chain.rs:4724-4739` — Proposal gap check: `head_epoch + sync_tolerance_epochs < proposal_epoch`
- `beacon_node/beacon_chain/src/beacon_chain.rs:5979-5992` — Proposer prep slot check
- `beacon_node/beacon_chain/src/beacon_chain.rs:6690-6715` — `with_committee_cache` using `AttestationShufflingId`
- `beacon_node/http_api/src/attester_duties.rs:44-47` — Attestation duty epoch tolerance
- `beacon_node/http_api/src/lib.rs:470-479` — HTTP sync filter
- `validator_client/beacon_node_fallback/src/beacon_node_health.rs:17` — VC-side 8-slot tolerance
- `validator_client/validator_services/src/duties_service.rs:1438-1517` — **Poll-based duty refresh** with "send-then-verify"

### Prysm
- `validator/client/validator.go:539-585` — Duty fetch: `current_epoch + 1`
- `validator/client/validator.go:1172-1221` — `checkDependentRoots()` on head event
- `validator/client/beacon-api/duties.go:47-81` — Concurrent fetch of `in.Epoch` and `in.Epoch + 1`
- `validator/client/propose.go:45-101` — `ProposeBlock` — no freshness gate
- `beacon-chain/cache/committee.go` — Randao-seed-based committee cache (4–32 entries)
- `beacon-chain/sync/service.go:333-341` — Sync status: `headEpoch + 1 < currentEpoch`

### Teku
- `validator/client/.../BlockDutyScheduler.java:30` — `LOOKAHEAD_EPOCHS = 0`
- `validator/client/.../AbstractDutyScheduler.java:133-140` — `isAbleToVerifyEpoch()` — `signingEpoch <= currentEpoch + 1`
- `validator/client/.../AbstractDutyScheduler.java:148-167` — `onProductionDue()` — rejects gap=2
- `validator/client/.../PendingDuties.java:107-168` — `onHeadUpdate()` → immediate `recalculate()` or `pendingHeadUpdate`
- `validator/client/.../BlockDutyScheduler.java:60-72` — `getExpectedDependentRoot()`
- `beacon/validator/.../ValidatorApiHandler.java:225-233` — Attestation: `MIN_SEED_LOOKAHEAD + DUTY_EPOCH_TOLERANCE = +2`

### Grandine
- `validator/src/validator_config.rs:15-16` — `max_empty_slots = 32` (default)
- `validator/src/validator.rs:105-115` — `HeadFarBehind` error type
- `validator/src/validator.rs:722-740` — `slot_head()` — unified freshness check
- `validator/src/validator.rs:599-615` — `handle_tick` → `slot_head` failure → all duties silently skipped
- `fork_choice_control/src/queries.rs:327-339` — `store_epoch + MIN_SEED_LOOKAHEAD` attestation tolerance
- `fork_choice_control/src/storage.rs:807-818` — `dependent_root()` from state at query time
- `validator/src/slot_head.rs:68-73` — Stateless proposer/committee query

### Lodestar
- `packages/beacon-node/src/api/impl/validator/index.ts:98-104` — `SYNC_TOLERANCE_EPOCHS = 1` (hardcoded with rationale)
- `packages/beacon-node/src/api/impl/validator/index.ts:340-365` — `notWhileSyncing()` uniform gate
- `packages/beacon-node/src/api/impl/validator/index.ts:1139-1141` — Attestation: `currentEpoch + 1` max
- `packages/validator/src/services/attestationDuties.ts:131-148` — `prepareForNextEpoch` boundary check
- `packages/validator/src/services/attestationDuties.ts:306-357` — `onNewHead()` with **deferred epoch+2** refresh
- `packages/validator/src/services/blockDuties.ts:197-206` — `dependentRoot` comparison on poll
- `packages/state-transition/src/cache/epochCache.ts:630-684` — `nextShuffling` precompute (2 epochs ahead)

---

## Suggested Default Hypothesis

When starting from epoch boundaries, the default hypothesis should be:

> Different clients are likely to agree on the final consensus rules, but differ
> on **when** they decide a duty is still fresh, **when** they refresh it after
> a reorg, and **when** they stop acting on stale head context.

---

## How to Classify Findings

Use the following labels unless stronger evidence appears:

- **Primary label**: `liveness`, `reorg_recovery`, or `duty_freshness`
- **Primary workflow**: usually `block_generate` or `attestation_generate`
- **Severity default**: low-to-medium until a concrete exploit path is shown

**Escalate only if** there is evidence of:
- slashable signing
- invalid block acceptance / rejection split
- lasting fork-choice corruption
- recoverable issue turning into a persistent halt under attacker control
