Skip to content

runner: optimize away BaseRunner.mtx#2771

Open
iurii-ssv wants to merge 21 commits intostagefrom
runner-optimize-away-base-runner-mtx
Open

runner: optimize away BaseRunner.mtx#2771
iurii-ssv wants to merge 21 commits intostagefrom
runner-optimize-away-base-runner-mtx

Conversation

@iurii-ssv
Copy link
Copy Markdown
Contributor

@iurii-ssv iurii-ssv commented Apr 2, 2026

This PR removes the need for BaseRunner.mtx that complicated the code quite a bit,

the only real reason to have it were onTimeout func(s) that would check HasRunningDuty() and skip the issuance of corresponding timeout even if necessary (as per @momosh-ssv discovery) ... now we do this instead:

  • we do not perform the HasRunningDuty() check in the onTimeout func(s) themselves, we just issue the timeout event every time (it's pretty rare, so the overhead from the added events is negligible)
  • these timeout events are processed within the same for-loop (StartQueueConsumer and ConsumeQueue funcs) with every other duty-related message/event, they are all processed sequentially (due to reading from same thread-safe queue, in the same for-loop)
  • the HasRunningDuty() check is performed during the actual processing of those timeout events (see case types.Timeout:)

Additional notes:

  • Validator/Committee Runners/Queues are managed somewhat differently - hence why there are certain discrepancies with respect to how those are handled in this PR
  • this PR also simplifies & standardizes runner-state / qbft-instance-state related checks, preserving old behavior (pure refactor) - figured it's a good time to do it, since we are changing the code in the area anyways

Related: F-ssv-186 F-ssv-054 https://github.com/ssvlabs/ssv-node-board/issues/764

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 2, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 58.5%. Comparing base (fb579a0) to head (ac230fd).

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 2, 2026

Greptile Summary

This PR removes BaseRunner.mtx by moving the HasRunningDuty() guard out of onTimeout callbacks and into the sequential message-processing loop (OnTimeoutQBFT), relying on single-goroutine sequential execution for concurrent safety. It also standardises state-predicate helpers (hasDutyRunning, hasDutyAssigned, HasStartedQBFTInstance) and inlines baseSetupForNewDuty directly — a clean refactor with clear documentation of the new concurrency contract.

Confidence Score: 4/5

Safe to merge after addressing the nil-safety regression in HasRunningQBFTInstance

One P2 finding: HasRunningQBFTInstance now directly dereferences RunningInstance.State.Decided without the nil-check that the old IsDecided() provided, introducing a potential panic in deserialization edge cases where Instance.State could be nil. All other changes are correct refactors or positive behavioural improvements (e.g., removing the silent nil-CurrentDuty marshal path in runner_state.go, correctly returning nil instead of an error when a pruned committee runner's timeout arrives).

protocol/v2/ssv/runner/runner.go (HasRunningQBFTInstance nil-safety)

Important Files Changed

Filename Overview
protocol/v2/ssv/runner/runner.go Core refactor: removes mtx, inlines state assignment, adds HasStartedQBFTInstance helper, moves HasRunningDuty to Getters interface, adds OnTimeoutQBFT with slot/height guard; HasRunningQBFTInstance loses nil-safety from old IsDecided()
protocol/v2/ssv/validator/timer.go Simplifies both Validator and Committee onTimeout: drops HasRunningDuty and DutyRunner nil checks (now handled in the processing loop), promotes several log levels from Warn/Debug to Error
protocol/v2/ssv/validator/committee.go Changes timeout-runner-not-found from a returned error to a debug log + nil return, correctly handling the pruned-runner race
protocol/v2/ssv/runner/runner_state.go Removes nil-guard on CurrentDuty before type assertions in MarshalJSON; now returns an error immediately for nil duty instead of silently serialising an incomplete struct
protocol/v2/ssv/runner/runner_validations.go Replaces bare b.State.RunningInstance nil-check with HasStartedQBFTInstance(), implicitly guarding against nil State
protocol/v2/ssv/spectest/util.go Replaces manual State/RunningInstance nil-checks with HasStartedQBFTInstance() in test setup utilities
protocol/v2/ssv/validator/validator.go Minor: improves error message wording for timeout data parsing failure

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Slot Timer Fires] --> B[onTimeout callback]
    B --> C{Queue initialised?}
    C -- No --> D[Log Error/Debug and return]
    C -- Yes --> E[Push TimeoutData msg to queue]
    E --> F[Sequential message-processing loop]
    F --> G{Message type?}
    G -- SSVEventMsgType/Timeout --> H[OnTimeoutQBFT]
    H --> I{hasDutyRunning?}
    I -- No --> J[return nil - duty already ended]
    I -- Yes --> K{height == currentDutySlot?}
    K -- No --> L[return nil - stale timeout for old slot]
    K -- Yes --> M[QBFTController.OnTimeout]
    G -- Other --> N[ProcessPreConsensus / ProcessConsensus / ProcessPostConsensus]
Loading

Reviews (4): Last reviewed commit: "revert the removal of queue filter" | Re-trigger Greptile

Comment thread protocol/v2/ssv/runner/runner.go Outdated
nkryuchkov
nkryuchkov previously approved these changes Apr 2, 2026
Comment thread protocol/v2/ssv/runner/runner.go Outdated
@iurii-ssv iurii-ssv requested a review from nkryuchkov April 3, 2026 10:54
@iurii-ssv
Copy link
Copy Markdown
Contributor Author

@greptile pls re-review

@iurii-ssv
Copy link
Copy Markdown
Contributor Author

@greptile could you re-review ?

@iurii-ssv
Copy link
Copy Markdown
Contributor Author

@greptile pls re-review

@iurii-ssv
Copy link
Copy Markdown
Contributor Author

One P2 finding: HasRunningQBFTInstance now directly dereferences RunningInstance.State.Decided without the nil-check that the old IsDecided() provided, introducing a potential panic in deserialization edge cases where Instance.State could be nil. All other changes are correct refactors or positive behavioural improvements (e.g., removing the silent nil-CurrentDuty marshal path in runner_state.go, correctly returning nil instead of an error when a pruned committee runner's timeout arrives).

protocol/v2/ssv/runner/runner.go (HasRunningQBFTInstance nil-safety)

Regarding that concern, the behavior is intentional - the HasRunningQBFTInstance has an explicit comment about it:

func (b *BaseRunner) HasRunningQBFTInstance() bool {
	// Note: RunningInstance.State cannot be nil for existing RunningInstance by construction.
	return b.hasDutyRunning() && b.HasStartedQBFTInstance() && !b.State.RunningInstance.State.Decided
}

I think panic is better in this case than any of:

  • retuning an implicit false that would hide/mislead on the actual root cause (of the bug being present for some RunningInstance.State initialization we have somewhere)
  • returning an error that needs to be handled - this would complicate a lot of call-sights ... so it's just not worth it (and it will also make it harder to spot the root cause - a bug in our code that needs fixing)

Comment thread protocol/v2/ssv/runner/runner.go
nkryuchkov
nkryuchkov previously approved these changes Apr 7, 2026
olegshmuelov
olegshmuelov previously approved these changes Apr 13, 2026
Copy link
Copy Markdown
Contributor

@olegshmuelov olegshmuelov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@iurii-ssv iurii-ssv dismissed stale reviews from olegshmuelov and nkryuchkov via 86f7b67 April 16, 2026 09:18
ljuba-ssv
ljuba-ssv previously approved these changes Apr 16, 2026
Copy link
Copy Markdown

@ljuba-ssv ljuba-ssv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean PR — the mutex removal is well-reasoned and the timeout-via-queue approach is solid. A few minor nits inline.

Comment thread protocol/v2/ssv/validator/timer.go
Comment thread protocol/v2/ssv/runner/runner_state.go
Comment thread protocol/v2/ssv/runner/runner.go Outdated
Comment thread protocol/v2/ssv/validator/timer.go
Comment thread protocol/v2/ssv/runner/runner.go
ljuba-ssv
ljuba-ssv previously approved these changes Apr 16, 2026
ljuba-ssv
ljuba-ssv previously approved these changes Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants