* impl guide: Update Collator Generation
* Address review comments
* Fix compile errors
I don't remember why I did this. Maybe it only made sense with the async backing
changes.
* Remove leftover glossary
* PVF: Remove `rayon` and some uses of `tokio`
1. We were using `rayon` to spawn a superfluous thread to do execution, so it was removed.
2. We were using `rayon` to set a threadpool-specific thread stack size, and AFAIK we couldn't do that with `tokio` (it's possible [per-runtime](https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.thread_stack_size) but not per-thread). Since we want to remove `tokio` from the workers [anyway](https://github.com/paritytech/polkadot/issues/7117), I changed it to spawn threads with the `std::thread` API instead of `tokio`.[^1]
[^1]: NOTE: This PR does not totally remove the `tokio` dependency just yet.
3. Since `std::thread` API is not async, we could no longer `select!` on the threads as futures, so the `select!` was changed to a naive loop.
4. The order of thread selection was flipped to make (3) sound (see note in code).
I left some TODO's related to panics which I'm going to address soon as part of https://github.com/paritytech/polkadot/issues/7045.
* PVF: Vote invalid on panics in execution thread (after a retry)
Also make sure we kill the worker process on panic errors and internal errors to
potentially clear any error states independent of the candidate.
* Address a couple of TODOs
Addresses a couple of follow-up TODOs from
https://github.com/paritytech/polkadot/pull/7153.
* Add some documentation to implementer's guide
* Fix compile error
* Fix compile errors
* Fix compile error
* Update roadmap/implementers-guide/src/node/utility/candidate-validation.md
Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>
* Address comments + couple other changes (see message)
- Measure the CPU time in the prepare thread, so the observed time is not
affected by any delays in joining on the thread.
- Measure the full CPU time in the execute thread.
* Implement proper thread synchronization
Use condvars i.e. `Arc::new((Mutex::new(true), Condvar::new()))` as per the std
docs.
Considered also using a condvar to signal the CPU thread to end, in place of an
mpsc channel. This was not done because `Condvar::wait_timeout_while` is
documented as being imprecise, and `mpsc::Receiver::recv_timeout` is not
documented as such. Also, we would need a separate condvar, to avoid this case:
the worker thread finishes its job, notifies the condvar, the CPU thread returns
first, and we join on it and not the worker thread. So it was simpler to leave
this part as is.
* Catch panics in threads so we always notify condvar
* Use `WaitOutcome` enum instead of bool condition variable
* Fix retry timeouts to depend on exec timeout kind
* Address review comments
* Make the API for condvars in workers nicer
* Add a doc
* Use condvar for memory stats thread
* Small refactor
* Enumerate internal validation errors in an enum
* Fix comment
* Add a log
* Fix test
* Update variant naming
* Address a missed TODO
---------
Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>
* PVF: Don't dispute on missing artifact
A dispute should never be raised if the local cache doesn't provide a certain
artifact. You can not dispute based on this reason, as it is a local hardware
issue and not related to the candidate to check.
Design:
Currently we assume that if we prepared an artifact, it remains there on-disk
until we prune it, i.e. we never check again if it's still there.
We can change it so that instead of artifact-not-found triggering a dispute, we
retry once (like we do for AmbiguousWorkerDeath, except we don't dispute if it
still doesn't work). And when enqueuing an execute job, we check for the
artifact on-disk, and start preparation if not found.
Changes:
- [x] Integration test (should fail without the following changes)
- [x] Check if artifact exists when executing, prepare if not
- [x] Return an internal error when file is missing
- [x] Retry once on internal errors
- [x] Document design (update impl guide)
* Add some context to wasm error message (it is quite long)
* Fix impl guide
* Add check for missing/inaccessible file
* Add comment referencing Substrate issue
* Add test for retrying internal errors
---------
Co-authored-by: parity-processbot <>
* Setting up new ChainSelectionMessage
* Partial first pass
* Got dispute conclusion data to provisioner
* Finished first draft for 4804 code
* A bit of polish and code comments
* cargo fmt
* Implementers guide and code comments
* More formatting, and naming issues
* Wrote test for ChainSelection side of change
* Added dispute coordinator side test
* FMT
* Addressing Marcin's comments
* fmt
* Addressing further Marcin comment
* Removing unnecessary test line
* Rough draft addressing Robert changes
* Clean up and test modification
* Majorly refactored scraper change
* Minor fixes for ChainSelection
* Polish and fmt
* Condensing inclusions per candidate logic
* Addressing Tsveto's comments
* Addressing Robert's Comments
* Altered inclusions struct to use nested BTreeMaps
* Naming fix
* Fixing inclusions struct comments
* Update node/core/dispute-coordinator/src/scraping/mod.rs
Add comment to split_off() use
Co-authored-by: Marcin S. <marcin@bytedude.com>
* Optimizing removal at block height for inclusions
* fmt
* Using copy trait
Co-authored-by: Marcin S. <marcin@bytedude.com>
* pre-checking: Reject failed PVFs
* paras: immediately reject any PVF that cannot reach a supermajority
* Make the `quorum` reject condition a bit more clear semantically
* Add comment
* Update implementer's guide
* Update a link
Not related to the rest of the PR, but I randomly noticed and fixed this.
* Update runtime/parachains/src/paras/tests.rs
Co-authored-by: s0me0ne-unkn0wn <48632512+s0me0ne-unkn0wn@users.noreply.github.com>
* Remove unneeded loop
* Log PVF retries using `info!`
* Change retry logs to `warn!` and add preparation failure log
* Log PVF execution failure
* Clarify why we reject failed PVFs
* Fix PVF reject runtime benchmarks
Co-authored-by: s0me0ne-unkn0wn <48632512+s0me0ne-unkn0wn@users.noreply.github.com>
* disputes pallet: Filter disputes with votes less than supermajority threshold
* Remove `max_spam_slots` usages
* Remove `SpamSlots`
* Remove `SpamSlotChange`
* Remove `Error<T>::PotentialSpam` and stale comments
* `create_disputes_with_no_spam` -> `create_disputes`
* Make tests compile - wip commit
* Rework `test_dispute_timeout`. Rename `update_spam_slots` to `filter_dispute_set`
* Remove `dispute_statement_becoming_onesided_due_to_spamslots_is_accepted` and `filter_correctly_accounts_spam_slots` -> they bring no value with removed spam slots
* Fix `test_provide_multi_dispute_success_and_other`
* Remove an old comment
* Remove spam slots from tests - clean todo comments
* Remove test - `test_decrement_spam`
* todo comments
* Update TODO comments
* Extract `test_unconfirmed_are_ignored` as separate test case
* Remove dead code
* Fix `test_unconfirmed_are_ignored`
* Remove dead code in `filter_dispute_data`
* Fix weights (related to commit "Remove `SpamSlots`")
* Disputes migration - first try
* Remove `dispute_max_spam_slots` + storage migration
* Fix `HostConfig` migration tests
* Deprecate `SpamSlots`
* Code review feedback
* add weight for storage version update
* fix bound for clear()
* Fix weights in disputes migration
* Revert "Deprecate `SpamSlots`"
This reverts commit 8c4d967c7b061abd76ba8b551223918c0b9e6370.
* Make mod migration public
* Remove `SpamSlots` from disputes pallet and use `storage_alias` in the migration
* Fix call to `clear()` for `SpamSlots` in migration
* Update migration and add a `try-runtime` test
* Add `pre_upgrade` `try-runtime` test
* Fix some test names in `HostConfiguration` migration
* Link spamslots migration in all runtimes
* Add `test_unconfirmed_disputes_cause_block_import_error`
* Update guide
- Remove `SpamSlots` related information from roadmap/implementers-guide/src/runtime/disputes.md
- Add 'Disputes filtering' to Runtime section of the Implementor's guide
* Update runtime/parachains/src/configuration/migration.rs
Co-authored-by: Marcin S. <marcin@bytedude.com>
* Code review feedback - update logs
* Code review feedback: fix weights
* Update runtime/parachains/src/disputes.rs
Co-authored-by: s0me0ne-unkn0wn <48632512+s0me0ne-unkn0wn@users.noreply.github.com>
* Additional logs in disputes migration
* Fix merge conflicts
* Add version checks in try-runtime tests
* Fix a compilation warning`
Co-authored-by: Marcin S. <marcin@bytedude.com>
Co-authored-by: s0me0ne-unkn0wn <48632512+s0me0ne-unkn0wn@users.noreply.github.com>
* Passed candidate events from scraper to participation
* First draft PR 5875
* Added support for timestamp in changes
* Some necessary refactoring
* Removed SessionIndex from unconfirmed_disputes key
* Removed duplicate logic in import statements
* Replaced queue_participation call with re-prio
* Simplifying refactor. Backed were already handled
* Removed unneeded spam slots logic
* Implementers guide edits
* Undid the spam slots refactor
* Added comments and implementers guide edit
* Added test for participation upon backing
* Round of fixes + ran fmt
* Round of changes + fmt
* Error handling draft
* Changed errors to bubble up from reprioritization
* Starting to construct new test
* Clarifying participation function rename
* Reprio test draft
* Very rough bump to priority queue test draft
* Improving logging
* Most concise reproduction of error on third import
* Add `handle_approval_vote_request`
* Removing reprioritization on included event test
* Removing unneeded test config
* cargo fmt
* Test works
* Fixing final nits
* Tweaks to test Tsveto figured out
Co-authored-by: eskimor <eskimor@no-such-url.com>
Co-authored-by: Tsvetomir Dimitrov <tsvetomir@parity.io>
* Put in skeleton logic for CPU-time-preparation
Still needed:
- Flesh out logic
- Refactor some spots
- Tests
* Continue filling in logic for prepare worker CPU time changes
* Fix compiler errors
* Update lenience factor
* Fix some clippy lints for PVF module
* Fix compilation errors
* Address some review comments
* Add logging
* Add another log
* Address some review comments; change Mutex to AtomicBool
* Refactor handling response bytes
* Add CPU clock timeout logic for execute jobs
* Properly handle AtomicBool flag
* Use `Ordering::Relaxed`
* Refactor thread coordination logic
* Fix bug
* Add some timing information to execute tests
* Add section about the mitigation to the IG
* minor: Change more `Ordering`s to `Relaxed`
* candidate-validation: Fix build errors
* Add PVF module documentation
TODO (once the PRs land):
- [ ] Document executor parametrization.
- [ ] Document CPU time measurement of timeouts.
* Update node/core/pvf/src/lib.rs
Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>
* Clarify meaning of PVF acronym
* Move PVF doc to implementer's guide
* Clean up implementer's guide a bit
* Add page for PVF types
* pvf: Better separation between crate docs and implementer's guide
* ci: Add "prevalidating" to the dictionary
* ig: Remove types/chain.md
The types contained therein did not exist and the file was not referenced
anywhere.
Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>
* Change best effort queue behaviour in `dispute-coordinator`
Use the same type of queue (`BTreeMap<CandidateComparator,
ParticipationRequest>`) for best effort and priority in
`dispute-coordinator`.
Rework `CandidateComparator` to handle unavailable parent
block numbers.
Best effort queue will order disputes the same way as priority does - by
parent's block height. Disputes on candidates for which the parent's
block number can't be obtained will be treated with the lowest priority.
* Fix tests: Handle `ChainApiMessage::BlockNumber` in `handle_sync_queries`
* Some tests are deadlocking on sending messages via overseer so change `SingleItemSink`to `mpsc::Sender` with a buffer of 1
* Fix a race in test after adding a buffered queue for overseer messages
* Fix the rest of the tests
* Guide update - best-effort queue
* Guide update: clarification about spam votes
* Fix tests in `availability-distribution`
* Update comments
* Add `make_buffered_subsystem_context` in `subsystem-test-helpers`
* Code review feedback
* Code review feedback
* Code review feedback
* Don't add best effort candidate if it is already in priority queue
* Remove an old comment
* Fix insert in best_effort
* Scraper processes CandidateBacked events
* Change definition of best-effort
* Fix `dispute-coordinator` tests
* Unit test for dispute filtering
* Clarification comment
* Add tests
* Fix logic
If a dispute is not backed, not included and not confirmed we
don't participate but we do import votes.
* Add metrics for refrained participations
* Revert "Add tests"
This reverts commit 7b8391a087922ced942cde9cd2b50ff3f633efc0.
* Revert "Unit test for dispute filtering"
This reverts commit 92ba5fe678214ab360306313a33c781338e600a0.
* fix dispute-coordinator tests
* Fix scraping
* new tests
* Small fixes in guide
* Apply suggestions from code review
Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>
* Fix some comments and remove a pointless test
* Code review feedback
* Clarification comment in tests
* Some tests
* Reference counted `CandidateHash` in scraper
* Proper handling for Backed and Included candidates in scraper
Backed candidates which are not included should be kept for a
predetermined window of finalized blocks. E.g. if a candidate is backed
but not included in block 2, and the window size is 2, the same
candidate should be cleaned after block 4 is finalized.
Add reference counting for candidates in scraper. A candidate can be
added on multiple block heights so we have to make sure we don't clean
it prematurely from the scraper.
Add tests.
* Update comments in tests
* Guide update
* Fix cleanup logic for `backed_candidates_by_block_number`
* Simplify cleanup
* Make spellcheck happy
* Update tests
* Extract candidate backing logic in separate struct
* Code review feedback
* Treat backed and included candidates in the same fashion
* Update some comments
* Small improvements in test
* spell check
* Fix some more comments
* clean -> prune
* Code review feedback
* Reword comment
* spelling
Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>
* Add a `last change` footer to the implementers guide
Some of the newcomers were noticing outdated pages in the implementer's guide.
This idea came up as a heuristic for how up-to-date an individual page is.
* Update `build-implementers-guide` CI job
* Rename timeout consts and timeout parameter; bump leniency
* Update implementor's guide with info about PVFs
* Make glossary a bit easier to read
* Add a note to LENIENT_PREPARATION_TIMEOUT
* Remove PVF-specific section from glossary
* Fix some typos
* Don't import backing statements directly
into the dispute coordinator. This also gets rid of a redundant
signature check. Both should have some impact on backing performance.
In general this PR should make us scale better in the number of parachains.
Reasoning (aka why this is fine):
For the signature check: As mentioned, it is a redundant check. The
signature has already been checked at this point. This is even made
obvious by the used types. The smart constructor is not perfect as
discussed [here](https://github.com/paritytech/polkadot/issues/3455),
but is still a reasonable security.
For not importing to the dispute-coordinator: This should be good as the
dispute coordinator does scrape backing votes from chain. This suffices
in practice as a super majority of validators must have seen a backing
fork in order for a candidate to get included and only included
candidates pose a threat to our system. The import from chain is
preferable over direct import of backing votes for two reasons:
1. The import is batched, greatly improving import performance. All
backing votes for a candidate are imported with a single import.
And indeed we were able to see in metrics that importing votes
from chain is fast.
2. We do less work in general as not every candidate for which
statements are gossiped might actually make it on a chain. The
dispute coordinator as with the current implementation would still
import and keep those votes around for six sessions.
While redundancy is good for reliability in the event of bugs, this also
comes at a non negligible cost. The dispute-coordinator right now is the
subsystem with the highest load, despite the fact that it should not be
doing much during mormal operation and it is only getting worse
with more parachains as the load is a direct function of the number of statements.
We'll see on Versi how much of a performance improvement this PR
* Get rid of dead code.
* Dont send approval vote
* Make it pass CI
* Bring back tests for fixing them later.
* Explicit signature check.
* Resurrect approval-voting tests (not fixed yet)
* Send out approval votes in dispute-distribution.
Use BTreeMap for ordered dispute votes.
* Bring back an important warning.
* Fix approval voting tests.
* Don't send out dispute message on import + test
+ Some cleanup.
* Guide changes.
Note that the introduced complexity is actually redundant.
* WIP: guide changes.
* Finish guide changes about dispute-coordinator
conceputally. Requires more proof read still.
Also removed obsolete implementation details, where the code is better
suited as the source of truth.
* Finish guide changes for now.
* Remove own approval vote import logic.
* Implement logic for retrieving approval-votes
into approval-voting and approval-distribution subsystems.
* Update roadmap/implementers-guide/src/node/disputes/dispute-coordinator.md
Co-authored-by: asynchronous rob <rphmeier@gmail.com>
* Review feedback.
In particular: Add note about disputes of non included candidates.
* Incorporate Review Remarks
* Get rid of superfluous space.
* Tidy up import logic a bit.
Logical vote import is now separated, making the code more readable and
maintainable.
Also: Accept import if there is at least one invalid signer that has not
exceeded its spam slots, instead of requiring all of them to not exceed
their limits. This is more correct and a preparation for vote batching.
* We don't need/have empty imports.
* Fix tests and bugs.
* Remove error prone redundancy.
* Import approval votes on dispute initiated/concluded.
* Add test for approval vote import.
* Make guide checker happy (hopefully)
* Another sanity check + better logs.
* Reasoning about boundedness.
* Use `CandidateIndex` as opposed to `CoreIndex`.
* Remove redundant import.
* Review remarks.
* Add metric for calls to request signatures
* More review remarks.
* Add metric on imported approval votes.
* Include candidate hash in logs.
* More trace log
* Break cycle.
* Add some tracing.
* Cleanup allowed messages.
* fmt
* Tracing + timeout for get inherent data.
* Better error.
* Break cycle in all places.
* Clarified comment some more.
* Typo.
* Break cycle approval-distribution - approval-voting.
Co-authored-by: asynchronous rob <rphmeier@gmail.com>
* explicitly tag network requests with version
* fmt
* make PeerSet more aware of versioning
* some generalization of the network bridge to support upgrades
* walk back some renaming
* walk back some version stuff
* extract version from fallback
* remove V1 from NetworkBridgeUpdate
* add accidentally-removed timer
* implement focusing for versioned messages
* fmt
* fix up network bridge & tests
* remove inaccurate version check in bridge
* remove some TODO [now]s
* fix fallout in statement distribution
* fmt
* fallout in gossip-support
* fix fallout in collator-protocol
* fix fallout in bitfield-distribution
* fix fallout in approval-distribution
* fmt
* use never!
* fmt
* gossip-support: be explicit about dimensions
* some guide updates
* update network-bridge to distinguish x and y dimensions
* get everything to compile
* beginnings
* some TODOs
* polkadot runtime: use relevant_authorities
* make gossip topologies per-session
* better formatting
* gossip support: use current session validators
* expand in comment
* adjust tests and fix index bug
* add past/present/future connection test and clean up code
* fmt
* network bridge: updated types
* update protocols to new gossip topology message
* guide updates
* add session to BlockApprovalMeta
* add session to block info
* refactor knowledge and remove most unify logic
* start replacing gossip_peers with new SessionTopologies
* add routing information to message state
* add some utilities to SessionTopology
* implement new gossip topology logic
* re-implement unify_with_peer
* distribute assignments according to topology
* finish grid topology implementation
* refactor network bridge slightly
* issue connection requests on all past/present/future
* fmt
* address grumbles
* tighten invariants in unify_with_peer
* implement random propagation
* refactor: extract required routing adjustment logic
* some block-age logic
* aggressively propagate messages when finality is slow
* overhaul aggression system to have 3 levels
* add aggression metrics
* remove aggression L3
* reduce random circulation
* remove PeerData
* get approval tests compiling
* use btree_map in known_by to make deterministic
* Revert "use btree_map in known_by to make deterministic"
This reverts commit 330d65343a7bb6fe4dd0f24bd8dbc15c0cbdbd9d.
* test XY grid propagation
* remove stray println
* test unshared dimension propagation
* add random gossip check
* test unify_with_peer better
* test sending after getting gossip topology
* test L1 aggression on originator
* test L1 aggression for non-originators
* test non-originator aggression L2
* fnt
* ~spellcheck
* fix statement-distribution tests
* fix flaky test
* fix metrics typo
* re-send periodically
* test resending
* typo
Co-authored-by: Bernhard Schuster <bernhard@ahoi.io>
* add more metrics about apd messages
* add back unify_with_peer logs
* make Resend an enum
* be more explicit when resending
* fmt
* fix error
* add a TODO for refactoring
* remove debug metrics
* add some guide stuff
* fmt
* update runtime API in test-runtim
Co-authored-by: Bernhard Schuster <bernhard@ahoi.io>
* remove v0 primitives from polkadot-primitives
* first pass: remove v0
* fix fallout in erasure-coding
* remove v1 primitives, consolidate to v2
* the great import update
* update runtime_api_impl_v1 to v2 as well
* guide: add `Version` request for runtime API
* add version query to runtime API
* reintroduce OldV1SessionInfo in a limited way
* First step in implementing https://github.com/paritytech/polkadot/issues/4386
This PR:
- Reduces MAX_UNSHARED_UPLOAD_TIME to 150ms
- Increases timeout on collation fetching to 1200ms
- Reduces limit on needed backing votes in the runtime
This PR does not yet reduce the number of needed backing votes on the
node as this can only be meaningfully enacted once the changed limit in
the runtime is live.
* Fix tests.
* Guide updates.
* Review remarks.
* Bump minimum required backing votes to 2 in runtime.
* Make sure node side code won't make runtime vomit.
* cargo +nightly fmt