* PVF: Refactor workers into separate crates, remove host dependency
* Fix compile error
* Remove some leftover code
* Fix compile errors
* Update Cargo.lock
* Remove worker main.rs files
I accidentally copied these from the other PR. This PR isn't intended to
introduce standalone workers yet.
* Address review comments
* cargo fmt
* Update a couple of comments
* Update log targets
* PVF: Remove `rayon` and some uses of `tokio`
1. We were using `rayon` to spawn a superfluous thread to do execution, so it was removed.
2. We were using `rayon` to set a threadpool-specific thread stack size, and AFAIK we couldn't do that with `tokio` (it's possible [per-runtime](https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.thread_stack_size) but not per-thread). Since we want to remove `tokio` from the workers [anyway](https://github.com/paritytech/polkadot/issues/7117), I changed it to spawn threads with the `std::thread` API instead of `tokio`.[^1]
[^1]: NOTE: This PR does not totally remove the `tokio` dependency just yet.
3. Since `std::thread` API is not async, we could no longer `select!` on the threads as futures, so the `select!` was changed to a naive loop.
4. The order of thread selection was flipped to make (3) sound (see note in code).
I left some TODO's related to panics which I'm going to address soon as part of https://github.com/paritytech/polkadot/issues/7045.
* PVF: Vote invalid on panics in execution thread (after a retry)
Also make sure we kill the worker process on panic errors and internal errors to
potentially clear any error states independent of the candidate.
* Address a couple of TODOs
Addresses a couple of follow-up TODOs from
https://github.com/paritytech/polkadot/pull/7153.
* Add some documentation to implementer's guide
* Fix compile error
* Fix compile errors
* Fix compile error
* Update roadmap/implementers-guide/src/node/utility/candidate-validation.md
Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>
* Address comments + couple other changes (see message)
- Measure the CPU time in the prepare thread, so the observed time is not
affected by any delays in joining on the thread.
- Measure the full CPU time in the execute thread.
* Implement proper thread synchronization
Use condvars i.e. `Arc::new((Mutex::new(true), Condvar::new()))` as per the std
docs.
Considered also using a condvar to signal the CPU thread to end, in place of an
mpsc channel. This was not done because `Condvar::wait_timeout_while` is
documented as being imprecise, and `mpsc::Receiver::recv_timeout` is not
documented as such. Also, we would need a separate condvar, to avoid this case:
the worker thread finishes its job, notifies the condvar, the CPU thread returns
first, and we join on it and not the worker thread. So it was simpler to leave
this part as is.
* Catch panics in threads so we always notify condvar
* Use `WaitOutcome` enum instead of bool condition variable
* Fix retry timeouts to depend on exec timeout kind
* Address review comments
* Make the API for condvars in workers nicer
* Add a doc
* Use condvar for memory stats thread
* Small refactor
* Enumerate internal validation errors in an enum
* Fix comment
* Add a log
* Fix test
* Update variant naming
* Address a missed TODO
---------
Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>
* PVF: Don't dispute on missing artifact
A dispute should never be raised if the local cache doesn't provide a certain
artifact. You can not dispute based on this reason, as it is a local hardware
issue and not related to the candidate to check.
Design:
Currently we assume that if we prepared an artifact, it remains there on-disk
until we prune it, i.e. we never check again if it's still there.
We can change it so that instead of artifact-not-found triggering a dispute, we
retry once (like we do for AmbiguousWorkerDeath, except we don't dispute if it
still doesn't work). And when enqueuing an execute job, we check for the
artifact on-disk, and start preparation if not found.
Changes:
- [x] Integration test (should fail without the following changes)
- [x] Check if artifact exists when executing, prepare if not
- [x] Return an internal error when file is missing
- [x] Retry once on internal errors
- [x] Document design (update impl guide)
* Add some context to wasm error message (it is quite long)
* Fix impl guide
* Add check for missing/inaccessible file
* Add comment referencing Substrate issue
* Add test for retrying internal errors
---------
Co-authored-by: parity-processbot <>
* Happy New Year!
* Remove year entierly
Co-authored-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io>
* Remove years from copyright notice in the entire repo
---------
Co-authored-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io>
* Move version check to `worker_event_loop`
* More minor refactors
- More consistent use of `format_invalid` and `format_internal`.
- Fix a doc error.
- Fix some poorly-named local variables.
* Check spawned worker version vs node version before PVF preparation
* Address discussions
* Propagate errors and shutdown preparation and execution pipelines properly
* Add logs; Fix execution worker checks
* Revert "Propagate errors and shutdown preparation and execution pipelines properly"
This reverts commit b96cc3160ff58db5ff001d8ca0bfea9bd4bdd0f2.
* Don't try to shut down; report the condition and exit worker
* Get rid of `VersionMismatch` preparation error
* Merge master
* Add docs; Fix tests
* Update Cargo.lock
* Kill again, but only the main node process
* Move unsafe code to a common safe function
* Fix libc dependency error on MacOS
* pvf spawning: Add some logging, add a small integration test
* Minor fixes
* Restart CI
---------
Co-authored-by: Marcin S <marcin@realemail.net>
* Change `MaxMemorySize` to `MaxMemoryPages`
We should set the max memory for the executor in pages (64KiB) and not in bytes.
The wasm memory is always a multiple of a page and we should use the
same terminology.
* FMT
* Fix warning
* Re-apply changes without Diener, rebase to the lastest master
* Cache pruning
* Bit-pack InstantiationStrategy
* Move ExecutorParams version inside the structure itself
* Rework runtime API and executor parameters storage
* Pass executor parameters through backing subsystem
* Update Cargo.lock
* Introduce `ExecutorParams` to approval voting subsys
* Introduce `ExecutorParams` to dispute coordinator
* `cargo fmt`
* Simplify requests from backing subsys
* Fix tests
* Replace manual config cloning with `.clone()`
* Move constants to module
* Parametrize executor performing PVF pre-check
* Fix Malus
* Fix test runtime
* Introduce session executor params as a constant defined by session info
pallet
* Use Parity SCALE codec instead of hand-crafted binary encoding
* Get rid of constants; Add docs
* Get rid of constants
* Minor typo
* Fix Malus after rebase
* `cargo fmt`
* Use transparent SCALE encoding instead of explicit
* Clean up
* Get rid of relay parent to session index mapping
* Join environment type and version in a single enum element
* Use default execution parameters if running an old runtime
* `unwrap()` -> `expect()`
* Correct API version
* Constants are back in town
* Use constants for execution environment types
* Artifact separation, first try
* Get rid of explicit version
* PVF execution queue worker separation
* Worker handshake
* Global renaming
* Minor fixes resolving discussions
* Two-stage requesting of executor params to make use of runtime API cache
* Proper error handling in pvf-checker
* Executor params storage bootstrapping
* Propagate migration to v3 network runtimes
* Fix storage versioning
* Ensure `ExecutorParams` serialization determinism; Add comments
* Rename constants to make things a bit more deterministic
Get rid of stale code
* Tidy up a structure of active PVFs
* Minor formatting
* Fix comment
* Add try-runtime hooks
* Add storage version write on upgrade
Co-authored-by: Andronik <write@reusable.software>
* Add pre- and post-upgrade assertions
* Require to specify environment type; Remove redundant `impl`s
* Add `ExecutorParamHash` creation from `H256`
* Fix candidate validation subsys tests
* Return splittable error from executor params request fn
* Revert "Return splittable error from executor params request fn"
This reverts commit a0b274177d8bb2f6e13c066741892ecd2e72a456.
* Decompose approval voting metrics
* Use more relevant errors
* Minor formatting fix
* Assert a valid environment type instead of checking
* Fix `try-runtime` hooks
* After-merge fixes
* Add migration logs
* Remove dead code
* Fix tests
* Fix tests
* Back to the strongly typed implementation
* Promote strong types to executor interface
* Remove stale comment
* Move executor params to `SessionInfo`: primitives and runtime
* Move executor params to `SessionInfo`: node
* Try to bump primitives and API version
* Get rid of `MallocSizeOf`
* Bump target API version to v4
* Make use of session index already in place
* Back to v3
* Fix all the tests
* Add migrations to all the runtimes
* Make use of existing `SessionInfo` in approval voting subsys
* Rename `TARGET` -> `LOG_TARGET`
* Bump all the primitives to v3
* Fix Rococo ParachainHost API version
* Use `RollingSessionWindow` to acquire `ExecutorParams` in disputes
* Fix nits from discussions; add comments
* Re-evaluate queue logic
* Rework job assignment in execution queue
* Add documentation
* Use `RuntimeInfo` to obtain `SessionInfo` (with blackjack and caching)
* Couple `Pvf` with `ExecutorParams` wherever possible
* Put members of `PvfWithExecutorParams` under `Arc` for cheap cloning
* Fix comment
* Fix CI tests
* Fix clippy warnings
* Address nits from discussions
* Add a placeholder for raw data
* Fix non exhaustive match
* Remove redundant reexports and fix imports
* Keep only necessary semantic features, as discussed
* Rework `RuntimeInfo` to support mock implementation for tests
* Remove unneeded bound
* `cargo fmt`
* Revert "Remove unneeded bound"
This reverts commit 932463f26b00ce290e1e61848eb9328632ef8a61.
* Fix PVF host tests
* Fix PVF checker tests
* Fix overseer declarations
* Simplify tests
* `MAX_KEEP_WAITING` timeout based on `BACKGING_EXECUTION_TIMEOUT`
* Add a unit test for varying executor parameters
* Minor fixes from discussions
* Add prechecking max. memory parameter (see paritytech/srlabs_findings#110)
* Fix and improve a test
* Remove `ExecutionEnvironment` and `RawData`
* New primitives versioning in parachain host API
* `disputes()` implementation for Kusama and Polkadot
* Move `ExecutorParams` from `vstaging` to stable primitives
* Move disputes from `vstaging` to stable implementation
* Fix `try-runtime`
* Fixes after merge
* Move `ExecutorParams` to the bottom of `SessionInfo`
* Revert "Move executor params to `SessionInfo`: primitives and runtime"
This reverts commit dfcfb85fefd1c5be6c8a8f72dc09fd1809cfa9ce.
* Always use fresh activated live hash in pvf precheck
(re-apply 34b09a4c20de17e7926ed942cd0d657d18f743fa)
* Fixing tests (broken commit)
* Fix candidate validation tests
* Fix PVF host test
* Minor fixes
* Address discussions
* Restore migration
* Fix `use` to only include what is needed instead of `*`
* Add comment to never touch `DEFAULT_CONFIG`
* Update migration to set default `ExecutorParams` for `dispute_period`
sessions back
* Use `earliest_stored_session` instead of calculations
* Nit
* Add logs
* Treat any runtime error as `NotSupported` again
* Always return default executor params if not available
* Revert "Always return default executor params if not available"
This reverts commit b58ac4482ef444c67a9852d5776550d08e312f30.
* Add paritytech/substrate#9997 workaround
* `cargo fmt`
* Remove migration (again!)
* Bump executor params to API v4 (backport from #6698)
---------
Co-authored-by: Andronik <write@reusable.software>
* Refactor PVF preparation memory stats
The original purpose of this change was to gate metrics that are unsupported by
some systems behind conditional compilation directives (#[cfg]); see
https://github.com/paritytech/polkadot/pull/6675#discussion_r1099996209.
Then I started doing some random cleanups and simplifications and got a bit
carried away. 🙈 The code should be overall tidier than before.
Changes:
- Don't register unsupported metrics (e.g. `max_rss` on non-Linux systems)
- Introduce `PrepareStats` struct as an abstraction over the `Ok` values of
`PrepareResult`. It is cleaner, and can be easily modified in the future.
- Other small changes
* Minor fixes to comments
* Fix compile errors
* Try to fix some Linux errors
* Mep
* Fix candidate-validation tests
* Update docstring
* Introduce jemalloc-stats feature flag
* remove unneeded space
* Update node/overseer/src/lib.rs
Co-authored-by: Marcin S. <marcin@bytedude.com>
* Update Cargo.toml
Co-authored-by: Marcin S. <marcin@bytedude.com>
* revert making tikv-jemallocator depend on jemalloc-stats
* conditionally import memory_stats instead of using dead_code
* fix test via expllicit import
* Add jemalloc-stats feature to crates, propagate it from root
* Apply `jemalloc-stats` feature to prepare mem stats; small refactor
* effect changes recommended on PR
* Update node/overseer/src/metrics.rs
Co-authored-by: Marcin S. <marcin@bytedude.com>
* fix compile error on in pipeline for linux. missing import
* Update node/overseer/src/lib.rs
Co-authored-by: Bastian Köcher <git@kchr.de>
* revert to defining collect_memory_stats inline
---------
Co-authored-by: Marcin S. <marcin@bytedude.com>
Co-authored-by: Marcin S <marcin@realemail.net>
Co-authored-by: Bastian Köcher <git@kchr.de>
* Add getrusage and memory tracker for precheck preparation
* Log memory stats metrics after prechecking
* Fix tests
* Try to fix errors (linux-only so I'm relying on CI here)
* Try to fix CI
* Add module docs for `prepare/memory_stats.rs`; fix CI error
* Report memory stats for all preparation jobs
* Use `RUSAGE_SELF` instead of `RUSAGE_THREAD`
Not sure why I did that -- was a brainfart on my end.
* Revert last commit (RUSAGE_THREAD is correct)
* Use exponential buckets
* Use `RUSAGE_SELF` for `getrusage`; enable `max_rss` metric for MacOS
* Increase poll interval
* Revert "Use `RUSAGE_SELF` for `getrusage`; enable `max_rss` metric for MacOS"
This reverts commit becf7a815409ab530fc61370abffcd1b97b9a777.
* pre-checking: Reject failed PVFs
* paras: immediately reject any PVF that cannot reach a supermajority
* Make the `quorum` reject condition a bit more clear semantically
* Add comment
* Update implementer's guide
* Update a link
Not related to the rest of the PR, but I randomly noticed and fixed this.
* Update runtime/parachains/src/paras/tests.rs
Co-authored-by: s0me0ne-unkn0wn <48632512+s0me0ne-unkn0wn@users.noreply.github.com>
* Remove unneeded loop
* Log PVF retries using `info!`
* Change retry logs to `warn!` and add preparation failure log
* Log PVF execution failure
* Clarify why we reject failed PVFs
* Fix PVF reject runtime benchmarks
Co-authored-by: s0me0ne-unkn0wn <48632512+s0me0ne-unkn0wn@users.noreply.github.com>
* Replace async-std with tokio in PVF subsystem
* Rework workers to use `select!` instead of a mutex
The improvement in code readability is more important than the thread overhead.
* Remove unnecessary `fuse`
* Add explanation for `expect()`
* Update node/core/pvf/src/worker_common.rs
Co-authored-by: Bastian Köcher <info@kchr.de>
* Update node/core/pvf/src/worker_common.rs
Co-authored-by: Bastian Köcher <info@kchr.de>
* Address some review comments
* Shutdown tokio runtime
* Run cargo fmt
* Add a small note about retries
* Fix up merge
* Rework `cpu_time_monitor_loop` to return when other thread finishes
* Add error string to PrepareError::IoErr variant
* Log when artifacts fail to prepare
* Fix `cpu_time_monitor_loop`; fix test
* Fix text
* Fix a couple of potential minor data races.
First data race was due to logging in the CPU monitor thread even if the
job (other thread) finished. It can technically finish before or after the log.
Maybe best would be to move this log to the `select!`s, where we are guaranteed
to have chosen the timed-out branch, although there would be a bit of
duplication.
Also, it was possible for this thread to complete before we executed
`finished_tx.send` in the other thread, which would trigger an error as the
receiver has already been dropped. And right now, such a spurious error from
`send` would be returned even if the job otherwise succeeded.
* Update Cargo.lock
Co-authored-by: Bastian Köcher <info@kchr.de>
* PVF preparation: do not conflate errors
+ Adds some more granularity to the prepare errors.
+ Better distinguish whether errors occur on the host side or the worker.
+ Do not kill the worker if the error happened on the host side.
+ Do not retry preparation if the error was `Panic`.
+ Removes unnecessary indirection with `Selected` type.
* Add missing docs, resolve TODOs
* Address review comments and remove TODOs
* Fix error in CI
* Undo unnecessary change
* Update couple of comments
* Don't return error for stream shutdown
* Update node/core/pvf/src/worker_common.rs
* rust 1.64 enables workspace properties
* add edition, repository and authors.
* of course, update the version in one place.
Co-authored-by: Andronik <write@reusable.software>
* Put in skeleton logic for CPU-time-preparation
Still needed:
- Flesh out logic
- Refactor some spots
- Tests
* Continue filling in logic for prepare worker CPU time changes
* Fix compiler errors
* Update lenience factor
* Fix some clippy lints for PVF module
* Fix compilation errors
* Address some review comments
* Add logging
* Add another log
* Address some review comments; change Mutex to AtomicBool
* Refactor handling response bytes
* Add CPU clock timeout logic for execute jobs
* Properly handle AtomicBool flag
* Use `Ordering::Relaxed`
* Refactor thread coordination logic
* Fix bug
* Add some timing information to execute tests
* Add section about the mitigation to the IG
* minor: Change more `Ordering`s to `Relaxed`
* candidate-validation: Fix build errors
* Add clippy config and remove .cargo from gitignore
* first fixes
* Clippyfied
* Add clippy CI job
* comment out rusty-cachier
* minor
* fix ci
* remove DAG from check-dependent-project
* add DAG to clippy
Co-authored-by: alvicsam <alvicsam@gmail.com>
* westend: update transaction version
* polkadot: update transaction version
* kusama: update transaction version
* Bump spec_version to 9330
* bump versions to 0.9.33
* Add PVF module documentation
TODO (once the PRs land):
- [ ] Document executor parametrization.
- [ ] Document CPU time measurement of timeouts.
* Update node/core/pvf/src/lib.rs
Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>
* Clarify meaning of PVF acronym
* Move PVF doc to implementer's guide
* Clean up implementer's guide a bit
* Add page for PVF types
* pvf: Better separation between crate docs and implementer's guide
* ci: Add "prevalidating" to the dictionary
* ig: Remove types/chain.md
The types contained therein did not exist and the file was not referenced
anywhere.
Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>
* Fix a couple of typos
* Retry failed PVF execution
PVF execution that fails due to AmbiguousWorkerDeath should be retried once.
This should reduce the occurrence of failures due to transient conditions.
Closes#6195
* Address a couple of nits
* Write tests; refactor (add `validate_candidate_with_retry`)
* Update node/core/candidate-validation/src/lib.rs
Co-authored-by: Andronik <write@reusable.software>
Co-authored-by: eskimor <eskimor@users.noreply.github.com>
Co-authored-by: Andronik <write@reusable.software>
* Rename timeout consts and timeout parameter; bump leniency
* Update implementor's guide with info about PVFs
* Make glossary a bit easier to read
* Add a note to LENIENT_PREPARATION_TIMEOUT
* Remove PVF-specific section from glossary
* Fix some typos