* Re-apply changes without Diener, rebase to the lastest master
* Cache pruning
* Bit-pack InstantiationStrategy
* Move ExecutorParams version inside the structure itself
* Rework runtime API and executor parameters storage
* Pass executor parameters through backing subsystem
* Update Cargo.lock
* Introduce `ExecutorParams` to approval voting subsys
* Introduce `ExecutorParams` to dispute coordinator
* `cargo fmt`
* Simplify requests from backing subsys
* Fix tests
* Replace manual config cloning with `.clone()`
* Move constants to module
* Parametrize executor performing PVF pre-check
* Fix Malus
* Fix test runtime
* Introduce session executor params as a constant defined by session info
pallet
* Use Parity SCALE codec instead of hand-crafted binary encoding
* Get rid of constants; Add docs
* Get rid of constants
* Minor typo
* Fix Malus after rebase
* `cargo fmt`
* Use transparent SCALE encoding instead of explicit
* Clean up
* Get rid of relay parent to session index mapping
* Join environment type and version in a single enum element
* Use default execution parameters if running an old runtime
* `unwrap()` -> `expect()`
* Correct API version
* Constants are back in town
* Use constants for execution environment types
* Artifact separation, first try
* Get rid of explicit version
* PVF execution queue worker separation
* Worker handshake
* Global renaming
* Minor fixes resolving discussions
* Two-stage requesting of executor params to make use of runtime API cache
* Proper error handling in pvf-checker
* Executor params storage bootstrapping
* Propagate migration to v3 network runtimes
* Fix storage versioning
* Ensure `ExecutorParams` serialization determinism; Add comments
* Rename constants to make things a bit more deterministic
Get rid of stale code
* Tidy up a structure of active PVFs
* Minor formatting
* Fix comment
* Add try-runtime hooks
* Add storage version write on upgrade
Co-authored-by: Andronik <write@reusable.software>
* Add pre- and post-upgrade assertions
* Require to specify environment type; Remove redundant `impl`s
* Add `ExecutorParamHash` creation from `H256`
* Fix candidate validation subsys tests
* Return splittable error from executor params request fn
* Revert "Return splittable error from executor params request fn"
This reverts commit a0b274177d8bb2f6e13c066741892ecd2e72a456.
* Decompose approval voting metrics
* Use more relevant errors
* Minor formatting fix
* Assert a valid environment type instead of checking
* Fix `try-runtime` hooks
* After-merge fixes
* Add migration logs
* Remove dead code
* Fix tests
* Fix tests
* Back to the strongly typed implementation
* Promote strong types to executor interface
* Remove stale comment
* Move executor params to `SessionInfo`: primitives and runtime
* Move executor params to `SessionInfo`: node
* Try to bump primitives and API version
* Get rid of `MallocSizeOf`
* Bump target API version to v4
* Make use of session index already in place
* Back to v3
* Fix all the tests
* Add migrations to all the runtimes
* Make use of existing `SessionInfo` in approval voting subsys
* Rename `TARGET` -> `LOG_TARGET`
* Bump all the primitives to v3
* Fix Rococo ParachainHost API version
* Use `RollingSessionWindow` to acquire `ExecutorParams` in disputes
* Fix nits from discussions; add comments
* Re-evaluate queue logic
* Rework job assignment in execution queue
* Add documentation
* Use `RuntimeInfo` to obtain `SessionInfo` (with blackjack and caching)
* Couple `Pvf` with `ExecutorParams` wherever possible
* Put members of `PvfWithExecutorParams` under `Arc` for cheap cloning
* Fix comment
* Fix CI tests
* Fix clippy warnings
* Address nits from discussions
* Add a placeholder for raw data
* Fix non exhaustive match
* Remove redundant reexports and fix imports
* Keep only necessary semantic features, as discussed
* Rework `RuntimeInfo` to support mock implementation for tests
* Remove unneeded bound
* `cargo fmt`
* Revert "Remove unneeded bound"
This reverts commit 932463f26b00ce290e1e61848eb9328632ef8a61.
* Fix PVF host tests
* Fix PVF checker tests
* Fix overseer declarations
* Simplify tests
* `MAX_KEEP_WAITING` timeout based on `BACKGING_EXECUTION_TIMEOUT`
* Add a unit test for varying executor parameters
* Minor fixes from discussions
* Add prechecking max. memory parameter (see paritytech/srlabs_findings#110)
* Fix and improve a test
* Remove `ExecutionEnvironment` and `RawData`
* New primitives versioning in parachain host API
* `disputes()` implementation for Kusama and Polkadot
* Move `ExecutorParams` from `vstaging` to stable primitives
* Move disputes from `vstaging` to stable implementation
* Fix `try-runtime`
* Fixes after merge
* Move `ExecutorParams` to the bottom of `SessionInfo`
* Revert "Move executor params to `SessionInfo`: primitives and runtime"
This reverts commit dfcfb85fefd1c5be6c8a8f72dc09fd1809cfa9ce.
* Always use fresh activated live hash in pvf precheck
(re-apply 34b09a4c20de17e7926ed942cd0d657d18f743fa)
* Fixing tests (broken commit)
* Fix candidate validation tests
* Fix PVF host test
* Minor fixes
* Address discussions
* Restore migration
* Fix `use` to only include what is needed instead of `*`
* Add comment to never touch `DEFAULT_CONFIG`
* Update migration to set default `ExecutorParams` for `dispute_period`
sessions back
* Use `earliest_stored_session` instead of calculations
* Nit
* Add logs
* Treat any runtime error as `NotSupported` again
* Always return default executor params if not available
* Revert "Always return default executor params if not available"
This reverts commit b58ac4482ef444c67a9852d5776550d08e312f30.
* Add paritytech/substrate#9997 workaround
* `cargo fmt`
* Remove migration (again!)
* Bump executor params to API v4 (backport from #6698)
---------
Co-authored-by: Andronik <write@reusable.software>
* Refactor PVF preparation memory stats
The original purpose of this change was to gate metrics that are unsupported by
some systems behind conditional compilation directives (#[cfg]); see
https://github.com/paritytech/polkadot/pull/6675#discussion_r1099996209.
Then I started doing some random cleanups and simplifications and got a bit
carried away. 🙈 The code should be overall tidier than before.
Changes:
- Don't register unsupported metrics (e.g. `max_rss` on non-Linux systems)
- Introduce `PrepareStats` struct as an abstraction over the `Ok` values of
`PrepareResult`. It is cleaner, and can be easily modified in the future.
- Other small changes
* Minor fixes to comments
* Fix compile errors
* Try to fix some Linux errors
* Mep
* Fix candidate-validation tests
* Update docstring
* Add getrusage and memory tracker for precheck preparation
* Log memory stats metrics after prechecking
* Fix tests
* Try to fix errors (linux-only so I'm relying on CI here)
* Try to fix CI
* Add module docs for `prepare/memory_stats.rs`; fix CI error
* Report memory stats for all preparation jobs
* Use `RUSAGE_SELF` instead of `RUSAGE_THREAD`
Not sure why I did that -- was a brainfart on my end.
* Revert last commit (RUSAGE_THREAD is correct)
* Use exponential buckets
* Use `RUSAGE_SELF` for `getrusage`; enable `max_rss` metric for MacOS
* Increase poll interval
* Revert "Use `RUSAGE_SELF` for `getrusage`; enable `max_rss` metric for MacOS"
This reverts commit becf7a815409ab530fc61370abffcd1b97b9a777.
* pre-checking: Reject failed PVFs
* paras: immediately reject any PVF that cannot reach a supermajority
* Make the `quorum` reject condition a bit more clear semantically
* Add comment
* Update implementer's guide
* Update a link
Not related to the rest of the PR, but I randomly noticed and fixed this.
* Update runtime/parachains/src/paras/tests.rs
Co-authored-by: s0me0ne-unkn0wn <48632512+s0me0ne-unkn0wn@users.noreply.github.com>
* Remove unneeded loop
* Log PVF retries using `info!`
* Change retry logs to `warn!` and add preparation failure log
* Log PVF execution failure
* Clarify why we reject failed PVFs
* Fix PVF reject runtime benchmarks
Co-authored-by: s0me0ne-unkn0wn <48632512+s0me0ne-unkn0wn@users.noreply.github.com>
* Replace async-std with tokio in PVF subsystem
* Rework workers to use `select!` instead of a mutex
The improvement in code readability is more important than the thread overhead.
* Remove unnecessary `fuse`
* Add explanation for `expect()`
* Update node/core/pvf/src/worker_common.rs
Co-authored-by: Bastian Köcher <info@kchr.de>
* Update node/core/pvf/src/worker_common.rs
Co-authored-by: Bastian Köcher <info@kchr.de>
* Address some review comments
* Shutdown tokio runtime
* Run cargo fmt
* Add a small note about retries
* Fix up merge
* Rework `cpu_time_monitor_loop` to return when other thread finishes
* Add error string to PrepareError::IoErr variant
* Log when artifacts fail to prepare
* Fix `cpu_time_monitor_loop`; fix test
* Fix text
* Fix a couple of potential minor data races.
First data race was due to logging in the CPU monitor thread even if the
job (other thread) finished. It can technically finish before or after the log.
Maybe best would be to move this log to the `select!`s, where we are guaranteed
to have chosen the timed-out branch, although there would be a bit of
duplication.
Also, it was possible for this thread to complete before we executed
`finished_tx.send` in the other thread, which would trigger an error as the
receiver has already been dropped. And right now, such a spurious error from
`send` would be returned even if the job otherwise succeeded.
* Update Cargo.lock
Co-authored-by: Bastian Köcher <info@kchr.de>
* PVF preparation: do not conflate errors
+ Adds some more granularity to the prepare errors.
+ Better distinguish whether errors occur on the host side or the worker.
+ Do not kill the worker if the error happened on the host side.
+ Do not retry preparation if the error was `Panic`.
+ Removes unnecessary indirection with `Selected` type.
* Add missing docs, resolve TODOs
* Address review comments and remove TODOs
* Fix error in CI
* Undo unnecessary change
* Update couple of comments
* Don't return error for stream shutdown
* Update node/core/pvf/src/worker_common.rs
* Put in skeleton logic for CPU-time-preparation
Still needed:
- Flesh out logic
- Refactor some spots
- Tests
* Continue filling in logic for prepare worker CPU time changes
* Fix compiler errors
* Update lenience factor
* Fix some clippy lints for PVF module
* Fix compilation errors
* Address some review comments
* Add logging
* Add another log
* Address some review comments; change Mutex to AtomicBool
* Refactor handling response bytes
* Add CPU clock timeout logic for execute jobs
* Properly handle AtomicBool flag
* Use `Ordering::Relaxed`
* Refactor thread coordination logic
* Fix bug
* Add some timing information to execute tests
* Add section about the mitigation to the IG
* minor: Change more `Ordering`s to `Relaxed`
* candidate-validation: Fix build errors
* Rename timeout consts and timeout parameter; bump leniency
* Update implementor's guide with info about PVFs
* Make glossary a bit easier to read
* Add a note to LENIENT_PREPARATION_TIMEOUT
* Remove PVF-specific section from glossary
* Fix some typos
The PVF host is designed to avoid spawning tasks to minimize knowledge
of outer code. Using `async_std::task::spawn` (or Tokio's counterpart)
deemed unacceptable, `SpawnNamed` undesirable. Instead there is only one
task returned that is spawned by the candidate-validation subsystem.
The tasks from the sub-components are polled by that root task.
However, the way the tasks are bundled was incorrect. There was a giant
select that was polling those tasks. Particularly, that implies that as soon as
one of the arms of that select goes into await those sub-tasks stop
getting polled. This is a recipe for a deadlock which indeed happened
here.
Specifically, the deadlock happened during sending messages to the
execute queue by calling
[`send_execute`](https://github.com/paritytech/polkadot/blob/a68d9be35656dcd96e378fd9dd3d613af754d48a/node/core/pvf/src/host.rs#L601).
When the channel to the queue reaches the capacity, the control flow is
suspended until the queue handles those messages. Since this code is
essentially reached from [one of the select
arms](https://github.com/paritytech/polkadot/blob/a68d9be35656dcd96e378fd9dd3d613af754d48a/node/core/pvf/src/host.rs#L371),
the queue won't be given the control and thus no further progress can be
made.
This problem is solved by bundling the tasks one level higher instead,
by `selecting` over those long-running tasks.
We also stop treating returning from those long-running tasks as error
conditions, since that can happen during legit shutdown.
We wanted to change niceness to accomodate the fact that some of the
preparation tasks are low priority. For example, when a node sees that
there is a new para was onboarded the node may start preparing right
away. Since all other activities are more important, such as network I/O
or validation of the backed candidates and preparation of the
immediatelly needed PVFs.
However, it turned out that this approach does not work: generally
non-root processes can only decrease niceness and they cannot increase
it to the previous value, as was assumed by the code.
Apart from that, https://github.com/paritytech/polkadot/pull/4123
assumes all PVFs are prepared in the same way. Specifically, that if a
PVF preparation failed before, then PVF pre-checking will also report
that it was failed, even though it could happen that preparation failed
due to being low-priority. In order to avoid such cases, we decided to
simplify the whole preparation model. Preparation under low priority
does not work well with that.
Closes https://github.com/paritytech/polkadot/issues/4520
Closes https://github.com/paritytech/polkadot/issues/4293
This PR changes the way how we treat a certain subset of PVF preparation
errors. Specifically, now only the deterministic errors are treated as
invalid candidates. That is, the errors that are easily
attributable to either the the PVF contents or the wasmtime code, but
not e.g. I/O errors that could be triggered by the OS (insufficient
memory, disk failure, too much load, etc). The latter are treated as
internal errors and thus do not trigger the disputes.
* pvf host: store only compiled artifacts on disk
* Correctly handle failed artifacts
* Serialize result of PVF preparation uniquely
* Set the artifact state depending on the result
* Return the result of PVF preparation directly
* Move PrepareError to the error module
* Update doc comments
* Update misleading comment
* pvf host: turn off parallel compilation
* pvf host: implement precheck requests
* Fix warnings
* Unnecessary clone
* Add a note about timed out outcome
* Revert the pool outcome handling behavior
* Move the prepare result type into error mod
* Test prepare done
* fmt
* Add an explanation to wasmtime config
* Split pvf host test
* Add precheck to dictionary
Co-authored-by: Sergei Shulepov <sergei@parity.io>
* Limit the number of PVF workers
In particular, limit the number of preparation workers to 1 (soft &
hard) and limit the number of execution workers to 2.
The reason why we are doing this is that it seems many workers launched
at the same time can cause problems. I.e. if there are more than 2
preparation workers, the time for preparation rises significantly to the
point of reaching the timeout.
This was mostly observed with parallel_compilation=true, so each worker
used `numcpu` threads and now we are looking to flip that parameter to
`false`. That said, we want to err on the safe side here and gradually
enable it later if our measurements show that we can do that safely.
* Adjust the test to accomodate the changed config value
* pvf host: store only compiled artifacts on disk
* Correctly handle failed artifacts
* Serialize result of PVF preparation uniquely
* Set the artifact state depending on the result
* Return the result of PVF preparation directly
* Move PrepareError to the error module
* Update doc comments
* Update misleading comment
* Cleanup docs
* Conclude a test job with an error
Co-authored-by: Sergei Shulepov <sergei@parity.io>
* pvf: make execution timeout configurable
* guide: add timeouts to candidate validation params
* add timeouts to candidate validation messages
* fmt
* port backing to use the backing pvf timeout
* port approval-voting to use the execution timeout
* port dispute participation to use the correct timeout
* fmt
* address grumbles & test failure
* CI: add spellcheck
* revert me
* CI: explicit command for spellchecker
* spellcheck: edit misspells
* CI: run spellcheck on diff
* spellcheck: edits
* spellcheck: edit misspells
* spellcheck: add rules
* spellcheck: mv configs
* spellcheck: more edits
* spellcheck: chore
* spellcheck: one more thing
* spellcheck: and another one
* spellcheck: seems like it doesn't get to an end
* spellcheck: new words after rebase
* spellcheck: new words appearing out of nowhere
* chore
* review edits
* more review edits
* more edits
* wonky behavior
* wonky behavior 2
* wonky behavior 3
* change git behavior
* spellcheck: another bunch of new edits
* spellcheck: new words are koming out of nowhere
* CI: finding the master
* CI: fetching master implicitly
* CI: undebug
* new errors
* a bunch of new edits
* and some more
* Update node/core/approval-voting/src/approval_db/v1/mod.rs
Co-authored-by: Andronik Ordian <write@reusable.software>
* Update xcm/xcm-executor/src/assets.rs
Co-authored-by: Andronik Ordian <write@reusable.software>
* Apply suggestions from code review
Co-authored-by: Andronik Ordian <write@reusable.software>
* Suggestions from the code review
* CI: scan only changed files
Co-authored-by: Andronik Ordian <write@reusable.software>
* Implement PVF validation host
* WIP: Diener
* Increase the alloted compilation time
* Add more comments
* Minor clean up
* Apply suggestions from code review
Co-authored-by: Bastian Köcher <bkchr@users.noreply.github.com>
* Fix pruning artifact removal
* Fix formatting and newlines
* Fix the thread pool
* Update node/core/pvf/src/executor_intf.rs
Co-authored-by: Bastian Köcher <bkchr@users.noreply.github.com>
* Remove redundant test declaration
* Don't convert the path into an intermediate string
* Try to workaround the test failure
* Use the puppet_worker trick again
* Fix a blip
* Move `ensure_wasmtime_version` under the tests mod
* Add a macro for puppet_workers
* fix build for not real-overseer
* Rename the puppet worker for adder collator
* play it safe with the name of adder puppet worker
* Typo: triggered
* Add more comments
* Do not kill exec worker on every error
* Plumb Duration for timeouts
* typo: critical
* Add proofs
* Clean unused imports
* Revert "WIP: Diener"
This reverts commit b9f54e513366c7a6dfdd117ac19fbdc46b900b4d.
* Sync version of wasmtime
* Update cargo.lock
* Update Substrate
* Merge fixes still
* Update wasmtime version in test
* bastifmt
Co-authored-by: Bastian Köcher <bkchr@users.noreply.github.com>
* Squash spaces
* Trailing new line for testing.rs
* Remove controversial code
* comment about biasing
* Fix suggestion
* Add comments
* make it more clear why unwrap_err
* tmpfile retry
* proper proofs for claim_idle
* Remove mutex from ValidationHost
* Add some more logging
* Extract exec timeout into a constant
* Add some clarifying logging
* Use blake2_256
* Clean up the merge
Specifically the leftovers after removing real-overseer
* Update parachain/test-parachains/adder/collator/Cargo.toml
Co-authored-by: Andronik Ordian <write@reusable.software>
Co-authored-by: Bastian Köcher <bkchr@users.noreply.github.com>
Co-authored-by: Andronik Ordian <write@reusable.software>