The PVF host is designed to avoid spawning tasks to minimize knowledge
of outer code. Using `async_std::task::spawn` (or Tokio's counterpart)
deemed unacceptable, `SpawnNamed` undesirable. Instead there is only one
task returned that is spawned by the candidate-validation subsystem.
The tasks from the sub-components are polled by that root task.
However, the way the tasks are bundled was incorrect. There was a giant
select that was polling those tasks. Particularly, that implies that as soon as
one of the arms of that select goes into await those sub-tasks stop
getting polled. This is a recipe for a deadlock which indeed happened
here.
Specifically, the deadlock happened during sending messages to the
execute queue by calling
[`send_execute`](https://github.com/paritytech/polkadot/blob/a68d9be35656dcd96e378fd9dd3d613af754d48a/node/core/pvf/src/host.rs#L601).
When the channel to the queue reaches the capacity, the control flow is
suspended until the queue handles those messages. Since this code is
essentially reached from [one of the select
arms](https://github.com/paritytech/polkadot/blob/a68d9be35656dcd96e378fd9dd3d613af754d48a/node/core/pvf/src/host.rs#L371),
the queue won't be given the control and thus no further progress can be
made.
This problem is solved by bundling the tasks one level higher instead,
by `selecting` over those long-running tasks.
We also stop treating returning from those long-running tasks as error
conditions, since that can happen during legit shutdown.
* Rename to BagError
* Additional parameter for 'revert' command
* Set aux revert param to None
* Align to changes in how the WASM executor is configured in `substrate`
* update lockfile for {"substrate"}
* update lockfile for {"substrate"}
* Update substrate
* Update substrate
Co-authored-by: Keith Yeung <kungfukeith11@gmail.com>
Co-authored-by: Davide Galassi <davxy@datawok.net>
Co-authored-by: Shawn Tabrizi <shawntabrizi@gmail.com>
Co-authored-by: parity-processbot <>
* merge master (do not compile)
* fix
* lock
* update lock
* Update to refactoring.
* runtime version
* fmt
* remove trie patch
* remove patch
* No layout alias for bridge proof.
* update depupdate depss
* No switch until migration.
* master lock
* test
* test
* Revert "test"
This reverts commit 57325ef73332bf4b054aa4a667bb716fcf8a0d89.
* Revert "test"
This reverts commit ce74d0e2062806f72c0e9e9ca07b14165f43521e.
* rename feature
* state version as parameter, use the feature only on runtimes.
* update
* update to state version in runtime
* state version from storage
* update lockfile for substrate
Co-authored-by: parity-processbot <>
We wanted to change niceness to accomodate the fact that some of the
preparation tasks are low priority. For example, when a node sees that
there is a new para was onboarded the node may start preparing right
away. Since all other activities are more important, such as network I/O
or validation of the backed candidates and preparation of the
immediatelly needed PVFs.
However, it turned out that this approach does not work: generally
non-root processes can only decrease niceness and they cannot increase
it to the previous value, as was assumed by the code.
Apart from that, https://github.com/paritytech/polkadot/pull/4123
assumes all PVFs are prepared in the same way. Specifically, that if a
PVF preparation failed before, then PVF pre-checking will also report
that it was failed, even though it could happen that preparation failed
due to being low-priority. In order to avoid such cases, we decided to
simplify the whole preparation model. Preparation under low priority
does not work well with that.
Closes https://github.com/paritytech/polkadot/issues/4520
* Companion PR for removing Prometheus metrics prefix
* Was missing some metrics
* Fix missing renames
* Fix test
* Fixes
* Update test
* Update Substrate
* Second time
* remove prefix from intergration test for zombienet
* update zombienet image
* Update Substrate
Co-authored-by: Bastian Köcher <info@kchr.de>
Co-authored-by: Javier Viola <pepoviola@gmail.com>
Closes https://github.com/paritytech/polkadot/issues/4293
This PR changes the way how we treat a certain subset of PVF preparation
errors. Specifically, now only the deterministic errors are treated as
invalid candidates. That is, the errors that are easily
attributable to either the the PVF contents or the wasmtime code, but
not e.g. I/O errors that could be triggered by the OS (insufficient
memory, disk failure, too much load, etc). The latter are treated as
internal errors and thus do not trigger the disputes.
* pvf host: store only compiled artifacts on disk
* Correctly handle failed artifacts
* Serialize result of PVF preparation uniquely
* Set the artifact state depending on the result
* Return the result of PVF preparation directly
* Move PrepareError to the error module
* Update doc comments
* Update misleading comment
* pvf host: turn off parallel compilation
* pvf host: implement precheck requests
* Fix warnings
* Unnecessary clone
* Add a note about timed out outcome
* Revert the pool outcome handling behavior
* Move the prepare result type into error mod
* Test prepare done
* fmt
* Add an explanation to wasmtime config
* Split pvf host test
* Add precheck to dictionary
Co-authored-by: Sergei Shulepov <sergei@parity.io>
* Limit the number of PVF workers
In particular, limit the number of preparation workers to 1 (soft &
hard) and limit the number of execution workers to 2.
The reason why we are doing this is that it seems many workers launched
at the same time can cause problems. I.e. if there are more than 2
preparation workers, the time for preparation rises significantly to the
point of reaching the timeout.
This was mostly observed with parallel_compilation=true, so each worker
used `numcpu` threads and now we are looking to flip that parameter to
`false`. That said, we want to err on the safe side here and gradually
enable it later if our measurements show that we can do that safely.
* Adjust the test to accomodate the changed config value
* pvf host: store only compiled artifacts on disk
* Correctly handle failed artifacts
* Serialize result of PVF preparation uniquely
* Set the artifact state depending on the result
* Return the result of PVF preparation directly
* Move PrepareError to the error module
* Update doc comments
* Update misleading comment
* Cleanup docs
* Conclude a test job with an error
Co-authored-by: Sergei Shulepov <sergei@parity.io>