mirror of
https://github.com/pezkuwichain/pezkuwi-subxt.git
synced 2026-06-12 08:51:09 +00:00
PVF: Vote invalid on panics in execution thread (after a retry) (#7155)
* PVF: Remove `rayon` and some uses of `tokio` 1. We were using `rayon` to spawn a superfluous thread to do execution, so it was removed. 2. We were using `rayon` to set a threadpool-specific thread stack size, and AFAIK we couldn't do that with `tokio` (it's possible [per-runtime](https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.thread_stack_size) but not per-thread). Since we want to remove `tokio` from the workers [anyway](https://github.com/paritytech/polkadot/issues/7117), I changed it to spawn threads with the `std::thread` API instead of `tokio`.[^1] [^1]: NOTE: This PR does not totally remove the `tokio` dependency just yet. 3. Since `std::thread` API is not async, we could no longer `select!` on the threads as futures, so the `select!` was changed to a naive loop. 4. The order of thread selection was flipped to make (3) sound (see note in code). I left some TODO's related to panics which I'm going to address soon as part of https://github.com/paritytech/polkadot/issues/7045. * PVF: Vote invalid on panics in execution thread (after a retry) Also make sure we kill the worker process on panic errors and internal errors to potentially clear any error states independent of the candidate. * Address a couple of TODOs Addresses a couple of follow-up TODOs from https://github.com/paritytech/polkadot/pull/7153. * Add some documentation to implementer's guide * Fix compile error * Fix compile errors * Fix compile error * Update roadmap/implementers-guide/src/node/utility/candidate-validation.md Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com> * Address comments + couple other changes (see message) - Measure the CPU time in the prepare thread, so the observed time is not affected by any delays in joining on the thread. - Measure the full CPU time in the execute thread. * Implement proper thread synchronization Use condvars i.e. `Arc::new((Mutex::new(true), Condvar::new()))` as per the std docs. Considered also using a condvar to signal the CPU thread to end, in place of an mpsc channel. This was not done because `Condvar::wait_timeout_while` is documented as being imprecise, and `mpsc::Receiver::recv_timeout` is not documented as such. Also, we would need a separate condvar, to avoid this case: the worker thread finishes its job, notifies the condvar, the CPU thread returns first, and we join on it and not the worker thread. So it was simpler to leave this part as is. * Catch panics in threads so we always notify condvar * Use `WaitOutcome` enum instead of bool condition variable * Fix retry timeouts to depend on exec timeout kind * Address review comments * Make the API for condvars in workers nicer * Add a doc * Use condvar for memory stats thread * Small refactor * Enumerate internal validation errors in an enum * Fix comment * Add a log * Fix test * Update variant naming * Address a missed TODO --------- Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>
This commit is contained in:
@@ -72,7 +72,7 @@ hopefully resolve. We use a more brief delay here (1 second as opposed to 15
|
||||
minutes for preparation (see above)), because a successful execution must happen
|
||||
in a short amount of time.
|
||||
|
||||
We currently know of at least two specific cases that will lead to a retried
|
||||
We currently know of the following specific cases that will lead to a retried
|
||||
execution request:
|
||||
|
||||
1. **OOM:** The host might have been temporarily low on memory due to other
|
||||
@@ -80,7 +80,9 @@ execution request:
|
||||
voting against the candidate (and possibly a dispute) if the retry is still
|
||||
not successful.
|
||||
2. **Artifact missing:** The prepared artifact might have been deleted due to
|
||||
operator error or some bug in the system. We will re-create it on retry.
|
||||
operator error or some bug in the system.
|
||||
3. **Panic:** The worker thread panicked for some indeterminate reason, which
|
||||
may or may not be independent of the candidate or PVF.
|
||||
|
||||
#### Preparation timeouts
|
||||
|
||||
@@ -103,4 +105,25 @@ of a process is less variable under different system conditions. When the
|
||||
overall system is under heavy load, the wall clock time of a job is affected
|
||||
more than the CPU time.
|
||||
|
||||
#### Internal errors
|
||||
|
||||
In general, for errors not raising a dispute we have to be very careful. This is
|
||||
only sound, if we either:
|
||||
|
||||
1. Ruled out that error in pre-checking. If something is not checked in
|
||||
pre-checking, even if independent of the candidate and PVF, we must raise a
|
||||
dispute.
|
||||
2. We are 100% confident that it is a hardware/local issue: Like corrupted file,
|
||||
etc.
|
||||
|
||||
Reasoning: Otherwise it would be possible to register a PVF where candidates can
|
||||
not be checked, but we don't get a dispute - so nobody gets punished. Second, we
|
||||
end up with a finality stall that is not going to resolve!
|
||||
|
||||
There are some error conditions where we can't be sure whether the candidate is
|
||||
really invalid or some internal glitch occurred, e.g. panics. Whenever we are
|
||||
unsure, we can never treat an error as internal as we would abstain from voting.
|
||||
So we will first retry the candidate, and if the issue persists we are forced to
|
||||
vote invalid.
|
||||
|
||||
[CVM]: ../../types/overseer-protocol.md#validationrequesttype
|
||||
|
||||
Reference in New Issue
Block a user