PVF: Vote invalid on panics in execution thread (after a retry) (#7155)

* PVF: Remove `rayon` and some uses of `tokio`

1. We were using `rayon` to spawn a superfluous thread to do execution, so it was removed.

2. We were using `rayon` to set a threadpool-specific thread stack size, and AFAIK we couldn't do that with `tokio` (it's possible [per-runtime](https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.thread_stack_size) but not per-thread). Since we want to remove `tokio` from the workers [anyway](https://github.com/paritytech/polkadot/issues/7117), I changed it to spawn threads with the `std::thread` API instead of `tokio`.[^1]

[^1]: NOTE: This PR does not totally remove the `tokio` dependency just yet.

3. Since `std::thread` API is not async, we could no longer `select!` on the threads as futures, so the `select!` was changed to a naive loop.

4. The order of thread selection was flipped to make (3) sound (see note in code).

I left some TODO's related to panics which I'm going to address soon as part of https://github.com/paritytech/polkadot/issues/7045.

* PVF: Vote invalid on panics in execution thread (after a retry)

Also make sure we kill the worker process on panic errors and internal errors to
potentially clear any error states independent of the candidate.

* Address a couple of TODOs

Addresses a couple of follow-up TODOs from
https://github.com/paritytech/polkadot/pull/7153.

* Add some documentation to implementer's guide

* Fix compile error

* Fix compile errors

* Fix compile error

* Update roadmap/implementers-guide/src/node/utility/candidate-validation.md

Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>

* Address comments + couple other changes (see message)

- Measure the CPU time in the prepare thread, so the observed time is not
  affected by any delays in joining on the thread.

- Measure the full CPU time in the execute thread.

* Implement proper thread synchronization

Use condvars i.e. `Arc::new((Mutex::new(true), Condvar::new()))` as per the std
docs.

Considered also using a condvar to signal the CPU thread to end, in place of an
mpsc channel. This was not done because `Condvar::wait_timeout_while` is
documented as being imprecise, and `mpsc::Receiver::recv_timeout` is not
documented as such. Also, we would need a separate condvar, to avoid this case:
the worker thread finishes its job, notifies the condvar, the CPU thread returns
first, and we join on it and not the worker thread. So it was simpler to leave
this part as is.

* Catch panics in threads so we always notify condvar

* Use `WaitOutcome` enum instead of bool condition variable

* Fix retry timeouts to depend on exec timeout kind

* Address review comments

* Make the API for condvars in workers nicer

* Add a doc

* Use condvar for memory stats thread

* Small refactor

* Enumerate internal validation errors in an enum

* Fix comment

* Add a log

* Fix test

* Update variant naming

* Address a missed TODO

---------

Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>
This commit is contained in:
Marcin S
2023-05-16 17:01:02 -04:00
committed by GitHub
parent b75b137b0f
commit 82e4dbcc2d
9 changed files with 236 additions and 61 deletions
+5 -3
View File
@@ -334,15 +334,17 @@ fn handle_job_finish(
Err(ValidationError::InvalidCandidate(InvalidCandidate::WorkerReportedError(err))),
None,
),
Outcome::InternalError { err, idle_worker } =>
(Some(idle_worker), Err(ValidationError::InternalError(err)), None),
Outcome::InternalError { err } => (None, Err(ValidationError::InternalError(err)), None),
Outcome::HardTimeout =>
(None, Err(ValidationError::InvalidCandidate(InvalidCandidate::HardTimeout)), None),
// "Maybe invalid" errors (will retry).
Outcome::IoErr => (
None,
Err(ValidationError::InvalidCandidate(InvalidCandidate::AmbiguousWorkerDeath)),
None,
),
Outcome::Panic { err } =>
(None, Err(ValidationError::InvalidCandidate(InvalidCandidate::Panic(err))), None),
};
queue.metrics.execute_finished();
@@ -356,7 +358,7 @@ fn handle_job_finish(
err
);
} else {
gum::debug!(
gum::trace!(
target: LOG_TARGET,
?artifact_id,
?worker,
@@ -18,6 +18,7 @@
use crate::{
artifacts::ArtifactPathId,
error::InternalValidationError,
worker_common::{
framed_recv, framed_send, path_to_bytes, spawn_with_program_path, IdleWorker, SpawnErr,
WorkerHandle, JOB_TIMEOUT_WALL_CLOCK_FACTOR,
@@ -64,6 +65,8 @@ pub async fn spawn(
}
/// Outcome of PVF execution.
///
/// If the idle worker token is not returned, it means the worker must be terminated.
pub enum Outcome {
/// PVF execution completed successfully and the result is returned. The worker is ready for
/// another job.
@@ -73,18 +76,23 @@ pub enum Outcome {
InvalidCandidate { err: String, idle_worker: IdleWorker },
/// An internal error happened during the validation. Such an error is most likely related to
/// some transient glitch.
InternalError { err: String, idle_worker: IdleWorker },
///
/// Should only ever be used for errors independent of the candidate and PVF. Therefore it may
/// be a problem with the worker, so we terminate it.
InternalError { err: InternalValidationError },
/// The execution time exceeded the hard limit. The worker is terminated.
HardTimeout,
/// An I/O error happened during communication with the worker. This may mean that the worker
/// process already died. The token is not returned in any case.
IoErr,
/// An unexpected panic has occurred in the execution worker.
Panic { err: String },
}
/// Given the idle token of a worker and parameters of work, communicates with the worker and
/// returns the outcome.
///
/// NOTE: Returning the `HardTimeout` or `IoErr` errors will trigger the child process being killed.
/// NOTE: Not returning the idle worker token in `Outcome` will trigger the child process being killed.
pub async fn start_work(
worker: IdleWorker,
artifact: ArtifactPathId,
@@ -171,8 +179,8 @@ pub async fn start_work(
Response::InvalidCandidate(err) =>
Outcome::InvalidCandidate { err, idle_worker: IdleWorker { stream, pid } },
Response::TimedOut => Outcome::HardTimeout,
Response::InternalError(err) =>
Outcome::InternalError { err, idle_worker: IdleWorker { stream, pid } },
Response::Panic(err) => Outcome::Panic { err },
Response::InternalError(err) => Outcome::InternalError { err },
}
}
@@ -223,8 +231,10 @@ pub enum Response {
InvalidCandidate(String),
/// The job timed out.
TimedOut,
/// Some internal error occurred. Should only be used for errors independent of the candidate.
InternalError(String),
/// An unexpected panic has occurred in the execution worker.
Panic(String),
/// Some internal error occurred.
InternalError(InternalValidationError),
}
impl Response {
@@ -236,12 +246,4 @@ impl Response {
Self::InvalidCandidate(format!("{}: {}", ctx, msg))
}
}
/// Creates an internal response from a context `ctx` and a message `msg` (which can be empty).
pub fn format_internal(ctx: &'static str, msg: &str) -> Self {
if msg.is_empty() {
Self::InternalError(ctx.to_string())
} else {
Self::InternalError(format!("{}: {}", ctx, msg))
}
}
}