PVF: Vote invalid on panics in execution thread (after a retry) (#7155)

* PVF: Remove `rayon` and some uses of `tokio`

1. We were using `rayon` to spawn a superfluous thread to do execution, so it was removed.

2. We were using `rayon` to set a threadpool-specific thread stack size, and AFAIK we couldn't do that with `tokio` (it's possible [per-runtime](https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.thread_stack_size) but not per-thread). Since we want to remove `tokio` from the workers [anyway](https://github.com/paritytech/polkadot/issues/7117), I changed it to spawn threads with the `std::thread` API instead of `tokio`.[^1]

[^1]: NOTE: This PR does not totally remove the `tokio` dependency just yet.

3. Since `std::thread` API is not async, we could no longer `select!` on the threads as futures, so the `select!` was changed to a naive loop.

4. The order of thread selection was flipped to make (3) sound (see note in code).

I left some TODO's related to panics which I'm going to address soon as part of https://github.com/paritytech/polkadot/issues/7045.

* PVF: Vote invalid on panics in execution thread (after a retry)

Also make sure we kill the worker process on panic errors and internal errors to
potentially clear any error states independent of the candidate.

* Address a couple of TODOs

Addresses a couple of follow-up TODOs from
https://github.com/paritytech/polkadot/pull/7153.

* Add some documentation to implementer's guide

* Fix compile error

* Fix compile errors

* Fix compile error

* Update roadmap/implementers-guide/src/node/utility/candidate-validation.md

Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>

* Address comments + couple other changes (see message)

- Measure the CPU time in the prepare thread, so the observed time is not
  affected by any delays in joining on the thread.

- Measure the full CPU time in the execute thread.

* Implement proper thread synchronization

Use condvars i.e. `Arc::new((Mutex::new(true), Condvar::new()))` as per the std
docs.

Considered also using a condvar to signal the CPU thread to end, in place of an
mpsc channel. This was not done because `Condvar::wait_timeout_while` is
documented as being imprecise, and `mpsc::Receiver::recv_timeout` is not
documented as such. Also, we would need a separate condvar, to avoid this case:
the worker thread finishes its job, notifies the condvar, the CPU thread returns
first, and we join on it and not the worker thread. So it was simpler to leave
this part as is.

* Catch panics in threads so we always notify condvar

* Use `WaitOutcome` enum instead of bool condition variable

* Fix retry timeouts to depend on exec timeout kind

* Address review comments

* Make the API for condvars in workers nicer

* Add a doc

* Use condvar for memory stats thread

* Small refactor

* Enumerate internal validation errors in an enum

* Fix comment

* Add a log

* Fix test

* Update variant naming

* Address a missed TODO

---------

Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>
This commit is contained in:
Marcin S
2023-05-16 17:01:02 -04:00
committed by GitHub
parent b75b137b0f
commit 82e4dbcc2d
9 changed files with 236 additions and 61 deletions
+49 -5
View File
@@ -78,9 +78,11 @@ impl fmt::Display for PrepareError {
#[derive(Debug, Clone)]
pub enum ValidationError {
/// The error was raised because the candidate is invalid.
///
/// Whenever we are unsure if the error was due to the candidate or not, we must vote invalid.
InvalidCandidate(InvalidCandidate),
/// This error is raised due to inability to serve the request.
InternalError(String),
/// Some internal error occurred.
InternalError(InternalValidationError),
}
/// A description of an error raised during executing a PVF and can be attributed to the combination
@@ -103,7 +105,7 @@ pub enum InvalidCandidate {
/// an `rlimit` (if set) or, again, invited OOM killer. Another possibility is a bug in
/// wasmtime allowed the PVF to gain control over the execution worker.
///
/// We attribute such an event to an invalid candidate in either case.
/// We attribute such an event to an *invalid candidate* in either case.
///
/// The rationale for this is that a glitch may lead to unfair rejecting candidate by a single
/// validator. If the glitch is somewhat more persistent the validator will reject all candidate
@@ -113,6 +115,48 @@ pub enum InvalidCandidate {
AmbiguousWorkerDeath,
/// PVF execution (compilation is not included) took more time than was allotted.
HardTimeout,
/// A panic occurred and we can't be sure whether the candidate is really invalid or some internal glitch occurred.
/// Whenever we are unsure, we can never treat an error as internal as we would abstain from voting. This is bad
/// because if the issue was due to the candidate, then all validators would abstain, stalling finality on the
/// chain. So we will first retry the candidate, and if the issue persists we are forced to vote invalid.
Panic(String),
}
/// Some internal error occurred.
///
/// Should only ever be used for validation errors independent of the candidate and PVF, or for errors we ruled out
/// during pre-checking (so preparation errors are fine).
#[derive(Debug, Clone, Encode, Decode)]
pub enum InternalValidationError {
/// Some communication error occurred with the host.
HostCommunication(String),
/// Could not find or open compiled artifact file.
CouldNotOpenFile(String),
/// An error occurred in the CPU time monitor thread. Should be totally unrelated to validation.
CpuTimeMonitorThread(String),
/// Some non-deterministic preparation error occurred.
NonDeterministicPrepareError(PrepareError),
}
impl fmt::Display for InternalValidationError {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
use InternalValidationError::*;
match self {
HostCommunication(err) =>
write!(f, "validation: some communication error occurred with the host: {}", err),
CouldNotOpenFile(err) =>
write!(f, "validation: could not find or open compiled artifact file: {}", err),
CpuTimeMonitorThread(err) =>
write!(f, "validation: an error occurred in the CPU time monitor thread: {}", err),
NonDeterministicPrepareError(err) => write!(f, "validation: prepare: {}", err),
}
}
}
impl From<InternalValidationError> for ValidationError {
fn from(error: InternalValidationError) -> Self {
Self::InternalError(error)
}
}
impl From<PrepareError> for ValidationError {
@@ -120,9 +164,9 @@ impl From<PrepareError> for ValidationError {
// Here we need to classify the errors into two errors: deterministic and non-deterministic.
// See [`PrepareError::is_deterministic`].
if error.is_deterministic() {
ValidationError::InvalidCandidate(InvalidCandidate::PrepareError(error.to_string()))
Self::InvalidCandidate(InvalidCandidate::PrepareError(error.to_string()))
} else {
ValidationError::InternalError(error.to_string())
Self::InternalError(InternalValidationError::NonDeterministicPrepareError(error))
}
}
}
+5 -3
View File
@@ -334,15 +334,17 @@ fn handle_job_finish(
Err(ValidationError::InvalidCandidate(InvalidCandidate::WorkerReportedError(err))),
None,
),
Outcome::InternalError { err, idle_worker } =>
(Some(idle_worker), Err(ValidationError::InternalError(err)), None),
Outcome::InternalError { err } => (None, Err(ValidationError::InternalError(err)), None),
Outcome::HardTimeout =>
(None, Err(ValidationError::InvalidCandidate(InvalidCandidate::HardTimeout)), None),
// "Maybe invalid" errors (will retry).
Outcome::IoErr => (
None,
Err(ValidationError::InvalidCandidate(InvalidCandidate::AmbiguousWorkerDeath)),
None,
),
Outcome::Panic { err } =>
(None, Err(ValidationError::InvalidCandidate(InvalidCandidate::Panic(err))), None),
};
queue.metrics.execute_finished();
@@ -356,7 +358,7 @@ fn handle_job_finish(
err
);
} else {
gum::debug!(
gum::trace!(
target: LOG_TARGET,
?artifact_id,
?worker,
@@ -18,6 +18,7 @@
use crate::{
artifacts::ArtifactPathId,
error::InternalValidationError,
worker_common::{
framed_recv, framed_send, path_to_bytes, spawn_with_program_path, IdleWorker, SpawnErr,
WorkerHandle, JOB_TIMEOUT_WALL_CLOCK_FACTOR,
@@ -64,6 +65,8 @@ pub async fn spawn(
}
/// Outcome of PVF execution.
///
/// If the idle worker token is not returned, it means the worker must be terminated.
pub enum Outcome {
/// PVF execution completed successfully and the result is returned. The worker is ready for
/// another job.
@@ -73,18 +76,23 @@ pub enum Outcome {
InvalidCandidate { err: String, idle_worker: IdleWorker },
/// An internal error happened during the validation. Such an error is most likely related to
/// some transient glitch.
InternalError { err: String, idle_worker: IdleWorker },
///
/// Should only ever be used for errors independent of the candidate and PVF. Therefore it may
/// be a problem with the worker, so we terminate it.
InternalError { err: InternalValidationError },
/// The execution time exceeded the hard limit. The worker is terminated.
HardTimeout,
/// An I/O error happened during communication with the worker. This may mean that the worker
/// process already died. The token is not returned in any case.
IoErr,
/// An unexpected panic has occurred in the execution worker.
Panic { err: String },
}
/// Given the idle token of a worker and parameters of work, communicates with the worker and
/// returns the outcome.
///
/// NOTE: Returning the `HardTimeout` or `IoErr` errors will trigger the child process being killed.
/// NOTE: Not returning the idle worker token in `Outcome` will trigger the child process being killed.
pub async fn start_work(
worker: IdleWorker,
artifact: ArtifactPathId,
@@ -171,8 +179,8 @@ pub async fn start_work(
Response::InvalidCandidate(err) =>
Outcome::InvalidCandidate { err, idle_worker: IdleWorker { stream, pid } },
Response::TimedOut => Outcome::HardTimeout,
Response::InternalError(err) =>
Outcome::InternalError { err, idle_worker: IdleWorker { stream, pid } },
Response::Panic(err) => Outcome::Panic { err },
Response::InternalError(err) => Outcome::InternalError { err },
}
}
@@ -223,8 +231,10 @@ pub enum Response {
InvalidCandidate(String),
/// The job timed out.
TimedOut,
/// Some internal error occurred. Should only be used for errors independent of the candidate.
InternalError(String),
/// An unexpected panic has occurred in the execution worker.
Panic(String),
/// Some internal error occurred.
InternalError(InternalValidationError),
}
impl Response {
@@ -236,12 +246,4 @@ impl Response {
Self::InvalidCandidate(format!("{}: {}", ctx, msg))
}
}
/// Creates an internal response from a context `ctx` and a message `msg` (which can be empty).
pub fn format_internal(ctx: &'static str, msg: &str) -> Self {
if msg.is_empty() {
Self::InternalError(ctx.to_string())
} else {
Self::InternalError(format!("{}: {}", ctx, msg))
}
}
}
+3 -1
View File
@@ -99,7 +99,9 @@ mod pvf;
mod worker_common;
pub use artifacts::CompiledArtifact;
pub use error::{InvalidCandidate, PrepareError, PrepareResult, ValidationError};
pub use error::{
InternalValidationError, InvalidCandidate, PrepareError, PrepareResult, ValidationError,
};
pub use execute::{ExecuteHandshake, ExecuteResponse};
#[cfg(any(target_os = "linux", feature = "jemalloc-allocator"))]
pub use prepare::MemoryAllocationStats;