PVF: add landlock sandboxing (#7303)

* Begin adding landlock + test * Move PVF implementer's guide section to own page, document security * Implement test * Add some docs * Do some cleanup * Fix typo * Warn on host startup if landlock is not supported * Clarify docs a bit * Minor improvements * Add some docs about determinism * Address review comments (mainly add warning on landlock error) * Update node/core/pvf/src/host.rs Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com> * Update node/core/pvf/src/host.rs Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com> * Fix unused fn * Update ABI docs to reflect latest discussions * Remove outdated notes * Try to trigger new test-linux-oldkernel-stable job Job introduced in https://github.com/paritytech/polkadot/pull/7371. --------- Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>
2026-06-21 07:31:03 +00:00 · 2023-07-05 12:57:53 -04:00
parent a40417da96
commit 2b9c4f82a7
10 changed files with 445 additions and 95 deletions
@@ -4029,6 +4029,17 @@ dependencies = [
 "kvdb",
 ]
 [[package]]
 name = "landlock"
 version = "0.2.0"
 source = "registry+https://github.com/rust-lang/crates.io-index"
 checksum = "520baa32708c4e957d2fc3a186bc5bd8d26637c33137f399ddfc202adb240068"
 dependencies = [
 "enumflags2",
 "libc",
 "thiserror",
 ]
 [[package]]
 name = "lazy_static"
 version = "1.4.0"
@@ -7424,8 +7435,10 @@ dependencies = [
 name = "polkadot-node-core-pvf-common"
 version = "0.9.43"
 dependencies = [
 "assert_matches",
 "cpu-time",
 "futures",
 "landlock",
 "libc",
 "parity-scale-codec",
 "polkadot-parachain",
@@ -7438,6 +7451,7 @@ dependencies = [
 "sp-io",
 "sp-tracing",
 "substrate-build-script-utils",
 "tempfile",
 "tokio",
 "tracing-gum",
 ]
@@ -25,5 +25,12 @@ sp-externalities = { git = "https://github.com/paritytech/substrate", branch = "
 sp-io = { git = "https://github.com/paritytech/substrate", branch = "master" }
 sp-tracing = { git = "https://github.com/paritytech/substrate", branch = "master" }
 [target.'cfg(target_os = "linux")'.dependencies]
 landlock = "0.2.0"
 [dev-dependencies]
 assert_matches = "1.4.0"
 tempfile = "3.3.0"
 [build-dependencies]
 substrate-build-script-utils = { git = "https://github.com/paritytech/substrate", branch = "master" }
@@ -16,6 +16,8 @@
 //! Functionality common to both prepare and execute workers.
 pub mod security;
 use crate::LOG_TARGET;
 use cpu_time::ProcessTime;
 use futures::never::Never;
@@ -203,7 +205,7 @@ pub mod thread {
 	};
 	/// Contains the outcome of waiting on threads, or `Pending` if none are ready.
-	#[derive(Clone, Copy)]
+	#[derive(Debug, Clone, Copy)]
 	pub enum WaitOutcome {
 		Finished,
 		TimedOut,
@@ -224,8 +226,14 @@ pub mod thread {
 		Arc::new((Mutex::new(WaitOutcome::Pending), Condvar::new()))
 	}
-	/// Runs a thread, afterwards notifying the threads waiting on the condvar. Catches panics and
+	/// Runs a worker thread. Will first enable security features, and afterwards notify the threads waiting on the
-	/// resumes them after triggering the condvar, so that the waiting thread is notified on panics.
+	/// condvar. Catches panics during execution and resumes the panics after triggering the condvar, so that the
 	/// waiting thread is notified on panics.
 	///
 	/// # Returns
 	///
 	/// Returns the thread's join handle. Calling `.join()` on it returns the result of executing
 	/// `f()`, as well as whether we were able to enable sandboxing.
 	pub fn spawn_worker_thread<F, R>(
 		name: &str,
 		f: F,
@@ -0,0 +1,188 @@
 // Copyright (C) Parity Technologies (UK) Ltd.
 // This file is part of Polkadot.
 // Polkadot is free software: you can redistribute it and/or modify
 // it under the terms of the GNU General Public License as published by
 // the Free Software Foundation, either version 3 of the License, or
 // (at your option) any later version.
 // Polkadot is distributed in the hope that it will be useful,
 // but WITHOUT ANY WARRANTY; without even the implied warranty of
 // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 // GNU General Public License for more details.
 // You should have received a copy of the GNU General Public License
 // along with Polkadot.  If not, see <http://www.gnu.org/licenses/>.
 //! Functionality for securing workers.
 //!
 //! This is needed because workers are used to compile and execute untrusted code (PVFs).
 /// To what degree landlock is enabled. It's a separate struct from `RulesetStatus` because that is
 /// only available on Linux, plus this has a nicer name.
 pub enum LandlockStatus {
 	FullyEnforced,
 	PartiallyEnforced,
 	NotEnforced,
 	/// Thread panicked, we don't know what the status is.
 	Unavailable,
 }
 impl LandlockStatus {
 	#[cfg(target_os = "linux")]
 	pub fn from_ruleset_status(ruleset_status: ::landlock::RulesetStatus) -> Self {
 		use ::landlock::RulesetStatus::*;
 		match ruleset_status {
 			FullyEnforced => LandlockStatus::FullyEnforced,
 			PartiallyEnforced => LandlockStatus::PartiallyEnforced,
 			NotEnforced => LandlockStatus::NotEnforced,
 		}
 	}
 }
 /// The	[landlock] docs say it best:
 ///
 /// > "Landlock is a security feature available since Linux 5.13. The goal is to enable to restrict
 /// ambient rights (e.g., global filesystem access) for a set of processes by creating safe security
 /// sandboxes as new security layers in addition to the existing system-wide access-controls. This
 /// kind of sandbox is expected to help mitigate the security impact of bugs, unexpected or
 /// malicious behaviors in applications. Landlock empowers any process, including unprivileged ones,
 /// to securely restrict themselves."
 ///
 /// [landlock]: https://docs.rs/landlock/latest/landlock/index.html
 #[cfg(target_os = "linux")]
 pub mod landlock {
 	use landlock::{Access, AccessFs, Ruleset, RulesetAttr, RulesetError, RulesetStatus, ABI};
 	/// Landlock ABI version. We use ABI V1 because:
 	///
 	/// 1. It is supported by our reference kernel version.
 	/// 2. Later versions do not (yet) provide additional security.
 	///
 	/// # Versions (June 2023)
 	///
 	/// - Polkadot reference kernel version: 5.16+
 	/// - ABI V1: 5.13 - introduces	landlock, including full restrictions on file reads
 	/// - ABI V2: 5.19 - adds ability to configure file renaming (not used by us)
 	///
 	/// # Determinism
 	///
 	/// You may wonder whether we could always use the latest ABI instead of only the ABI supported
 	/// by the reference kernel version. It seems plausible, since landlock provides a best-effort
 	/// approach to enabling sandboxing. For example, if the reference version only supported V1 and
 	/// we were on V2, then landlock would use V2 if it was supported on the current machine, and
 	/// just fall back to V1 if not.
 	///
 	/// The issue with this is indeterminacy. If half of validators were on V2 and half were on V1,
 	/// they may have different semantics on some PVFs. So a malicious PVF now has a new attack
 	/// vector: they can exploit this indeterminism between landlock ABIs!
 	///
 	/// On the other hand we do want validators to be as secure as possible and protect their keys
 	/// from attackers. And, the risk with indeterminacy is low and there are other indeterminacy
 	/// vectors anyway. So we will only upgrade to a new ABI if either the reference kernel version
 	/// supports it or if it introduces some new feature that is beneficial to security.
 	pub const LANDLOCK_ABI: ABI = ABI::V1;
 	// TODO: <https://github.com/landlock-lsm/rust-landlock/issues/36>
 	/// Returns to what degree landlock is enabled with the given ABI on the current Linux
 	/// environment.
 	pub fn get_status() -> Result<RulesetStatus, Box<dyn std::error::Error>> {
 		match std::thread::spawn(|| try_restrict_thread()).join() {
 			Ok(Ok(status)) => Ok(status),
 			Ok(Err(ruleset_err)) => Err(ruleset_err.into()),
 			Err(_err) => Err("a panic occurred in try_restrict_thread".into()),
 		}
 	}
 	/// Basaed on the given `status`, returns a single bool indicating whether the given landlock
 	/// ABI is fully enabled on the current Linux environment.
 	pub fn status_is_fully_enabled(
 		status: &Result<RulesetStatus, Box<dyn std::error::Error>>,
 	) -> bool {
 		matches!(status, Ok(RulesetStatus::FullyEnforced))
 	}
 	/// Runs a check for landlock and returns a single bool indicating whether the given landlock
 	/// ABI is fully enabled on the current Linux environment.
 	pub fn check_is_fully_enabled() -> bool {
 		status_is_fully_enabled(&get_status())
 	}
 	/// Tries to restrict the current thread with the following landlock access controls:
 	///
 	/// 1. all global filesystem access
 	/// 2. ... more may be supported in the future.
 	///
 	/// If landlock is not supported in the current environment this is simply a noop.
 	///
 	/// # Returns
 	///
 	/// The status of the restriction (whether it was fully, partially, or not-at-all enforced).
 	pub fn try_restrict_thread() -> Result<RulesetStatus, RulesetError> {
 		let status = Ruleset::new()
 			.handle_access(AccessFs::from_all(LANDLOCK_ABI))?
 			.create()?
 			.restrict_self()?;
 		Ok(status.ruleset)
 	}
 	#[cfg(test)]
 	mod tests {
 		use super::*;
 		use std::{fs, io::ErrorKind, thread};
 		#[test]
 		fn restricted_thread_cannot_access_fs() {
 			// TODO: This would be nice: <https://github.com/rust-lang/rust/issues/68007>.
 			if !check_is_fully_enabled() {
 				return
 			}
 			// Restricted thread cannot read from FS.
 			let handle = thread::spawn(|| {
 				// Write to a tmp file, this should succeed before landlock is applied.
 				let text = "foo";
 				let tmpfile = tempfile::NamedTempFile::new().unwrap();
 				let path = tmpfile.path();
 				fs::write(path, text).unwrap();
 				let s = fs::read_to_string(path).unwrap();
 				assert_eq!(s, text);
 				let status = try_restrict_thread().unwrap();
 				if !matches!(status, RulesetStatus::FullyEnforced) {
 					panic!("Ruleset should be enforced since we checked if landlock is enabled");
 				}
 				// Try to read from the tmp file after landlock.
 				let result = fs::read_to_string(path);
 				assert!(matches!(
 					result,
 					Err(err) if matches!(err.kind(), ErrorKind::PermissionDenied)
 				));
 			});
 			assert!(handle.join().is_ok());
 			// Restricted thread cannot write to FS.
 			let handle = thread::spawn(|| {
 				let text = "foo";
 				let tmpfile = tempfile::NamedTempFile::new().unwrap();
 				let path = tmpfile.path();
 				let status = try_restrict_thread().unwrap();
 				if !matches!(status, RulesetStatus::FullyEnforced) {
 					panic!("Ruleset should be enforced since we checked if landlock is enabled");
 				}
 				// Try to write to the tmp file after landlock.
 				let result = fs::write(path, text);
 				assert!(matches!(
 					result,
 					Err(err) if matches!(err.kind(), ErrorKind::PermissionDenied)
 				));
 			});
 			assert!(handle.join().is_ok());
 		}
 	}
 }
@@ -28,7 +28,9 @@ use polkadot_node_core_pvf_common::{
 	executor_intf::NATIVE_STACK_MAX,
 	framed_recv, framed_send,
 	worker::{
-		bytes_to_path, cpu_time_monitor_loop, stringify_panic_payload,
+		bytes_to_path, cpu_time_monitor_loop,
 		security::LandlockStatus,
 		stringify_panic_payload,
 		thread::{self, WaitOutcome},
 		worker_event_loop,
 	},
@@ -170,11 +172,22 @@ pub fn worker_entrypoint(socket_path: &str, node_version: Option<&str>) {
 			let execute_thread = thread::spawn_worker_thread_with_stack_size(
 				"execute thread",
 				move || {
-					validate_using_artifact(
+					// Try to enable landlock.
-						&compiled_artifact_blob,
+					#[cfg(target_os = "linux")]
-						&params,
+					let landlock_status = polkadot_node_core_pvf_common::worker::security::landlock::try_restrict_thread()
-						executor_2,
+						.map(LandlockStatus::from_ruleset_status)
-						cpu_time_start,
+						.map_err(|e| e.to_string());
 					#[cfg(not(target_os = "linux"))]
 					let landlock_status: Result<LandlockStatus, String> = Ok(LandlockStatus::NotEnforced);
 					(
 						validate_using_artifact(
 							&compiled_artifact_blob,
 							&params,
 							executor_2,
 							cpu_time_start,
 						),
 						landlock_status,
 					)
 				},
 				Arc::clone(&condvar),
@@ -187,9 +200,24 @@ pub fn worker_entrypoint(socket_path: &str, node_version: Option<&str>) {
 			let response = match outcome {
 				WaitOutcome::Finished => {
 					let _ = cpu_time_monitor_tx.send(());
-					execute_thread
+					let (result, landlock_status) = execute_thread.join().unwrap_or_else(|e| {
-						.join()
+						(
-						.unwrap_or_else(|e| Response::Panic(stringify_panic_payload(e)))
+							Response::Panic(stringify_panic_payload(e)),
 							Ok(LandlockStatus::Unavailable),
 						)
 					});
 					// Log if landlock threw an error.
 					if let Err(err) = landlock_status {
 						gum::warn!(
 							target: LOG_TARGET,
 							%worker_pid,
 							"error enabling landlock: {}",
 							err
 						);
 					}
 					result
 				},
 				// If the CPU thread is not selected, we signal it to end, the join handle is
 				// dropped and the thread will finish in the background.
@@ -35,7 +35,9 @@ use polkadot_node_core_pvf_common::{
 	prepare::{MemoryStats, PrepareJobKind, PrepareStats},
 	pvf::PvfPrepData,
 	worker::{
-		bytes_to_path, cpu_time_monitor_loop, stringify_panic_payload,
+		bytes_to_path, cpu_time_monitor_loop,
 		security::LandlockStatus,
 		stringify_panic_payload,
 		thread::{self, WaitOutcome},
 		worker_event_loop,
 	},
@@ -155,6 +157,14 @@ pub fn worker_entrypoint(socket_path: &str, node_version: Option<&str>) {
 			let prepare_thread = thread::spawn_worker_thread(
 				"prepare thread",
 				move || {
 					// Try to enable landlock.
 					#[cfg(target_os = "linux")]
 					let landlock_status = polkadot_node_core_pvf_common::worker::security::landlock::try_restrict_thread()
 						.map(LandlockStatus::from_ruleset_status)
 						.map_err(|e| e.to_string());
 					#[cfg(not(target_os = "linux"))]
 					let landlock_status: Result<LandlockStatus, String> = Ok(LandlockStatus::NotEnforced);
 					#[allow(unused_mut)]
 					let mut result = prepare_artifact(pvf, cpu_time_start);
@@ -173,7 +183,7 @@ pub fn worker_entrypoint(socket_path: &str, node_version: Option<&str>) {
 						});
 					}
-					result
+					(result, landlock_status)
 				},
 				Arc::clone(&condvar),
 				WaitOutcome::Finished,
@@ -186,13 +196,16 @@ pub fn worker_entrypoint(socket_path: &str, node_version: Option<&str>) {
 					let _ = cpu_time_monitor_tx.send(());
 					match prepare_thread.join().unwrap_or_else(|err| {
-						Err(PrepareError::Panic(stringify_panic_payload(err)))
+						(
 							Err(PrepareError::Panic(stringify_panic_payload(err))),
 							Ok(LandlockStatus::Unavailable),
 						)
 					}) {
-						Err(err) => {
+						(Err(err), _) => {
 							// Serialized error will be written into the socket.
 							Err(err)
 						},
-						Ok(ok) => {
+						(Ok(ok), landlock_status) => {
 							#[cfg(not(target_os = "linux"))]
 							let (artifact, cpu_time_elapsed) = ok;
 							#[cfg(target_os = "linux")]
@@ -208,6 +221,16 @@ pub fn worker_entrypoint(socket_path: &str, node_version: Option<&str>) {
 								max_rss: extract_max_rss_stat(max_rss, worker_pid),
 							};
 							// Log if landlock threw an error.
 							if let Err(err) = landlock_status {
 								gum::warn!(
 									target: LOG_TARGET,
 									%worker_pid,
 									"error enabling landlock: {}",
 									err
 								);
 							}
 							// Write the serialized artifact into a temp file.
 							//
 							// PVF host only keeps artifacts statuses in its memory, successfully
@@ -140,6 +140,7 @@ struct ExecutePvfInputs {
 }
 /// Configuration for the validation host.
 #[derive(Debug)]
 pub struct Config {
 	/// The root directory where the prepared artifacts can be stored.
 	pub cache_path: PathBuf,
@@ -189,6 +190,11 @@ impl Config {
 /// In that case all pending requests will be canceled, dropping the result senders and new ones
 /// will be rejected.
 pub fn start(config: Config, metrics: Metrics) -> (ValidationHost, impl Future<Output = ()>) {
 	gum::debug!(target: LOG_TARGET, ?config, "starting PVF validation host");
 	// Run checks for supported security features once per host startup.
 	warn_if_no_landlock();
 	let (to_host_tx, to_host_rx) = mpsc::channel(10);
 	let validation_host = ValidationHost { to_host_tx };
@@ -854,6 +860,30 @@ fn pulse_every(interval: std::time::Duration) -> impl futures::Stream<Item = ()>
 	.map(|_| ())
 }
 /// Check if landlock is supported and emit a warning if not.
 fn warn_if_no_landlock() {
 	#[cfg(target_os = "linux")]
 	{
 		use polkadot_node_core_pvf_common::worker::security::landlock;
 		let status = landlock::get_status();
 		if !landlock::status_is_fully_enabled(&status) {
 			let abi = landlock::LANDLOCK_ABI as u8;
 			gum::warn!(
 				target: LOG_TARGET,
 				?status,
 				%abi,
 				"Cannot fully enable landlock, a Linux kernel security feature. Running validation of malicious PVF code has a higher risk of compromising this machine. Consider upgrading the kernel version for maximum security."
 			);
 		}
 	}
 	#[cfg(not(target_os = "linux"))]
 	gum::warn!(
 		target: LOG_TARGET,
 		"Cannot enable landlock, a Linux kernel security feature. Running validation of malicious PVF code has a higher risk of compromising this machine. Consider running on Linux with landlock support for maximum security."
 	);
 }
 #[cfg(test)]
 pub(crate) mod tests {
 	use super::*;
@@ -60,6 +60,7 @@
  - [Utility Subsystems](node/utility/README.md)
    - [Availability Store](node/utility/availability-store.md)
    - [Candidate Validation](node/utility/candidate-validation.md)
      - [PVF Host and Workers](node/utility/pvf-host-and-workers.md)
    - [Provisioner](node/utility/provisioner.md)
    - [Network Bridge](node/utility/network-bridge.md)
    - [Gossip Support](node/utility/gossip-support.md)
@@ -44,86 +44,10 @@ Once we have all parameters, we can spin up a background task to perform the val
  * The collator signature is valid
  * The PoV provided matches the `pov_hash` field of the descriptor
 For more details please see [PVF Host and Workers](pvf-host-and-workers.md).
 ### Checking Validation Outputs
 If we can assume the presence of the relay-chain state (that is, during processing [`CandidateValidationMessage`][CVM]`::ValidateFromChainState`) we can run all the checks that the relay-chain would run at the inclusion time thus confirming that the candidate will be accepted.
 ### PVF Host
 The PVF host is responsible for handling requests to prepare and execute PVF
 code blobs.
 One high-level goal is to make PVF operations as deterministic as possible, to
 reduce the rate of disputes. Disputes can happen due to e.g. a job timing out on
 one machine, but not another. While we do not yet have full determinism, there
 are some dispute reduction mechanisms in place right now.
 #### Retrying execution requests
 If the execution request fails during **preparation**, we will retry if it is
 possible that the preparation error was transient (e.g. if the error was a panic
 or time out). We will only retry preparation if another request comes in after
 15 minutes, to ensure any potential transient conditions had time to be
 resolved. We will retry up to 5 times.
 If the actual **execution** of the artifact fails, we will retry once if it was
 a possibly transient error, to allow the conditions that led to the error to
 hopefully resolve. We use a more brief delay here (1 second as opposed to 15
 minutes for preparation (see above)), because a successful execution must happen
 in a short amount of time.
 We currently know of the following specific cases that will lead to a retried
 execution request:
 1. **OOM:** The host might have been temporarily low on memory due to other
   processes running on the same machine. **NOTE:** This case will lead to
   voting against the candidate (and possibly a dispute) if the retry is still
   not successful.
 2. **Artifact missing:** The prepared artifact might have been deleted due to
   operator error or some bug in the system.
 3. **Panic:** The worker thread panicked for some indeterminate reason, which
   may or may not be independent of the candidate or PVF.
 #### Preparation timeouts
 We use timeouts for both preparation and execution jobs to limit the amount of
 time they can take. As the time for a job can vary depending on the machine and
 load on the machine, this can potentially lead to disputes where some validators
 successfuly execute a PVF and others don't.
 One dispute mitigation we have in place is a more lenient timeout for
 preparation during execution than during pre-checking. The rationale is that the
 PVF has already passed pre-checking, so we know it should be valid, and we allow
 it to take longer than expected, as this is likely due to an issue with the
 machine and not the PVF.
 #### CPU clock timeouts
 Another timeout-related mitigation we employ is to measure the time taken by
 jobs using CPU time, rather than wall clock time. This is because the CPU time
 of a process is less variable under different system conditions. When the
 overall system is under heavy load, the wall clock time of a job is affected
 more than the CPU time.
 #### Internal errors
 In general, for errors not raising a dispute we have to be very careful. This is
 only sound, if we either:
 1. Ruled out that error in pre-checking. If something is not checked in
   pre-checking, even if independent of the candidate and PVF, we must raise a
   dispute.
 2. We are 100% confident that it is a hardware/local issue: Like corrupted file,
   etc.
 Reasoning: Otherwise it would be possible to register a PVF where candidates can
 not be checked, but we don't get a dispute - so nobody gets punished. Second, we
 end up with a finality stall that is not going to resolve!
 There are some error conditions where we can't be sure whether the candidate is
 really invalid or some internal glitch occurred, e.g. panics. Whenever we are
 unsure, we can never treat an error as internal as we would abstain from voting.
 So we will first retry the candidate, and if the issue persists we are forced to
 vote invalid.
 [CVM]: ../../types/overseer-protocol.md#validationrequesttype
@@ -0,0 +1,127 @@
 # PVF Host and Workers
 The PVF host is responsible for handling requests to prepare and execute PVF
 code blobs, which it sends to PVF workers running in their own child processes.
 This system has two high-levels goals that we will touch on here: *determinism*
 and *security*.
 ## Determinism
 One high-level goal is to make PVF operations as deterministic as possible, to
 reduce the rate of disputes. Disputes can happen due to e.g. a job timing out on
 one machine, but not another. While we do not have full determinism, there are
 some dispute reduction mechanisms in place right now.
 ### Retrying execution requests
 If the execution request fails during **preparation**, we will retry if it is
 possible that the preparation error was transient (e.g. if the error was a panic
 or time out). We will only retry preparation if another request comes in after
 15 minutes, to ensure any potential transient conditions had time to be
 resolved. We will retry up to 5 times.
 If the actual **execution** of the artifact fails, we will retry once if it was
 a possibly transient error, to allow the conditions that led to the error to
 hopefully resolve. We use a more brief delay here (1 second as opposed to 15
 minutes for preparation (see above)), because a successful execution must happen
 in a short amount of time.
 We currently know of the following specific cases that will lead to a retried
 execution request:
 1. **OOM:** The host might have been temporarily low on memory due to other
   processes running on the same machine. **NOTE:** This case will lead to
   voting against the candidate (and possibly a dispute) if the retry is still
   not successful.
 2. **Artifact missing:** The prepared artifact might have been deleted due to
   operator error or some bug in the system.
 3. **Panic:** The worker thread panicked for some indeterminate reason, which
   may or may not be independent of the candidate or PVF.
 ### Preparation timeouts
 We use timeouts for both preparation and execution jobs to limit the amount of
 time they can take. As the time for a job can vary depending on the machine and
 load on the machine, this can potentially lead to disputes where some validators
 successfuly execute a PVF and others don't.
 One dispute mitigation we have in place is a more lenient timeout for
 preparation during execution than during pre-checking. The rationale is that the
 PVF has already passed pre-checking, so we know it should be valid, and we allow
 it to take longer than expected, as this is likely due to an issue with the
 machine and not the PVF.
 ### CPU clock timeouts
 Another timeout-related mitigation we employ is to measure the time taken by
 jobs using CPU time, rather than wall clock time. This is because the CPU time
 of a process is less variable under different system conditions. When the
 overall system is under heavy load, the wall clock time of a job is affected
 more than the CPU time.
 ### Internal errors
 In general, for errors not raising a dispute we have to be very careful. This is
 only sound, if we either:
 1. Ruled out that error in pre-checking. If something is not checked in
   pre-checking, even if independent of the candidate and PVF, we must raise a
   dispute.
 2. We are 100% confident that it is a hardware/local issue: Like corrupted file,
   etc.
 Reasoning: Otherwise it would be possible to register a PVF where candidates can
 not be checked, but we don't get a dispute - so nobody gets punished. Second, we
 end up with a finality stall that is not going to resolve!
 There are some error conditions where we can't be sure whether the candidate is
 really invalid or some internal glitch occurred, e.g. panics. Whenever we are
 unsure, we can never treat an error as internal as we would abstain from voting.
 So we will first retry the candidate, and if the issue persists we are forced to
 vote invalid.
 ## Security
 With [on-demand parachains](https://github.com/orgs/paritytech/projects/67), it
 is much easier to submit PVFs to the chain for preparation and execution. This
 makes it easier for erroneous disputes and slashing to occur, whether
 intentional (as a result of a malicious attacker) or not (a bug or operator
 error occurred).
 Therefore, another goal of ours is to harden our security around PVFs, in order
 to protect the economic interests of validators and increase overall confidence
 in the system.
 ### Possible attacks / threat model
 Webassembly is already sandboxed, but there have already been reported multiple
 CVEs enabling remote code execution. See e.g. these two advisories from
 [Mar 2023](https://github.com/bytecodealliance/wasmtime/security/advisories/GHSA-ff4p-7xrq-q5r8)
 and [Jul 2022](https://github.com/bytecodealliance/wasmtime/security/advisories/GHSA-7f6x-jwh5-m9r4).
 So what are we actually worried about? Things that come to mind:
 1. **Consensus faults** - If an attacker can get some source of randomness they
   could vote against with 50% chance and cause unresolvable disputes.
 2. **Targeted slashes** - An attacker can target certain validators (e.g. some
   validators running on vulnerable hardware) and make them vote invalid and get
   them slashed.
 3. **Mass slashes** - With some source of randomness they can do an untargeted
   attack. I.e. a baddie can do significant economic damage by voting against
   with 1/3 chance, without even stealing keys or completely replacing the
   binary.
 4. **Stealing keys** - That would be pretty bad. Should not be possible with
   sandboxing. We should at least not allow filesystem-access or network access.
 5. **Taking control over the validator.** E.g. replacing the `polkadot` binary
   with a `polkadot-evil` binary. Should again not be possible with the above
   sandboxing in place.
 6. **Intercepting and manipulating packages** - Effect very similar to the
   above, hard to do without also being able to do 4 or 5.
 ### Restricting file-system access
 A basic security mechanism is to make sure that any thread directly interfacing
 with untrusted code does not have access to the file-system. This provides some
 protection against attackers accessing sensitive data or modifying data on the
 host machine.