PVF: Don't dispute on missing artifact (#7011)

* PVF: Don't dispute on missing artifact

A dispute should never be raised if the local cache doesn't provide a certain
artifact. You can not dispute based on this reason, as it is a local hardware
issue and not related to the candidate to check.

Design:

Currently we assume that if we prepared an artifact, it remains there on-disk
until we prune it, i.e. we never check again if it's still there.

We can change it so that instead of artifact-not-found triggering a dispute, we
retry once (like we do for AmbiguousWorkerDeath, except we don't dispute if it
still doesn't work). And when enqueuing an execute job, we check for the
artifact on-disk, and start preparation if not found.

Changes:

- [x] Integration test (should fail without the following changes)
- [x] Check if artifact exists when executing, prepare if not
- [x] Return an internal error when file is missing
- [x] Retry once on internal errors
- [x] Document design (update impl guide)

* Add some context to wasm error message (it is quite long)

* Fix impl guide

* Add check for missing/inaccessible file

* Add comment referencing Substrate issue

* Add test for retrying internal errors

---------

Co-authored-by: parity-processbot <>
This commit is contained in:
Marcin S
2023-04-20 15:38:31 +02:00
committed by GitHub
parent 023d459857
commit 0940cdd1d7
8 changed files with 286 additions and 99 deletions
@@ -23,7 +23,7 @@ Upon receiving a validation request, the first thing the candidate validation su
* The [`CandidateDescriptor`](../../types/candidate.md#candidatedescriptor).
* The [`ValidationData`](../../types/candidate.md#validationdata).
* The [`PoV`](../../types/availability.md#proofofvalidity).
The second category is for PVF pre-checking. This is primarly used by the [PVF pre-checker](pvf-prechecker.md) subsystem.
### Determining Parameters
@@ -67,8 +67,20 @@ or time out). We will only retry preparation if another request comes in after
resolved. We will retry up to 5 times.
If the actual **execution** of the artifact fails, we will retry once if it was
an ambiguous error after a brief delay, to allow any potential transient
conditions to clear.
a possibly transient error, to allow the conditions that led to the error to
hopefully resolve. We use a more brief delay here (1 second as opposed to 15
minutes for preparation (see above)), because a successful execution must happen
in a short amount of time.
We currently know of at least two specific cases that will lead to a retried
execution request:
1. **OOM:** The host might have been temporarily low on memory due to other
processes running on the same machine. **NOTE:** This case will lead to
voting against the candidate (and possibly a dispute) if the retry is still
not successful.
2. **Artifact missing:** The prepared artifact might have been deleted due to
operator error or some bug in the system. We will re-create it on retry.
#### Preparation timeouts