change prepare worker to use fork instead of threads (#1685)

Co-authored-by: Marcin S <marcin@realemail.net>
This commit is contained in:
jserrat
2023-11-14 14:50:18 -03:00
committed by GitHub
parent 3a87390b30
commit 54f84285bf
24 changed files with 1468 additions and 534 deletions
@@ -1,7 +1,11 @@
# PVF Host and Workers
The PVF host is responsible for handling requests to prepare and execute PVF
code blobs, which it sends to PVF workers running in their own child processes.
code blobs, which it sends to PVF **workers** running in their own child
processes.
While the workers are generally long-living, they also spawn one-off secure
**job processes** that perform the jobs. See "Job Processes" section below.
This system has two high-levels goals that we will touch on here: *determinism*
and *security*.
@@ -36,8 +40,11 @@ execution request:
not successful.
2. **Artifact missing:** The prepared artifact might have been deleted due to
operator error or some bug in the system.
3. **Panic:** The worker thread panicked for some indeterminate reason, which
may or may not be independent of the candidate or PVF.
3. **Job errors:** For example, the worker thread panicked for some
indeterminate reason, which may or may not be independent of the candidate or
PVF.
4. **Internal errors:** See "Internal Errors" section. In this case, after the
retry we abstain from voting.
### Preparation timeouts
@@ -62,10 +69,16 @@ more than the CPU time.
### Internal errors
In general, for errors not raising a dispute we have to be very careful. This is
only sound, if we either:
An internal, or local, error is one that we treat as independent of the PVF
and/or candidate, i.e. local to the running machine. If this happens, then we
will first retry the job and if the errors persists, then we simply do not vote.
This prevents slashes, since otherwise our vote may not agree with that of the
other validators.
1. Ruled out that error in pre-checking. If something is not checked in
In general, for errors not raising a dispute we have to be very careful. This is
only sound, if either:
1. We ruled out that error in pre-checking. If something is not checked in
pre-checking, even if independent of the candidate and PVF, we must raise a
dispute.
2. We are 100% confident that it is a hardware/local issue: Like corrupted file,
@@ -75,11 +88,11 @@ Reasoning: Otherwise it would be possible to register a PVF where candidates can
not be checked, but we don't get a dispute - so nobody gets punished. Second, we
end up with a finality stall that is not going to resolve!
There are some error conditions where we can't be sure whether the candidate is
really invalid or some internal glitch occurred, e.g. panics. Whenever we are
unsure, we can never treat an error as internal as we would abstain from voting.
So we will first retry the candidate, and if the issue persists we are forced to
vote invalid.
Note that any error from the job process we cannot treat as internal. The job
runs untrusted code and an attacker can therefore return arbitrary errors. If
they were to return errors that we treat as internal, they could make us abstain
from voting. Since we are unsure if such errors are legitimate, we will first
retry the candidate, and if the issue persists we are forced to vote invalid.
## Security
@@ -119,6 +132,20 @@ So what are we actually worried about? Things that come to mind:
6. **Intercepting and manipulating packages** - Effect very similar to the
above, hard to do without also being able to do 4 or 5.
### Job Processes
As mentioned above, our architecture includes long-living **worker processes**
and one-off **job processes*. This separation is important so that the handling
of untrusted code can be limited to the job processes. A hijacked job process
can therefore not interfere with other jobs running in separate processes.
Furthermore, if an unexpected execution error occurred in the worker and not the
job, we generally can be confident that it has nothing to do with the candidate,
so we can abstain from voting. On the other hand, a hijacked job can send back
erroneous responses for candidates, so we know that we should not abstain from
voting on such errors from jobs. Otherwise, an attacker could trigger a finality
stall. (See "Internal Errors" section above.)
### Restricting file-system access
A basic security mechanism is to make sure that any process directly interfacing