Oasis Network Architecture: Designed for Scaling
The Layer 2 scaling solutions have evolved from “sidechains” to “commitchains”, “rollups”, and validating bridges. “Rollup” refers to running a smart contract virtual machine (VM) where the VM’s state is periodically verified and committed to an underlying blockchain to provide smart contract functionality at a reduced cost and a higher throughput rate, as opposed to running smart contracts directly on the underlying blockchain. Verification done at the underlying blockchain secures the state transitions, and the off-chain computation allows smart contract execution to scale.
Community translations | Russian | Italian | French | Bengali | Turkish | Ukrainian | Indonesian | Hindi | Filipino
The only blockchain with native support for rollups at the consensus layer
Layer-2 scaling solutions have evolved from “sidechains” to “commitchains,” “rollups” and validating bridges. “Rollup” refers to running a smart contract virtual machine (VM) where the VM’s state is periodically verified and committed to an underlying blockchain to provide smart contract functionality at a reduced cost and a higher throughput rate, as opposed to running smart contracts directly on the underlying blockchain. Verification done at the underlying blockchain secures the state transitions, and the off-chain computation allows smart contract execution to scale.
We didn’t call them rollups, but that’s how Oasis Network’s ParaTimes or compute layer works.
From its inception, the Oasis Network has separated computation from consensus as a modular design principle. This separation means that ParaTime layer entities only handle smart contract execution, and the consensus layer entities only handle consensus — both are greatly simplified. This has many benefits, including ease of auditing, fault isolation and reduced compute replication without sacrificing security. The separation, having compute done in a (rollup) virtual machine and the results verified and logged into a blockchain, is exactly what rollups are all about.
The Oasis network isn’t just a network that natively supports rollups, the architecture is optimized for rollups and only rollups. It discourages putting general computation in the consensus layer, and allows only built-in contracts to run there. These built-in contracts are the validating bridge contracts in rollup jargon. While the Oasis architecture design goals were for a modular Layer-1 blockchain that supports smart contracts, one might argue that the result of the modularity is that the Oasis consensus layer is a blockchain that only supports rollups, since viewed through the rollup lens all ParaTimes are Layer 2 rollup virtual machines.
In particular, Oasis’s validating bridge in the consensus layer uses a fraud proof technique called “discrepancy detection” to validate the results from the compute layer. This technique, resulting from fraud detection and system architecture co-design, uses “bare metal proofs” that are simple and more trustworthy since there are fewer things to go wrong. The simplicity of bare metal proofs also gives ParaTime designers more headroom: Implementing a smart contract execution environment where smart contracts are sandboxed native code becomes feasible, giving the system headroom for performance improvements beyond that of concurrent ParaTime execution.
One might also say that the Oasis network is the first network that natively supports rollups. And that the Layer 1/Layer 2 nomenclature is insufficiently descriptive/precise.
Call it convergent design. Let’s unpack what this stuff means and go into some of the details.
Properties of Rollups
Rollups have mainly been designed to speed up Ethereum smart contract processing. More specifically, they were designed to execute Ethereum smart contracts but in a separate and independent virtual machine, distinct from the Ethereum Virtual Machine (EVM) on the Ethereum Layer 1 “base chain” to reduce the Layer 1 workload. All of the real contract execution is done in the Layer 2 rollup virtual machine. The only work that the underlying blockchain does is to validate the rollup virtual machine execution using a “validating bridge” smart contract. Normally, the rollup virtual machine is also an instance of the EVM, and in some designs, such as Arbitrum, the rollup virtual machine is tweaked to make validation in the base chain easier. As long as the validation check in the Layer 1 is cheaper than running the rollup smart contracts directly in Layer 1, we have an efficiency gain since Layer 2 machine execution is supposed to be cheaper. This is the heart of how rollups achieve transaction throughput scaling.
Note that on top of the Layer 1 blockchain, there could be many Layer 2 virtual machines. There are no architectural limits other than Layer 1 validation throughput. The more efficient the validating bridge contract and the less non-bridge smart contract execution load for the Layer 1 blockchain to run, the more rollups could be supported. Obviously, there could be a validating bridge-type contract running inside the rollup as well, but there are limits to this recursion.
Most rollup designs commit transaction data to the base chain first and then later commit the resultant state of the rollup virtual machine in a subsequent transaction. This provides transaction order finality and, in a way, solves the data availability problem, as the rollup virtual machine state can be reconstructed from the transaction data at the cost of re-executing all transactions since the rollup virtual machine’s genesis state. Validating bridges can be more general, with off-chain state storage that is verifiable using Merkle proofs so a separate data availability solution is possible. For example, transaction data can be cryptographically summarized in a similar way that the virtual machine state is, reducing Layer 1 storage costs, separating availability being provided by the off-chain storage from authenticity of the data. This has a trade-off with validation: A temporary data availability throughput problem could delay validation because accessing input data needed for verification might be slow.
The way that rollups do the virtual machine validation comes in two forms:
- Optimistic rollups, where a claimed execution result is publicly posted and potentially challenged.
- Zk rollups, where SNARKs are used to construct a proof of correctness.
The key difference is that optimistic rollups use “fraud proofs,’’ where the challengers assemble evidence to prove that the claimed execution result is incorrect or fraudulent, and zk rollups use a “validity proof,” which the executors publish along with the rollup virtual machine state and the validating bridge smart contract verifying the correctness of the proof. In both cases, time is needed to allow other participants to see the new state and to either discover that the state is incorrect and construct a fraud proof — in the case of optimistic rollups — or execute a proof verification algorithm to reject a purported proof when it is incorrect — in the case of a zk rollup. While it may seem a minor distinction, in both cases, work has to be done to either construct a fraud proof or to check whether the validity proof is correct. The difference is that typically fraud proofs involve re-execution of the transactions, whereas a validity proof check is supposed to be cheaper, using cryptographic techniques (probabilistically checkable proofs; SNARKs). In practice, the cryptographic techniques involve significant overhead and do not (yet) generalize to arbitrary smart contract execution.
Note that, in full generality, the rollup virtual machine doesn’t have to be Ethereum compatible nor does the validation mechanism have to run on top of Ethereum. Any decentralized computation substrate would do, extending its security to that of the rollup virtual machine. Of course, having the rollup virtual machine be Ethereum compatible makes it trivial to port existing EVM code.
Flavors of Fraud Proofs
To understand why Oasis network’s fraud detection and system architecture co-design resulted in a more efficient and general scheme, we first need to discuss the different kinds of fraud proofs.
Simulation Proofs
The way that optimistic rollup systems like Arbitrum and Optimism work is that challengers who claim that a computation is wrong must submit a fraud proof. The fraud proof is supposed to show that the computation that led to the incorrect resultant state diverged from a correct virtual machine execution in a single VM instruction, which can be “easily” verified using a rollup virtual machine simulator running on the base chain.
This is not easy; it is enormously complex. This strongly couples the rollup virtual machine instruction set with the validating bridge contract implementation. The key to being able to check the execution is to not simulate the entire execution of the rollup virtual machine, since that would be enormously expensive in gas fees. Instead, the challenger is supposed to provide incontrovertible evidence that the executor’s assertion of the output rollup VM state by pointing at the exact VM instruction, which did not execute correctly, using bisection comparing execution states and Merkle proofs. The proof construction can be done offline, and only the verification needs to be done online, by the validating bridge contract and by checking that, indeed, the execution of the indicated VM instruction did not adhere to the VM semantics.
For this to work, the validating bridge contract must be able to simulate any rollup VM instruction. Ensuring that the simulator is correct and that it is semantically equivalent to the actual rollup VM is technically challenging. While the complexity of a simulator depends on the complexity of the VM instruction set, it is likely to require many person-years of effort to construct an accurate and robust simulator, as is the case for Arbitrum and Optimism.
A challenger claiming fraud is expected to submit a fraud proof along with a new claim for the correct result state. If fraudulent execution trace F diverged from the correct execution trace at some time t_0, a challenger could construct an execution trace C that diverges from the correct trace at some later time step t_1> t_0 and submit a fraud proof that would successfully overturn F but is itself a fraudulent result. Thus, a fraud proof against F does not prove that C is correct, and a successful challenge with a different result should reset the challenge submission clock so that C can itself be challenged. The successful challenge is not conclusive evidence that the result associated with C is correct without additional checking by other verifiers.
An advantage of simulation proofs is that they are precise, and assuming everything works, we are sure when a result is incorrect because we know where VM semantics was violated. Note, however, the validating bridge is a smart contract executing on a blockchain, and the correctness of its result is only probabilistic, whether consensus about its execution occurs on a PoW or PoS chain. (In the former case, probability of a successful 50%+ε attack occurring; in the latter case, probability of ⅔ total stake becoming under Byzantine actors’ control. In both cases, there is also the probability of a common-mode failure such as a zero-day vulnerability in the underlying blockchain code or the host operating system allowing widespread failures.) Since fraud proof checking is probabilistic, the advantage of having deterministic proof is more theoretical than actual.
Bare Metal Proofs
What do we mean by “bare metal proofs?” A bare metal proof scheme:
- Must be robust. The proofs can be probabilistic in nature (like ZKPs) with the ability to choose security parameters to drive the probability of error to be as close to zero as necessary.
- Should work with any (deterministic) virtual machine without machine-specific adaptations, including VMs that are simple extensions of common instruction-set architectures such as x86–64. The scheme should allow running sandboxed native code binaries (with restrictions to avoid non-deterministic behavior) at full, “bare metal” speed, with blockchain instruction extensions as function calls.
In particular, these requirements mean that a system based on bare metal fraud proofs will be much simpler than ones based on simulation proofs. The validating bridge contract does not need access to the rollup virtual machine’s state, since verifying the state requires knowledge of the kind of Merklized data structures being used and how to check them, which is rollup VM specific. This would introduce the need for interfaces/mechanisms for the underlying blockchain to have access to the rollup state and make the module interfaces more complex. Furthermore, no VM-specific instruction-level simulation is needed. Avoiding (unnecessary) complexity is an important security goal since complexity engenders bugs.
Interestingly, by requiring the fraud proof scheme to be more general, it also relaxes constraints on the rollup virtual machine design. While we can continue to write smart contracts in a language that compiles to bytecode and using bytecode interpreters as the virtual machine (which makes it easier to single step and collect a Merklized execution trace), we are no longer constrained to this approach. More advanced and efficient techniques can be applied so that there is an enormous amount of headroom for smart contract efficiency improvements. For example, virtual machine instructions/bytecodes can be compiled to machine code to run in a Software Fault Isolation (SFI) sandbox such as RLBox, achieving near-native code performance.
Oasis ParaTimes
In Oasis’s case, the separation of smart contract execution from consensus meant that we ended up with a rollup-style design. There is an EVM-compatible ParaTime, the Emerald ParaTime and others that run Rust-based smart contracts, all of which use a single, built-in, object-oriented, validation smart contract. No other blockchains restrict the base chain to run only rollup validation contracts.
Discrepancy Detection Fraud Proofs
We mentioned earlier that Oasis only runs built-in validation contracts in the consensus layer. Currently, this is a bare metal fraud-proof style validating bridge where we use a technique called “discrepancy detection” to detect fraud and, if detected, “discrepancy resolution” to resolve it. The key idea behind “discrepancy detection” is that we can be more efficient at detecting the fraud than at correcting the fraud, similar to how additional redundant bits in coding theory can be used to detect more errors than they can correct. Under the assumption that they exhibit failure independence, an adversary compromising one execution node does not help them compromise others, e.g., bribing operational staff, guessing their passwords, etc., we can use smaller compute committees, requiring that all arrive at the same smart contract execution result.
The practical security of this design is easier to understand. A committee-based design provides a publicly known security parameter called the committee size, which executes contracts in a semi-synchronized fashion so that users both know the number of cross-checks as well as knowing when the cross-checks will be completed, committee members have an SLA and could be slashed due to lack of availability. This contrasts with optimistic rollups, where the user waits a fixed amount of time for an unknown number of potential challengers or verifies the computations themselves. Note that the design does not preclude non-committee members from submitting fraud proofs, and even “late” fraud proofs from non-committee members can be handled, see checkpointing section in this article below, so validation is not closed.
When a discrepancy occurs, a discrepancy resolution/recovery phase executes. For the semi-synchronous committee results, we immediately run a larger committee to figure out which is the correct result. The size of this resolution committee is much larger — resolution could be run by the entire consensus committee, for example — so there will be confidence in this result. While this is more expensive in terms of both replicated computation and communications, that’s okay: this should be extremely infrequent, and the cost is amortized over the vast majority of the cases where resolution is not needed.
The efficiency gain is due to applying a fast path/slow path optimization common in systems design. The likely case of no fraud/no discrepancy is handled efficiently, and the unlikely case of having to determine which of two or more discrepant results is correct can be slower. See here for details on the security parameters.
The Oasis Network’s consensus layer does not run general smart contracts. Instead, Oasis has a validating bridge functionality that is otherwise instantiated as a smart contract in an Ethereum-based rollup. We use discrepancy detection to validate the results from the nodes in the compute (ParaTime) layer because it is both more efficient and more general.
To understand why discrepancy detection is more efficient, let us consider the expected amount of time required before a random selection of compute committee members might allow an adversary to compromise the computation. If out of a total of 100 candidates at most 33 are Byzantine and we randomly chose 20 to serve as the compute committee, even if new committee compositions were selected at a rate of 100,000 per hour, the adversary controlling those 33 Byzantine nodes would have to wait an expected 1,066 years before an all-Byzantine committee would be chosen and can be used to attack the system. A common Byzantine fault tolerance scheme may use a 100-way replication to reach consensus; here, we replicate the computation only 20 times.
The details of the technical analysis for the calculations and why discrepancy detection is more efficient can be found in the appendix in our white paper.
Because discrepancy detection only compares the results of computation, it is more general than any scheme that requires comparing intermediate states via single stepped execution, the approach used in most optimistic rollup designs. The work needed to be done by the discrepancy detection validating bridge contract in the consensus layer is small, making it easy to arbitrarily scale the system by running multiple ParaTimes concurrently for parallel execution.
Extensibility for other fraud/validity proof schemes
Note, that while the Oasis Network’s consensus layer bakes in discrepancy detection, it is nonetheless architecturally extensible to allow other rollup execution validation techniques to be implemented. We do not see a need to do so as yet.
If/when zk-rollup proof schemes mature enough for general computation, a zkSNARK verifier would be a great addition, since the baked-in validating bridges are implemented in Go, they should also run much faster than Solidity smart contracts.
There are more variations/dimensions in the fraud proof design space that were not covered here. See here for more information on fraud proof design.
Checkpointing
Discrepancy detection allows us to set the committee size parameter to be confident that the likelihood of a whole committee compromise is negligible. There remain two issues:
- Discrepancy Detection — if only done for the compute committee, is not fully open, since to qualify as a compute node candidate resource limits such as stake is required to prevent Sybil attacks
- Common-Mode Failures, such as programming errors in the Oasis Network code or the Linux kernel, could allow an adversary to use a zero-day compromise to take over all the compute — and consensus — nodes of the network.
Note that the concern about programming errors exists for any network, of course, and is by no means unique to Oasis. As a matter of fact, we believe that the Oasis code quality and review processes are among the best of class.
Handling zero-day vulnerabilities is hard. Indeed, like all software systems, no blockchain has a general solution other than possibly doing a hard fork. In Oasis, we address catastrophic failures — extremely unlikely scenarios such as an all-committee compromise, zero-day vulnerability exploitation / common-mode failures, etc. — by distinguishing transaction order determination and transaction result determination and allowing anyone to challenge the transaction results.
What does this mean? We are adding secure logging of periodic state checkpoint hashes and transaction ordering. This data makes it easy to replay transactions from a known-good checkpoint after a catastrophic failure is identified via a challenge and addressed.
The logging needs to be monotonic/have finality since we include long-range attacks/loss of control of cryptographic keys as potential failure modes. This property can be achieved by writing the log to distributed append-only ledgers such as another blockchain, e.g., Ethereum, by writing to physically append-only media, e.g., continuous-feed printer; by writing to write-once media (CD-R/DVD-R); by writing to media that is periodically copied to off-line backup services; etc. Writing to Ethereum would use Ethereum’s finality to essentially timestamp the log record creation, whereas writing to other append-only mechanisms will require verification, e.g., by comparing independent copies, that the media is not a new but altered copy of the legitimate log. See the Shades of Finality paper (will be released soon) for a more in-depth discussion of transaction order finality and state value finality.
Once this is complete, anyone may challenge the results from the compute layer, even after discrepancy detection has accepted the results. This means that catastrophic failures such as an all-Byzantine committee or a zero-day vulnerability can be handled if detected in a timely fashion. As long as there is a valid state checkpoint that predates the catastrophic failure, we have a known-good state to use as the basis for recovery.
The recommended policy for handling catastrophic failures is to honor the transaction order. This means that if a challenger shows that an incorrect state transition occurred, we can replay the logged transactions submitted after a known-good state to compute the correct, current state using as much replication/verification as needed to do so. Like discrepancy detection allowing an efficient fast path for the no discrepancy case versus a more expensive scheme for the infrequent case when a discrepancy is detected, checkpoint and replay to recover from catastrophic failures can be expensive since such catastrophic failures are expected to be extremely unlikely.
Summary
The Oasis Network architecture resulted from a co-design between fraud detection and system architecture. There is a modular separation of duties between computation and consensus layers, with the interface between them consisting of a simple and efficient bare-metal fraud proof scheme.
The advantages of the Oasis design are:
- Clean and modular architecture, making the design easier to understand and reason about than monolithic designs. The modularity also translates into cleaner implementations, which makes security auditing easier.
- Efficient fraud detection, with explicit security parameters that allow ParaTime designers to choose appropriate parameters for the intended use cases.
- Bare-metal fraud proofs permit future development of high-performance smart contract execution environments such as sandboxed native code, making compute- or data-intensive smart contracts feasible in the future.
- Users know a lower bound on how much independent verification is done, rather than hope that validators were available/operational during the challenge period, as in the case of optimistic rollups.
The result is that Oasis Network is a rollup-style blockchain, where the consensus layer only runs validating bridge contracts. Multiple ParaTimes — independent rollup virtual machines — have successfully been deployed on top of the Oasis consensus layer. Currently, the Emerald ParaTime provides EVM-compatible smart contract execution, and the Cipher ParaTime will provide a confidential smart contract execution environment, once released in Q1 2022.