Oasis Network validators and engineers have spent the last year preparing for and testing the Eden upgrade which actually went live twice – first on Testnet, then on Mainnet. The features and improvements included in this network upgrade represent a significant milestone in the history of Oasis.
During the network upgrade on November 29, 2023, an issue with transitioning trust to the new chain was encountered. In this article, we explain this incident and the creative solutions developed by the Oasis Foundation team to overcome it. Uncompromising security is a core component of the engineering culture at Oasis, and the effort of executing this upgrade while preserving the safety and stability of the network despite surprising challenges was no exception.
What happened during the upgrade?
During the Eden upgrade, while the new consensus layer started quickly after the upgrade, the confidential runtimes (i.e., Sapphire, Cipher, and key manager) refused to accept that the new network was actually a trustworthy canonical chain. Why do runtimes even need to check this? Why is it a problem if the check fails? Why did this happen in the first place? Before answering these questions, here is a brief overview of the Oasis architecture and the mechanics of completing the upgrade.
A confidential runtime can hold secrets that must not be disclosed, even to the node operator that is running the runtime. The secrets are managed inside a Trusted Execution Environment (TEE) and only properly attested enclaves may access them. The consensus layer represents a root of trust of the entire system as it stores the canonical state of all the runtimes. If a node operator was able to trick a runtime to accept a malicious fork of the consensus layer as valid, this would open the runtime to a wide variety of attacks. This is why the runtime does not actually trust the node operator’s host environment.
Every Oasis confidential runtime internally runs a light client that is verifying all of the consensus layer blocks. When an upgrade like Eden happens, it is this light client that needs to be convinced that the new, upgraded consensus layer is actually a valid continuation approved by 2/3+ of the last known validator set and not a malicious fork. Even the Oasis Foundation cannot override this logic.
The consensus layer of the new chain starts running when more than two thirds of the voting power defined in its genesis document validates and signs the genesis block. This was also the case for the Eden upgrade where the genesis block was signed by ~67.3% of the voting power. Blocks built after the new genesis block have a higher percentage of signers, but as soon as the two-thirds threshold is reached, the new consensus starts.
Yet somehow, these votes were not convincing the light client inside the confidential runtimes to trust the new network, which was saying that the voting power that signed the block was insufficient (details on why this was the case are explained below under the “What was the root cause” section).
The consequence of this check failing was that the confidential runtimes refused to start and give access to any encrypted secrets. At that time, the Oasis team realized that the confidential runtimes will not be able to initialize and proposed to all node validators to stop their nodes until the issue could be resolved.
How did Oasis engineers resolve the incident?
Readers may now ask themselves, “Why not simply change the verification logic of the confidential runtimes, for example, to temporarily reduce the required threshold?”
As briefly mentioned above, confidential runtimes run in Trusted Execution Environments (TEEs). A confidential runtime can generate special encryption keys that are derived using the runtime’s binary identity. Changed verification logic would result in a different runtime binary, which by design would result in different encryption keys being generated. So, the “fixed” logic would be unable to decrypt any previously encrypted secrets. This is by design, to ensure that nobody, not even the Oasis Foundation, can extract any secrets.
Instead, the Oasis team tackled the problem in a different way. While the genesis block might not have enough voting power to satisfy the runtime’s light client, the network actually had enough voting power online. This is why the team decided to quickly collect the missing signatures out of band and bake them into another Oasis Core release.
When a confidential runtime asks the host node for the list of genesis block signatures, the additionally collected valid signatures from validators would be included without compromising security. To accomplish this, a small block signing tool was rapidly prepared just for the Eden genesis block and sent to the validators that didn’t sign the genesis block in time. The Oasis team then collected their signatures and incorporated them into the Oasis Core 23.0.8 patch release.
With the additional signatures acquired, the trust transition was successfully performed and all of the confidential runtimes started operating normally after a two-epoch (approximately two-hour) initialization period. Importantly, the solution to garner additional signatures would be impossible without the help of Oasis validators, underscoring the security and resilience of the network. The Oasis Network officially completed the process of upgrading to Eden on November 29.
What was the root cause?
Subsequent root cause analysis uncovered why there was a discrepancy in voting power in the first place. In short, this happened because of Oasis Network’s support for rotating validator keys. The Oasis Network functions such that stake and its voting power are bound to “entities”, and each entity can run multiple validator nodes. But only a single entity’s node can actually be elected as a validator within any given epoch.
It turned out that this change of an entity’s validator node happened for two of the validators in the validator set just at the time of the Eden upgrade, making the runtime light client not count the voting power of the two validators.
While no similar hard fork upgrades are currently planned for the future, this problem will be addressed in two ways.
- The light client will be able to additionally verify subsequent blocks from the first epoch in order to verify the chain transition.
- The genesis state will include an explicit encoding of the previous network’s validator set that will be in effect for the first epoch of the new network.
How does this impact the Oasis Network going forward?
Impromptu adjustments during the upgrade process have no lasting effect on the security or stability of the Oasis Network. But, the entire episode highlights the agile performance and uncompromising security standards of the Oasis Foundation’s engineering team. Even with a slightly prolonged deployment period, the upgrade was completed successfully and the network continues to function stably.
“Blockchain development is a very thorough and complex process,” said Jernej Kos, Director at the Oasis Foundation. “The stakes are always high, which is why all of our engineers are unwaveringly committed to ensuring the security and success of any new release even if that requires small periods of delay.”