“Data availability” and the “data availability problem” are terms used to refer to a specific problem faced in various blockchain scaling strategies. This problem asks: how can nodes be sure that when a new block is produced, all of the data in that block was actually published to the network? The dilemma is that if a block producer doesn’t release all of the data in a block, no one could detect if there is a malicious transaction hidden within that block.
In this article, we’ll do a deep dive on the data availability problem, why it’s important and what solutions exist for it.
How Blockchain Nodes Function
In a blockchain, each block consists of two pieces:
- A block header. This is the meta-data for the block, which consists of some basic information about the block, including the Merkle root of transactions.
- The transaction data. This makes up the majority of the block, and consists of the actual transactions.
There are also generally two types of nodes in a blockchain network:
- Full nodes (also known as fully validating nodes). These are nodes that download and check that every transaction in the blockchain is valid. This requires a lot of resources and hundreds of gigabytes of disk space, but these are the most secure nodes as they can’t be tricked into accepting blocks that have invalid transactions.
- Light clients. If your computer doesn’t have the resources to run a full node, then you can run a light client. A light client doesn’t download or validate any transactions. Instead, they only download the block header, and assume that the block only contains valid transactions, so light clients are less secure than full nodes.
Luckily, there’s a way to allow light clients to indirectly check that all the transactions in blocks are valid. Instead of checking the transactions themselves, they can rely on full nodes to send them a fraud proof if a block contains an invalid transaction. This is a small proof that a specific transaction in a block is invalid. We won’t cover how this proof works in this article, but this paper explains it in more detail.
There’s just one problem — in order for a full node to generate a fraud proof for a block, they need to know the transaction data for that block. If a block producer just publishes the block header but not the transaction data, then full nodes won’t be able to check if the transactions are valid and generate fraud proofs if they’re not valid. It is a requirement that block producers must publish all the data for their blocks, but we need a way to enforce this.
To solve this problem, there needs to be some sort of way for light clients to check that the transaction data for a block was actually published to the network, so that full nodes can check it. However, we want to avoid requiring light clients to download the entire block itself to check that it’s been published, because that defeats the point of a light client.
How do we solve this? First, let’s discuss where else the data availability problem is relevant, and then we’ll dive into the solutions.
Where Is the Data Availability Problem Relevant?
In the first section, we introduced the data availability problem. Let’s discuss which scalability solutions it’s important for.
Increasing the Size of Blocks
In blockchains like Bitcoin, most standard laptops have the ability to run a full node and verify the entire chain, because there is an artificial block size limit to keep the blockchain small.
But what if we wanted to increase the block size limit? Less people will afford to run full nodes and independently verify the chain, and more people will run light clients that are less secure. This is bad for decentralization, because it would be easier for block producers to change the protocol rules and insert invalid transactions that light clients will accept as valid. Therefore, adding fraud proof support for light clients becomes very important, but as discussed, light clients need a way to check that all the data in blocks has been published for this to work.
One way of increasing the throughput of a blockchain is to split the blockchain into multiple chains called shards. These shards have their own block producers, and can communicate with each other to transfer tokens between shards. The point of sharding is to split up the block producers in the network so that instead of every block producer processing every transaction, they split up their processing power into different shards that only process some transactions.
Typically, a full node in a sharded blockchain will run a full node for only one or a few shards, and run a light client for every other shard. After all, anyone running a full node for every shard defeats the purpose of sharding, which is to split up the resources of the network to different nodes.
However, this method has its problems. What if the block producers in a shard become malicious and start accepting invalid transactions? This is more likely to happen in a sharded system than a non-sharded system, as a sharded system is easier to attack since it has only a few block producers in each shard. Remember that the block producers are split up into different shards.
In order to solve the problem of detecting if any shard accepted an invalid transaction, you need to be able to guarantee that all the data in that shard was published and made available, so that any invalid transaction can be proven with a fraud proof.
Optimistic rollups are a new scaling strategy that is based on sidechains called rollups, which can be thought of like shards. These sidechains have their own dedicated block producers, which can transfer assets to and from other chains.
But what if the block producers misbehave and make blocks that include invalid transactions, and steal all the money of the users in the sidechain? To solve this, fraud proofs can be used to detect this. But once again, the sidechain users need some way of making sure that the data for all the sidechain’s blocks was actually published, in order to be sure that any invalid transactions can be detected. Rollups on Ethereum deal with this by simply posting all of the rollup blocks on to the Ethereum chain and relying on it for data availability, therefore using Ethereum as a data availability layer to dump data on.
Zero-knowledge (ZK) rollups are similar to optimistic rollups, but instead of using fraud proofs to detect invalid blocks, they use a cryptographic proof called a validity proof to prove that a block is vald. Validity proofs themselves don’t require data availability. However, ZK rollups as a whole still require data availability, because if a block producer makes a valid block and proves it with a validity proof but doesn’t release the data for the block then users won’t know what the state of the blockchain is and what their balances are, and so won’t be able to interact the chain.
Rollups are a design that uses a blockchain only as a data availability layer to dump transactions, but all the actual transaction processing and computation happens on the rollup itself. This leads to an interesting insight: a blockchain doesn’t actually need to do any computation, but at minimum it needs to order transactions into blocks and guarantee the data availability of transactions.
This is the design philosophy of LazyLedger, which is a “lazy” blockchain that only does the two core things that a blockchain needs to do — order transactions and make them available, in a scalable way. This makes it useful as a minimal “pluggable” component for systems such as rollups.
What Solutions Are Available for the Data Availability Problem?
Downloading All The Data
The most obvious way, as discussed, to solve the data availability problem is to simply require everyone (including light clients) to download all the data. Clearly, this doesn’t scale well. This is what most blockchains, such as Bitcoin and Ethereum, currently do.
Data Availability Proofs
Data availability proofs are a new technology that allows clients to check with very high probability that all the data for a block has been published, by only downloading a very small piece of that block.
It uses a mathematical primitive called erasure codes, which are used everywhere in information technology from CD-ROMs to satellite communications to QR codes. Erasure coding allows you to take a block, say 1MB big, and “blow it up” to 2MB big, where the extra 1MB is a special piece of data called the erasure code. If any bytes from the block go missing, you can recover those bytes easily thanks to the code. You can recover the entire block even if up to 1MB of the block goes missing. It’s the same technology that allows your computer to read all the data in a CD-ROM even if it’s scratched.
This means that in order for 100% of a block to be available, only 50% of it needs to be published to the network by the block producer. If a malicious block producer wants to withhold even 1% of the block, they must withhold 50% of the block, because that 1% can be recovered from the 50%.
Armed with this knowledge, clients can do something clever to make sure that no parts of the block have been withheld. They can try to download some random chunks from the block, and if they are unsuccessful in downloading any of those chunks (i.e. the chunk is in the 50% of the chunks that a malicious block producer didn’t publish), then they will reject the block as unavailable. After trying to download one random chunk, there’s a 50% chance that they will detect that the block is unavailable. After two chunks, there’s a 75% chance, after three chunks, there’s a 87.5% chance and so on until after seven chunks, there’s a 99% chance. This is very convenient, because it means that clients can check with high probability that the entire block was published, by only downloading a small portion of it.
The full details of data availability proofs are a bit more complicated and relies on other assumptions, such as requiring a minimum number of light clients in the network so that there’s enough light clients making sample requests so that they can collectively recover the whole block. You can check out the original data availability proofs paper if you want to learn more.
In this article, we introduced the data availability problem, showed why it’s important for blockchain scalability, and described a solution.
To learn more, check out the following resources:
- John Adler’s whiteboard session about fraud and data availability proofs
- Original fraud and data availability proofs paper
- Coded Merkle Trees paper on an alternative data availability scheme
- Ethereum Research wiki post on the data availability problem