“The funny thing about blockchain data is that even though in theory everything is transparent and accessible, in practice it has historically been very opaque.”
-Hayden Adams, Uniswap founder
There are a few challenges that blockchains have in common when it comes to data querying. In this piece, we will try to describe these as well as the existing ways of solving issues with blockchain data access.
Challenges With Querying Blockchain Data
Blockchain technology appears to be a perfect solution for transparent and decentralized data storing. It also allows for openness in data access and is strengthened by its immutable nature. These features have helped this technology prove its usefulness in a number of industries until now. Yet, distributed ledger technology does not act as a pure database, so a few problems with blockchain data access occur.
In the usual sense, a database is a collection of data organized in accordance with a conceptual structure that describes the characteristics of this data and its relationship. You can consider MySQL or PostageSQL as commonly known examples. Each database has its own query language (like CQL or JPQL) which supports one or more fields of application.
Querying data from blockchain runs into performance and bandwidth issues. Primarily, this happens because blockchains do not have an initial query language unlike other regular databases. Blockchain’s distributed nature also becomes an obstacle in this case.
Generally, we can define the following problems that prevents efficient querying on blockchains:
- Decentralization and data distribution: as good as it is, in some cases, such a decentralized infrastructure can create inconveniences for querying the right information in a short period of time. The sequential nature of blocks does not allow for the storing of complete information in one block. As per the network’s rules, each block saves just the “hash” of the previous block and is populated with new information. That is why finding a single piece of information among a big datasets takes a lot of time and performance capabilities.
- Lack of query language: each type of database has a query language that serves to answer the factual question or provide the documents relevant to the input inquiry. Blockchain’s way of data storing sadly does not comply with any of the currently used query languages, so it requires a lot of code to be able to interact with a DApp’s data.
- Data confusion and entanglement: previously mentioned issues contribute to the third problem with blockchain data access: data confusion within the internal system. Broadly speaking, in Ethereum-like blockchains, historical data (records) is distributed in events that are stored in a separate part of the node (separate from the blockchain and block storage). Frequent retrieval of those events is not effective at all.
Moreover, public nodes like Infura often try to limit such actions that considerably slows the query process. This is mainly due to data interpretation difficulty and general data entanglement within a node structure.
- Limited APIs: current application programming interfaces are regarded as strongly platform dependent and are able to provide only simple queries (like ranging or top-k queries).
These issues are extremely painful in terms of attempting to store a long list of records or transactions.
For instance, blockchain can be successfully applied to securely store data for healthcare purposes. However, when it comes down to retrieving data — for instance, about a particular patient for a certain period of time — blockchain would extend the processing time and make the job really difficult.
The problem is not limited to blockchain applications in the healthcare industry. It can occur while handling transactions on decentralized exchanges, or processing data within betting or trading platforms.
What Is the Ideal Way Out?
Following the above-mentioned characteristics, we can define the main features of the required solution for sufficient blockchain data querying. Those are:
- A centralized approach to data storing where each piece of information has its own place and can be easily reached;
- Obtaining a proper query language to be able to retrieve the information from the blockchain;
- Ensure the transparency and data storage order to provide easy data navigation within the system.
What Are the Current Solutions We Can Use?
The current market knows a few ways to solve problems with blockchain data access. A blockchain based DApp that needs to process a large amount of data can:
- Use one of various centralized services
There are a number of companies that provide centralized databases and APIs for blockchain data access. This might be a good solution that will considerably fasten the operational work of your system, but it violates the decentralized nature of the desired project. Another disadvantage might be the possibility of service disruption or lack of access to the stored data due to external problems on the database side.
- Build your own data storing service
Providing project users with private database storage would solve the problem of possible data loss or access limitations. Yet, on the other hand, such a system will require a huge investment for initial development and set up as well as ongoing maintenance.
- Use The Graph service
Using The Graph protocol for querying blockchains is a convenient solution that solves problems with bandwidth and data access. The Graph is an open source protocol that helps to ensure the full decentralization of data transmission.
A Deep Dive Into The Graph service
The Graph introduces an efficient service for querying data stored on blockchains or InterPlanetary File Systems (IPFS). The Graph service provides a high-performance and well-optimized solution that eliminates unnecessary event querying from nodes.
The Graph enables creation of Subgraphs for secure and easily accessible data storing. [For better understanding, you can see a subgraph as a graph within a larger graph.]
The Subgraph is a way to index blockchain data and then query it with the simple GraphQL API. Therefore, once it's deployed, it becomes a part of the global graph of blockchain data.
The Subgraph provided by The Graph can be understood as another instance of the indexing service. Yet, despite similarity in their nature, the deployed Subgraphs do not cooperate and interact with each other. No matter how many Subgraphs you create for one DApp, they will not exclude one another.
Subgraphs cooperate and communicate with the main service using GraphQL. The GraphQL is a language for querying and manipulating open source data for the API and an execution environment for handling requests from available data. The algorithm is built so that it allows the user to retrieve the information in a single query while preserving bandwidth and reducing waterfall requests.
How to Use The Graph
There are basically two ways to use The Graph service: the first is to run your own copy of The Graph node or utilize the hosted service as a ready-made solution.
Running your own Graph instance can help to be fully independent from the service and is totally fine as long as the code is open-sourced. On the other hand, it might cause certain inconveniences, as you will have to count additional hours for upgrading every time the new features come in.
The second option is to use the hosted service provided by The Graph. The service combines its own network of Curators, Delegators and Indexers who operate within the system to ensure indexing and query processing. It’s worth noting that using the hosted solution will be accompanied by the query fees for service providers.
The more blockchain-based apps we have, the more information they need to store and process. Therefore, the problem with sufficient data flowing and bandwidth increasing will inevitably occur sooner or later.
The above-mentioned solutions have definitely found their application and can be further used to get the required result in blockchain data querying. Nevertheless, the technology offered by The Graph has undoubtedly simplified data indexing. Thus, building a truly decentralized application that runs entirely on a public infrastructure does not seem so onerous anymore.