Published

07 August, 2024

by

@aperturecrypto

Serving AI with Data Infrastructure Fit for Web3

Web3 technology is perfectly positioned to ensure AI operates on trustworthy data while making AI accountable, transparent, and interconnected.

Web3 technology is perfectly positioned to ensure AI operates on trustworthy data while making AI accountable, transparent, and interconnected. Blockchain can verify data through the network, guaranteeing that the inputs to and outputs from AI models are reliable. While the current Web3 landscape largely centers around financial data, blockchain technology has the potential to extend far beyond, encompassing personal information, scientific data, and government records. 

Currently, developers of AI, AI agents, bots, and ML models in Web3 are working to determine the necessary data infrastructure and data for training, inference, monitoring, and retraining their models. Before diving into the entire data pipeline required for these processes, let's look at a few examples of model types that can be supported:

Thanks for reading Indexing Co! Subscribe for free to receive new posts and support my work.

  • Large Language Models (LLMs): this type of model is at the center of attention regarding AI. In Web3, these models can be used to have users interact with the blockchain and perform actions without those users requiring to understand the complexities of the technology. On its own, LLMs are not the best fit for blockchain data, since they mostly require text data (vs. the transactional data on the blockchain). However, when transactions or wallets are labeled and given context through embeddings and systems using Retrieval Augmented Generation (RAG), blockchain data can be served back to users. 

  • Document or Vector Search: this type of model utilizes embeddings to find similarities between documents or vectors. In blockchain an embedding could exist for a protocol or for a wallet address and then compared to other embeddings. This type of model can be very useful for search engines, marketing and growth tooling and analytics. 

  • Prediction models: since most activity in Web3 is related to financial transactions, predicting prices to speculate on these prices is a popular exercise. However, prediction models can be used to predict other useful metrics like transaction activity, gas costs, user retention,  sybil users, etc.  

A few challenges exist in the current Web3 environment with this data with the main problems being:

  • Data is scattered across multiple chains and the number of chains continues to increase. Most data providers only serve a certain number of chains.

  • Data on the blockchain is unstructured which requires custom transformations and feature engineering to make it useful for building AI. 

  • If data served to developers is structured it takes a certain format (because of APIs or query systems). This format or data schema is highly likely not the exact schema needed for the models, which requires developers to build out infrastructure to load and transform data before it can be used in their models. 

  • Data regarding address labels is scattered across multiple data providers, without standardization or automation to do correct data labeling (like contract labels)


We will now take a look at the processes needed to put AI into production and how the unique data infrastructure from The Indexing Company can serve the builders and AI in Web3.

Training

To train models, a vast amount of data is needed. The training of these models happens often in a local environment with easy access to the data. The data is fed to these models in batches so these models can learn from those inputs. Historical on-chain data has to be fetched and can come from multiple chains. Ideally this data is transformed into a unified data schema, regardless of chain (EVM or non-EVM), while data is enriched with off-chain data like contract labels. Since the data pipelines built by The Indexing Company are chain agnostic and allow custom transformations, the data can be put into a unified data schema before it hits the training database or data lake. Since the data pipelines are highly configurable, data like contract labels or pricing data can be added to ensure a more complete feature set. The parallel processing network utilized by The Indexing Company ensures that backfilling this historical data is pushed fast to the target data infrastructure.

Inference

Inference is a term that covers the process trained AI models use to make predictions and decisions based on new incoming data. Ideally this data reaches the model in the same schema and with the same features as in the training stage. Data needs to be frequently updated to have the AI serve the user or act on its own. Data can be streamed in real time to a database which can trigger the AI based on certain thresholds. If the AI needs to pull data, the AI can query the database or can call an API which is hosted on top of the database. Since the pipelines from The Indexing Company can be configured in a way where it does not matter if the data is historical or real time, the same infrastructure can be used to both train the AI and serve the data for inference purposes. Basically, setting up these pipelines for historical data ensures that the data pipelines for inference are already in place too. These data pipelines can furthermore be optimized for low latency to have the AI act as fast as possible after blocks are confirmed on the blockchain.


Monitoring

Once an AI or a swarm of multiple agents begin transacting on the blockchain, they should be monitored to ensure performance. The data resulting from the agent's actions can also be indexed and used for real-time alerts, monitoring and analytics, giving users the ability to disable or reconfigure the agent in real time. We designed our infrastructure to be responsive (vs. a static approach to configurations), automatically indexing new data based on the data coming in and/or the reconfigured logic (either on events emitted on the blockchain or when a trigger is sent to the pipelines). This ensures that every new action by the bot or every new bot added to the swarm gets monitored.

One example of this responsive data infrastructure is Just In Time Indexing (JITI). In a previous article, we described how Just In Time Indexing can work to continuously backfill and index new transactions from new addresses. For example, when a new agent is registered to the network, it would do so from a Factory Contract. JITI would be triggered to now monitor this new address and all transactions related to this address. This process ensures data completeness without manual intervention by developers.  


Retraining

Models need to be retrained frequently to stay up to date with changes in the environment, to improve performance or to add new chains the bots need to be active on. With new types of data coming in, the chance that this data is in a different schema and requires new transformations is high. This is both true when new data from protocols or chains is added, since the smart contract structure or event structure might be different. Luckily, since we designed our data pipelines to be highly configurable, these transformations can happen before the data hits the target data infrastructure. Even if data comes from different sources or chains (EVM vs. Non-EVM) the resulting data schema can be unified. The unification of the data ensures continuity in the data schemas needed to calculate the features. This reduces the additional data engineering needed to integrate new data.

Conclusion

We welcome the opportunity AI brings to Web3. The potential is promising to both improve UX for users or automate tasks with settlement on a blockchain. The data infrastructure The Indexing Company provides is fully ready to help developers in AI and Web3 build a next generation of products. With fast and complete historical data, real time data streaming and  responsive data pipelines, any type of model and AI can be (re-)trained, served and monitored.

We are happy to spar with developers and businesses on their data needs. If you want to chat or need support, reach out to us at The Indexing Company

Published

07 August, 2024

by

@aperturecrypto

Serving AI with Data Infrastructure Fit for Web3

Web3 technology is perfectly positioned to ensure AI operates on trustworthy data while making AI accountable, transparent, and interconnected.

Web3 technology is perfectly positioned to ensure AI operates on trustworthy data while making AI accountable, transparent, and interconnected. Blockchain can verify data through the network, guaranteeing that the inputs to and outputs from AI models are reliable. While the current Web3 landscape largely centers around financial data, blockchain technology has the potential to extend far beyond, encompassing personal information, scientific data, and government records. 

Currently, developers of AI, AI agents, bots, and ML models in Web3 are working to determine the necessary data infrastructure and data for training, inference, monitoring, and retraining their models. Before diving into the entire data pipeline required for these processes, let's look at a few examples of model types that can be supported:

Thanks for reading Indexing Co! Subscribe for free to receive new posts and support my work.

  • Large Language Models (LLMs): this type of model is at the center of attention regarding AI. In Web3, these models can be used to have users interact with the blockchain and perform actions without those users requiring to understand the complexities of the technology. On its own, LLMs are not the best fit for blockchain data, since they mostly require text data (vs. the transactional data on the blockchain). However, when transactions or wallets are labeled and given context through embeddings and systems using Retrieval Augmented Generation (RAG), blockchain data can be served back to users. 

  • Document or Vector Search: this type of model utilizes embeddings to find similarities between documents or vectors. In blockchain an embedding could exist for a protocol or for a wallet address and then compared to other embeddings. This type of model can be very useful for search engines, marketing and growth tooling and analytics. 

  • Prediction models: since most activity in Web3 is related to financial transactions, predicting prices to speculate on these prices is a popular exercise. However, prediction models can be used to predict other useful metrics like transaction activity, gas costs, user retention,  sybil users, etc.  

A few challenges exist in the current Web3 environment with this data with the main problems being:

  • Data is scattered across multiple chains and the number of chains continues to increase. Most data providers only serve a certain number of chains.

  • Data on the blockchain is unstructured which requires custom transformations and feature engineering to make it useful for building AI. 

  • If data served to developers is structured it takes a certain format (because of APIs or query systems). This format or data schema is highly likely not the exact schema needed for the models, which requires developers to build out infrastructure to load and transform data before it can be used in their models. 

  • Data regarding address labels is scattered across multiple data providers, without standardization or automation to do correct data labeling (like contract labels)


We will now take a look at the processes needed to put AI into production and how the unique data infrastructure from The Indexing Company can serve the builders and AI in Web3.

Training

To train models, a vast amount of data is needed. The training of these models happens often in a local environment with easy access to the data. The data is fed to these models in batches so these models can learn from those inputs. Historical on-chain data has to be fetched and can come from multiple chains. Ideally this data is transformed into a unified data schema, regardless of chain (EVM or non-EVM), while data is enriched with off-chain data like contract labels. Since the data pipelines built by The Indexing Company are chain agnostic and allow custom transformations, the data can be put into a unified data schema before it hits the training database or data lake. Since the data pipelines are highly configurable, data like contract labels or pricing data can be added to ensure a more complete feature set. The parallel processing network utilized by The Indexing Company ensures that backfilling this historical data is pushed fast to the target data infrastructure.

Inference

Inference is a term that covers the process trained AI models use to make predictions and decisions based on new incoming data. Ideally this data reaches the model in the same schema and with the same features as in the training stage. Data needs to be frequently updated to have the AI serve the user or act on its own. Data can be streamed in real time to a database which can trigger the AI based on certain thresholds. If the AI needs to pull data, the AI can query the database or can call an API which is hosted on top of the database. Since the pipelines from The Indexing Company can be configured in a way where it does not matter if the data is historical or real time, the same infrastructure can be used to both train the AI and serve the data for inference purposes. Basically, setting up these pipelines for historical data ensures that the data pipelines for inference are already in place too. These data pipelines can furthermore be optimized for low latency to have the AI act as fast as possible after blocks are confirmed on the blockchain.


Monitoring

Once an AI or a swarm of multiple agents begin transacting on the blockchain, they should be monitored to ensure performance. The data resulting from the agent's actions can also be indexed and used for real-time alerts, monitoring and analytics, giving users the ability to disable or reconfigure the agent in real time. We designed our infrastructure to be responsive (vs. a static approach to configurations), automatically indexing new data based on the data coming in and/or the reconfigured logic (either on events emitted on the blockchain or when a trigger is sent to the pipelines). This ensures that every new action by the bot or every new bot added to the swarm gets monitored.

One example of this responsive data infrastructure is Just In Time Indexing (JITI). In a previous article, we described how Just In Time Indexing can work to continuously backfill and index new transactions from new addresses. For example, when a new agent is registered to the network, it would do so from a Factory Contract. JITI would be triggered to now monitor this new address and all transactions related to this address. This process ensures data completeness without manual intervention by developers.  


Retraining

Models need to be retrained frequently to stay up to date with changes in the environment, to improve performance or to add new chains the bots need to be active on. With new types of data coming in, the chance that this data is in a different schema and requires new transformations is high. This is both true when new data from protocols or chains is added, since the smart contract structure or event structure might be different. Luckily, since we designed our data pipelines to be highly configurable, these transformations can happen before the data hits the target data infrastructure. Even if data comes from different sources or chains (EVM vs. Non-EVM) the resulting data schema can be unified. The unification of the data ensures continuity in the data schemas needed to calculate the features. This reduces the additional data engineering needed to integrate new data.

Conclusion

We welcome the opportunity AI brings to Web3. The potential is promising to both improve UX for users or automate tasks with settlement on a blockchain. The data infrastructure The Indexing Company provides is fully ready to help developers in AI and Web3 build a next generation of products. With fast and complete historical data, real time data streaming and  responsive data pipelines, any type of model and AI can be (re-)trained, served and monitored.

We are happy to spar with developers and businesses on their data needs. If you want to chat or need support, reach out to us at The Indexing Company

Indexing—Co

Indexing Co builds modern, distributed infra that does more, for less.

Founded

2022 / USA

Follow us

Linked In

Made by Indexing Co © 2025

Indexing—Co

Indexing Co builds modern, distributed infra that does more, for less.

Founded

2022 / USA

Follow us

Linked In

Made by Indexing Co © 2025

Indexing—Co

Indexing Co builds modern, distributed infra that does more, for less.

Founded

2022 / USA

Follow us

Linked In

Made by Indexing Co © 2025