Accessing Data 3.0: Indexing 101

9/14/2022 - originally posted on mirror.xyz

This is an entry in our long running series, “Accessing Data 3.0”, where we talk about the “whats” and the “hows” of working with data in web3. Enjoy!

Remember libraries? The walls of books and the fearless librarians somehow always knowing exactly where everything is. Well, two things: 1) libraries still exist, 2) those libraries are each indexed. Librarians around the world categorize all of the books under their purview into what are known as a “library catalogs”. These catalogs serve as a means of quickly finding books by given keywords: genres, authors, titles, etc.

The internet works in much the same way. There are billions of websites on the internet (our books in the example above) and each contains some amount of content. In order to find anything online we ask our almighty catalogs for direction - we “Google” a question or search someone’s name on Facebook. And for any of that to be possible, our trusted librarians (Google, Facebook, etc) had to first sort through the sea of content, categorize it, and then build intelligent indexes (catalogs). Now when you search “weather today”, Google is aware of sites that are relevant to the keyword “weather” and presents those. Of course, Google is also aware of your physical location, today’s date, your previous search history, and which website is paying the most to be matched on the keyword “weather” … but we’ll leave the darker details of the indexing industry for another day.

The point is, indexing is everywhere - a book’s table of contents, your phone’s set of contacts, the grocery list pinned to the fridge … you get the gist. Any time a given data set is too large to be easily consumed, indexing of some sort is employed to aid in its digestion.

Let’s refocus with some definitions:

  1. Data: bits of information

  2. Decentralization: the distribution of control and ownership away from a central authority

  3. Index: a catalog to help find data more quickly

  4. Web3: the decentralized web; empowering individuals to own their data

From the above we can surmise that “indexing”, as it relates to web3, is the act of cataloging decentralized information. Simple as that.

Generally speaking, indexing, in all scenarios, is done to make data more easily searched, and therefore to make that data more accessible. When the data itself is decentralized though, it opens the door to entirely new models of indexing. Take for instance the basic flow of indexed data in web2:

Web2 Data Indexing

In the current, web2 world, centralized authorities control the flow of data. They are responsible for discovering, aggregating, indexing, and ultimately serving data. When a user performs a “search” for instance, Google chooses which information is relevant and provides it back to the end user. Historically this has been “okay”, but when those centralized authorities start to drop their “don’t be evil” mottos, it’s worth pausing to rethink this centralized model.

Enter the world of web3 and decentralized data:

Web3 Data Indexing

There’s two important pieces to call out in this web3 scenario:

  1. Users are adding their data directly to decentralized networks (Ethereum, IPFS, Arweave, etc)

  2. Because these networks are decentralized, anybody can go through and index the data

That second point has some significant ramifications. For starters, that means that the individual user, whether that’s a single human or a company, can ultimately index and access their own data directly from the network; no middlemen deciding what information is “right”. Furthermore, this model doesn’t stop centralized authorities from also indexing that data and providing it to users - and that’s also good!

By enabling open data access, decentralized networks effectively create an incentive strategy for truly providing what’s best for the users. Because the centralized authorities no longer control the influx of data, they must cater to the needs of their users. Otherwise, a new competitor will come along to meet those needs. And, the best part of all of this is, that new competitor could simply be the users themselves.

Although this access isn’t always simple today in many decentralized networks, the barriers to entry are lowering and the potential continues to increase. For those interested in following along, this series on “Accessing Data 3.0” will dive deeper into the various aspects of data in web3 and how we can all start participating in it.

subscribe://