Could our whole approach to managing data for processing & querying be fundamentally incorrect? Since the Dawn of LLM, its a notion gaining traction.
So, what about this notion that contemporary best whole approach to managing data & processing thru query pipelines could be fundamentally incorrect?
Ideally, we would prefer to index a log of activity or events.
Find the specific activity we are interested in & fetch that subset of data.
Perform analytics and visualize metrics on that small subset.
Tag or Label that combination of elements & monitor the situation until no longer interested.
Start over, but have the analytics agent remember my moves to tailor preferences in future research sessions.
– Jonathan Cachat
Are we doing that currently? Well, ya. but its a collaborative, iterative process in which we systematically engineer out the variability & flexibility that permits serendipitous AHA-moments. The situations that make analytical research & discovery efforts valuable.
By vacuuming every morsel of raw data thru algorithmically locked processing & purposing, we end up limiting the questions that can be asked about that data. All the data is treated the same, organized in key structures with predetermined metrics calculated en-mass to (hopefully) effectively answer a limited set of questions.
We are currently expending enormous resources on storage, compute and retrieval pipelines that end-up being great for specific questions, approached in specific ways – but historically, little else after that.
The point I am trying to demonstrate here is that we are now on our third version of purposed tables for, which all start with the same raw vendor data. The ability for each version to answer particular questions depends on both:
how the question is asked
what does TOP mean? (MAX(COUNT()), (MAX(SUM()), MAX((COUNT DISTINCT())
what does UNUSAL mean? in v1, score > 9 was high; in v2 score > 75,000 was high, in v3 we dont know the distribution of score values yet & we will have three distributions to compate
what is pre-calculated & provided
score average for (H3 LOCATION) – how does the user input H3 Location? in v1 we dont have H3s, in v2 we have h3_cell_index VARCHAR (huh?), in v3 we have an h3 centered table but no H3 location names, nor lat/long, nor H3 specific identifiers, just country ARRAY
Despite some good wins achieve as we moved from v1 to v2 to v3 of exhibition_metrics, we have ultimately & fundamentally changed how we can query the same underlying data & its locked in that way forever.
Reminder that with each new version, there is a required backfilling of 12-18 months of data, something that costs $10k-100ks & takes weeks, to months, to longer to complete.
We are essentially recording a cam-rip IMAX experience, making bootleg copies on VHS to VHS to VHS, digitally re-encoding to upscale to a 4K .mov file that then needs to be compressed to fit onto a CD-ROM. Then asking why we have 20 warehouses with so many gosh-darn VHS tapes & boxes of jewel cases.
We are locking all of our data into a single format which is being processed as much as possible to serve what we believe the market wants.
Maybe not all of raw vendor data needs to go into a columnar database. Maybe not all of our analytic operations need to be performed on all of the data before the questions are asked. Maybe my skeptical knee-jerk reaction to Darren’s notion that at the core, we are fundamentally managing and querying our data incorrectly was misplaced – Darren’s on to something. Something that only now is becoming possible.
We need to have the flexibility to change how metrics are calculated, based on the order of questions that got us there.
We dont even need those numerical inputs to calculate metrics until we have a smaller subset based on string similarity more than anything else.
We want our products to feel helpfully empowering by providing personalized accommodations & assistance.
The promise here, of having a way to more effectively & efficiently manage & query different types of data really is a game-changer for us. Darren posed the question to me today – “perhaps we will find that there are better ways to store particular elements of our datasets – some in text docs. maybe numerical in BigQuery, but really pandas or polars dataframes are cheaper, lighter and all that would be needed on a subset”.
Beyond that, the promise of LLM-enabled data anaytics is that ability to leave the data sources as they are, minimally processed & provide the LLM with access to Tools, Tools that provide access to different data sources & based on the question being asked different tools are called upon to best answer the question. The order of the tool use, the total number of tools that can be used, and the fact that those tools can be for Google News, or Google Places to pull current information (we dont have to scrap and process and store and update) – really, really makes the notion of a flexible, more lean & efficient analysis process possible.