Patterns in Data Store: OLTP -> OLAP -> Agentic AI
Abstracting analytic compute from the underlying storage through open innovation
My last post was about the patterns I observed across the enterprise stack, from semiconductors to the software apps with the platforms in the middle, and what I have learned to apply from one layer to the other for a durable product growth strategy.
Since then, I have been spending some time catching up on the latest developments in the foundation of the AI stack - the Data.
Having started my career at Oracle, data always holds special significance to me. Oracle pioneered and led the transactional (OLTP) database technology for a very long time. The row-based storage in the underlying file system provided fast write and ACID integrity/reliability. Oracle databases since the 1990s, including MySQL later, became the stickiest foundation in the enterprise software stack on top of which the durable ERP, CRM, and other transactional systems of record apps were built, continuing through the emergence of SaaS.
With the focus advancing to system of insights from just system of records, the next step of innovation was Analytics. ETL based Data Warehouses became popular. While Oracle and others offered Data warehouses for analytics workloads (OLAP), the real innovation came from Snowflake in 2015 in its cloud data warehouse, successfully commercializing columnar database ideas coming from Vertica, Google Bigtable and others a couple of decades back. Columnar databases offered much faster Read with reduced storage requirements compared to row-based databases.
However, the real innovation behind modern analytics, and later Machine Learning, was the ability to draw insights from unstructured data. While cloud data warehouses were good for structured data, most of our data resides in unstructured files and objects. This is where open source projects - originally Hadoop (heavily influenced by Google File System and the MapR project), and subsequently the Apache Spark engine built on top of Hadoop Data Lakes - brought in the speed of processing that data, necessary for accelerating Machine Learning commercially.
Subsequently, in 2020, Databricks, started by that same Apache Spark team, introduced the key innovation for the data foundation of ML - the Data Lakehouse. The ability to abstract structured or unstructured data alike from the underlying file storage, to run queries/calculations in a standard way, has massively eased the adoption of ML for upstream Apps. It has been only 5 years since then, but the Data Lakehouse has become the standard that every other leading vendor has adopted for the AIML foundation.
So, I was not surprised last week when I saw their latest innovation - the Lakebase - an OLTP database built on top of Lakehouse, targeting the Agentic AI developers. In a world, where they are seeing 80% of the new enterprise databases are being built by the AI Agents, a database that scales real time based on needs, that lets the developers manage real time branching, and that supports real time push/pull from the Lakehouse data store, is no less than a revolution in database architecture.
The patterns continue - the benefit of abstracting analytic compute from the underlying storage.
Structured Data, great for pre-ML:
Oracle / MySQL Database: Abstracted row-based file blocks - for on-prem OLTP.
PostgreSQL Database: Did the same for the SaaS applications.
Columnar block-based databases did it for OLAP: Snowflake accelerated it for Cloud Data Warehousing.
Unstructured Data, built for ML:
Open-source Hadoop to Apache Spark powered Datalake
Structured + Unstructured, built for LLM and Agentic AI: Databricks set the stage for AI foundation, pre-LLM, through the critical Data Lakehouse innovation.
Databricks is now extending that open innovation pattern to transactional databases for Agentic AI, through Lakebase - a Serverless database built on Lakehouse architecture.
(Neon turned out to be an excellent investment for Databricks, followed by the acquisition. More patterns there on how truly innovative companies bring in external innovation to complement internal innovation - more on that later).
I have watched the Databricks Data+AI summit for 3 years now. The speed and scale of innovation from the founding team still at the helm, while continuing the original open ecosystem principles, is nothing short of spectacular. There is a reason they are still growing 60% yty with $3B+ run rate, at a $60B+ valuation. Their success is a testament to the success of open source. Perhaps the only such company achieving this scale beyond RedHat.
Snowflake and the cloud majors made many exciting recent announcements as well, focused on the data lifecycle: from ingestion to transport and Agentic AI application. Those would be for a subsequent post.

Interesting thought. Do you think abstractions in data will ever fully outpace storage innovation?