The Data-Centric Revolution: An Interview with Dave McComb
What's really behind the lack of comprehensive business agility today?
If you do a root cause analysis, you eventually get to complexity as the primary driver of lack of business agility. Our systems have become very complex and highly interdependent (which is just a further dimension of complexity) such that even the simplest change becomes economically unjustifiable. We have many clients for whom the addition of a few fields to a database can be a multi-hundred-thousand-dollar project. Because of this, many incremental changes are not attempted. Instead business users rely on shadow IT to cook up something, which gives them a short-term win, at the cost of making the environment even more complex.
Businesses and business information systems are inherently complex. What are you suggesting?
They aren't inherently complex. It's the way that we have been implementing them that has made them complex, and this is so pervasive it seems like it's inherent. Every time we look at a design we can reduce the complexity by a factor of ten. When you look at systems of systems, this is often a factor of 100.
Where have mainstream approaches to software development, change, and re-engineering led us astray?
The change processes have gotten better in the last few decades. The disciplines around devOps and incremental change have really improved our ability to move changes into production. The issue is that most of the cost of change is not in making the change; it's not in testing the change; it's not in moving the change into production. Most of the cost of change is in impact analysis and determining how many other things are affected. The cost of change is more about the systems being changed than it is about the change itself.
Agile development has addressed some of the complexity issues. In agile, practitioners periodically set aside sprints to "refactor" their code. This is the recognition that code accumulates complexity, and by continually pruning back the complexity that is creeping in, one can keep cost of change manageable. But it is a local optimization. The unfortunate thing is that the focus on small agile teams in most environments means more applications, and, while each is simpler, the sum total of them is more complex.
So agile is not the solution?
Agile, by itself, is not the solution. It is certainly an improvement over the very large, waterfall-style development projects of yore. The problem with most implementations of agile is that they treat data as a second-class citizen. The sponsors describe the "stories" (use cases) they would like implemented, and the team figures out the most expeditious way to implement those stories. There is a tendency to implement the least amount of data that will solve the problem at hand. There is nothing in the agile movement to encourage shared data models across many agile projects.
We have seen a few instances where the data-centric discipline is firmly in place, and agile rides on top of that. This can work quite well.
Are today's economics of software projects and support inevitable?
No. They are a product of the fact that the industry has collectively chosen the application-centric route to implementing new functionality. When every business problem calls for a new application and every new application comes with its own database, what you really get is runaway complexity. Many of our clients have thousands of applications.
But it isn't inevitable. A few firms have shown us the way out: data-centric development.
What does data have to do with it?
Starting with a single, simple core data model is the best prescription to reducing overall complexity. Most application code — and therefore most complexity — comes from the complexity of the schema the code is written to.
Even a single complex data model is better than thousands of complex models, but the surprising thing is how simple a model can be and still capture all the requisite variety.
How simple is too simple? What size should be anticipated?
We talk about the complexity of the model as the number of concepts (tables + columns or classes + attributes) that the developers must deal with. Most enterprise applications have thousands of concepts in their schemas. Most packaged applications are ten times as complex.
We have been refining our approach to building these enterprise models. A typical enterprise model is 300-500 concepts (this is classes + properties). Specializing the enterprise model for a subdomain or application typically adds just a few dozen new concepts.
What does being data-centric mean?
A firm is data-centric to the extent that its applications are designed and built to a single, simple data model. The data itself may continue to reside in existing databases, as long as the model and the query mechanism can access them with reasonable latency.
In effect, are you talking about a virtual enterprise data model?
Yes, exactly. Historically, this has been very hard to do. Most virtual enterprise data models are either conceptual models that must be transformed and added to in order to be implemented, or they are views over data warehouses.
Up until now there wasn't the infrastructure to allow someone to query from one model over many different implementations.
The SPARQL query language, which is the W3C's standard for querying over graph databases, supports federated querying over multiple repositories. The repositories needn't have identical schemas; they must share either some common concepts (which are called out in the query) and/or some common identifiers, which allows the query to link disparate data sets.
By adding in another standard, R2RML, we can create maps from the shared model to relational database schemas. In so doing, a semantic query can combine information from graph databases with multiple relational databases.
In 'virtual enterprise data model' do you mean 'data model' in the traditional sense of ER or logical database design, or are you actually talking about semantically-enriched concept models or ontologies?
The main difference between these semantic data models (ontologies) and traditional ER data models is the separation of meaning from structure. In an ER model, the meaning and structure are inextricably linked. In an ontology, the structure is a graph. The shape of a graph can be different for two similar instances. For example, two inventory items — say, a chip with a serial number and a carton of milk with a use-by date — in an ontology can have different sets of properties and still be 'products'.
In an ontology, we focus on the formal definition of the meaning of the concepts. While this is a bit of extra work initially, the payoff is that the overall structure of the model becomes simpler. This is because, in the act of establishing what is similar and what is different between two concepts in a way that a system can reason about them, closely-related concepts are brought into close proximity in the model, where designers can then consider whether they are different enough to warrant the distinction.
How would you deal with semantic inconsistencies between the 'virtual enterprise data model' and the schemas for local, existing databases? What about redundancies? What happens in such cases when you 'write' data?
Let me start with the third part of this question. To make a point I often ask professionals to imagine what a Data Lake would be like if they could build transactional applications directly on, and write to, the data lake. This is essentially the goal of a data centric architecture: to be able to read and write to a single, simple, shared data model, with all the requisite data integrity and security required in enterprise applications.
The relationship between the overarching model and the domain- or application-specific models is that we want to have a large percentage of the domain-specific concepts defined using the shared concepts in the overarching model.
We did an ontology for the Washington State Department of Transportation. In the enterprise model were concepts such as: geospatial locations, roadways, contact information, organizations, physical substances (gravel, concrete, etc.), and something they called "roadway features" which is anything permanently attached to the earth close enough to a roadway that it could be hit by a car (guardrails, signs, trees, etc.).
We discovered a "fire hydrant" database. But looking at the concepts in the fire hydrant database, virtually all of them could be expressed as derivatives of the core. The fire hydrant itself was a roadway feature; it was owned by a fire department, which was an organization, a phone number (contact information), a location (expressed in latitude and longitude), and even water pressure, where water is a substance and pressure could be expressed in concepts already present in the enterprise model. The net effect of modeling in this way is that it provides a seamless map. A query for any of the enterprise concepts would bring the equivalent concepts from the fire hydrant system without any additional work.
Once we know how the fire hydrant data relates to the ontology we can build a map. There are W3C standards for this map building as well, called R2RML (Relational to RDF Markup Language). Once the R2RML maps are built they can be used in either of two modes. You can take the existing database and run it through the maps to create triples representing all the source data that you mapped. They call this "ETL mode" as it works a lot like the Extract Transform and Load in a traditional data warehouse pipeline.
But you can also leave the data in the relational database. At query time, the SPARQL query can reach through the R2RML maps to query just the subset of the relational data needed to satisfy the query. The data so extracted from the relational database looks, at query time, like semantic triples and can be combined with other triple stores in "federated queries." This ability to combine converted data with data that has been left in place, provides a true virtual enterprise data model.
To the extent that the concepts are defined in terms of shared concepts, the inconsistencies are minimized. In a semantic system, redundancy is not a problem. Expressing the same fact twice doesn't change it.
How would you address the problem of the overarching data model being bypassed and applications continuing to update local, existing databases directly?
That is the biggest issue. Just because you have a shared model does not mean that everyone will use it or that they will derive their concepts from it. This is mostly a matter of education, communication, and encouraging sharing.
Is the idea new?
The idea of data-centric development isn't new. It was present almost at the dawn of the database industry. But as systems became more and more complex, each application tended to be centered on their own data model, but this meant that an enterprise would have hundreds of models they were centered on. The ERP (Enterprise Resource Planning) industry was founded on the idea that a large part of an enterprise could be implemented on a single shared model. But the models became incredibly complex, inflexible, and hard to evolve. One of the leading ERP vendors has over 90,000 tables and a million attributes in the core offering.
What is different now is that we have the ability to directly implement much simpler models without losing fidelity.
What new technologies are you referring to?
The main technologies are graph databases and formal specification of meaning. There is a very viable ecosystem of tool and platform support because of the standards that the W3C have put in place. The key standards are:
- RDF (Resource Description Framework) for reducing all information to "triples" (node — edge — node) and for doing that with globally unique identifiers (URIs).
- RDFS (Resource Description Framework Schema) which adds a schema also expressed in triples.
- OWL (the dyslexically-named Web Ontology Language) for constructing formal definition of class membership and other logical entailments.
- SPARQL (graph-based query language and support for federated querying).
- R2RML (Relational to RDF Markup Language) for accessing relational data as if it were semantic data.
- SKOS (Simple Knowledge Organization System) for capturing thesaurus-style knowledge.
- SHACL (Shape Constraint Language) for expressing integrity constraints.
- PROV (Provenance) for standardizing information about the source and quality of information.
- RIF (Rule Interchange Format) for expressing rules in a way compatible with semantic data structures.
That is a pretty long list of standards. What is most impressive about it is how well they all work together, as they are all based on a small number of well-thought-out primitive concepts. Vendor adherence to the standards is far higher than in the relational database stack, which means that it is fairly easy to port an implementation from one vendor to another.
Can this data-centric approach lead to a fundamental rethinking and retooling of the way large-scale software is engineered / re-engineered today?
One of the profound realizations we've come to is that a proper data-centric architecture would allow the development of the equivalent of the app store for the enterprise. We now have the tools to allow the replacement of the large, monolithic applications of the past with small, loosely-coupled enterprise 'applets'.
But the data-centric approach (coupled with model-driven development and a good microservices environment) gives a natural way that small applets could coordinate, eliminating the need for monolithic applications.
What part will true, technology-agnostic business rules play?
By technology-agnostic business rules, I'm assuming you mean independent from specific database schema, application APIs, and even formal syntax. We have primarily been working on expressing rules in application-independent concepts — that is, using the shared concepts from the ontology. To be completely technology-agnostic, there would need to be a layer that allows rule writers to write their rules in natural language and have it converted to implementation-level rules by the system.
The data-centric approach provides a simple and well-documented vocabulary for the rules engine to operate on. The terms referenced in the rules must be defined somewhere. No better place than in the ontology that populates the shared data model.
What do you mean by "ontology that populates the shared data model"? Aren't they one and the same thing?
One way of seeing an ontology is as a conceptual model. But a traditional conceptual model must be transformed into a logical model and further transformed into a physical data model before any data can be persisted. You can persist data in a triple store (graph database) directly conforming to the ontology. We call doing that "populating the shared data model."
Is all of the ontology considered part of the graph database? Or only the triples representing domain data?
Everything is in the graph database. Everything is expressed as triples. There is no separation between metadata and data as there is with relational. This gives us some interesting properties. You can query the graph with very little previous knowledge of any schema. As you find instance data (say, a node representing John Smith) you can "follow your nose" through additional triples to find that this "John Smith" thing is a member of the class of "Person" and perhaps at the same time "Employee" and /or "Accredited Investor" or any of the dozens of other classes he may have been asserted or inferred to belong to.
This feature allows any consumer of ontology-based information to learn the ontology by degree, on an as needed basis. This could be a developer or a business user. At one level, it may not sound like much, but what it does is lower the barrier for getting started.
What new kinds of software tools and methods will be needed to support the rethinking and retooling you've described above?
We have observed companies implementing the data-centric approach with traditional technology — that is to say, with relational and object-oriented tools — but that is really the hard way to do it. I can't think of a scenario when I wouldn't recommend a graph database at the heart of this approach. They are now standards compliant and high performing, and there are many vendors to choose from. The graph structure provides flexibility. Most of the vendors support queries federated over many repositories and a standards-based mapping to relational databases and JSON. Semantic technology helps a lot to keep the complexity in check.
Most of the firms that have adopted this have also, in parallel, implemented model-driven development or, as Gartner are now calling it, "low code / no code" environments. This is really a natural match and goes a long way toward further reduction of complexity.
What kinds of new business-side methods, skills, and mindsets will be necessary to make the rethinking and retooling happen on the front-end?
We've been working a lot with the business owners of systems on this approach. The two biggest changes are — first, the business users need to get used to being more involved in the domain model design, and second, they will be owning the model. In the past, they contented themselves with creating requirements and throwing them over the transom. For most historical business applications, the data models were designed and structured by specialists in the IT organization. Going forward, for this to work the business needs to own the model.
The other change is going to be fostering cooperation and sharing between subdomains within an enterprise. Silo-ization is a learned behavior that needs to be unlearned.
Where will these new business-side methods, skills, and mindsets come from?
There are a small number of practitioners who have been thinking deeply about how to model the meaning in information systems. This was hard to implement with traditional tools. Now that the technology is in place to do this more directly, we are starting to see the emergence of modeling and methodological approaches to take advantage of it.
Are you predicting a revolution?
We think this is a real paradigm shift and as such is revolutionary, but frankly it seems to be happening so slowly that it will appear to be evolutionary for the firms that are participating. It may appear to be more of a revolution for those that sit on the sidelines. Those that are implementing now will be at it for many years before anyone outside the firm is aware of it. From the outside this will appear rapid.
What would you call it?
The Data-Centric Revolution.
# # #