Solving the data silo conundrum

First know what you have. Then know what it is. Finally, know how to access it.

If you were born in the 70s, chances are your first smartphone was one of those that folded open and had a physical keyboard – the same size as the bricks in your house. It probably took you a couple of hours to read the manual and a few weeks to get used to it. A few years later, the iPhone came along. You probably never read a manual, and you probably never looked back.

Companies are the result of people and assets coming together. In the past, those assets consisted solely of physical objects – the most fluid ones being money, but even money was a tangible asset. As companies grow, those assets must be organized in a workable, manageable, and efficient manner. What happened when we added digital, immaterial, assets to the mix? Well, essentially, nothing. We just organized them in the same way physical assets were organized. Whether you were organized by verticals, geography, or applied a matrix structure, digital assets ‘followed’ their physical master.

In doing so, we unwittingly limited the full potential of data by creating barriers. These barriers limit data sharing across different structures in the organization, which, in turn, leads to double and incorrect data being created, decisions being taken without knowing the full picture, etc. When I was involved in some advanced AI projects, clients – even very sophisticated ones – admitted it was often easier to track and source external information than knowing who to ask for internal data sources.

Ideally, and this is the origin of a data lake or similar initiatives, all corporate data should be accessible from a central repository. Companies launched such projects only to find out that these can be very long – and expensive. One of the reasons is data duplication. If data exists in different instances and should (!) be the same, which of these instances becomes the data master? More often than not, there’ll be good reasons why, for different departments, the master may need to be different (i.e. departments working on different time scales). This brings us to the following practical approach.


First, know what you have

We’ve been mapping our planet for centuries now. Current GIS (Geographical Information Systems) allow for multi-layered mapping of geographic features. It’s about time we deploy a similar approach to mapping our data. This should be no more complex than, and is actually very similar to, what an inventory management system does for your physical assets. There’s no need for all your inventory to be in one central location. Also, each inventory can consist of many different things (just think of a retailer’s inventory, which goes from food to white goods and may even consist of all the shelves on hand). Such an inventory overview lets you know what you have, how much, and where it’s stored, etc.


Then, know exactly what it is

This means that, just like your inventory management, when you know you have #a of item X, you should know what item X is. In the data world, this means understanding the format and having an explanation of the data asset (loosely interpreted; this can be a data field but also a dataset, a report, etc.). Correct and exact documentation is essential. When I ask for someone to fetch me soap from the inventory, do I mean a soap bar (and which one?), floor cleaning liquid, shampoo,…? Because data is often the source of computation, errors on the input side often get magnified as the data progresses through the pipeline. If we’re ever to deploy workable AI, trust in the source data is essential because otherwise, we can’t trust the AI outcome.

Ideally, in an empowered organization, accessible knowledge should be available to all about what is possessed and what it means. There may be exceptions, but they should be just that: exceptions.


Finally, know how to access it

Creating overall inventory visibility and accessibility should not mean handing everyone the keys to the warehouse! Just as in the physical world, some have direct access (i.e., to an SAP Analytics Cloud report), whereas for others, finding out that the report exists, knowing where it can be found, and who owns it is a huge leap forward from mentioning to a colleague that you may need a report. If that colleague is also unaware of the report’s existence (highly likely), you’re stuck at square one. Being able to search the “digital inventory” as easily and efficiently as you’re used to searching the web or browsing the Amazon webshop removes guesswork, frustration, and double work from running a modern, digitized business.

A good inventory management system knows which boxes are on the shelves, what’s written on the label, and whether MSDS (material safety data sheets) apply, etc. But the inventory management system doesn’t have to open all the boxes to check the content. Ultimately, we’d love to have a computerized system that can make up its own mind on the contents, the quality of the contents, etc., through AI. Someday. Be patient. For now, this is the playing field of the rich and wealthy. However, that doesn’t mean we should stand still while they get a head start. AI has recently received a lot of – deserved – attention, mainly ChatGPT, even though I believe there are equally strong models from Meta, Amazon, and Google. Some articles have appeared asking whether we could overcome data silos by deploying AI. The funny thing is that the steps mentioned above would still be required. Therefore, I wonder whether we’re putting the cart before the horse by going down the AI route from the start?

Our dScribe solution solves the three points above by automatically scraping your digital inventory – without “opening the boxes.” Our core approach is not digital first but people-centric. There are obviously more advanced features that can be deployed over time, but essentially, one can be up and running within days. I began this post by referring to my smartphone history. We aim to do the same for data catalogs. Why? Because usability matters, and too many projects fail due to a lack of user adoption. Also, because not only Fortune100 companies deserve efficient data usage. If we want to break down data silos, an affordable solution and, above all, user adoption are major catalysts for success – or barriers to failure.