Sometimes it takes a little while for things to sink in. In this instance when I first heard it, I knew there was something profound about it, but was not sure what. The “it” was a comment made at the recent PPDM Conference in Perth. It was a comment made by Doris Ross of Woodside Energy about a seemingly trivial way of doing work, that with hindsight should have significant implications for our industry – SO LISTEN UP!
“Schema on read” were her words. Big deal eh? Well yes, having let it now sink in, in my view, it is a huge deal. Let me explain further.
“Schema on read” is an innovative data analysis strategy where data is applied to a plan or schema as it is pulled out of a stored location. (Ref www.technopedia). This is in contrast to the standard modus operandi of “Schema on write” which refers to the traditional approach of defining schema and indexing as it is loaded into storage.
For the last 40 years in the oil sector, we have spent considerable time, money and energy trying to define, populate and maintain structured, relational databases containing metadata that describes the properties and locations of our digital assets. Somewhere along the line we decided that before an asset could be an “actual asset”, that it needed to be catalogued in gory detail for it to be considered real and useful. This traditional approach made a lot of sense when digital assets were in one location and the metadata that described them in another.
For example, the Excel spreadsheet catalogue system I created for my collection of garden gnomes is separate from my self-lock storage unit containing my over one thousand physical garden gnomes collected from 43 countries). When I attend the annual “GnomeFest” (yes this is a serious hobby folks), I usually use my Excel catalogue to search for which box has which gnome so that I can take a few that I want with me to have on display. Without that catalogue, I would have to rummage through my entire self-lock storage unit looking in boxes for the right ones.
Now, if I had a display case big enough for them to be at home with me, I probably would not need the digital catalogue at all, but if I did I would certainly not need to have as much detail in it because the gnomes would be right in front of me. (My decision not to have them on display at home is solely the result of not wanting to be kicked out of my own house by my family).
Back to reality…What if an oil company could have all of its digital assets addressable in an online platform where it could be accessed anytime the company wanted rather than the typical model of metadata here and actual data physically stored elsewhere, like off site or offline? What if the actual data was also the metadata as well as being the data itself? By that I mean what if we didn’t need the catalogue and could immediately access the real dataset in real time to get the information we want? How much money would we save, how fast could we react, how nimble would we become?
In today’s world, the rapid expansion of ubiquitous public cloud storage has now made it possible to have your cake, eat it, and even share the cake with others without the need for in depth data models or acutely described metadata. Plus, you can just get more cake if you run out. The current old school, deeply ingrained yet disjointed nature of data storage and metadata management can now requite. Let it become one, become the same, be all it can be, and therefore instead of asking a database a question or query, you can now just ask the data itself – because it is all there waiting for you.
Here is an example of what I mean. A lot of oil and gas companies are currently spending significant sums of money on large scale data indexing projects to help describe their data in a predictable and approachable manner. Database standards for doing this are well established, and the process for collection is too. Usually painful to do and very time consuming, but nonetheless the way it has always been done. This traditional and well established “Schema on Write” approach is believed to produce better access to the data, but in fact it is now really only serving to slow down the actual use of the data while we wait for these large indexes to be built. Once built we find ourselves confined to searching or querying data based on the limited and hard to change choices we made when we indexed the data on ingest. Can you imagine if we had to wait for staff at Google to hand enter the metadata for every document and webpage on the www before it could be indexed? This very article you are reading would only be available in the year 2070. Imagine having to live your life never having read this – unthinkable right?
It would not be uncommon for a good size oil company to have a seismic line database that contains 20,000 seismic lines for data it shot in 20 countries over a period of 30 years. Imagine now that for every line in the database, there are 20 digital assets that relate to them. This many to one relationship between assets and lines would produce 400,000 database rows of metadata, each describing 30 characteristics of each digital asset. That turns into 12 million database entries for the 400,000 rows, describing the 20,000 seismic lines.
Most of the information contained in these 12 million entries in the metadata catalogue is contained within the digital assets themselves (not always, but usually). Yet, right now oil companies are currently picking through their datasets to accumulate this metadata in a separate table even though it is already present in the digital assets that could be addressed directly.
One of our developers wrote a Python script recently that can view, query and display 90% of the typical fields in a seismic catalogue that users want to interrogate, plus 100% of the stuff they did not know they could interrogate, direct from the data itself. Search parameters no longer need to be limited to a survey name, line name and shot point range way of life. It can now be all of those, plus amplitude variations, water depth, velocity statistics, feature recognition and a host of other things that we don’t even know we will want to ask of our data yet.
So, back to Doris’s words. “Schema on Read” was what she said. That means ask the data what you want and extract it in a meaningful view when you read it. It does NOT mean to query the data using defined schema created when you put the data into a database in the first instance. This old and stale approach restricts your ability to be creative, reduces your results to fixed points and can kill discovery and innovation. Schema on Read means accessing the raw, unstructured or native data and applying your own parameters/lens to the data when you read it back out. Ask the questions you need to ask, not what you can ask by predetermined schema. For the O&G industry where the time between creating data, and cataloguing it can span many certainly many months, but also many years. The scheme we create today may not be adequate for future access.
Storing data in public cloud platforms and a data lakes makes a “Schema on Read” approach both possible and extremely exciting. It provides massive flexibility and improvement over how your data can be accessed, consumed and shared.
I can hear a collective sigh from the librarians of the world right now. Actually, it is just three of them that are picketing in front of my office repetitively yelling, “Index, catalogue, keep it clean… Guy’s approach is obscene!” And who knows, they may be right – although I thought it was my garden gnome collection that made me obscene.