Michael Stonebraker data scienceThere are so many ways we can go to try to understand and then to make use of the Industrial Internet of Things. As my thinking coalesces I’ve come to the conclusion that the IIoT is a tool. It is a tool to be used in the service of an overall manufacturing/production strategy.

In order to properly use this tool of connected devices serving real-time data, we are going to need advances in data science.

Two database types seem to dominate in manufacturing—at least as expounded by suppliers. One is a relational (SQL) database. The other type is data historian.

I remember talking to some of the tech guys at Opto 22 about exploring semi-structured and open source variants such as NoSQL. At the time they thought that SQL would be all they need. And maybe so. But that was a couple of years ago.

All that discussion introduces an important podcast I just listened to. I subscribe to the O’Reilly Radar podcasts on iTunes. They’ve been cranking out about one per week—usually to promote an O’Reilly book or O’Reilly conference.

 Data Science

Michael Stonebraker was awarded the 2014 ACM Turing Award for fundamental contributions to the concepts and practices underlying modern database systems. In this podcast, he discusses the future of data science and the importance—and difficulty—of data curation.

[Notes from the O’Reilly Website]

One size does not fit all

Stonebraker notes that since about 2000, everyone has realized they need a database system, across markets and across industries. “Now, it’s everybody who’s got a big data problem,” he says. “The business data processing solution simply doesn’t fit all of these other marketplaces.” Stonebraker talks about the future of data science — and data scientists — and the tools and skill sets that are going to be required:

It’s all going to move to data science as soon as enough data scientists get trained by our universities to do this stuff. It’s fairly clear to me that you’re probably not going to retread a business analyst to be a data scientist because you’ve got to know statistics, you’ve got to know machine learning. You’ve got to know what regression means, what Naïve Bayes means, what k-Nearest Neighbors means. It’s all statistics.

All of that stuff turns out to be defined on arrays. It’s not defined on tables. The tools of future data scientists are going to be array-based tools. Those may live on top of relational database systems. They may live on top of an array database system, or perhaps something else. It’s completely open.

Getting meaning out of unstructured data

Gathering, processing, and analyzing unstructured data presents unique challenges. Stonebraker says the problem really is with semi-structured data, and that “relational database systems are doing just fine with that”:

When you say unstructured data, you mean one of two things. You either mean text or you mean semi-structured data. Mostly, the NoSQL guys are talking about semi-structured data. When you say unstructured data, I think text. … Everybody who’s trying to get meaning out of text has an application-specific parser because they’re not interested in general natural language processing. They’re interested in specific kinds of things. They’re all turning that into semi-structured data. The real problem is on semi-structured data. Text is converted to semi-structured data. … I think relational database systems are doing just fine on that. … Most any database system is happy to ingest that stuff. I don’t see that being a hard problem.

Data curation at scale

Data curation, on the other hand, is “the 800-pound gorilla in the corner,” says Stonebraker. “You can solve your volume problem with money. You can solve your velocity problem with money. Curation is just plain hard.” The traditional solution of extract, transform, and load (ETL) works for 10, 20, or 30 data sources, he says, but it doesn’t work for 500. To curate data at scale, you need automation and a human domain expert. Stonebraker explains:

If you want to do it at scale — 100s, to 1000s, to 10,000s — you cannot do it by manually sending a programmer out to look. You’ve got to pick the low-hanging fruit automatically, otherwise you’ll never get there; it’s just too expensive. Any product that wants to do it at scale has got to apply machine learning and statistics to make the easy decisions automatically.

The second thing it has to do is, go back to ETL. You send a programmer out to understand the data source. In the case of Novartis, some of the data they have is genomic data. Your programmer sees an ICU 50 and an ICE 50, those are genetic terms. He has no clue whether they’re the same thing or different things. You’re asking him to clean data where he has no clue what the data means. The cleaning has to be done by what we could call the business owner, somebody who understands the data, and not by an IT guy. … You need domain knowledge to do the cleaning — pick the low-hanging fruit automatically and when you can’t do that, ask a domain expert, who invariably is not a programmer. Ask a human domain expert. Those are the two things you’ve got to be able to do to get stuff done at scale.

Stonebraker discusses the problem of curating data at scale in more detail in his contributed chapter in a new free ebook, Getting Data Right.

Share This

Follow this blog

Get a weekly email of all new posts.