What Is Data Science, Really?
Not too long ago I attended a conference on data analytics and machine learning. I listened to one innovative and exciting session after another. The term 'data science' was sprinkled generously throughout. But were they all talking about the same thing? Or was it simply a code word offering membership in an imagined community?
Indeed, 'data science' could simply be a term of convenience for a broad and enticing new marketing space. The industry loves that sort of thing. And broad indeed! Consider the following statement from Wikipedia (and pardon a nit about 'business' coming last):
"Data science … incorporates skills from computer science, statistics, information science, mathematics, information visualization, data integration, graphic design, complex systems, communication and business."
I asked several times at the conference for a definition (which, I admit, is a habit of mine — good or bad, depending on your point of view). I was consistently disappointed. Perhaps 'data science' is a discipline that doesn't admit semantics(!?). That's a very interesting question, but I'll not digress.
The best response I got was something like, "It's more than statistics. More than business analytics. More than machine learning." To which was added, "You can't get an MBA to become a data scientist. Or get a degree in math. Or computer science."
Not an adequate definition at all! Not even a definition(!). Shared understanding does not arise from saying what something is not, or explaining how you can't become one. Shared understanding arises from expressing what something is. And no, you're not allowed to say, "I know it when I see it." I might look at the very same thing and not see what you see.
Wikipedia does give lots of good insights but, in my view, falls short of a solid definition. It says:
"Data science … uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms, both structured and unstructured."
- It seems rather obvious that something called a 'science' would have to use 'scientific methods'. What's the real point there?
- The phrase "uses processes, algorithms and systems to extract knowledge and insights from data in various forms" could also be applied to 'data analytics' or 'statistical analysis'. What's the differentiation? It could also probably be applied to 'learning' (as in machine learning). What characteristic specifically makes data science a new, different thing of its own that can be clearly differentiated from similar and/or pre-existing things?
Wikipedia also says:
"The field encompasses preparing data for analysis, formulating data science problems, analyzing data, developing data-driven solutions, and presenting findings to inform high-level decisions in a broad range of application domains."
It's certainly true that we are seeing all kinds of new 'formulations' these days to develop 'data-driven solutions'. But that's sort of cherry-picking. What's the essence of the concept?
On social media, William Brooks suggested this definition:
data science: the application of the scientific method and experimental design to the statistical analysis of data
Much better! He went on to say:
- "This definition differentiates data science from most of the data analysis that goes on in business today." (Yes, true, but correctly not part of the definition.)
- "Much in the same way that an Erlenmeyer flask might be used for scientific inquiry — or as a convenient vessel for a beer — machine learning uses tools of statistical analysis that may be used in data science or in other ways."
That second point provides an excellent insight. A good 'essence' definition might therefore be:
data science: the application of the scientific method in using the tools of data analysis
That leaves one additional question. Can you really have a science of data? Has the world now become so digital that you can have a science of data when the data itself can literally be about anything?
I suppose the answer depends on your definition of science. I hate to say it, but definitions provided by standard dictionaries support you either way. So, in the end (as always), the meaning is whatever the community says it is. I just wish the community would say it more clearly.
# # #