By now everybody is familiar with the basics of big data although we will have to admit that it is too complex to be clearly understood. Nonetheless, the world of data science and the functions performed by data scientists remain more or less arcane.
Most of us wouldn’t have thought about it. But the few who did would have wondered what data scientists do on a day-to-day basis. Big data analysis involves collecting massive amounts of data to gain meaningful insights and that much we know. What about the cogs and springs behind the process of interpreting big data? Before looking into the work of data scientists, let’s try to understand what data science is and who data scientists are.
The explosion of data, the availability of myriad new types of it and the emergence of new technologies to interpret this proliferated data lead to the birth of data science, described as a hot new field by The New York Times. Data science basically involves three components– organizing, packaging and delivering data. Organizing is concerned with the physical location and structure of data. Packaging involves prototype building, statistics performance and creation of visualizations. Delivering is the part where the interpretations are communicated and the value is obtained. The rise of big data has created huge potential for data science to look deeper into the physical and biological systems and gain insights into human, social and economic behavior.
Following the footsteps of Hal Varian, the chief economist at Google, (who said that the next decade will see statistician as a sexy job) Harvard Business Review called data scientist the sexiest job of the 21st century.
A data scientist is someone equipped with analytic, machine learning, data mining and statistical skills to make sense of huge amounts of data and effectively explain its significance to others. He usually analyzes big data and plays a major part in the planning and marketing process of an organization. The insights he derives through statistical analysis is invaluable for planning, executing and monitoring of an organization’s strategies.
The title data scientist was coined by D.J. Patil and Jeff Hammerbacher in 2008. In 2012, it was Patil who wrote data scientist as the sexiest job of the 21st century. However, the title was criticized by many as nothing but glorified synonym for data analyst. Besides, data scientists are not the only data-savvy professionals. There are data engineers, data analysts and statisticians who also work in this field. So it is only natural to ponder how a data scientist is different from a data analyst. Here’s how.
A data scientist can be viewed as an evolution of the role played by a data analyst. The formal training is apparently similar with education and expertise in computer science, modeling, statistics and analytics. A traditional data analyst usually analyzes data from a single source whereas data scientists explore multiple sources and examine all kinds of incoming data to discover hidden insights. Data scientists go beyond merely addressing organizational problems; they are capable of selecting the right problems to offer solutions that are most valuable for the organization at the moment. A data scientist’s work is not simply to collect and report on data but to look at it from multiple angles and perceive its meaning and then to come up with recommendations on how to best apply the data. The role of data scientist is best described as part analyst, part artist. Now let’s look at how data scientist achieves his goals through three simple steps or capabilities.
The fundamental duty of a data scientist sounds surprisingly simple – it is to acknowledge that data has some kind of meaning. Only if data means something will the effort we make to understand it start making sense. This has nothing to do with complex algorithms or engineering. Understanding that numbers mean beyond they look and trying to understand the meaning has to be seen as an art.
The actual work of the data scientist begins here. The data scientist has to identify the right analytics approaches and algorithms that can best match with the data. These tools are not confined to machine learning alone, they include operations research, decision theory, game theory and control theory. Data should be made sense in the context of the problem you have at hand. The algorithm should somehow lead to the solution of the problem.
A data scientist should be able to comprehend the engineering or the infrastructure required to perform and deliver the analysis. If there is no infrastructure that can effectively deliver the solution at the right time and right place, the analysis and the problem solving process will be rendered useless.
All these three capabilities make a good data scientist. But each of the three capabilities is complex in itself and data scientist, Dr. Steve Hanks, claims that it is virtually impossible to be adept at all three of them at the same time. There are enough sub-divisions in each of the capabilities and that presents a room for specialization. But specialization doesn’t mean one can neglect any of the three areas. For instance, you can be primarily concerned with algorithm or engineering but without understanding the problem and how to match it with the data, your work will remain defective.