Data Minimization In Big Data: Benefits and Risks

Dream. Dare. Do – that is Suyati’s work principle in a nutshell.

  • Author:
  • Sahana Rajan

Data Minimization In Big Data

Towards the end of 2015, it was found that more data had been generated since 2013 than had been ever created throughout human history. This data was marked by three vs high volume of data produced with great variety processed at outstanding velocity. Big Data has continued its reign on IT over time. Predicting the future of Big Data, Jeremy Waite of Digital Strategy pointed out that design focused on user experience would be central to Big Data in 2016, and by 2020 we can expect to see about 1.7 MB of data being generated per second by almost every human across the planet. There is a cost tied to every byte of data stored. As more and more data is produced, we will shift from traditional forms of data storage to modes which allow for greater performance at lower physical occupancy. But is all data equally relevant and useful? Are we storing data which is obsolete and immaterial to our purposes? Data minimization arises as a solution to ensuring that we store the data relevant to our purpose, thus reducing the cost involved in storing them.

Data minimization is the process of gathering solely the data required for fulfilling a particular purpose. Currently, cloud storage has become the latest option for storing data. However, just because cloud storage is not expensive does not mean that there is a need to record all the data that we have. The European Union brought out a new law in their Data Protection Act to implement data minimization. The Act states that the personal data which is stored would be sufficient and relevant to the purpose. Most fundamentally, this implies that we should only collect and record the minimal data needed to complete a purpose.

When the age of Big Data began, the initial reaction was an overwhelming tendency to store all the data one could. This led to data scientists drowning under a hurricane of data. With the Internet of Things taking over IT where each wearable has capacity to collect data, the category of personal and private information has been revitalized. While some companies might feel that the information could be of use in the future, there is a critical need to regulate the amount of data and time that can be stored up.

The biggest danger of hoarding up data is that required data will not be findable at critical points. Moreover, storing up irrelevant data over a long period of time will end up being a financial burden on the company. Answering to the dangers of data explosion, data scientists are following the data minimization policy where only the relevant data is stored up.


Merits of data minimization:

The optimum data storage policy is one which stores up data on the basis of function. Every data which is stored must be filtered through a series of objectives. If the data does not fit into any of the intended purposes, then the data should be discarded. The company will also save up huge chunks of money previously spent on saving useless information. As a phone with overload of apps and data begins to perform low, a company overflowing with unrequired data in storage begins to stagnate in the long run.

The risk of data loss and theft is also minimized when only the necessary data is stored up. If confidential data (which is not required for needs of the company) is stolen, then the company could not only face complete decline but could also be charged with criminal negligence. Such a consequence would be a product of carelessness and lack of data management, especially when the stolen data was not required by the company.


Challenges of data minimization:

The categorization of data according to relevancy is a dicey task. In a report “The Risks of Data Minimization”, Lawrence Bowdish pointed out that within the policy of data minimization, all data would be treated the same. Large amounts of data which are collected by medical instruments could be discarded over time which could form hurdles to health advances. “Would health characteristics reported by glucose monitors or pacemakers be treated the same as climate information reported by in-home thermostats or meters reporting water usage?” It is crucial that we define ‘sensitive’ to be able to classify the data correctly and practice the tradition of predicting data utility in the long run.