Big data is the flavor of the season, with companies cutting across sectors and size lining up to get on the big data bandwagon. However, it is at implementation time, and even later, that many companies come face to face with the harsh realities of big data. All the potential and advantages it offers comes only if the pain points that come along with it are resolved.
Here are the top 10 pain points associated with big data.
The first challenge that comes to any big data analyst worth her byte is consolidating across the enterprise. Data is not just spread across multiple repositories, but is invariably trapped in silos, many such silos are inaccessible or not even online. In fact, some data may not even be digitized. For instance, a data analyst looking to mine insights into customer behavior may find data for each store trapped in separate stand-alone databases, and some information remaining in paper feedback forms at the store. A typical mix of data includes structured data from in-house systems and databases, and unstructured data from a plethora of sources, including emails, system logs, and social media.
Data in different sources, in different tables and databases, makes it difficult to draw any accurate conclusions that may give competitive advantage. Big data analytics is only as good or effective as the data it analyses, and as such unless all data is brought into a common pane, big data loses its effectiveness, and with incomplete data, may even deliver distorted results.
The converse to lack of sufficient data or access to all data is data overload. Data is growing at an exponential pace in today’s highly digital world, and before an organization knows it, they are submerged with massive datasets that cost a bomb to store and analyse. As of today, more than 2.5 quintillion bytes of data are created on a daily basis, from sensors, social media, transaction-based data, mobile devices, and many more sources.
Simply collecting every bit of information the organization can lay its hands upon, as most organizations do, simply loads up the data warehouse and analytical engine with large volumes of mostly useless data. Much of the data in organizational data sets may be irrelevant or even duplicates. The more useless or irrelevant data there is, the more difficult it becomes to isolate nuggets of information that can offer real, usable insights.
The challenge before organizations is to cull the chaff, or knowing which data to collect and which data to discard.
It is important to understand where each piece of data came from, and how it may be best used. For instance, if the data comes from the social media, it is necessary to decipher the customer needs first, to understand the data in a proper context. Many organizations fail to relate the data in its proper context, and pay the price for the same at the end of the analysis.
Data visualization, or presentation of information in a graphical or pictorial format makes it easier to understand information, and is a key tool to interpret data.
Closely-related to interpreting the data is cleansing the data. Raw data, or the data that comes in may not have appropriate headers, might have incorrect data types, or might contain unknown or unwanted character encoding. It is essential to modify the raw data to get rid of these discrepancies, for consistency. Many organizations underestimate the magnitude of this task, and fail to make adequate provisions for the same. In fact, it takes more time to clean the data than to perform statistical analysis on it.
Big data invariably requires high processing power. GPUs or graphics processor units do the job well than traditional CPUs that may simply not be able to withstanding the load. GPUs cost a lot less than CPUs in any case, but the pain point is the difficulty in programming GPUs. It is much harder to program GPUs compared to CPUs, and in fact impossible to do so without committing to a specific model. Unless AMD or Nvidia, or even Intel resolves this technical complexity of programming to GPU, this difficulty associated with big data analytic programming is likely to stay.
Speed is a key driver of competitive advantage in today’s highly competitive fast-paced business environment. Companies today require a resilient IT infrastructure capable of reading the data faster and delivering real-time insights.
Apache Hadoop is the most common framework in use for distributed storage and distributed processing of large data sets on computer clusters. Hadoop however presents challenges with scheduling, cluster management, resource sharing, and data sharing. Many standard commercial packages such as IBM InfoSphereBigInsights, Cloudera, and Hortonworks are capable of resolving such challenges, and ensure that parallel processing goes on smoothly. The challenge lies in identifying these tools and integrating them into the company’s IT ecosystem.
There are other tools that facilitate this end. MapReduce breaks applications into smaller fragments, each of which is then executed on a single node within a cluster.
The two critical infrastructure elements in big data analytics are storage and processing. It is important to get the interaction between these two right. Scaling multiple workloads however pose a challenge. At times, it may be required to expand and distribute storage on a temporary basis. In an idea world, the system should be able to deploy the processing resources to whatever is needed at a time, undeploy it, and deploy whatever else requires the processing resources next. This is far from the reality today, and requires writing code.
With big data comes big risks. Big data inputs come in from multiple sources, and it is important to ensure that all the data that comes in are secured. Trojans that slip in can subvert the entire system. It is easy to manipulate Big Data at the processing level, since the major Big Data processing tools such as Hadoop and NoSQL were not designed with security in mind.
Also, big data processing takes place in the cloud, and all the inherent security risks of data theft as data moves to and forth between the company servers to the cloud server are ever-present. There are severe limitations on the available authentication solutions as well. For instance, basic authentication and authorization require two completely different stacks that incompletely support various sections of Hadoop but not the others. The root cause of the problem is every vendor making the lines of LDAP integration part of their “enterprise” proprietary edition, preventing a single integrated mechanism.
Big data analytics is costly, and costs can very easily overshoot estimates.
Big data projects involve ETL (Extract-Transform-Lord). More often than not, this is where the budget unravels. Implementing ETL may require the use of Flume to land data from high-throughput streams, Oozie for workflow scheduling, Kettle for visual interface, Pig for complex data transformation, Sqoop for transferring bulk data-sets, and Kettle to load data. Writing code for all these is a complex affair and there is no seamless way to get around these.
Deciding on the approach taken to collect, store, and analyse data is one thing, and deploying suitable tools for analysis quite another. Organizations need to spend considerable time before selecting an appropriate tool for analysis, for it is difficult to move an application from one tool to another. This rarely happens, and many organizations end up with perfect systems and an inadequate tool, rendering the whole effort unproductive.
The three important elements to consider when selecting the analytic tool are volume of data, volume of transaction, and legacy data management and applications. The volume of data and transaction may be handled by any Hadoop-based tool, such as IBM InfoSphere BigInsights or Cloudera. Most organizations do not go beyond this point, to consider the third important requirement of legacy data management.
Many well-developed applications go to waste because the deployment process does not factor in integration of the new system with the existing production system.
Being aware of the pain points related to Big Data, and resolving it upfront allows organizations to realize meaningful insights and ensure ROI for their big data investments.