Big-Data or Big-Mess?!​​

When it comes to dealing with raw data it’s always said that - “The messier it is, the better”! I never really got this concept when I studied this in my Big-Data class but today when I’ve got the opportunity to actually play around and evaluate some of this huge data, which is the lifeblood of our business, it all makes total sense to me!

The term ‘messiness’ here refers to the simple fact of the likeliness of errors increases as you add more data points. But this messiness can also be increased by merely combining the data from different data-sources, which don’t always align perfectly. But this big-data transforms into something more probabilistic than precise.

In the world of small-data, ensuring high-quality of data and reducing errors was natural and an essential impulse. Since we collected very little information, we made sure that our data was as accurate as possible! But the obsession of getting the perfect data is an artifact of the information-deprived analog era. It was when the data was sparse and each data-point was critical.

Today, we’re not living in that information-starved situation anymore. We are exposed to such big data and have so many data sources that imprecision or messiness has become a positive feature and not a shortcoming. It is a tradeoff. In return for reducing the standards for allowable errors, we get ahold of much more data, which also gives us more scope to analyze this data and process it and be selective with our output.

But by merely adding more data to the warehouse doesn’t add any value or does any good unless there’s a strong correlation between those data points. The data has to be leveraged in the right way. And to do this the automated tools are not sufficient enough and we must add the ‘human-element’ to it. After playing around with big volumes of data I’ve realized that 80% of the data scrubbing can be easily done using different analytical tools but the later 20% needs human intervention. We need to eyeball this data and do some spot checking before it goes out for further processing.

There are times when it becomes difficult to come down to any conclusion or to make a decision to whether to use this data also or not because these days the data is way noisier and messier than one could even think! So the bigger the data we have or the messier it is, it gives us more scope to analyze it and to bring the best out of it! It’s easier to classify the outliers and understand the underlying behavior of the data collection and distribution. And once we are done with this pre-processing, data cleaning removes all the noise and the missing data. But we are not done here yet! Now before this information gets packaged and is sold to make any intelligent business insights or make useful predictions we need to process this data.

Data processing may seem like a purely magical thing where one just moves the magic wand or hits one button and the raw data gets converted into useful information whereas it actually requires us to go through a set of complex canonical data mining tasks to find the true correlation between different variables. Various techniques are used to do so such as, classification and class probability estimation, regression, clustering, co-occurrence grouping, similarity matching, data profiling, link predictions, data reduction and casual modeling. And all this cannot be achieved solely with the use of the tools because computers may help us in sifting through this massive collection of data and also provide relevance in predicting a target but humans creativity, knowledge and common sense add value in selecting the right data. Hence, data science is the sensible integration of human knowledge and computer based techniques to achieve what neither of them could achieve alone!

Previous
Previous

Leading change

Next
Next

Sixth ‘v’ of big-data!