In the high-tech area, data mining and big data are the buzzwords and catchphrases being widely used, primarily reflecting the information age we currently live in. Indeed, we are living in exponential times, the amount of data generated by people these days is staggering.
The first commercial text was sent in 1992 and today the number of text messages sent and received everyday exceeds the total population of the planet.
It’s also estimated that 2.3 trillion gigabytes of data are created each day and the amount of new technical information is doubling every two years, with 43 trillion gigabytes of data expected to be created by 2020. More recently, Japan has successfully tested a fibre optic cable that pushes 14 trillion bits per second down a single strand of fibre - that is 2,660 CD’s or 210 million phone calls every second.
This showcases how data science is emerging as an attractive field of study and many students are venturing into big data. From traffic patterns, music downloads to web history and medical records, all these data is recorded, stored and analysed to enable the technology and services that the world relies on every day.
By 2015, it was estimated that 4.4 million IT jobs would be created globally to support big data.Thus, businesses today are collecting massive volumes of both structured and unstructured data to give them competitive advantage over their competitors.
But in many enterprise scenarios, this end-result is more imagined than real. It’s true that big data is inherently disruptive in nature, but just like how twitter emerged from a hackathon originally intended to send standard text messages to multiple users catapulting to providing news and social networking services that is destabilizing everything from news and information to unpopular governments today doesn’t mean that this is the trend for all businesses.
In his book ‘‘Numbersense’’, Kaiser Fung, a professional statistician and adjunct statistics professor at New York University, correctly emphasises on data analysis over big data mining.
The careful observation of data and good questions generated from careful observations, not the size of it - the ability to process, store and make sense out of the data. He gives the example where some years ago the Gates Foundation made a mistake of assuming that smaller schools are better for student achievement which was later proven to be untrue.
The unfortunate bit about the current hype on big data mining is that there is not enough attention paid into reporting the accuracy of data mining and processing procedures. Most of data being analyzed today is unstructured, poorly formatted, poorly documented and not designed with the data scientist in mind making it more difficult to process.
Second, the other problem Kaiser argues about big data is that it actually moves us backwards since more data results to more time spent in analyzing, arguing, validating and replicating results.
This same argument has also been posited by Nat Silver – America’s top statistics-driven writer and analyst - in his book ‘‘Signal and Noise’’, he talks about the need to reduce the big data collected to its essence which he referred to as pulling the signal out from all the noise (big data).
A great example is the Netflix’s price which got a lot of attention. Some few years ago, Netflix chose to focus on analysis than data mining by assembling hundreds of different component models submitted by a pool of external researchers to develop a complex network of pricing algorithm.
The algorithm would later improve Netflix rating by ten per cent concluding that throwing more data at the problem doesn’t get you better answers.
Therefore, data science is not necessarily improved by lots of data but by analysis that draws out insights that leads to policies and actions that deliver intended results. Big data mining does not always lead to better data analytics.