You can’t escape the words BIG DATA these days.  If you’re like me, you might be wondering subconsciously “hasn’t data always been something companies have to deal with?” And wasn’t it always a massive amount?” The answer is “Of course”. The difference now is that companies have made a push to make even more sense out of their data piles because growth is harder and harder to come by.  Any competitive advantage that can be obtained from analyzing raw, random information is sorely needed.

In 2005, Roger Mougalas from O’Reilly Media coined the term Big Data, only a year after they created the term Web 2.0. It refers to a large set of data that is almost impossible to manage and process using traditional business intelligence tools.  2005 is also the year that Hadoop was created by Yahoo! built on top of Google’s MapReduce. Its goal was to index the entire World Wide Web and nowadays the open-source Hadoop is used by many organizations to crunch through huge amounts of data.

Here’s an infographic describing the history of the modern term:

You can go back further, to the first major data project created in 1937 by the US government; It was ordered by Franklin D. Roosevelt’s administration. After the Social Security Act became law in 1937, the government had to keep track of contributions from 26 million Americans and more than 3 million employers.  IBM won the contract to develop a punch card-reading machine for this massive project.

There are other examples going back hundreds, even thousands of years, but you get the point. Even though large data sets have been around, we’re just now starting to get a grasp of how to extract insights and yield its true power.  90% of the data companies house in various server farms today has been created within the last 4 years, hence the reason Data Scientists are in such high demand.

Additionally, data analysis is a difficult process to master because few people can describe exactly how to do it. The methods by which we state a question, explore data, conduct formal modeling, interpret results, and communicate findings, are difficult to generalize and abstract.  Fundamentally, data analysis is an art. It is not yet something that we can easily automate. Data analysts/scientists have many tools at their disposal, from linear regression to classification trees to random forests, and these tools have all been carefully implemented on computers. But ultimately, it takes a person to find a way to assemble all of the tools and apply them to data to answer questions that are of interest to people.