Big data is a great quantity of diverse information

Simply put, big data is the knowledge domain that explores the techniques, skills and technology required to obtain valuable insights out of massive quantities of data. Data isn’t new, however. The earliest known records of society using data dates back to the roman empire using collected data to measure crop yields and coordinate armies. But let’s take a look at a more recent history of the term.

A brief history

Data first became a problem in 1880 for the U.S census bureau when they had to process the 1880 census. At the time, it was estimated that it would take 8 years to process all the information and by 1890, it would take 10 years to process the new data. Luckily, a man working for the bureau in 1881 created a new method of processing information. Herman Hollerith created the Hollerith Tabulating Machine. His invention was based on the punch cards used for controlling the patterns woven by mechanical looms. His machine reduced the time it took to process the 1880 consensus from 8 years to a mere 3 months.

In 1927, an Austrian-German engineer by the name of Fritz Pfleumer designed a method for storing information on magnetic strips. After experimenting with a variety of materials, he decided to use thin paper, striped with iron oxide powder and coated with lacquer. He patented this technology in 1928.

In 1965, the U.S government had build the first data centre with the intent of storing millions of tax returns and fingerprints. Every record was recorded on magnetic tapes and stored in these data centres. This data centre is considered to be the first effort at large scale data storage.

After the birth of the internet and publication of the World Wide Web in 1991, the amount of information started to increase. The World Wide Web allowed for the transfer of audio, video, and pictures. In 1993, CERN announced the World Wide Web would be free for everyone to develop and use. The free nature of the internet meant that businesses and individuals everywhere began to create their own websites and online marketplaces. The 1990s saw a substantial growth in internet usage, and personal computers became rapidly more powerful. Internet growth was based both on Tim Berners-Lee’s efforts, Cern’s free access, and access to individual personal computers.

But where did the term ‘big data’ come from?

The term ‘Big Data’ has been in use since the early 1990s. Although it is not exactly known who first used the term, most people credit John R. Mashey (who at the time worked at Silicon Graphics) for spreading the term and giving it attention. However, only in recent years has the term gained a lot of traction due to the various emerging technologies such as the internet-of-things, blockchain, artificial intelligence and machine learning, amongst others.

The Four V’s

The ability to draw insights from data is revolutionising entire industries and changing human culture and behaviour. It is a result of the information age and is changing how people form opinions, create, work, process emotions, and exercise.

When we say massive quantities of data, it doesn’t refer to a spreadsheet with 1000 rows or even 100,000. The term actually refers to datasets that have a massive variety of data points. Normal computers don’t have the processing power required to efficiently process and analyse these large data sets. But these voluminous data sets offer great value to organisations in the form of useful information.

Big Data is embraced by all types of industries, ranging from healthcare, finance and insurance, to the academic and non-profit sectors.

CKC Big Data 4 V's Diagram

Let’s go into further detail about the four V’s:

  • Volume – This refers to the scale of the data. The volume refers to the size of the data sets that need to be analysed and processed, which are now frequently larger than terabytes and petabytes. As previously mentioned, all the information that currently exists in the world was created in the past two years alone. Every two years the amount is expected to double and by 2025, the amount of data in the global data sphere will exceed 175 zettabytes. This is equivalent to 175 trillion gigabytes. This growth will only accelerate as global access to the internet increases.
  • Velocity – This refers to the speed of which data is generated. High velocity data is generated with such a pace that it requires specific distributed processing techniques. An example of data that is generated with high velocity would be tweets or YouTube videos. The Internet sends a vast amount of information across the world every second. Enterprises have to run this information through their systems and networks, process it, scan it, and store it while making sure that it is retrievable later.
  • Variety – Variety is what makes big data really big. Information is generated from a wide variety of sources and is generally generated as one of three types: structured, semi structured and unstructured. Think of all the pictures, videos, emails, audio files, and documents that are transferred through the internet. Each of these file types are completely different from one another. Not all of them can be neatly structured into rows and columns in a database, which is one of the issues that enterprises encounter.
  • Veracity – This term is associated with the quality of the data that is being generated. Veracity helps to filter through what is important and what is not, and in the end, it generates a better understanding of data and how to categorise and sort it in order to take action. High veracity data has many records that are valuable to analyse and that can be converted into meaningful reports of which better decisions can be made.