Wednesday, 3 December 2014

What's Big Data?

First Published:  The Star, (Malaysia) May 2014


What does “Big Data” mean? Is there a “Small Data?” Don’t they mean “Lots of Data” rather than “Big Data?” And where’s it all coming from?  (I tell you, these IT folks are just plain weird sometimes…)

Actually, “Big Data” sounds a lot better than “Plenty of Data” or “Lots of Data”, both of which sound terribly mundane.  “Data” is not very useful on its own, its just a fact or a statistic, (for example, a single record of a toaster sale, but data can become useful if we know how many toasters were sold, the combination of colours, and where they were sold, for example, because then we know which coloured toasters to make more of, or if our pricing is right.) At this stage, our data has become “information” because it informs us, so in effect, collections of facts or statistics, meaningless in themselves can be organized or assembled to provide useful ‘information’.


Astonishingly, it’s estimated that 90 percent of all the data in the world was created in the last 2 years! How can that be?  You wonder what the rest of human civilization, going back, back, to early societies, was doing.  If you think about it, knowledge then was a scarce commodity and in early human civilisations, only an elite privileged, such as the priesthood, or royalty, was educated. They wrote on papyri, inscribed on stones or leaves, or wood so recorded information was very, very limited.

It wasn’t until the printing press – just a couple of hundred years ago – and the spread of education that knowledge could be more widely distributed, but everything was recorded or printed on a medium such as paper. Every technological advance since then has increased the amount of data – electricity, industrialization, mass-production, radio, the TV, the computer.  In fact, it wasn’t until the mid-twentieth century that computers began to have a significant impact on society.  The early computers were behemoths, but they were able to process lots of information rapidly in a tireless manner, so something like a population census was ideal for a fast recording and transaction machine, which is what early computers were.

Suddenly, there was a lot of data, because these big, early computers could process hundreds or thousands of times faster than a roomful of clerks writing away with their pens.  This data was, and still is, in a certain form – the name was recorded in a certain number of fields, followed by the gender, the date of birth and so on.  In other words, all these data had a ‘structure’ – so logically, they’re called ‘structured data’.

(Which begs the question – so what is ‘unstructured data? Patience…)

Computers became smaller and more affordable and companies, and not just governments, could afford to use them – their efficiency launched humankind into a new era of civilisation that many of us lived through in the twentieth century, ushering in more prosperity and economic activity than could have been possible otherwise. 

Computers – Information Technology, or IT – became more pervasive with personal computers, or PCs – besides the fact that the man in the street could now own one, it also meant that there was lots and lots of digital data being generated – spreadsheets, documents, presentations, and so on, and when these PCs became connected or networked together – well!!

With the Internet, data traffic exploded exponentially, and all sorts of new ways of generating and using data.  The cost of computing became dramatically cheaper – the smartphone in your pocket has more computing power than NASA did when it sent man to the moon in 1969, and it costs a miniscule fraction of what NASA had to pay for roomfuls of heat-breathing, vacuum-tube technology back then.  Astronauts used slide rules instead of digital calculators – and this was only back in 1969!

In the modern era, a large aspect of our lives is digitized. Our governments, commercial, banking systems, transport, factories, industries, public services, heathcare, even agriculture are computerized, busily churning out data and transactions.  Much of this computerization is invisible to the end-consumer, but supply chains, financial transactions, transport schedules, traffic systems, financial institutions, government machinery, all depend on computers whirring silently away, largely unseen.

We, individuals, churn out lots and lots of data ourselves – every time you send an email, make a phone call, text your boyfriend or girlfriend, take a photo of the food you’re eating, post an update on Facebook, download a video or a music clip, you’re generating data.  A large part of the data generated by humankind isn’t apparent, but satellites, CCTVs, sensors in machines, RFID tags are all generating and sending data as well, humming silently away 24 hours a day, everyday, the unseen lubricants of modern human civilization.  No wonder all of the data in the world was generated in the last two years, and it’s all accelerating rapidly, by the way, the rate of data generation, in an increasingly globalised, fast-paced world driven by commerce and human interaction.

All this stuff is by and large meaningless – your email makes sense to you, but it has no formal structure, and the next email will look completely different. Same with your FB postings, and the vacation pictures you took – they don’t conform to a stereotype or form, and therefore have no ‘structure’.  Guess what all this messy stuff is called?

“Unstructured data”, of course.  And there’s so much ‘unstructured data’ generated by you and me and machines sending out messages that it’s estimated that 80 percent of all the data in the world is unstructured.

There are some seven billion people on this planet. Depending on which estimate you go by, there are going to be some fifty billion connected devices in the near future, which far exceed the number of humans. Many of these devices are sending signals indicating their condition, mundane things such as oil pressure and temperatures, revs per minute, fluid strength, and so on, and then there are the sorts of signals used for telecommunications, or for location – such as GPS, and a whole multitude of other signals emitted by machines. To be useful, these signals need to be transmitted, often to a controlling device that can make sense of it all.  That controlling device may be manned, as in a human operator who looks at a screen, but it can also be unmanned, a machine or system that detects anomalies and acts accordingly.

You may have noticed that all this data being generated is of limited use if its not communicated or transmitted, so Big Data doesn’t exist independently – rather, it is co-dependent on the evolution of other technologies at the same time, such as Cloud as business enabler and delivery platform, and the quantum increases in computing power and storage, which have been accompanied by equally dramatic decreases in the associated costs.

Remember early in this discussion, the difference between ‘data’ and how it needs to be organized and assembled to be useful, to become ‘information?” The value of data is the ability to make some sense out of it – in the ‘good old’ days – which aren’t that old, (and not that ‘good’ either), data could be used for reporting – how much sales, who sold more, which units did better, how the organization was doing against its targets, who its customers were, what were the best selling items, and so on and on. 

The proposition is really quite different when there was ‘structured’ data, which all conformed to some sort of standard form – it was easy to assemble similar records to extract useful bits of information, but when the majority of the data is ‘unstructured’, of varying lengths, forms, pictures, sounds, that becomes a lot trickier.  How, in fact, do you make sense of what appear to be completely random and dissimilar pieces of information? 

Now having established what “Big Data” is, as well as some of its qualities – volume, randomness, speed, lack of coherent structure, and so on, what can you do with it, and how do you make sense of it?


Now that’s a story for another day.

No comments:

Post a Comment