When we talk about Big Data, we are talking about an evolution of databases, but in a big way, or rather we are talking about a collection of data so extensive in terms of volume, speed and variety that it requires specific technologies and analytical methods for value extraction.
This is the definition we can find from a superficial web search. We could be dealing with Big Data even if the characteristic were only variety or speed.
The Data Scientist, i.e. the one who has a background in Data Science deals with analysing, interpreting data, and above all must know how to retrieve and ‘clean’ or ‘prepare’ them for analysis.
To move in this area, in my opinion, it is essential to have a classical training in databases, let us review some basic concepts
We have said that Big Data are databases that have grown a bit too big, or at least have one of the three Vs seen above (volume,velocity,variety), but what is a database for? To put the information somewhere so that it is quick and easy to retrieve it later.
The reason for making it retrievable in the future is simple: information together with people, money, materials and real estate constitute the resource base of any organisation, where organisation is understood in the broadest possible sense.
In order to manage information, we need something that contains and manages it, enabling us to access and manipulate it; this something is the well-known information system whose tasks are precisely those of collecting, storing, processing, transforming and distributing information. They normally use computer system and information system as synonyms, the reason is simple, today’s information systems are almost all computer systems, realised through a computer system, but the definition we have given is independent of automation. Do you know the classification and indexing system of books in libraries in the old days? That is an information system, not a computer system.
So far, we have talked about information, but what really is information? and data? Let’s forget the bits for a moment and rely on dictionaries or at least Wikipedia.
Information first has to be something useful and comprehensible, something that produces a change in the cognitive heritage of a subject called the information receiver.
If I ask by e-mail to a friend who is in China for a study stay what time he was there and he answers me using Chinese characters and I do not know Chinese, I have no variation in my cognitive heritage, usefulness, the same thing if I open a jpeg file of a holiday photo with a hexadecimal editor that certainly does not allow me to see the image of the pleasant past holiday, I only see data, which need an interpretative context.
Databases are full of data, their task is to contain and manage them. Think of a column that contains 10,20,34,38, if I don’t know the name of the column called Temperature, it will be difficult for me to have the interpretative context to make the data into information, therefore something useful.
Incidentally, a database is a collection of interrelated data and to manage it we need a database management system, what we commonly know as a Data Base Management System in acronym DBMS or RDBMS when we want to emphasise that the database is relational.
When we move into the world of Big Data, databases are often not relational for the reasons we shall see.