Written by Maurice Wolter, Lead Data Engineer
No means NoSQL
Meetrics Data Blog
Online advertising industry is growing rapidly and with it the amount of data that needs to be coped with. It is clear that it is not enough to just take some computers and an old fashioned database to overcome this challenge.
Speaking of which, probably everybody has already heard the word database and has a more or less precise idea of it. When we say database, we usually refer to the so called relational database, deriving from a model that was already introduced in 1970 and which, put in highly simple terms, describes an approach of structuring pieces of data based on their relationship to one another. Data is stored in a normalized form with the purpose of reducing redundancy and ensure data integrity and can be queried using the famous Structured Query Language (SQL).
Relational databases are a very good in performing the tasks they were made for and they play a very important role in all kinds of data processing systems and SQL (pronounced: SeQueL) is a powerful “tool” to manage and process structured data, but nonetheless relational databases lack some abilities such as they are not easily scalable and, by design, do not perform well, when it comes to unstructured data.
NoSQL stands for “non SQL” meaning ”non relational”
With the emerging of the information age, the need of new concepts and technologies became obvious and NoSQL was born. There are, as always, advantages and disadvantages to every approach, but clearly on the pro site are:
- “Simple” design: Commonly used data structures are simple key to value mappings or big tables which contain the data in a more-or-less structured way
- Horizontal scalability: Just add more hosts to your cluster and you are ready to go
Also, as disk space is getting cheaper, data redundancy is not a big issue any more, data is even replicated, not only for reliability but also to increase efficiency, as it can be stored physically close to the place where it should be processed.
There is no real point against NoSQL technologies when it comes to really large amounts of data, but there are some drawbacks as e.g. transaction handling and last but not least a standardized query language as SQL.
Nowadays there exist countless technologies and products in this area making it impossible to mention them all. Some of the most important ones, we also use at Meetrics are Hadoop (MapReduce), Spark, Kafka, Hive, which will maybe explained in some future blog post.