A Less Obvious Way in Which Technology is Disrupting Economics.
July 31, 2014
NoSQL vs SQL - Who Is Who?
August 14, 2014
NoSQL is a very fashionable buzzword lately, everyone has heard about it, everyone knows it’s the new big thing and yet very few know what it really is. To most, it’s a magical new technology that is just like SQL except friendly to parallel processing and thus scalable to very large datasets, something to do with the cloud, Hadoop and MapReduce. That’s partially true, but perhaps a clearer description is called for.
In fact, NoSQL isn’t just one specific technology, or paradigm. Instead, it’s a loosely grouped collection of data storage and retrieval technologies that all extend or altogether replace the traditional relational database paradigm in one form or another, but always in some way that makes it horizontally scalable, versatile and suitable for large and fast data flows. Sometimes the paradigm is relational sometimes not, sometimes there is SQL sometimes there is no room for it – No SQL is a very heterogeneous set of technologies and a more accurate interpretation for “NoSQL” is ”NotOnlySQL”.
Conventional SQL systems and relational database management systems (RDBMSs) range from low-end solutions such as SQLite and MS Access to high-end “industrial grade” solutions like SQL Server and MySQL. These may differ in some syntactic details and in data processing and query optimization power, but underneath they all share the same well entrenched standards: all store data in tables, all adhere to relations and relational algebra structures, all provide the same structured query language (SQL) as an interface for programmers for retrieving and manipulating data. All transactions in RDBMS-s are necessarily ACID compliant – atomic, consistent, isolated and durable. When it comes to the CAP theorem, all RDBMs guarantee C (consistency) and A (availability) at the expense of P (partitioning).
On the other hand, the wide mix of systems collectively referred to as NoSQL contains a diversity of storage and retrieval paradigms, all designed for different types of business problems. You can have have NoSQL systems that follow a relational paradigm (for example MySQL Cluster, ScaleDB or VoltBD), but there are also systems that abandon relational paradigm in favor of simple key-value pairs (Memcached/Membase, Voldemort), document stores (SimpleDB from Amazon that drives S3, CouchDB, MongoDB), extensible records (BigTable, HBase, Accumulo, Cassandra), graph relations (OrientDB) or some kind of an object oriented large scale data store. In some NoSQL technologies indexes, views and joins are allowed, in others not. In some cases SQL is available, but more often data is retrieved and processed via other means – either through application code directly or via a Map Reduce / Hadoop algorithm or via a higher level “query” language such as PigLatin that sits on top of Hadoop.
Also, different NoSQL systems offer various tradeoffs between the C, A and P of the CAP theorem. In contrast to the traditional relational databases, the P part is fixed – after all we are trying to achieve is horizontal scalability through the means of partitioning and parallelization, so P is a must. Instead the tradeoffs are made between availability (A) and consistency (C). As an example consider a partitioned system that needs to be always available but can be relaxed to be only “eventually consistent” (for example if someone updates their Facebook status, that status will be seen straightaway by some but only a while later by others and by everyone). Also the ACID principles are not always guaranteed in NoSQL and which of the ACID principles is violated and to what extent is dependent on the particular NoSQL technology.
The diversity of the NoSQL world is explained by the fact that it’s simply a very young set of technologies that has sprung up and grown quickly in many directions in response to a diverse range of problems posed by Web 2.0. There simply hasn’t been enough trial time for these technologies to be able to converge to one (or several) clear cut de-facto. It may be a Wild West when compared to the standardized and seamless world of RDBMS-s and SQL, but let’s not forget that even though RDBMS-s have been a de facto standard for a few decades now, it wasn’t always the case. There was a time when RDBMS was just one of the paradigms that competed with others (such as hierarchical databases, star schema, etc) and only with passage of time had it become the undisputed champion.
Some of the more prominent NoSQL Systems are as follows: MySQL Cluster, ScaleDB, VoltBD, Memcached/MemcacheDB/Couchbase/Membase, Voldemort, SimpleDB, MongoDB, Dynamo, Accumulo, BigTable, HBase, Cassandra, OrientDB. Here is a good overview of the NoSQL world from Rick Cattell who is a recognized authority in the field: http://cattell.net/datastores/Datastores.pdf
The paper also has an interesting take on history of NoSQL:
"In fact, BigTable, memcached, and Amazon’s Dynamo  provided a “proof of concept” that inspired many of the data stores we describe here:
Memcached demonstrated that in-memory indexes can be highly scalable, distributing and replicating objects over multiple nodes.
Dynamo pioneered the idea of eventual consistency as a way to achieve higher availability and scalability: data fetched are not guaranteed to be up-to-date, but updates are guaranteed to be propagated to all nodes eventually
BigTable demonstrated that persistent record storage could be scaled to thousands of nodes, a feat that most of the other systems aspire to."