What is Apache Cassandra?

Apache Cassandra is a distributed, highly scalable,  high performance NoSQL database.  It offers several advantages over traditional relational database management systems (RDBMS), particularly in write-intensive, globally distributed, high availability situations that span geographies and datacenters.

Apache Cassandra is Open Source and distributed under the Apache 2.0 license.  Originally developed by Facebook, Cassandra is used by many global enterprises including Apple, Cisco, and Netflix.

Unlike traditional RDBMS systems, Apache Cassandra is a NoSQL database.  Rather than relying on related tables to describe data, Cassandra uses a simplified data storage architecture known as ‘wide column’.  This allows for the simplicity of key value storage, but row data types can vary per row, allowing for the flexibility of tabular data storage.  Cassandra also has the concept of ‘Column Families’ which allow grouping of columns into tables, but rows do not all need to contain the same columns.  Key / Value storage is fundamental to NoSQL databases as they allow for fast indexing, writes, and retrieval.

NoSQL databases forego complex transactions and guaranteed consistency in favor of a highly scalable, strongly or eventually consistent model which is well-suited to internet-scale applications.

Apache Cassandra is masterless, meaning that all nodes in a Cassandra cluster are active and communicating with each other. Any node in the cluster can accept and serve requests, and in the event of a failure to a given node, traffic can be automatically redirected to another active node with no need for complex master – slave replication schemes.  Cassandra automatically distributes and maintains data across the cluster with no need for complex sharding and disk partitioning.

Additionally, Cassandra’s replication approach is much simpler than multi-master or master-slave architectures.  Once a replication schema is created, it is automatically managed across all nodes of the cluster without need for any additional administration.  

Cassandra also exposes an SQL-type query and management interface called Cassandra Query Language (CQL) which allows for developers and administrators to interface the system using familiar RDBMS queries.

Why use Apache Cassandra?

Due to it’s highly distributed and fault tolerant nature, Apache Cassandra is well-suited to globally distributed, write and read-intensive applications that require high availability and high scalability.   A few examples include Social Media data, IOT Sensor data, User Tracking and Messaging applications.

A common reason to use Apache Cassandra is to locate highly available database clusters close to end users in a globally available application.  Since Cassandra nodes can be replicated across any type of infrastructure, including private, public, and hybrid clouds, Cassandra is well suited to geographically distributed applications.   Reads and writes can be delivered with low latency, close to the end user, and replicated throughout the cluster from any node.  This is especially important in high throughput scenarios where locating data infrastructure close to the end user can result in a significant reduction in bandwidth costs.

Additionally, Apache Cassandra is well suited for applications that may require significant scaling up or down.  Adding and removing nodes in a Cassandra cluster is simple and requires no downtime.