Introduction to Azure Cosmos DB
In today's world, applications are required to be highly responsive. They should be able to store ever-increasing volumes of data, have high availability, and are always online to make this data available to users in milliseconds.
Further in this article, we will learn about Azure Cosmos DB and how its feature can help us build applications with high availability and scalability.
What is Azure Cosmos DB?
Azure Cosmos DB is a NoSQL database used for modern app development.
- It is a fully managed service, through which it automatically does the capacity management with cost-effective serverless and automatic scaling options, does patching, and provides updates.
- SLA-backed availability and enterprise-grade security assure business continuity.
- App development becomes faster and more productive with Azure Cosmos DB because of its multi-region data distribution, open-source APIs, and SDKs for popular language features.
Key Benefits of Azure Cosmos DB
The guaranteed speed at any scale:
- Fast read and write latencies worldwide.
- Multi-region writes and ease of data distribution to any Azure region.
- Scale storage and throughput across any Azure region independently and elastically.
Simplified application development:
- Deeply integrated with Azure services used in cloud-native application development.
- Multiple database APIs such as API for SQL, MongoDB, Cassandra API, Gremlin API, and Table
- API are available.
- Build apps on SQL API using programming languages of your choice. You can also choose drivers for any of the other database APIs.
- Easily track and manage changes to database containers using Change feed.
- The schema-less service of Azure Cosmos DB automatically indexes all your data.
Mission-critical ready:
- Azure Cosmos DB is backed by an extensive suite of SLAs, including industry-leading availability worldwide.
- The automatic data replication feature helps to distribute data across any Azure region efficiently.
- Data safety is ensured by Azure role-based access control.
Fully managed and cost-effective:
- Fully-managed database service eventually saving developers time and money.
- Cost-effective options for unpredictable workloads of any scale.
- Manage spiky workloads through automatic and responsive service offered by the serverless model.
- Unpredictable workloads are successfully managed by the autoscale provisioned throughput feature of Azure Cosmos DB. It automatically scales the capacity when the workload spikes while maintaining SLAs.
Click to explore Adopt or not to Adopt Data Mesh? - A Crucial Question
What is NoSQL?
The term 'NoSQL database' typically refers to the non-relational database. There are two conceptions about the term NoSQL. Some say it holds for 'non SQL' while others say it is 'not only SQL.' Either way, NoSQL are databases that store data differently than relational tables.
In comparison with SQL databases.
- Modeling data having relationships is easier in NoSQL since related data don't have to be split between tables. Therefore, there is eventual consistency in NoSQL databases than the transaction consistency of SQL databases.
- Related data can be nested within a single data structure.
- Other common features of NUnpredictable workloads are successfully managed by the autoscale provisioned throughput feature of Azure Cosmos DB. It automatically scales the capacity when the workload spikes while maintaining SLAs.oSQL include data clustering, lack of a database schema, and support for replication.
NoSQL can be collectively termed as a non-relational, distributed, flexible, and scalable database.
Document, key-value, wide-column, and graph are some of the types of NoSQL based on its data model. These types are scalable and provide flexible schemas.
Difference between Azure Cosmos DB NoSQL and Relational Databases?
The highligted points are the difference between Azure Cosmos DB NoSQL and Relational Databases:
High Throughput
Challenge while maintaining a relational database system in case of high transactional volumes:
Most relational engines apply locks and latches to enforce strict ACID semantics, which helps in ensuring a consistent data state within the database. However,
- There are heavy trade-offs concerning availability, latency, and concurrency.
- High transactional columns can result in the need to shard data due to these fundamental architectural restrictions manually. Implementing manual sharding can be an expensive, time-consuming and painful process.
How can Azure Cosmos DB help simplify this challenge?
Azure Cosmos DB simplifies these challenges by:
- They are being deployed in all the major countries across all Azure regions.
- You are being a distributed NoSQL database which will help if your transactional volumes are reaching intolerable levels. Azure Cosmos DB ensures high availability, ease of maintenance, maximum efficiency, and reduced total cost of ownership.
Hierarchical Data
Maintaining data containing parent-child relationships:
Several use cases state that transactions in the database may contain many parent-child relationships. These parent-child relationships can grow over time and can be difficult to manage. In the 1980s, forms of hierarchical databases emerged to handle these types of relationships but failed to gain success because of storage inefficiency.
How can Azure Cosmos DB help simplify this challenge?
NoSQL document databases such as Azure Cosmos DB can help handle transactions in the database containing many parent-child relationships and deep levels of hierarchy.
Click to explore about The Comparison of hierarchical and relational databases
Complex Networks and Relationships
Challenges faced in relational databases for maintaining complex networks and relationships:
- Relationships between entities don't exist in a relational database, and they need to be computed at runtime. As a result, as relationships increase, operations become exponentially expensive in terms of computation.
Various forms of “Network” databases did emerge during the phase that relational databases emerged, but similar to hierarchical databases, these systems also failed to gain popularity because of:
- Storage inefficiencies
- A lack of use cases at the time.
Today, graph database engines could be thought of as a re-emergence of the network database paradigm. Thus, relationships can be transversed in a fixed time.
How can Azure Cosmos DB be a great help?
A graph database like Azure Cosmos DB Gremlin API can be a great help if you maintain the complex network of relationships in your database.
The Gremlin (graph) and SQL (Core) Document API layers are fully coherent, which benefits switching between different models at the programmability level.
Fluid Schema
Challenges faced with the relational databases in case of schemas:
- In relational databases, the schemas are required to be defined at the design time.
- This has benefits in the case of conformity of data and referential integrity.
- However, it can cause restrictions as the application grows because the changes in schema across logically separate models sharing the same database or table can get complex to respond.
To manage this, the database requires to be schema-agnostic and allow records to be self-describing.
How can Azure Cosmos DB be a great help?
- Azure Cosmos DB can help you manage data whose structures are constantly changing at a high rate.
- NoSQL database service like Azure Cosmos DB offers a schema-agnostic solid approach that can enforce conformity across the database on transactions coming from external sources.
Click to explore Data Catalog Architecture for Enterprise Data Assets
Use-Cases of Azure Cosmos DB
The most common Use case of Azure Cosmos DB are mentioned below:
IoT and Telematics
In IoT use cases, bursts of data are ingested from device sensors. Next, the streaming data is processed to obtain real-time insights. Lastly, the data is saved to cold storage for batch analytics. Microsoft Azure offers many rich services which can be implemented for IoT use cases. Services such as Azure Event Hubs, Azure Cosmos DB, Azure Stream Analytics, Azure Notification Hub, Azure HDInsight, and Power BI.
Bursts of data can be ingested through Azure Event Hubs. Azure Event Hubs offers high throughput data ingestion with low latency. Now, the streaming data that needs to be processed for real-time analytics can be funneled to Azure Stream Analytics. Load the data into Azure Cosmos DB for ad-hoc querying. All the data or just data changes can be used as a part of real-time analytics. You can also refine data by connecting Azure Cosmos DB data to HDInsight for Pig, Hive, or Map/Reduce jobs and load it back to Azure Cosmos DB for reporting.
Social Applications
Azure Cosmos DB is used to store and query user-generated content (UGC) like chat sessions, tweets, blog posts, ratings, and comments for web, mobile, and social media applications.
You can easily store content such as chats, comments, and posts. This doesn't require transformations or complex objects to relational mapping layers. As the developers iterate over the application code, the data properties can be added or modified easily to match requirements, thus promoting rapid development.
The database is required to be schema-agnostic and allow records to be self-describing since the applications that integrate with third-party social networks must be able to counter changing schemas from these networks. In NoSQL database services like Azure Cosmos DB, data is automatically indexed. It offers a strong schema-agnostic approach.
Many of the social applications run on a global scale and can present unpredictable usage patterns. Autoscale Provisional throughput offered by Azure Cosmos DB can help in scaling the datastore as usage demand rises. You can also create multiple Cosmos DB accounts across various Azure regions.
Azure Cosmos DB resource model
The Azure Cosmos DB has four generic resource types:
- Azure Cosmos DB Account
- Databases
- Containers
- Items
They can be understood uniquely depending upon the nature of the API used for the Cosmos DB Account. The hierarchy of these resources works as such:
Azure Cosmos DB Account
In conventional database systems, multiple databases can reside inside a single server. This server is the access point to those databases. If the databases are functioning inadequately, we have an option to increase the number of server. However, there is no server concept in the PaaS cloud, at least not in a physical form. The storage and computer are virtualized and presented as such. The storage and computer are performed and presented virtually. Therefore, we require an access point to the databases other than connecting to a server. Azure Cosmos DB Account allows this access point to join the database through a unique DNS name. Also, you can improve or deteriorate the throughput of your databases through the Cosmos DB Account, simultaneously with geo-replicating databases for high availability.
Databases
Using your Azure Cosmos DB account, you can generate numerous databases for any API supported by Cosmos DB.
For example, you’ll have a Sales database, a Marketing database, and a Payroll, Database for Sales, Marketing, and HR personnel. By using a proper security model, none of these user groups will access one another’s data. Aside from security, we can isolate stored methods and functions to implement different business logic. Also, you scale each database separately to accommodate different levels of throughput.
There are some cautions when creating databases in Cosmos DB.
- All API containers use databases of one kind or some other, except for the Table API. When tables for the Table API are created, a default database ‘TablesDB’ is built, and all the tables reside in this database. We can’t create a new database or a renamed version of the database.
- Cassandra uses the name ‘keyspace’ instead of ‘Database.’
Containers
At first glance, containers seem to be just ordinary database tables. They can take various shapes for specific API. They are partitioned horizontally based on a shared key, which is compulsory when creating the container. The sharding of data into various partitions improves the performance of the container manifold. You can also configure the Time-To-Live (TTL) on Containers to erase expired records after a definite amount of time.
An Azure Cosmos DB container is capable of taking multiple shapes depending upon the API:
- SQL API: When opting for SQL API, Cosmos DB provides us with a Container where we can store documents.
- Cassandra API: Cassandra uses Tables to store rows of data.
- MongoDB API: MongoDB API uses a collection to store the BSON documents. MongoDB commands to query records are run in the shell provided by the Azure Cosmos DB from the Azure Portal UI.
- Gremlin API: All Cosmos DB Containers use a table of one kind or another in case of storage. However, the Gremlin API uses Graphs to store vertices and edges.
- Table API: As the name implies, Table API uses Tables to store key-value data. It arose from Azure Table Storage, but then In 2017, it was merged into Cosmos DB item.
Item
The items stored in multiple types of containers are different:
API | SQL | Cassandra | MongoDB | Gremlin | Table |
Container | Container | Table | Collection | Graph | Table |
Item | JSON Document | BSON Documen | BSON Document | Vertex or Edge | Item |
Items are different from each other, but they all are stored similarly using the ARS model. This feature of Cosmos DB will be a huge advantage if we can query them mutually.
Explore about Kafka vs. Pulsar : Pick the Right one for your Business
Global Distribution
Today's applications are expected to be highly responsive and constantly online. Instances of these applications need to be deployed in data centers close to their users to achieve low latency and high availability. These applications are typically disposed of in multiple data centers and are called globally distributed. For data duplication, a globally distributed application requires a globally distributed database that can transparently duplicate the data in any world region. This will allow the applications to run on a copy of the data close to its users.
Azure Cosmos DB, a globally distributed database operation that enables you to read and write data from the local replicas of your database. Data is transparently duplicated to each region linked with your Azure Cosmos DB account. Azure Cosmos DB is highly available, has low latency, provides elastic scalability of throughput and distinct semantics for data consistency.
You can set up your databases to be globally distributed and available in any region where Azure operates. To lower the latency, store the data close to where your user base is. Azure Cosmos DB provides a single system image of your globally distributed Azure Cosmos database and containers, which can be used for reading and writing locally from your application.
You can add or exclude the regions associated with your account at any time without the need to pause or redeploy the application.
Key benefits of global distribution
Build global active-active apps: Each region supports both writes and reads because of the multi-region writes duplication protocol. The multi-region writes capability also allows:
- Unlimited elastic writing and read scalability.
- 99.999% read and write availability all over the world.
- Assured reads and writes completed in less than 10 milliseconds at the 99th percentile.
The addition or elimination of the Azure Cosmos DB account doesn’t affect your application working. It doesn’t have to be redeployed or paused. The application is highly usable and available all the time.
- Build highly responsive apps: The application can execute near-real-time reads and writes on all the user's regions for the database.
- Build highly available apps: Running a database in various regions worldwide enhances the availability of a database. If one region is unavailable, other areas can automatically manage application requests. Azure Cosmos DB grants 99.999% read and writes availability for multi-region databases.
- Maintain business continuity during regional outages: Azure Cosmos DB supports automatic failover when a regional outage occurs. Azure Cosmos DB manages its latency, availability, consistency, and throughput SLAs during the regional outages.
- The scale read and write throughput globally: You can make every region to be writable and elastically scale reads and writes all over the world. Your application's throughput upon an Azure Cosmos database or a container is provisioned across all regions associated with your Azure Cosmos account. Financially-backed SLAs guarantee the provisioned throughput.
Click to discover the Top 9 Challenges of Big Data Architecture
Partitioning and Horizontal Scaling
Azure Cosmos DB uses partitioning to compute individual containers in a database. The items in a container are grouped into distinct subsets, known as logical partitions. The creation of logical partitions depends upon the value of a partition key. Each item in a container is associated with a partition key. All the things in a logical partition have identical partition key values.
Logical Partitioning
A logical partition consists of a collection of items that hold the same partition key.
There is an unlimited number of logical partitions in your container. Each logical partition can save up to 20GB of data. To guarantee that the container can scale, select a partition key with a wide range of possible values.
Managing Local Partitioning
Azure Cosmos DB automatically and transparently maintains the scalability and performance needs of the container. Azure Cosmos DB moves logical partitions to automatically spread the load across a more significant number of physical partitions as the throughput and storage requirements of an application increase. Hash-based partitioning is used to expand logical partitions across physical partitions. It hashes the partition key value of an item. Then, Azure Cosmos DB allocates the key-space of partition key hashes uniformly across the physical partitions. The hashed result defines the physical partition.
Choosing a partition key
The two components of the partition key are the partition key path and the partition key value. For example, consider an item { "userId" : "John", "worksFor": "XenonStack" } if you choose "userId" as the partition key, the following are the two partition key components:
- The partition key path (For example: "/userId"). The partition key path allows alphanumeric and underscores (_) characters. Only nested objects can be used.
- The partition key value (For example, "John"). The value of the partition key can either be of numeric or string types.
Selecting your partition key is a simple but essential design choice in Azure Cosmos DB. Once you select your partition key, it is not allowed to change it in place. If you need to remodel your partition key, you should transfer your data to a new container with your updated partition key.
For all containers, your partition key should:
- Be a property that has a value that does not change. If your partition key is a property, you can't update that property's value.
- The property should have a wide range of possible values or, in more convenient words, it should have high cardinality.
- Spread request unit (RU) consumption and data storage fairly across all logical partitions. This ensures fair distribution of storage and RU consumption across the physical partitions.
Auto Scaling
There are two ways to configure throughput on your databases and containers in Azure Cosmos DB. Either by using either standard (manual) or autoscale provisioned throughput.
The throughput scaling is based on the usage, and it does not impact the latency, throughput, availability, or performance of the workload.
Autoscale provisioned throughput can be helpful for mission-critical workloads that have unpredictable traffic patterns and require SLAs on high performance and scale.
Benefits of Autoscale
- Simple: It becomes simple to manage RU/s since autoscale eliminates the complexity of managing RU/s manually scaling capacity or custom scripting.
- Scalable: Without any disruption to applications, client connections, or impact to Azure Cosmos DB SLAs, databases and containers can automatically scale the provisioned throughput as needed.
- Cost-effective: it is a pay-as-per-use
service, that is, paying for the resources that your workloads need on a per-hour basis. Autoscale scales down when not in use, which eventually helps to optimize cost usage and RU/s usage. - Highly available: Azure Cosmos DB backend ensures data durability and high availability.
Explore more about Comprehending composable Data Processing with a Case study
Multi-model APIs
ARS model is the base behind all the supported APIs. Under the hood, Cosmos DB uses the database engine named Atom-Record-Sequence, which is responsible for data persistence. Although Cosmos DB stores data in the same way for all APIs, you cannot change APIs once the database is created, except for SQL and Graph API.
SQL API
SQL API offers extensive programming support, including user-defined functions (UDFs), stored procedures, JavaScript's programming model, and triggers.
Key Benefits of SQL API
- You can query JSON data through SQL. It's easy to understand and doesn't require extra learning efforts to get things working.
- SQL API supports server-side programming, which will be beneficial if you are starting a project from scratch.
MONGODB API
MongoDB is a Document-based distributed database. It stores JSON data as documents and provides rich query features over this JSON data.
MongoDB API and DocumentDB both use documents as objects.
Key Benefits
In addition to creating databases with MongoDB, you can easily migrate existing MongoDB applications over to Cosmos DB with minimal changes. In most minor cases, you could change the connection string from MongoDB to Cosmos.
CASSANDRA API
Cassandra is a distributed and open-source NoSQL database. It is a wide column store database similar to relational databases. Still, it varies in the behavior that the column name and types can change for different rows in the same table. Cassandra uses CQL (Cassandra Query Language) for data storage and manipulation.
Key Benefits:
Like MongoDB API, Cassandra API can be used to create a new database or migrate an existing one with minimal changes.
GREMLIN API
Gremlin API is used to create and query graph databases. SQL can also be used to query the graph databases. Gremlin API is better than other graph databases because it comes under Cosmos API and inherits all its capabilities, including integration with Azure services.
Key Benefits:
Azure Cosmos DB automates the management of database and machine resources. Many existing graph databases have limited features because of their infrastructure. The maintenance of such databases can also cost highly.
TABLE API
Table API stores and manipulates data from Table Storage in Cosmos DB. Azure Table Storage was used to store structured NoSQL data as key-value pairs. Now it has been included in Cosmos DB, by which it also inherits the premium capabilities of Cosmos DB.
Key Benefits:
- Cost-effective storage
- Ease of migration of the applications that are already using Table Storage to Cosmos DB is also effortless.
- Explore more about What is Data Observability?
- Read more about Emerging Modern Data Infrastructure