2187-just_azure.svg

There’s a new kid on the block in the NoSQL world – Azure DocumentDB. Released in preview back in August 2014 and going Generally Available this month, Azure DocumentDB is Microsoft’s initial foray into the increasingly competitive space of non-relational database management systems.

Of course there is no better competitor in this space to measure up against than MongoDB. With a valuation of 1.6 billion dollars and close to 50% of NoSQL database market share, MongoDB’s flagship database is the Goliath in a market of NoSQL David’s.

Nevertheless, Microsoft knew this when they rolled on Azure DocumentDB; what did they come up with? How close does DocumentDB stack up to MongoDB? Are they even close? This post seeks to break down the differences between the two platforms and hopefully give us an answer to these questions.

Commonalities

Before we begin dissecting the differences between Mongo DB and Azure DocumentDB let us briefly review some of the common ground these two database platforms share. With respect to the general taxonomy of NoSQL databases (Key-Value, Columnar, Graph, Document, and Multi-Model), both MongoDB and DocumentDB fall squarely in the realm of the Document classification. Broadly speaking, both databases:

  • are partition tolerant by design (which implies “built for horizontal scalability”)
  • eschew traditional relational schema design for de-normalized arrangement of collections
  • emphasize a human-readable data format that mimics object-orientated programming entities

There are other characteristics that all Document databases share in common, but these tend to be the main three that we can safely assume.

Beyond the basic of features common to all Document databases, there are few more worth noting in this particular instance. Firstly, both databases have support for client SDKs in multiple programming languages. In addition to the .NET Framework, DocumentDB currently has support for Node.js, Javascript, Java, and Python (with community projects already for PHP, Go). MongoDB has an SDK (drivers) for all the aforementioned languages and approximately thirty other languages as well. The point here being both products have made some attempt to keep a relatively low barrier to entry for developers (see official list below).

2200-DocDB-Mongo-Fig-1_OfficiallySupport

Figure 1: Officially supported languages

Secondly, both database platforms utilize a syntactically similar looking data format – JSON (technically BSON for MongoDB, but we’ll get to that). It’s easy to take this for granted, but it’s worth stating up front that the data snippet examples we’ll see will all look like JSON (and not say, XML, YAML, etc.)

Finally, at a high level the “behind-the-scenes” lingua franca for both platforms is JavaScript. Beyond that however, the similarity ends which is again something we will delve more into as we proceed with our comparative analysis.

Platform-as-a-Service (PaaS)

Perhaps the single greatest driving factor for the growth of the NoSQL movement as a whole has been the precipitous fall in the cost of storage over the past decade. Subsequently, as the cost of storage has fallen, the benefits associated with the efficiency of storage due to normalized schemas design have become increasingly less relevant. Whereas having duplication of data (de-normalized data) in the past created a huge strain on financial budgets, being able to scale out using low-cost commoditized hardware has now helped make NoSQL a viable option for many use case scenarios.

Microsoft Azure supports two different cloud-based implementations to support the scale out of hardware: PaaS and IaaS. This leads us to our first distinction between MongoDB and DocumentDB. Whereas both eschew the traditional relational design for data, the greatest distinction between the two database platforms is that DocumentDB is PaaS by design whereas MongoDB is not. Put another way, DocumentDB was built intentionally to work within the Azure Cloud ecosystem much like SQL Azure, SQL Storage, Azure search, etc. (see diagram below)

The greatest benefit for running in a PaaS environment is that the detail for managing the cluster of low-cost commodity hardware/virtual machines has been abstracted. Setup and the initial/ongoing configuration is fairly painless and all managed through the Azure Portal or via PowerShell and the Command Line Interface. This includes handling scale up/down for storage/throughput, replication, etc. which will be addressed in the later sections.

However, one of the primary implications for running only in PaaS is that DocumentDB cannot be run in a localhost type environment or in any user-managed environment. Several implications follow:

  1. You have to have a Microsoft Azure cloud account.
  2. There is no concept of a local testing environment at the moment, which also means,
  3. You always have to be online (connected to the internet)

By contrast, MongoDB was not intentionally built to run in a PaaS environment from the ground up. Historically, MongoDB was designed and built to run on low-cost hardware running on Linux. Soon afterward, MongoDB was given support for running on Windows operating systems (although mainly for cost reasons most real-world production environments use Linux clusters exclusively).

It should be no surprise to know that MongoDB retains all the benefits of running in a user-managed environment but also incurs the challenges of managing that complexity as well. That said, it is possible to run MongoDB in Azure Cloud as both IaaS and PaaS.

With IaaS, the setup is fairly straightforward like configuring any other virtual machine (note: there is no specific “MongoDB” VM template so you will have to instantiate your own Linux/Windows VM instance and then install MongoDB). With PaaS however you are in for a treat. To set up MongoDB running in a PaaS environment, such as Azure Web or Worker roles, is a non-trivial affair with very limited documentation to support the process. It is my strong recommendation that if you would like to run MongoDB in a managed environment do not try this yourself – pay for a third party hosting service like MongoLab, Compose, and even MongoDB Inc. themselves. The great thing about MongoLab is that it can be hosted in Azure and can even be purchased through the Azure marketplace as an add-on. In case you’re not 100% ready to dive into managed MongoDB in Azure, MongoLab also has a free starter plan with a 500MB max limit.

Beyond initial setup and configuration, the benefits of running in PaaS (either DocumentDB or MongoDB) translate further into the realm of ongoing maintenance. Not having to worry about patching or servers or managing backups is huge benefit that cannot be overlooked. In the next section this becomes even more apparent when we look at sharding and replication and realize that the scope of maintenance grows exponentially as the necessary infrastructure grows to accommodate these common use-cases. The trade-off for this benefit of PaaS of course is granularity of control which may matter for some.

Scaling – Sharding and Replication

Continuing from the previous distinction, the manner in which DocumentDB’s PaaS driven architecture and MongoDB handles horizontal scaling is quite different as well. Although these are distinct concepts, sharding and replication are typically lumped together because they represent two aspects of horizontal scaling. What isn’t entirely obvious for newcomers to NoSQL is that they represent scaling…but for entirely different reasons.

As a brief review, sharding improves the performance of an indexed search (we cover indexes in more depth a bit later) over a collection by splitting the documents (i.e. “rows”) for that given collection across multiple servers. It does this by taking a designated column “shard key” and grouping the documents by ranges of values; alphabetically if textual, lowest to highest if numeric. See example below:

2200-DocDB-Mongo-Fig-2_ShardingAndReplic

Figure 2: Sharding a collection

In the example above, the benefit is twofold: First, the large storage requirement for five billion documents can be shared across multiple servers instead of a single server with a very large hard drive. In some scenarios when dealing with enormous amounts of data (e.g. 1 Petabyte) there is simply no way to retain the volume of data unless the data is sharded. Second, an indexed search at most covers one billion as opposed to five billion rows which is an improvement since the performance of a search degrades roughly proportionally to the number of documents in a collection.

What has just been described, at a high level, is an example of how MongoDB would handle sharding. There is some work to be done in the initial setup and configuration of each shard server and some minimal level of ongoing automated maintenance to ensure each shard server stays balanced (i.e. optimally you want each shard to be approximately the same size, in our example above that would be 1 billion documents each).

That said, by contrast although DocumentDB uses sharding (sharding on the primary key identity column “_self”), all the implementation detail has been totally abstracted from the user. This is by virtue of the fact that DocumentDB is a PaaS solution.

Replication as the name suggests deals with keeping duplicates of the data mainly for redundancy purposes and to promote high availability. Once again, a replicated server (or “replica”) requires some level of initial setup as well as configuration of a fail-over strategy in the event replica server(s) go down.

With MongoDB typically both sharding and replication are used in combination so that a single database implementation can often be defined as an entire cluster or rig of servers. Taking the previous example, we create an additional two replica servers for each shard bringing our total to fifteen servers (see below). As a side note, there is a method to the madness for deciding how many replica servers to use for each node (shard); generally you are always looking at putting together 2n+1 replicated nodes (e.g. 3, 5, 7, etc.) where ‘n’ represents the order/magnitude of nodes to replicate.

2200-DocDB-Mongo-Fig-3_ShardingAndReplic

Figure 3: Sharding a Collection with Replicas

As you can see, it’s beginning to look like a lot of configuration to keep track of! However once again, with DocumentDB the replication setup and implementation is entirely handling behind the scenes; to the user, it appears they are only ever working with one node.

So to summarize, the benefit of DocumentDB being PaaS implies there is no need to deal with the nitty-gritty of sharding and replication. The drawback is there is less fine grain control to scale which could come more into play when we consider cost as we will see shortly.

Native REST interface

As stated earlier, one thing that both DocumentDB and MongoDB have in common is developer support in the form of SDKs in multiple programming languages.

In addition to working with a language driver, DocumentDB can use a native REST interface. In fact, the client drivers are largely wrappers around the REST interface. The usage pattern is fairly straightforward as this diagram summarizes below:

2200-DocDB-Mongo-Fig-4_Resources.png

Figure 4: Hierarchy of resources in DocumentDB (from Microsoft documentation used with permission)

By contrast, MongoDB does not have a native REST interface. It can be a little confusing because while there is no native support, there are third party open source wrappers written in other languages (Node, Python, but sadly no .NET).

What MongoDB does have available for developer to use is the MongoDB Wire Protocol – a binary protocol communicated over TCP/IP and more recently the MongoDB Meta driver. This allows non-language specific techniques to leverage working with MongoDB much like a pure REST interface would (except you have to know a little bit about working with TCP).

For the typical line-of-business desktop/web application it’s hard to envision using REST (DocumentDB) or TCP (DocumentDB and MongoDB) over using a language specific driver. It ends up being a lot more work with a greater chance for error (certainly no compile-time checking) minus some of the security guarantees. That said, if a developer wanted to write their own language specific driver then REST (DocumentDB) or TCP (MongoDB) would be the best way to go. There may even be scenarios (e.g. lightweight IoT devices) that cannot run .NET, Node.js but are capabable of making simple HTTP GET, POST, PUT calls; these would be candidates for working with a REST API.

One last point worth noting here is that having a REST service as part of PaaS means that Microsoft can release updates and immediately people can take advantage of new features. Additionally there is support built-in for working with multiple versions of the API in production by specifying the specific version in the REST header and thereby allowing the migration process from older to newer as painless as possible.

Data Interchange Format

In this context the term “Data Interchange Format” is merely a formal way of describing a protocol that is used to represent data during transmission to and from client and database server. Just to clarify, for most database platforms, how data is physically stored on disk is typically different from how it is actually stored/managed in memory and transmitted over the wire.

DocumentDB uses JSON to represent documents which is originally derived from the Javascript language and it is currently described by two competing standards, RFC 7159 and ECMA-404. A document in DocumentDB might look like this:

Example document in DocumentDB

For the most part, just your typical JSON data fragment. One thing to look for is the _self property; this field is ubiquitous (and unique) to all documents in DocumentDB. The purpose of _self is to act as an immutable primary key field. Searching on _self is the fastest way to retrieve a resource.

MongoDB uses BSON which is a proprietary extension (or superset) of JSON. While there is no official standard, a specification can be found online. The primary takeaway is that BSON has support for data types beyond the standard JSON types as shown below:

2200-DocDB-Mongo-Fig-5_JSONvsBSON.png

Figure 5: Comparing JSON and BSON

Of course one cannot choose the data format one wishes to work with, this is imposed by the database.

A full comparison debating the merits of JSON vs. BSON is beyond the scope of this article but to briefly summarize:

  • JSON has a slightly smaller footprint (size-wise) than BSON so if working in a scenario where disk/memory/bandwidth are an issue then JSON is a better fit.
  • BSON has a richer set of types which allows for more flexible querying involving dates, timestamps, numbers, Javascript objects, etc.
  • There is a slight performance penalty associated with serializing and de-serializing BSON and JSON to native objects when working with strongly-typed languages like C#, Java, etc. However with JSON (DocumentDB) there is no such penalty when working with the Javascript driver since JSON is native to Javascript (not the case with BSON).

In similar fashion to DocumentDB’s _self field, MongoDB enforces the use of a primary key field which is called _id (although it can be renamed). Being a key field, this field is automatically indexed and can be used for fast retrieval of a single document.

Indexes

Understanding the indexing paradigm of a database is critical to understanding its performance and DocumentDB and MongoDB are no exception.

The most common indexing strategy adopted by relational and NoSQL databases utilizes a B-Tree infrastructure. This is certainly the case with both MongoDB and DocumentDB. To summarize, B-Trees have average and worst case performance that is O(log n) on Inserts, Deletes and Searches. B-Tree indexes generally improve performance of queries (typically in scenarios that involve larger collections, large documents, and/or highly selective queries). Note that these are not hard and fast rules; the art of whether to index or not should be applied on a case-by-case basis and depends on the types of queries and data. It is very possible for an index to actually give worse performance than a table scan (a query not involving an index), so one has to be judicious when creating an index. On top of that, for every index there is a minor penalty associated with upkeep.

That said, there are a few things to be aware of when using indexes in DocumentDB. There are two types of indexes: Hash and Range. Hash is default, but choosing an index type of Range enables range-like queries (i.e. using >, <, >=, <= comparative operators). With DocumentDB this distinction is something you consciously have to be aware of when dealing with numeric data. The good news is that you can have both a Hash and a Range index on a single field – they are not mutually exclusive.

The second thing to be aware of is that range indexes *currently* do not work with DateTime fields. For example, if you wanted to perform a query that selected based on a range of dates between 9/1/2014 and 12/1/2014, the range index would not work. The workaround for implementing a range type query on date fields involves a two-step process of: first, storing the DateTime values as Ticks (numeric values representing number of seconds since 1970) and second, applying a range query on that field. This workaround is described in much more detail in an article entitled “Working with Dates in Azure DocumentDB“.

Finally, when it comes to querying on location information, DocumentDB does not yet have a geospatial index to handle this type of query unlike MongoDB. That said, Azure DocumentDB does integrate nicely with Azure Search which does provide this kind of support, but that is a distinctly separate product and further discussion is beyond the scope of this article.

MongoDB’s strength over DocumentDB here comes from product maturity, a common theme you will see over and over throughout this post. The great thing about maturity is that it simply means it can be fixed in time. Or as the optimist would have it, has the greatest potential for improvement!

Async by Design

The .NET API for DocumentDB fully supports the .NET async pattern. In other words, .NET SDK provides classes for each DocumentDB task with many asynchronous methods that ends with the Async suffix. If you still haven’t worked with the async and await modifiers in C#, it is highly recommended that you come up to speed in this area before tackling DocumentDB programming. Below is a sample code snippet for DocumentDB using the .NET API:

Example of Async call for DocumentDB

As you can see from the code snippet above, a typical write and then read scenario the use of the async and await keywords are prevalent. In fact, any type of write operation (creating a database, creating a collection, updating a document, etc.) only have async versions of those methods. Read operations (querying) can be either async or not.

The C# driver for MongoDB added support the async pattern in Version 2.

Cost

When it comes to cost, MongoDB and DocumentDB have two very different pricing models. MongoDB is free to use from a software licensing perspective and can be deployed on as many servers/virtual machines with no additional pricing beyond hardware costs. However there is also an Enterprise flavor of MongoDB that does incur a non-trivial cost but offers support, on demand training, advanced ops, hosting and enhanced security among other things. Alternatively, MongoDB hosting can be purchased from a third party like MongoLab and the price is proportional to the feature set selected (shared vs. dedicated, number of replicas, etc.)

By contrast, DocumentDB is an Azure service and requires an Azure account. It is not free however there are several ways to partially/fully offset the cost using programs such as DreamSpark and BizSpark.

Pricing for DocumentDB is determined by Performance level units – S1, S2, and S3. Each level shares the following constraints; 10GB of SSD storage representing a single collection. A database can also contain virtually unlimited document storage and throughput partitioned by collections. The difference in pricing (S1 being the cheapest, S3 being the most expensive) is tied to the number of Request Units (commonly abbreviated as “RUs”). For S1, a collection in DocumentDB supports up to a maximum of 250 RUs/second, for S2 it is 1000/RUs/second, and for S3 it is 2500/second.

A Request Unit is a simply measurement of throughput. DocumentDB will throttle when the RU capacity/second is exceeded. Different types of operations consume different levels of RUs and in general read operations (querying) tend to be less expensive that writes (inserts, updates).

If it seems a little odd that throughput (RUs) and capacity are priced together you have to understand that each Performance level unit represents a piece(s) of virtualized hardware in the cloud. The corollary is that there is a restriction of one collection to an S1, S2 or S3. It may take a bit of getting used to at first, but the traditional one-entity-type-per-collection convention you might see in a MongoDB data schema would not be the recommended approach in designing a data schema for DocumentDB. Instead, you might typically see one or two “deep” collections capable of holding a variety of document types.

Consistency

In the database world, consistency refers to the accuracy of what is being read from the database versus what is actually stored in the database at a given point in time. For low volume traffic this is a non-issue; the database appears to be consistent. However when there is a high volume of traffic, the database can generally deal with things in one of two ways for a given query: first, it can return a result immediately which may not take into account the latest writes (and therefore may not be entirely accurate) or, second, it can wait until all the relevant writes have been committed across all replicas and then return a result (completely accurate). In the case of the former, we say the database is “highly available” because there is no delay in processing a query (read operation). In the case of the latter, the database is “consistent” because the result from a query accurately reflects what is in the database.

These terms “high availability” and “consistency” find their roots in CAP Theorem which essentially constrains the behavior of a database on two out of three dimensions (consistency, availability, and partition tolerance). As we discussed earlier, DocumentDB and MongoDB are partition tolerant by design. So on the CAP Theory continuum, by default, both MongoDB and DocumentDB are consistent (CP).

What’s nice about both databases is that this behavior can be adjusted. For MongoDB it involves setting a flag when instantiating the database connection (safe = true ensures consistent behavior, safe = false will relax the consistency constraint).

By contrast, DocumentDB has four levels of consistency as defined on a sliding scale. The four levels represent a sliding scale of consistency where “strong” is fully CP and “eventual” is fully AP. By default, DocumentDB has “Session” level of consistency where “writes are propagated asynchronously while reads for a session are issued against the single replica that can serve the requested version”. The two benefits here over MongoDB’s implementation are that you have more granularity over the level of consistency, and that consistency can be set on the fly per request (rather than per connection as with MongoDB).

Binary Large Object (BLOB) Storage

Sometimes there are scenarios where the data being stored exceeds the capacity of the document size imposed by the database. As of the writing of this article, DocumentDB’s maximum document size is 512KB and MongoDB’s is 16MB.

It turns out that both DocumentDB and MongoDB support blob storage. The primary difference is that DocumentDB had blob storage already designed (see Azure Blob Storage) whereas MongoDB added blob storage as a feature in its version 2.0 release (see GridFS).

The implication here is that DocumentDB integrates seamlessly with Azure Blob Storage. If an Azure Blob Storage instance doesn’t exist, one is automatically provisioned when the first write to blob storage is issued. All of this is handled behind the scenes in a single line of code!

Example of attaching a blob for DocumentDB

That’s not to imply that the storage abilities of MongoDB’s GridFS are inferior in any way, but that a few extra lines of code are required to work with GridFS.

Monitoring

At first glance, the gap between the maturity of MongoDB and DocumentDB’s offerings here could not be wider. Below is a screenshot of the monitoring section located inside the Azure Portal page:

2200-DocDB-Mongo-Fig-6_Monitoring.png

Figure 6: DocumentDB Monitoring

With DocumentDB, access to real-time instrumentation metrics is largely reduced to these two numbers: total requests and average requests/second. By contrast MongoDB via Mongo Monitoring Service (MMS) tracks a whole host of metrics – page faults, memory usage, background flush averages, CPU, and IO wait times, just to name a few. It should be noted however that access to MMS is not a free service.

One could argue that technically there isn’t much needed to “monitor” DocumentDB beyond a few common metrics because a PaaS implementation would suggest this level of detail is unnecessary. Knowing details such as CPU utilization or page faults/second would be useful pieces of information*if* one was in the business of manually managing virtual machines or scaling database capacity/performance, but with DocumentDB that is simply not the case.

One final difference that should not be overlooked is Mongo’s administrative console (mongo.exe for those Windows users out there). The console exists to provide an interface for the administration of MongoDB (allowing for some level of limited remote administration for cloud hosted solutions). Beyond basic manipulation of databases, collections, etc. is the ability to run on-the-spot diagnostic queries about the performance of a queries (using .explain() ) and other useful commands to assess the condition of replicas, shards, and the relative disk sizes of a database or collection.

Programmability

In an earlier section it was noted that both MongoDB and DocumentDB utilized Javascript; however the underlying usage pattern for each is completely different.

Mongo’s usage of Javascript falls largely under two categories. The first is for general database administration purposes as was touched upon in the previous section. Unlike DocumentDB, Mongo has an interactive console running the V8 JavaScript engine which can be used for a number of different administrative tasks. The second category is for Map-Reduce type operations which is beyond the scope of this article but essentially boils down to executing a more sophisticated type of query involving aggregating and grouping data from a collection or multiple collections all in Javascript.

DocumentDB takes a different approach when it comes to Javascript because it isn’t fundamentally grounded in a Map-Reduce paradigm for data aggregation. Rather it would seem more confluent with the traditional relational database style; support for stored procedures, triggers, and user defined functions (UDF). The major difference obviously being that for all of these, Javascript is the language of work and not T-SQL or some other SQL derivative. Below is a trivial example of a stored procedure used in DocumentDB.

Example DocumentDB Stored Procedure

A common point of confusion is that DocumentDB driver for C# supports the use of LINQ. This is true, for all the driver’s common operations. However stored procedures, UDFs and triggers still have to be written in Javascript – there is simply no way around this.

Something found lacking in DocumentDB when it comes to programmability is the debugging story. As a developer, when doing any programmability type work, essentially a string (your Javascript program) would have to be injected into the process at run-time and there would be no compile-time checking in that scenario.

With MongoDB, the debugging story is a bit better where some constraints would be enforced by driver at the Map, Reduce (and optional Project) stages but you could still inject some buggy Javascript since at the end of the day you’d still be using strings at each stage of the process.

Other Considerations for DocumentDB

Just a few odds and ends before we wrap up here. Due to maturity of DocumentDB, there are a couple gotchas to look out for. These are things one might assume or expect in a document database (MongoDB has all of these) but for whatever reason have not made the cut yet.

Limited support for aggregates

By aggregates we simply mean the ability to execute the equivalent of GROUP BY, SUM, AVERAGE type operations. MongoDB can use its Map-Reduce or Aggregation Framework to accomplish this. At the moment DocumentDB is limited to aggregates but only effective across a single partition.

No ordering

In this context we are talking about the server-side ordering of query results before returning to the client. In SQL this would the equivalent of the ORDER BY directive.

Limited tooling

More specifically, UI tooling. In fairness, this seems like a common theme for document databases where the expectation has been that the community would step up and create something. For MongoDB we have things like MongoVUE and RoboMongo. For DocumentDB officially there is the Azure portal which is somewhat limited in what it has to offer. Unofficially, there is DocumentDBStudio on GitHub which I’ve found to be most helpful (and free!).

Conclusion

There were a couple of over-arching themes that can be drawn out here.

First, DocumentDB has most of the basic requisites of a document database taken care of. Where it falls short by comparison to MongoDB (mainly in the areas of indexing, advanced query support, and tooling) tends to suggest it is more an issue of maturity and is correctable in the long run. Indeed, all of these areas are in the process of being addressed at the writing of this article.

Beyond this, it would also appear that DocumentDB is somewhat lacking in two other areas: documentation and adoption. This is to be expected with any new product and will be remedied over time.

From a developer’s perspective, DocumentDB and MongoDB are both a delight to work with. Both drivers offer support for multiple languages and DocumentDB’s built-in support for async in C# adds a nice touch. Also, being a PaaS solution by design, one can’t say enough about the conveniences afforded to the developer using DocumentDB not having to worry about maintenance and the complexity involved in managing patching, sharding, replication and the like.

Overall, the prevailing sentiment is that DocumentDB has made a great initial first step into the world of NoSQL and Document databases. It still has some catching up to do to its bigger cousin MongoDB, but it presents itself as a true cloud/PaaS solution built from the ground up. Already in heavy use internally within Microsoft for over a year now ( the biggest of which is MSN (serving 10s of millions of users a day), we all look forward to what the bright future holds for this up-and-comer to the world of NoSQL as enterprise will start to take notice.