until(true)
Go Cry on Somebody Else’s Shoulder: MongoDB is fine

I’m getting kind of sick of all of these postings where people slag on MongoDB because they happened to have a bad experience in their environment. Moreover, I’m tired of people who can’t make a decent argument about a piece of technology.

My personal experience w/ MongoDB

A little bit about who I am. I worked at a now-defunct video startup named Motionbox. We had a system in place that offloaded (from our rails app) the generation of a manifest for our player (describing where to get videos, multiple videos in playlists, etc). The first iteration of this system was to write these “JSON blobs” to an NFS filesystem, which had all the scalability problems you can imagine. The second was CouchDB, and it was a total disaster (I know many people like CouchDB and can use it effectively, but for us, it didn’t work). The last solution was based on MongoDB, and it had a name that made our CTO recoil with horror: Cornhole.

We hosted the infamous iphone4 video that Gawker got in all that trouble for. That day was not fun, our load balancers were totally overwhelmed by the amount of traffic, and were falling over because of TCP TIME_WAIT issues. The one thing that didn’t flinch was MongoDB.

We had MongoDB deployed on two Sun x4440 servers, nothing special in terms of the filesystem (it was lzjb-compressed ZFS in a RAID-10 configuration), and we probably gave mongo 40GB of ram. This was in the days before replica sets and we were using simple master/slave replication. It served over 400 hits/sec, and aside from our load balancer issues, came through with flying colors.

Know what tradeoffs you’re making

**1. MongoDB issues writes in unsafe ways *by default* in order to
win benchmarks**

If you don't issue getLastError(), MongoDB doesn't wait for any
confirmation from the database that the command was processed.
* In a concurrent environment (connection pools, etc), you may
  have a subsequent read fail after a write has "finished";
  there is no barrier condition to know at what point the
  database will recognize a write commitment
* Any unknown number of save operations can be dropped on the floor
  due to queueing in various places, things outstanding in the TCP
  buffer, etc, when your connection drops of the db were to be KILL'd or
  segfault, hardware crash, you name it

First of all, with any piece of technology, you should, y’know, RTFM. EVERY company out there selling software solutions is going to exaggerate how awesome it is. Before you deploy something into production, you should make sure you understand the implications of the decisions you’re making. All technology is a tradeoff. MySQL uses MyISAM as its default table type. You know, the one that’s designed for high performance and isn’t all that awesome about data integrity.

Any communication between two processes is subject to conditions where data may be lost. Traditional RDBMS’s make certain promises about the steps they take to mitigate those situations, however those trade off performance for data integrity. When businesses choose to run Oracle for their eCommerce backend, that choice wasn’t arrived at because some developer in designer eyeglasses decided that Oracle was sexytime. They made that decision because if they credited someone’s account, they needed to be absolutely 100% certain that the transaction either happened or didn’t, and that nobody gets charged/credited twice.

What MongoDB and NoSQL in general are saying is that (in the words of my co-worker) “Sometimes it doesn’t matter”. If you’re collecting lots of statistics on a distributed process, and you miss one because a process dies, do you really care? Maybe? The point is that you should have the option to trade data integrity for speed in the cases where you need to.

Have a backup

**2. MongoDB can lose data in many startling ways**

Here is a list of ways we personally experienced records go missing:

 1. They just disappeared sometimes.  Cause unknown.
 2. Recovery on corrupt database was not successful,
    pre transaction log.
 3. Replication between master and slave had *gaps* in the oplogs,
    causing slaves to be missing records the master had.  Yes,
    there is no checksum, and yes, the replication status had the
    slaves current
 4. Replication just stops sometimes, without error.  Monitor
    your replication status!

Every database can have the same sorts of errors. I can’t speak to #1, but #2 and #4 we definately experienced with MySQL at Motionbox. I hate to break it to the poster (and I would if they hadn’t chickened out and actually put their name on their post) but software has bugs. We went to great lengths to try and ensure against data loss. As a cardinal rule, you should have three copies of anything you don’t want to lose. In addition, you should have multiple backups in case your data gets corrupted.

Learning experiences suck

All in all, reading the original post, it seems like someone just learned that all software sucks. This is not unique to MongoDB. Even with the potential drawbacks, we’ve chosen MongoDB again to be our backend for an encoding system, because it works for us, it’s fast, and for our workload, it’s a good fit.

I look forward to hearing about their awesome adventures trying to get PostgreSQL to scale in terms of write load and learning about log-shipping replication.