The Challenge of Scale

Working in the SMB space for the majority of my career meant rarely worrying about hitting scale limits in the hardware and software I was responsible for. A few years ago, the idea of managing a data footprint of 20-30TB was huge for me. I didn’t have the data storage requirements, I didn’t have the number of virtual machines, I didn’t face struggles of scale. As I moved into the enterprise that scale went up massively. 20-30TB quickly became multiple petabytes. The struggles you face at the enterprise-level are much different.

While listening to James Cowling from Dropbox present on their “Magic Pocket” storage system, he said something that really put their scale into perspective. Building a storage system of 30 petabytes was referred to as a “toy system.” As they explored the possibility of moving users’ data out of Amazon and into their datacenters, a storage system needed to meet their ever-increasing storage demands. Storage software capable of managing 30PB was easier to come by then software capable of managing 500PB. When building this homegrown solution to hold all the file content for its users, theirs was a challenge few others have had to face. With that much data being hosted in AWS there was no off-the-shelf product capable of managing this scale.

While the move from AWS to on-premises sounds simple, issues like scale are just the tip of the iceberg. Dropbox didn’t just need to write a massively scalable filesystem, work hand-in-hand with hardware vendors to find the right design, determine the best way to migrate their data to their datacenters, ensure data integrity, and validate every aspect throughout this entire process, but they needed the time to do all of this right the first time. When your job is content storage and collaboration, “losing” data isn’t an option. Having confidence in your solution and management granting the autonomy necessary to “reset the clock” if and when bugs were found is the only way this move was going to be successful.

And what prompted the decision to move out of AWS’s S3 storage? Cost. To the tune of nearly $75 million in operating expenses over the 2 years since getting out of AWS. Storage is cheap and getting cheaper, but storage at scale is an expensive endeavor. While the cost savings is signficant, the performance gain was significant as well. Dropbox saw a dramatic performance increase by bringing data into their datacenters and using their new storage system. This is just a reminder that the real cost of “cloud” is often much higher than companies expect.

Back to the issue of scale. Storage wasn’t the only issue they faced. Now with over 1 exabyte of storage and growing at a rate of nearly 10PB per month, they also faced an issue of bandwidth. Dropbox sees around 2Tb of data moving in and out of its datacenters per second. PER SECOND. With that kind of demand, minimizing traffic and chatter inside their network is important as well. Events such as disk, switch, or power failures shouldn’t be creating additional rebuild traffic inside the network impacting disk and network performance. The Dropbox datacenter monitoring solution is just as advanced as the storage system; capable of analyzing the impact of any such failures in the datacenter and triggering rebuilds and redistribution only when necessary. There is a balance of network versus disk cost when it comes to how and where to rebuild that data.

Designing a highly availability, redundant, always-on infrastructure looks different depending on your scale. Application-level redundancy, storage-level redundancy, combined with a robust monitoring solution are just a few of the techniques Dropbox has utilized to ensure application and data availability. The Dropbox approach may not be common, but was necessary for long term success. Sometimes the only way to reach your goals is to think outside the box.


Disclaimer: During Storage Field Day 15, my expenses (flight, hotel, transportation) were paid for by Gestalt IT. Dropbox provided each delegate with a small gift (sticker, notepad, coffee), but I am under no obligation to write about any of the presented content nor am I compensated for such writing.

The Challenge of Scale