Convergence Without Compromise

Hyperconverged Infrastructure (HCI) gets a lot of attention these days, and rightly so. With HCI we’ve seen a move towards an easy-to-use, pay-as-you-grow approach to the datacenter that was previously missing. Complex storage array that required you to purchase all your capacity up front is what I started with in my career. While expansion of these storage arrays was possible, often times we were buying all the storage we’d need for 3-5 years even though we wouldn’t be consuming it for multiple years.

While HCI certainly made things easier it was far from perfect. Mixing storage and compute nodes into a single server meant maintenance operations needed to account for both available compute resources as well as available storage capacity to accomodate offline storage. At times we would actually sacrifice our data protection scheme in order to takes nodes offline and hope there were no additional failures within a cluster at the same time. Not ideal when we’re talking about production storage.

Get Down with the DVX

Datrium and the DVX platform aim to address these problems in an interesting way. Datrium separates storage and compute nodes much like traditional two tier system, but utilizes SSDs inside each of the hosts to act as a read cache. By moving the cache into the host we’re able to increase performance with every host we add. This decoupling of cache from the storage layer means we’re not queuing up reads at a storage array that is trying to satisfy the requests of all the connected hosts over the same connected switches. While this sounds very similar to previous technologies we’ve seen before (Infinio and PernixData come to mind), the differentiator is the storage awareness.

The Datrium DVX solution utilizes their own storage nodes for the persistent storage piece. With the caching and storage being fully aware of each other, Datrium is able to offer end-to-end encryption from the hypervisor down to the persistent storage while still being able to take advantage of deduplication and compression. Often times encrypting data at the storage array level means we are forced to give up these data efficiencies, but not in the case of Datrium. We get an additional level of data security without having to make any compromises.

No Knobs, No Problems

HCI vendors have really pushed the configuration abilities within their systems. Customers can choose what data is deduplicated and compressed, whether or not it should be encrypted, how many copies of their data should be kept, and is erasure coding a better choice than traditional RAID just to name a few. This is where Datrium separates itself from its HCI competitors. Disaggregating compute nodes from the persistent storage layer, Datrium’s DVX system manages to deliver performance and features without penalty. Once again, no compromises.

Erasure coding, dedupe and compression, double-device failure protection, data encryption; every one of these features is always on and doesn’t require any separate licensing or configuration. The advantage here isn’t just in administrative overheard, but also in performance. Datrium’s performance numbers are based on each one of these features enabled. No tricks. No Gimmicks. What you see is what you get; unlike many of their competitors that hide behind unrealistic configurations many of these features being disabled.

3 Tiers, 1 Solution

Datrium aims to bring together a Tier 1 HCI-like solution, combined with scale-out backup storage and Cloud-based DR all in the same system. With integrated snapshots that utilize VMware snapshots as well as VSS integration, they are able to perform crash consistent and application consistent snapshots of virtual machines right on the box. This, of course, is table stakes when it comes to modern storage arrays. The differentiator is that Datrium is able to do this at the VM-level despite presenting NFS to the virtual hosts. Now we’re not just backing up all the VMs that live in a LUN or volume, we’re able to get as granular as the virtual disk itself. No VVOLs required.

Adding another level of visibility into the mix, Datrium reports its latency at the individual Virtual Machine level instead of at the storage array. Traditional storage array vendors talk about their ultra-low latency, but this reported latency is what the array is seeing not taking into account the latency imposed by virtual hosts and switching infrastructure. With each different component in the virtual infrastructure having its own queues, varying utilization and available bandwidth, the latency a Virtual Machine experiences is much greater than what the array is reporting. Datrium is offering this full visibility at the individual Virtual machine level so you know how your environment is actually performing. Dr. Traylor from The Math Citadel has an excellent overview of queuing theory, Little’s Law, and the math behind it.

The cloud-based integrations also allows for an additional level of data availability. Instead of requiring an additional backup software, Datrium allows for replication of your data to a DVX running in the cloud. Now we have an offsite copy of your data ready to be restored in the event of VM corruption or deletion. Replication is also dedupe-aware, meaning data isn’t being sent to the cloud if it is already present helping to minimize bandwidth requirements and speeding up the replication process.

Cloudy Skies Ahead

While I am very reluctant to trust one solution with my primary and backup data, in certain situations I can see the advantages. Integrations with AWS allowing for virtual machines to be restored from the Cloud-based DVX means your DR site can now be in AWS. Datrium has lowered the barrier to the cloud for a lot of customers with the features they’ve included in the DVX platform.

Datrium continues to make a good product even better. The additional features available in version 4.0 of DVX make this not only a great fit for SMB customers, but enterprises as well. A feature-rich, no-knobs approach to enterprise storage with backup and DR-capabilities all rolled into one. Datrium is definitely worth a look.

________________________________________

Disclaimer: During Storage Field Day 15, my expenses (flight, hotel, transportation) were paid for by Gestalt IT. I am under no obligation by Gestalt IT or Datrium to write about any of the presented content nor am I compensated for such writing.

Convergence Without Compromise

The Challenge of Scale

Working in the SMB space for the majority of my career meant rarely worrying about hitting scale limits in the hardware and software I was responsible for. A few years ago, the idea of managing a data footprint of 20-30TB was huge for me. I didn’t have the data storage requirements, I didn’t have the number of virtual machines, I didn’t face struggles of scale. As I moved into the enterprise that scale went up massively. 20-30TB quickly became multiple petabytes. The struggles you face at the enterprise-level are much different.

While listening to James Cowling from Dropbox present on their “Magic Pocket” storage system, he said something that really put their scale into perspective. Building a storage system of 30 petabytes was referred to as a “toy system.” As they explored the possibility of moving users’ data out of Amazon and into their datacenters, a storage system needed to meet their ever-increasing storage demands. Storage software capable of managing 30PB was easier to come by then software capable of managing 500PB. When building this homegrown solution to hold all the file content for its users, theirs was a challenge few others have had to face. With that much data being hosted in AWS there was no off-the-shelf product capable of managing this scale.

While the move from AWS to on-premises sounds simple, issues like scale are just the tip of the iceberg. Dropbox didn’t just need to write a massively scalable filesystem, work hand-in-hand with hardware vendors to find the right design, determine the best way to migrate their data to their datacenters, ensure data integrity, and validate every aspect throughout this entire process, but they needed the time to do all of this right the first time. When your job is content storage and collaboration, “losing” data isn’t an option. Having confidence in your solution and management granting the autonomy necessary to “reset the clock” if and when bugs were found is the only way this move was going to be successful.

And what prompted the decision to move out of AWS’s S3 storage? Cost. To the tune of nearly $75 million in operating expenses over the 2 years since getting out of AWS. Storage is cheap and getting cheaper, but storage at scale is an expensive endeavor. While the cost savings is signficant, the performance gain was significant as well. Dropbox saw a dramatic performance increase by bringing data into their datacenters and using their new storage system. This is just a reminder that the real cost of “cloud” is often much higher than companies expect.

Back to the issue of scale. Storage wasn’t the only issue they faced. Now with over 1 exabyte of storage and growing at a rate of nearly 10PB per month, they also faced an issue of bandwidth. Dropbox sees around 2Tb of data moving in and out of its datacenters per second. PER SECOND. With that kind of demand, minimizing traffic and chatter inside their network is important as well. Events such as disk, switch, or power failures shouldn’t be creating additional rebuild traffic inside the network impacting disk and network performance. The Dropbox datacenter monitoring solution is just as advanced as the storage system; capable of analyzing the impact of any such failures in the datacenter and triggering rebuilds and redistribution only when necessary. There is a balance of network versus disk cost when it comes to how and where to rebuild that data.

Designing a highly availability, redundant, always-on infrastructure looks different depending on your scale. Application-level redundancy, storage-level redundancy, combined with a robust monitoring solution are just a few of the techniques Dropbox has utilized to ensure application and data availability. The Dropbox approach may not be common, but was necessary for long term success. Sometimes the only way to reach your goals is to think outside the box.

________________________________________

Disclaimer: During Storage Field Day 15, my expenses (flight, hotel, transportation) were paid for by Gestalt IT. Dropbox provided each delegate with a small gift (sticker, notepad, coffee), but I am under no obligation to write about any of the presented content nor am I compensated for such writing.

The Challenge of Scale

vSAN – Check VM Storage Policy & Compliance

As I continue to work with vSAN I discover there’s way more to do than just move some VMs over and you’re on your way. With multiple vSAN clusters each with different configurations I needed a way to monitor the current setup and check for changes. While creating a simple script to check which VM Storage Policy is assigned to each VM isn’t very difficult, a creating a script to check the storage policy of VMs across multiple vSAN datastores proved to be a little more difficult.

We run multiple PowerCLI scripts to check health and configuration drift (thanks to a special tool created by Nick Farmer) in our environment. In the event that a new vCenter is added or new vSAN datastore is deployed, we needed a simple script that can be run without any intervention or modification. Now we can be alerted when the proper VM storage policies isn’t assigned or the current policy is out of compliance.

To further complicate things in our setup, we create a new VM Storage Policy that contains the name of the cluster in which it’s assigned.  Due to the potential differences in each vSAN cluster (stripes, failures to tolerate, replication factor, RAID, etc) having a single Storage Policy does not work for us. In the event a VM is migrated from one vSAN cluster to another we need to check that the VM storage policy matches vSAN datastore cluster policy.

What this script does is grab all the clusters in a vCenter that have vSAN enabled. For each cluster that is found with vSAN enabled, it is filtering only the VMs that live on vSAN storage (with the name of “-vsan”. Then we get the storage based policy management (Get-SpbmEntityConfiguration) of those VMs. The script then filters for a storage policy that doesn’t contain the cluster name OR a compliance status that is compliant.

$vsanClusters = Get-cluster | Where-Object {$_.vsanenabled -eq "True"}
foreach ($cluster in $vsanClusters)
{
$Cluster | get-vm |?{($_.extensiondata.config.datastoreurl|%{$_.name}) -like "*-vsan*"} |
Get-SpbmEntityConfiguration | Where-Object {$_.storagepolicy -notlike "*$Cluster*" -or $_.compliancestatus -notlike "*compliant*"} |
Select-Object Entity,storagepolicy,compliancestatus
}

Once this is run we can see the output below. I’ve obscured the names of the VMs, but we can see that there are still 12 VMs that are using the default vSAN Storage Policy instead of the cluster-specific storage policy they should be using. In addition, we see that the compliance status is currently out of date on most of these VMs. These VMs reside on 2 separate clusters and there are also 2 VMs that were filtered because they are on local storage in these clusters instead of vSAN.

storagepolicy01-12202016

 

vSAN – Check VM Storage Policy & Compliance

VSAN – Compliance Status is Out of Date

Occasionally the Compliance status of the performance service will go to the “out of date” status. This is not an alert that is thrown anywhere within vCenter. You will have to check this status by logging into the vSphere web client, locating your vCenter, choose the cluster, clicking on “Manage” then choosing “Health and Performance” under “Virtual SAN”
ComplianceStatus-a

As I have recently fixed this issue the above screenshot shows the “Compliant” status. Below are the steps to get to that point.

1. In the box for “Performance Service” click “Edit storage policy”
ComplianceStatus-01

2. If there is a storage policy available in the drop down, select it and click “OK”. This will apply that policy and perform the compliance check.
ComplianceStatus-02

For the lucky few where that works, that’s all you need to do. If the storage policy list is empty you’ll need to restart the vsanmgmtd service on each of the hosts.

3. Enable SSH on each of the hosts in the VSAN cluster and using an SSH client (like putty), SSH to a host and run the following command to restart the vsanmgmtd service (this is a non-impactful operation and should be able to be performed during production hours with no impact)
a. /etc/init.d/vsanmgmtd restart

4. Repeat that command on each of the hosts in the cluster until they have all restarted their services
ComplianceStatus-04

5. Wait 5 minutes and then check to see if you are able to select a storage policy for the performance service. If not, move on to step 6

6. Now we’ll need to restart the vSphere Profile-Driven Storage Service on the vCenter server. This is also non-impactful and should be able to be performed in the middle of the day. If you’re using vCenter on windows, connect to the Windows server and restart the “Vmware vSphere Profile-Driven Storage Service”. If using VCSA (like this example) you’ll need to SSH to the VCSA and run the command below
a. Service vmware-sps restart

7. After the vmware-sps service restarts, log out of the web client and wait for 5 minutes while the storage profile service  completes its restart.

8. Log back in to the web client, navigate to the vCenter server, click “Manage” then choose the “Storage Providers” tab
ComplianceStatus-08

9. Click the Synchronize Providers button to resync the state of the environment
ComplianceStatus-09

10. Wait another 5 minutes while these synchronize completes. After 5 minutes, navigate to the VSAN cluster in the web client. Click on “Manage” then choose “Settings” and locate “Health and Performance” under the “Virtual SAN” section
ComplianceStatus-10

11. In the Performance Service box, click the “Edit Storage Policy” button
ComplianceStatus-11

12. From the drop down list you should be able to select the appropriate VSAN storage policy and then click “OK”
ComplianceStatus-12

13. After this is selected the compliance status should change to “Compliant” and you should be all set.

So far these are the only steps that I have needed to follow in order to fix this issue. Let me know if there are any other fixes available.

 

 

 

VSAN – Compliance Status is Out of Date

The Beginning of Cloud Natives

Over the last 8 years I have built my career around VMware. I remember the first time I installed VMware Server at one of my jobs just to play around with and imported my first virtual machine. I had no idea what I was doing or how any of it worked, but I felt there was a future for me in this technology. As I moved on to other companies, the VMware implementations just got larger and larger; from 3 hosts all the way up to well over 1000.

Having spent time in these environments and with other users at local VMUG events and VMworld, I’ve seen that the skills required to be a VMware administrator are becoming commoditized. More people know about it than ever before, more blogs exist than ever before, and the necessity of meetings that revolve around VMware specifically seems to have run its course. While VMware remains integral to the datacenter today, there are skills we need to be developing and technologies we need to be exploring to ensure we’re not the ones being replaced when the next generation joins the workforce.

Enter Cloud Natives.
cloud natives

Cloud Natives was the idea of Dominic Rivera and myself as a means to bridge the gap between user and these new technologies. Cloud Natives looks to bring together the leaders in a technology space to present their solutions in one location. Rather than just letting vendors spew marketing material,  we take a different approach. Vendors are required to provide actual customers to present how their solutions have impacted their job and their business. No more outlandish claims, no more vanity numbers that don’t depict actual workloads, just real stories from real users.

We are kicking off 2016 with our first event on July 14th in Portland, OR. This event will be focused on one of the hottest technologies in the datacenter right now: Flash Storage. We’re bringing together the top players in the Flash Storage space and you’ll hear their customers discuss the benefits and challeneges they faced when moving away from legacy spinning disk arrays and even newer hybrid arrays.Our goal is to educate our members one event at a time.

Cloud Natives looks to bring together all the datacenter technologies into one place. Whether it’s a focus on hypervisors, traditional or next-generation storage and infrastructure, cloud providers, DevOps and automation, or anything else that is hot in the datacenter, we will be that go-to resource in the Pacific Northwest. Each event is an opportunity to evaluate multiple vendors from the perspective of the customer. With no overlapping session schedules, you can walk away better informed and get any questions answered in one event.

I encourage everyone in the Portland area to register for this event at the Cloud Natives site. Our goal is to bring a sense of community back to Portland. We want to be a place to meet and network, to encourage, to mentor and to grow in our careers. No matter the stage in our career, we all have knowledge and experience that can help someone else and it’s time we all do our part to give back to the community.

The Beginning of Cloud Natives

Cohesity – Scale-Out Secondary Storage

Backups are boring. No matter if you’re talking about swapping tapes, configuring backup jobs in your legacy agent-based  software, or spending another night restoring snapshots from your storage array; there’s just no way to make backups interesting. Cohesity aims to fix that. No, they won’t make backups sexy, but they are looking to add a bit more flash to the secondary storage market.

So what exactly is “secondary storage?” Secondary storage encompasses our backups, non-prod workloads, fileshares and the like. The secondary storage market is gaining visibility recently. With the flood of primary storage vendors, Cohesity could have been another “me-too” primary storage vendor, but they see the value in attacking an under-developed market.

The concept of Cohesity is simple. You can purchase the C2300 or C2500 models which offer 48TB or 96TB of storage respectively in each 4-node appliance (with a minimum of 3-nodes to start). Additional capactity can be added a single node at a time afterwards in 12TB or 24TB chunks depending on the model. Each node contains either 800GB or 1.6TB of flash for caching along with compute and memory. Cohesity claims they are infinitely scalable due to their distributed OASIS (Open Architecture for Scalable Intelligent Storage) architecture, though they’ve only tested up to 32 nodes at the time of this writing. Once your nodes are setup, you just point Cohesity at your vCenter Server and you now have visibility of your virtual machines.

Cohesity, leveraging VADP, is able to snapshot your configured VMs and begins ingesting all that data. The changes of these VMs are tracked (using CBT) so you’re not performing new full backups each time. All that is pretty standard in the backup world, so what sets Cohesity apart? That data is not just backed up, but it is available to actually use. Want to spin up one of these backed up VMs for testing? Space-efficient clones are created directly on the Cohesity appliance and are presented back to your ESXi hosts. Searching for a file to restore from one of these VMs? You can locate it right from the web interface and download the file without having to restore the entire VM.

The differentiator for Cohesity is not just how it scales or how simple it makes the backup process, but how it makes your backups useful. Enabling developers to access clones of your production systems to test deployments and hotfixes without impacting your production storage. Integrated QoS preventing your dev/test workloads from consuming all your resources and causing backup performance to suffer. Utilizing the onboard flash combined with global deduplication, performance of these workloads can mimic production without the cost of an all flash array.

An all-inclusive secondary storage appliance that provides visibility of data sprawl adds to the value. Often times, as production systems are backed up and cloned and cloned again, you lose sight of the origin of that data. Migrating data from one storage array to another you lose that deduplication and you’re now increasing capacity across systems to accommodate your storage footprint. By providing an all-in-one solution for your backups and dev/test workloads, you’re able to maximize your investment without the need for multiple arrays and storage targets.

The backup market is a crowded one. There are more feature rich backup software providers in the space, but many of them require the purchase of additional storage that doesn’t have the capabilities of what Cohesity provides. Having just released Version 1 in mid-October, Cohesity has a lot of capabilities in their software with what appears to be a great vision for the future. The product is still in need of refinement to simplify the process of searches, reporting, and scheduling, but the foundation of what the Cohesity team has built has me excited to see where they’ll be able to take their product.

__________

Watch all the videos from Cohesity at Storage Field Day 8 here.

Disclaimer: During Storage Field Day 8, my expenses (flight, hotel, etc) were paid for by Tech Field Day. I am under no obligation to write about any of the presented content nor am I compensated by any of the presenting companies for such writing.

Cohesity – Scale-Out Secondary Storage

NexGen Storage – The Future is Hybrid

At Storage Field Day 8, the delegates got a sneak peak at what NexGen Storage was up to. With new product and patent announcements, there was a lot to be excited about for this hybrid array vendor.

Hybrid storage is the future. But with the death of disk and living in an all-flash world, how can that be true? While disk isn’t dead yet, it feels like it’s dying. High IOPs, low latency, and consistent performance is what makes flash so desirable. When designing a modern datacenter today, I’d be unlikely to buy spinning disk. So how can hybrid be the future?

As the cost of flash continues to drop the dependency we have on spinning disk drops as well. When it comes to enterprise storage, why pay for a fading technology when flash is becoming more and more affordable? That’s not to say disk doesn’t have a place in the datacenter; it just means the use cases are beginning to diminish. Spinning disk is generally not a place where I want my production applications to run.

NexGen Storage has been in the hybrid array space for some time. After being founded in 2010, being purchased by Fusion-io in 2013 and then eventually being spun out as its own company earlier this year, NexGen Storage has continued to focus on its hybrid arrays. The engineering efforts and customer growth didn’t stop along the way. The NexGen team stayed focused on what it does best; fast storage with predictable performance.

With the rise of flash in the datacenter, why the focus on hyrbid? With NexGen, their hybrid arrays are not just flash cache in front of spinning disk. The approach of flash and RAM caching writes and/or reads makes sense as only your working set “needs” that high speed/low latency tier, but in the event your array was unable to predict the next blocks being requested by your application the performance of spinning disk sometimes isn’t enough. Their latest model, the N5-1500, is all flash with a hybrid approach.

While the cost of flash is dropping rapidly, it still is expensive. NexGen Storage is utilizing flash over the PCIe bus for it’s caching tier which is much more expensive than just regular SSD drives. The advantage of this approach is lower latency, higher throughput, but at a more reasonable cost as you’re not filling an array with all PCIe flash. The N5 all flash series is available in 15TB to 60TB raw capacity (in 15TB increments) with all sizes coming with 2.6TB of PCIe flash.

Why the same cache tier size? With its now patented dynamic QoS, NexGen Storage is able to deliver that consistent performance businesses need for their applications. The ability to prioritize workloads and assign pre-configured QoS policies allows you to purchase a do-everything array. In many of the smaller environments I’ve worked in, you don’t have the luxury of multiple storage arrays for production and dev/test. Often time, you are hoping that your non-critical workloads do not affect your mission-critical workloads. With automated throttling and prioritizing placement of your data, you’re able to ensure development never interferes with Tier 0.

The all-flash datacenter is here today and each all-flash vendor has a different approach. This hybrid all-flash approach is what sets NexGen Storage apart along with an all-inclusive software feature set. A fast array isn’t enough anymore; you need an array that has the intelligence to deliver the performance you’re expecting at all times. Combine that with VMware Virtual Volumes support, data reduction (deduplication and compression) along with array-based snapshots and replication (between all-flash and hybrid spinning disk arrays) and you have a solution built for the next generation.

Disclaimer: During Storage Field Day 8, my expenses (flight, hotel, etc) were paid for by Tech Field Day. I am under no obligation to write about any of the presented content nor am I compensated by any of the presenting companies for such writing.

NexGen Storage – The Future is Hybrid