Thursday, June 6, 2013

Uptime and the Value of Clustering for Linux Servers

This Red Hat white paper explores the need for uptime; explains the three components that contribute to uptime, reliability, availability, and serviceability; reviews some of the problems with existing enterprise server management; provides a clear definition of clustering; and finally gives an honest assessment of how it can contribute to your organization's server manager and uptime goals.

Modern server-class computer systems have an increasing number of hardware error detection and correction capabilities built-in to them, and there are many options for increasing fault tolerance from RAID and multipath storage area networks to bonded ethernet interfaces and more. Even with all these high-availability features though, failure is a fact of life in modern data center environments. Increasingly complex IT environments and the ever growing demand for storage and compute capacity create environments where no amount of planning can prevent server outages. The key then is not to only prevent failure, but also to find ways to cope with it, preferably in real-time and with little or no downtime. This is the promise of clustering technologies and specifically Red Hat Cluster Suite running on Red Hat Enterprise Linux.

Image a data center where a server going down causes no noticeable interruption to end-user services, where mission critical applications seamless fail-over from faulty nodes to any number of remaining nodes. Now imagine this in your data center, with your already deployed Linux servers, on your existing commodity x86 hardware. This is the promise of clustering.

This white paper explores the need for uptime, explains the three components that contribute to uptime, reviews some of the problems with existing enterprise server management, provides a clear definition of clustering, and finally gives an honest explanation of what it can and can't offer your organization.

Whether you manage an enterprise server environment, are a system administrator, system architect, network engineer, or an application administrator, if you aren't concerned about downtime, you're in a very small minority of technology professionals. For most folks, downtime is a constant concern. It seems that nearly all businesses are increasing their reliance on technology, and many simply wouldn't exist without their technical underpinnings. As a knowledge worker servicing this critical infrastructure you understand the business' need for highly reliable service, but you're also intimately aware of the difficult challenges that stand in the way of that goal.

The goal of uptime is ideally spelled out between the IT organization and its internal business customers using clear service level agreements (SLAs) that dictate minimum acceptable performance and availability metrics. To achieve these agreed-upon metrics, the IT organization is chiefly concerned with managing reliability, availability, and serviceability; often referred to simply as RAS. Clustering is a particularly potent solution to the uptime challenge, because it offers a triple-play of uptime benefits: improving reliability, availability, and serviceability.

Before looking specifically at clustering though, let's gain a better understanding of the three components of uptime.

Reliability is the ability to avoid or detect faults. Server-grade memory that supports Error Correcting Codes (ECC) is a great example of a reliability feature. If a single bit is incorrectly read from an ECC-capable memory module, the system is able to detect this error and, using the ECC bits in the memory, and actually correct the error and deliver the appropriate data to the CPU. Even double bit errors are detectable, though not correctable with current ECC implementations. At the network level, we could implement multihomed internet connectivity - connections to two or more independent Internet Service Providers - to increase reliability. If one ISP goes down, we can simply avoid sending packets to that ISP and instead choose to send packet to the link to our other ISP. Whether it's server components, operating systems, networks, or applications, there are lots of existing solutions focused on increasing reliability. It's important to remember though, that reliability is just one of the three components of uptime.

Availability is the ability of a component, server, or service to stay operational in spite of faults. Redundant Array of Independent Disks (RAID) storage configurations are an example of an availability feature. RAID enables a file system to remain available and fully accessible to users, in-spite of one (and depending on the RAID level, sometimes more than one) individual hard drive going bad.

Finally, serviceability is the ability to quickly diagnose faults and service systems or components to restore normal operating conditions. If a hard drive fails on a server with a RAID storage configuration, but you are never notified of that failure and the drive never gets replaced, then you may have high availability, but low serviceability. The collection of performance monitoring data and system logs, along with the tools to automatically analysis this data, are important serviceability tools.

While there is often a lot of overlap between the above uptime components, I find it important to regard all three as equals when designing and thinking about server architectures. The drive failure scenario in the previous paragraph demonstrates why any one component without the others is simply insufficient.


View the original article here

No comments:

Post a Comment