| Michael's profileMichael's spaceBlogListsNetwork | Help |
|
|
October 14 High Availability: Abstracting out every layer - Service AvailabilitySo what really is high availability? It appears everyone wants their servers and networks to be available all the time, but how do you achieve this? A normal systems administrator would answer something like "virtualization", or "clustering", maybe even "replication" and "off-site backups", but I think the answer lies deeper than that. What is it that all of these things really do that makes the service highly available? Abstraction. All of these things provide some degree of abstraction, whether it be service, physical, storage, network, or location abstraction. In this piece I hope to provide others with some insight into what I believe is the beauty of high availability systems and, hopefully, some assistance in designing your company's design. The first thing to identify out of all of this is at what layers do these vulnerabilities exist? In a system there are many layers, but based on the service provided some of these layers are non-crucial. The important piece here is to take a holistic approach barring no physical or logical boundaries. Take a moment to think about the target of evaluation (ToE) provided and how it is utilized. In most cases it is not a physical device but the logical service it provides that you are most interested in being available. The vulnerability in many cases is that service being tied to one physical device and then that device not being available. This being the case the service should be the focal point of the analysis rather than the service itself. For example, does it matter where the web site/DNS/DHCP/directory services/etc comes from so long as it is available? (Now I know some of you are going to flame me about taxing WAN links, right? Remember I'm not talking about efficient design; you should definitely have higher-utilization services available in the location that they are used. I'm talking about high availability; making sure your service is still accessible when something goes wrong.) Service AvailabilityCommonly the first thing that comes to mind when it comes to high availability is the service itself. The service is the entire reason for the server's existence (and likely the entire reason you have a job if you're reading this). For service availability here are a few commonly chosen options: Windows clustering, network load balancing, virtualization, DNS round robin, and geo-clustering. Lets take a few moments to identify the benefits and pitfalls of each of these options. Windows ClusteringWindows clustering is a common option to ensure high-availability of resource-intensive applications such as Microsoft SQL Server and Microsoft Exchange. In a clustered scenario each computer in the cluster is attached to some form of Fiber Channel or iSCSI device, commonly a SAN array. All servers in the cluster share the same data set available to both servers (commonly stored on the same device) to ensure that when a server becomes unavailable any of the others can immediately take control of operations. Transactions are controlled in this environment by using a quorum drive to keep track of individual transactions occurring on the server. Before any transaction is performed it is recorded on this quorum drive, then recorded again on completion. Server availability is determined by a dedicated heartbeat network. For more information about Windows clustering please see the following Microsoft TechNet article: http://technet.microsoft.com/en-us/library/aa997507.aspx. So what is the downfall of Windows clustering? There are a few actually. The first is that a Windows cluster is expensive requiring additional physical boxes with no increase in performance. The best-designed Windows cluster, in my opinion, is the active-passive cluster. In this design, though, an entire server's job is to do nothing unless the first fails. Of course, this is also very expensive requiring two similarly (hopefully identically) spec'ed servers, two copies of Windows Server Enterprise, and if you are using more than two nodes a switch. The second downfall is that a bunch of processing resources are wasted that can usually be better allocated. I'll explain more about this when I get into virtualization later on in this post. Network Load BalancingAnother option to achieve service availability is network load balancing, or NLB. Each NLB cluster can have up to 32 hosts and allows them to all listen on a single IP address for ease of configuration. Because NLB only works with stateless applications its uses are quite limited, but very efficient where available. NLB clustering carries only 1% CPU overhead making it quite efficient and all nodes are active at all times significantly decreasing wasted resources. Some additional information about NLB clustering can be found at the following Microsoft TechNet article: http://technet.microsoft.com/en-us/library/cc758834.aspx. So, what are the limitations and pitfalls of such an efficient service? First, this form of clustering is only available with stateless applications (i.e. one where the state of the service is not controlled whatsoever by the requests coming into it). In short, you just ruled out a large number of the available server products on the market. The most common NLB clustered resources are web servers and Microsoft ISA servers. A second downfall of NLB clustering is configuration management. Remember that in this configuration all servers operate autonomously; they do not necessarily have to share a common storage area and, thus, can issue different data based upon which node in the cluster you are communicating with. To alleviate this issue be sure to find a way to centralize the storage of configurations (i.e. relocating the inetpub folder for IIS to a network share). DNS Round RobinThe old faithful DNS round robin cluster seems to still be a great option for many servers. The beauty of this form of "clustering" is the simplicity; simply make multiple entries in DNS for a single host name. Round robin configuration causes DNS to rotate between a set of hosts changing which address it responds with each request. This type of clustering allows an equal number of requests to be distributed among servers and allows for an application to exist in multiple physical locations in an active-active scenario. For more information on DNS round robin please see the following Microsoft TechNet article: http://technet.microsoft.com/en-us/library/cc787484.aspx. Once again, there are downfalls associated with this form of clustering. As with network load balancing, this particular type of clustering is recommended only for stateless applications as should the cached DNS query expire on the clients PC they could be directed to another server. A second issue with this form of clustering is that it requires manual intervention to remove an address from DNS. This means that should one of the addresses go offline the DNS server would still issue that IP address as being valid until an administrator removed it manually. VirtualizationMy personal favorite form of clustering is virtualization, a concept that has recently been making some waves in the IT market. Virtualization is the concept of installing a very lightweight operating system on a server that emulates hardware for virtual servers housed within it. These virtual machines work on the concept that servers rarely ever use all of their available resources all of the time. This being the case, multiple virtual servers can be loaded onto one physical server allowing them to share resources while still maintaining the best-practices of a single-purpose box. In addition, enterprise-class virtualization products, such as VMWare's Virtual Infrastructure, can actually migrate these virtual servers between physical servers live without causing any interference to the servers housed on it. This is done through shared storage, shared network resources, and a dedicated network. In VMWare's dynamic resource scheduling (DRS) clustering VMWare constantly monitors the CPU and memory utilization of the various virtual servers housed across the cluster. Should a virtual server require more resources, VirtualCenter will send a command to the VMWare server instructing it to migrate the virtual server to another physical box with more available resources. Once this command is issued, the memory used by the virtual server is transitioned via the dedicated network to the destination server, then finally the application hook on the file is swapped making the destination server take over CPU and memory load of the virtual server. Also, in high availability clustering (HA) VirtualCenter will detect that a VMWare server has gone offline and dynamically transfer the control of the virtual servers once housed on that VMWare server to other VMWare servers in the cluster. Ok, well you can see that I am very pro-virtualization from all of this, but it has its downfalls as well. As you might expect, going through another layer of software before hitting hardware costs time. While most applications respond just as fast as always, resource-intensive applications such as, Microsoft Exchange and Microsoft SQL Server, could slow down when virtualized. Also remember that one of the biggest benefits of virtualization is resource sharing. Some applications, usually database-based applications, allocate all of the allotted memory upon initialization for performance reasons. These applications do not commonly benefit from virtualization. Additionally, note that only memory and CPU utilization are monitored for dynamic resource scheduling. This means that if your applications reside on a server with another virtual server that is hogging disk or network resources they may decrease in performance somewhat. Geo-ClusteringWell, with the onset of Windows Server 2008 I think that we will be seeing much more of geo-clustering. The concept of geo-clustering is similar to that of Windows clustering, yet spread across hundreds or thousands of miles. This allows for minimal downtime in the case of a catastrophe and should be considered a form of site redundancy. Geo-clustering allows for an entire site to go down while still providing only minimal loss of service for any cluster-aware applications. Instead of using a single shared storage platform, geo-clustering uses replication to copy data from a live (read-write) storage array to a passive (read-only) array until a failure occurs. For more information on geo-clustering, please see the following white paper: http://download.microsoft.com/download/3/b/5/3b51a025-7522-4686-aa16-8ae2e536034d/WS2008%20Multi%20Site%20Clustering.doc Once again, downfalls exist in this situation as well. This time, though, it is not as much as technology limitation as a budgetary limitation. In order to implement this form of clustering hefty WAN links will be required to support replication of data between arrays. Also, in order to implement this you will require two separate arrays with replication capabilities to replicate the data used by the application. Other Forms of ClusteringI could go on forever with the various third-party forms of clustering available today. Veritas Cluster Server, Neverfail, and so forth all are focused on establishing high availability through distributing a service over multiple servers or locations. The important piece is to analyze the service in question and, based on that analysis, choose the most appropriate and best-fitting clustering for that application. Use Case ScenarioOk, so now having read this mini-book on service availability its time to take a look at one possible use for this. Remember that this is only one post in a series of posts about high availability design, more will come with reference to storage, networking, and physical infrastructure. The idea of high availability design is that no single point of failure exists. To make a simple example, we will use a web server. By itself, IIS loaded on a single server will provide a web site, but will not withstand even a simple reboot. Because of this we need to ensure that this server has a replica copy to respond in the event of a reboot and, thus, requires some form of service availability method. To begin our investigation into this let's weigh our options based on the profile of the service. Web Server Profile
Ok, now that we have some basic statistics about a web server we can compare it to our various forms of clustering:
So upon review we can see that we have three feasible options; Network Load Balancing, DNS Round Robin, and Virtualization. The first thing that I would identify in this case is that with virtualization we become hardware-agnostic allowing our web server any amount of physical device failure without significant outage. Also, with virtualization driver management is handled by the host server thus simplifying configuration and patch management. In addition, we will obtain the ability to back up the virtual machine as a file which could significantly improve the recovery time objective, or RTO, in the event of an outage. With virtualization as one angle of our high-availability design we are still left unable to reboot our web server without causing outage. It is for this reason that we will now look to network load balancing or DNS round robin as high-availability options. DNS round robin is a commonly-used method of achieving high availability for services that can be geographically dispersed as it allows for local network prioritization, however is not as efficient in responding to a service outage as it requires a failed connection prior to the next server in the list to be queried. Network load balancing, however, is designed to cluster together stateless servers on the same network and is much quicker to respond to a service outage due to reboot. It is because of this that we will use a replica web server in a network load balancing cluster to ensure high availability at the site level. Well our hard work has definitely paid off: we have established a relatively highly-available web site at our current site. So what would happen if that site were to lose either its network connection, air conditioning, or power? Once again we would be left with no web site and must seek out a form of hot site to ensure that a locality-based disaster does not cause an outage. Out of our clustering options above it appears that DNS round robin will offer us the best return on investment due to its ability to issue out DNS replies of the closest subnet to the user. This will ensure that if the sites are significantly geographically dispersed (as they should be) the users are re-directed to the web servers closest to them for performance. Once we have multiple web servers network load balanced and DNS round robin clustered across multiple sites we now have a new issue to deal with: configuration management. What if a user was to get different web pages based on which server they hit because a systems administrator only uploaded the changes to some of the web servers? To thwart this we need a central storage location for all of these web pages, a file server. Now that we have a new network location for our web servers to look for their web page we run into another issue, high availability of the file server. In this case we now need to look at establishing some form of high availability for our file servers. Luckily, Active Directory adds a new feature specifically for establishing highly-available file systems; the distributed file system, or DFS.
As we can see from this chart DFS is the only one of these technologies which supports replication. In fact, DFS also supports site-awareness to ensure that your web server is not reaching across a WAN link to obtain its content unless, of course, all of the local file servers are offline. ConclusionService availability is a very extensive topic with numerous supported high availability technologies available. Another important piece to consider is the value of the service at-hand. For instance, the bare minimum server configuration to achieve true high-availability in the above scenario would be 8 separate servers (2 IIS servers and 2 file servers per site, not to mention the implied domain controllers), yet it is highly unlikely that it will ever be unavailable if a proper configuration management process is adhered to. Always make sure to perform a proper business process analysis and cost-benefit analysis before implementing any new technology. |
|
|