Introduction

DNS (Domain Name Service) is one of the primary Internet services, which is to map human-friendly domain names to machine-friendly IP address. If there are a lot of people using DNS service (for example, subscribers use ISP's DNS server), one DNS server might be becoming a bottleneck, and the server might fail.

Scalable DNS cluster can help provide scalability and availability of DNS service.

The Example below is about setting up a cluster for recursive DNS but you can just as well use the same method for authorative DNS as well. Just remember that clients who use your cluster as a secondary nameservice would need to also-notify{} each of your realservers, not just the service-IP.

Architecture

DNS is a simple service, there is no affinity between requests from the same client. DNS usually listens for queries at UDP port 53 and TCP port 53.

LVS can simply load balance UDP port 53 and TCP port 53 among a set of DNS servers, and there is no need to setup any persistence options.

Configuration Example

keepalived.conf:

  1. ! Balancer-Set for udp/53  
  2. virtual_server 194.97.173.124 53 {  
  3.    delay_loop 10  
  4.    lb_algo wrr  
  5.    lb_kind DR  
  6.    protocol UDP  
  7.    ! persistence_timeout 1  
  8.    ! persistence_granularity 255.255.255.255  
  9.    ! eth1.105 -> kai eth1.105  
  10.    real_server 10.1.53.2 53 {  
  11.        weight 1  
  12.        MISC_CHECK {  
  13.            misc_path "/usr/bin/dig -b 10.1.53.1 a resolve.test.roka.net @10.1.53.2 +time=1 +tries=5 +fail > /dev/null"  
  14.            misc_timeout 6  
  15.        }  
  16.    }  
  17.    ! eth1.109 -> kai eth1.109  
  18.    real_server 10.3.53.2 53 {  
  19.        weight 1  
  20.        MISC_CHECK {  
  21.            misc_path "/usr/bin/dig -b 10.3.53.1 a resolve.test.roka.net @10.3.53.2 +time=1 +tries=5 +fail > /dev/null"  
  22.            misc_timeout 6  
  23.        }  
  24.    }  

As you can dig (;-) we are using an A record with a low TTL to test the service for this setup is a recursive DNS cluster. So far dig works fine with 44 real_servers configured on an idle Dual PIII 800.


on real_server kai we use the following netfilter setup to be able to direct the traffic to different BIND processes on the same machine/mac:

  1. #DNAT 194.97.173.124->10.1.53.2 eth1.105  
  2. iptables -t nat -A PREROUTING -i eth1.105 -s $net -d 194.97.173.124 -p tcp --dport 53 -j DNAT --to-destination 10.1.53.2:53  
  3. iptables -t nat -A PREROUTING -i eth1.105 -s $net -d 194.97.173.124 -p udp --dport 53 -j DNAT --to-destination 10.1.53.2:53  
  4. #DNAT 194.97.173.124->10.3.53.2 eth1.109  
  5. iptables -t nat -A PREROUTING -i eth1.109 -s $net -d 194.97.173.124 -p tcp --dport 53 -j DNAT --to-destination 10.3.53.2:53  
  6. iptables -t nat -A PREROUTING -i eth1.109 -s $net -d 194.97.173.124 -p udp --dport 53 -j DNAT --to-destination 10.3.53.2:53 

BIND9

When i wrote this example we were using two BIND processes on the same machine for BIND9 currently just runs faster when it is not threading. Here is something JINMEI Tatuya told me on the bind9-workers Mailinglist which turned out to be very true:

  1. If you go with disabling threads, you may also want to enable  
  2. "internal memory allocation".  (I hear that) it should use memory more  
  3. efficiently (and can make the server faster) but is disabled by  
  4. default due to response-performance reasons in the threaded case.  You  
  5. can enable this feature by adding the following line 
  1. #define ISC_MEM_USE_INTERNAL_MALLOC 1 
  1. just before the following part of bind9/lib/isc/mem.c: 
  1. #ifndef ISC_MEM_USE_INTERNAL_MALLOC  
  2. #define ISC_MEM_USE_INTERNAL_MALLOC 0  
  3. #endif 

Try it and you will keep it. ;)

BIND 9.4 line makes use of this new internal malloc library by default now, but disabling threading will probably free you from the hickups some BIND9 users are experiencing.

PowerDNS recursor

This one is a recursive-only Nameserver with very limited authorative DNS capabilities. The author of this Example uses PowerDNS recursor exclusively for his caching-only DNS cluster by now and is glad that while giving roughly the same queries per second performance it generates less SERVFAIL answers and is generally several times more robust than BIND9.

added redundancy via iBGP

If you have more than one Loadbalancer at different locations and you can convince your local Networker to let you speak BGP4+ to his routers you can use quagga with something like the following configuration to failover the service IP to the second LB if the first one goes down:

  1. !  
  2. router bgp 5430  
  3.  no synchronization  
  4.  bgp router-id a.b.c.d  
  5.  redistribute connected route-map benice  
  6.  neighbor c.d.e.f remote-as 5430  
  7.  neighbor c.d.e.f description ffm4-j2  
  8.  neighbor c.d.e.f send-community both  
  9.  neighbor c.d.e.f soft-reconfiguration inbound  
  10.  neighbor c.d.e.f route-map nixda in  
  11.  neighbor c.d.e.f route-map benice out  
  12.  neighbor d.c.f.e remote-as 5430  
  13.  neighbor d.c.f.e description ffm4-j  
  14.  neighbor d.c.f.e send-community both  
  15.  neighbor d.c.f.e soft-reconfiguration inbound  
  16.  neighbor d.c.f.e route-map nixda in  
  17.  neighbor d.c.f.e route-map benice out  
  18.  no auto-summary  
  19. !  
  20. access-list line permit 127.0.0.1/32 exact-match  
  21. access-list line deny any  
  22. !  
  23. ip prefix-list cns-dus2 description dus2 high-metric eq low-perference  
  24. ip prefix-list cns-dus2 seq 5 permit 194.97.173.125/32  
  25. ip prefix-list cns-dus2 seq 10 deny any  
  26. ip prefix-list cns-ffm4 description ffm4 low-metric eq high-preference  
  27. ip prefix-list cns-ffm4 seq 5 permit 194.97.173.124/32  
  28. ip prefix-list cns-ffm4 seq 10 deny any  
  29. !  
  30. route-map benice permit 10  
  31.  match ip address prefix-list cns-ffm4  
  32.  set local-preference 100  
  33.  set metric 0  
  34. !  
  35. route-map benice permit 20  
  36.  match ip address prefix-list cns-dus2  
  37.  set local-preference 100  
  38.  set metric 1  
  39. !  
  40. route-map nixda deny 10  

This is the LB at FFM4. Note that the metric at the DUS2 LB is just the other way around. Here we fancy talking to two core-routers from each LB for extra redundancy. You can also have an internal anycast ServiceIP if you use the same metric at both LBs and make sure they are attached to the same level of router network-topology-wise. This way traffic gets shared between the two loadbalancers according to your network-topology most interesting of course for large dialin ISPs.

Problem

dig does not return a non-zero error code when receiving a SERVFAIL but there are situations when some BIND9 versions return SERVFAIL for any query for example when they are out of memory. For a recursive DNS cluster situation we would want to take such BIND processes out of service.

Workaround

use the following perl script as a wrapper for dig which is quite ugly for perl is an interpretated language and forking it is not much fun so this consumes much user cpu when executed every 6 seconds.

  1. #!/usr/bin/perl  
  2. use strict;  
  3. use warnings;  
  4. # cmdline arguments: <FromIP> <Class> <QTYPE> <QNAME> <ToIP> <Times> <Tries> <ErrrorMatch> <Transport>  
  5. # /usr/bin/dig -b 10.5.53.1 IN A 2.0.0.127.my.test @10.5.53.2 +time=1 +tries=5 +fail  
  6. if(  
  7.        ((defined $ARGV[0])&&($ARGV[0]=~/^\d+\.\d+\.\d+\.\d+$/))  
  8.        &&((defined $ARGV[1])&&($ARGV[1]=~/^(IN|CHAOS)$/))  
  9.        &&((defined $ARGV[2])&&($ARGV[2]=~/^(A|ANY|MX|PTR|SRV|TXT|AAAA|NS|CNAME|SOA)$/))  
  10.        &&((defined $ARGV[3])&&($ARGV[3]=~/^[A-Za-z0-9\-\.]+$/))  
  11.        &&((defined $ARGV[4])&&($ARGV[4]=~/^\d+\.\d+\.\d+\.\d+$/))  
  12.        &&((defined $ARGV[5])&&($ARGV[5]=~/^\d+$/))  
  13.        &&((defined $ARGV[6])&&($ARGV[6]=~/^\d+$/))  
  14.        &&((defined $ARGV[7])&&($ARGV[7]=~/^\S+$/))  
  15.        ) {  
  16.        my $transport="notcp";  
  17.        if((defined $ARGV[8])&&($ARGV[8]=~/^tcp$/i)) {  
  18.                $transport="tcp";  
  19.        } elsif ((defined $ARGV[8])&&($ARGV[8]=~/^udp$/i)) {  
  20.                $transport="notcp";  
  21.        }  
  22.        my (@res)=`/usr/bin/dig -b $ARGV[0] $ARGV[1] $ARGV[2] $ARGV[3] \@$ARGV[4] +time=$ARGV[5] +tries=$ARGV[6] +fail +$transport 2>&1`;  
  23.        my $return=$?;  
  24.        if(my $error=(map {/status:\s*($ARGV[7])/ ? $1 : ()} @res)[0]) {  
  25.                die("$error");  
  26.        } elsif ($return!=0) {  
  27.                die("dig returned: \"$return\"");  
  28.        } elsif ($return==0) {  
  29.                exit 0;  
  30.        } else {  
  31.                die("error: \"$return\" HAS BAD VALUE!");  
  32.        }  
  33. else {  
  34.        die("dig-wrapper.pl <FromIP> <Class> <QTYPE> <QNAME> <ToIP> <Times> <Tries> <ErrrorMatch> <Transport>");  

Ah yes, forgot to say: The Dual PIII 800 is not idleing around anymore - its busy running this script 44 times every 6 seconds, which accounts for roughly 12% user cpu and 5% system used at a query rate of ~3600q/s.

Solution

use a patched version of dig?

Conclusion

It still just works.

From
http://kb.linuxvirtualserver.org/wiki/Building_Scalable_DNS_Cluster_using_LVS

本日志由 flyinweb 于 2009-09-24 09:42:06 发表,目前已经被浏览 308 次,评论 0 次;

作者添加了以下标签: DNS ClusterLVS

引用通告:http://www.517sou.net/Article/261/Trackback.ashx

评论订阅:http://www.517sou.net/Article/261/Feeds.ashx

评论列表

    暂时没有评论
(必填)
(必填,不会被公开)