IntroductionThe intention of this document is to give you a short walk-through of how to set up a filesystem, which replicates across two web nodes, and allows concurrent access from both nodes. This scenario is particularly useful, when you intend to load-balance or automatically fail-over two web nodes, which write session information or user-uploaded content to disk. There are many web applications out there, which are not cluster-aware, but will work nicely with this transparent type of setup. DRBD solves this issue. You can consider a DRBD cluster a RAID 1 setup where the two disks of the array reside in different physical (or virtual) machines. A Network RAID if you will. GFS2 (like GFS, OCFS2, and others) is a cluster-aware filesystem, which you need on top of DRBD to make sure that both nodes remain consistent. You can't use purely local filesystems like ext3 or ext4 in that setup. However, as GFS2 is as POSIX-compliant as ext3/4, there's nothing to worry about. Once it is set up and running, from a systems administrator's and developer's point of view DRBD/GFS2 behaves like any locally mounted partition. In a load-balanced setup it is nice to know that DRBD will always read from and write to the local node (write operations are then pushed to the remote node), which – at least in theory – makes read operations faster than from any "traditional" network storage. Please note that I don't want to give the impression that a DRBD/GFS2 setup is as robust and reliable as for example a NetApp® appliance. If you can afford such expensive hardware or are in a mission-critical enterprise environment, go for it! However, DRBD/GFS2 an excellent low-budget (read: free and open source) alternative. Another option you may want to consider is GlusterFS, which is a lot more flexible, but does not cope well with high I/O load (at least not until version 2.0.7, which was the last one I've seen myself in a production environment). Anyway, let's get started!
What you needFor this short tutorial, I assume that
To make your life easier at this point, I'm assuming that none of the two nodes is in production use at the moment. They shouldn't even be accessible from the public internet. Only if this is true, you can safely switch off iptables and selinux temporarily. In particular the selinux part is tricky for systems, which are not up to date or can't apply the latest default SELinux policies (which may be the case for heavily customised setups). Right. Now that all pre-requisites are met, let's really get started!
Installing and preparing the required softwareUnless stated otherwise, please do everything on both nodes! First, install the required software:
yum groupinstall 'Cluster Storage'
yum install drbd83 kmod-drbd83 (or kmod-drbd83-xen in my case)
yum install which
I've listed which explicitly, because it is required by one of the tools, but not listed in its dependencies. That is why it will not be installed automatically, and it will give you lots of pain to figure that out! Then you need to create entries in /etc/hosts for both nodes: 10.10.10.11 node1 10.10.10.12 node2 This is important, as the cluster manager needs host names. We do not use valid fully qualified domain names (FQDN) and nameservers to handle this, because we don't want to risk that
You really want to avoid any possible service disruptions once the nodes are up and running! This is one of the rare cases where flexibility does more harm than good. Make sure both nodes can at least ping each other by name before continuing. It's easy to introduce typos in hostnames and/or IP addresses. Per default, RedHat and CentOS assume a different combination of running services for the cluster, which is why the default startup order would not work in our case. Open /etc/init.d/drbd and change this: # remove: # chkconfig: 345 70 08 # insert: # chkconfig: 345 22 78 Make sure to keep the hash ('#') in front of 'chkconfig'. Then let's activate DRBD: chkconfig --level 345 drbd on Configuring the ClusterNow everything is prepared. Let's throw in the configuration files... First, drop this into /etc/cluster/cluster.conf (which will not exist yet): <?xml version="1.0"?> <cluster alias="cluster-setup" config_version="1" name="cluster-setup"> <rm log_level="4"/> <fence_daemon clean_start="1" post_fail_delay="0" post_join_delay="3"/> <clusternodes> <clusternode name="node1" nodeid="1" votes="1"> <fence> <method name="2"> <device name="LastResortNode01"/> </method> </fence> </clusternode> <clusternode name="node2" nodeid="2" votes="1"> <fence> <method name="2"> <device name="LastResortNode02"/> </method> </fence> </clusternode> </clusternodes> <cman expected_votes="1" two_node="1"/> <fencedevices> <fencedevice agent="fence_manual" name="LastResortNode01" nodename="node1"/> <fencedevice agent="fence_manual" name="LastResortNode02" nodename="node2"/> </fencedevices> <rm/> <totem consensus="4800" join="60" token="10000" token_retransmits_before_loss_const="20"/> </cluster> This file tells cman, the cluster manager, which nodes we have got, and how we want to separate them from each other (fencing), if they lose connection. In this example, we go for manual fencing. In a production environment, you will want to use real fencing devices, like IPMI or remote controlled power switches (yes, we are that brutal, because we better switch off a faulty node and only lose data there than losing or corrupting data on both nodes!). Now let's configure DRBD. Edit /etc/drbd.conf as follows (assuming you've got GBit ethernet between both nodes; otherwise you may want to change the maximum syncer rate):
global { usage-count yes; }
common { syncer { rate 100M; } }
resource res0 {
protocol C;
startup {
wfc-timeout 20;
degr-wfc-timeout 10;
# we will keep this commented until tested successfully:
# become-primary-on both;
}
net {
# the encryption part can be omitted when using a dedicated link for DRBD only:
# cram-hmac-alg sha1;
# shared-secret anysecrethere123;
allow-two-primaries;
}
on node1 {
device /dev/drbd1;
disk /dev/xvdb1;
address 10.21.127.51:7789;
meta-disk internal;
}
on node2 {
device /dev/drbd1;
disk /dev/xvdb1;
address 10.21.127.52:7789;
meta-disk internal;
}
disk {
fencing resource-and-stonith;
}
handlers {
#outdate-peer "/sbin/handler";
}
}
What we do here, is setting up a resource called res0, which is shared across node1 and node2. We allow to have two writable primary nodes (if we don't do that, we can write only to the primary and read from both). Also you will have noticed an optional encryption, which is highly recommended if DRBD uses shared network links. However, if we have a distinct link for DRBD traffic only, you want to remove the additional overhead caused by encryption to gain more speed. Initialisation of the cluster resourceNow that we've configured everything, you are ready to start the drbd service on both nodes. Be careful to do it in a short interval on both to avoid timouts. iptables -I OUTPUT -o eth2 -j ACCEPT iptables -I INPUT -i eth2 -j ACCEPT service iptables save service drbd start This assumes that eth2 is not used for anything else (it's a crossover cable between two boxes or a distinct VLAN, which is not used by anything else). On both nodes this should finish successfully. To verify that the resource is online (although not yet initialised), run: $ drbd-overview 1:res0 Unconfigured . . . . Brilliant. So let's configure: # create meta data information: $ drbdadm create-md res0 # let's make both nodes communicate: $ drbdadm up res0 # check the status again: $ drbd-overview 1:res0 Connected Secondary/Secondary Inconsistent/Inconsistent C r---- So now we've got two secondaries, which are inconsistent. That's expected as we haven't set any node to primary yet, and DRBD cannot know that both partitions are clean. The following step must only be run on the first node Let's sync then: $ drbdadm -- --overwrite-data-of-peer primary res0 This syncs data from the local to the remote node. Now we can check the status of the syncronisation:
$ drbd-overview
1:res0 SyncSource Primary/Secondary UpToDate/Inconsistent C r----
[=>..................] sync'ed: 14.3% (4388/5112)M
It may take quite some time to finish. Be patient. Good time to have a coffe, before we continue... In the end, we want to see this (a working primary/secondary setup with no remaining inconsistencies): $ drbd-overview 1:res0 Connected Primary/Secondary UpToDate/UpToDate C r---- But we did want primary/primary, didn't we? So let's promote the other node, too. On the second node run this: $ drbdadm primary res0 $ drbd-overview 1:res0 Connected Primary/Primary UpToDate/UpToDate C r---- There we go! Now is the right time to uncomment this line in /etc/drbd.conf:
previously:
# become-primary-on both;
now:
become-primary-on both;
Let's quickly recap what we've done so far: Create and mount the GFS2 filesystem
To create a file system, use the GFS2 version of mkfs on only one of the nodes: $ mkfs.gfs2 -p lock_dlm -t cluster-setup:res0 /dev/drbd1 -j 2 You have noticed the references to "cluster-setup" and "res0"? The former was defined in /etc/cluster/cluster.conf, the latter in /etc/drbd.conf. $ service cman start If this went ok (it may take a few seconds), you're ready to use /dev/drbd1 like any other local partition!
$ mount -o noatime,nodiratime /dev/drbd1 /mnt $ df -h /mnt Filesystem Size Used Avail Use% Mounted on /dev/drbd1 5.0G 259M 4.8G 6% /mnt My partition is 5 GB in size, 259M of which have been eaten by the meta data sections and journals. That's acceptable as disk space is incredibly cheap these days. If cman hangs at the "Starting fencing..." stage for a long time on both machines, you have missed one or more steps above, and/or selinux and/or iptables are causing trouble. Wait for cman to finish, then de-activate selinux and iptables temporarily: $ service iptables stop $ setenforce 0 (For selinux, you may want to have a look at /etc/sysconfig/selinux and set it to permissive rather than enforcing for the time being. You can tackle the lockdown of the system once you know it works.) If everything went ok until here, let's prove it. On node1 create a 100MB test file and get its md5 checksum:
$ dd if=/dev/random of=/mnt/test bs=1024 count=100k
$ md5sum /mnt/test
2f282b84e7e608d5852449ed940bfc51 /mnt/test
And on node2: $ md5sum /mnt/test 2f282b84e7e608d5852449ed940bfc51 /mnt/test Try it the other way round, too. Tadaa! You now have a working primary/primary DRBD/GFS2 partition, and can write to and read from both nodes. Now it's up to you how to make use of it for load-balanced setups, hot standby, or whatever you have in mind. You could also use NFS on top of the just created cluster device to publish the data to many consumers. I may write a Howto for a working AutoFS/Heartbeat/NFS/DRBD combination later. Some notes on SELinuxObviously you don't want to ignore security features like SELinux on a server (except you are using a Linux distribution which prefers home-brewed stuff like AppArmor). And while it used to be a bit fiddly to get DRBD/GFS2 working with SELinux, this has been fixed yum update selinux-policy However, you still need to be careful to label your DRBD mountpoint and all directories therein correctly. For example, if I wanted to mount the clustered filesystem in /cluster/www in order to use it with apache, I would need to label it accordingly: $ chcon -R -t httpd_sys_content_t /cluster/www
At least for me, DRBD/GFS2 works very well with SELinux in enforcing mode. |
Shortcuts |