Skip to end of metadata
Go to start of metadata

Introduction

The intention of this document is to give you a short walk-through of how to set up a filesystem, which replicates across two web nodes, and allows concurrent access from both nodes. This scenario is particularly useful, when you intend to load-balance or automatically fail-over two web nodes, which write session information or user-uploaded content to disk. There are many web applications out there, which are not cluster-aware, but will work nicely with this transparent type of setup.
Common alternatives in order to access the same data with multiple nodes are NFS, iSCSI, Samba (don't beat me!), and many others. However, unless you buy very expensive kit, all of them usually store the data in a single physical box using a single network connection, a single power connector, etc. You see the issue? You can access the data from anywhere you want, but if that particular machine breaks, none of your nodes can access the data any more, and recovery will take a lot of (down)time.

DRBD solves this issue. You can consider a DRBD cluster a RAID 1 setup where the two disks of the array reside in different physical (or virtual) machines. A Network RAID if you will.

GFS2 (like GFS, OCFS2, and others) is a cluster-aware filesystem, which you need on top of DRBD to make sure that both nodes remain consistent. You can't use purely local filesystems like ext3 or ext4 in that setup. However, as GFS2 is as POSIX-compliant as ext3/4, there's nothing to worry about. Once it is set up and running, from a systems administrator's and developer's point of view DRBD/GFS2 behaves like any locally mounted partition. In a load-balanced setup it is nice to know that DRBD will always read from and write to the local node (write operations are then pushed to the remote node), which – at least in theory – makes read operations faster than from any "traditional" network storage.

Please note that I don't want to give the impression that a DRBD/GFS2 setup is as robust and reliable as for example a NetApp® appliance. If you can afford such expensive hardware or are in a mission-critical enterprise environment, go for it! However, DRBD/GFS2 an excellent low-budget (read: free and open source) alternative. Another option you may want to consider is GlusterFS, which is a lot more flexible, but does not cope well with high I/O load (at least not until version 2.0.7, which was the last one I've seen myself in a production environment).
On the other hand, GFS2 is not very well documented (RedHat, who have developed GFS, apparently want people to buy their professional support packages ).

Anyway, let's get started!

I know it's tempting to copy & paste things without reading the text. Don't! No really. Don't!

What you need

For this short tutorial, I assume that

  • you have set up identical unused disk partitions on both nodes (xvdb in this tutorial, because I am demoing this in a Citrix XenServer cluster, where both nodes sit on different physical machines)
  • the two nodes are connected via a distinct GBit link, and IP addresses have been assigned (I'm using 10.10.10.11 and 10.10.10.12 here, on interface eth2)
  • you are running CentOS 5.x on both nodes

To make your life easier at this point, I'm assuming that none of the two nodes is in production use at the moment. They shouldn't even be accessible from the public internet. Only if this is true, you can safely switch off iptables and selinux temporarily. In particular the selinux part is tricky for systems, which are not up to date or can't apply the latest default SELinux policies (which may be the case for heavily customised setups).

Right. Now that all pre-requisites are met, let's really get started!

If you consider applying this manual to other Linux distributions, be very careful. In particular the implementation, which is shipped with Ubuntu Server, is outdated and possibly broken. You may want to compile from scratch instead. You can expect best support for GFS2 on RedHat and derivates thereof.

Installing and preparing the required software

Unless stated otherwise, please do everything on both nodes!

First, install the required software:

yum groupinstall 'Cluster Storage'
yum install drbd83 kmod-drbd83  (or kmod-drbd83-xen in my case)
yum install which

I've listed which explicitly, because it is required by one of the tools, but not listed in its dependencies. That is why it will not be installed automatically, and it will give you lots of pain to figure that out!

Then you need to create entries in /etc/hosts for both nodes:

10.10.10.11 node1
10.10.10.12 node2

This is important, as the cluster manager needs host names. We do not use valid fully qualified domain names (FQDN) and nameservers to handle this, because we don't want to risk that

  • names cannot be resolved due to DNS downtimes or problems on the public network interfaces
  • unnecessary delays occur
  • somebody (by mistake) assigns new IP addresses in the zone for both nodes or removes their DNS records

You really want to avoid any possible service disruptions once the nodes are up and running! This is one of the rare cases where flexibility does more harm than good.

Make sure both nodes can at least ping each other by name before continuing. It's easy to introduce typos in hostnames and/or IP addresses.

Per default, RedHat and CentOS assume a different combination of running services for the cluster, which is why the default startup order would not work in our case. Open /etc/init.d/drbd and change this:

# remove:
# chkconfig: 345 70 08

# insert:
# chkconfig: 345 22 78

Make sure to keep the hash ('#') in front of 'chkconfig'. Then let's activate DRBD:

chkconfig --level 345 drbd on

Configuring the Cluster

Now everything is prepared. Let's throw in the configuration files...

First, drop this into /etc/cluster/cluster.conf (which will not exist yet):

<?xml version="1.0"?>
<cluster alias="cluster-setup" config_version="1" name="cluster-setup">
  <rm log_level="4"/>
  <fence_daemon clean_start="1" post_fail_delay="0" post_join_delay="3"/>
  <clusternodes>
    <clusternode name="node1" nodeid="1" votes="1">
      <fence>
        <method name="2">
          <device name="LastResortNode01"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="node2" nodeid="2" votes="1">
      <fence>
        <method name="2">
          <device name="LastResortNode02"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>
  <cman expected_votes="1" two_node="1"/>
  <fencedevices>
    <fencedevice agent="fence_manual" name="LastResortNode01" nodename="node1"/>
    <fencedevice agent="fence_manual" name="LastResortNode02" nodename="node2"/>
  </fencedevices>
  <rm/>
  <totem consensus="4800" join="60" token="10000" token_retransmits_before_loss_const="20"/>
</cluster>

This file tells cman, the cluster manager, which nodes we have got, and how we want to separate them from each other (fencing), if they lose connection. In this example, we go for manual fencing. In a production environment, you will want to use real fencing devices, like IPMI or remote controlled power switches (yes, we are that brutal, because we better switch off a faulty node and only lose data there than losing or corrupting data on both nodes!).
Why is this so important? Well, easy answer: When you've got a primary/primary cluster, and one of the nodes writes data, which never reaches the other one, you've got a problem. Even worse, if both nodes write changes only locally and can't publish them any more. The fencing daemon blocks access to the file system while the cluster is in an "unclean" state. If we don't have a real fencing device, we have to degrade the cluster manually (file system operations are blocked during that time). With a fencing device, this operation (switching off power automatically) is much faster. In either case, we have to be very careful when we re-introduce the manually repaired node again. Best bet would probably be to switch DRBD to primary/secondary and force a full re-sync of the secondary (the previously broken node).

Now let's configure DRBD. Edit /etc/drbd.conf as follows (assuming you've got GBit ethernet between both nodes; otherwise you may want to change the maximum syncer rate):

global { usage-count yes; }
common { syncer { rate 100M; } }
resource res0 {
  protocol C;
  startup {
    wfc-timeout 20;
    degr-wfc-timeout 10;
    # we will keep this commented until tested successfully:
    # become-primary-on both; 
  }
  net {
    # the encryption part can be omitted when using a dedicated link for DRBD only:
    # cram-hmac-alg sha1;
    # shared-secret anysecrethere123;
    allow-two-primaries;
  }
  on node1 {
    device /dev/drbd1;
    disk /dev/xvdb1;
    address 10.21.127.51:7789;
    meta-disk internal;
  }
  on node2 {
    device /dev/drbd1;
    disk /dev/xvdb1;
    address 10.21.127.52:7789;
    meta-disk internal;
  }
  disk {
    fencing resource-and-stonith;
  }
  handlers {
    #outdate-peer "/sbin/handler";
  }
}

What we do here, is setting up a resource called res0, which is shared across node1 and node2. We allow to have two writable primary nodes (if we don't do that, we can write only to the primary and read from both). Also you will have noticed an optional encryption, which is highly recommended if DRBD uses shared network links. However, if we have a distinct link for DRBD traffic only, you want to remove the additional overhead caused by encryption to gain more speed.

Initialisation of the cluster resource

Now that we've configured everything, you are ready to start the drbd service on both nodes. Be careful to do it in a short interval on both to avoid timouts.

iptables -I OUTPUT -o eth2 -j ACCEPT
iptables -I INPUT -i eth2 -j ACCEPT
service iptables save
service drbd start

This assumes that eth2 is not used for anything else (it's a crossover cable between two boxes or a distinct VLAN, which is not used by anything else).

On both nodes this should finish successfully. To verify that the resource is online (although not yet initialised), run:

$ drbd-overview

  1:res0  Unconfigured . . . . 

Brilliant. So let's configure:

# create meta data information:
$ drbdadm create-md res0

# let's make both nodes communicate:
$ drbdadm up res0

# check the status again:
$ drbd-overview

  1:res0  Connected Secondary/Secondary Inconsistent/Inconsistent C r---- 

So now we've got two secondaries, which are inconsistent. That's expected as we haven't set any node to primary yet, and DRBD cannot know that both partitions are clean.

The following step must only be run on the first node

Let's sync then:

$ drbdadm -- --overwrite-data-of-peer primary res0

This syncs data from the local to the remote node.

Now we can check the status of the syncronisation:

$ drbd-overview 
  1:res0  SyncSource Primary/Secondary UpToDate/Inconsistent C r---- 
        [=>..................] sync'ed: 14.3% (4388/5112)M

It may take quite some time to finish. Be patient. Good time to have a coffe, before we continue...

In the end, we want to see this (a working primary/secondary setup with no remaining inconsistencies):

$ drbd-overview 
  1:res0  Connected Primary/Secondary UpToDate/UpToDate C r---- 

But we did want primary/primary, didn't we? So let's promote the other node, too. On the second node run this:

$ drbdadm primary res0
$ drbd-overview 
  1:res0  Connected Primary/Primary UpToDate/UpToDate C r---- 

There we go!

Now is the right time to uncomment this line in /etc/drbd.conf:

previously:
    # become-primary-on both; 
now:
    become-primary-on both; 

Let's quickly recap what we've done so far:
We've defined two nodes in the cluster, which will be "protected" by cman (the cluster manager) and fenced (the fencing daemon) later.
DRBD has been set up successfully to provide a block device drbd1 in the resource res0, which is in sync with the respective remote node, and can be used for read and write operations on both nodes.
What's left to do is to create a file system and mount it.

Create and mount the GFS2 filesystem

Although it might be tempting to use a file system, which you are more familiar with (e.g. ext3 or XFS), you must not do this! These file systems are not cluster-aware and don't know at all what's going on at the other node. It will become inconsistent very soon and possibly corrupt your data!

To create a file system, use the GFS2 version of mkfs on only one of the nodes:

$ mkfs.gfs2 -p lock_dlm -t cluster-setup:res0 /dev/drbd1 -j 2

You have noticed the references to "cluster-setup" and "res0"? The former was defined in /etc/cluster/cluster.conf, the latter in /etc/drbd.conf.
The file system should be created just fine. Now let's mount it on both nodes for a quick test. Before we can actually do that, we need to start the cluster manager, which will also fire up fenced. You have to do that on both nodes. It will hang for minutes waiting for the other node's fenced to initialise otherwise!

$ service cman start

If this went ok (it may take a few seconds), you're ready to use /dev/drbd1 like any other local partition!

Don't forget to mount with options noatime and nodiratime. Without these options Linux will write the last access times to disk whenever you open a file or directory – even if you only want to read them! The information written is absolutely irrelevant in the vast majority of setups (and not to be confused with creation/modification times). Using noatime and nodiratime improves read speed and reduces locking operations – particularly important in a clustered setup.
$ mount -o noatime,nodiratime /dev/drbd1 /mnt
$ df -h /mnt
Filesystem            Size  Used Avail Use% Mounted on
/dev/drbd1            5.0G  259M  4.8G   6% /mnt

My partition is 5 GB in size, 259M of which have been eaten by the meta data sections and journals. That's acceptable as disk space is incredibly cheap these days.

If cman hangs at the "Starting fencing..." stage for a long time on both machines, you have missed one or more steps above, and/or selinux and/or iptables are causing trouble. Wait for cman to finish, then de-activate selinux and iptables temporarily:

$ service iptables stop
$ setenforce 0

(For selinux, you may want to have a look at /etc/sysconfig/selinux and set it to permissive rather than enforcing for the time being. You can tackle the lockdown of the system once you know it works.)

If everything went ok until here, let's prove it. On node1 create a 100MB test file and get its md5 checksum:

$ dd if=/dev/random of=/mnt/test bs=1024 count=100k
$ md5sum /mnt/test 
2f282b84e7e608d5852449ed940bfc51  /mnt/test

And on node2:

$ md5sum /mnt/test 
2f282b84e7e608d5852449ed940bfc51  /mnt/test

Try it the other way round, too.

Tadaa! Works!

You now have a working primary/primary DRBD/GFS2 partition, and can write to and read from both nodes. Now it's up to you how to make use of it for load-balanced setups, hot standby, or whatever you have in mind.

You could also use NFS on top of the just created cluster device to publish the data to many consumers. I may write a Howto for a working AutoFS/Heartbeat/NFS/DRBD combination later.

Some notes on SELinux

Obviously you don't want to ignore security features like SELinux on a server (except you are using a Linux distribution which prefers home-brewed stuff like AppArmor). And while it used to be a bit fiddly to get DRBD/GFS2 working with SELinux, this has been fixed end of 2009. If you haven't updated your system recently (shame on you! ), do it now. At the very least run:

yum update selinux-policy 

However, you still need to be careful to label your DRBD mountpoint and all directories therein correctly. For example, if I wanted to mount the clustered filesystem in /cluster/www in order to use it with apache, I would need to label it accordingly:

$ chcon -R -t httpd_sys_content_t /cluster/www
SELinux context changes are not immediately visible on the second node! You need to remount the partition on the remote node. However, this is not a big issue as subsequently written files will automatically take the context of the parent directory – on both nodes.

At least for me, DRBD/GFS2 works very well with SELinux in enforcing mode.

Shortcuts



Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.