Skip to end of metadata
Go to start of metadata

Introduction

Months ago I've published Clustered Filesystem with DRBD and GFS2 on CentOS 5.4 in this wiki. Now I would like to give another option – OCFS2, which is a clustered filesystem developed by Oracle. OCFS initially was focused on use with Oracle's databases, but with OFCS2 it's a general-pupose cluster filesystem. It's available as open source, and Oracle does a quite good job publishing packages and modules for each and every kernel version (so you don't need to re-compile it over and over again when you update your kernel). That at least holds true for RHEL and derivates.
OCFS2 works very similar to GFS2, except that it doesn't use RedHat's Cluster Manager, but instead ships with O2CB, Oracle's own cluster manager. As far as the filesystem is concerned, it does the same thing.

You will notice that this howto is very similar to the GFS2 howto mentioned earlier. In fact the DRBD-related parts are mostly identical.

I know it's tempting to copy & paste things without reading the text. Don't! No really. Don't!

Pre-Requisites

For this short tutorial, I assume that

  • you have set up identical unused disk partitions on both nodes (sdf in this tutorial)
  • ideally, the two nodes are connected via a distinct network link, and IP addresses have been assigned (I'm using 10.10.10.1 and 10.10.10.2 here)
  • you are running CentOS 5.x on both nodes

Installation

Unless stated otherwise, please do everything on both nodes!

For DRBD, install the required software:

yum install redhat-lsb which drbd83 kmod-drbd83  (or kmod-drbd83-xen in my case)

Then you'll need to get OCFS2 and O2CB (including command line tools) from Oracle's Open Source downloads, for example:

ocfs2-2.6.18-194.3.1.el5xen-1.4.7-1.el5.i686.rpm
ocfs2-tools-1.4.4-1.el5.i386.rpm
It is vital that you use the OCFS2 version, which exactly matches your kernel – either by downloading the appropriate RPMs or by compiling from scratch (not covered by this tutorial). The reason is that OFCS2 is handled by a kernel module.

Install both RPMs as you normally would:

rpm -Uhv ocfs2-*.rpm

Configuring the Cluster

Now everything is prepared. Let's throw in the configuration files...

First, DRBD. Your /etc/drbd.conf should look similar to this (replace hostnames/IP/devices according to your setup):

global { usage-count yes; }

# you may want to lower the rate, if you're not using a dedicated GBit link 
# between both peers; 20M worked well on EC2
common { syncer { rate 100M; } }

resource res0 {
  protocol C;
  startup {
    wfc-timeout 20;
    degr-wfc-timeout 10;
    # we will keep this commented until tested successfully:
    # become-primary-on both; 
  }
  net {
    # the encryption part can be omitted when using a dedicated link for DRBD only:
    # cram-hmac-alg sha1;
    # shared-secret anysecrethere123;

    allow-two-primaries;
    after-sb-0pri discard-zero-changes;
    after-sb-1pri discard-secondary;
    after-sb-2pri disconnect;
  }
  on node1 {
    device /dev/drbd1;
    disk /dev/sdf1;
    address 10.10.10.1:7789;
    meta-disk internal;
  }
  on node2 {
    device /dev/drbd1;
    disk /dev/sdf1;
    address 10.10.10.2:7789;
    meta-disk internal;
  }
  disk {
    fencing dont-care;
  }
}

Second, the cluster definition for O2CB, which needs to be stored in /etc/ocfs2/cluster.conf (the directory needs to be created!):

cluster:
        node_count = 2
        name = testcluster

node:
        ip_port = 7777
        ip_address = 10.10.10.1
        number = 1
        name = node1
        cluster = testcluster

node:
        ip_port = 7777
        ip_address = 10.10.10.2
        number = 2
        name = node2
        cluster = testcluster

Note: You want to be very careful with the formatting of that file:

  • The leading whitespaces have to be single tabs
  • between cluster and node definitions, there must be a single empty line

Furthermore, the "name" in the OCFS2 configuration's node sections must match the "on" in the DRBD configuration, and those names there must resolve to local IPs (ideally those on the dedicated network link, of course).

To avoid both resolution problems and unnecessary DNS queries, I suggest putting the names into /etc/hosts on both peers:

10.10.10.1   node1
10.10.10.2   node2

Finally, you would want to add exceptions to iptables in order to allow both nodes to communicate (open ports tcp/7777 and tcp/7789 on both nodes for the other peer).

Now let's run through O2CB's initial configuration:

service o2cb configure

This will throw a bunch of questions at you. Just accept the default values except for the cluster name, which is testcluster in our case.

O2CB will start after configuration. When you've done that on both nodes, you can run service o2cb status to get some positive feedback.

Initialisation of the cluster resource

Now that we've configured everything, you are ready to start the drbd service on both nodes. Be careful to do it in a short interval on both to avoid timouts.

On both nodes this should finish successfully. To verify that the resource is online (although not yet initialised), run:

$ drbd-overview

  1:res0  Unconfigured . . . . 

Brilliant. So let's configure:

# create meta data information:
$ drbdadm create-md res0

# let's make both nodes communicate:
$ drbdadm up res0

# check the status again:
$ drbd-overview

  1:res0  Connected Secondary/Secondary Inconsistent/Inconsistent C r---- 

So now we've got two secondaries, which are inconsistent. That's expected as we haven't set any node to primary yet, and DRBD cannot know that both partitions are clean.

The following step must only be run on the first node

Let's sync then:

$ drbdadm -- --overwrite-data-of-peer primary res0

This syncs data from the local to the remote node.

Now we can check the status of the syncronisation:

$ drbd-overview 
  1:res0  SyncSource Primary/Secondary UpToDate/Inconsistent C r---- 
        [=>..................] sync'ed: 14.3% (4388/5112)M

It may take quite some time to finish. Be patient. Good time to have a coffe, before we continue...

In the end, we want to see this (a working primary/secondary setup with no remaining inconsistencies):

$ drbd-overview 
  1:res0  Connected Primary/Secondary UpToDate/UpToDate C r---- 

But we did want primary/primary, didn't we? So let's promote the other node, too. On the second node run this:

$ drbdadm primary res0
$ drbd-overview 
  1:res0  Connected Primary/Primary UpToDate/UpToDate C r---- 

There we go!

Now is the right time to uncomment this line in /etc/drbd.conf:

previously:
    # become-primary-on both; 
now:
    become-primary-on both; 

Let's quickly recap what we've done so far:
We've defined two nodes in the cluster, which will be "protected" by o2cb (the cluster manager) later.
DRBD has been set up successfully to provide a block device drbd1 in the resource res0, which is in sync with the respective remote node, and can be used for read and write operations on both nodes.
What's left to do is to create a file system and mount it.

Create and mount the OCFS2 filesystem

Although it might be tempting to use a file system, which you are more familiar with (e.g. ext3 or XFS), you must not do this! These file systems are not cluster-aware and don't know at all what's going on at the other node. It will become inconsistent very soon and possibly corrupt your data!

Here's all you have to do on one of the nodes:

mkfs -t ocfs2 -N 2 -L ocfs2_drbd1 /dev/drbd1

You can then mount it on both nodes:

mount /dev/drbd1 /mnt

or better (to avoid unnecessary writes):

mount -o noatime,nodiratime /dev/drbd1 /mnt

noatime,nodiratime avoid that the last access time (read access) is stored with file and directory entries. That's in 99% of all use-cases unnecessary information, which costs performance.

Well done! You should now be able to write to and read from either node.

You may want to double-check that o2cb and drbd are both started during boot (with chkconfig). Also, make sure that this happens in the right order: first drbd, then o2cb.

Some notes on SELinux

OCFS2 has got one downside: It doesn't support SELinux labels. So you can't use filesystem contexts on a OCFS2 partition. OCFS2 will work with SELinux in enforcing mode. But you can't benefit from the added filesystem security. Oracle is aware of that and apparently working on implementing that for future versions.

Until then your files will show as and remain unlabeled_t.

However, you can use OCFS2 filesystems with Apache and SELinux in enforcing more, if you add a small module which allows access from httpd to unlabeled_t filesystems:

cat << EOF > httpocfs.te

module httpocfs 1.0.11;

require {
        type unlabeled_t;
        type httpd_t;
        type initrc_t;
        class dir { search getattr write create read add_name remove_name rmdir };
        class file { read getattr write unlink create append setattr };
        class lnk_file { read getattr write unlink };
        class tcp_socket { read write };

}

#============= httpd_t ==============
allow httpd_t unlabeled_t:dir { search getattr write create read add_name remove_name rmdir };
allow httpd_t unlabeled_t:file { read getattr write unlink create append setattr };
allow httpd_t unlabeled_t:lnk_file { read getattr write unlink };


# if this is missing, you may end up with kernel panic with selinux enforced,
# while starting httpd 
allow httpd_t initrc_t:tcp_socket { read write };

EOF
checkmodule -M -m -o httpocfs.mod httpocfs.te 
semodule_package -o httpocfs.pp -m httpocfs.mod
semodule -i httpocfs.pp

semodule -l | grep httpocfs  
--> should show the new module then

As a normal intact SELinux system wouldn't really have unlabelled files anywhere else, you might actually be able to use this as a security feature by tweaking other services' policies a little... just a thought. Certainly not trivial

Failure Recovery

The main reason why I added this OCFS2 alternative to the tutorial collection is that GFS2 requires fencing (see Clustered Filesystem with DRBD and GFS2 on CentOS 5.4) and does not behave well without a real fencing device. OCFS2 handles that a bit more elegantly. You won't get away without a short "freeze" either, but after about 30-40 seconds, OCFS2 will degrade the cluster and resume normal operation on the remaining node. (It may be possible to tune this behaviour a bit, although you want to be careful not to let it degrade too quickly, because otherwise a short network hiccup will lead to split-brains.)
No manual intervention is required with OCFS2 (except that you want to repair the broken node soon, of course ), nor do you need a fencing device.

In a split-brain situation (network link down, but both nodes otherwise working fine), one of the nodes will declare itself Master, the other one will be marked failed. This is good and safe, because it avoids that data is written to both, which would lead to a situation where you've got different new data on both nodes. There wouldn't be any way to merge both afterwards; part of your data would be lost.
You can then manually resolve the degradation after a split-brain by:

  • un-mounting the partition on the degraded node
  • disconnecting the underlying DRBD resource from the peer
  • declare the node secondary (the "good" one still being primary)
  • reconnect the resource to make it sync itself on the degraded node from the primary
  • log on the other node, which has switched itself to standalone in the meantime, and tell it to reconnect
    That'll initiate a synchronisation from primary to secondary. Note that not everything will be rewritten. DRBD will check the journals, so this operation will usually be much faster than the very first synchronisation.

In code it looks like this; first the outdated/degraded node:

umount /shared
drbdadm disconnect res0
drbdadm secondary res0
drbdadm -- --discard-my-data connect res0

Now on the primary (or now standalone) node:

drbdadm connect res0

Then check with drbd-overview what the status is, and as soon as it's finished (both nodes listed as UpToDate), you can promote the secondary node to primary again, and mount the partition:

drbdadm primary res0
mount -o noatime,nodiratime /dev/drbd1 /shared

The mount and very first filesystem operation take a short moment, and afterwards all is back up and running as primary/primary.

By the way, normal reboots or unmounts/remounts at either end of the cluster don't do any harm or freeze I/O, not even for a short moment. After a reboot, the returning node would just tell his peer to send him the data which has changed in the meantime and then resume normal operation in primary/primary mode.


Shortcuts



Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.