Months ago I've published Clustered Filesystem with DRBD and GFS2 on CentOS 5.4 in this wiki. Now I would like to give another option – OCFS2, which is a clustered filesystem developed by Oracle. OCFS initially was focused on use with Oracle's databases, but with OFCS2 it's a general-pupose cluster filesystem. It's available as open source, and Oracle does a quite good job publishing packages and modules for each and every kernel version (so you don't need to re-compile it over and over again when you update your kernel). That at least holds true for RHEL and derivates.
You will notice that this howto is very similar to the GFS2 howto mentioned earlier. In fact the DRBD-related parts are mostly identical.
For this short tutorial, I assume that
Unless stated otherwise, please do everything on both nodes!
For DRBD, install the required software:
Then you'll need to get OCFS2 and O2CB (including command line tools) from Oracle's Open Source downloads, for example:
Install both RPMs as you normally would:
Now everything is prepared. Let's throw in the configuration files...
First, DRBD. Your /etc/drbd.conf should look similar to this (replace hostnames/IP/devices according to your setup):
Second, the cluster definition for O2CB, which needs to be stored in /etc/ocfs2/cluster.conf (the directory needs to be created!):
Note: You want to be very careful with the formatting of that file:
Furthermore, the "name" in the OCFS2 configuration's node sections must match the "on" in the DRBD configuration, and those names there must resolve to local IPs (ideally those on the dedicated network link, of course).
To avoid both resolution problems and unnecessary DNS queries, I suggest putting the names into /etc/hosts on both peers:
Finally, you would want to add exceptions to iptables in order to allow both nodes to communicate (open ports tcp/7777 and tcp/7789 on both nodes for the other peer).
Now let's run through O2CB's initial configuration:
This will throw a bunch of questions at you. Just accept the default values except for the cluster name, which is testcluster in our case.
O2CB will start after configuration. When you've done that on both nodes, you can run service o2cb status to get some positive feedback.
Now that we've configured everything, you are ready to start the drbd service on both nodes. Be careful to do it in a short interval on both to avoid timouts.
On both nodes this should finish successfully. To verify that the resource is online (although not yet initialised), run:
Brilliant. So let's configure:
So now we've got two secondaries, which are inconsistent. That's expected as we haven't set any node to primary yet, and DRBD cannot know that both partitions are clean.
The following step must only be run on the first node
Let's sync then:
This syncs data from the local to the remote node.
Now we can check the status of the syncronisation:
It may take quite some time to finish. Be patient. Good time to have a coffe, before we continue...
In the end, we want to see this (a working primary/secondary setup with no remaining inconsistencies):
But we did want primary/primary, didn't we? So let's promote the other node, too. On the second node run this:
There we go!
Now is the right time to uncomment this line in /etc/drbd.conf:
Let's quickly recap what we've done so far:
Here's all you have to do on one of the nodes:
You can then mount it on both nodes:
or better (to avoid unnecessary writes):
noatime,nodiratime avoid that the last access time (read access) is stored with file and directory entries. That's in 99% of all use-cases unnecessary information, which costs performance.
Well done! You should now be able to write to and read from either node.
You may want to double-check that o2cb and drbd are both started during boot (with chkconfig). Also, make sure that this happens in the right order: first drbd, then o2cb.
OCFS2 has got one downside: It doesn't support SELinux labels. So you can't use filesystem contexts on a OCFS2 partition. OCFS2 will work with SELinux in enforcing mode. But you can't benefit from the added filesystem security. Oracle is aware of that and apparently working on implementing that for future versions.
Until then your files will show as and remain unlabeled_t.
However, you can use OCFS2 filesystems with Apache and SELinux in enforcing more, if you add a small module which allows access from httpd to unlabeled_t filesystems:
As a normal intact SELinux system wouldn't really have unlabelled files anywhere else, you might actually be able to use this as a security feature by tweaking other services' policies a little... just a thought. Certainly not trivial
The main reason why I added this OCFS2 alternative to the tutorial collection is that GFS2 requires fencing (see Clustered Filesystem with DRBD and GFS2 on CentOS 5.4) and does not behave well without a real fencing device. OCFS2 handles that a bit more elegantly. You won't get away without a short "freeze" either, but after about 30-40 seconds, OCFS2 will degrade the cluster and resume normal operation on the remaining node. (It may be possible to tune this behaviour a bit, although you want to be careful not to let it degrade too quickly, because otherwise a short network hiccup will lead to split-brains.)
In a split-brain situation (network link down, but both nodes otherwise working fine), one of the nodes will declare itself Master, the other one will be marked failed. This is good and safe, because it avoids that data is written to both, which would lead to a situation where you've got different new data on both nodes. There wouldn't be any way to merge both afterwards; part of your data would be lost.
In code it looks like this; first the outdated/degraded node:
Now on the primary (or now standalone) node:
Then check with drbd-overview what the status is, and as soon as it's finished (both nodes listed as UpToDate), you can promote the secondary node to primary again, and mount the partition:
The mount and very first filesystem operation take a short moment, and afterwards all is back up and running as primary/primary.
By the way, normal reboots or unmounts/remounts at either end of the cluster don't do any harm or freeze I/O, not even for a short moment. After a reboot, the returning node would just tell his peer to send him the data which has changed in the meantime and then resume normal operation in primary/primary mode.
Skip to end of metadata Go to start of metadata