IntroductionMonths ago I've published Clustered Filesystem with DRBD and GFS2 on CentOS 5.4 in this wiki. Now I would like to give another option – OCFS2, which is a clustered filesystem developed by Oracle. OCFS initially was focused on use with Oracle's databases, but with OFCS2 it's a general-pupose cluster filesystem. It's available as open source, and Oracle does a quite good job publishing packages and modules for each and every kernel version (so you don't need to re-compile it over and over again when you update your kernel). That at least holds true for RHEL and derivates. You will notice that this howto is very similar to the GFS2 howto mentioned earlier. In fact the DRBD-related parts are mostly identical.
Pre-RequisitesFor this short tutorial, I assume that
InstallationUnless stated otherwise, please do everything on both nodes! For DRBD, install the required software:
yum install redhat-lsb which drbd83 kmod-drbd83 (or kmod-drbd83-xen in my case)
Then you'll need to get OCFS2 and O2CB (including command line tools) from Oracle's Open Source downloads ocfs2-2.6.18-194.3.1.el5xen-1.4.7-1.el5.i686.rpm ocfs2-tools-1.4.4-1.el5.i386.rpm
Install both RPMs as you normally would: rpm -Uhv ocfs2-*.rpm Configuring the ClusterNow everything is prepared. Let's throw in the configuration files... First, DRBD. Your /etc/drbd.conf should look similar to this (replace hostnames/IP/devices according to your setup):
global { usage-count yes; }
# you may want to lower the rate, if you're not using a dedicated GBit link
# between both peers; 20M worked well on EC2
common { syncer { rate 100M; } }
resource res0 {
protocol C;
startup {
wfc-timeout 20;
degr-wfc-timeout 10;
# we will keep this commented until tested successfully:
# become-primary-on both;
}
net {
# the encryption part can be omitted when using a dedicated link for DRBD only:
# cram-hmac-alg sha1;
# shared-secret anysecrethere123;
allow-two-primaries;
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
}
on node1 {
device /dev/drbd1;
disk /dev/sdf1;
address 10.10.10.1:7789;
meta-disk internal;
}
on node2 {
device /dev/drbd1;
disk /dev/sdf1;
address 10.10.10.2:7789;
meta-disk internal;
}
disk {
fencing dont-care;
}
}
Second, the cluster definition for O2CB, which needs to be stored in /etc/ocfs2/cluster.conf (the directory needs to be created!):
cluster:
node_count = 2
name = testcluster
node:
ip_port = 7777
ip_address = 10.10.10.1
number = 1
name = node1
cluster = testcluster
node:
ip_port = 7777
ip_address = 10.10.10.2
number = 2
name = node2
cluster = testcluster
Note: You want to be very careful with the formatting of that file:
Furthermore, the "name" in the OCFS2 configuration's node sections must match the "on" in the DRBD configuration, and those names there must resolve to local IPs (ideally those on the dedicated network link, of course). To avoid both resolution problems and unnecessary DNS queries, I suggest putting the names into /etc/hosts on both peers: 10.10.10.1 node1 10.10.10.2 node2 Finally, you would want to add exceptions to iptables in order to allow both nodes to communicate (open ports tcp/7777 and tcp/7789 on both nodes for the other peer). Now let's run through O2CB's initial configuration: service o2cb configure This will throw a bunch of questions at you. Just accept the default values except for the cluster name, which is testcluster in our case. O2CB will start after configuration. When you've done that on both nodes, you can run service o2cb status to get some positive feedback. Initialisation of the cluster resourceNow that we've configured everything, you are ready to start the drbd service on both nodes. Be careful to do it in a short interval on both to avoid timouts. On both nodes this should finish successfully. To verify that the resource is online (although not yet initialised), run: $ drbd-overview 1:res0 Unconfigured . . . . Brilliant. So let's configure: # create meta data information: $ drbdadm create-md res0 # let's make both nodes communicate: $ drbdadm up res0 # check the status again: $ drbd-overview 1:res0 Connected Secondary/Secondary Inconsistent/Inconsistent C r---- So now we've got two secondaries, which are inconsistent. That's expected as we haven't set any node to primary yet, and DRBD cannot know that both partitions are clean. The following step must only be run on the first node Let's sync then: $ drbdadm -- --overwrite-data-of-peer primary res0 This syncs data from the local to the remote node. Now we can check the status of the syncronisation:
$ drbd-overview
1:res0 SyncSource Primary/Secondary UpToDate/Inconsistent C r----
[=>..................] sync'ed: 14.3% (4388/5112)M
It may take quite some time to finish. Be patient. Good time to have a coffe, before we continue... In the end, we want to see this (a working primary/secondary setup with no remaining inconsistencies): $ drbd-overview 1:res0 Connected Primary/Secondary UpToDate/UpToDate C r---- But we did want primary/primary, didn't we? So let's promote the other node, too. On the second node run this: $ drbdadm primary res0 $ drbd-overview 1:res0 Connected Primary/Primary UpToDate/UpToDate C r---- There we go! Now is the right time to uncomment this line in /etc/drbd.conf:
previously:
# become-primary-on both;
now:
become-primary-on both;
Let's quickly recap what we've done so far: Create and mount the OCFS2 filesystem
Here's all you have to do on one of the nodes: mkfs -t ocfs2 -N 2 -L ocfs2_drbd1 /dev/drbd1 You can then mount it on both nodes: mount /dev/drbd1 /mnt or better (to avoid unnecessary writes): mount -o noatime,nodiratime /dev/drbd1 /mnt noatime,nodiratime avoid that the last access time (read access) is stored with file and directory entries. That's in 99% of all use-cases unnecessary information, which costs performance. Well done! You should now be able to write to and read from either node. You may want to double-check that o2cb and drbd are both started during boot (with chkconfig). Also, make sure that this happens in the right order: first drbd, then o2cb. Some notes on SELinuxOCFS2 has got one downside: It doesn't support SELinux labels. So you can't use filesystem contexts on a OCFS2 partition. OCFS2 will work with SELinux in enforcing mode. But you can't benefit from the added filesystem security. Oracle is aware of that and apparently working on implementing that for future versions. Until then your files will show as and remain unlabeled_t. However, you can use OCFS2 filesystems with Apache and SELinux in enforcing more, if you add a small module which allows access from httpd to unlabeled_t filesystems:
cat << EOF > httpocfs.te
module httpocfs 1.0.11;
require {
type unlabeled_t;
type httpd_t;
type initrc_t;
class dir { search getattr write create read add_name remove_name rmdir };
class file { read getattr write unlink create append setattr };
class lnk_file { read getattr write unlink };
class tcp_socket { read write };
}
#============= httpd_t ==============
allow httpd_t unlabeled_t:dir { search getattr write create read add_name remove_name rmdir };
allow httpd_t unlabeled_t:file { read getattr write unlink create append setattr };
allow httpd_t unlabeled_t:lnk_file { read getattr write unlink };
# if this is missing, you may end up with kernel panic with selinux enforced,
# while starting httpd
allow httpd_t initrc_t:tcp_socket { read write };
EOF
checkmodule -M -m -o httpocfs.mod httpocfs.te
semodule_package -o httpocfs.pp -m httpocfs.mod
semodule -i httpocfs.pp
semodule -l | grep httpocfs
--> should show the new module then
As a normal intact SELinux system wouldn't really have unlabelled files anywhere else, you might actually be able to use this as a security feature by tweaking other services' policies a little... just a thought. Certainly not trivial Failure RecoveryThe main reason why I added this OCFS2 alternative to the tutorial collection is that GFS2 requires fencing (see Clustered Filesystem with DRBD and GFS2 on CentOS 5.4) and does not behave well without a real fencing device. OCFS2 handles that a bit more elegantly. You won't get away without a short "freeze" either, but after about 30-40 seconds, OCFS2 will degrade the cluster and resume normal operation on the remaining node. (It may be possible to tune this behaviour a bit, although you want to be careful not to let it degrade too quickly, because otherwise a short network hiccup will lead to split-brains.) In a split-brain situation (network link down, but both nodes otherwise working fine), one of the nodes will declare itself Master, the other one will be marked failed. This is good and safe, because it avoids that data is written to both, which would lead to a situation where you've got different new data on both nodes. There wouldn't be any way to merge both afterwards; part of your data would be lost.
In code it looks like this; first the outdated/degraded node: umount /shared drbdadm disconnect res0 drbdadm secondary res0 drbdadm -- --discard-my-data connect res0 Now on the primary (or now standalone) node: drbdadm connect res0 Then check with drbd-overview what the status is, and as soon as it's finished (both nodes listed as UpToDate), you can promote the secondary node to primary again, and mount the partition: drbdadm primary res0 mount -o noatime,nodiratime /dev/drbd1 /shared The mount and very first filesystem operation take a short moment, and afterwards all is back up and running as primary/primary. By the way, normal reboots or unmounts/remounts at either end of the cluster don't do any harm or freeze I/O, not even for a short moment. After a reboot, the returning node would just tell his peer to send him the data which has changed in the meantime and then resume normal operation in primary/primary mode. |
Shortcuts |