non-gnu/clusterix • Public Open Source Mirror

# This file is part of Clusterix, Copyright (C) 2004 Alessandro Manzini,
# email: a.manzini@infogroup.it

# Clusterix is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# Clusterix is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with Clusterix; if not, write to the Free Software
# Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA

############# Clusterix(wsd) #############

It is a cluster software for unix operating system
(Linux,FreeBSD,Solaris)

there are 2 type of clusterix:
1) Clusterix: this type write status information on a raw device on
shared disk.
2) Clusterixwsd: this type write status information on file on local
2disk of the nodes.

It is intented to be used with 2 machines if you want to put a service
in high availability. With service i mean ip addresses, program (like
web server,database etc), shared disk. So we can use for example to put
in high availability a web server in 2 nodes or to put in high
availability a database with shared disk. If you want to put more than
one service in high availability so that one service stay in one node
and the second in the other node you can simply make 2 installation of
clusterix in 2 different directory of the 2 nodes. When the there is a
failer of service on one node the cluster system move the service on the
other node, that is it configure ip,publish macaddress of interface,
mount disk if any in configuration file and at the end it start the
program (web server, database etc).

If you have to mount shared disk and you want to use very intensivily
these disk, maybe it is better to use clusterixwsd because clusterix
write informazion on shared disk (not exchange information throught
network like clusterixwsd) and can take away some resource that are
needed by the service.

- the cluster software is composed by a main script (clusterixwsd.sh),
one configration file (clusterixwsd.conf), one file that control the
start and stop sequence of a service (script.sh), one file that start
and stop the service (service.sh),2 files that check the status of the
service and switch the cluster if service failer (control_process.sh
control_control.sh). only the script clusterixwsd.sh and the
configuration file clusterixwsd.conf are needed for the correct function
of the cluster. others file and script are optional and they are used to
control and administrate the service offered by the cluster.
- All the variables that you have to set are in the configuration file.
- The cluster script and configuration file have to be the same in both
of the nodes.
- Usually you have only to make change to clusterixwsd.conf, service.sh
and to conffile variable on clusterixwsd.sh. The other you have to leave
untouched.

HOW IT FUNCTION

a) cluster with status information on shared disk: Clusterix

Public Interface
|-----------------------------------------------|
____|____ Private Interface ____|____
| | <---------> | |
| | <---------> | |
--------- Backup Private Interface ---------
| |
Write | | Write
| raw device |
----------------------------------------------------------
| |
|writes host 1 | |write host 2 |
|________________________________________________________|

The information are placed in one rawdevice of a shared disk. Mainly
every node writes a timestamp that the other goes to read.If the
timestamp is unchanged by tot sec (variable timeoutdisk) the node that
reads decides the other is down and takes the services if it does not
still have them. The cluster takes advantage also of information taken
from the net. It can use zero, one or two private interfaces of
communication with the other node. In order to activate them or
deactivate set variable PRNI and BNI. Before controlling the timestamp
of the other node, the cluster check if the public interface is
communicating on the net. If yes the timestamp is checked of the other
node. If no it checks through the private interfaces if the public
interface of the other node is up. If yes it stop the services and start
them on the other node. If no it does not take no action. If beyond the
public interface also the private one is down, it tries to communicate
with the private interface of backup. If also this is down, the isolated
node stop its services. The private interfaces (or public in the case in
which there is no private interface) are used to check if the public
interfaces on the 2 nodes are up and in order to launch vary commands
from a node to the other. In this type of cluster the information on the
status (timestamp etc) are written on the raw device that is visible at
the same time by the 2 nodes.

b) Cluster with status information on local disk: Clusterixwsd

Public Interface
|-----------------------------------------------|
____|____ Private Interface ____|____
| | <---------> | |
| | <---------> | |
--------- Backup Private Interface ---------
| | | | | |
Write | | |R R| | | Write
Host 1 | | |e e| | | host 2
| | |a a| | |
| | |d d| | |
| | | | | |
| | |h h| | |
-------------------- ----------------------
| | |o o| | |
Write | | |s s| | | Write
Host 2 | | |t t| | | Host 1
| | |1 2| | |
| | v v | |
|_| v v |_|

Quorum file host 1 Quorum file host 2

Every node writes the information and the timestamp on a file on local
disk on certain blocks and they also writes a timestamp on a very
precise block on file on the other node. Every node reads the timestamp
on the local file. In this case is fundamental that puclic or private
network function in order to determine the state of the node. It can use
zero, one or two private interfaces of communication with the other
node. In order to activate them or deactivate check variable PRNI and
BNI. Before controlling the timestamp of the other node, the cluster
check if the public interface is communicating on the net. If yes the
timestamp is checked of the other node. If yes it checks through the
private interfaces if the public interface of the other node is up. If
yes it stop of the services and it start on the other node. If yes it
does not take no action. If beyond the public interface also the private
is down, it tries to communicate with the private interface of backup.
If also this is down, the isolated node stop its services. The private
interfaces (or public in the case in which there is no private
interface) are used to check if the public interfaces on the 2 nodes are
up and in order to launch vary commands from a node to the other. In
this type of cluster the information on the status (timestamp etc) are
exchanged via network.

For both types of cluster

For both types based on the settings the start of the virtual services
implies start of one or more IP addreses, the start of a program and
mount of one or more disk. The differences in the settings of variables
with linux or freebsd are not many. and it is possible to find
indications on how to set the variable in the configuration file as
comments before the variable. Variable that can be different according
of the os you are using are date, ps, dd, ifconfig,for definition of the
alias, the mount/umount of the disks. In the case in which you active
external control on the services, 2 processes are activated: the first
one controls that the service is up. if not it switch the services on
the other node. The other controls that the first process is up. if not
it restart it. The first process restart the second in the case in which
it is not present. So the 2 processes check one the other. In the case
in which we have two host with 3 interfaces of net you can set the
variable PRNI=on and BNI=on. In this case we have 1 public interface, 1
private and 1 private interface of backup. If we have only 2 interfaces
we set the PRNI=on and BNI=off.So we will have 1 public interface and 1
private one without interface of backup. In the case in which the host
have one single interface the cluster can only work with the public
interface (but it is better to not use so). In this case we set variable
PRNI=off and BNI=off. The command cluster status shows timestamp, status
and if the node own the services for every of the 2 nodes, when the
service is started and finally the status of all the interfaces in the
configuration file. A detailed log can be found in the file set in the
variable log. It is possible to set the timeout after which if the field
timestamp is not changed,the node is considered down from the other.
This variable is timeoutdisk. The variable checkfreq regulated the
frequency with which the timestamp is written and the frequency
with which the timestamp written by the other node is checked. The
variable checkprocfreq regulated the frequency with which the script
control_process check the availability of the service. The script of
control of the availability of the service can be whatever script that
give 0 for ok and 1 for ko. When it is executed the stop of the service,
the cluster try to umount disks if any in configuration file. If umount
command fails, cluster force a crash of node to prevent data corruption
(if you want to disable this, you have to leave unset crashcommand
variable).

BEFORE TO INSTALL

It is needed whichever unix os between Linux,FreeBSD
and Solaris. It is also needed the following commands:
bash,hostname,ps,dd,ssh,date,kill,ifconfig,ping,send_arp (only for linux
not for freebsd). if you want mount and umount filesystem you needs
also: mount,umount,lsof (for linux e solaris),fstat (for freebsd). For
solaris you need to use ksh for the script.

For ping you have to install the ping program of iputils. For Freebsd is
right the default program. This because you want ping to be able to send
broadcast request. The syntax of ping program for linux to establish if
network is down is "ping -b -c 3 -f -w 2" for Freebsd is "ping -c 3 -f
-t 2" and for Solaris is "ping -c 1"

All this program are put in variables in the configuration file and can
be substitue by other command that make the same things.

The ssh program has to be configured so that you can launch command from
root on the other node without any password on all the interfaces. If
there is installed openssh you can receive it so:

ssh-keygen -t rsa (for dsa is ssh-keygen -t dsa.it is the same)
this creates a key in ~/.ssh/id_rsa e ~/.ssh/id_rsa.pub. the key in
~/.ssh/id_rsa.pub has to be placed in the file ~/.ssh/authorized_keys of
the other node.

examples of the files:
~/.ssh/id_rsa.pub
ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEA1brtV7H9V5A3yLDYxUG71eGO0nvHmJ2g2+U7
n+Ed5cs0C8mW3Ecb5PkQqCHmdErVQFnzs8BllZSoAcmfxMSjbH7DZKmlz/z0V3CcRgIc661o
TfrIFc/xk7GDxQiaNO8+VMw/BjrtWsYxPHT5vkzigPQPdLBhamFWKTYeTJAX7sE= root@be
llatrix.intra.it
~/.ssh/authorized_keys dell'altro nodo:
ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEA1brtV7H9V5A3yLDYxUG71eGO0nvHmJ2g2+U7
n+Ed5cs0C8mW3Ecb5PkQqCHmdErVQFnzs8BllZSoAcmfxMSjbH7DZKmlz/z0V3CcRgIc661o
TfrIFc/xk7GDxQiaNO8+VMw/BjrtWsYxPHT5vkzigPQPdLBhamFWKTYeTJAX7sE= root@be
llatrix.intra.it

then you repeat these things inverting the 2 nodes. then you have to go
in the file /etc/ssh/sshd_config and set yes the permitrootlogin
variable and restart ssh daemon. at this point you can lanch command as
root on both of the 2 nodes without any password.

you have also to say to the node to ignore broadcast request. This is
because in this manner you can establish if network is reacheable or
not. In fact the test that try to know if network is reacheable of
not send a broadcast request to the network. so you dont want
that the machine itself reply to this request. So you have to set
net.ipv4.icmp_echo_ignore_broadcasts = 1 for linux. Freebsd by default
dont reply to broadcast request.

In order to permit that when the virtual ip pass from one node to the
other, the mac address is published correctly to all machine, you have
to put the nodes in the same virtual lan with the router on the switch.
In this manner the gratuitos arp request send by send_arp program
arrives to router and so to all other network. Otherwise uou can see the
virtual ip only in its network.

INSTALLAZIONE

For cluster with status information on local disk:
create a directory to put cluster software:
mkdir /opt/clusterix
create a directory to put log and to put file that contain
status information.
mkdir /opt/clusterix/log
mkdir /opt/clusterix/qf
create file for status information:
touch /opt/clusterix/qf/quorumfile
put the following files in the directory /opt/clusterix:

clusterixwsd.sh Main cluster script
clusterixwsd.conf Configuration file
control_control0_4.sh Scirpt that control control_process0_4.sh
control_process0_4.sh Scirpt that control control_control0_4.sh and
that lanch the script that test the availability of
the service
script.sh Script that control the start and stop of
the service
service.sh Script that start and stop the service (it is
only to test the cluster at the beginning
of installation)
controllo (script that have to give 0 for ok and 1 for ko)
gotest.sh the service use for test (it is only to test
the cluster at the beginning of installation)

Note: all the files names can be changed and then changes the values of
variables in the configuration files)

Note: All the files except the configuration file have to be executable.

Note: The send_arp program have to take the following parameters:
usage: send_arp src_ip_addr src_hw_addr targ_ip_addr targ_hw_addr
I include that send_arp program for linux in this archive. Note that for
FreeBSD and Solaris you havent needed of it.

Note: If for some reason you want to change the block number in which
the cluster write information either in file or raw device, you have to
see the blocknum function in clusterix(wsd).sh.

Note: For Solaris you have to use another date program to return the
number of secs from 1 January 1970. I include in this tar datenum.c and
compiled version datenum that give the right result.

Then you have to change the values of the variable according to your
system: for the file clusterixwsd.sh you have only to set conffile
variable to the path of the configuration file. Then you have to set all
the variables in the configuration file clusterixwsd.conf according to
your needs and following the comment that is before each variable. When
you are installing it is better that you set the variable to use the
test service that is with the cluster and only when you are sure that
all function that you substitute this test service with the real
service. after that you have set the variable in configuration files
you have to copy all these files on the other node. than you
can give the command: "/opt/clusterix/clusterixwsd.sh initialize" this
initialize the file in which the status information are stored. then you
start the cluster: "/opt/clusterix/clusterixwsd.sh start" now you can
check the log to see if it is all right, and give the command
"/opt/clusterix/clusterixwsd.sh status" to see if it is all right and
check for the presence of the test service with: "ps auxw | grep gotest"
and check for the precence of an ip and mounted disk if any in
configuration file. then you start the cluster in the other node:
"/opt/clusterix/clusterixwsd.sh remote start" and then check again with:
"/opt/clusterix/clusterixwsd.sh status" at this point the cluster is up
and function and if there is some hardware failer the cluster switch the
service. but until now there is no check of service. So if there is a
service fails or you stop the service the cluster dont know of that
and doesnt make anything. So you can start the external check of
the service: "/opt/clusterix/clusterixwsd.sh startcontrol" and then
check the statuscontrol program with: "/opt/clusterix/clusterixwsd.sh
statuscontrol"

At this point we can make some test. For example you can stop the
service with "/opt/clusterix/service.sh stop" (or ps auxw | grep gotest
and then kill pid) and see that the cluster switch it on the other node.
you can stop the cluster with "/opt/clusterix/clusterixwsd.sh stop" and
see that the cluster start the service on the other node. you can reboot
the machine and test that the service go on the other node etc.

The same is valid for cluster with status information on shared disk
with the exception that instead to specified a quorumfile variable with
a file you have to set a variable quorumdevice to a raw device that is
on shared disk.

after that all the things function you can put your service under the
control of the cluster: so set the variable start_service_comand to a
script that start the service set the variable stop_service_comand
to a script that stop the service set stop_service_comand_2 to
a script that stop the service if the first script fails to
stop it set includestringstart excludestringstart includestringstop
excludestringstop according to your needs.

Explanation of the variables in configuration file:

version is the version of the cluster system
pathcluster is the path of the main script
operatingsystem Specify operating system. Possible chooses: Linux,FreeBSD,
Solaris.if you use Solaris change also the default
shell in ksh in all the script
hostname program that have to return the host name without domain
like host1 and not host1.domain.org
node1 Name of the 2 nodes of the cluster without domain
node2

log Log file cluster
quorumfile File that contain status information. For cluster
with status information in local disk
quorumdevice Raw device that contain status information. For cluster
with status information in shared disk
blocksize block size in bytes inside the status information file
or partition
timeoutdisk Timeout accessing status information file after that
a node is thought
down by other node
checkfreq Frequency of the check and write information on status
information file

timeoutkillremotestop Timeout to wait before to kill the process that
remotely stop the cluster on the other node.
This is useful to give time to disk to umount.
So it is advisable that is greater than timeoutstop +
timeoutstop2 + umountfailedsec * umountcountnum. if you
make so you are sure that when the node mount the disk
the other node have just umounted them.

script Script that control the start and the stop of the service
servicename Name of the service to start. it appear in the log and in
the email
start_service_comand Command that start the service.
IMPORTANT: start_service_comand variable has to be different
from includestringstart variable
stop_service_comand Command that stop the service.
stop_service_comand_2 Second stop command to lanch if the first dont stop the service

includestringstart
excludestringstart
includestringstop
excludestringstop
List of string pattern to match or exclude for starting
and stopping the service. Put the strings separated by a "|".
example "pattern1|pattern2". For the include the service
start only if all the pattern are matched. For the exclude,
all the pattern are excluded from the matched ones and you
can also use regular expression for the exclude.

timeoutstop Timeout after which begin the stop of the process without
waiting for normal stop
timeoutstop_2 Timeout for the the second stop script after which begin
the stop of the process without waiting for normal stop
timeoutstart Timeout after which the service is tried to start again
countstartlimit Number of tries to start the service
begincheck Number of seconds after the start of the service before to
begin to check the precence of the process
trycountnumquorum Number of tries before to say that the other member is
not polling quorum device
trysecintquorum Interval between one try to another to see if other node
is polling quorum device
trymessagequorum Message to display in log for polling quorum device test

date have to return the date as Wed Sep 21 10:51:35 CEST 2005
datenum have to return date as 1127292729 (seconds by 1 Jan 1970)

mailto Destination address of mail alert

control_process path to the control_process script
control_control path to the control_control script

controlscript Path to script that control the availability of the service.
It have to return 0 for good 1 for wrong
checkprocfreq Frequency of the check of the availability of the service
countfailedservicenum Number of tries before to say that a service is down
failedtrysec secs between one failed try to the next
trymessageservice Message to display in log when trying this service

fsckbeforemount put ON if you want to make an fsck of file system
before to mount it
umountcountnum Number of tries to umount the device
umountfailedsec secs between one failer try to the next
umountfaildmessage Message to display in log for umount failer
crashcommand Command that force crash on the node.
reboot -f for linux reboot -q for freebsd.
It is needed to be sure that when you mount
a file system on the node, the file system is not
mounted in the other node.
Leave unset if you dont want to force crash of
node in the case of umount failer.
numdevice Number of device to mount,umount when cluster start,stop.
devicetomount1 device to mount
mountpoint1 Mount point

killprocess # Define the utility to use to kill the process before
to umounting a file system.
For Solaris,Linux:
for pidopenfile in `$lsof $1 | $awk '{print $2}' |$uniq | $grep -v PID`; do if [ ! -z "$pidopenfile" ]; then $kill -9 $pidopenfile; fi; done
For Freebsd:
for pidopenfile in `$fstat -f $1 | $awk '{print $6}' | $grep -v INUM`; do if [ ! -z $pidopenfile ]; then $kill -9 $pidopenfile; fi; done

PRNI set to on if you want to you a private network interface
node1ipprivatenetwork1 ip of private network interface on node1
node2ipprivatenetwork1 ip of private network interface on node2
BNI set BNI (backup network interface) to on if you
have and want to use a private backup interface
node1ipprivatenetwork2 ip of backup private network interface on node1
node2ipprivatenetwork2 ip of backup private network interface on node2
node1ippublicnetwork ip of public network interface on node1
node2ippublicnetwork ip of public network interface on node2
netmaskpublicnetwork netmask of public network interface
broadcastpublicnetwork broadcast of public network interface
interfacepublicnetwork name of public network interface
trycountnumpubnet Number of tries before to say that public interface
is down
trysecintpubnet Interval between one try to another to see if
public interface is down
trymessagepubnet Message to display in log for public interface test

node1macaddress mac address of public interface of node1
node2macaddress mac address of public interface of node2
is needed for linux not for FreeBSD

numvip Number of virtual ip address to active on the
start of service
useexternalvipfile Set to "on" if you want to read vip address
from esternal file.
externalvipfile File containing vip address
if you use external file you have to put one
ip per line
vipdeflinux Set the ip,netmask,broadcast,interface and
interfacenumber for virtual ip addrees.
if unused let the variable ip unset. If you use
it also add the other variable (netmask,broadcast,
interface,interfacenumber). you have to set when
you put useexternalvipfile="off".
vipdeffreebsd the same for FreeBSD
vipdefsolaris the same for Solaris

OPERATIVE MANUAL

Clusterix 4.6...
Usage: /opt/clusterix1/clusterixwsd.sh {start|stop|startforeground|startall|stopall|stat
us|initialize|startservice|stopservice|stopcluster|stopclusterhere|startcontrol|stopcont
rol|statuscontrol|writedatenow|writeactive|version}

start: start cluster in background: service and check quorum device.
stop: stop cluster: service and check quorum device.
status: Retrieve status information of the 2 nodes.
initialize: Initialize quorum device.
startforeground: start cluster in foreground: service and check quorum device.
startcontrol: start processes that control the availability of service.
stopcluster: stop the cluster system without stopping the service on both nod
es.
stopclusterhere: stop the cluster system without stopping the service on this nod
e.
stopcontrol: stop processes that control the availability of service.
statuscontrol: status of processes that control the availability of service.
startservice: start only the service not the cluster system.
startserviceifnotactive: start only the service not the cluster system only if th
e node is not active.
stopservice: stop only the service not the check of quorum device.
stopall: stop service,stop check quorum device and stop processes that control the availability of service.
startall: start service,start check quorum device and start processes that
control the availability of service. Use this only if the cluster is down also in the o
ther node.Otherwise use /opt/clusterix1/clusterixwsd.sh start.
remote {start|stop|startservice|stopservice}: start,stop,startservice,stopservice on the
other node.
writeactive {yes|no}: Write yes,no for the status on quorum device.
writedatenow: Write actual date on quorum device.
version: Program version.

clusterixwsd.sh initialize - Initialize status information files
clusterixwsd.sh start - Start cluster on this node
clusterixwsd.sh remote start - Start cluster on other node

clusterixwsd.sh stop - Stop cluster on this node
clusterixwsd.sh remote stop - Stop cluster on other node
clusterixwsd.sh stopcluster - Stop cluster system without stopping
the service
clusterixwsd.sh status - Check the status of the cluster
clusterixwsd.sh startservice - Start only the service without cluster
(for emergency)
clusterixwsd.sh writeactive yes|no - Write yes or no in the block that contain
active information on status file
clusterixwsd.sh writedatenow - Write date in the block that contain
timestamp information on status file