Vic/SunCluster
From Summerseas
Sun Cluster 3.1/3.2 Notes
[edit]
SC 3.1 Use of Persistent reservations
[edit]
SCSI-2/SCSI-3 Fencing Configuration
- For 2 node clusters, scsi-3 fencing must be configured is using any ontap earlier than 7.2.3. The following will set scsi fencing to use scsi-3.
- cluster set -p global_fencing=prefer3
- To verify fencing configuration...
- cluster show 3mileisland | grep global_fenc (3mileisland is my cluster name)
- NOTE - remove/replace quorum devices following a change to global_fencing
[edit]
Related Sun blog
- Excellent discussion from [Kristien's Blog]
- From the Blog...
- SCSI reservations in Sun Cluster 3.x
- I promised some time ago to write something about the mechanisms that Sun Cluster uses to prevent split brain and amnesia. As said, in a two node cluster, a node can get the vote count from the quorum device by 'reserving' the quorum device or making sure that the other node cannot reserve it. We also discussed that reserving quorum devices is not enough: you should also make sure that all disks are fenced out from a node that has to leave the cluster. This is called disk fencing. SCSI reservations are used for both the quorum disk and all the other disks.
- You have probably heard of SCSI-2 versus SCSI-3. When Sun Cluster 3.x was designed, they reckoned all disks would be ready to understand SCSI-3 by the time Sun Cluster was released, but unfortunately this didn't seem to be true. So they decided to have Sun Cluster use either SCSI-2 or SCSI-3. Big question: when does it use what? And why not use SCSI-2 all the time? Let's first try to answer the last question: SCSI-2 is an exclusive reservation, which means that only one node can own the disk. Which means that other nodes will not be able to reserve the disk and they will panic. Not so handy when you have a 4 node cluster and you want to kick off only one node. SCSI-3 is a group reservation: every node has a key on a dedicated area on the disk and when a node has to leave, another node will just kick off its key.
- The next question, when Sun Cluster uses SCSI-2 or when SCSI-3 is an easy one to answer but there are lots of misunderstandings. Sun Cluster will not 'test' whether the disk understands SCSI-2 or SCSI-3. Reason for that is that we use a specific functionality of SCSI-3 called Persistent (Group) Reservation (PGR) which is optional in the specs. So it is perfectly possible that a disk understands SCSI-3 but does not have PGR functionality enabled. So Sun Cluster decides what mechanism to use based on the number of paths to the disk cluster-wide. You can check this with the output of scdidadm -L.
- An example in a 2-node cluster:
- 14 moon1:/dev/rdsk/c1t2d0 /dev/did/rdsk/d14
- 14 moon2:/dev/rdsk/c1t2d0 /dev/did/rdsk/d14
- Here we see that there is one path from moon1 to /dev/did/rdsk/d14, and one path from moon2 hence scsi-2 will be used.
- The next thing we will need to do is discuss the difference between scsi reservations used for the Quorum device and the ones used for disk fencing. There is no overlap: Disk fencing code will issue scsi reservations on all shared disks except the Quorum Disk.
- Let us first start with the SCSI mechanism used by disk fencing (ie the protection of disk against 'rogue' nodes that have unexpectedly left the cluster). As said, SCSI-2 will be used when it is a 2-node cluster, SCSI-3 when there are more than 2 paths to the disk cluster wide. SCSI-3 is needed in that case because of what we have discussed before: we need more granularity than the all or nothing 'kick everyone out' of SCSI-2. The SCSI-2 reservations used are the typical MHIOCTKOWN and MHIOCRELEASE ioctls.
- For the quorum device it is not as straightforward. As said, the quorum rule is used to protect amnesia. This implies that any reservation of the quorum device should be able to persist across reboots of the storage. This is true for SCSI-3 (hence the Persistent in PGR) but not for SCSI-2. Therefore, Sun invented a mechanism it has called SCSI-2 PGRE (Persistent Group Reservation Emulation). This is an emulation using SCSI-2 ioctls of the SCSI-3 mechanism: keys will be put on a designated area on the disk. These keys are able to survive a power cycle of the disk subsystem. One additional remark: since putting your key on a disk or kicking off another ones key off the disk has to be an atomic operation, but the SCSI-2 emulation consists of many commands: therefore a traditional SCSI-2 MHICTKOWN will still be used to ensure atomicity.
- Oh: both SCSI-3 and SCSI-2 keys are invisible and are not placed in a specific partition. SCSI-2 keys are in a designated area on the disk or LUN and the location of SCSI-3 keys is implementation-dependant. A quorum disk can still be used to put whatever data you want on. I will show in a next post how you can see these mysterious keys.
[edit]
Solaris Zone Failover Groups
- Solaris zones may be used in SC3.2 in a couple of different ways.
- Zones may be identified as cluster nodes in a resource group.
- Zones may be failed over to other nodes in the cluster.
- To configure failover zones.
- Configure shared storage and a local filesystem.
- Configure the zones on all cluster nodes.
- Install the zones on the node with the local filesystem.
- Copy /etc/zones/index to the other cluster nodes.
- Configure a failover group which includes the shared storage containing the zonepath.
[edit]
Adding ZFS and SVM Resource Groups With Logical Host and NFS Resources
#Register these agents with the cluster if that hasn't already been done... clresourcetype register SUNW.HAStoragePlus clresourcetype register SUNW.nfs #Create the SVM/NFS Ressource Group clresourcegroup create -n sunx4200-shu01,sunx4200-shu02,sunx2200-shu01 -p Pathprefix=/global/services/nfs global_svm_group clreslogicalhostname create -g global_svm_group cluster-test-svm clresource create -g global_svm_group -t SUNW.HAStoragePlus -x FilesystemMountpoints=/global/services -x AffinityOn=True global-storage-resource clresourcegroup online -e -m -M global_svm_group vi /global/services/zfs/SUNW.nfs/dfstab.zpool-shares (share -F nfs -o rw /global/services/nfs/data) clresource create -g global_svm_group -t SUNW.nfs -p Resource_dependencies=global-storage-resource global-shares #Create the ZFS/NFS Resource Group and use the SVM global space for the dfstab file for the ZFS share. clresourcegroup create -n sunx4200-shu01,sunx4200-shu02,sunx2200-shu01 -p Pathprefix=/global/services/zfs zpool-shares-group clreslogicalhostname create -g zpool-shares-group cluster-test clresourcetype register SUNW.HAStoragePlus clresource create -g zpool-shares-group -t SUNW.HAStoragePlus -x Zpools=HAzfs zpools-storage-resource clresourcegroup online -e -m -M zpool-shares-group vi /global/services/zfs/SUNW.nfs/dfstab.zpool-shares (share -F nfs -o rw /HAzfs) clresource create -g zpool-shares-group -t SUNW.nfs -p Resource_dependencies=zpools-storage-resource zpool-shares
[edit]
Adding a Quorum Server (New feature in 3.2)
Note - The command clqs is aliased to clquorumserver so the 2 commands may be used interchangably.
- 1. Run the Java Installer on the host to serve as Quorum server. Select and install the quorum server s/w.
- 2. Enter "clquorumserver start +" on the quorum server host
- 3. Run clsetup on a cluster node and add a quorum device. Specify the quorum server's IP and port 9000.
- 4. Now on the quorum server view the quorum reservation and node registrations.
[root@sunx2200-shu02--->]clqs show
--- Quorum Server on port 9000 ---
--- Cluster eveready (id 0x467BE3B0) Reservation ---
Node ID: 1
Reservation key: 0x467be3b000000001
--- Cluster eveready (id 0x467BE3B0) Registrations ---
Node ID: 1
Registration key: 0x467be3b000000001
Node ID: 2
Registration key: 0x467be3b000000002
Node ID: 3
Registration key: 0x467be3b000000003
[edit]
Starting Additional Quorum Server Instances
- The quorum server daemon, scqsd, is configured by default with the entries in /etc/scqsd/scqsd.conf. This file allows the user to preconfigure multiple instances for startup. I did not find that this worked very well, however, the following command can be used to start additional instances.
- /usr/cluster/lib/sc/scqsd -d /var/sc2 -p 9002
- The requirement is that each instance have a unique port and directory.
- Note - Currently the belief is that 1 instance can service multiple clusters.
- The following shows 2 quorum servers started and another 2 instances configured but not started.
[root@sunx2200-shu02--->]clqs show -v
--- Quorum Server on port 9000 ---
--- Cluster eveready (id 0x467BE3B0) Reservation ---
Node ID: 1
Reservation key: 0x467be3b000000001
--- Cluster eveready (id 0x467BE3B0) Registrations ---
Node ID: 1
Registration key: 0x467be3b000000001
Node ID: 2
Registration key: 0x467be3b000000002
Node ID: 3
Registration key: 0x467be3b000000003
--- Quorum Server on port 9002 ---
Quorum server on port "9002" is not configured in any cluster.
--- Quorum Server on port 9003 ---
clqs: (C339181) Quorum server is not yet started on port "9003".
--- Quorum Server on port 9004 ---
clqs: (C339181) Quorum server is not yet started on port "9004".
[edit]
Adding a ZFS resource group
- The first steps here show how to simply create a zpool resource group. The last step shows how to create an nfs shared zpool resource group. Visit the ZFS FAQ for help with ZFS.
- Create a pool
- zpool create HAzfs raidz disk1 disk2 disk3 disk4 disk5
- Create Filesystems within the pool
- zfs create HAzfs/test1
- zfs create HAzfs/test2
- zfs create HAzfs/test3
- zfs create HAzfs/test4
- Create a cluster resource group
- clresourcegroup create rg-zfs
- clresourcetype register SUNW.HAStoragePlus
- Add the zpool to the group then online the group
- clresource create -g rg-zfs -t SUNW.HAStoragePlus -p Zpools=HAzfs rs-zfs
- clresourcegroup online -M rg-zfs
- Test the group on the other cluster nodes
- scswitch -z -h sunx4200-shu02 -g rg-zfs
- scswitch -z -h sunx2200-shu01 -g rg-zfs
- Add a virtual IP incase we want to NFS share the zpool
- clreslogicalhostname create -g rg-zfs cluster-test
- NOTE - The logical hostname, cluster-test, must be resolvable to an IP, i.e either in DNS or in /etc/hosts of each node.
- Create a zpool group with SUNW.nfs resource
- The sharenfs zfs property can not be used for nfs sharing within the cluster. Instead use the SUNW.nfs agent.
- clresourcegroup create -n sunx4200-shu01,sunx4200-shu02,sunx2200-shu01 -p Pathprefix=/global/services/zfs/zfs-shares-admin/ zpool-shares-group
- clreslogicalhostname create -g zpool-shares-group cluster-test
- clresourcetype register SUNW.HAStoragePlus
- clresource create -g zpool-shares-group -t SUNW.HAStoragePlus -x Zpools=HAzfs,HAzfs_mirror zpools-storage-resource
- clresourcegroup online -e -m -M zpool-shares-group
- vi /global/services/zfs/SUNW.nfs/dfstab.zpool-shares (Add the share commands)
- clresource create -g zpool-shares-group -t SUNW.nfs -p Resource_dependencies=zpools-storage-resource zpool-shares
[edit]
Sun Volume Manager, Metasets
- Metasets are SVM disksets designed to mimic Veritas Disk Groups to ease disk administration among cluster nodes. Based on the lack of reference to metasets on the various solaris message boards and forums, metasets are likely not widely used. Still if you want to create one, here's how...
- Before beginning create metadb replicas on local disks on each node that will be associated with the metaset.
- Create the set and add cluster hosts
- metaset -s oracle -a -h host1 host2 host3
- Now add disks to the set
- metaset -s oracle -a c5t6d0 c5t6d0
- Now create a metadevice in the metaset
- metainit -s oracle d1 1 2 c5t6d0s0 c5t6d0s0 (Note - Use slice 0 for the metadevice)
[edit]
Misc. Commands
- clresourcegroup delete -F rg-zfs (Force delete a resource group with resources)
- clresourcegroup offline nfs-rg
[edit]
Using bootadm with x86
- Using bootadm can be convenient to pre-select boot options before doing a reboot. That way you don't have to wait around for the grub menu to come up to make selections or change boot options. I use it to boot into cluster or non-cluster mode.
- For example, in the following case I use bootadm to set grub to boot menu item number 1 which is SC cluster mode.
- Basically the sequence is "bootadm set-menu default=1" then "init 6".
[root@sunx2200-shu01--->]bootadm list-menu The location for the active GRUB menu is: /boot/grub/menu.lst default 0 timeout 10 0 Solaris 10 11/06 s10x_u3wos_10 X86 No Cluster 1 Solaris 10 11/06 s10x_u3wos_10 X86 2 Solaris failsafe [root@sunx2200-shu01--->]bootadm set-menu default=1 #This menu entry was added to simplify booting into non-cluster mode #File = /boot/grub/menu.lst title Solaris 10 11/06 s10x_u3wos_10 X86 No Cluster root (hd0,0,a) kernel /platform/i86pc/multiboot -x module /platform/i86pc/boot_archive
