Voting Disk Internals

CSSD processes (Cluster Services Synchronization Daemon) monitors the health of RAC nodes by two distinct heart beats, Network heart beat and Disk heart beat.

Network heartbeat is across the interconnect, every one second, a thread (sending) of CSSD sends a network tcp heartbeat to itself and all other nodes, another thread (receiving) of CSSD receives the heartbeat. If the network packets are dropped or has error, the error correction mechanism on tcp would re-transmit the package, Oracle does not re-transmit in this case. In the CSSD log, you will see a WARNING message about missing of heartbeat if a node does not receive a heartbeat from another node for 15 seconds (50% of misscount). Another warning is reported in CSSD log if the same node is missing for 22 seconds (75% of misscount) and similarly at 90% of misscount and when the heartbeat is missing for a period of 100% of the misscount (i.e. 30 seconds by default), the node is evicted.

Disk heartbeat is between the cluster nodes and the voting disk. CSSD process in each RAC node maintains a heart beat in a block of size 1 OS block in a specific offset by read/write system calls (pread/pwrite), in the voting disk. In addition to maintaining its own disk block, CSSD processes also monitors the disk blocks maintained by the CSSD processes running in other cluster nodes. The written block has a header area with the node name and a counter which is incremented with every next beat (pwrite) from the other nodes. Disk heart beat is maintained in the voting disk by the CSSD processes and If a node has not written a disk heartbeat within the I/O timeout, the node is declared dead. Nodes that are of an unknown state, i.e. cannot be definitively said to be dead, and are not in the group of nodes designated to survive, are evicted, i.e. the node’s kill block is updated to indicate that it has been evicted.

Thus summarizing the heartbeats, N/W Heartbeat is pinged every second, nodes must respond in css_miscount time, failure would lead to node eviction. Similarly Disk Heartbeat, node pings (r/w) voting disk every second, nodes must recieve a response in (long/short) disk timeout time.

Below are the different possibilities of individual heartbeat failures.

Network Ping	Disk Ping	Reboot
Completes within misscount seconds	Completes within Misscount seconds	N
Completes within Misscount seconds	Takes more than misscount seconds but less than Disktimeout seconds	N
Completes within Misscount seconds	Takes more than Disktimeout seconds	Y
Takes more than Misscount Seconds	Completes within Misscount seconds	Y

Voting Disk contains information regarding nodes in the cluster and the disk heartbeat information, CSSD of the individual nodes registers the information regarding their nodes in the voting disk and with that pwrite() system call at a specific offset and then a pread() system call to read the status of other CSSD processes. But as information regarding the nodes is in OCR/OLR too and system calls have nothing to do with previous calls, there isn’t any useful data kept in the voting disk. So, if you lose voting disks, you can simply add them back without losing any data. But, of course, losing voting disks can lead to node reboots. If you lose all voting disks, then you will have to keep the CRS daemons down, then only you can add the voting disks.Now that we have understood both the heartbeats which was the most important part, we will dig deeper into Voting Disk, as in what is stored inside voting disk, why Clusterware needs of Voting Disk, how many voting disks are required etc.

Now finally to understand the whole concept of voting disk we need to know what is split brain syndrome, I/O Fencing and simple majority rule.

Split Brain Syndrome, In a Oracle RAC environment all the instances/servers communicate with each other using high-speed interconnects on the private network. This private network interface or interconnect are redundant and are only used for inter-instance oracle data block transfers. Now talking about split-brain concept with respect to oracle RAC systems, it occurs when the instance members in a RAC fail to ping/connect to each other via this private interconnect, but the servers are all physically up and running and the database instance on each of these servers is also running. These individual nodes are running fine and can conceptually accept user connections and work independently. So basically due to lack of communication the instance thinks that the other instance that it is not able to connect is down and it needs to do something about the situation. The problem is if we leave these instances running, the same block might get read, updated in these individual instances and there would be data integrity issue, as the blocks changed in one instance, will not be locked and could be over-written by another instance. This situation is termed as Split Brain Syndrome.

I/O Fencing, there will be some situation where the leftover write operations from failed database instances (The cluster function failed on the nodes, but the nodes are still running at OS level) reach the storage system after the recovery process starts. Since these write operations are no longer in the proper serial order, they can damage the consistency of the data stored data. Therefore when a cluster node fails, the failed node needs to be fenced off from all the shared disk devices or disk groups. This methodology is called I/O fencing or failure fencing

Simple Majority Rule, According to Oracle – “An absolute majority of voting disks configured (more than half) must be available and responsive at all times for Oracle Clusterware to operate.” Which means to survive from loss of ‘N’ voting disks, you must configure atleast ‘2N+1′ voting disks.

Now we are in a state to understand the use of voting disk in case of heartbeat failure.

Example 1:- Suppose in a 3 node cluster with 3 voting disks, a network heartbeat fails between Node 1 and Node 3 & Node 2 and Node 3 whereas Node 1 and Node 2 are able to communicate via interconnect, and from the Voting Disk CSSD notices that all the nodes are able to write to Voting Disks thus spli-brain, so the healthy nodes Node 1 & Node 2 would would update the kill block in the voting disk for Node 3. Then when during pread() system call of CSSD of Node 3, it sees a self kill flag set and thus the CSSD of Node 3 evicts itself. And then the I/O fencing and finally the OHASD will finally attempt to restart the stack after graceful shutdown.

Example 2:- Suppose in a 2 node cluster with 3 voting disk, a disk heartbeat fails such that Node 1 can see 2 Voting Disks and Node 2 can see 1 Voting Disk, ( If here the Voting Disk wouldn’t have been odd then both the Nodes would have thought the other node should be killed hence would have been difficult to avoid split-brain), thus based on Simple Majority Rule, CSSD process of Node 1 (2 Voting Disks) sends a kill request to the CSSD process of Node 2 (1 Voting Disk) and thus the Node 2 evicts itself and then the I/O fencing and finally the OHASD will finally attempt to restart the stack after graceful shutdown.

Thus Voting disk is plays a role in both the heartbeat failures, and hence a very important file for node eviction & I/O fencing in case of a split brain situation.

I hope the above information helps you in understanding the concept of Voting Disk, its purpose, what it contains, when its used etc, Please comment below if you need more information as in the split brain examples of a bigger cluster, how does Oracle executes STONITH internally, what processes are involved in the complete node eviction process how to identify the cause of a node eviction, how is a node evicted, rebooted and then joined again in the cluster etc.