DCA failover

Post date: Nov 29, 2012 5:49:27 PM

Q. Testing of the mdw to smdw automatic failover. Is there a way to shorten the time it takes to detect the failover. I was told it can take a fairly long time (5-7 minutes) after a failure of mdw for it to switch to smdw.

Ans: This is not tunable as far as I know. I’ve seen it take an hour or more in some cases actually.

Q. During automatic failover, is command center automatically started on smdw?

Ans: I don’t know but I don’t think so. I would be happy to be mistaken on this point however. 

Q. What will happen if we simulate a failure of a nic/cable or ethernet failure by doing a if down on a segment server to one of the ethernet networks, Will the segments failover to the mirror? How long to notification. 

Ans: In DCA v2, this will have no obvious effect on the database. DCA v2 has a physically redundant network topology where the loss of a link is routed around automatically.

In DCA v1, this will have a primary and a secondary effect. What this does is effectively make a hostname impossible to reach. For example, suppose you pull the cable from the second 10Gb NIC on sdw3. That means nothing can talk to sdw3-2. Things needing to talk to sdw3-2 will register that as a failure and do what needs doing to get around it.

For primary and mirror segment DBs linked to sdw3-2 in gp_segment_configuration, these will very quickly (depending on one or more timeouts) be seen as unavailable and their counterpart segment DB’s will go into change tracking mode. So, on DCA v1, three mirror segments on other servers (sdw1, sdw2, and sdw4) will go into change tracking and three primaries on other servers will also got into change tracking (some mirrors are assigned to sdw3-2 as well).

However, there is a secondary effect. All of the segment DB’s on sdw3 that are assigned to sdw3-1 need to use the cable you pulled form sdw3-2 to talk to their counterparts. By this I mean a primary that is assigned to sdw3-1 will be mirrored to another server using that server’s -2 hostname (sdw4-2 perhaps). Since sdw3-2 no longer has a connection, this primary segment DB can’t talk on it to reach it’s mirror on sdw4-2. Eventually, this will be interpreted by the master as a failure. It means that the seg DB’s assigned to sdw3-1 will go into change tracking mode and all of the associated segment DBs elsewhere (the mirrors for the sdw3-1 primaries and the primaries for the sdw3-1 mirrors) will be marked as failed. This takes a lot longer usually.

Q. If we pull a disk and write some data and then re-insert it, will omreport storage disk controller=0 show the raid 5 rebuild, I know gpstate –e will show the resync rate 

Ans: Yes, on DCA v1, doing what you describe will appear in the omreport storage output as a degraded disk group that is rebuilding. 

On DCA v2, the command will be CmdTool2, not omreport but the output is analogous and the behavior similar. On v2, you would see the hot spare pick up and, when you put the original disk back in, the rebuild should start over using the now good, not hot spare, drive.