Exchange 2010 HA Guide

Contents

1.Introduction:

1.1 Quorum

1.2 DAG Networks

1.3 Active Manager

2.Datacenter Activation Coordination DAC

2.1 Introduction

2.2 How to get DAC OK status?

2.3 Restore-DatabaseAvailabilityGroup

2.4 Examples

3.Recovery Single Failed DAG member

4.Database Mobility

5.Outlook WebApp across Sites

5.1 Introduction

5.2 Scenario 1

5.3 Scenario 2

5.4 Scenario 3

6.Datacenter Switch Over

6.1 Terminate the primary data center

6.2 Activating Mailbox Servers

6.4 Activating CAS Servers

6.5 Restoring Services in the Primary Datacenter

7.Autodiscover

7.1 When Autodiscover is triggered on Outlook

7.2 How to find the service

7.3 What Autodiscover needs

7.4 What Autodiscover process

7.5 What Autodiscover returns

8.How Outlook Connects

8.1 What information Outlook needs

8.2 Database linkage to CAS Arrays

Scenario 1

Scenario 2

Scenario 3

Scenario 4

1.Introduction:

This guide simply explains in a very easy way, all the technologies and procedures that you need to know to perform Exchange 2010 data center switch over, recovering DAG member or stretching DAG between sites.

1.1 Quorum

Define asa mechanism to ensure that only one subset of members are functioning at any given time. It used to find majority.

There is Quorum data that is configuration shared between all nodes.

Exchange 2010 supports only two out four models of Quorums:

Node Majority: for odd number of nodes
File share majority: for even number of nodes

Witness is a file share (Witness.log) that represent a vote when there is need to break the tie. When we are one vote from losing the majority, the node that hold the cluster group (PAM) will lock the witness file share.

The witness cluster file share is created when the DAG members become even and cluster will apply isalive controls to monitor it. If it fails, the cluster group is moved to another node and try to bring it online.

(Exchange Subsystem) group should be member of the local administrator group on the witness server and the alternative witness server.

1.2 DAG Networks

For each subnet that the cluster discovers, a DAG network is created. Note also that heartbeat happens in all networks.

Two types of DAG Networks:

MAPI Network:

You can have only one MAPI network.
Default G and register in DNS

Replication Network: (Over TCP 64327)
You can have Zero or as many replication networks as you much
No default G and no register in DNS

It is important to note the following:

DAG Network enumeration happens only when adding DAG members or can be triggered by running (Set-DatabaseAvailabilityGroup –DiscoverNetworks)
If the MAPI network dies in a server, automatic switch over happens.
If Replication network dies in a server, replication will happen over MAPI network.
ISCIS network should be configured to be ignored from Cluster use.

And also make sure that the replication cannot route to the MAPI network in any case, or cross heartbeat scenario will happen.

1.3 Active Manager

Lives inside (Microsoft Replication Service)

The data about where the database is active now DOES NOT LIVE IN AD. Active Manager is the one who knows about it.

Three Server types:

Standalone ( for nodes not member of DAG)
Standby (SAM)
Monitor local resources and notify PAM
Give information to Active Manager clients about where databases are active
Primary (PAM)
The one who holds the cluster group
Best Copy Selection

Active Manager Client exists in HUB and CAS to know where the active copy lives in order to deliver or access data.

2.Datacenter Activation Coordination DAC

2.1 Introduction

Active Manager handles DAC

DAC mode enables us to use three new commands:Stop-DatabaseAvailabilityGroup,Start-DatabaseAvailabilityGroup and Restore-DatabaseAvailabilityGroup

DAG property that uses DACP protocol to handle split brain scenarios when DAG is stretches to more than one subnet.

DAC when enabled, will be an extra application Quorum criteria that should be return OK.

DAC split DAG members to one of two sets:

Stopped DAG Members - Stop-DatabaseAvailabilityGroup
Started DAG Members - Start-DatabaseAvailabilityGroup

Only Started DAG Members will participate in DAC voting. Started servers are those candidate to bring their database copies online.

Stopped DAG member is the status of Active Manager that prevents the databases to be mounted on the server and will exclude it from DAC voting.

2.2 How to get DAC OK status?

If all started DAG members can communicate to each other
If not, if a DAG Started member can communicate with a node with DAC bit 1

Note: In case of two DAG started members in the alternate datacenter exist, the boot time of the alternative witness share server can be used. If the witness boot time is before, DAC succeeded. Else, use Restore-DatabaseAvailabilityGroup. This only true for two member started DAG members.

In all cases, if all DAG members are DAC 0, use Start-DatabaseAvailabilityGroup to reset the DAC bit to 1 even if the nodes are already started.

2.3 Restore-DatabaseAvailabilityGroup

Evicts DAG members marked as stopped from the cluster , thus created quorum
Assign alternate witness share in case of even number of nodes

It has three parameters:

Identity (required) : name of DAG
ActiveDirectorySite (Optional)
AlternativeWitnessDirectory and AlternativeWitnessServer (Optional): those can be configured ahead on the DAG level.

2.4 Examples

Stop-DatabaseAvailabilityGroup -Identity DAG1 -MailboxServer E14EX2

Stop-DatabaseAvailabilityGroup -Identity DAG1 -ActiveDirectorySite Redmond

Stop-DatabaseAvailabilityGroup -Identity DAG1 -MailboxServer E14EX3 –ConfigurationOnly

3.Recovery Single Failed DAG member

Database copies on the failed server are marked as (ServiceShutdown)

For a failed MBX1 server

Remove database copies on the server

Remove-MailboxDatabaseCopy DB1\MBX1

This command will generated warning because the server is offline but the info about the copy in AD will be deleted.

Remove its configuration in DAG

Remove-DatabaseAvailabilityGroupServer -Identity DAG1 -MailboxServer MBX1 –ConfigurationOnly

It may happen that the server is not fully removed, so open the cluster console from any active mailbox server and evict the failed DAG member manually

Reset Computer Account in AD
Install a new Windows with same patches and service pack (IMPORTANT : SAME IP Addresses)
Setup /m:RecoverServer
Add it to DAG
Add-DatabaseAvailabilityGroupServer -Identity DAG1 -MailboxServer MBX1
Add databases copies back to it.

4.Database Mobility

If you have a server that fails but the SAN or disk database files are accessible, you can mount the DB on another server. This is called Database Mobility.

Attach the database files to a drive on the new mailbox server.
Use eseutil to check the health of the database

Eseutil /MH database.edb |findstr “state:”

If the database is dirty shutdown and log files are available, then perform soft recovery : From the folder that contains the log files, type :

eseutil /r E00 /d G:\Data\databaseFolderPath

Note: Replace E00 with log prefix

Finally, create new DB on the new server, mark it as over writable, dismount it , switch files.
Point the user to the new DB :

Get-Mailbox -Database oldDB | Set-Mailbox –Database newDB

Outlook clients will automatically pick up the new info.

5.OutlookWebAppacross Sites

5.1 Introduction

When CAS receives OWA requests:

It checks to see if the request can be severed locally.
If mailbox is not local, CAS retrieves target ExternalURL (if defined) and redirects or proxies if no OWA ExternalURLs are defined in the target Active Directory site.

Below is additional Scenarios

5.2 Scenario 1

Suppose that the primary site went down completely, and you changed the DNS entry for owa.contoso.com to point to the CAS NLB in the secondary site. Now the primary site is back to normal and you changed the DNS entry for owa.contoso.com to point to the primary CAS NLB in the main site.

The client need to wait for the TTL for owa.contoso.com to expire (usually set the TTL to 5 minutes), and also after the cache expires, the browser will still cache the DNS entry for another 20 minutes.

So a loop will happen here as the browser will go to owa.contoso.com which will go to the secondary CAS NLB because of the browser cache, and the secondary CAS array will send an OWA redirection message “Hey... You should be using for best performance.” Because the mailbox is active in the primary site now and the OWA ExternalURL for the primary CAS array is

The user may think “ODD, I just did log in at that site! Silly computer, let me log in again.”

The second time he logs in to owa.contoso.com, he will probably still hit the secondary CAS array servers because of their browser cache still isn’t updated. The secondary CAS array servers are intelligent enough to see this 2nd logon attempt (via a web canary) and then know “OH… this user’s DNS cache is old. They don’t know we failed back to the other datacenter. Send him the FailbackURL for the primary CAS servers.

The user is then prompted with a slightly different page with a “CONTINUE” button and it explains to them that the mailbox is in the process of being brought online in different datacenter. He clicks continue, which takes him to the FailbackURL. They log in again and this time is successfully in OWA.

So the Secondary CAS array will detect if the primary CAS servers has the failbackURL configured, and if it is, it will redirect the client to it to end the loop. If there is no failbackURL configured, then the secondary CAS array will send an error page to the client indicating that he should close his browser and try again.

5.3 Scenario 2

If the CAS receive a request for OWA to a database, and he can see that the database legacyExchangeDN matches his local AD site, but the database is mounted in different site, the CAS will issues a redirect to the ExternalURL of the CAS server hosting the mounted database.

5.4 Scenario 3

NEW IN SP2 Cross-Site Silent Redirection

If you configure the Set-OWAVirtualDirectory with CrossSiteRedirectType = Silent (default is manual), then all redirections become silent. In addition, if FBA or Integrated authentication is configured, a Single Sign On experience will occur.

6.Datacenter Switch Over

The case of complete outage in primary data center (NYC) and restoring things back in secondary data center (LON)

6.1 Terminate the primary data center

DAG Members in the primary data center must be marked as stopped. Stopped is the status of Active manager that prevents database copies to be mounted on them, and will exclude them from DACP voting. This can be done on the primary and the secondary sites :

On the Primary side :

If the mailbox servers in the primary are operational and there is a functioning DC in the primary site, use

Stop-DatabaseAvailabilityGroup -Identity DAG1 -ActiveDirectorySiteNYC

If the mailbox servers in the primary site are not operational but there is domain controller in the primary site, use this command for each primary MBX servers:

Stop-DatabaseAvailabilityGroup -Identity DAG1 -MailboxServer E14EX3 –ConfigurationOnly

If no DC nor mailbox servers are available in the parent side, then make sure that mailbox servers are shutdown always.
If the primary mailbox server are online, make sure the cluster service is set to Disabled or do it yourself.

On the Secondary side :

We need to tell the secondary site which servers are available during the switch over. This can be done by using the Stop-DatabaseAvailabilityGroup command with the ConfigurationOnly.

UM Servers

If any Unified Messaging servers are in use in the failed datacenter, they must be disabled to prevent call routing to the failed datacenter. You can disable a Unified Messaging server by using the Disable-UMServercmdlet (for example, Disable-UMServer UM01).

Alternatively, if you are using a Voice over IP (VoIP) gateway, you can also remove the Unified Messaging server entries from the VoIP gateway, or change the DNS records for the failed servers to point to the IP address of the Unified Messaging servers in the second datacenter if your VoIP gateway is configured to route calls using DNS.

6.2 Activating Mailbox Servers

When the primary datacenter is down, the mailbox servers in the secondary site, will try to take ownership of the cluster group and will try to bring the primary Witness server online for couple of time before timing out and failing. This is when the cluster as a whole goes down because of majority issues. Database copies on primary datacenter mailbox servers appears as (Service Shutdown), where database copies on secondary datacenter mailbox servers appear as (Disconnected and Healthy)

The Cluster service must be stopped on each DAG member in the primary datacenter (This can be one of two :

If the Primary data center is down, then for sure objective completed
If the primary mailbox servers are online, make sure cluster service is stopped and the service is marked as disabled.

Running Restore-DatabaseAvailabilityGroup which will do two things :
Evict Stopped DAG members from cluster
Create alternative witness share if not created previously on the DAG level

Restore-DatabaseAvailabilityGroup -Identity DAG1 -ActiveDirectorySiteLON - AlternateWitnessServer EXHUB1 -AlternateWitnessDirectory D:\DAG1

You may need to run the command couple of time until the primary mailbox servers are evicted from the cluster.

Note: the restore command can fail, just wait 5 minutes and run it again. Also you can make sure that the command is being executed on the right domain controller by running:

Set-ADServerSettings –PreferredServer <Domain Controller in Failover Datacenter

Always and at any time, if you want to force the cluster model to refresh (i.e if you open the cluster console from the secondary mailbox server, alternative witness share should appear after you entered the Restore-DatabaseAvailabilityGroup command, if it didn’t reflect in the cluster console, just type Set-DatabaseAvailabilityGroup –Identity DAGName)
You should make sure the Witness server and directory are up. Never lose them and avoid restarting them. Make sure Exchange Trusted Subsystem is member of the local administrator group on the Witness server and create a firewall rule on the Witness server if necessary to allow all traffic from the mailbox server to the Witness Server.
At this moment, the secondary mailbox server(s) will try to assume the ownership of the cluster group and trying to get the secondary DAG IP online and will keep trying to bring the alternative Witness share online.
Use Get-DatabaseAvailabilityGroupcmdlet to make sure the Stopped servers are those mailbox servers in the primary site while started servers are those in the secondary site only.
If databases in the secondary site don’t mount automatically, remember to remove any activation blocks on the server level (Set-MailboxServer) or on the database level (Suspend Activation).
If still databases didn’t mount correctly, use this command:

Move-ActiveMailboxDatabase –Server FQDNofaServerinPrimarySite –ActivateOnServerFQDNofaServerinDRSite

This command contains many Skip switches that can be handy.This is very important step as it is like taking ownership of those databases. You can also use :

Move-ActiveMailboxDatabaseDatabaseName –ActivateOnServerFQDNofaServerinDRSite

We need to choose whether to remove the database copies existing in the primary site to allow log truncation or not. If we choose so, reseeding will be necessary once you fail back to the primary data center.
Outlook Office clients will act as per the following :
If the primary CAS servers are online, CAS servers in the primary site will issue a silent redirect message to outlook users. Outlook users will see a message that they need to restart their outlook.
If the primary CAS servers are online, you can change the DNS name for the outlook anywhere name or just force autodiscover to work by repairing outlook profile
OWA clients will do the following :
If the primary CAS servers are online, silent redirection will happen with SOO since both OWA virtual directories has Integrated Authenticated on them
If the primary CAS servers are offline, DNS name for OWA primary should point to secondary and that’s it.
If you restarted mailbox servers in the secondary site and/or the Witness server, the DAC bit will be sit to 0 and databases will be shown as Dismounted. If you try to mount them , an error that the replication services on the primary mailbox servers are not online. You may find a problem locating the Active manager also especially if you typed: Get-DatabaseAvailabilityGroup –Identity DAGName – Status.

The solution will be forcing the DAC bit to be 1 by running the Start-DatabaseAvabilibityGroup –Server (Secondary Mailbox Servers) even if they are already started.

6.4 Activating CAS Servers

If the primary datacenter has the following URLs internally and externally

Mail.NYC.contoso.com (Outlook Anywhere)
OWA.NYC.contoso.com (Outlook Web Access)
EAS.NYC.contoso.com (Exchange ActiveSync)

And the secondary site has:

Mail.LON.contoso.com
OWA.LON.contoso.com
EAS.LON.contoso.com

And suppose SCP for Autodiscover for CAS servers in the primary datacenter points to Mail.NYC.contoso.com where SCP for CAS servers in the secondary datacenter points to Mail.LON.contoso.com. Suppose also that the public autodiscover.Contoso.com points externally to primary datacenter publishing rule

During Data center Switchover:

OWA :

Change the IP address for OWA.NYC.contoso.com to point to OWA.LON.contoso.com in the internal and external DNS servers. This really depends if the primary data center will be off for long time.

You can also chose not to change this DNS name if the primary CAS servers are online since they will do the redirection.