Enhancing Database Resilience with Azure SQL Failover Groups: A Comprehensive Guide for DB Auto Group

Applies to: Azure SQL Database

For businesses like Db Auto Group that rely heavily on data, ensuring business continuity and minimizing downtime are paramount. Azure SQL Database’s failover groups feature offers a robust solution to manage replication and failover for your databases, providing a safety net against regional outages. This article provides a detailed overview of failover groups, best practices, and recommendations tailored for organizations like DB Auto Group looking to enhance their database infrastructure’s resilience.

To begin implementing this feature, refer to the guide on configuring a failover group for Azure SQL Database.

Note: This article focuses on failover groups for Azure SQL Database. For Azure SQL Managed Instance, please see Failover groups overview & best practices – Azure SQL Managed Instance.

To understand more about Azure SQL Database disaster recovery, watch this introductory video:

Overview of Azure SQL Failover Groups

Azure SQL Failover Groups are designed to simplify the management of geo-replication and failover of databases to a secondary Azure region. For companies like DB Auto Group, this means you can replicate some or all of your critical databases to a geographically separate location. This acts as a powerful disaster recovery mechanism, ensuring that your data remains accessible even if a primary region experiences an outage. Failover groups are built upon the foundation of active geo-replication, abstracting its complexity and making it easier to deploy and manage geo-replicated databases at scale.

For specific details on geo-failover Recovery Point Objective (RPO) and Recovery Time Objective (RTO), consult the overview of business continuity documentation.

Seamless Endpoint Redirection for Uninterrupted Access

One of the key advantages of failover groups for businesses like DB Auto Group is the provision of consistent read-write and read-only listener endpoints. These endpoints remain constant even during geo-failovers. This eliminates the need to modify application connection strings post-failover, as connections are automatically routed to the current primary database. When a geo-failover occurs, all secondary databases within the group seamlessly transition to the primary role. Following the failover, the Domain Name System (DNS) record is automatically updated, redirecting traffic to the new primary region, ensuring minimal disruption to your operations at DB Auto Group.

Offloading Read-Only Workloads to Secondary Databases

To optimize performance and reduce load on primary databases, DB Auto Group can leverage secondary databases within a failover group to handle read-only workloads. By directing read-only traffic to the read-only listener, you can effectively utilize the secondary database’s resources, freeing up the primary for critical read-write operations. This strategy enhances overall system efficiency and responsiveness.

Comprehensive Application Recovery Strategy

For DB Auto Group, achieving complete business continuity involves more than just database redundancy. A holistic approach to disaster recovery requires ensuring the resilience of all components that constitute your services, including client software, web front-ends, storage solutions, and DNS configurations. It’s crucial to evaluate all dependent services and understand their respective recovery capabilities and guarantees. DB Auto Group should implement necessary measures to guarantee service functionality during failovers of dependent services, ensuring a seamless and comprehensive recovery process.

Failover Policy Options for DB Auto Group

Failover groups offer two distinct failover policies, allowing DB Auto Group to choose the option that best aligns with their operational needs and risk tolerance:

Customer Managed (Recommended): This policy empowers DB Auto Group with direct control over the failover process. You can initiate a failover of a group when an unexpected outage affects one or more databases within it. When using command-line tools such as PowerShell, Azure CLI, or the REST API, the failover policy value for customer managed is manual.
Microsoft Managed: In the event of a widespread outage impacting an entire primary Azure region, Microsoft can initiate failover for all affected failover groups configured with this policy. It’s important to note that Microsoft-managed failovers are not triggered for individual failover groups or subsets within a region. When using command-line tools, the failover policy value for Microsoft-managed is automatic.

The following table summarizes the key differences between these policies, helping DB Auto Group make an informed decision:

Failover Policy	Failover Scope	Use Case	Potential Data Loss	Control Level
Customer Managed (Recommended)	Selected Failover Group(s)	Outage affecting specific databases within a group. DB Auto Group decides when to failover.	Yes	High – DB Auto Group initiates failover
Microsoft Managed	All Failover Groups in the Region	Region-wide outage. Microsoft initiates failover for all groups with this policy.	Yes	Low – Microsoft initiates failover in extreme scenarios

Customer Managed Failover: Putting DB Auto Group in Control

In rare instances, the built-in availability or high availability features might not be sufficient to mitigate an outage. This could lead to unacceptable downtime for DB Auto Group’s applications. Outages can range from localized issues affecting a few databases to datacenter, availability zone, or region-level events. In such scenarios, initiating a forced failover becomes necessary to restore business continuity.

Setting the failover policy to customer managed is strongly recommended for DB Auto Group. This approach provides you with the necessary control to initiate failovers precisely when needed, allowing for rapid restoration of business operations. You can trigger a failover as soon as you detect an outage impacting databases within a failover group, minimizing downtime and data loss.

Microsoft Managed Failover: Delegating Disaster Recovery

Opting for a Microsoft managed failover policy means entrusting disaster recovery responsibilities to the Azure SQL service. For Microsoft to initiate a forced failover, specific conditions must be met:

A significant datacenter, availability zone, or region-level outage must occur due to a natural disaster, configuration error, software bug, or hardware failure, impacting a large number of databases in the region.
A grace period must have elapsed. Due to the complexities of verifying and mitigating large-scale outages, this grace period cannot be less than one hour.

Once these conditions are satisfied, Azure SQL service will initiate forced failovers for all failover groups in the affected region that are configured with the Microsoft managed policy.

Important: While Microsoft managed failover exists, it’s crucial for DB Auto Group to primarily rely on customer managed failover for disaster recovery planning and testing. Microsoft-managed failover is reserved for extreme, region-wide disasters and should not be considered the standard operating procedure for failover scenarios. If DB Auto Group requires selective failover capabilities, customer managed failover is the appropriate choice.

Consider setting the failover policy to Microsoft managed only under these specific circumstances:

DB Auto Group wishes to delegate disaster recovery initiation to Azure SQL service for region-wide events.
Your applications can tolerate database unavailability for at least one hour or potentially longer.
It is acceptable for failovers to occur sometime after the grace period expires, as the exact timing of a Microsoft-managed failover can vary.
DB Auto Group accepts that all databases within a failover group will failover, regardless of their zone redundancy configuration or current availability status. Even zone-redundant databases, designed for zonal failure resilience, will be failed over if part of a Microsoft-managed failover group.
It’s acceptable for forced database failovers to occur without considering the dependencies of your application on other Azure services or components. This could potentially lead to performance degradation or application unavailability if other services are not also failed over in a coordinated manner.
DB Auto Group is prepared for potential unknown data loss, as the exact timing of a forced failover is not controllable and disregards the synchronization status of secondary databases.
Ensure all primary and secondary databases within the failover group, as well as any geo-replication relationships, have identical service tiers, compute tiers (provisioned or serverless), and compute sizes (DTUs or vCores). If the service level objectives (SLOs) do not match across all databases, the failover policy will be automatically reverted from Microsoft Managed to Customer Managed by the Azure SQL service to prevent unexpected issues.

When Microsoft triggers a failover, an entry labeled Failover Azure SQL failover group will be logged in the Azure Monitor activity log. This entry will include the failover group name under Resource, and Event initiated by will display a hyphen (-) indicating Microsoft as the initiator. This information can also be accessed on the Activity log page of the new primary server or instance in the Azure portal.

Terminology and Key Capabilities of Failover Groups

For DB Auto Group to effectively utilize failover groups, understanding the core terminology is essential:

Failover Group (FOG): A named collection of databases, managed by a single logical server in Azure, that can fail over together to a secondary Azure region. This ensures transactional consistency and coordinated recovery for related databases in case of a primary region outage.

Important: Failover group names must be globally unique within the .database.windows.net domain. Choose descriptive and unique names for your DB Auto Group failover groups.
Servers: A logical server can host multiple failover groups, and a failover group can include a subset or all of the user databases on a server. This provides flexibility in organizing and protecting different sets of databases based on DB Auto Group’s needs.
Primary: The logical server hosting the primary databases within the failover group. This is the active server location under normal operations.
Secondary: The logical server hosting the secondary databases in the failover group. Crucially, the secondary server must reside in a different Azure region than the primary, ensuring geographical redundancy.
Failover (No Data Loss): A planned switchover where the secondary database becomes primary after a full data synchronization with the original primary. This guarantees zero data loss. Failovers are possible only when the primary server is accessible and functional. DB Auto Group can use planned failovers for:
- Conducting disaster recovery (DR) drills in production environments without risking data loss.
- Relocating workloads to a different Azure region for strategic reasons.
- Executing failback operations to return workloads to the original primary region after an outage has been resolved.
Forced Failover (Potential Data Loss): An immediate switch of the secondary to the primary role without waiting for complete data synchronization. This operation carries the risk of data loss but is essential during outages when the primary is inaccessible. Once the outage is mitigated, the original primary server automatically reconnects and becomes the new secondary. A subsequent planned failover can be performed to fail back to the original configuration.
Grace Period with Data Loss: For Microsoft-managed failover policies, a configurable grace period (GracePeriodWithDataLossHours) determines how long the Azure SQL service waits before initiating a forced failover, potentially resulting in data loss. DB Auto Group can adjust this setting based on their application’s tolerance for data loss and RTO requirements.
Adding Single Databases to Failover Groups: Multiple single databases on the same logical server can be grouped into a single failover group. Adding a single database automatically creates a secondary database with the same edition and compute size on the designated secondary server. Existing geo-replication links to the secondary server will be inherited by the group. If a database already has a secondary on a server outside the failover group, a new secondary is created on the specified secondary server within the group.

Important considerations for DB Auto Group:
- Ensure the secondary logical server does not have a database with the same name as any database being added to the failover group, unless it’s an existing secondary database intended to be part of the group.
- For databases utilizing in-memory OLTP objects, both primary and secondary geo-replica databases must have matching service tiers. Discrepancies in service tiers can lead to out-of-memory issues on the geo-replica, potentially causing recovery failures and unsuccessful failovers. Always ensure service tier parity for databases with in-memory OLTP objects.
Adding Databases in Elastic Pools to Failover Groups: Entire elastic pools, or subsets of databases within them, can be included in failover groups. If a primary database resides in an elastic pool, the secondary database is automatically created in a secondary elastic pool with the same name. DB Auto Group must ensure that the secondary server has an identically named elastic pool with sufficient capacity to accommodate the secondary databases. Existing geo-replication links within the pool are inherited by the group, and new secondary databases are created in the secondary pool for databases without pre-existing secondaries in the failover group’s secondary server.
Failover Group Read-Write Listener: A DNS CNAME record, automatically generated upon failover group creation, that always points to the current primary server. For DB Auto Group applications, the read-write workload should always connect using the listener URL <fog-name>.database.windows.net</fog-name>. This ensures transparent reconnection to the primary after failovers without connection string modifications. The DNS record is automatically updated after each failover.
Failover Group Read-Only Listener: Another automatically created DNS CNAME record, <fog-name>.secondary.database.windows.net</fog-name>, which points to the current secondary server. This listener is designed for read-only workloads. By default, failover for the read-only listener is disabled to protect primary server performance during secondary server outages. However, this means read-only sessions will be interrupted until the secondary recovers. If DB Auto Group requires continuous availability for read-only sessions and can tolerate potential performance impacts on the primary, you can enable read-only listener failover using the AllowReadOnlyFailoverToPrimary property. In this case, read-only traffic automatically redirects to the primary if the secondary is unavailable.

Note: The AllowReadOnlyFailoverToPrimary property is only effective when Microsoft managed failover policy is enabled and a forced failover is triggered. When set to True, the new primary server will handle both read-write and read-only sessions.
Multiple Failover Groups: For advanced configurations, DB Auto Group can create multiple failover groups between the same pair of servers. This allows for fine-grained control over geo-failover scope. Each group fails over independently. For tenant-per-database applications across multiple regions using elastic pools, this capability can be used to strategically mix primary and secondary databases within each pool, minimizing the impact of regional outages to only a subset of tenant databases.

Failover Group Architecture for DB Auto Group

A failover group in Azure SQL Database can encompass one or more databases, typically those used by a common application or service within DB Auto Group. A failover group is configured on the primary server, establishing a connection to a designated secondary server in a different Azure region. A failover group can include all or a selected subset of databases from the primary server. The following diagram illustrates a typical geo-redundant cloud application architecture leveraging failover groups:

Alt text: Azure SQL Failover Group Architecture for Geo-Redundant Cloud Application, showing primary region with active databases and read-write listener, replicating to secondary region with standby databases and read-only listener, ensuring disaster recovery for DB Auto Group.

When designing services with business continuity as a core principle, DB Auto Group should adhere to general guidelines and best practices. When setting up a failover group, it’s critical to ensure that authentication and network access are properly configured on the secondary server to function seamlessly after a geo-failover when it becomes the new primary. For detailed guidance, see Configure and manage Azure SQL Database security for geo-restore or failover. Further information can be found in Designing globally available services using Azure SQL Database and Geo-restore for Azure SQL Database.

Best Practices and Recommendations for DB Auto Group

To maximize the effectiveness of failover groups, DB Auto Group should consider these best practices:

Utilize Paired Azure Regions

When creating failover groups between primary and secondary servers, always prioritize using paired Azure regions. Failover groups within paired regions generally exhibit improved performance compared to those in unpaired regions due to network proximity and optimized infrastructure.

Azure SQL Database typically avoids simultaneous updates to paired regions to adhere to safe deployment practices. However, predicting which region will be upgraded first is not possible, and the deployment order is not guaranteed. Sometimes the primary server region is upgraded before the secondary, and vice versa.

If DB Auto Group has configured geo-replication or failover groups for databases that do not align with Azure region pairings, it is advisable to use different maintenance window schedules for primary and secondary databases. For example, selecting a Weekday maintenance window for the secondary database and a Weekend maintenance window for the primary database can minimize potential conflicts during maintenance operations.

Understanding Initial Seeding Duration

When adding databases or elastic pools to a failover group, an initial seeding phase occurs before data replication commences. This initial seeding is typically the most time-consuming and resource-intensive part of the process. Once completed, subsequent data changes are replicated efficiently. The duration of initial seeding depends on factors such as data size, the number of databases being replicated, the load on primary databases, and the network bandwidth between primary and secondary regions. Under normal conditions, SQL Database can achieve seeding speeds up to 500 GB per hour. Seeding operations are performed in parallel for all databases within the failover group.

Optimize Number of Databases per Failover Group

The number of databases within a failover group directly influences the duration of both planned Failover and Forced Failover operations.

Planned Failover: This operation ensures complete synchronization of all primary databases with their secondaries before switching roles. Databases are prepared in batches to manage control plane load. Therefore, limiting the number of databases per failover group is highly recommended for faster planned failovers.
Forced Failover: The preparation phase is expedited as data synchronization is skipped. To achieve quicker and more predictable forced failover times, it is beneficial to maintain a smaller number of databases within each failover group.

Leverage Multiple Failover Groups for Scalability

DB Auto Group can create one or more failover groups between primary and secondary servers. Each group acts as an independent failover unit. This architecture enables granular disaster recovery, allowing specific sets of databases to be failed over independently. Creating a failover group automatically establishes geo-secondary databases with the same service objective as the primary databases. When incorporating existing geo-replication relationships into a failover group, ensure the geo-secondary is configured with the same service tier and compute size as the primary for consistency and optimal performance.

Utilizing the Read-Write Listener for Primary Connections

For all read-write workloads, DB Auto Group should consistently use the read-write listener URL, <fog-name>.database.windows.net</fog-name>, as the server name in connection strings. Connections are automatically directed to the current primary server. This endpoint remains constant even after failovers, simplifying application connectivity management. Note that DNS record updates during failover require client DNS cache refresh for connections to be redirected to the new primary. The Time To Live (TTL) for both primary and secondary listener DNS records is 30 seconds.

Employing the Read-Only Listener for Secondary Read Operations

For read-only workloads that can tolerate potential data latency, DB Auto Group can effectively utilize geo-secondary databases by directing read-only traffic to the read-only listener, <fog-name>.secondary.database.windows.net</fog-name>. It is also recommended to explicitly indicate read-intent in connection strings by adding ApplicationIntent=ReadOnly.

In Premium, Business Critical, and Hyperscale service tiers, Azure SQL Database supports read-only replicas to offload read-only queries using the ApplicationIntent=ReadOnly parameter. When geo-replication is configured via failover groups, this capability extends to connecting to read-only replicas in both the primary and geo-secondary locations.

To connect to a read-only replica in the secondary location specifically, use the connection string parameters ApplicationIntent=ReadOnly and <fog-name>.secondary.database.windows.net</fog-name>.

Addressing Potential Performance Degradation Post-Failover

Typical Azure applications, including those at DB Auto Group, often rely on multiple Azure services and components. Failover group operations are triggered based on the state of Azure SQL Database alone. Other Azure services within the primary region might remain unaffected by the outage and continue to be available. When primary databases failover to the secondary (DR) region, increased latency between dependent components can occur. To mitigate performance impacts due to higher latency, DB Auto Group should ensure redundancy for all application components in the DR region, adhere to relevant network security guidelines, and orchestrate geo-failovers of dependent application components in conjunction with database failovers for a cohesive DR strategy.

Understanding Potential Data Loss During Forced Failovers

If an outage occurs in the primary region, recent transactions might not yet be replicated to the geo-secondary database. Consequently, performing a forced failover in such situations could result in data loss.

Important: Elastic pools with 800 or fewer DTUs or 8 or fewer vCores and more than 250 databases are susceptible to issues like prolonged planned geo-failovers and reduced performance. These risks are heightened for write-intensive workloads, geographically dispersed geo-replicas, or scenarios with multiple secondary geo-replicas per database. A common symptom is increasing geo-replication lag, potentially leading to greater data loss during outages. This lag can be monitored using sys.dm_geo_replication_link_status. If these issues arise, mitigation strategies include scaling up the elastic pool’s DTUs or vCores or reducing the number of geo-replicated databases within the pool.

Failback Process

For failover groups configured with a Microsoft-managed failover policy, forced failover to the geo-secondary server is automatically initiated during a disaster scenario based on the defined grace period. However, failback to the original primary server must be initiated manually by DB Auto Group once the primary region is restored.

Permissions and Limitations

Refer to the comprehensive failover group configuration guide for a detailed list of permissions and limitations associated with failover groups.

Programmatic Management of Failover Groups

Failover groups can be efficiently managed programmatically using Azure PowerShell, Azure CLI, and REST API. For detailed instructions and code samples, consult the guide on configuring failover groups for Azure SQL Database.

Enabling High Availability (Zone Redundancy) for Enhanced Resilience

Availability through redundancy further strengthens resilience by providing protection against availability zone outages within a region.

When creating a failover group that includes databases, enabling high availability for secondary databases directly during failover group creation is not supported, irrespective of the primary databases’ high availability settings.

Zone Redundancy with Non-Hyperscale Databases

Secondary databases created through failover groups do not automatically inherit high availability. After creating the failover group, DB Auto Group must manually enable high availability on the databases within the group. This applies even if Active Geo-Replication is initially configured, and databases are subsequently added to a failover group.

Zone Redundancy with Hyperscale Databases

Secondary databases created within failover groups will inherit the high availability settings from their corresponding primary Hyperscale databases. If the primary database is zone-redundant, the secondary will also be zone-redundant. Conversely, if the primary is not zone-redundant, neither will be the secondary.

Regional Support for Availability Zones

In scenarios where high availability is enabled on the primary database, and the secondary database is being added in a region that does not yet support availability zones, the operation will fail with error code 45122: “Create or update Failover Group operation successfully completed; however, some of the databases could not be added to or removed from Failover Group. Provisioning of zone redundant database/pool is not supported for your current request.” A workaround for DB Auto Group is to utilize Active geo-replication to create the secondary database, enabling or disabling high availability during secondary creation, and then optionally adding these databases to a failover group.