Failover Plugin v2

The AWS Advanced Python Wrapper uses the Failover Plugin v2 to provide minimal downtime in the event of a DB instance failure. The plugin is the next version (v2) of the Failover Plugin and unless explicitly stated otherwise, most of the information and suggestions for the Failover Plugin are applicable to the Failover Plugin v2.

Differences between the Failover Plugin and the Failover Plugin v2

The Failover Plugin performs a failover process for each DB connection. Each failover process is triggered independently and is unrelated to failover processes in other connections. While such independence between failover processes has some benefits, it also leads to additional resources like extra threads. If dozens of DB connections are failing over at the same time, it may cause significant load on a client environment.

Picture 1. Each connection triggers its own failover process to detect a new writer.

If a connection needs to get the latest topology, it calls RdsHostListProvider. It should be noted that RdsHostListProvider runs in the same thread as a connection failover process. As shown in Picture 1 above, different connections start and end their failover processes independently.

The Failover Plugin v2 uses an optimized approach where the process of detecting and confirming a cluster topology is delegated to a central topology monitoring component that runs in a separate thread. When the topology is confirmed and a new writer is detected, each waiting connection can resume and reconnect to a required host. This design helps minimize resources required for failover processing and scales better compared to the Failover Plugin.

Picture 2. Connections call MonitoringRdsHostListProvider, which is responsible for detecting the new writer. While waiting for MonitoringRdsHostListProvider, connection threads suspend.

If two connections encounter communication issues with their internal (physical) DB connections, each connection may send a request to the topology monitoring component (MonitoringRdsHostListProvider in Picture 2) for updated topology information reflecting the new writer. Both connections are notified as soon as the latest topology is available. Connection threads can resume, continue with their suspended workflows, and reconnect to a reader or a writer host as needed.

The topology monitoring component mentioned above (MonitoringRdsHostListProvider) updates topology periodically. Usually it uses a connection to a writer host to fetch a cluster topology. Using a connection to a writer host gathers topology first hand without a risk of getting stale data as is the case of fetching topology from a reader. In some exceptional cases the monitoring component may (temporarily) use a reader connection to fetch topology however it will switch back to a writer host as soon as possible.

Picture 3. MonitoringRdsHostListProvider detects a new writer by establishing connections to hosts in separate threads.

When the cluster topology needs to be confirmed, the monitoring component opens new threads, one for each host (see Picture 3). Each of these threads tries to connect to a host and checks if the host is a writer. When Aurora failover occurs, the new writer host is the first host to reflect the true topology of the cluster. Other hosts connect to the new writer shortly after and update their local copies of the topology. Topology information acquired from a reader host may be outdated/inaccurate for a short period after failover. You can see a typical example of stale topology in the diagram above: thread instance-3, box Topology, to the right. The stale topology incorrectly shows that instance-3 is still a writer.

The threads monitoring the topology stop when a new writer is detected. For 30 seconds after a new writer is detected (and after all waiting connections have been notified), topology continues to be updated at an increased rate. This allows time for all readers to appear in the topology, since 30 seconds is usually enough time for cluster failover to complete and cluster topology to stabilize.

All improvements mentioned above help the Failover Plugin v2 to operate with improved performance and less demand for resources.

A summary of the key differences and between the failover and failover_v2 plugins is outlined below. With the failover plugin:

Each connection performs its own failover process.
Each connection fetches topology by calling the RdsHostListProvider in the same thread.
Topology may be fetched from a reader host and it may be stale.

With the failover_v2 plugin:

Each connection delegates detection of the new writer to the MonitoringRdsHostListProvider (which runs in its own thread) and suspends until the new writer is confirmed.
The MonitoringRdsHostListProvider tries to connect to every cluster host in parallel.
The MonitoringRdsHostListProvider uses an "Am I a writer?" approach to avoid reliance on stale topology.
The MonitoringRdsHostListProvider continues topology monitoring at an increased rate to ensure all cluster hosts appear in the topology.

Using the Failover Plugin v2

The Failover Plugin, not the Failover Plugin v2, will be enabled by default if the wrapperPlugins value is not specified. If you would like to override the default plugins, you can explicitly include the failover plugin v2 in your list of plugins by adding the plugin code failover_v2 to the wrapperPlugins value, or by adding it to the current driver profile. After you load the plugin, the failover v2 feature will be enabled.

Please refer to the failover configuration guide for tips to keep in mind when using the failover plugins.

Warning

Do not use the failover and failover_v2 plugins at the same time for the same connection!