Skip to end of metadata
Go to start of metadata

Introduction

For Asterisk SCF to support transparent failover from an active component to a standby component, all Asterisk SCF components must allow incoming remote procedure calls to be retried. The Ice run-time is able to automatically retry operations that have been declared idempotent. By marking an operation idempotent, we're telling the Ice run-time that it can abandon at-most-once semantics for that operation. In other words, the system will be in a correct state whether the operation was invoked once or multiple times.

For many operations, such as those that allocate resources, at-most-once semantics would typically be expected (and thus the idempotent keyword would be omitted). But in order to allow Ice to automatically perform retries for us, we have chosen to mark operations that would normally require at-most-once semantics as idempotent. This puts an extra burden on the servant implementation for these operations, and on the API itself. The servant must be able to distinguish a retry of a previous call to an operation from a new call. This page describes the approach in Asterisk SCF to handle retry scenarios, and what developers should keep in mind in order to create "well behaved" Asterisk SCF components.

Impact on API

Ice will attempt to retry an operation that fails only if the operation is marked "idempotent". (See Ice manual discussion of automatic retries.) The Ice documentation states that you can mark an operation as idempotent if it doesn't modify the state of the servant being called. (Idempotent implies that calling the operation more than once is safe, as the system will be in the same state whether the operation is invoked once or more than once.) So operations that are simply accessing state are (typically) safe to mark idempotent.

In the Asterisk SCF API, we mark essentially all operations, including those that would normally be considered as requiring at-most-once semantics, as idempotent so that components can benefit from the automatic retry provided by the Ice run-time. Our approach to failover depends on the ability to retry any remote procedure call. However, since many operations in the API aren't naturally idempotent, the servants must be designed to detect duplicate calls and properly process redundant calls. To enable the servant to do so, we have added an OperationContext that can be passed to operations that require at-most-once semantics, so that the servant can distinguish retries from unique calls.

As seen in OperationsIf.ice:

Adding an OperationContext to an operation's arguments ensures that the servant processing the request can identify redundant calls. Below, for example, is the SessionListener interface from SessionCommunicationsIf.ice. Consider the act of flashing a switch-hook on a phone causing the indicated() operation to be invoked on a listener. Without the OperationContext argument, the listener would have to assume that being called more than once was indicative of the switch-hook being flashed more than once. With the OperationContext providing a unique id, the listener can distinguish between retries and multiple occurrences.

Implied Contract on the Servant

Icon

There is an implied contract for all servants that expose idempotent operations in their interface. The contract is that the servant can handle (and behave correctly) if it receives duplicate calls to an idempotent operation. When an OperationContext argument is provided, the servant must take whatever actions are necessary when a duplicate call is detected to "appear idempotent". It should avoid semantically incorrect state changes (such as duplicating the allocation of objects), and all return values (including exceptions) should be supplied to the retry call.

Impact on Servant Implementation

Supporting idempotency using the OperationContext implies a need to cache all received OperationContext in order to be able to detect redundant calls. Asterisk SCF provides a utility class to do this, with a built-in ability to purge "old" contexts after a configurable period of time. As seen in OperationContextCache.h:

So a servant implementation that uses OperationContextCache would typically attempt to call addOperationContext() when an operation is about to be processed. If this call returns false, the servant should assume this call is a retry for the operation that it has already processed, but the results of the call weren't received by the caller in time to avoid a retry. Thus a burden exists on the servant to be able to return the same results, without performing any semantically incorrect state changes.

To support returning previously computed values to retry calls, an alternate addOperationContext operation exists for the OperationContextCache. This alternate version allows the caller to attach a cookie to an OperationContext in the cache. The cookie will be automatically destroyed whenever the OperationContext itself is dropped from the cache.

This cookie can hold any information that the client needs to provide responses to future retry calls. It simply needs to be subclassed from the OperationContextCookie type.

Network latency and timeouts

It's possible for a retry to occur outside of a failover scenario simply due to network loading, changes in routing, or other such factors. (In fact, the "retry" instance of an operation could arrive at the client before the original). It would be challenging for the client receiving a retry to know the cause of the retry. It simply has to be prepared to process retries correctly regardless of the cause.

The challenging aspect of this is that many operations will be processed asynchronously, and the processing of a result (for a return value, or "out" parameter) may not be complete when a second instance of the operation is received. So if the Bridging component were in the process of creating a bridge, for example, and a retry arrives before the bridge is fully constructed, the Bridging component may not yet have a Bridge proxy to simply hand back to the retry instance of the call. In other components, where a WorkQueue is providing a natural serialization of the processing of an object, this type of race condition wouldn't exist. These types of issues must be worked through on a case-by-case basis.

Reusing an Operation Context

Consider the diagram below, where some external event causes a series of Asterisk SCF API operations (all mutators) to be invoked. To call operation1, the client creates a new OperationContext record to insure that retries can be made.

Let's think about what would happen if the same OperationContext were also used when calling notifyListener and operation2. When component 2 received operation1, it would cache the OperationContext so that it could detect any retries. When operation2 is received (assuming it is being processed by the same logical object that processed operation1), it would be detected as being in the OperationContextCache, and the component would attempt to behave as if this were a retry. This is because the OperationContext is cached based on its UUID. While Component2 could conceivably create a cache for every operation, this seems burdensome in the extreme. A cache of operations for a logical entity is more practical. So this leads us to the decision that we should never send the same OperationContext to the same logical object, because we would get false cache hits.

Now consider the call to notifyListener. Component 3 would add this OperationContext to some cache. When the call to operation4 is received, if the same perationContext was passed to operation2, and then Component 2 forwarded that context on to Component 3, then we have a similar problem as above. So what this shows us is that, in a distributed system, you can't really tell where a passed parameter could end up. A lot of that will actually be decided by deployment configurations. So this means we should just never reuse an OperationContext in any of these ways. Each call should have a new OperationContext created unless the call is intended as an explicit retry.

To make it easy to create an OperationContext, we have a couple of helper functions defined in a utility library.

So based on the sequence diagram above, we call the first createContext() factory to create a brand new context with a new transaction id. Since we're processing an externally generated event, we don't have an OperationContext to pass to the alternate factory. We later call the alternate factory (which copies the transaction id from the passed in OperationContext), which provides a means of tracing a series of calls back to some original event. All of the remaining calls in the sequence, even those in other components, use the alternate factory to create a context for each operation that has the same transaction id.

Replicating the OperationContext

The component developer must also consider when to replicate an OperationContext. When actual failover does occur, the retry process is what initially gets the (previously) standby component involved in an ongoing operation sequence. For example, consider the following call sequence (parameters omitted for simplification). If the Routing Service fails over to a standby instance before a return of routeSessionOperation, the SIP Session Gateway will eventually retry the call. It's important that any calls out from the newly activated Routing Service use the same OperationContext that the previously active component used, in case the down-stream Bridge Service had already been called. Otherwise, the Bridge service would experience memory leaks due to over-allocating resources. In this case, if we didn't pass createBridge the same OperationContext that the failed component had, an additional Bridge would be created.

So the Ice run-time in the SIP Session Gateway, through automatic retries, essentially gets the standby Routing Service involved in an ongoing chained operation. The standby Routing Service component, when receiving an OperationContext that was previously sent to an active component, must obey the contract for idempotency even across failover, by sending the identical OperationContext that it had previously sent to the Bridge Service. Assuming the Bridge Service is behaving idempotently, it will return the same Bridge that it previously returned to the active Routing Service, rather than creating a new one.

Exceptions to requiring OperationContext
  1. Intrinsically idempotent operations don't have an OperationContext parameter.
  2. Media write operations are currently not marked as idempotent. Media can typically handle some dropped frames, and retrying media write operations can adversely affect timing of frames. (Downstream buffers would likely need notification.)
Strategies for different types of operations

When developing a servant within a component, it's useful for the developer to classify each operation to know how best to handle it.

Operation Classification

Examples

Example signature

Intrinsically Idempotent

Accessors "getters" are naturally idempotent and require no special handling, and so we don't pass an OperationContext to them. Some queries (like a "lookup" operation fall into this category, even though they aren't named in the customary "getXyz()" form.

idempotent SessionCookies getCookies(SessionCookies cookieTypes);

No-Return-Value

1.) A command may direct the servant to perform a task, with no return value or "out" parameter. 2.) Similarly, a notification simply provides information.

idempotent void updateConnectedLine(OperationContext operationContext, ConnectedLine connectedLine);

Return-Value (and/or "out" parameters)

1.) An allocating operation may return a proxy to a newly-created object. 2.) A command may return an informational record that would be of use to the caller.

idempotent SessionInfo setBridge(OperationContext operationContext, Bridge* newBridge, SessionListener* listener);

For the No-Return-Value operation, the OperationContextCache can be used. The servant would determine when an operation is a retry using this mechanism, and simply ignore duplicate calls.

For the Return-Value operation, significant care must be taken to insure that retries are handled correctly. Some type of mapping from an OperationContext to a return value (or values, if "out" parameters are being used) may be required. If asynchronous processing is being used, the servant will need to be able to detect when an operation is a duplicate for an operation that hasn't even completed yet, and be able to wait for the results.

For both the No-Return-Value and the Return-Value case, we need to consider what happens if the original call resulted in an exception being thrown. We'd want the retry to throw the same exception. It may be that the simplest approach to take here is to remove the OperationContext from the OperationContextCache if an exception is generated, on the assumption that if it failed, it will fail the same way again. But this means that every exception must be caught for these operations (and then rethrown) so that the OperationContextCache can be modified for the cases when we previously would have allowed exceptions to propagate out. To support removing operations from the cache when an exception is caught, the OperationContextCache provides a remove method.

Client-side impact

Aside from generating a new OperationContext for every API call, no other client changes are required.

  • No labels

1 Comment

  1. Something that occurred to me when investigating alternative strategies is that there is an API approach that might be applicable in some cases. Since it already appears in this discussion, let's look at createBridge for a moment. Our current implementation treats the creation of the bridge object, the addition of the provided sessions, etc as one complete, all-or-nothing, operation. Because there are so many operations involved in this form of bridge creation, including calls on the sessions (setBridge()), media sessions, etc, it can be difficult to keep things consistent on failover of the bridge component or the component calling createBridge. It might be necessary to replicate state indicating intermediate steps, which can make implementing these types of methods and the relevant replication pretty complex. If it were possible to consider createBridge() as a function that creates a bridge and returns the proxy as well as going off and doing all this other stuff that we don't particularly want results about right now, things get a bit more straightforward. The step of creating the servant, replicating the id and activating the servant and returning the proxy is relatively "short" and simple. The creation could then queue some work to finish adding the sessions etc, and simply return the proxy while that occurs in the background. In a sense the API becomes more "service oriented" and less like remote procedure calls. It can also make the handling of exceptions a bit more sane. Consider what happens with exceptions on calls to Session::setBridge(). Does the caller of createBridge() really know how to deal with those? Do we simply ignore them while continuing on with the other session objects? If so, then why bother waiting around for those to complete before turning the proxy to the bridge to the caller of createBridge()? It certainly is not appropriate everywhere (maybe not even anywhere) but it is something to consider.