For Asterisk SCF to support transparent failover from an active component to a standby component, all Asterisk SCF components must allow incoming remote procedure calls to be retried. The Ice run-time is able to automatically retry operations that have been declared
idempotent. By marking an operation
idempotent, we're telling the Ice run-time that it can abandon at-most-once semantics for that operation. In other words, the system will be in a correct state whether the operation was invoked once or multiple times.
For many operations, such as those that allocate resources, at-most-once semantics would typically be expected (and thus the
idempotent keyword would be omitted). But in order to allow Ice to automatically perform retries for us, we have chosen to mark operations that would normally require at-most-once semantics as idempotent. This puts an extra burden on the servant implementation for these operations, and on the API itself. The servant must be able to distinguish a retry of a previous call to an operation from a new call. This page describes the approach in Asterisk SCF to handle retry scenarios, and what developers should keep in mind in order to create "well behaved" Asterisk SCF components.
Impact on API
Ice will attempt to retry an operation that fails only if the operation is marked "idempotent". (See Ice manual discussion of automatic retries.) The Ice documentation states that you can mark an operation as idempotent if it doesn't modify the state of the servant being called. (Idempotent implies that calling the operation more than once is safe, as the system will be in the same state whether the operation is invoked once or more than once.) So operations that are simply accessing state are (typically) safe to mark idempotent.
In the Asterisk SCF API, we mark essentially all operations, including those that would normally be considered as requiring at-most-once semantics, as
idempotent so that components can benefit from the automatic retry provided by the Ice run-time. Our approach to failover depends on the ability to retry any remote procedure call. However, since many operations in the API aren't naturally idempotent, the servants must be designed to detect duplicate calls and properly process redundant calls. To enable the servant to do so, we have added an
OperationContext that can be passed to operations that require at-most-once semantics, so that the servant can distinguish retries from unique calls.
As seen in OperationsIf.ice:
OperationContext to an operation's arguments ensures that the servant processing the request can identify redundant calls. Below, for example, is the
SessionListener interface from SessionCommunicationsIf.ice. Consider the act of flashing a switch-hook on a phone causing the indicated() operation to be invoked on a listener. Without the
OperationContext argument, the listener would have to assume that being called more than once was indicative of the switch-hook being flashed more than once. With the
OperationContext providing a unique id, the listener can distinguish between retries and multiple occurrences.
Impact on Servant Implementation
Supporting idempotency using the
OperationContext implies a need to cache all received
OperationContext in order to be able to detect redundant calls. Asterisk SCF provides a utility class to do this, with a built-in ability to purge "old" contexts after a configurable period of time. As seen in OperationContextCache.h:
So a servant implementation that uses
OperationContextCache would typically attempt to call
addOperationContext() when an operation is about to be processed. If this call returns false, the servant should assume this call is a retry for the operation that it has already processed, but the results of the call weren't received by the caller in time to avoid a retry. Thus a burden exists on the servant to be able to return the same results, without performing any semantically incorrect state changes.
To support returning previously computed values to retry calls, an alternate
addOperationContext operation exists for the
OperationContextCache. This alternate version allows the caller to attach a cookie to an OperationContext in the cache. The cookie will be automatically destroyed whenever the OperationContext itself is dropped from the cache.
This cookie can hold any information that the client needs to provide responses to future retry calls. It simply needs to be subclassed from the
Network latency and timeouts
It's possible for a retry to occur outside of a failover scenario simply due to network loading, changes in routing, or other such factors. (In fact, the "retry" instance of an operation could arrive at the client before the original). It would be challenging for the client receiving a retry to know the cause of the retry. It simply has to be prepared to process retries correctly regardless of the cause.
The challenging aspect of this is that many operations will be processed asynchronously, and the processing of a result (for a return value, or "out" parameter) may not be complete when a second instance of the operation is received. So if the Bridging component were in the process of creating a bridge, for example, and a retry arrives before the bridge is fully constructed, the Bridging component may not yet have a Bridge proxy to simply hand back to the retry instance of the call. In other components, where a WorkQueue is providing a natural serialization of the processing of an object, this type of race condition wouldn't exist. These types of issues must be worked through on a case-by-case basis.
Reusing an Operation Context
Consider the diagram below, where some external event causes a series of Asterisk SCF API operations (all mutators) to be invoked. To call
operation1, the client creates a new
OperationContext record to insure that retries can be made.
Let's think about what would happen if the same
OperationContext were also used when calling
operation2. When component 2 received
operation1, it would cache the
OperationContext so that it could detect any retries. When
operation2 is received (assuming it is being processed by the same logical object that processed
operation1), it would be detected as being in the
OperationContextCache, and the component would attempt to behave as if this were a retry. This is because the
OperationContext is cached based on its UUID. While Component2 could conceivably create a cache for every operation, this seems burdensome in the extreme. A cache of operations for a logical entity is more practical. So this leads us to the decision that we should never send the same
OperationContext to the same logical object, because we would get false cache hits.
Now consider the call to
notifyListener. Component 3 would add this
OperationContext to some cache. When the call to
operation4 is received, if the same perationContext was passed to
operation2, and then Component 2 forwarded that context on to Component 3, then we have a similar problem as above. So what this shows us is that, in a distributed system, you can't really tell where a passed parameter could end up. A lot of that will actually be decided by deployment configurations. So this means we should just never reuse an
OperationContext in any of these ways. Each call should have a new
OperationContext created unless the call is intended as an explicit retry.
To make it easy to create an
OperationContext, we have a couple of helper functions defined in a utility library.
So based on the sequence diagram above, we call the first createContext() factory to create a brand new context with a new transaction id. Since we're processing an externally generated event, we don't have an OperationContext to pass to the alternate factory. We later call the alternate factory (which copies the transaction id from the passed in OperationContext), which provides a means of tracing a series of calls back to some original event. All of the remaining calls in the sequence, even those in other components, use the alternate factory to create a context for each operation that has the same transaction id.
The component developer must also consider when to replicate an
OperationContext. When actual failover does occur, the retry process is what initially gets the (previously) standby component involved in an ongoing operation sequence. For example, consider the following call sequence (parameters omitted for simplification). If the Routing Service fails over to a standby instance before a return of
routeSessionOperation, the SIP Session Gateway will eventually retry the call. It's important that any calls out from the newly activated Routing Service use the same
OperationContext that the previously active component used, in case the down-stream Bridge Service had already been called. Otherwise, the Bridge service would experience memory leaks due to over-allocating resources. In this case, if we didn't pass
createBridge the same OperationContext that the failed component had, an additional Bridge would be created.
So the Ice run-time in the SIP Session Gateway, through automatic retries, essentially gets the standby Routing Service involved in an ongoing chained operation. The standby Routing Service component, when receiving an OperationContext that was previously sent to an active component, must obey the contract for idempotency even across failover, by sending the identical OperationContext that it had previously sent to the Bridge Service. Assuming the Bridge Service is behaving idempotently, it will return the same Bridge that it previously returned to the active Routing Service, rather than creating a new one.
Exceptions to requiring
- Intrinsically idempotent operations don't have an OperationContext parameter.
- Media write operations are currently not marked as idempotent. Media can typically handle some dropped frames, and retrying media write operations can adversely affect timing of frames. (Downstream buffers would likely need notification.)
Strategies for different types of operations
When developing a servant within a component, it's useful for the developer to classify each operation to know how best to handle it.
Accessors "getters" are naturally idempotent and require no special handling, and so we don't pass an
1.) A command may direct the servant to perform a task, with no return value or "out" parameter. 2.) Similarly, a notification simply provides information.
Return-Value (and/or "out" parameters)
1.) An allocating operation may return a proxy to a newly-created object. 2.) A command may return an informational record that would be of use to the caller.
For the No-Return-Value operation, the
OperationContextCache can be used. The servant would determine when an operation is a retry using this mechanism, and simply ignore duplicate calls.
For the Return-Value operation, significant care must be taken to insure that retries are handled correctly. Some type of mapping from an
OperationContext to a return value (or values, if "out" parameters are being used) may be required. If asynchronous processing is being used, the servant will need to be able to detect when an operation is a duplicate for an operation that hasn't even completed yet, and be able to wait for the results.
For both the No-Return-Value and the Return-Value case, we need to consider what happens if the original call resulted in an exception being thrown. We'd want the retry to throw the same exception. It may be that the simplest approach to take here is to remove the
OperationContext from the
OperationContextCache if an exception is generated, on the assumption that if it failed, it will fail the same way again. But this means that every exception must be caught for these operations (and then rethrown) so that the
OperationContextCache can be modified for the cases when we previously would have allowed exceptions to propagate out. To support removing operations from the cache when an exception is caught, the
OperationContextCache provides a remove method.
Aside from generating a new
OperationContext for every API call, no other client changes are required.