Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

Introduction

For Asterisk SCF to support transparent failover from an active component to a standby component, all Asterisk SCF components must allow incoming remote procedure calls to be retried. The Ice run-time is able to automatically retry operations that have been declared idempotent. By marking an operation idempotent, we're telling the Ice run-time that it can abandon at-most-once semantics for that operation. In other words, the system will be in a correct state whether the operation was invoked once or multiple times.

For many operations, such as those that allocate resources, at-most-once semantics would typically be expected (and thus the idempotent keyword would be omitted). But in order to allow Ice to automatically perform retries for us, we have chosen to mark operations that would normally require at-most-once semantics as idempotent. This puts an extra burden on the servant implementation for these operations, and on the API itself. The servant must be able to distinguish a retry of a previous call to an operation from a new call. This page describes the approach in Asterisk SCF to handle retry scenarios, and what developers should keep in mind in order to create "well behaved" Asterisk SCF components.

Impact on API

Ice will attempt to retry an operation that fails only if the operation is marked "idempotent". (See Ice manual discussion of automatic retries.) The Ice documentation states that you can mark an operation as idempotent if it doesn't modify the state of the servant being called. (Idempotent implies that calling the operation more than once is safe, as the system will be in the same state whether the operation is invoked once or more than once.) So operations that are simply accessing state are (typically) safe to mark idempotent.

In the Asterisk SCF API, we mark essentially all operations, including those that would normally be considered as requiring at-most-once semantics, as idempotent so that components can benefit from the automatic retry provided by the Ice run-time. Our approach to failover depends on the ability to retry any remote procedure call. However, since many operations in the API aren't naturally idempotent, the servants must be designed to detect duplicate calls and properly process redundant calls. To enable the servant to do so, we have added an OperationContext that can be passed to operations that require at-most-once semantics, so that the servant can distinguish retries from unique calls.

As seen in OperationsIf.ice:

Code Block
/**
 * An abstraction of a logical operational context. May be used to identify a group of operations as being related to
 * each other, allowing a normally non-idempotent operation to be retried as if it were idempotent. A servant receiving 
 * an OperationContext as an argument to an operation should be prepared to detect when the call is actually a retry 
 * of a previously received call. This implies:
 *  - Returning the same return value (or out parameters) for each instance of the call. 
 *  - The component must otherwise behave as if the operation had only been called once.
 *
 * Interface methods that accept an OperationContext instance may throw a OperationCallCancelledException.
 *
 **/
unsliceable class OperationContext
{
    /**
     * Identifies a specific invocation of an operation. This value is the same
     * across retries of the operation. 
     */
    string id;

    /** 
     * Can be common across a group of operations that are logically responding to the
     * the same external event. 
     */
    string transactionId;
};

Adding an OperationContext to an operation's arguments ensures that the servant processing the request can identify redundant calls. Below, for example, is the SessionListener interface from SessionCommunicationsIf.ice. Consider the act of flashing a switch-hook on a phone causing the indicated() operation to be invoked on a listener. Without the OperationContext argument, the listener would have to assume that being called more than once was indicative of the switch-hook being flashed more than once. With the OperationContext providing a unique id, the listener can distinguish between retries and multiple occurrences.

Code Block
interface SessionListener
    {
        /**
         * Notification that some indication event has occurred on the session.
         *
         * @param operationContext Provides unique context for each call to this operation.
         *
         * @param source The session the event occurred on.
         *
         * @param event The indication event that has occurred.
         *
         * @param cookies Any cookies present on the session.
         *
         * @see Session
         *
         * @see Indication
         */
        idempotent void indicated(
            AsteriskSCF::System::V1::OperationContext operationContext,
            Session* source,
            Indication event,
            SessionCookies cookies);
    };
Note
titleImplied Contract on the Servant

There is an implied contract for all servants that expose idempotent operations in their interface. The contract is that the servant can handle (and behave correctly) if it receives duplicate calls to an idempotent operation. When an OperationContext argument is provided, the servant must take whatever actions are necessary when a duplicate call is detected to "appear idempotent". It should avoid semantically incorrect state changes (such as duplicating the allocation of objects), and all return values (including exceptions) should be supplied to the retry call.

Impact on Servant Implementation

Supporting idempotency using the OperationContext implies a need to cache all received OperationContext in order to be able to detect redundant calls. Asterisk SCF provides a utility class to do this, with a built-in ability to purge "old" contexts after a configurable period of time. As seen in OperationContextCache.h:

Code Block
class ASTSCF_DLL_EXPORT OperationContextCache : public IceUtil::Shared
{
public:
    /**
     * ctor
     * @param ttlSeconds  The time-to-live for the OperationContexts being cached.
     * Entries will remain in the cache for at least the provided value, but can 
     * remain in cache longer. 
     */
    OperationContextCache(int ttlSeconds);

    /**
     * ctor for logging. 
     * @param ttlSeconds  The time-to-live for the OperationContexts being cached.
     * Entries will remain in the cache for at least the provided value, but can 
     * remain in cache longer. 
     * @param logger The logger to log to. 
     * @param label Label to apply when logging to identify this cache.
     */
    OperationContextCache(int ttlSeconds, 
                          const AsteriskSCF::System::Logging::Logger& logger,
                          const std::string& label);

    ~OperationContextCache();

    /**
     * Caches the specified context if it isnt' already in the cache. 
     * @return true The context was added, which means it wasn't already in the cache.
     * @note Make sure you don't confuse the return value of this operation with the return
     * value of the 'contains' operation. 
     */
    bool addOperationContext(const AsteriskSCF::System::V1::OperationContextPtr& operationContext);

    /**
     * Tests if the specified context is in the cache. 
     */
    bool contains(const AsteriskSCF::System::V1::OperationContextPtr& operationContext);

    ...

So a servant implementation that uses OperationContextCache would typically attempt to call addOperationContext() when an operation is about to be processed. If this call returns false, the servant should assume this call is a retry for the operation that it has already processed, but the results of the call weren't received by the caller in time to avoid a retry. Thus a burden exists on the servant to be able to return the same results, without performing any semantically incorrect state changes.

To support returning previously computed values to retry calls, an alternate addOperationContext operation exists for the OperationContextCache. This alternate version allows the caller to attach a cookie to an OperationContext in the cache. The cookie will be automatically destroyed whenever the OperationContext itself is dropped from the cache.

Code Block
class ASTSCF_DLL_EXPORT OperationContextCookie
{
};

...

    /**
     * Caches the specified context if it isnt' already in the cache, and associate a cookie with it. 
     *
     * @param operationContext The context to add to the cache. 
     * @param inCookie A cookie object to associate with this entry in the cache. 
     * @param existingCookie This value will be set by this method to the cookie of an existing
     * operationContext if there was already an entry in the cache with the same identity. 
     * @return true The context was added, which means it wasn't already in the cache.
     *
     * @note Make sure you don't confuse the return value of this operation with the return
     * value of the 'contains' operation. 
     */
    bool addOperationContext(
        const AsteriskSCF::System::V1::OperationContextPtr& operationContext,
        const OperationContextCookiePtr& inCookie, 
        OperationContextCookiePtr& existingCookie);

This cookie can hold any information that the client needs to provide responses to future retry calls. It simply needs to be subclassed from the OperationContextCookie type.

Network latency and timeouts

It's possible for a retry to occur outside of a failover scenario simply due to network loading, changes in routing, or other such factors. (In fact, the "retry" instance of an operation could arrive at the client before the original). It would be challenging for the client receiving a retry to know the cause of the retry. It simply has to be prepared to process retries correctly regardless of the cause.

The challenging aspect of this is that many operations will be processed asynchronously, and the processing of a result (for a return value, or "out" parameter) may not be complete when a second instance of the operation is received. So if the Bridging component were in the process of creating a bridge, for example, and a retry arrives before the bridge is fully constructed, the Bridging component may not yet have a Bridge proxy to simply hand back to the retry instance of the call. In other components, where a WorkQueue is providing a natural serialization of the processing of an object, this type of race condition wouldn't exist. These types of issues must be worked through on a case-by-case basis.

Reusing an Operation Context

Consider the diagram below, where some external event causes a series of Asterisk SCF API operations (all mutators) to be invoked. To call operation1, the client creates a new OperationContext record to insure that retries can be made.

Let's think about what would happen if the same OperationContext were also used when calling notifyListener and operation2. When component 2 received operation1, it would cache the OperationContext so that it could detect any retries. When operation2 is received (assuming it is being processed by the same logical object that processed operation1), it would be detected as being in the OperationContextCache, and the component would attempt to behave as if this were a retry. This is because the OperationContext is cached based on its UUID. While Component2 could conceivably create a cache for every operation, this seems burdensome in the extreme. A cache of operations for a logical entity is more practical. So this leads us to the decision that we should never send the same OperationContext to the same logical object, because we would get false cache hits.

Now consider the call to notifyListener. Component 3 would add this OperationContext to some cache. When the call to operation4 is received, if the same perationContext was passed to operation2, and then Component 2 forwarded that context on to Component 3, then we have a similar problem as above. So what this shows us is that, in a distributed system, you can't really tell where a passed parameter could end up. A lot of that will actually be decided by deployment configurations. So this means we should just never reuse an OperationContext in any of these ways. Each call should have a new OperationContext created unless the call is intended as an explicit retry.

To make it easy to create an OperationContext, we have a couple of helper functions defined in a utility library.

Code Block
/**
 * Create a new OperationContext with a new transaction id. 
 */
ASTSCF_DLL_EXPORT AsteriskSCF::System::V1::OperationContextPtr createContext();

/**
 * Create a new OperationContext that has the same transaction id as the input argument.
 *  @param context The source OperationContext that contains the transaction id to use. 
 */
ASTSCF_DLL_EXPORT AsteriskSCF::System::V1::OperationContextPtr createContext(const AsteriskSCF::System::V1::OperationContextPtr& context);

So based on the sequence diagram above, we call the first createContext() factory to create a brand new context with a new transaction id. Since we're processing an externally generated event, we don't have an OperationContext to pass to the alternate factory. We later call the alternate factory (which copies the transaction id from the passed in OperationContext), which provides a means of tracing a series of calls back to some original event. All of the remaining calls in the sequence, even those in other components, use the alternate factory to create a context for each operation that has the same transaction id.

Replicating the OperationContext

The component developer must also consider when to replicate an OperationContext. When actual failover does occur, the retry process is what initially gets the (previously) standby component involved in an ongoing operation sequence. For example, consider the following call sequence (parameters omitted for simplification). If the Routing Service fails over to a standby instance before a return of routeSessionOperation, the SIP Session Gateway will eventually retry the call. It's important that any calls out from the newly activated Routing Service use the same OperationContext that the previously active component used, in case the down-stream Bridge Service had already been called. Otherwise, the Bridge service would experience memory leaks due to over-allocating resources. In this case, if we didn't pass createBridge the same OperationContext that the failed component had, an additional Bridge would be created.

So the Ice run-time in the SIP Session Gateway, through automatic retries, essentially gets the standby Routing Service involved in an ongoing chained operation. The standby Routing Service component, when receiving an OperationContext that was previously sent to an active component, must obey the contract for idempotency even across failover, by sending the identical OperationContext that it had previously sent to the Bridge Service. Assuming the Bridge Service is behaving idempotently, it will return the same Bridge that it previously returned to the active Routing Service, rather than creating a new one.

Exceptions to requiring OperationContext
  1. Intrinsically idempotent operations don't have an OperationContext parameter.
  2. Media write operations are currently not marked as idempotent. Media can typically handle some dropped frames, and retrying media write operations can adversely affect timing of frames. (Downstream buffers would likely need notification.)
Strategies for different types of operations

When developing a servant within a component, it's useful for the developer to classify each operation to know how best to handle it.

Operation Classification

Examples

Example signature

Intrinsically Idempotent

Accessors "getters" are naturally idempotent and require no special handling, and so we don't pass an OperationContext to them. Some queries (like a "lookup" operation fall into this category, even though they aren't named in the customary "getXyz()" form.

idempotent SessionCookies getCookies(SessionCookies cookieTypes);

No-Return-Value

1.) A command may direct the servant to perform a task, with no return value or "out" parameter. 2.) Similarly, a notification simply provides information.

idempotent void updateConnectedLine(OperationContext operationContext, ConnectedLine connectedLine);

Return-Value (and/or "out" parameters)

1.) An allocating operation may return a proxy to a newly-created object. 2.) A command may return an informational record that would be of use to the caller.

idempotent SessionInfo setBridge(OperationContext operationContext, Bridge* newBridge, SessionListener* listener);

For the No-Return-Value operation, the OperationContextCache can be used. The servant would determine when an operation is a retry using this mechanism, and simply ignore duplicate calls.

For the Return-Value operation, significant care must be taken to insure that retries are handled correctly. Some type of mapping from an OperationContext to a return value (or values, if "out" parameters are being used) may be required. If asynchronous processing is being used, the servant will need to be able to detect when an operation is a duplicate for an operation that hasn't even completed yet, and be able to wait for the results.

For both the No-Return-Value and the Return-Value case, we need to consider what happens if the original call resulted in an exception being thrown. We'd want the retry to throw the same exception. It may be that the simplest approach to take here is to remove the OperationContext from the OperationContextCache if an exception is generated, on the assumption that if it failed, it will fail the same way again. But this means that every exception must be caught for these operations (and then rethrown) so that the OperationContextCache can be modified for the cases when we previously would have allowed exceptions to propagate out. To support removing operations from the cache when an exception is caught, the OperationContextCache provides a remove method.

Code Block
    /**
     * This will remove an OperationContext from the cache, if one exists with the given id. 
     * Removal is typically done automatically within the cache based on an internal timer. 
     * This operation exists to support clients that wish to force an immediate removal of a 
     * context themselves. 
     */
    void removeOperationContext(const AsteriskSCF::System::V1::OperationContextPtr& operationContext);

Client-side impact

Aside from generating a new OperationContext for every API call, no other client changes are required.