Use polling for resiliency

December 12, 2020

How should two microservices communicate?

Let’s analyze two different communication patterns:

Polling: Service B periodically query Service A for the current state of the users and updates its local storage.
Event driven: Service A publishes in a queue every time user information is updated. Service B consumes the updates to stay up to date.

Using TLA+ we are going to model the two different patterns to see how well they fit our requirements. In our specification, we are also going to take into account unexpected failures that can affect the system and show how to model them in TLA+.

Let’s specify our requirements with an example.

Imagine you have two services, one holding user information and another one in need of user information to properly function, for example it could be sending out promotional emails to you register users for new and upcoming products. Let’s call them services A and B from now on from simplicity.

Service B does not have to be immediately notified about users joining or leaving the system, but it must know about them eventually.

Resiliency

First lets specify what we mean in resiliecy using TLA+.

Let’s call dataSourceA and dataSourceB, the information stored in Services A and B respectively which we want to keep in sync.

Resiliency == []<>(dataSourceB = dataSourceA)

The formula specifies that as the system evolves dataSourceB and dataSourceA become equal an infinite number of times. Thus, even if at some point diverge, for whatever reason, at a later point they will become equal again.

The operators []<> can be read as always eventually. TLA+ makes it easy to express formulas related to how your systems evolve through time.

Polling

The specification

-------------------- MODULE PollMechanism --------------------
EXTENDS Naturals, Integers, TLC

VARIABLES dataSourceA, dataSourceB

vars == <<dataSourceA, dataSourceB>>

Poll ==
    /\ dataSourceB'= dataSourceA
    /\ UNCHANGED <<dataSourceA>>

Update ==
    /\ \E u \in DOMAIN dataSourceA: \E v \in 1..10:
        /\ dataSourceA' =  [ dataSourceA EXCEPT  ![u] = v ]
    /\ UNCHANGED <<dataSourceB>>

Environment == /\ dataSourceB' \in [u1: 1..10, u2: 1..10]
               /\ UNCHANGED <<dataSourceA>>

Init ==
    /\ dataSourceA = [u1 |-> 1, u2 |-> 1]
    /\ dataSourceB = [u1 |-> 1, u2 |-> 1]

Next == \/ Poll \/ Update \/ Environment

Spec == Init /\ [][Next]_vars /\ WF_vars(Poll)

TypeOK ==
    /\ \A u \in DOMAIN dataSourceA: dataSourceA[u] \in 1..10
    /\ \A u \in DOMAIN dataSourceB: dataSourceB[u] \in 0..10

Resilient == []<>(dataSourceB = dataSourceA)
===============================================================

The datasources hold the records of two users u1 and u2. The state of each user is represented by a number for simplicity.

The following three actions are possible in the system:

Service A can Update the state of one of the users.
Service B can Poll to sync it’s state, dataSourceB, with Service’s A state, dataSourceA
The Environment can kick Service B out of sync. This action models the effects of bugs, malicious actors, misunderstood requirements or anything that wrecks havoc to your system.

TLC, the model checker for TLA+, validates that our specification satisfies our resiliency requirement!

Conceptually, even if Service B falls behind Service A or if an external factor kicks Service B out of sync, it only takes one successful poll to bring everything back in sync. Simple and effective.

Event driven

Now, let’s try to solve the same problem using an event driven solution. Service A will be publishing messages with user updates and Service B will be receiving them and use them to update its state.

Here’s the specification:

-------------------- MODULE PushMechanism --------------------
EXTENDS Naturals, Sequences, Integers, TLC

VARIABLES dataSourceA, dataSourceB, queue

MaxQueueLength == 1

vars == <<dataSourceA, dataSourceB, queue>>

Update ==
    /\ Len(queue) # MaxQueueLength
    /\ \E u \in DOMAIN dataSourceA: \E v \in 1..10:
        /\ dataSourceA' =  [ dataSourceA EXCEPT  ![u] = v ]
        /\ queue' = Append(queue, <<u, v>> )
    /\ UNCHANGED <<dataSourceB>>

Receive ==
    /\ Len(queue) # 0
    /\ LET
            u == Head(queue)[1]
            v == Head(queue)[2]
        IN
        /\ dataSourceB' = [ dataSourceB EXCEPT ![u] = v ]
        /\ queue' = Tail(queue)
    /\ UNCHANGED <<dataSourceA>>

Environment ==
    /\ dataSourceB' \in [u1: 1..10, u2: 1..10]
    /\ UNCHANGED <<dataSourceA, queue>>

Init ==
    /\ dataSourceA = [u1 |-> 1, u2 |-> 1]
    /\ dataSourceB = [u1 |-> 1, u2 |-> 1]
    /\ queue = <<>>

Next == \/ Update \/ Receive \/ Environment

Spec == Init /\ [][Next]_vars  /\ WF_vars(Update) /\ WF_vars(Receive)

TypeOK ==
    /\ \A u \in DOMAIN dataSourceA: dataSourceA[u] \in 1..10
    /\ \A u \in DOMAIN dataSourceB: dataSourceB[u] \in 0..10
    /\ Len(queue) \in 0..1

Resilient == []<>(dataSourceB = dataSourceA)
===============================================================

TLC finds the following counter example which demonstrates that the specification does not satisfy the resiliency property:

Environment messes up the state of both users in dataSourceB.
Service A Update the state of u1 and publishes a message in the queue.
Service B Receive the message and updates u1 record in dataSourceB accordingly.
In the next steps, Service A keeps updating u1 but because there is no update for u2 the dataSourceB remains out of sync for u2.

Intuitively, only for values for which you have recently received updates, you can be confident that they are in sync with the source.

Conclusion

The plain old polling is still a strategy worth considering when resiliency is important. Since, it only takes one successful poll to bring everything back to order.

Also, I believe that it better aligns with the needs of the services. In this example, it is Service B that is in need of some data. With polling, service B is in charge of initiating the exchange of data whenever and for whichever subset of data, it really needs to know about.

When Service A is in charge of initiating the exchange, Service B may receive multiple updates for entities it does not need to know about and more importantly it may not get updates for entities it needs to know about (as in the counter-example where it did not receive updates for u2).

TLA+ is an amazing tool to help you explore the properties of distributed systems. In this post, I demonstrate an easy technique to model external factors that can affect your system.

Thank you for reading!

Resources

“Modeling Adversaries with TLA+” by Hillel Wayne, post