Specifying a simple serverless system with TLA+
In this guide, I want to demonstrate how to use TLA+ to specify a simple serverless system.
AWS defines serverless as:
A serverless architecture is a way to build and run applications and services without having to manage infrastructure.
You design your system as a collection of small runtime entities, often connected to queues to exchange messages.
AWS offers services like Lambda and SQS to easily provision such systems. With managed services, less development effort is required to have something up and running. Additionally, they make it easier to build scalable and resilient solutions.
This comes at a price since the increased asynchronicity makes it harder to reason about the behaviour of the system as a whole and because we need to take into account in our design the failure modes of those external services. There is more information in the ThoughtWorks Radar about Serverless architecture and AWS Lambda.
Services like SQS and Lambdas describe what you get. For example, SQS provides by default at-least-once delivery, and no message ordering guarantees.
As with any other piece of software, cloud services fail from time to time, delivering even less than what they document. Since we lack visibility in those systems, they are managed by the cloud provider after all, harmful failures can go unnoticed.
As an example: How do you know that all the messages you published in the queue got delivered, without instrumenting the consumer?
On the one hand, taking advantage of services offered by your cloud provider reduces the development effort. On the other hand, failures are insidious, we don’t know if things are going well or not.
In this blog post, I attempt to demonstrate how to use TLA+ and PlusCal to explore some of this failure modes.
Describing TLA+ and PlusCal in detail would take more that a blog post, and it is not my goal. My goal is twofold: to show how to specify concurrent systems and nondeterministic behaviour with TLA+, and to warn about the dangers of building overly asynchronous systems, since us humans are very bad at reasoning how they can fail.
For those excited to learn more about TLA+ and PlusCal. I’ve added some resources in the end.
Short introduction in TLA+ and PlusCal
TLA+ is a high-level language for modeling programs and systems–especially concurrent and distributed ones.
Leslie Lamport
TLA+ is well suited to model concurrent systems to make sure they meet your requirements.
We are going to write a specification. Our specification is going to consist of steps. Each step it atomic and is going to update the state of the system. For every step, TLA+ computes all possible next steps, but not all transitions are legal. For example, in a user management system, the step to delete a user is only available after at least one user is in the system.
A sequence of steps is called a behaviour. You use behaviours to express powerful liveness conditions that your system should meet. For example, whether a database will eventually be updated even in the face of failures. It is a very powerful tool to be able to express properties on behaviours, because you can validate whether different execution paths eventually lead to the same outcome.
We are going to write the specification in PlusCal, a language that translates to TLA+ but has flow control structures that would be more familiar to people with programming background.
TLC is a powerful model checker that will validate the specification or find counter-examples when our invariants fail.
Specifying a simple system
Let’s specify a producer that pushes data over a queue in a consumer who stores them in a store. For AWS users, think of a Lambda sending data to another Lambda via a queue. The second Lambda stores them in S3.
I intentionally did not add any business logic to the example, because I want to focus on the drawbacks that arise by using this architectural style, and the increased asynchronicity it brings.
Global state
|
|
That’s how we define the global state of our system. In our specification, we are going to model both the queue, and the store as Sets of unordered values. The reason I chose to model the queue with the set data structure is because I am looking to model a service which does not guarantee ordering of messages.
The producer
|
|
We specify a producer that makes only one action which is to Publish.
All the actions under a label (like Publish) are executed atomically.
So in one step, the producer picks a message from the Messages set and adds it in the queue set.
At the end of the step, the line goto Publish
ensures that the publisher is allowed to publish again.
Notice how the construct with m \in Messages do
.
Allows us to model that we don’t know which message is going to be picked.
TLA+ will make sure to verify our model for all possible picks.
The consumer
|
|
The consumer also has only one action which is to Consume. The step is disabled until there is at least a message in the queue, that’s what Line 4 does.
Line 5 picks a message from the queue. Not necessarily the first one, remember it is an unordered set.
Lines 6-10 are a conditional. Either the consumer is going to remove the message from the queue (Line 7), or not. Remember, that SQS guarantees at-least-once delivery. By leaving the message in the queue, it is going to be consumed again in the future.
TLC will validate that our invariants hold in either case.
Invariants
|
|
QueueInvariant does some basic sanity check ensuring that our queue contains Messages and only Messages.
It guards against publishing in queue something that is not of type Message.
It may seem trivial but is a nice example how an invariant can look like.
You can specify when defining the model this invariant should hold in every step of our model.
Liveness describes a property over a whole behaviour (not a single step) of the algorithm.
It says: Eventually, always the store is going to be equal to the Messages
.
The operator <>[]
means “eventually always”.
That is a very powerful operator to describe that your algorithms does the right thing in the end.
All together
Putting all the little snippets together, we get our specification:
|
|
As you can see it is only about 40 lines of code.
Execution
Let’s specify the simplest model for the above specification:
Messages = {m1} # Only one message
Producers = {p1} # Only one producer
Consumers = {c1} # Only one consumer
QueueInvariant # Invariance that should hold for every step
Liveness # Invariance that should hold eventually for every execution
In simple terms, one producer is going to be running in a loop. Every time it runs, it is going to be publishing the same message in the queue.
One consumer also running in a loop, but only when there are available messages. Every time it runs, it is going to be reading the message in the queue (potentially removing it from the queue) and adding it in the store.
The model checker tells us that our invariants are satisfied in every possible execution.
Success!
Progressing thought time
Let’s extend our simple example. Let’s say that as time goes by, the producer is sending us new revisions of the data. For example, the producer is sending us the current “view count” for a video, over time it publishes new messages containing the latest “view count”. In the store, we are only interested in storing only the latest “view count”. Thus, when we receive an update, we should override any older data, if present. We are still going to limit our system to send updates for only one video.
We now have to model a system that evolves thought time, to do that we are going to introduce a new process that models a clock.
|
|
Every time the clock Ticks, it increases the generation variable. The publisher now attaches the generation in the messages, so that we can distinguish older from newer messages. We also changed the store from a set, to a structure. You can think of it a Map. The key is the message, and the value is the generation.
All those changes happened with the end goal to be able to assert that the store only contains latest version of messages.
Liveness == <>[](\A f \in DOMAIN store: store[f] = generation)
Eventually always for all keys in the store the value is the latest generation.
Execution
Let’s validate this specification using a simple model:
Messages = {m1} # Only one message
Producers = {p1} # Only one producer
Consumers = {c1} # Only one consumer
MaxGeneration = 2 # The clock will tick one
QueueInvariant # Invariance that should hold for every step
Liveness # Invariance that should hold eventually for every execution
The model fails.
Here is the counter-example that TLC discovered:
-
Producer publishes a message in the queue
{[generation |-> 1, message |-> m1]}
-
Clock ticks to model the passage of time and increases the generation to 2. It also exits, as in this model, is configured to only tick once.
-
Producer publishes another message in the queue
{[generation |-> 2, message |-> m1]}
. Now the queue holds two messages. -
Consumer reads and removes the second generation message from the queue. It can happen since the queue does not guarantee any ordering of the message. The consumer stores the message in the store
[m1 |-> 2]
. -
Consumer reads the first generation message but is not removed from the queue. It can happen since the queue guarantees at-least-once delivery, which we model by sometimes keeping the message in the queue after it is read). The consumer stores the message in the store
[m1 |-> 1]
. Thus overriding the newer state with an older one.
No easy solution
A possible fix would be to wait until the queue is empty before the producer publishes, so that the produces it certain all previous messages are consumed. That could be tough to implement because it may require some kind of cache in the producer to store messages temporarily until it is ok to publish them.
Another possible fix would be to actually store some kind of timestamp with the data and compare the timestamp in the stored data with the one in the message. That would require reading from the store before every write, which could be expensive and inefficient. Also, it may not be sufficient if you have multiple consumers.
Lastly, maybe you can switch to a FIFO queue that guarantees ordering of messages. Depending on the behaviour of your producers, it may not solve your problem for multiple producers.
If you don’t believe me, the specification already supports multiple consumers and producers, you only need to configure the model and explore.
Conclusion
We just demonstrated that even with only one producer and one consumer failures can still happen.
One of the advantages of serverless is how easily it scales. By tweaking on a parameter in your cloud provider you can have hundreds instances of the same Lambda. Be cautious, even more failure modes come into play when we have multiple consumers and producers!
I believe that there is sweet spot in the intersection between TLA+ and Serverless architectures. With relatively simple specifications, we get to greatly sharpen our awareness on how those systems behave.
Resources
- Leslie Lamport’s site who is the creator of TLA+ and PlusCal.
- Learn TLA+ is another good resource.