Distributed Systems and Message Reliability

In distributed computing, components need to communicate and coordinate with one another to accomplish various tasks. This communication happens through message passing. It’s often the case, that the medium through which messages are passed, is over an internet connected network which is inherently unreliable. Networks can be severed, routers send packets to the wrong destination, firewalls can block traffic, and software can crash. When faced with this reality we must accept that messages may be lost or duplicated.

There is a thought experiment meant to convey this principle called the Two Generals’ Problem. The Two Generals’ Problem describes two army generals that are separated by an enemy. They are attempting to coordinate an attack on the enemy, but in order to coordinate they must send messengers through the enemy territory. Faced with the possibility that the messenger could be captured or killed, how can they come to an agreement on when to attack?

General A can send a message to let General B know that the attack should be at dawn. However they can’t be sure that the enemy won’t capture the messenger resulting in General A attacking without General B and being defeated. This uncertainty could lead to General A not attacking even when General B may have received the message resulting in General B attacking without General A, again, resulting in defeat.

This one-shot, message may or may not be received, can be described as at-most-once message delivery. The sender of the message sends a message and has no idea if the recipient received it or not. Often they will, but rarely they may not. Ultimately this mode of messaging can result in a lost, never to be received, message.

Troubled by the idea that they may be leading their army into defeat, General A decides to not attack unless they receive an acknowledgment from General B, verifying the message was received.

By requiring an acknowledgment, General A can now confidently know his original message was delivered, but what if he doesn’t receive the acknowledgment from General B? It could mean that the original message was captured or that the acknowledgment was captured. General A, not certain, will continue to send messages until an acknowledgment is received. This is known as at-least-once message delivery. The original message won’t be lost, but it may be delivered more than once.

All of this can be flipped around where General B also wants to be certain that General A received their acknowledgment. So they require General A to send an acknowledgment to their acknowledgment. General A can send an acknowledgment to the acknowledgment, but how can they be certain that acknowledgment is received without another acknowledgment? This results in a infinite loop where one party will always be in a state of uncertainty. Ultimately this problem is unsolvable and has important implications for distributed consensus. One in particular that I want to discus is the fact that guarantee of exactly-once message delivery over an unreliable medium, like the internet, is impossible.

This problem is not just a thought experiment, it has real world consequences. Consider an application that uses a messaging system, like RabbitMQ, to handle credit card processing requests. RabbitMQ may attempt to send the message, “Charge the credit card 123 $50”. Like General A, it can’t be certain that the worker in charge of actually processing the credit card charge is going to receive the message. It’ll usually receive it, but occasionally a router could misroute it.

Just like with the two generals, RabbitMQ can require the worker to send back an acknowledgment so that it knows the message was received and can be deleted on it’s end. However the acknowledgment may be lost resulting in RabbitMQ sending the message again, and again, until an acknowledgment eventually is received.

This is the at-most-once and at-least-once messaging modes we talked about earlier. RabbitMQ supports both of these modes, SQS support at-least-once, and Kafka supports both (they mention exactly-once, but this is about processing not message delivery, more on this in a second). The one you’ll want to use depends on your applications tolerance for message loss. Often the at-least-once mode is going to be more desirable, but it comes with it’s own problems. In the case of the credit card charge message, we could charge the user more than once. While exactly-once message delivery guarantee is impossible we can still achieve the same result by applying exactly-once message processing.

The idea with exactly-once message processing is that you provide an unique token, called an idempotency id, which can uniquely identify a message. With this the message receiver needs to store some state on what idempotency ids it’s seen before. That way if a message arrives with an idempotency id it has seen before it can acknowledge the message without re-processing it. Stripe makes uses of this in their APIs through an Idempotency-Key header.

Software of course is full of bugs and we could run into a scenario where we mark an idempotency id as seen, but then the attempt to update the users account after the payment was made fails. This could again leave things in an incorrect state where the payment was charged, but the users account isn’t updated to reflect that, resulting in them still appearing as a free user or not receiving a product. While never perfect, you’ll want to have robust error handling around these critical sections. Rolling back state, retrying at specific parts, or even just having a dead letter queue, logging, or alerting in place so that you can run recover tooling or manual clean things up.

There is another option, although it’s often not achievable for many real world applications, that is to have the message handling code be side-effect free. That is, it’s idempotent in and of itself. The idea is that applying the same message more than once results in the same end state as if it had only been applied once.

This principle of message delivery applies to any distributed system where there is an unreliable medium which they need to communicate over. Faced with these realities, any operation that must occur only once, needs to be made idempotent, unless lost messages or duplicate processing is tolerable.