Retry and fallback policies in C# with Polly

In this blog I will try to explain how one can create clean and effective policies to retry API calls and have fallbacks when requests are failing. With Polly it is possible to create complex and advanced scenarios for error handling with just a few lines of code.

This week I was connecting an eCommerce web application to an ERP system with REST APIs. There are multiple endpoints, all authenticated with OAuth. To make sure all calls to the APIs will have a high success rate I had to implement retry mechanisms for different scenarios.

For this kind of scenarios there is a very cool library: Polly which I have been using for some years now (together with Refit) and I am just deeply in love with both libraries.

Although there are abundant resources about Polly on the web I wanted to write a post with a lot of sample code to provide a quick and practical example of how easy it is to use Polly to create advanced exception handling with APIs. I am using Refit because it is quick and easy to use with REST APIs but Polly can be used with any kind of C# code.

Disclaimer: this article and sample code have nothing to do with the work I did for the eCommerce website. It was just a trigger for me to write about Polly. Also, the shown code might not always show the best way to implementat things, it is just an example to explain some use cases of Polly.

What is Polly?

From the Polly repository: Polly is a .NET resilience and transient-fault-handling library that allows developers to express policies such as Retry, Circuit Breaker, Timeout, Bulkhead Isolation, and Fallback in a fluent and thread-safe manner.

How a simple API call can get way too complex

Let’s say I have a micro service with an API endpoint to retrieve products:

Could everything just be as simple as that. When you use code like this in a production environment you will quickly find out that there is a need of exception handling. And, even better, a mechanism to do some retries before throwing an exception.

So, let’s add some simple retry (this is kind of pseudo-code, just for demonstration purpose):

Although it is not the most beautiful code, it might actually work for you. But the next problem arises: the API is going to be protected with OAuth so we have to get an access token from another endpoint and provide a bearer token to be able to retrieve products. This will add quite a few extra scenarios where things can go wrong, the most commonly be timeouts and expiration of tokens.

Let’s work on another revision of the code to add extra retries for these scenarios:

I am going to stop right here. I should add another retry around the retrieval of the access token, handle more cases in the switch statement, in short, this simple API is becoming an unmaintainable mess.

Polly to the rescue

Let’s try and implement the same scenario in a more clean and maintainable way by using Polly!

First, a simple version:

The code is simple, it hardly needs further explanation. It will authenticate first (the authentication service itself will also use Polly) and try to get products. It will retry for a number of time when receiving any exception. While this is not a complete solution it can already handle some issues.

Let’s extend it a bit. I want to add a delay when I receive a timeout. Maybe the API is spinning up, rebooting or there might be a network issue:

But what if the API throws an exception because my access token is expired? This will be a different type of exception and it will also need a different solution to solve the problem. Polly is able to wrap different policies to handle different scenarios:

While this is not the way I would structure my code in a real app, I believe this is understandable and maintainable code. But how can we verify all these scenarios work? How can one simulate all the scenarios at a time to verify the behavior of all policies?

Unit testing with Polly and Refit

This brings us to unit testing. Too me, this is one of the most important (and fun) parts. I do like writing unit tests but especially when programming difficult scenarios with APIs and policies.

Imagine this: I want a retry on the authentication api but only when I receive a RequestTimeout (Http status code 408). This will be my full AuthenticationService:

Now I can test the behavior with Moq to mock the API:

Advanced scenarios

Let us dive a bit deeper into policies and Polly and combine different policies (and even add two more).

Fallbacks

Let’s say I created a micro service to create orders. We do not want to loose any order because this will directly result in money loss. A simple retry will not be enough because what if the order api is offline for a longer time? I want an advanced scenario that looks like this:

Order flow

I will not implement authentication in this flow but I guess you can already imagine:

a) the flow will be much more complicated

b) it will still be quite easy to implement with Polly using the example from above

This is what the flow will look like in code:

And the unit test to test the full flow (check the repository on Github to see the mock setups):

CircuitBreaker

So now we have a retry and a fallback. Can it still be improved? Yes, it can! Imagine the order api is really broken. Do we want customer to have a slower experience while retrying to reach the API although we know the last few calls have been unsuccessful? Guess not! So, let’s say hi to the circuit breaker.

The circuit breaker keeps track of the number of exceptions. It will break when the configured number of exceptions have been thrown. It will “open the circuit” for a certain amount of time which means it will not even try to execute the call but immediately throw an exception. When the configured delay time has been passed it will reset the circuit and start all over.

When I first tried the circuit breaker I made a trivial mistake: I initialized the breaker on every call, resulting in a recount at every call so the circuit would never break. It is important to have the circuit working on a higher level than the call (i.e. as a singleton or in the constructor of the service, this having the same scope as the service itself).

I added the circuit breaker to the order service:

All unit tests will still succeed because the circuit breaker will only break after 10 exceptions. Let’s try and create a unit test to test the behavior of the circuit breaker.

After adding some logging to the service and creating the unit test I got this log result:

Logging output

The unit test is a bit “funny”. Since there is a time element (during which the circuit breaker breaks), the number of retries can vary. I guess I should be able to create an exact test but for demonstration purposes this will serve its purpose.

Conclusion

I hope you did learn something here. Please tell me if you have started using Polly. Also, tell me if you happen to know alternative libraries, I would very much like that!

Links