Implementing circuit breaker pattern from scratch in Python
We’ll briefly look into the circuit breaking pattern before jumping to code.
What is circuit breaking?
In real world applications, services might go down and start back up (or they might just stay down). The idea is that when you make a remote call(HTTP Request/RPC) to another service, there are chances that the remote call might fail. After a certain number of failed remote calls, we stop making remote calls and send a cached response or an error as a response. After a specified delay, we allow one remote call to be made to the failing server, if it succeeds, we allow the subsequent remote calls to be made to the server, if it did not succeed, we will continue sending a cached response or an error and will not make any remote calls to the failing service for some time.
When all services were working and the remote calls were returning without any errors, we call this state — “Closed”.
When the remote calls continued to fail and when we stopped making any more remote calls to the failing service, we call this state — “Open”
After a certain delay, when we make a remote call to the failing service, the state transitions from “Open” to “Half-Open”. If the remote call does not fail, then we transition the state from “Half Open” to “Closed” and the subsequent remote calls are allowed to be made. In case the remote call failed, we transition the state from “Half Open”, back to “Open” state and we wait for a certain period of time till we can make the next remote call (in Half Open state)
To know more, read this and this
Why do you need it?
- To prevent a network or service failure from cascading to other services.
- Saves bandwidth by not making requests over a network when the service you’re requesting is down.
- Gives time for the failing service to recover.
Code Marathon
Let’s now try to build a simple circuit-breaker using Python
Disclaimer: This is in no way production ready. There are some excellent libraries that are available online and well tested. I’ve mentioned two of them here: circuit-breaker and pybreaker.
Let’s first decide on the api for the circuit breaker that we are going to build and also define the expected behavior.
I’m a big fan of retry library syntax. We change the API towards the end of the blog post.
Let’s define all the possible states
Let’s create a class that handles all of the circuit breaker logic.
Constructor takes the following parameters
func
- method/function that makes the remote callexceptions
- an exception or a tuple of exceptions to catch (ideally should be network exceptions)threshold
- number of failed attempts before the state is changed to "Open"delay
- delay in seconds between "Closed" and "Half-Open" state
make_remote_call
takes the parameters that the underlying remote call might need (func
)
If it seems confusing, please take a look at the following snippet
make_request
is passed as a first class function to CircuitBreaker class. The parameters required by make_request
are sent through make_remote_call
Let’s now try to complete handle_closed_state
and handle_open_state
handle_closed_state
makes the remote call, if it is a success, then we update last_attempt_timestamp
and return the result of the remote call.
If the remote call fails, then _failed_attempt_count
is incremented. If _failed_attempt_count
is lesser than threshold, then simple raise an exception. If _failed_attempt_count
is greater than or equal to the threshold, we change the state to Open and finally an exception is raised.
handle_open_state
first checks if the delay
seconds has elapsed since the last attempt to make a remote call. If not, then it raises an exception. If delay
seconds has elapsed since the last attempt then we change the state to "Half Open". Now we try to make one remote call to the failing service. If the remote call was successful, then we change the state to "Closed" and reset the _failed_attempt_count
to 0 and return the response of the remote call. If the remote call failed, when it was in "Half Open" state, then state is again set to "Open" and we raise an exception.
Complete code
Now to test it out. Let’s create a mock server.
Install Flask and requests. Ipython is optional
pip install requests
pip install Flask
pip install ipython
Let’s create some endpoints to mock the server
Run the development server
export FLASK_APP=main.py; flask run
By default it runs on port 5000
Now to test it out. You can use these snippets to test it out.
Now open up a terminal and run the following commands.
(circuit-breaker) ➜ circuit-breaker git:(master) ✗ ipythonIn [1]: from circuit_breaker import CircuitBreakerIn [2]: from snippets import make_request, faulty_endpoint, success_endpointIn [3]: obj = CircuitBreaker(make_request, exceptions=(Exception,), threshold=5, delay=10)In [4]: obj.make_remote_call(success_endpoint)
Call to http://localhost:5000/success succeed with status code = 200
06:07:51,255 INFO: Success: Remote call
Out[4]: <Response [200]>In [5]: obj.make_remote_call(success_endpoint)
Call to http://localhost:5000/success succeed with status code = 200
06:07:53,610 INFO: Success: Remote call
Out[5]: <Response [200]>In [6]: vars(obj)
Out[6]:
{'func': <function snippets.make_request(url)>,
'exceptions_to_catch': (Exception,),
'threshold': 5,
'delay': 10,
'state': 'closed',
'last_attempt_timestamp': 1607800073.610199,
'_failed_attempt_count': 0}
Line 1 and Line 2 are just imports. In line 3, we are creating a CircuitBreaker object for make_request
. Here, we're setting exceptions=(Exception,)
, this will catch all the exceptions. We should ideally narrow down the exception to the one that we actually want to catch, in this case, Network Exceptions, but we're going to leave it there for this demo.
Now make successive calls to the faulty
endpoint.
In [7]: obj.make_remote_call(faulty_endpoint)In [8]: obj.make_remote_call(faulty_endpoint)In [9]: obj.make_remote_call(faulty_endpoint)In [10]: obj.make_remote_call(faulty_endpoint)In [11]: obj.make_remote_call(faulty_endpoint)In [12]: obj.make_remote_call(faulty_endpoint)
---------------------------------------------------------------------------
Traceback data ..........RemoteCallFailedException: Retry after 8.688776969909668 secs In [13]: obj.make_remote_call(success_endpoint)
---------------------------------------------------------------------------
Traceback data......RemoteCallFailedException: Retry after 6.096494913101196 secs
Try to make these calls as fast as possible. After the first five callls to the faulty_endpoint, the next call(Line 12) will not make an api-request to the flask server instead it will raise an Exception, mentioning to retry after a specified number of secs. Even if you make an API call to the success_endpoint
endpoint (Line 13), it will still raise an error. It is in "Open" state.
Now, after the delay time has elapsed, if we make a call to the faulty endpoint, it will transition from Half-Open to Open state.
In [18]: obj.make_remote_call(faulty_endpoint)
06:21:24,959 INFO: Changed state from open to half_open
...
06:21:24,964 INFO: Changed state from half_open to open
Now, after the delay has elapsed, if we make a call to the success_endpoint, it will transition from Half-Open to Closed state
In [19]: obj.make_remote_call(success_endpoint)
06:25:10,673 INFO: Changed state from open to half_open
...
06:25:10,678 INFO: Changed state from half_open to closed
Out[19]: <Response [200]>
Finally, improving the API shouldn’t take a lot of time. I’ve added quick dirty version here
All code samples can be found here
Now we have a working circuit breaker. We could introduce response caching, monitoring and make it thread-safe. Errors could be handled better. More Exception types could help. All of these features are left as an exercise for the readers.