• +43 660 1453541
  • contact@germaniumhq.com

Writing BPMN Let's Encrypt Kubernetes Operators in Python III


Writing BPMN Let’s Encrypt Kubernetes Operators in Python III

Having event deduplication is a game-changer for several reasons, that finally made the operator writing possible. Let’s have a look.

This article is the last article from our series:

It was almost a year ago - wow, time flies by - when I started working on this operator. Before, I used to have an Ansible playbook to update the certificates, but after migrating the website to Kubernetes, that wasn’t an option anymore. It’s not that it’s impossible, but in Ansible, since there is no parallelism control, it’s just complicated and time-consuming, almost a mission impossible.

After I created the operator and wrote the first two articles, I’d run it, and from time to time, it would freeze. Now you need to remember that it updates the certificates after a week. So if it would freeze, I’d only see it a week later at best. I’d look in the log and see that nothing is going on except regular checks. So I’ve added support for Adhesive to print its state when getting the USR1 signal.

I realized with my new state dumping that for whatever reason, the mutexes would lock, and the state was considered as RUNNING. My process would think, "yeah, the certificate is expired, but I’m updating it right now," which wasn’t the case.

So I started working on having event deduplication in the BPMN engine itself instead of manually done in the process. That was a great idea, except it was far more tricky to implement than I initially thought. It made me better understand why it would be a worse idea to force users to deduplicate events through the BPMN process itself.

You can see in the adhesive examples how the process looked before:

Before

And how it looks now:

After

Excellent job, deduplication, excellent job!

Unfortunately, the first implementations of @deduplicate didn’t work exceptionally well. I’d see the processes freeze again, but now the state printout would tell me that there’s no event running at all - that meant the deduplication state was wrong in Adhesive. Why? Tough question, since deduplication works the opposite way of waiting. For "waiting" we need to defer the task execution whenever there is any task before in the graph. We need to wait for "deduplication" as long as there are tasks running after in the process graph, for the same deduplication id and loops. These checks can go haywire for several reasons - i.e. exceptions, nested loop information, deduplication overrides, etc. Not fun to debug and reproduce errors.

So I’ve added a lot of debugging information doing event sourcing that was readable in a human way. I’ll write an article about that at some point. That was the decisive moment when I was finally able to iron out the bugs and get it to run correctly.

So here are my take aways:

  1. Debugging is key. When developing the operator, I was able to run it locally with just connections to the K8s cluster. Running a single command and freezing the execution to see why this particular step isn’t working, look at Nginx logs, etc., was fundamental. It made me iterate extremely fast and have the operator in about two full working days.

  2. State querying. Having a way of querying the current state of the application is critical. I think only logging trumps that, since if you have excellent logging, you can infer why that state is the way it is.

  3. Heavy logging. You need to dump a lot of things to see why it isn’t working, especially if it sometimes works. Having some timestamps to see timing bugs is mandatory.

  4. Event sourcing. This can tell you the story of why the event was fired, where did it come from? If done right, you can immediately infer when an event should not have executed, or it wasn’t fired.

  5. Event deduplication. Finally, this was one of the more complicated things left and a great source of bugs. If you can have the framework maintain the state for you, so much the better. It is probably better tested anyway.

I think that pretty much sums it up. Whew, that was a lot of text. Thank you for reading it. :) Enjoy your day.

Article Photo from Pexels: https://www.pexels.com/photo/people-laptop-industry-internet-132700/ Article Photo from Pexels: http://wut.com