Process Warden

Sometimes things fall through the cracks

A lot of modern web development involves implementing workflow. Things like user bookings moving from incomplete to pending to paid to confirmed, uploaded files going from uploading to uploaded to saved to done and so on. In larger systems, it can be hard to make sure that everything is accounted for - that every single thing gets processed correctly or noticed. The question here is: "What sort of gaps are there in typical development and how can they be covered?"

When is something really considered to be 'done'?

In this particular example a client had a file upload process. The gist of it is that they were receiving user files and then a background process was copying them somewhere and then making sure they were backed up, all asynchronously. From their point of view, the file wasn't successfully uploaded until it was both copied and backed up. That was their definition of 'done' for that workflow.

Not all failures are errors

Obviously the application and file sync process were configured to report errors, but under certain conditions user files were still going missing and it took a lot of time and several lost files to figure out where and how and to implement a fix. The problem stemmed from the gaps in between their processes. They had several separate execution processes for handling each step of the workflow, each one being responsible for calling the next. The pipeline looked roughly like this:

    start -> uploading -> uploaded (on disk) -> copy to cloud storage -> copy to backup -> done

The files were going missing after the copy to cloud storage but before the copy to backup. The cloud storage copy process was responsible for sending a web request to the backup process, with the details of the file and how to access it. The problem arose whenever the copy process tried to scale down by removing servers.

File is copied successfully
Copy server X123 sends a request to the backup service requesting a backup of the file it just processed.
The backup service is busy, so the request hangs briefly
Copy server X123 is terminated as it's no longer required
The request never completes, the file is never backed up and is lost as a result

Is the solution better software, or something else?

Clearly the issue here is a flaw in the design or the software. There are development practices that can help avoid it, but none of them solve the general problem of making sure that a workflow is completed. You might tighten up the logic between the copy service and the backup but at any other step in your workflow there may be a bug or flaw or outage that results in failed items. What you need is something that answers the question you care about - did every item get processed correctly?

Solving the problem at its core

The solution is to change your workflow to the following

    start -> create item in process warden -> rest -> of -> workflow -> delete item in process warden -> done

Process Warden in this sense can provide workflow assurance. It can be used to track workflow items and have it trigger an alert if a particular item is sitting waiting for too long. If you can build an initial storage guarantee into your workflow (so items tracked by Process Warden are always accessible) then you can always recover them in future and resume the workflow.

Never lose a workflow item again, and recover from any interruptions

Here Process Warden is like the gatekeeper of the workflow - once it receives the item then no matter what happens during your workflow it will be noticed if it doesn't complete. This covers bugs, outages, throughput issues, dropped requests and anything else that could interrupt a workflow. The client in question implemented this initially just to work around the issue by manually processing lost files, but ended up using it for their other workflows too.

So far Process Warden's workflow assurance has helped deal with the following issues:

Failure to complete: a large file wasn't being processed properly due to incorrect chunking code, so it never completed (large files were rare in the application so this was an edge case)
Throughput limitation: Due to misconfigured scaling several items started backing up and hit the time limit configured in Process Warden
Recovery: A complete outage caused by a database failure meant thousands of items failed to process. Thanks to the item data stored in Process Warden they were able to resume the workflow from there despite losing the original records in the database crash

And it continues to monitor each and every workflow item giving them complete peace of mind. The only burden on the developers is to make sure that the initial part works: that Process Warden receives the item. After that, they can relax knowing that anything else that goes wrong will be detected. If they don't get any alerts from Process Warden then they know their workflow is working correctly.