Automatic retries on transaction failure are not foolproof
Databases that are ACID compliant guarantee that transactions can be retried on failure. All changes are rolled back on error, guaranteeing data sanctity and ensuring all that invariants remain satisfied.
But there are a few pitfalls in this mode of automatic retries:
- The transaction could already have succeeded, but the network failed to send the success message back to the application or service. Retrying the transaction will then cause duplication of data unless idempotency is handled.
- A transaction may be held up due to a deadlock. It could be in a queue waiting to be accepted by the server, or the machine may be overloaded, or the database may be executing a heavy query. Retrying the transaction will only worsen the problem. It can be handled by using an exponential backoff mechanism and detecting the type of error.
- Side-effects need to be handled carefully, preferably with a Unit of Work pattern to ensure side-effects get executed only on overall transaction success. A Two-phase commit would be appropriate here, which could unify the database commit and the message queue dispatch.
- Retrying on transient errors, with Exponential Backoff, makes sense, so detecting the error type is vital. There is no use in retrying a transaction repeatedly if there is a permanent error, like a constraint being violated or data validation failure.
A bigger problem is the lack of awareness and tooling support around this database capability. Developers mostly think only about the happy path and do not account for error scenarios. Error handling is either left to discovery in the form of bugs or errors shown to the end-user.
Also, ORMs traditionally do not differentiate between transient errors (like deadlock, isolation violation, failover, and network interruptions) and permanent errors (like contract violation). All errors are bubbled up as exceptions for the application to handle, and the application fails the entire transaction process.