Things go down and need to come back up. Take a look at almost any major cloud i...

Things go down and need to come back up. Take a look at almost any major cloud incident in the last 15 years — some combination of goofs results in a bunch of services going down. Then it takes many minutes or hours to come back up. Sure, one can (and should) try to design for fewer failures that result in waiting for things to come up, but one should also design for faster bringup.

For the NVDIMM in question, the whole point is for durability after a reboot or power loss. A fifteen minute cycle in which it would write its contents out to nonvolatile storage and then read it back, during which the firmware is too inept to notice that all the data is already loaded is a bug that should have been fixed. (In fairness, this was a pre-production model.)

Even ignoring the reboot time, SMBUS was needed to properly identify the device and to access its health-check data.

As for my database, it’s not intended for active-active HA setups — it’s intended for use on a small number of high quality machines. It has had zero meaningful data loss incidents in its entire time in production, and it’s extremely fast. But it has some other properties I don’t like and that would cause me to do it differently if I were to start over.