We have a set of stringent FT mechanisms at our company since failure of a single process might spell disaster for the whole system. Threre are multiple instances of the same process running as primary and mirror. And there are multiple "Sets" which hold copies of the whole system so if one set goes down, another could take over.If a primary process goes down, an "FT_CHANGE" happens and the mirror becomes the primary.
I came across an instance where the mirror process was made to "Fail over" just as it became "Ready" (A process is "Ready" once it finishes the initialization process and sends a message to the controlling system). This has resulted some bizarre behaviour.
After having a look at the logs, we came to the conclusion that this was due to the fact that the process gave the "Ready" signal before some "pthread_create" calls have finished executing. I check the return types of the code but this does not guarantee that the threads are up and running at that moment.
The lesson: Wait till the child thread comes up and sends the main thread an acknowledgement before making any calls to that thread (like we do in most of the other processes)