{C++ etc.}

Latest Posts
 
Home | C++ Posts | Linux Posts | Programming Posts | Issue Tracking Posts

Monday, April 06, 2009

Posix Threads and FT Handling

We have a set of stringent FT mechanisms at our company since failure of a single process might spell disaster for the whole system. Threre are multiple instances of the same process running as primary and mirror. And there are multiple "Sets" which hold copies of the whole system so if one set goes down, another could take over.
If a primary process goes down, an "FT_CHANGE" happens and the mirror becomes the primary.
I came across an instance where the mirror process was made to "Fail over" just as it became "Ready" (A process is "Ready" once it finishes the initialization process and sends a message to the controlling system). This has resulted some bizarre behaviour.
After having a look at the logs, we came to the conclusion that this was due to the fact that the process gave the "Ready" signal before some "pthread_create" calls have finished executing. I check the return types of the code but this does not guarantee that the threads are up and running at that moment.
The lesson: Wait till the child thread comes up and sends the main thread an acknowledgement before making any calls to that thread (like we do in most of the other processes)

1 comment:

Og√ľn Heper said...

Hi Gayan,

Where are you working at? :)

I mean, nowadays, it is really hard to find a company that is achieving software fault tolerancy by running multiple concurrent copies of the same software. Thus, what you do and implement is great i think.

Executing several copies of the software concurrently brings up problems like state synchronization, heartbeat checking, etc.

But anyway, these topics are great. I would like to work on these. It would be great if you share the name of the company you work. That would be a good starting point for me.

Regards.