Atomikos Forum - recovery and tm_unique

recovery and tm_unique_name

I'm wondering how atomikos handles the situation where tm_unique_name changes between jvm starts, but there are "stuck" transactions in the txn log. For example, suppose that I have a txn that has completed the prepare phase during one run of the jvm where tm_unique_name was set to "foo" and the jvm crashes before the commit phase has completed on all resources. Now, I reconfigure atomikos to use tm_unique_name of "bar" and start the jvm again. Will atomikos recover (i.e., finish the commit phase) of that transaction?

I've hacked the atomikos source to force just such a failure while changing tm_unique_name between jvm starts. It appears that atomikos did complete the commit of my transaction (the atomikos logs and the database both confirm this). However, atomikos seems to complain about this transaction at startup:

20:10:42.450 DEBUG atomikos: recovery initiated for resource XADBMS with branchIdentifier timp-1084e8eea0194d99b6c04fa4e5ce821f
20:10:42.471 INFO atomikos: Resource XADBMS inspecting XID: timp-d6ca62a3b2c84862b5929b8df3b5f09d0000300001timp-d6ca62a3b2c84862b5929b8df3b5f09d2
20:10:42.471 INFO atomikos: Resource XADBMS: XID timp-d6ca62a3b2c84862b5929b8df3b5f09d0000300001timp-d6ca62a3b2c84862b5929b8df3b5f09d2 with branch timp-d6ca62a3b2c84862b5929b8df3b5f09d2 is not under my responsibility
20:10:42.472 DEBUG atomikos: XAResourceTransaction timp-d6ca62a3b2c84862b5929b8df3b5f09d0000300001timp-d6ca62a3b2c84862b5929b8df3b5f09d2: about to switch to XAResource org.postgresql.xa.PGXAConnection@1c0bee6
20:10:42.472 DEBUG atomikos: XAResourceTransaction timp-d6ca62a3b2c84862b5929b8df3b5f09d0000300001timp-d6ca62a3b2c84862b5929b8df3b5f09d2: switched to XAResource org.postgresql.xa.PGXAConnection@1c0bee6

The message "is not under my responsibility" seems to be atomikos saying that because tm_unique_id has changed, it doesn't think it needs to recover this transaction, yet it recovers it anyway.

Browsing the XATransactionalResource.recover() code shows that xids that don't have tm_unique_id as a prefix are handled differently in that they aren't added to the recoveryMap_ member variable of this class. I haven't looked at the code long enough to understand what that means.

Anyway, my question is really is it safe in terms of atomikos recovery to change tm_unique_id between jvm starts? The motivation for the question is that we deploy multiple instances of the same java process onto 6 different servers. I would prefer to not have to concern myself with assigning a unique tm_unique_id for each of these processes in a config file. It's just something that can go wrong (and I don't currently have any other need for per-process custom configuration data). So, I've been playing with autogenerating tm_unique_id to include a uuid that is generated each time the jvm starts. I figured that atomikos would just track xid's in its log file and recover whatever xids are in there when it restarts. However, the "not under my responsibility" log message and different treatment in XATransactionResource.recover(), makes me think that this may not work in all cases?

Jon Oler

Thursday, January 15, 2009

Hi,

If you switch the tm_unique_name then the transaction manager will only recognize those branches that it finds in its logs.

This is imperfect recovery, which can lead to anomalies with pending in-doubt branches in the database.

We can dig deeper into this to find what solution best matches your requirements, but then I suggest that you consider purchasing our developer support.

Best
Guy

Guy Pardon

Thursday, January 15, 2009

Thanks for the response, Guy. I'm an XA newbie, so please forgive me if this is a dumb question. I see that the bitronix and Geronimo transaction managers also both require configuring a globally unique transaction manager id. I'm happy to conform to the rules--it's really not that hard to do, but I would just like to understand why this is a requirement.

If all transaction id's initiated by the transaction manager are written to the transaction log, why isn't it good enough to just ensure that all transaction id's generated by the transaction manager are globally unique? Why the additional requirement that not only are xid's globally unique, but the transaction manager identity also persist and remain globally unique? I assume that it must be safe to change transaction manager identity when there are no in doubt or pending transactions, so the requirement has to do with managing and recovering pending in-doubt transactions. But, if the xids of any pending transactions are guaranteed to be in the transaction log, why does it matter if the transaction manager identity changes across jvm starts? In the one case that I've hacked atomikos to simulate (failure after completion of the prepare phase), atomikos does, in fact, correctly recover my pending, in-doubt transaction even though the tm_unique_id changed when I restarted the jvm (atomikos spits out a log message saying it's not it's responsibility to handle that transaction, but it correctly recovers it anyway).

Again, sorry if this is a dumb question, but my recent work in the XA world is still experimental at this point, and I need to understand how these transaction managers work if I'm going to continue to pursue this path. The only reason that I can speculate that maintaining transaction manager identity is important is that in the event that the transaction manager machine fails and the transaction log is lost, then a new transaction manager could be started using the same tm_unique_id and the transactions could be recovered if that new transaction manager jvm connects to all the same transactional resources that participated in the transactions whose information was lost when the machine crashed. This is pure speculation on my part though. Am I on the right track? Why does the transaction manager identity need to remain constant to avoid "anomalies with pending in-doubt branches in the database".

Thanks again for the help,

Jon

Jon Oler

Saturday, January 17, 2009

So, I've done some more reading on XA, and I think I know the answer to my question. I'm not sure about atomikos' implementation, but apparently a transaction manager doesn't necessarily log information about transactional resources that have voted yes to prepare (or have been sent the prepare message, for that matter). A transaction manager needs to record its decision to commit or rollback in the log, but there may be resources that have prepared (or been told to prepare) before this decision has been written to the log. If a transaction manager fails, and is subsequently restarted, it needs to sort through all the in-doubt transaction at each transactional resource it connects to to see if there are any in-doubt transactions that it is responsible for. As I just said, there may or may not be an entry in the transaction manager's log for these transactions, so the only way the transaction manager can identify the transactions that aren't in the log is by using a persistent transaction manager id.

So, if a transaction manager fails, both the transaction log *and* the unique transaction manager id is required to reliably recover all in-doubt transactions that it is managing.

Hopefully, I have this mostly right. It is surprisingly difficult to find good information on the XA protocol without resorting to the specification itself. Anyway, please correct me if I'm mistaken in my understanding, and sorry again for the newbie questions.

Jon Oler

Saturday, January 17, 2009

This topic is archived. No further replies will be accepted.