A few days ago one of my clients at Blackboard was having some Snapshot problems. Since the old days, Snapshot has been a very sturdy tool which most of our clients here use to automate their SIS.
Recently we introduce an improved version of the Snapshot tool using it via the UI but I'm not going to go into very much detail about that tool, since that is not this troubleshooting excercise.
Usually when you are troubleshooting something, you can definetely start by looking at everything but everything will not tell you what you need so, lets disect this problem the best that I can.
- Test if there are any more snapshot processes running
$ ps -ef | grep snapshot
- If you see that there is more than one process, you kill it. End of the Story.
- Then lets try to restart the services, just to kill everything in the system
- After it completes, find out your snapshot feed file and make a smaller, only with one record, to see if it actually processes. This can be done doing the following
- Getting the first two lines (making one record only feed file)
head --lines=2 bigfeedfile.txt > /tmp/smallfeedfile.txt
Lets run the feed file with this small feed file (in my case the problematic seemed to be an enrollment, so we will do that)
$BBHOME/apps/snapshot/bin/snapshot_override.sh "-Ddata.source.key=SYSTEM" -f ENR_MANUAL -C $BBHOME/apps/snapshot/data/snapshot.properties -t /tmp/smallfeedfile.txt
If this file doesn't process in less than 1 minute, you can kill the process.
If it ran correctly, then there is no need to continue, but in my case it actually was failing over an over again, so we look at the Stack trace to see what was going on.
snapshot processand ran the following:
Which in many of the things provided us the following:
Thread 22246: (state = BLOCKED) - java.lang.Object.wait(long) @bci=0 (Interpreted frame) - org.apache.activemq.transport.failover.FailoverTransport.oneway(java.lang.Object) @bci=361, line=600 (Interpreted frame) - org.apache.activemq.transport.TransportFilter.oneway(java.lang.Object) @bci=5, line=85 (Interpreted frame) - org.apache.activemq.transport.MutexTransport.oneway(java.lang.Object) @bci=12, line=68 (Interpreted frame) - org.apache.activemq.transport.ResponseCorrelator.asyncRequest(java.lang.Object, org.apache.activemq.transport.ResponseCallback) @bci=123, line=81 (Interpreted frame) - org.apache.activemq.transport.ResponseCorrelator.request(java.lang.Object) @bci=3, line=86 (Interpreted frame) - org.apache.activemq.ActiveMQConnection.syncSendPacket(org.apache.activemq.command.Command) @bci=20, line=1366 (Interpreted frame) - org.apache.activemq.ActiveMQConnection.ensureConnectionInfoSent() @bci=75, line=1481 (Interpreted frame) - org.apache.activemq.ActiveMQConnection.start() @bci=5, line=516 (Interpreted frame) - blackboard.platform.messagequeue.impl.activemq.ActiveMQConnectionPool$ActiveMQConnectionObjectFactory.activateObject(java.lang.Object) @bci=6, line=160 (Interpreted frame) - org.apache.commons.pool.impl.StackObjectPool.borrowObject() @bci=94, line=158 (Interpreted frame) - blackboard.platform.messagequeue.impl.activemq.ActiveMQConnectionPool.get() @bci=4, line=106 (Interpreted frame) - blackboard.platform.messagequeue.impl.activemq.ActiveMQMessageQueue$1.run() @bci=13, line=77 (Interpreted frame) - java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction) @bci=0 (Interpreted frame) - blackboard.platform.messagequeue.impl.activemq.ActiveMQMessageQueue.sendMessage(blackboard.platform.messagequeue.MessageQueueMessage) @bci=9, line=64 (Interpreted frame) - blackboard.platform.batch.enroll.PostBatchEnrollMessageHandler.submit(java.util.Set, boolean) @bci=32, line=55 (Interpreted frame) - blackboard.admin.snapshot.persist.impl.SnapshotDbPersister.sendPostEnrollmentEvents(java.util.Set) @bci=2, line=288 (Interpreted frame) - blackboard.admin.snapshot.persist.impl.SnapshotDbPersister.sendPostEnrollmentEvents(java.util.List) @bci=7, line=272 (Interpreted frame) - blackboard.admin.persist.course.impl.EnrollmentDbPersister.save(java.util.List, java.lang.String) @bci=24, line=144 (Interpreted frame) - blackboard.admin.snapshot.authority.Authority.delegatePersistence(java.util.List) @bci=70, line=413 (Interpreted frame) - blackboard.admin.snapshot.authority.Authority.commit() @bci=54, line=742 (Interpreted frame) - blackboard.admin.snapshot.serialize.Parser.execute() @bci=357, line=199 (Interpreted frame) - blackboard.admin.snapshot.authority.Authority.execute() @bci=90, line=268 (Interpreted frame) - blackboard.apps.snapshot.SnapshotApplication.execute() @bci=229, line=257 (Interpreted frame) - blackboard.platform.ApplicationLauncher.main(java.lang.String) @bci=600, line=175 (Interpreted frame)
this actually tells us in the first few lines that we are blocked and ActiveMQ is trying to communicate but simply can't. So lets start with the second part.
Since we now know that the ActiveMQ is kind of working but having issues communicating we need to know where and what to look at. Since there isn't much of a manual, lets try to cover all the basis.
Database verification that is running. 1.1. We run the following query
select pk1,node_id,last_seen,(sysdate - last_seen) * 24 * 60 * 60 as seconds_old from peer_service where service_id='activemq_broker';
1.2. This should return only one as activemq_broker with seconds few seconds and always going small 1.3. Second Query verification:
select node_id,service_id,payload from peer_service where inactive_ind='N';
This query should return something like:
NODE_ID SERVICE_ID PAYLOAD ---------------------------------------- -------------------- -------------------------------------------------- app04name.mhint:6903 ehcache_rmi //app04name.mhint:38528 app03name.mhint:29455 activemq_broker nio://app03name.mhint:61616 app03name.mhint:29455 ehcache_rmi //app03name.mhint:60095 app01name.mhint:24774 ehcache_rmi //app01name.mhint:26113 app02name.mhint:27764 ehcache_rmi //app02name.mhint:37946
1.4. As you can see there is only one activemq_broker in the service id column.
With this we have configured that the Database has everything configured correctly.
2. Now lets check the Application Server
2.1. On the application server that is the activemq broker go in it and check if the port is correct by doing:
netstat -lnp | grep 61616 This should return correct for only that application server.
2.2. From any other application server try to connect via telnet to that application server doing:
telnet apcprd-100367-57480-app15 61616 telnet apcprd-100367-57480-app15.mhint 61616
if You can, then there is correct communication, if not, then there is a bad routing problem. IMPORTANT: Aha! moment
2.3. This moment, you just realize that there is a problem communicating that it can be:
Since we are in MH we know that there is no firewall between application servers, that the connectivity is there due to the logs that shows it, so checking the Host files, actually was the answer here.
In my particular use case, I found that the problem was in the Host file.
# # Cluster IP addresses for ActiveMQ - <YOUR NAME HERE> <DATE> # XXXX.XXXX.XXXX.XXXX app01 app01 app01name.blackboard.com XXXX.XXXX.XXXX.XXXX app02 app02 app02name app02name.blackboard.com XXXX.XXXX.XXXX.XXXX app03 app03 app03name app03name.blackboard.com
In my case I actually saw that there were inconsistencies in the IP's and in the name and machine names, so I actually edit it to look to the following:
XXXX.XXXX.XXXX.XXXX app01 app01name.mhint app01name.blackboard.com XXXX.XXXX.XXXX.XXXX app02 app02name.mhint app02name.blackboard.com XXXX.XXXX.XXXX.XXXX app03 app03name.mhint app03name.blackboard.com XXXX.XXXX.XXXX.XXXX app04 app04name.mhint app04name.blackboard.com XXXX.XXXX.XXXX.XXXX app05 app05name.mhint app05name.blackboard.com XXXX.XXXX.XXXX.XXXX app06 app06name.mhint app06name.blackboard.com
I have removed the Ips and the machine name for security purposes but at the end, there were more application servers, different IPs and added the internal machine name that appeared in the Database. After doing this, I re ran the Snapshot and it actually completed in a few seconds.