A while ago I had some trouble with a Weblogic 10.0MP1 installation at my work place.I decided to post this here so that one day maybe it might help someone else facing the same problem.
At my workplace the operation guys provide a script to start a Weblogic instance which we have to use to get the instance configured properly.At one day I had to restart our test environment cluster due to an out of memory error. But for some reason the start script just hung without returning and the process list did not show a spawned java process, no errors, no log entries, nothing.
The start script was the same that was used on integration and production environments (hard link) which where still starting properly so the error could not have been there. To get some more information I started reading the script and added some debugging statements in a local copy of it. After a while I found the reason why the script hung. The java process just exited after being executed with a smooth exit code of 0. Meanwhile the script was listening to the log file in a loop waiting for either a success or an error message, since neither did get printed (no logs, see above) the readline call just blocked for I/O and would sit there until the end of days.
So now the question was why did the Weblogic server just shut down with a success return code and without any kind of error or log messages. Adding all kinds of debug flags to the start command did not change a thing. So it looked like a dead end, even more after searching the web for quite some time did not reveal any hits about this problem.
At this point I was pretty stumped and only saw one last chance to dig deeper into this, messing with the java code itself. I decompiled the Weblogic main class server.Weblogic and had a look at the code.At first I wanted to extend the class and override all methods to add debug statements and then delegate to the super class, but to my great joy the class is final. Using a delegate pattern was also of no use because as soon as I would delegate the first call to the original class the control would have never come back to me, so I decided to add plenty of printlns to the decompiled code, compiled it again and then used the new class to start the server.
Pretty soon I found the root cause of my problems: A bloody NullPointerException!!! Thrown by the classpath initialization routine. Now one would ask: "Why is a NPE not logged? Why did it set an error return code??" Weeeeeellll.. because the return value in that main routine is initially set to 0 and the whole block catches all Exceptions but only prints a statement if it was an AccessControlException. So for all other exceptions the routine just returns quitely and the return value is never changed. HOORAAYYY!!! Thank you Bea/Oracle.. ONCE MORE!! You nimrods!! Another stupid unnecessary bug caused by crap coding style every rookie java developer would be put against the wall for.
But this was not the end of it, I still had to figure out how that NPE got caused and more important how to fix it. The stacktrace did show that the Exception occured while the domain's lib folder was scanned. This folder can be used to put jar files there that should be added to the servers classpath which is rather convinient. When looking into the directory I saw the only jar we added there but also a directory containing the extracted contents of the jar file. After deleting the directory the server started up again.. UFFF!!
The directory got created by one of my colleagues who needed to know what the exact version of the jar file was and so had to look it up in the manifest. For some reason he was not able to deleted it afterwards, so he left it there to check back later. Can't really blame him for that, I would probably have done the same. Noone in our team would have suspected that this would cause such problems.
I spent the better part of a day fixing this problem, time that is now lost and that I have to make up for to keep our project on time. And so my hatred towards Weblogic raises yet again...