Wednesday, April 27, 2011

Application Express 4.0 and Library Cache Latch Contention in Oracle DB 11.2.0.2

In January of 2011, Oracle Support approached me about a Service Request they received from a customer. This customer had an APEX 4.0.2 application in production with more than 1,000 users. After they upgraded their Oracle database to version 11.2.0.2, the system ran fine for 2 days and then eventually "locked up", as they put it. They identified the problem as being related to library cache locks, and the offending statement they reported was:

begin wwv_flow_log.g_content_length := sys.htp.getcontentlength; end;

I reviewed the problem myself, reviewed it with a couple of other folks on the APEX development team, and we were mystified how this could possibly be related to library cache locks.

Fast forward to today, and I was contacted by a completely different customer who was running a benchmark with 100 simulated concurrent users (concurrent as in truly concurrent with no think time). Their conclusion after the benchmark was that there was library cache latch contention due to Application Express, and lo and behold, the statement they identified as being the source of library cache latch contention was:

begin wwv_flow_log.g_content_length := sys.htp.getcontentlength; end;

This was beyond coincidence that two different customers, in highly concurrent use of an APEX application, identify latch locks and/or latch contention issues with the very same line of the APEX source code.

The PL/SQL block in question was in the activity logging PL/SQL package of APEX. This is the package that writes to the activity log of Application Express, which tracks information like the application, page, user, elapsed time, etc. One of the changes in Application Express 4.0 was a new attribute in the activity log called content length. It was populated via a block like:


begin
execute immediate ('begin wwv_flow_log.g_content_length := sys.htp.getcontentlength; end;');
exception when others then
g_content_length := 0;
end;

I'm sure some will ask why this was done dynamically. Simply stated, because it could not be statically compiled. Most versions of the PL/SQL Web Toolkit will not contain the function getcontentlength() in the package specification. So for those installations that do contain this function, the dynamic execution will succeed. For those that don't (the vast majority), the parse of the PL/SQL block will fail, an exception will be raised, and the exception block will catch it and set a variable to 0. But why would something like this cause library cache latch contention?

I really had no idea. I found no bugs on this issue. apex.oracle.com was on APEX 4.0.2 on 11.2.0.2 and this was never an issue. So I diagnosed this problem the only remaining way I knew how...I asked Tom.

I knew in advance that Tom was going to rail about the "when others", but I told him I wasn't the author of this code. He also pointed out that doing this dynamically would not result in optimal performance. Sounds good...but that's not the source of the latch contention. In a nutshell, Tom suspected that there is a change in the way the parse happens in 11.2, and a bad parse (i.e., failed parse) is more expensive in 11.2 than earlier database versions. Is it a database bug? Maybe - but it also could simply be changed behavior.

Rather than wait for this problem to be diagnosed as potentially a database bug and maybe some database patch set may contain a fix, I decided to work around this. This particular piece of code, for all practical purposes, has been commented out - no dynamic execution, no failed parse and exception processing, for each and every invocation of the activity logging procedures.

I believe this issue is quite important. For our customers on large, highly concurrent production sites on APEX 4.0, when they decide to upgrade to Database version 11.2.0.2, they may experience degradation in performance for the same application with the identical load. There is now a patch set exception (a "one-off patch") for this issue for APEX 4.0.2.00.07 on My Oracle Support. Look for patch 12404581. The actual patch itself is not database version specific. Hopefully this blog post (and patch) will save some of our customers from future headaches.

4 comments:

Gary Myers said...

Wouldn't switching to DBMS_SQL be an option. Firstly, you can trap whether the parse succeeds/fails. You can set a package global (or global CONTEXT) is the parse fails so that you never retry it. But you could also hold the cursor so that future invocations just do the execute rather than re-parse it.
Alternatively you could do an explicit check on whether getcontentlength exists through the data dictionary.
Also, is there anything in the PL/SQL Web toolkit that could be used as a basis for conditional compilation. Now that Apex is 10g+ only, this should be an option.

Byte64 said...

Hi Joel,
I assume that this problem might affect XE 11g and since XE 11g customers do not have access to Metalink, they can't download the patch, so I wonder if it worth mentioning the problem somewhere in the release notes.

Cheers,
Flavio

Joel R. Kallman said...

@Gary - let's just suffice it to say that the inclusion of this logic wasn't as well thought out as it should have been. More care should have been given to something which is going to be executed for each and every user for each and every page view. That's why something like querying the data dictionary dynamically at runtime would be ill-advised.

@Flavio - APEX in XE *Production* will have this bug fix in it. I'm one step ahead of you. ;)

Byte64 said...

I'd be very proud of myself if I were just one step behind...