For the latter part of 2009, the performance on
apex.oracle.com was lacking at times, to put it kindly. As evidenced by
this long-running thread on the
APEX discussion forum on OTN, performance issues started cropping up in August and things only got worse as the year went on.
Ben Burrell, a very frequent contributor on the APEX discussion forum said it best with:
"The dedication and patience shown by the contributors to this forum is nothing short of incredible. Oracle should be magnificently proud of the APEX community - and reward it with a decent service."I wholeheartedly agree! The least Oracle can do for the army of APEX experts who graciously give of their time to help others with Application Express is provide them a usable service. Additionally, for those users of Application Express who sign up for a test account on apex.oracle.com, a first impression is
everything. Even if the cause of the issue is not "an APEX problem", per se, it will always have the perception that APEX is slow and cannot scale.
This has been a learning experience for me, and I'm simply here to report a number of the issues uncovered and the steps we took to remedy them.
1)
KeepAlive - In November 2009, I was monitoring the database listener log files. There were a very large number of incoming connection requests to the database listener - in some instances, on the order of 30 connection requests per second. One of the most expensive actions you can take is to establish a new database session, hence, why session pooling is so commonly used. As it turns out, an administrator had turned off the KeepAlive setting in Oracle HTTP Server. Every new APEX page view request had to establish a new database session. For a site that gets millions of page views a day, this is not good. I had them turn KeepAlive back to On and with a KeepAliveTimeout of 6 seconds. As busy as apex.oracle.com is today, we now only average about one new database session request every 10 to 15 seconds.
2)
robots.txt - The
robots.txt file is supposed to be used by Web crawlers and search engines, although not all of them seem to observe it. This file enables a Web site administrator to control when Web crawlers are allowed to crawl a site and how frequently. The "window" of time that we permitted Web crawlers was too great - that's the last thing we want is to have some new user of Application Express competing for the same resources with some Web crawler which is hurling requests at our site. So I reduced the open window for Web crawlers:
http://apex.oracle.com/robots.txt.
3)
Number of Database Access Descriptors - Since the days of the original 'marvel.oracle.com', the Database Access Descriptor used for the site was named 'otn'. Earlier this year, I wanted to have apex.oracle.com support the documented and standard Database Access Descriptor named 'apex', so we added that one to our configuration. Then, with the migration of AskTom to the apex.oracle.com instance, we created a third Database Access Descriptor named 'asktom' (otherwise, we would have broken any bookmarks to Tom's questions and answers).
The Web server software used by Oracle Global IT on this instance is Oracle Application Server 10.1.2. Unfortunately, Oracle HTTP Server and mod_plsql do not support true connection pooling in this version (see the
documentation for the explanation). Each HTTP Server process would establish and maintain a database connection per DAD - that's three database connections times the number of HTTP Server processes. There are two Web servers that front-end apex.oracle.com - so 2 Web servers * 3 connections per Oracle HTTP Server process * 200 - 250 HTTP Server processes - and we easily had more than 1,000 database sessions during peak times. Even if only a handful of these sessions were active at any given time, the memory consumed by all of these sessions was too large. So large, that the system log was recording messages like "Out of Memory: Killed process 31088 (oracle)" every few minutes. Not good.
With the help of
Kris Rice, we were able to remove the 'asktom' Database Access Descriptor and replace it with the following rewrite directives in httpd.conf:
RewriteEngine On RewriteCond %{THE_REQUEST} ^.*/pls/asktom.* [NC] RewriteRule ^.*/pls/asktom(.*)$ /pls/apex$1 [NC,QSA,PT]With this one change, the database server wasn't swapping memory and the OS was not randomly killing processes because it was out of memory. The next day,
Tom even commented how snappy it was.
4)
File System Caching - The system administrators reviewed the network performance and I/O systems. They changed the "noac" and "actimeo" parameters so that the system would cache file system parameters, thus reducing the I/O on the system. Remember that this was a database with over 15,000 tablespaces and datafiles - that's a lot of potential I/O savings. Don't ask me for any more detail on this one, though - it's beyond my domain of expertise.
5)
Purge of Stale Workspaces - There is no built-in workspace purging process in Application Express. The way the process has worked over the past 3 or 4 year is I personally would run a hand-crafted collection of custom PL/SQL programs which identified the workspaces that have not been touched in the last 45 days, notify the administrators of these workspaces via e-mail that their workspace was going to be purged, let them retain the workspace if they wished, and then after a two-week period - go through and purge the ones which hadn't been retained.
The problem with this solution is that apex.oracle.com would go through these peaks and valleys - dropping to 7,000 workspaces and then over time, creep up to 17,000 workspaces. I would manually run the purge process a year later, bring it back down to 8,000 workspaces, and then the process starts all over again. The system would be fastest after the purge and then slowly degrade over time.
I rewrote the purge process from scratch, but this time, I designed it to run perpetually (I have to give credit to my manager, Mike Hichwa, for this idea). The parameters are roughly the same - every night, it wakes up, computes who hasn't used their workspace in the past N days and isn't already on "the list", sends the administrators an e-mail, gives them the chance to retain their workspace, sends a reminder 3 days in advance of the purge, and and then after a total of 10 days, purges the workspace. Purging the workspace involves dropping the tablespace & data file, dropping the database user corresponding to the workspace schema, and deleting the workspace and all associated meta data. The great thing is this runs every day - I don't have to kick it off and monitor it, I just get daily e-mails about the status and statistics.
Since the first workspace was purged by this process on January 18, 2010, this perpetual purge process has been used to remove 5,998 workspaces and 5,860 tablespaces, reclaiming 64,531,415,040 bytes from the file system. And now apex.oracle.com is on auto-pilot.
My manager, Mike Hichwa, also suggested I simply fold this feature into Application Express 4.0 - so Application Express can become a self-maintaining system. I'm not convinced this feature would be used by that many customers outside of apex.oracle.com. I tend to believe DBA's and our customers would rarely want to automatically drop data.
6)
Memory - This past Saturday, February 6, 2010, the memory on apex.oracle.com was upgraded from 16 GB to 32 GB. Since then, the machine is really very, very fast. I've monitored the system load and it is consistently running at about 50% of maximum capacity, even during peak times.
Going forward, I am going to request that we upgrade the database to version 11.2.0.1. Additionally, I am going to request an upgrade to Oracle HTTP Server which supports
true connection pooling and which should dramatically reduce the number of overall database sessions to the database server, itself. If the upgrade to Oracle HTTP Server can't be done in a reasonable time frame, then I'm going to try to setup server-side pooling via
Database Resident Connection Pooling. Even though I always advise against multiplexing multiplexed connections, I can't really describe the connection pooling on our version of Oracle HTTP Server today as "multiplexed."
If I were in charge of this site, I would have done these upgrades yesterday. But probably like the company you're at, our IT organization that actually runs this site has their software levels and it takes time to uptake new versions and roll it out. I'm not complaining, as I understand and appreciate this when you have to manage hundreds of databases and Web applications, as they do.
Contrary to some assertions, the performance of apex.oracle.com was of great concern to us and to me personally. However, you cannot fix a problem until you have diagnosed the actual cause of the issue. And unfortunately, that takes time. Additionally, some of these steps took a great deal of time to actually implement and test before rolling out on apex.oracle.com. As much as I would have liked to flip the "fast=true" bit, that wasn't reality. This has been a learning process and one which will ultimately benefit the users of apex.oracle.com going forward as well as the product, Oracle Application Express. I am grateful to those customers and members of the APEX community for their patience during this time. To paraphrase
Frank Costanza, "We're back, baby!"