Monday, February 08, 2010

Making apex.oracle.com fast (again)

For the latter part of 2009, the performance on apex.oracle.com was lacking at times, to put it kindly. As evidenced by this long-running thread on the APEX discussion forum on OTN, performance issues started cropping up in August and things only got worse as the year went on. Ben Burrell, a very frequent contributor on the APEX discussion forum said it best with:


"The dedication and patience shown by the contributors to this forum is nothing short of incredible. Oracle should be magnificently proud of the APEX community - and reward it with a decent service."


I wholeheartedly agree! The least Oracle can do for the army of APEX experts who graciously give of their time to help others with Application Express is provide them a usable service. Additionally, for those users of Application Express who sign up for a test account on apex.oracle.com, a first impression is everything. Even if the cause of the issue is not "an APEX problem", per se, it will always have the perception that APEX is slow and cannot scale.

This has been a learning experience for me, and I'm simply here to report a number of the issues uncovered and the steps we took to remedy them.




1) KeepAlive - In November 2009, I was monitoring the database listener log files. There were a very large number of incoming connection requests to the database listener - in some instances, on the order of 30 connection requests per second. One of the most expensive actions you can take is to establish a new database session, hence, why session pooling is so commonly used. As it turns out, an administrator had turned off the KeepAlive setting in Oracle HTTP Server. Every new APEX page view request had to establish a new database session. For a site that gets millions of page views a day, this is not good. I had them turn KeepAlive back to On and with a KeepAliveTimeout of 6 seconds. As busy as apex.oracle.com is today, we now only average about one new database session request every 10 to 15 seconds.




2) robots.txt - The robots.txt file is supposed to be used by Web crawlers and search engines, although not all of them seem to observe it. This file enables a Web site administrator to control when Web crawlers are allowed to crawl a site and how frequently. The "window" of time that we permitted Web crawlers was too great - that's the last thing we want is to have some new user of Application Express competing for the same resources with some Web crawler which is hurling requests at our site. So I reduced the open window for Web crawlers: http://apex.oracle.com/robots.txt.




3) Number of Database Access Descriptors - Since the days of the original 'marvel.oracle.com', the Database Access Descriptor used for the site was named 'otn'. Earlier this year, I wanted to have apex.oracle.com support the documented and standard Database Access Descriptor named 'apex', so we added that one to our configuration. Then, with the migration of AskTom to the apex.oracle.com instance, we created a third Database Access Descriptor named 'asktom' (otherwise, we would have broken any bookmarks to Tom's questions and answers).

The Web server software used by Oracle Global IT on this instance is Oracle Application Server 10.1.2. Unfortunately, Oracle HTTP Server and mod_plsql do not support true connection pooling in this version (see the documentation for the explanation). Each HTTP Server process would establish and maintain a database connection per DAD - that's three database connections times the number of HTTP Server processes. There are two Web servers that front-end apex.oracle.com - so 2 Web servers * 3 connections per Oracle HTTP Server process * 200 - 250 HTTP Server processes - and we easily had more than 1,000 database sessions during peak times. Even if only a handful of these sessions were active at any given time, the memory consumed by all of these sessions was too large. So large, that the system log was recording messages like "Out of Memory: Killed process 31088 (oracle)" every few minutes. Not good.

With the help of Kris Rice, we were able to remove the 'asktom' Database Access Descriptor and replace it with the following rewrite directives in httpd.conf:

RewriteEngine On
RewriteCond %{THE_REQUEST} ^.*/pls/asktom.* [NC]
RewriteRule ^.*/pls/asktom(.*)$ /pls/apex$1 [NC,QSA,PT]

With this one change, the database server wasn't swapping memory and the OS was not randomly killing processes because it was out of memory. The next day, Tom even commented how snappy it was.




4) File System Caching - The system administrators reviewed the network performance and I/O systems. They changed the "noac" and "actimeo" parameters so that the system would cache file system parameters, thus reducing the I/O on the system. Remember that this was a database with over 15,000 tablespaces and datafiles - that's a lot of potential I/O savings. Don't ask me for any more detail on this one, though - it's beyond my domain of expertise.




5) Purge of Stale Workspaces - There is no built-in workspace purging process in Application Express. The way the process has worked over the past 3 or 4 year is I personally would run a hand-crafted collection of custom PL/SQL programs which identified the workspaces that have not been touched in the last 45 days, notify the administrators of these workspaces via e-mail that their workspace was going to be purged, let them retain the workspace if they wished, and then after a two-week period - go through and purge the ones which hadn't been retained.

The problem with this solution is that apex.oracle.com would go through these peaks and valleys - dropping to 7,000 workspaces and then over time, creep up to 17,000 workspaces. I would manually run the purge process a year later, bring it back down to 8,000 workspaces, and then the process starts all over again. The system would be fastest after the purge and then slowly degrade over time.

I rewrote the purge process from scratch, but this time, I designed it to run perpetually (I have to give credit to my manager, Mike Hichwa, for this idea). The parameters are roughly the same - every night, it wakes up, computes who hasn't used their workspace in the past N days and isn't already on "the list", sends the administrators an e-mail, gives them the chance to retain their workspace, sends a reminder 3 days in advance of the purge, and and then after a total of 10 days, purges the workspace. Purging the workspace involves dropping the tablespace & data file, dropping the database user corresponding to the workspace schema, and deleting the workspace and all associated meta data. The great thing is this runs every day - I don't have to kick it off and monitor it, I just get daily e-mails about the status and statistics.

Since the first workspace was purged by this process on January 18, 2010, this perpetual purge process has been used to remove 5,998 workspaces and 5,860 tablespaces, reclaiming 64,531,415,040 bytes from the file system. And now apex.oracle.com is on auto-pilot.

My manager, Mike Hichwa, also suggested I simply fold this feature into Application Express 4.0 - so Application Express can become a self-maintaining system. I'm not convinced this feature would be used by that many customers outside of apex.oracle.com. I tend to believe DBA's and our customers would rarely want to automatically drop data.




6) Memory - This past Saturday, February 6, 2010, the memory on apex.oracle.com was upgraded from 16 GB to 32 GB. Since then, the machine is really very, very fast. I've monitored the system load and it is consistently running at about 50% of maximum capacity, even during peak times.




Going forward, I am going to request that we upgrade the database to version 11.2.0.1. Additionally, I am going to request an upgrade to Oracle HTTP Server which supports true connection pooling and which should dramatically reduce the number of overall database sessions to the database server, itself. If the upgrade to Oracle HTTP Server can't be done in a reasonable time frame, then I'm going to try to setup server-side pooling via Database Resident Connection Pooling. Even though I always advise against multiplexing multiplexed connections, I can't really describe the connection pooling on our version of Oracle HTTP Server today as "multiplexed."

If I were in charge of this site, I would have done these upgrades yesterday. But probably like the company you're at, our IT organization that actually runs this site has their software levels and it takes time to uptake new versions and roll it out. I'm not complaining, as I understand and appreciate this when you have to manage hundreds of databases and Web applications, as they do.


Contrary to some assertions, the performance of apex.oracle.com was of great concern to us and to me personally. However, you cannot fix a problem until you have diagnosed the actual cause of the issue. And unfortunately, that takes time. Additionally, some of these steps took a great deal of time to actually implement and test before rolling out on apex.oracle.com. As much as I would have liked to flip the "fast=true" bit, that wasn't reality. This has been a learning process and one which will ultimately benefit the users of apex.oracle.com going forward as well as the product, Oracle Application Express. I am grateful to those customers and members of the APEX community for their patience during this time. To paraphrase Frank Costanza, "We're back, baby!"

9 comments:

  1. It would be great if the configuration changes you did could be included as recommendations in the APEX installation document!

    Matthias

    ReplyDelete
  2. Joel. Excellent work, Apex is a magnificent resource.

    Perhaps now that that has been done you could pop over to the support.oracle.com people and show them how to run a properly-performing website !

    ReplyDelete
  3. Good News for apex.oracle and nice work.
    However your idea is good. i can use this with my applications.

    KeepAliveTimeout = 6 , less value... -)

    ReplyDelete
  4. Joel,

    Joel,
    Nice post! I've been looking for ways to improve our internal APEX environment especially on web server side. Even though there is a bunch of information on internet about performance tuning for web server, as a DBA, it is hard to understand if you're not really involved as web admin. Your post is clear as it is context specific to APEX.

    I agree with Matthias that this will be great information to be included in the installation guide.

    Thanks.
    Ittichai

    ReplyDelete
  5. This sounds to me like a perfect real-life lession on how many of us can keep performance up on our own servers. Thanks Joel.

    BTW I think the auto purge option may have some bearing on some servers an might serve as a good how-to article for the future.

    ReplyDelete
  6. Seems like I read that the APEX repository was compiled using native compilation at one time. But it was reversed because of some internal ORA-0600 errors. Was that ever resolved? Can you give any feedback about this?

    ReplyDelete
  7. Joel,

    Outstanding post! Not only is it a shining example of corporate transparency, but there are a lot of real-world lessons in there that we can learn from. Running apex on your laptop != hosting a production server with a large population of users. Keep up the great work!

    Tyler

    ReplyDelete
  8. In my experience, keep-alive is even more important in an AJAX application that makes a lot of short requests. As AJAX becomes more and more popular in this community (it's going to explode with 4.0), I suspect the value of that parameter will be more important.

    ReplyDelete