Ars Digita Site Profiler

One of the Ars Digita Free Tools

Software testing is boring and repetitive. As a result, people aren't very good at it - we tend to carry out the same actions over and over, and thus miss errors that are outside the path of our normal activities. So testing is often neglected, and it can be quite easy for minor (and sometimes major) defects that detract from a piece of software's utility and appearance of 'solidness' make their way into production code.

Automating The Process

Because websites have fairly consistent and well-defined mechanisms though which all user interaction occurs (a GET or POST for input, HTML for output), and through which most possible actions are communicated to the user (i.e. links), we have an opportunity to easily automate the testing of a website. Crawlers have been taking advantage of these consistent interfaces to index sites for years. Why not use a crawler of sorts to test a site for broken links and some other easily-distinguished errors, and to profile its performance? That's the purpose of this system.

What This System Can Do

Tell you which pages are slow. Find pages with broken links, internal server errors, and any other problem that is accompanied by an HTTP response code other than 200.

What This System Can't Do

We have no way of testing the semantic correctness of a page's content. Your .TCL script can return complete gibberish, and the crawler will happily report that all is well as long as it gets an HTTP return code of 200.

Also, forms are currently ignored (meaning that a page only reachable via a POST will not get profiled), and no validation is done on the returned HTML. Both these things may change, though. See 'Possible Improvements', below.

The Mid-Level Details

The system consists of two parts:

The Crawler

The crawler is a fairly dumb program that takes a few arguments specifying hostname, count of hits to perform, login information, and a few other things. It then goes through a fairly straightforward sequence of actions:

  1. Log on to the site using the specified login information.
  2. Retrieve a specified starting path from the site.
  3. Record a record indicating the time required to get a response from the server and any error conditions returned.
  4. Extract all links in the HTML retrieved by last hit that don't point to a different host.
  5. Randomly choose one of the URLs from the previous step; fetch that from the server.
  6. Repeat steps 3-5 until the specified number of hits has been performed.
  7. Write all recorded logging information to a file.

(Actually, it's a bit more complicated than this, when you get into issues like reacting to error codes returned by the server and preventing the crawler from getting 'trapped' in a small subset of the site. If you want more detail, browse the source, with particular attention to the profile_site function.)

The Profile Log Analysis Pages

The log file stored by the crawler contains mostly just raw data about response codes and times for the pages it fetches. Making effective use of this data requires tools for browsing and generating summaries of the data. This is where a new set of pages at /admin/profile comes in. On these pages, developers can upload log files generated by the Python script. The data from the logs is entered into a database table, and views on the data that summarize by fairly arbitrary criteria can be created.

How To Use The System

Running The Crawler

The first step to using the profiling system is to record a dataset, using the crawler script, named 'profile_acs_site'. This script is written in Python; running it requires that you have Python v1.5.2 installed on your system. The default Red Hat install seems to include Python by default. If your system doesn't already have Python, you can download it from the official Python website. All library modules needed by the script should be included in the standard Python distribution.

The format of the command line to record a profile is:

profile_acs_site -profile (host) (path) (count) (outfile) [(logon)] [(illegal_urls)]

Let's take this apart:

Here's a sample command line I would use to crawl the WineAccess website starting from the path '/', for 10000 hits, recording output in the file 'walog', logging in to the site as 'burdell@gatech.edu', and avoiding links to pages under /static (or, implicitly, /register):

profile_acs_site -profile www.wineaccess.com / 10000 walog \
  -acs burdell@gatech.edu thegoodword "/static"
WARNING! When the crawler runs as a logged in user, it can do anything that the user it connects as can (with the exception of actions requiring a POST). If you have it connect as a user with significant administrative privledges, it can randomly delete content, nuke users, or wipe out significant parts of the site, and all sorts of other bad stuff. Think carefully about who the crawler will be connecting as.

Analyzing The Crawler Log

Once the log has been recorded, it needs to be uploaded to the database. For this to work requires that the log analysis data model and pages be available on an Oracle-backed AOLserver somewhere. Installation requires two or three steps:

  1. Load the data model in profile.sql into SQL*Plus.
  2. Drop the files under the pages directory of the profile distribution into their own directory somewhere under the server's pageroot. If this is an ACS site, I recommend /admin/profile.
  3. If the server is running ACS, you're done. If not, you'll also need to download the Ars Digita AOLserver utilities and place the file somewhere that your AOLserver can load it on start up (i.e. in either the shared or private TCL directory).

In a browser, go to index.tcl in the directory where you installed the contents of the pages directory. Select "Upload New Dataset", fill in the information requested by the form. All fields other than short name are optional, but some links generated elsewhere in the profile pages will be broken if you don't provide an exact hostname for the site being profiled. (This hostname should be the same as the one you passed to the crawler script.) The upload will take a while (5-10 minutes for a 10,000 record log file) - TCL is having to dissect a CSV file and insert a record into the database for each sample taken, and this is not such a fast process.

After the upload is complete, go back to index.tcl, and you should see an entry for the dataset you just uploaded. Click on it, and you get a summary view that groups samples by stripping off the URL variables from the paths and computes statistics on these groups. You can build your own summary views by specifying filtering and grouping expressions in the form on this page. The expressions provided here end up being put into WHERE and GROUP BY clauses in a select on the profile_samples table, so you may reference any columns (with the exception of dataset_id, which is already determined) from this table.

Log File Format

For those of you interested in writing your own analysis tools, the logfile is a simple comma-seperated value file. Each line in the file represents a single hit on the site and is in this format:

<seq>,<path>,<form>,<redirect>,<prevpath>,<result_cd>,<result_msg>,<html_valid>,<dur>,<timestamp>
<seq>
A simple sequence number recording the order in which hits occured
<path>
The initial URL requested from the site, minus "http://<host>"
<form>
Form data sent to the site. If this is present, then a POST to the site was done with this data. If absent, a simple GET. Currently no POSTs are done by the script - this was simply put in for future expansion.
<redirect>
A semicolon-seperated list of redirects. If the initial path results in a redirect, the new location and any ensuing redirects will be listed here. Note that the duration given for the page load includes the time required to load all pages in the redirect chain.
<prevpath>
The page on which the link to path was found. This path should probably always be either the path or last entry in the redirect list (if any) of the previous sample.
<result_cd>
Normally, this will be the HTTP response code. However, certain errors generate other error codes. Codes that aren't standard HTTP result codes will always be negative:
  • -1, -11: An error occured connecting, no further info
  • -12: A timeout occured (> 60 seconds) while fetching page
  • -13: Encountered redirect to another host. (Note that this isn't exactly an error, but we don't want the crawler wandering off and profiling some other site.)
  • -14: Bad location header: The server returned a Location header that had a relative or host-relative URL. These violate the HTTP spec.
  • -15: The server returned 302, the HTTP redirect code, but did not provide a Location header.
<result_msg>
A text result message
<html_valid>
0 if HTML failed some sort of validation, 1 if it passed, empty string if no validation performed. Currently the only 'validation' done is a check for unevaluated evaluated '<% %>' ADP tags.
<dur>
length of time it took to fetch page, including all links in the redirect chain, if any
<timestamp>
a timestamp telling when the fetch was initiated, in in the format "YYYY-MM-DD HH:MI:SS", with a 24-hour clock. Greenwich mean time is used.

Frequently Asked Questions:

Possible Improvements

Here are a few things I'd like to see but haven't had time to implement just yet:


elorenzo@arsdigita.com