Ars Digita Site Profiler

Crawler Source: profile_site
Analysis Data Model: profile.sql
Download: profile-1.0.tar.gz

Software testing is boring and repetitive. As a result, people aren't very good at it - we tend to carry out the same actions over and over, and thus miss errors that are outside the path of our normal activities. So testing is often neglected, and it can be quite easy for minor (and sometimes major) defects that detract from a piece of software's utility and appearance of 'solidness' make their way into production code.

Automating The Process

Because websites have fairly consistent and well-defined mechanisms though which all user interaction occurs (a GET or POST for input, HTML for output), and through which most possible actions are communicated to the user (i.e. links), we have an opportunity to easily automate the testing of a website. Crawlers have been taking advantage of these consistent interfaces to index sites for years. Why not use a crawler of sorts to test a site for broken links and some other easily-distinguished errors, and to profile its performance? That's the purpose of this system.

What This System Can Do

Tell you which pages are slow. Find pages with broken links, internal server errors, and any other problem that is accompanied by an HTTP response code other than 200.

What This System Can't Do

We have no way of testing the semantic correctness of a page's content. Your .TCL script can return complete gibberish, and the crawler will happily report that all is well as long as it gets an HTTP return code of 200.

Also, forms are currently ignored (meaning that a page only reachable via a POST will not get profiled), and no validation is done on the returned HTML. Both these things may change, though. See 'Possible Improvements', below.

The Mid-Level Details

The system consists of two parts:

The Crawler

The crawler is a fairly dumb program that takes a few arguments specifying hostname, count of hits to perform, login information, and a few other things. It then goes through a fairly straightforward sequence of actions:

Log on to the site using the specified login information.
Retrieve a specified starting path from the site.
Record a record indicating the time required to get a response from the server and any error conditions returned.
Extract all links in the HTML retrieved by last hit that don't point to a different host.
Randomly choose one of the URLs from the previous step; fetch that from the server.
Repeat steps 3-5 until the specified number of hits has been performed.
Write all recorded logging information to a file.

(Actually, it's a bit more complicated than this, when you get into issues like reacting to error codes returned by the server and preventing the crawler from getting 'trapped' in a small subset of the site. If you want more detail, browse the source, with particular attention to the profile_site function.)

The Profile Log Analysis Pages

The log file stored by the crawler contains mostly just raw data about response codes and times for the pages it fetches. Making effective use of this data requires tools for browsing and generating summaries of the data. This is where a new set of pages at /admin/profile comes in. On these pages, developers can upload log files generated by the Python script. The data from the logs is entered into a database table, and views on the data that summarize by fairly arbitrary criteria can be created.

How To Use The System

Running The Crawler

The first step to using the profiling system is to record a dataset, using the crawler script, named 'profile_acs_site'. This script is written in Python; running it requires that you have Python v1.5.2 installed on your system. The default Red Hat install seems to include Python by default. If your system doesn't already have Python, you can download it from the official Python website. All library modules needed by the script should be included in the standard Python distribution.

The format of the command line to record a profile is:

profile_acs_site -profile (host) (path) (count) (outfile) [(logon)] [(illegal_urls)]

Let's take this apart:

-profile - this flag indicates that the script should record a profile log. There's also a -report option, which takes an existing log and generates a plain text report on the data it contains. However, that option is mostly a legacy from the early stages of this tool, before the web/Oracle analysis system was created.
host - hostname of the site to profile. If the service is running on a port other than 80, specify the port using the hostname:port syntax.
path - 'start path' from which the crawler begins exploring the site. Usually you'll want to use '/'.
count - number of samples (i.e. hits) to record
outfile - name of the file in which profile data will be stored. More details on the file format below.
logon - a logon specification. This can take 2 forms:
- -acs - indicates an ACS-type login. This flag should be immediately followed by an email/username and password. The crawler then executes a 'GET' on /register/user-login.tcl with these values passed as URL variables, and follows the redirect chain to get the authentication cookie.
- -generic - indicates a 'generic' logon. This flag should be immediately followed by a URL (presumably containing username and password or somesuch as URL variables) which the crawler will GET in order to log on. This option is mostly a leftover from an earlier incarnation of the crawler, but I kept it around thinking it might be useful some other time.
The logon may also be omitted entirely.
illegal_urls - the last argument to the script is a regular expression (Perl style, I believe - but go to the documentation for the Python 're' module for the final word on the syntax). The crawler will not follow any links whose URL (minus scheme and hostname) match this regexp. This can be used to prevent the crawler from wasting time profiling pages you aren't interested in, or following links that will cause it to be logged out. (When you specify a -acs login, '/register' is automatically added to this regexp to prevent logouts from occuring.) Please note that this filter only applies to links parsed out of a page's HTML. The only restriction on redirect locations is that they must point to the host being profiled.

Here's a sample command line I would use to crawl the WineAccess website starting from the path '/', for 10000 hits, recording output in the file 'walog', logging in to the site as 'burdell@gatech.edu', and avoiding links to pages under /static (or, implicitly, /register):

profile_acs_site -profile www.wineaccess.com / 10000 walog \ -acs burdell@gatech.edu thegoodword "/static"

WARNING! When the crawler runs as a logged in user, it can do anything that the user it connects as can (with the exception of actions requiring a POST). If you have it connect as a user with significant administrative privledges, it can randomly delete content, nuke users, or wipe out significant parts of the site, and all sorts of other bad stuff. Think carefully about who the crawler will be connecting as.

Analyzing The Crawler Log

Once the log has been recorded, it needs to be uploaded to the database. For this to work requires that the log analysis data model and pages be available on an Oracle-backed AOLserver somewhere. Installation requires two or three steps:

Load the data model in profile.sql into SQL*Plus.
Drop the files under the pages directory of the profile distribution into their own directory somewhere under the server's pageroot. If this is an ACS site, I recommend /admin/profile.
If the server is running ACS, you're done. If not, you'll also need to download the Ars Digita AOLserver utilities and place the file somewhere that your AOLserver can load it on start up (i.e. in either the shared or private TCL directory).

In a browser, go to index.tcl in the directory where you installed the contents of the pages directory. Select "Upload New Dataset", fill in the information requested by the form. All fields other than short name are optional, but some links generated elsewhere in the profile pages will be broken if you don't provide an exact hostname for the site being profiled. (This hostname should be the same as the one you passed to the crawler script.) The upload will take a while (5-10 minutes for a 10,000 record log file) - TCL is having to dissect a CSV file and insert a record into the database for each sample taken, and this is not such a fast process.

After the upload is complete, go back to index.tcl, and you should see an entry for the dataset you just uploaded. Click on it, and you get a summary view that groups samples by stripping off the URL variables from the paths and computes statistics on these groups. You can build your own summary views by specifying filtering and grouping expressions in the form on this page. The expressions provided here end up being put into WHERE and GROUP BY clauses in a select on the profile_samples table, so you may reference any columns (with the exception of dataset_id, which is already determined) from this table.

Log File Format

For those of you interested in writing your own analysis tools, the logfile is a simple comma-seperated value file. Each line in the file represents a single hit on the site and is in this format:

<seq>,<path>,<form>,<redirect>,<prevpath>,<result_cd>,<result_msg>,<html_valid>,<dur>,<timestamp>

<seq>

A simple sequence number recording the order in which hits occured

<path>

The initial URL requested from the site, minus "http://<host>"

<form>

Form data sent to the site. If this is present, then a POST to the site was done with this data. If absent, a simple GET. Currently no POSTs are done by the script - this was simply put in for future expansion.

<redirect>

A semicolon-seperated list of redirects. If the initial path results in a redirect, the new location and any ensuing redirects will be listed here. Note that the duration given for the page load includes the time required to load all pages in the redirect chain.

<prevpath>

The page on which the link to path was found. This path should probably always be either the path or last entry in the redirect list (if any) of the previous sample.

<result_cd>

Normally, this will be the HTTP response code. However, certain errors generate other error codes. Codes that aren't standard HTTP result codes will always be negative:
-1, -11: An error occured connecting, no further info

-12: A timeout occured (> 60 seconds) while fetching page

-13: Encountered redirect to another host. (Note that this isn't exactly an error, but we don't want the crawler wandering off and profiling some other site.)

-14: Bad location header: The server returned a Location header that had a relative or host-relative URL. These violate the HTTP spec.

-15: The server returned 302, the HTTP redirect code, but did not provide a Location header.

<result_msg>

A text result message

<html_valid>

0 if HTML failed some sort of validation, 1 if it passed, empty string if no validation performed. Currently the only 'validation' done is a check for unevaluated evaluated '<% %>' ADP tags.

<dur>

length of time it took to fetch page, including all links in the redirect chain, if any

<timestamp>

a timestamp telling when the fetch was initiated, in in the format "YYYY-MM-DD HH:MI:SS", with a 24-hour clock. Greenwich mean time is used.

Frequently Asked Questions:

Why do I get a "Bad Location Header" when I try to get the crawler to log in to an ACS site?
The Crawler is really anal-retentive about some HTTP specs. In this case, it refuses to follow a 302 redirect if the location header is not a complete URL (i.e. including "http://<hostname>"). ns_returnredirect, if given a relative URL as input, generates such a Location header. Some of the cookie-chain and login pages on older ACS versions use ns_returnredirect this way, and as a result generate bad location headers. The crawler sees these, reports an error, and does not complete the login process. You can fix this problem

My location headers are okay, but I'm still having trouble getting the crawler to log in.
When you specify the -acs logon option, the crawler attempts to log on by doing a GET of /register/user-login.tcl with the email address and password encoded as URL variables. The specific field names used are "email" and "password_from_form". If user login is not at this location, or expects different field names, the login won't work. If you want to quickly modify the crawler to make it work with your site (which I recommend over changing the site to work with the crawler), look in the crawler source at the function named "process_logon_args" (It's the block of code that starts out with "def process_logon_args". Hopefully, even those unfamiliar with Python will be able to figure out what needs to be changed to get things working. Alternatively, figure out how to use the -generic logon option.

Possible Improvements

Here are a few things I'd like to see but haven't had time to implement just yet:

I'd like to make the crawler smart enough to generate random data to send to forms so that it can do POSTs as well as GETs. Also, it would be nice to test how the POST recipients behave in response to random inputs. I've made room for this in the data model and CSV files that the crawler outputs, but haven't had time to to write code to parse forms and work out what inputs need to be sent, just yet.
Run the HTML returned by each page through an HTML validator, and record any errors found.
Extract anything that appears to be English text and run it through a spelling/grammar checker. (Might be better to just hire a proofreader, though, if we can't come up with an automated system that doesn't give a lot of false hits.)
Handle SSL. This will probably have to wait until the python httplib module that I'm using to talk to the site gains this support.
Integration with other ACS components? It seems like there may be some opportunities to, for example, streamline the generation of tickets for defects located by the crawler.
Clean up data load process so that aborting halfway doesn't leave fragments of datasets laying around
A faster means of importing datasets into Oracle. TCL parsing of the data files is slow. Maybe exec an external script, or SQL*Loader?
Implement PostGres versions of the data model and pages, to make this tool useful for people who don't have access to Oracle.

elorenzo@arsdigita.com