Software testing is boring and repetitive. As a result, people aren't very good at it - we tend to carry out the same actions over and over, and thus miss errors that are outside the path of our normal activities. So testing is often neglected, and it can be quite easy for minor (and sometimes major) defects that detract from a piece of software's utility and appearance of 'solidness' make their way into production code.
Because websites have fairly consistent and well-defined mechanisms though which all user interaction occurs (a GET or POST for input, HTML for output), and through which most possible actions are communicated to the user (i.e. links), we have an opportunity to easily automate the testing of a website. Crawlers have been taking advantage of these consistent interfaces to index sites for years. Why not use a crawler of sorts to test a site for broken links and some other easily-distinguished errors, and to profile its performance? That's the purpose of this system.
Tell you which pages are slow. Find pages with broken links, internal server errors, and any other problem that is accompanied by an HTTP response code other than 200.
We have no way of testing the semantic correctness of a page's content. Your .TCL script can return complete gibberish, and the crawler will happily report that all is well as long as it gets an HTTP return code of 200.
Also, forms are currently ignored (meaning that a page only reachable via a POST will not get profiled), and no validation is done on the returned HTML. Both these things may change, though. See 'Possible Improvements', below.
The system consists of two parts:
The crawler is a fairly dumb program that takes a few arguments specifying hostname, count of hits to perform, login information, and a few other things. It then goes through a fairly straightforward sequence of actions:
(Actually, it's a bit more complicated than this, when you get into
issues like reacting to error codes returned by the server and preventing
the crawler from getting 'trapped' in a small subset of the site. If you
want more detail, browse the source, with
particular attention to the profile_site function.)
The log file stored by the crawler contains mostly just raw data about response codes and times for the pages it fetches. Making effective use of this data requires tools for browsing and generating summaries of the data. This is where a new set of pages at /admin/profile comes in. On these pages, developers can upload log files generated by the Python script. The data from the logs is entered into a database table, and views on the data that summarize by fairly arbitrary criteria can be created.
The first step to using the profiling system is to record a dataset, using the crawler script, named 'profile_acs_site'. This script is written in Python; running it requires that you have Python v1.5.2 installed on your system. The default Red Hat install seems to include Python by default. If your system doesn't already have Python, you can download it from the official Python website. All library modules needed by the script should be included in the standard Python distribution.
The format of the command line to record a profile is:
profile_acs_site -profile (host) (path) (count) (outfile) [(logon)] [(illegal_urls)]
Let's take this apart:
Here's a sample command line I would use to crawl the WineAccess website starting from the path '/', for 10000 hits, recording output in the file 'walog', logging in to the site as 'burdell@gatech.edu', and avoiding links to pages under /static (or, implicitly, /register):
profile_acs_site -profile www.wineaccess.com / 10000 walog \
-acs burdell@gatech.edu thegoodword "/static"
WARNING! When the crawler runs as a
logged in user, it can do anything that the user it connects as can
(with the exception of actions requiring a POST). If you have it
connect as a user with significant administrative privledges, it can
randomly delete content, nuke users, or wipe out significant parts of
the site, and all sorts of other bad stuff. Think carefully about
who the crawler will be connecting as.
Once the log has been recorded, it needs to be uploaded to the database. For this to work requires that the log analysis data model and pages be available on an Oracle-backed AOLserver somewhere. Installation requires two or three steps:
In a browser, go to index.tcl in the directory where you installed the contents of the pages directory. Select "Upload New Dataset", fill in the information requested by the form. All fields other than short name are optional, but some links generated elsewhere in the profile pages will be broken if you don't provide an exact hostname for the site being profiled. (This hostname should be the same as the one you passed to the crawler script.) The upload will take a while (5-10 minutes for a 10,000 record log file) - TCL is having to dissect a CSV file and insert a record into the database for each sample taken, and this is not such a fast process.
After the upload is complete, go back to index.tcl, and you should
see an entry for the dataset you just uploaded. Click on it, and you get
a summary view that groups samples by stripping off the URL variables from
the paths and computes statistics on these groups. You can build your own
summary views by specifying filtering and grouping expressions in the form
on this page. The expressions provided here end up being put into
WHERE and GROUP BY clauses in a select on the
profile_samples table, so you may reference any columns (with the
exception of dataset_id, which is already determined) from this table.
For those of you interested in writing your own analysis tools, the logfile is a simple comma-seperated value file. Each line in the file represents a single hit on the site and is in this format:
<seq>,<path>,<form>,<redirect>,<prevpath>,<result_cd>,<result_msg>,<html_valid>,<dur>,<timestamp>
- <seq>
- A simple sequence number recording the order in which hits occured
- <path>
- The initial URL requested from the site, minus "http://<host>"
- <form>
- Form data sent to the site. If this is present, then a POST to the site was done with this data. If absent, a simple GET. Currently no POSTs are done by the script - this was simply put in for future expansion.
- <redirect>
- A semicolon-seperated list of redirects. If the initial path results in a redirect, the new location and any ensuing redirects will be listed here. Note that the duration given for the page load includes the time required to load all pages in the redirect chain.
- <prevpath>
- The page on which the link to path was found. This path should probably always be either the path or last entry in the redirect list (if any) of the previous sample.
- <result_cd>
- Normally, this will be the HTTP response code. However, certain errors generate other error codes. Codes that aren't standard HTTP result codes will always be negative:
- -1, -11: An error occured connecting, no further info
- -12: A timeout occured (> 60 seconds) while fetching page
- -13: Encountered redirect to another host. (Note that this isn't exactly an error, but we don't want the crawler wandering off and profiling some other site.)
- -14: Bad location header: The server returned a Location header that had a relative or host-relative URL. These violate the HTTP spec.
- -15: The server returned 302, the HTTP redirect code, but did not provide a Location header.
- <result_msg>
- A text result message
- <html_valid>
- 0 if HTML failed some sort of validation, 1 if it passed, empty string if no validation performed. Currently the only 'validation' done is a check for unevaluated evaluated '<% %>' ADP tags.
- <dur>
- length of time it took to fetch page, including all links in the redirect chain, if any
- <timestamp>
- a timestamp telling when the fetch was initiated, in in the format "YYYY-MM-DD HH:MI:SS", with a 24-hour clock. Greenwich mean time is used.
def process_logon_args". Hopefully, even
those unfamiliar with Python will be able to figure out what
needs to be changed to get things working. Alternatively,
figure out how to use the -generic logon option.
Here are a few things I'd like to see but haven't had time to implement just yet: