Analyzing the Website

To see how the process of analyzing the link structure of a website fits within the Web Metrics support for user logging, please see the dataflow diagram.

Analysis begins when the website under consideration is processed by the Linklint software, which determines its static link structure. This information is written to several files, one of which (named fileF.txt), is then further processed by the PERL script convert-ll. Here are the details:

Website:
Copy and Name

Linklint can analyze either the original website or the copied instrumented version. We will concentrate on the latter, since it simplifies later steps. We assume that the website resides in $WEBLOG_DATA/site-2d-tested/website/.

Run Linklint

Read the full documentation of linklint to find out about all the options. The following is a suggested procedure.

Use a text editor to build a command-line file in $WEBLOG_DATA/site-2d-tested/webstruct/, named something like cmdline.dat. This file will contain parameters (one per line) to be used by linklint. A sample cmdline.dat file might be:

   -http 
   -host www.yourserver.com
   -net
   -doc .
   -limit 8000
   -xref
This says to use the http protocol to access webpages, gives the hostname of the website, and specifies that results are to be written into the current directory.

Change your current directory to $WEBLOG_DATA/site-2d-tested/webstruct/ and invoke linklint something like this:

linklint  @cmdline.dat $WEBLOG_DATA/site-2d-tested/website/@
This command line says to use the cmdline.dat file for control parameters, and points to one or more seeds (in this case, just the files within the "website" directory) to be scanned within website.

As a result, a number of informative files are written within the webstruct directory. Note especially the error files (such as error.html), as these may indicate needed repairs to the web pages. For our purposes, the important one is fileF.txt. This file contains complete information about all the web pages in the directory and how they are inter-linked. It also contains information about external (to the website) pages that are directly linked to from within the website.

convert-ll

Assuming your current directory is still $WEBLOG_DATA/site-2d-tested/webstruct/, you can invoke convert-ll.perl something like this:
$WEBMET_TOOLS/resources/weblink/convert-ll.perl  dir=.
This causes convert-ll to read fileF.txt and then generate three output files in the current directory:
  • pages.dat describes the pages in the website. The first record of the file contains the hostname (e.g. "HOST=quackmost.com"). Subsequent records each represent a page in the website. Each record has three fields: a unique nickname of the page, the type of the page, and the URL. Also, synonymous URLs are collapsed (e.g. xxx == xxx/ == xxx/index.html) by assigning them the same nickname. Thus pages.dat contains a mapping from nickname to URL.
  • links.dat describes the static links within the website. Each record contains a pair of nicknames, indicating a link from the first-named page to the second. Convert-ll may write out some redundant records if there are multiple links among pages. This will not cause a problem to VisVIP, but if you want to clean up the file, you can use such utilities as "sort" and "uniq" in Unix, e.g.
       sort links.dat | uniq > ulinks.dat
       mv ulinks.dat links.dat
    
  • url2nn.dat provides the reverse mapping from URL to nickname. In the case of aliases, this may be a many-to-one mapping.
convert-ll.perl accepts parameters to indicate alternative input and output directories, e.g.:
$WEBMET_TOOLS/resources/weblink/convert-ll.perl  indir=here outdir=there
would cause the input to be read from here/fileF.txt and the output written to there/pages.dat, there/links.dat, and there/url2nn.dat.


Version 3.0
Page last modified: 15 May 2002
National Institute of Standards and Technology (NIST)