GenTree utility for TreeDec
1. Operation
1.1 Setting up GenTree input
To capture the static link structure of a website, use the free
software linklint together with the utility convert-ll
as described in the system documentation.
As a result, two files will be generated: pages.dat and links.dat.
1.2 Invoking GenTree
GenTree accepts one mandatory parameter and three optional parameters,
which are written in keyword=value format. GenTree should be
executed from the parent
directory of the website to be decorated.
1.2.1 root
The root parameter is mandatory and it specifies the pathname
of the file to be used as the root of the tree. The parameter
is written as: root=pathname.
1.2.2 indir
The indir parameter (indir=input_directory) tells GenTree in
which directory to find the pages.dat and links.dat files. The
default is "treedec".
1.2.3 outdir
The outdir parameter (outdir=output_directory) tells GenTree in
which directory to write its output files (see below). The default is
"treedec".
1.2.4 dir
The dir parameter (dir=inout_directory) is a shortcut for
indir and outdir to indicate that input and output
files are in the same directory. The default is "treedec".
1.2.5 Example
Consider this command line:
gentree.perl root=mysite/home.html indir=/home/blip/data
This tells GenTree that:
-
the root page is mysite/home.html (which it will expect to find in pages.dat)
-
it should read input from /home/blip/data/pages.dat and
/home/blip/data/links.dat
-
by default, it should write its output to the treedec/ directory.
2. GenTree Output
2.1 Message file
GenTree writes informational and error messages to STDOUT and to a
message file, named gen-td-msg.txt.
2.2 Tree file
GenTree writes a tree file, named td-tree.dat. It will
overwrite a pre-existing file, so rename any other versions you want
to save. The tree file will conform to the
format requirements of TreeDec.
The tree is generated by performing a breadth-first
expansion from the specified root.
Note that this is not
guaranteed to generate a perfectly satisfactory logical tree. Links
within pages may skip over logical levels, the order of siblings may
not be correct, and so on.
Moreover, the title field within each record is specified
as "*", indicating that the title should be acquired from
the <title> field of each HTML file.
This may be acceptable, but you may achieve more usable results
by directly specifying the title content.
3. Next Step: TreeDec Itself
Read all about it.