WDCMS in depth
WDCMSR: the rust version of WDCMS
For the creation of a dom from an html page, I use html_editor, not as complete as Beautiful Soup, but very usable. (To be honest: I am unable to understand how to use the canonical html5ever)
The remarks for the rust version (WDCMSR) are the same as for the Python version.
Traversing the source tree, generating the destination tree
Note: all directory names must be free of spaces. Use underscores '_' instead.
Let us call the directory tree whare the sources are: "source tree" or "source", and the tree that will be generated by wdcms.py: "destination tree" or "destination".
The start of the source tree is recognized by the presence of two files: "wdcms.root" and "template.html". The first one is unique for the source, there can only be one. This file, "wdcms.root", contains a line like:
When wdcms.py is run, it searches for these two files, going back in the file system, until a directory is found containing these two files. Herewith is the root of the source determined.
The root of the destination tree is found in the file "wdcms.root" as shown above. In this case the root of the destination tree will be
Starting in the root of the source tree, wdcms.py visits every directory recursively, going from the root to the end leaves. In doing so, it constructs at the destination a directory tree with the same names as in the source tree, with exception of the root directory: that name is given in "wdcms.root".
Usage of wdcms.py:
- take care that wdcms.py can be found using your PATH environment variable, for example by placing wdcms.py in /usr/local/bin, or $HOME/bin.
- create a file 'wdcms.root' in the start directory of your source tree as described above.
- create a file 'template.html' in the start directory of your source tree. As a start, you could take the file from download.
- cd to a directory in your source tree.
- Call wdcms:
- wdcms.py : the program will find the root of your source tree, and from that point recursively traverse all directories (except wdcms.copy)
- wdcms.py / : this is the same as the previous one, the difference is that the subtree below the current directory will be recursively traversed.
Example, you have this source tree:
start/ ├── one/ ├── three/ └── two/ └── four/ ├── five/ └── six/
You cd to start/two, and give the command:
- wdcms.py : visit all directories, i.e. "start", "one", "three", "two", "four", "five", "six"
- wdcms.py . : visit "start", "two"
- wdcms.py / : visit "start", "two", "four", "five", "six"
In wdcms.root you set the start location of the destination tree:
outputdir where to place the generated website. Default: "/tmp/website"
parallel (optional) how many processes to run while converting from content.txt to index.html possible values:
- -1 : default (number of CPU's available)
- 0 : no parallel execution
- 1 : no parallel execution
- n : use n processes
Default: -1 This option has only effect on systems that support the fork() system call (not MS-Windows).
Example of a wdcms.root file:
outputdir: /home/willem/public_html/website parallel: 6
The root of your source tree must contain a file wdcms.root.
template.html defines the theme that is used by wdcms.py to create the pages. It is inherited from the parent directory.
The current theme is inspired by Drupal's Bartik theme.
In the download section, you find a zip file with a complete example.
The root of your source tree must contain a file template.html.
.css and .js
Files with suffix .css or .js, are copied from the source tree to the destination tree, on the same level. Moreover, in the <head>of the generated "index.html", there is a reference made to these files. The children of this directory also get these references in the <head>.
The contents of "head.html", if any, are copied to the <head> of the generated "index.html". These contents are inherited from the parent directory and augmented with the content of the current head.html.
With "menu.txt" you determine the main-menu bar. An example of "menu.txt":
# Note: all directories named here must exist, otherwise: mayhem menu: / : Home menu: /alracTTG : Tennis tournament schedule generator menu: /findent : Indent Fortran sources menu: /wsnow : Let it snow in your browser menu: /xsnow : Let it snow on your desktop menu: /software : Even more software menu: /wdcms : How I made this website menu: /drupal : Drupal content management menu: /contact : Write me a letter
Th first column contains the text "menu:". Then follows the directory name and then the text that becomes visible when you hover over the link. Note that the directories must exist. If not, you get an error message from the program, complaining that some file or directory cannot be found.
menu.txt is inherited from the parent directory and overwritten by the current one, if any.
The file "trans.txt" contains translations. These come in very handy: suppose there are many references in the website for the version number of findent. If you put this line in "trans.txt":
and put in the places where you need this version number YYYfindent-versionZZZ, then 3.1.7 will be inserted, for example this line in content.txt:
The current version of findent is YYYfindent-versionZZZ
will be translated to;
The current version of findent is 3.1.7
If no translation is given for findent-version, then findent-version is output. The translations will also take place in "template.html". In this way are different logo's defined for different chapters.
The translations are inherited from the parent directory and augmented and or overwritten by the translations in the current directory.
Why ZZZ and YYY are chosen
Wouldn't it be better to choose (: and :) or something? Problem is, that the text is handled by the markdown parser and by BeautifulSoup. Both have their peculiarities in trying to do their job as good as possible. This results in for example '[' translated to '!5B'. (the exclamation mark replaced by a percent sign)
So, I decided to use normal alphabetic characters, hoping that there are very few texts containg for example YYYsomethingZZZ.
How to get YYYsomethingZZZ on the output
This is a special one: we run in all kinds of problems if we have a translation that results in ZZZ or YYY. But I want to document wdcms, so I invented two special cases: one for leftq (yyy) and one for rightq (zzz). In stead that the translation table gets an entry YYYleftqZZZ, it gets one with lowercase y's and z's. The same of course for YYYrightqZZZ.
In the final replacement stage, a special step is taken to convert these aberrent strings.
At the start of the program, the following translations are defined:
wdcms_version: the version of wdcms.py wdcms_date: date + time when wdcms.py is run leftq: special treatment, see above rightq: special treatment, see above
The contents (files and directories) of the directory "wdcms.copy" in the source are copied to the destination. This can for example be used to define a .htaccess file, or to copy a directory with logos.
The directory "wdcms.copy" only works at the level it is defined, no inheritance from other levels.
In settings.txt you can set the following parameters:
- format the default format of the content. Default: "pandoc"
- superfish whether to use superfish. Default: "yes"
- parser which parser BeautifulSoup will use. Default: "html.parser". Sane choices are "html.parser", "lxml", "html5lib"
Example of a settings.txt file:
format: mediawiki superfish: no parser: lxml # this parameter should be considered as "advanced"
The settings are inherited from the parent directory and augmented and/or replaced by the settings in the current directory.
For example you could have an input directory sub-tree coded in mediawiki. Only the first directory in this tree would need a settings.txt file.
Text in the footer
Text in footer-left and footer-right are defined in "trans.txt", have a look at "template.html" and "trans.txt".
Luckily, until now I found no reason to introduce an escape character for wdcms. Note, however, that markdown uses the escape character '\', so I had to use two of them here.
Even more translations
Links, labelled with "href=" or "src=", starting with a "/" get a special treatment.
- /foo/bar will be translated to something like ./../../foo/bar, the number of "../" dependent of the level in the source tree.
- //foo/bar will be translated to /foo/bar: probably the real root of your website
In all link texts, '_' is replaced by ' ', for example:
Formatters: pandoc, cmark, mediawiki ...
wdcms.py is willing to use the following filters for the content:
- pandoc using pandocs's markdown filter
- cmark using cmark (faster than pandoc)
- mediawiki using pandoc with mediawiki as filter
- markdown Pythons markdown filter (only for the Python version).
- markdown2 Pythons markdown2 filter: fast and more complete than "markdown" (only for the Python version).
- asis no conversion at all
The default is pandoc, you can change that easily in the program.
Normally, the filter is chosen in "settings.txt", but the filter can also be chosen on a per-content basis. If the first line of a content.txt looks like:
then mediawiki is chosen for that document.
What happens in wdcms.py
General philosophy: all internal links will be relative. So you can run the generated website everywhere, even without a web server. SEO adepts are getting a bit nervous now: shouldn't you always use absolute urls? Well, I don't think so: Google c.s. should be clever enough to handle relative urls. BTW: it is not difficult to adapt the program to emit absolute urls, starting with https://....
The program defines a number of internal variables. In this text we call them as follows:
- CONTENT: the content
- CSS: names of .css files encountered
- HEAD: text to be added to the <head> section of the output
- JS: names of .js files encountered
- MENU: the main menu, defined in menu.txt
- PAGE: an html page, unparsed
- SETTINGS: defines parser, superfish, format
- SOUP: a parsed html page
- TEMPLATE: the template to be used
- TRANS: the translations
Here we go:
- The directory tree is scanned backwards for the existence of "wdcms.root" and "template.html", and the program steps into that directory.
- The file "wdcms.root" is read to determine the desired root of the destination tree.
- TRANS is filled with some standard translations: leftq and rightq.
- A recursion through the source tree is started.
Actions are take if the following files or directories
- wdcms.copy: if that is a directory, its contents (files and directories) are copied to the destination.
- template.html: the content of that file is copied to TEMPLATE.
- head.html: its content is added to HEAD.
- *.css: these files are copied to the destination, and their names added to CSS.
- *.js: these files are copied to the destination, and their names added to JS.
- content.txt: it's content is copied to CONTENT.
- trans.txt: its content is parsed and added to TRANS.
- settings.txt: its content is parsed and added to SETTINGS.
- menu.txt: its content is parsed and put in MENU.
- then the following action are performed:
- TEMPLATE is parsed (using BeautifulSoup) giving SOUP.
- The main menu is filled in, using MENU.
- If superfish should be activated, the main menu is changed accordingly.
- HEAD is added to SOUP's <head>.
- References to the .css files named in CSS are placed in in SOUP's <head>.
- References to the .js files named in JS are placed in in SOUP's <head>.
- In SOUP's left-top menu, references to child directories are constructed.
- In SOUP's left-bottom menu, references to parent directories are constructed.
- Apply a filter on CONTENT, convert that with BeautifulSoup, and place the result in SOUP's <div id="content" >.
- SOUP is converted to PAGE.
- PAGE is translated using TRANS.
- PAGE is converted to SOUP.
- The href's and src's urls are converted (see above).
- SOUP is converted to PAGE.
- PAGE is written to "index.html" in the destination tree.
wdcms.py was written without much care for efficiency. Experiments showed that most of the time is spent in "pandoc", so there is no real need to optimize wdcms.py itself.
A stated above, there is no reason to optimize wdcms.py
itself, but running more than one "pandoc' in parallel
should lower the running time. So n the Python version, a
crude parallelism is implemented using
multiprocessing.Pool. A test is done to see if this is
available. On my system I observed a 40%-50% reduction in
running time creating this website. For details, see the
wdcms.root elsewhere on this
In the Rust version, parallelization is done with the fork mechanism from the "nix" crate. This will probably run only in Unix-like systems, an attempt has been made to detect the environment and disable the fork-code if not on a Unix-like system.
External packages used
The system uses the following external packages:
Parsing and modifying html is done with BeautifulSoup. BeautifulSoup has been specially tailored to handle not perfect html. De documentation is very adequate, and maintenance is still going on. As you can see in de source of wdcml.py, you can choose between a number of parsers to be used by BeautifulSoup. I choose "html.parser", but I found that the others work also.
- markdown Python's standard markdown parser
- markdown2 Python's markdown parser with extensions
Optionally (see "settings.txt"), the main menu can be enriched with drop-down menu's. For this feature usage is made of Superfish.