Pathway to Parakeet

Just over a year ago I started generating this blog not with Wordpress but with a Perl script. The first version, DmBlog was a dirty hack but proved the point. I then expanded this script into DmSite which was capable of building my whole site.

DmSite worked, just about, but was pretty creaky and in no way could be used by anyone else. I decided it was worth a bit of a refactor to tidy it up and the resulting project is Parakeet.

Process

This has taken a while to get done and I regretted starting it more than once. The code was utterly broken for a long time and it’s hard to stay motivated when working on something which is days/weeks away from even compiling. Some of side-tasks took time as well. For example, as part of making this releasable I needed to remove all references to myself in the code and move into a config file. That lead to the building of the ValidatedConfig Perl module, which I hope to get some more use out of for other projects.

I’ve finally got to the point where Parakeet is capable of generating my site and I’ve moved to it from DmSite. It’s not officially released yet as I have some issues to deal with, but the code is on GitHub so it is kinda-released and there are installation instructions for the curious. Call it 0.00001 pre-alpha for now.

What’s new

Parakeet doesn’t do a lot more than DmSite did, considering the time spent on it, but the design is much cleaner. Where the output generation in DmSite was a mixture of XML processing, blocks of HTML held as strings, special methods etc. in Parakeet each output file is the result of a content object being run through several stages of processing.

For a page:

  1. The input is parsed into an intermediate XML format
  2. Addresses in href/src attributes are parsed and resolved
  3. The XML has header and footers added according the config and theme settings
  4. The XML is transformed into HTML by the default renderer and into limited HTML to be included in ATOM feeds

The feeds themselves are content objects, so you could define a custom ATOM feed by placing an appropriate source file in your site directory:

title: My best posts
address: /blog/pathway-to-parakeet/index.page
address: /blog/panasonic-42.5/index.page

The dependency management is much more reliable. DmSite would recreate changed pages but didn’t have any concept of pages depending on other pages. Whenever I added a new blog post I’d have to manually touch the previously-newest post or the “Next” link would never be added. Occasionally I forgot to do this. If I changed a CSS file which was being inlined in some pages I’d have to manually trigger a rebuild of all pages.

Because every link, even those in the headers and footers, is picked up and parsed Parakeet has an understanding of what needs to be rebuilt. Plus it understands that content is dependent on configuration which can be specified at an item, directory or site level.

Other notable improvements include:

  • Theme support (but only the one theme so far)
  • The concept of build modes to make local previews easier and also control which content is or isn’t uploaded
  • Automatic tracking of publication/updated times
  • Address parsing means I can write links as absolute to the site, e.g. <a href="/coding/index.page">, which are output as relative links so still work when read from local disc

What’s still broken

Parsing

The single biggest outstanding issue is parsing of page sources. One of the main features of this kind of site builder is that it lets the user write without worrying about writing HTML syntax for basis content. For example, the ability to write:

---++ Some header
This is a paragraph.

This is another paragraph.

   * Bullet list item 1
   * Bullet list item 2

Plus on top of that turning double back-ticks and apostrophes into smart quotes (LaTeX style), easy escaping of pasted code, etc. etc.

This is all taken care of by the markup parser which turns the input into a XML document which contains special elements to be interpreted by the renderer which generates HTML. The current parser, largely unchanged from DmSite, is... not great. There are some combinations of theoretically valid markup which don’t get interpreted properly. For example, I have rework sentences so they don’t start with a link, as that will throw off the paragraph detection.

It all needs a big rework, and Parakeet won’t get a version number[1] until done.

Speed

Build performance is also a problem.

The first case is the “no change” time, i.e. how long does the build script take to run through a site with no changes and finish. This site consists of 1500 content objects, roughly 1000 from the site itself and 500 which referenced on other sites[2]. The build script currently takes about 20 seconds to run and find no changes. This feels too slow, I’d like it do be 10 seconds at most.

I’ve started putting timing points in to try and identify the bottlenecks. I did find one innocuous looking loop which was adding 45 seconds to the build time due to some inefficient string casting. There is clearly fat to trim.

The second performance case is the build speed when there are changes. This is partly determined by the performance of the various processing stages, but the bigger impact is actually dependency management.

If I change just this sentence and re-run, the build time is only a few seconds longer than the no-change case. But if I change the first paragraph the build time increases to about a minute. This is because the first paragraph is used as part of the summary which appears on the blog index page. The blog index page has a dependency link to all posts so any change to a blog post will trigger the rebuild process for the index. If the summary hasn’t changed no intermediate files for the blog index will update and so the impact stops there.

However, if the summary has changed then the blog index page output files will change. This triggers a regeneration on every page which has a dependency on the blog index. Now, at the top of this page you’ll see a navigation link back to the blog index, and that counts as a dependency[3]. This all means that any change to a blog summary causes every blog page to get partially rebuilt.

Similarly, one of the most expensive things you can do is change the site-wide config file as that causes everything to rebuild.

Possible work to come

Permalinks

Each content object in the site is assigned a unique ID so it should be pretty easy to auto-generate a .htaccess file for redirection. For example, this page could be referenced as http://www.duncanmartin.com/uuid/4CC3F828-AFD3-11E5-97E3-812C00FE22F5

If I happen to rename this blog post, the real URL would change but the redirection will update automatically.

More markup options

  • Fully support MarkDown (or just include MarkDown directly, I suppose)
  • Easy table generation (TWiki style?)
  • Pretty-printing for pasted code

Hypermedia

My PhD, several years ago, was in hypermedia. Structurally, HTML on the Web is a very simple hypermedia system as it only includes one-to-one uni-directional movement links. The XLink specification was supposed to bring more advanced linking to the public but no-one really cared.

Within a closed system such as a Parakeet site there’s no reason you couldn’t have bi-directional links, links with multiple end-points and inclusion of content from one page to another. I still suspect no-one will care but I could do it anyway.

Themes

I’m not much cop as a Web designer, I have limited aesthetic talent and I deeply resent working with CSS. Consequently this site is pretty basic looking.

Parakeet does support themes that include a mixture of CSS, JavaScript and image resources. The themes can also specify config which affect, to a degree, the structure of the site. For example, a theme can request a load of extra div or span elements to be placed around, within, before and after named elements. A real designer could use these to make something pretty.

It would be nice a few nicely designed themes built for Parakeet. Whatever merits it may have as a CMS will be ignored if all people ever see in terms of output is my begrudging CSS work.

  1. For whatever that’s worth. [Back]
  2. This includes outbound links and photo hosting on SmugMug. [Back]
  3. Correctly so, the text of the link is retrieved from the blog index content object. [Back]