About Archive Tags RSS Feed


Oh, this should be stunning.

8 August 2009 21:50

Recently I've been writing some documentation using the docbook toolset.

"Helpfully" the docbook tools produce a nice table of contents for your documentation. For example it will produce an index.html file containing a list of chapters, list of figures, list of tables, and finally a list of examples.

For my specific use I only wanted a table of contents listing chapters, all the other lists were just noise.

Unfortunately I've produced my documentation using the naive docbook2html tool, and all the details I can find online about customising the table of contents to remove specific items refer to using xslt and other more low-level tools.

So I thought I'd cheat. Looking at the generated index.html file I notice that the contents I wish to remove have got class attributes of TOC.

Is there a tool to parse HTML removing items with particular ID attributes? Or removing items having a particular CLASS?

I couldn't find one. So I knocked one up, using HTML::TreeBuilder::XPath, perhaps it will be useful to ohters:

html-tool --file=index.html --cut-class=foo --indent

The file index.html will be read, parsed, and all items with "class='foo'" will be removed. The output will be indented in a pretty fashion and written to STDOUT.

This example does a similar thing:

html-tool --url=http://www.steve.org.uk/ --output=x.html \
  --cut-id=top --cut-class=mbox --indent

I dabbled with allowing you to just dump HTML sections, so you could run:

html-tool --show-class=foo --file=index.html

But that didn't seem as obviously useful, so I dyked it out. Other similar operations could make it more generally useful though - right now it's more of a html-cut than a html-tool!

ObFilm: The Breakfast Club



Comments on this entry

icon Daniel Leidert at 13:46 on 8 August 2009

When you use the docbook-xsl stylesheets to produce the HTML documentation, you can adkjust the ToC: http://docbook.sourceforge.net/release/xsl/current/doc/html/generate.toc.html

About your question parsing HTML: Check the html-xml-utils suite (in Sid and testing).

icon Steve Kemp at 14:03 on 8 August 2009

Thats the kind of documentation I found already - but I don't understand how that applies to my input.

e.g. My book looks like this:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN" >
&lt;book id="my-book" lang="en"&gt;<br>

&lt;title&gt;My book ..&lt;/title&gt;

Nowhere can I place that generate.toc value without receiving an error.

Still thanks for the pointer to the html-xml-tools, I'll check those out shortly.

icon Daniel Leidert at 15:24 on 8 August 2009

You need a minimal customization layer:

<?xml version='1.0'?>
<xsl:stylesheet version="1.0" 
<xsl:import href="/usr/share/xml/docbook/stylesheet/nwalsh/xhtml/docbook.xsl"/>
<xsl:param name="generate.toc">
book      toc,title

Then use it as stylesheet.

icon Diego E. “Flameeyes” Pettenò at 15:44 on 8 August 2009

I'm not sure whether docbook2hml allows you to pass parameter values, but I guess it should. If I remember correctly what you want is to set the parameter toc.section.depth to 1 to just list chapters. With xsltproc it would be "--stringparam toc.section.depth 1", check if docbook2html allows you to pass parameters down the drain, and it should solve it.

icon Steve Kemp at 15:45 on 8 August 2009

Thanks very much - that looks exactly like what I was missing and needing.

I do wish it had been easier to determine that on my own, I can only imagine my google-fu is weak, or that I failed to find the correct sites.

Thanks again!

icon Diego E. “Flameeyes” Pettenò at 20:31 on 8 August 2009

Unfortunately DocBook doesn't really have much documentation when you want to go out of the basic schemes it provides…

I remembered that a parameter existed (because I used it before), but I also had to look it up in the installed XSL-NS to get it right.

icon Sven Mueller at 15:21 on 11 August 2009

docbook stuff aside, right to your little tool:
showing only specific classes/ids would help finding out wether all that would be cut is what you actually want to cut (and not more). Also it might help fetching specific content from websites/files instead of all but some specific content. So I think it would be nice if you would re-add that to your script.