Recently I've been writing some documentation using the docbook toolset.
"Helpfully" the docbook tools produce a nice table of contents for your documentation. For example it will produce an index.html file containing a list of chapters, list of figures, list of tables, and finally a list of examples.
For my specific use I only wanted a table of contents listing chapters, all the other lists were just noise.
Unfortunately I've produced my documentation using the naive docbook2html tool, and all the details I can find online about customising the table of contents to remove specific items refer to using xslt and other more low-level tools.
So I thought I'd cheat. Looking at the generated index.html file I notice that the contents I wish to remove have got class attributes of TOC.
Is there a tool to parse HTML removing items with particular ID attributes? Or removing items having a particular CLASS?
I couldn't find one. So I knocked one up, using HTML::TreeBuilder::XPath, perhaps it will be useful to ohters:
html-tool --file=index.html --cut-class=foo --indent
The file index.html will be read, parsed, and all items with "class='foo'" will be removed. The output will be indented in a pretty fashion and written to STDOUT.
This example does a similar thing:
html-tool --url=http://www.steve.org.uk/ --output=x.html \ --cut-id=top --cut-class=mbox --indent
I dabbled with allowing you to just dump HTML sections, so you could run:
html-tool --show-class=foo --file=index.html
But that didn't seem as obviously useful, so I dyked it out. Other similar operations could make it more generally useful though - right now it's more of a html-cut than a html-tool!
ObFilm: The Breakfast Club
Tags: html-tool, random, utilities 7 comments
When you use the docbook-xsl stylesheets to produce the HTML documentation, you can adkjust the ToC: http://docbook.sourceforge.net/release/xsl/current/doc/html/generate.toc.html
About your question parsing HTML: Check the html-xml-utils suite (in Sid and testing).