About Archive Tags RSS Feed


Entries tagged html-tool

Oh, this should be stunning.

8 August 2009 21:50

Recently I've been writing some documentation using the docbook toolset.

"Helpfully" the docbook tools produce a nice table of contents for your documentation. For example it will produce an index.html file containing a list of chapters, list of figures, list of tables, and finally a list of examples.

For my specific use I only wanted a table of contents listing chapters, all the other lists were just noise.

Unfortunately I've produced my documentation using the naive docbook2html tool, and all the details I can find online about customising the table of contents to remove specific items refer to using xslt and other more low-level tools.

So I thought I'd cheat. Looking at the generated index.html file I notice that the contents I wish to remove have got class attributes of TOC.

Is there a tool to parse HTML removing items with particular ID attributes? Or removing items having a particular CLASS?

I couldn't find one. So I knocked one up, using HTML::TreeBuilder::XPath, perhaps it will be useful to ohters:

html-tool --file=index.html --cut-class=foo --indent

The file index.html will be read, parsed, and all items with "class='foo'" will be removed. The output will be indented in a pretty fashion and written to STDOUT.

This example does a similar thing:

html-tool --url=http://www.steve.org.uk/ --output=x.html \
  --cut-id=top --cut-class=mbox --indent

I dabbled with allowing you to just dump HTML sections, so you could run:

html-tool --show-class=foo --file=index.html

But that didn't seem as obviously useful, so I dyked it out. Other similar operations could make it more generally useful though - right now it's more of a html-cut than a html-tool!

ObFilm: The Breakfast Club



Has he tried to speak or communicate in any way?

25 August 2009 21:50

See? Steam power does have its uses!

To avoid this becoming my most content-free post ever I'll close by saying that I updated the html-tool utility as per Sven Mueller's suggestion. So you can now show/dump arbitrary class or ID values from HTML.

Oh, and I added an about page to my blog.

ObFilm: Seven (Not Se7en - that's just dumb.)

| No comments