Projects / Tag Soup

Tag Soup

TagSoup is a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty, and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command line processor that reads HTML files, and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.

Operating Systems

Recent releases

  •  21 Mar 2007 20:03

    Release Notes: The main issue was with HTML comments, which were very badly broken: any > character would terminate one, so commenting out elements did not work properly. Everything should now be correct. Everyone should update who possibly can. Additionally, &#Xnnnn (with capital X) now works, some debugging code was removed from PYXWriter, a Unicode BOM at the beginning of a document is skipped, and the new version of Saxon is supported as an XSLT processor. Documentation has been added on SAX features and properties specific to TagSoup.

    •  07 Feb 2007 08:11

      Release Notes: A DOCTYPE declaration will be output if there is one in the input. The --ignorable switch was added to preserve whitespace in element content. The --output-encoding switch was added to specify output encoding. The default values for html/@version were removed. Various minor bugs were fixed.

      •  15 Jun 2006 21:38

        Release Notes: All known bugs are fixed and all features considered appropriate have been added. This release is ready for full production use.

        •  23 Jan 2003 08:05

          No changes have been submitted for this release.


          Project Spotlight


          A Fluent OpenStack client API for Java.


          Project Spotlight

          TurnKey TWiki Appliance

          A TWiki appliance that is easy to use and lightweight.