Projects / jsoup


jsoup is a Java library for working with real-world HTML. It can parse HTML from a URL, file, or string. It can find and extract data, using DOM traversal or CSS selectors. The HTML elements, attributes, and text can be manipulated. It can clean user-submitted content against a safe white-list. jsoup is designed to deal with all varieties of HTML found in the wild, from pristine and validating to invalid tag-soup; jsoup will create a sensible parse tree.

Operating Systems

Recent releases

  •  11 Nov 2013 06:08

    Release Notes: This release introduces improved form handling, more robust character set detection, speed and memory optimizations in parsing and CSS selectors, and a number of bugfixes.

    •  28 Jan 2013 01:57

      Release Notes: This release introduces selectors for structural pseudo CSS classes, full support for international supplementary characters, and a raft of improvements and bugfixes.

      •  24 Sep 2012 02:29

        Release Notes: This release parses HTML 2.3x faster. The author has profiled the parse execution of thousands of documents, optimized every hotspot to streamline the parser, and significantly minimized node memory consumption. This release also trims the retained heap memory when retrieving data from parsed documents, reduces garbage collection when selecting elements, and removes lock contention to allow jsoup to run concurrently on as many threads as are available.

        •  28 May 2012 17:45

          Release Notes: This release adds a number of improvements and bugfixes, including renewed support for the Google App Engine and parsing fixes.

          •  28 Mar 2012 05:25

            Release Notes: This release adds many improvements, including a relaxed XML parser, a lighter memory footprint, and a range of bugfixes.


            Project Spotlight


            A Fluent OpenStack client API for Java.


            Project Spotlight

            TurnKey TWiki Appliance

            A TWiki appliance that is easy to use and lightweight.