jsoup is a Java library for working with real-world HTML. It can parse HTML from a URL, file, or string. It can find and extract data, using DOM traversal or CSS selectors. The HTML elements, attributes, and text can be manipulated. It can clean user-submitted content against a safe white-list. jsoup is designed to deal with all varieties of HTML found in the wild, from pristine and validating to invalid tag-soup; jsoup will create a sensible parse tree.
UnDBX is a tool to extract, recover, and undelete email messages from Outlook Express .dbx files. On first run, all messages are extracted as individual .eml files. Subsequent runs only update the output directory with new messages, and delete old .eml files that correspond to deleted messages in the .dbx file. Corrupted .dbx files (including files larger than 2GB) can be opened in recovery mode, in order to recover messages and partially undelete deleted messages. The success of recovery depends on the type and level of .dbx file corruption.
PdfParser is a standalone PHP library that provides various tools for extracting data from PDF files. It loads and parses objects and headers, extracts meta data, and extracts text from ordered pages. It supports compressed PDF, MAC OS Roman charset encoding, hex and octal encoding in text sections, and is compliant with PSR-0 (autoloader) and PSR-1 (code styling). Currently, secured documents are not supported.