Projects / HarvestMan


HarvestMan is a multithreaded off-line browser.It has many features for customizing offline browsing through URL filters, word filters, domain filters, URL priorities, depth-fetching, fetch levels, file limits, time limits, robot exclusion protocols, and many more. It is useful to download an entire Web site or certain files from a Web site to the hard disk for offline browsing later. It supports HTTP/HTTPS and FTP protocols and can work across proxies.

Operating Systems

Recent releases

  •  09 Sep 2005 09:32

    Release Notes: The install scripts were fixed. They had problems working with Python 2.4.

    •  20 Aug 2005 14:22

      Release Notes: This release fixes a bug in the regular expression for localizing URLs, a bug related to resuming a project by reading back its project file, and errors with a few commandline options that were not working correctly. It adds a subdomain flag to the commandline.

      •  02 Aug 2005 08:00

        Release Notes: New, user friendly command line options, a new nocrawl command line flag for only downloading URLs, similar to wget, support for .chm, .cfm, .cfml, .php4, and .aspx Web page extensions, and a duplicate link bugfix for the URL tree printing option. Other minor bugfixes were made and readme.txt was updated.

        •  21 Jul 2005 21:59

          Release Notes: This release replaces lists at critical places with the new collections.deque data structure. This improves performance when run with Python 2.4. 2. A bug with HTTP redirect handling that requires cookies has been fixed. Many bugs that created invalid URL (HTTP 404) errors have been fixed. The modules htmlparser and cookiemgr have been removed, since they are no longer used. The default locale has been changed to 'C'. Bugs in the,, and modules have been fixed.

          •  27 May 2005 20:43

            Release Notes: The config file format has been changed from text to XML. There is a new HTML parser based on the SGMLParser module. The dependency on HTML tidy is removed. A new archive feature for archiving project files to tar.bz2/tar.gz archives. Changes in project caching: data of Web pages is compressed before writing to cache, there is an option for writing the cache in DBM format, and headers of URLs are also written to the cache. A junk filter for filtering out banner ads and similar URLs. This release works with Python 2.4.


            Project Spotlight


            A Fluent OpenStack client API for Java.


            Project Spotlight

            TurnKey TWiki Appliance

            A TWiki appliance that is easy to use and lightweight.