Larbin is an HTTP Web crawler with an easy interface that runs under Linux. It can fetch more than 5 million pages a day on a standard PC (with a good network).
Tags | Internet Web Indexing/Search |
---|---|
Licenses | GPL |
Operating Systems | POSIX Linux BSD FreeBSD |
Implementation | C++ |
Release Notes: This release corrects some compilation tweaks with recent gcc versions, improves the configuration file parser, and adds new options for following links selectively.
Release Notes: This release compiles on Solaris, cookie management has been added, images can be fetched with pages, and many rewrites have been done for efficiency and portability.
Release Notes: With this release, it is possible again to crawl through a proxy, all configurations should compile (Linux and BSD), images can now be downloaded with pages, and the robots.txt parser has been enhanced.
Release Notes: Many efficiency updates were made to the sequencer, to buffer recycling, and to DNS management. A new output module for statistics has been added.
Release Notes: Output and buffer interfaces have been simplified. A dynamic buffer option has been added. The web server has been reworked.
larbin@somewhere.com
I tried to reach the larbin project owner but I get the
following error, so I'm posting this here.
<sebastien.ailleret@inria.fr>
(reason: 550 5.7.1 <sebastien.ailleret@inria.fr>...
Access denied)
Either larbin does this as a default, or someone has
configured their version so, but I would appreciate it if
someone make sure that the the user-agent field for
larbin is *not* larbin@somewhere.com. I've gotten
several complaints, and I don't really appreciate it.