Projects / Texterize

Texterize

Texterize is a text and metadata extraction tool and library which can be used to quickly get the text content of a file. It currently supports file formats like PDF, Excel, Powerpoint, Word, RTF, WordPerfect, MP3, Ogg, and all OpenDocument file formats. The output of texterize is either text or XML. It is also designed to work with Unicode input and output, and the default output character set is UTF-8. Texterize also has a recursive mode so that whole directories (or whole filesystems) can be converted to text. This recursion also works through archive files and compressed files like zip, tar, and gz files.

Tags
Licenses
Implementation

Recent releases

  •  05 Oct 2009 07:06

    Release Notes: Support was added for MS Write and the KOffice formats. Simple text extraction is supported from AmiPro (.sam files), OOXML, and dBase. Compiling now works with external objdir and glib versions 2.0, 2.2, 2.4, and 2.6. PDF support is now optional. Bugfixes were made to tarfile extraction.

    •  03 Feb 2008 12:54

      Release Notes: Many crashes found through fuzzing were fixed. Some major PDF bugs were fixed (including a font parser bug introduced in 0.1.1). The configure script was improved (no more forced CFLAGS).

      Recent comments

      03 Nov 2011 02:52 thesquid

      http://download.opensuse.org/repositories/server:/search/SLE_10/src/ is where the latest code seems to be... I did: alien -t texterize-0.1.3-3.1.src.rpm ; mkdir texterize ; cd texterize ; tar xvzf ../texterize-0.1.3.tgz ; tar xvjf texterize-0.1.3.tar.bz2 ; cd texterize-0.1.3 ; patch -p0 < ../texterize-fixes.patch

      Project homepage is on wayback (but no code there) -- http://web.archive.org/web/20090406024706/http://texterize.org/

      Screenshot

      Project Spotlight

      OpenStack4j

      A Fluent OpenStack client API for Java.

      Screenshot

      Project Spotlight

      TurnKey TWiki Appliance

      A TWiki appliance that is easy to use and lightweight.