Projects / DataCleaner


DataCleaner is a data quality analysis tool that allows you to perform data profiling, validating, and minor ETL-like tasks. These activities help you administer and monitor your data quality in order to ensure that your data is useful and applicable to your business situation. It can be used for master data management (MDM) methodologies, data warehousing projects, statistical research, preparation for extract-transform-load activities, and more.


Last announcement

Community contributor contest 08 Nov 2012 14:20

Who will post the best content for use in DataCleaner? Human Inference is announcing a competition for the DataCleaner community. The goal is to provide the best contribution for our favourite open source data quality tool: Submitted content can be of many forms: * Educational content like tutorials, videos etc. * Regular Expressions for the RegexSwap. * DataCleaner extensions for the ExtensionSwap. * Reference data for inclusion in the tool. * Use case descriptions tell the community about your experiences. * Third party tool integration. Prize: We do cherish everything in the community being free. But we will also be giving a nice prize to the winner with the best submission. The exact prize is to be announced shortly. All submissions will be reviewed and mentioned on the DataCleaner website. Content must be submitted before Christmas (December 24) 2012. Post a comment on this discussion topic to tell the community where and how to retrieve your submitted content: We also encourage people to join our Google+ community hangouts where authors will be invited to present their contributions:

Recent releases

  •  21 May 2014 12:51

    Release Notes: A new major feature, duplicate detection, allows you to fuzzy find duplicate records in your data. A new analyzer for checking referential integrity between tables of multiple sources. Progress Indication has been improved and is more responsive.

    •  15 Mar 2014 03:50

      Release Notes: You can now compose jobs so that a DataCleaner job actually calls/invokes another "child" job as a single transformation. Source column handling was improved, and the user can now choose which columns to include in a source query. Repository file locking was implemented to prevent concurrent reads and writes.

      •  24 Sep 2013 13:16

        Release Notes: The 'Synonym lookup' transformation now has an option to look up every token of the input. This is useful if you're doing replacement of synonyms within the values of a long text field. A potential failure was fixed when blocking execution of DataCleaner jobs through the monitor's Web service. An improvement was made in the way jobs and the sequence of components are closed / cleaned up after execution. The Java WebStart version of DataCleaner was exposed by a bug in the Java runtime causing certain JAR files not to be recognized by the WebStart launcher under certain circumstances.

        •  05 Sep 2013 12:39

          Release Notes: It is now possible to hide output columns of transformations. Hiding will not affect the processing flow, but simply hide them from the user interface, potentially making the experience cleaner when interacting with other components. A new Web service has been added to the monitoring Web application which provides a way to poll the status of the execution of a particular job. A bug has been fixed which caused the HTML report to fail for certain analysis types when no records had been processed. Six other minor bugs have been addressed.

          •  12 Jun 2013 07:55

            Release Notes: This release adds a new filter for performing Change Data Capture, makes execution of jobs queued to avoid concurrent execution issues, and adds several minor bugfixes and improvements.


            Project Spotlight


            A Fluent OpenStack client API for Java.


            Project Spotlight

            TurnKey TWiki Appliance

            A TWiki appliance that is easy to use and lightweight.