MyDeDub is obsoleted by BlackHole (which can do the same and more and faster)

MyDeDub

MyDeDub is a data deduplicating NBD server.
In the current state, the program is merely a proof of concept.
The special part of this program is that it uses a MySQL database to store the data.
As MyDeDub works on the block-level, it is filesystem agnostig. So you can create ext3, xfs, jfs, any filesystem you like on it. Tested with ext-2 -3 and 4, and OCFS2 (a clustered filesystem - you need to disable the MyDeDub cache for that).
NBD is short for 'network block device'. It is comparable to iSCSI and FCoE. For allmost all platforms NBD-servers are available. NBD-clients are in the Linux, HURD and Solaris kernel.

Download

MyDeDub-0.4.jar - now also handles arbitrary block size, some speed improvements
mdd-0.4.tgz - source
Please note that at least version 0.4 won't work with openjdk, please use the sun java6 jdk.

How to use it

For it to work you need a database-server and a server to run MyDeDub on. If a system has enough resources (ram, cpu), these two can be combined.

Installation on MySQL

Create a database and grant a user INSERT, SELECT, UPDATE and DELETE rights. (e.g.: grant insert, select, update, delete on mdd.* to mdduser@'%' identified by 'mddpass').
Create these tables:
CREATE TABLE `blockmap` (
  `sector` bigint(12) NOT NULL DEFAULT '-1',
  `blockid` bigint(12) NOT NULL DEFAULT '-1',
  PRIMARY KEY (`sector`),
  KEY `bi` (`blockid`)
);
CREATE TABLE `config` (
  `name` varchar(255) NOT NULL,
  `value` varchar(255) NOT NULL,
  PRIMARY KEY  (`name`)
);
CREATE TABLE `data` (
  `blockid` bigint(12) NOT NULL auto_increment,
  `data` blob NOT NULL,
  PRIMARY KEY  (`blockid`),
  KEY `data` (`data`(16))
);
You migtht want to use InnoDB tables instead of MyISAM. This gives less performance buy you can be more sure that your data is on disk when the database server crashes (myisam does no explicit fsync after each write).
In the 'config'-table, put a record with name='size' and value is the size (in bytes) of your storage. Also add a record with name 'block_size' and value '4096'.

Running it

Then invoke the program with the following parameters:
--db-url jdbc:mysql://localhost:3306/mdd --db-user mdduser --db-pass mddpass --port 12345
You might need to tweak the parts written in bold.
The port is the port to which the nbd-client connects.
e.g.:
java -cp /usr/share/java/mysql-connector-java.jar:MyDeDub.jar MyDeDub \
      --db-url jdbc:mysql://localhost:3306/mdd --db-user mdduser \
      --db-pass mddpass --port 2209
/usr/share/java/mysql-connector-java.jar is the default location on Debian systems for the MySQL JDBC connector. On other systems (e.g. RedHat) this location might be different.

The sever that will use the device will do something like:
nbd-client bs=4096 mydedubhost port /dev/nbdX
See the man-page of nbd-client for details. Note that the 'bs=4096' parameter should be equal to the block_size configuration parameter of MyDeDub.

Checking how much diskspace is gained

In MySQL client, enter the following query:
select count(*) / count(distinct(blockid)) from blockmap;
Bigger than 1 means space won, less than one: needing more diskspace than the original.

Warning

This is still a alpha version: don't use it with data you don't have a backup of.

Performance


License

In short: it is released under GPLv2.
MyDeDub is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, version 2.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

FAQ

  • Q: you're using not a hash column, why not?
    A: calculating a hash consumes too much time. I'm using the index as a kind of hash: a filesystem with 62797 files in 4280 directories (ext3) has for each 16-byte index on average 11.6 clashes, unique enough in my opinion
  • Q: why Java?
    A: Java is quick enough; the bottle neck is the database. also java enables me to develop quicker with less bugs
  • Q: MyDeDuB? Don't you mean MyDeDuP?
    A: No, dub is a reference to Dub

Links



contact form Winnen in de Staatsloterij! disclaimer
Check out my united states Mega Millions lottery winning help page