Frequently Asked Questions

distributedDataMining (dDM) is the name of a research project that uses Internet-connected computers to perform research in the various fields of Data Analysis and Machine Learning. The project uses the Berkeley Open Infrastructure for Network Computing (BOINC) for the distribution of research related tasks to several computers.

All dDM applications use the open source framework RapidMiner. This data mining suite provides various machine learning methods for data analysis purposes. The RapidMinder uses a comfortable plug-in mechanism to easily add new developed algorithms. This flexibility and the processing power of BOINC is an ideal foundation for scientific distributed Data Mining. The dDM project takes that opportunity and serves as a metaproject for different kind of machine learning applications. Currently, we are doing in Time Series Analysis. Former subprojects were dealing with Social Network Analysis and Medical Data Analysis

There are many ways to support the dDM project and your help is always welcome! An overview how to help us is given here.

We - the scientific board - would like to thank you for supporting our research efforts in any way.

The runtime of a WU describes how long your computer is working on a specific WU. During this work, the WU doesn't constantly use 100% of the CPU's computational power because the CPU might also be used by the operational system (Windows or Linux), other programs or even by other WUs. In order to allow a comparison of run time between different computers we need a measurement that is independent of other running programs. For this purpose, the CPU time describes the time, that would have been needed if your CPU had run constantly at 100%.
Usually, a dDM WU uses just one CPU and in this case the cpu time can't be greater than the run time. Sometimes, java is able to compute things parallel. That means a WU is using more than one CPU. In these cases the CPU time can be greater than the runtime.

Here is an example:
Let's assume an average CPU usage of 75% and a specific WU starts at 3 am and ends at 7 am. Under these circumstances the run time would be 4 hours and the CPU time would be 3 hours (75% of 4 hours).
Now, let's assume the WU is able to use more than one CPU. The first CPU has an average usage of 80% and the second CPU of 45%. In this case we have again a runtime of 4 hours but a CPU time of 5 hours (125% of 4 hours).

Yes, dDM supports checkpointing. A computation checkpoint is written after each atomic computation part. Doing so, we just loose the computations of the last handled part if the application crashes because of an error. After restarting, the application will continue processing the WU by starting at the last checkpoint. Independent of the computation checkpoints, the accumulated CPU time is stored frequently. Doing so, you won't loose credit if an application has to restart at the last checkpoint: Even if some atomic parts must be processed twice your provided CPU time will be taken into account for credit calculations.

These days, distributed computing that uses GPUs is a very popular topic. Anyway, as long as dDM's underlying data mining suite (RapidMiner) doesn't provide GPU support, we won't be able to provide WUs for GPUs. The distributedDataMining project doesn't have the necessary resources to extend rapidminer's functionality for GPU usage. Sorry for that!

The error message above states that the file experiment is missing. This file is usually generated on your computer by our wrapper. For that the file 'experiments.orig' is modified according to the progress of your WU and save as 'experiment'.

Somewhere in this process something goes wrong and consequently the mentioned error message appears. Right now it is quite hard to find the reason of your trouble. This error is really rare and didn't occur in that high frequency on a single host before.

A first guess is, that your hard disc is full or that the portion of the boinc data exceeds your predefined hard disc limits.