Adding Distributed DBSCAN to Mahout

Trevor Grant

2017-08-26 17:53:24 UTC

To be clear- I do not recommend Method 2 at all for Mahout, but was trying
to be illustrative.

Post by Aditya
Hello all,
The InCoreDBSCAN code has been pushed to the PR and can be found here
<https://github.com/apache/mahout/pull/334>. I'm also working on including
the rtree module to reduce the computational complexity of the existing
code. Once the RTree code has been pushed it can be used for other
algorithms / modules as well. With respect to the Distributed version of
DBSCAN, Trevor and I have been having discussions for a while now and we
have come to a conclusion that an *exact distributed algorithm* *would be
really expensive* due to large amounts of data copy.
The design for the distributed version can be found in this google doc
<https://docs.google.com/document/d/1SfMIt8hYENwlm328rSGAMJcci6JM4
PyXLoNlVF0Yno8/edit?usp=sharing>.
I urge all of you to have a look at it, because if we get ideas to
implement the exact version that'd be golden!
Apart from that Trevor and I have come up with a few ways about the design
of the *approximate distributed DBSCAN algorithm*. I have given details
about two methods we came up with. Please go through them and give your
thoughts about each one. The ideas are still raw and need some
brainstorming.
It helps if you are familiar with the idea of an RTree before proceeding.
The information given in the wiki page
<https://en.wikipedia.org/wiki/R-tree> is more than sufficient for this
email.
*Method 1: Iterative method*
*1. *In each partition, create rectangles to properly describe the data in
that partition.
*2. *The driver will collect information about all of the rectangles (a
rectangle is denoted by the coordinates of its bottom left corner and its
top right corner)
*3. *The driver will try to re-balance the collected rectangles by checking
for overlaps / containment etc and modify the input rectangles accordingly
*4. *Do cluster assignment according to the rebalanced rectangles in each
partition and check how many points end up in each cluster.
*5. *If the clustering is not satisfactory (This can be determined by a
standard indexes), then repeat the process again.
*Method 2: Sampling method*
*1. *If the given data is really really large select a sample of the data
points, generate clusters on the sampled dataset.
*2. *Assign each point in the full dataset to clusters computed on the
sample. The assignment of points to the clusters will be heuristic based.
Now for the important part, we need feedback on the proposed ideas and
possibly discussion on what the heuristics should be. Hope all of you
contribute to the brainstorming process.
Best Regards,
Aditya