Density based Clustering in Mahout

Discussion:

Aditya

2017-06-22 23:43:28 UTC

Hello everyone,

I've been working for the past few weeks on coding an in-core DBSCAN
algorithm.

A more efficient version with an O(n*log(n)) complexity does exist but it
uses the R-Tree data structure to index the data. I have a few
concerns/questions and I'm hoping you would be able to help me out.

1. Based on my knowledge, using an indexing data structure like an R-Tree
or a Kd-Tree is the only way to reduce the complexity of the dbscan
algorithm. If there's any other method that you are familiar with, please
let me know.

2. From what I've read in the book Apache Mahout: Beyond MapReduce written
by Andrew and Dmitry, I don't see how I can directly exploit the existing
data structures and operations to get the functionality of an R-Tree.

3. On the off chance that an R-Tree module has to built in Mahout, because
it is not possible to easily use existing operations I need some insights
as to how to go about it. I learned that everything in Mahout at the end
should be serializable to a vector. I can't fathom how to do that
intuitively in the case of an R-Tree

There are a couple of other concerns that need to be discussed but these
are vital at the moment.

PS: The research paper that proposed the DBSCAN algorithm can be found here
<https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf>.

Thanks!

-Aditya

Trevor Grant

2017-06-23 22:11:14 UTC

Permalink

What if you had Arrays of Matrices, or Arrays of Arrays of Matrices? (e.g.
3d and 4d tensors)?

I implemented these for the MLPs (still WIP)

https://github.com/apache/mahout/pull/323/files#diff-cd8a7c5e2cf42b91b5aa47c96daf19c0R25

But those functions were specifically to overcome the challenges you
describe wrt neural network weight sets.

If you could use those as is- that would be awesome, if not maybe you'll
find inspiration there.

Post by Aditya
Hello everyone,
I've been working for the past few weeks on coding an in-core DBSCAN
algorithm.
A more efficient version with an O(n*log(n)) complexity does exist but it
uses the R-Tree data structure to index the data. I have a few
concerns/questions and I'm hoping you would be able to help me out.
1. Based on my knowledge, using an indexing data structure like an R-Tree
or a Kd-Tree is the only way to reduce the complexity of the dbscan
algorithm. If there's any other method that you are familiar with, please
let me know.
2. From what I've read in the book Apache Mahout: Beyond MapReduce written
by Andrew and Dmitry, I don't see how I can directly exploit the existing
data structures and operations to get the functionality of an R-Tree.
3. On the off chance that an R-Tree module has to built in Mahout, because
it is not possible to easily use existing operations I need some insights
as to how to go about it. I learned that everything in Mahout at the end
should be serializable to a vector. I can't fathom how to do that
intuitively in the case of an R-Tree
There are a couple of other concerns that need to be discussed but these
are vital at the moment.
PS: The research paper that proposed the DBSCAN algorithm can be found here
<https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf>.
Thanks!
-Aditya

Aditya

2017-07-05 07:09:15 UTC

Permalink

***Important** **Do read** *
Hello everyone,

Trevor and I have been discussing as to how to effectively represent an
R-Tree in Mahout. Turns out there is a method to represent a Binary Search
Tree (BST) in the form of an ancestor matrix. This
<http://www.geeksforgeeks.org/construct-ancestor-matrix-from-a-given-binary-tree/>
and this <http://www.geeksforgeeks.org/construct-tree-from-ancestor-matrix/>
show the conversion logic from a tree representation to a matrix
representation and vice versa.

In a distributed scenario, I know of the following design
<https://docs.google.com/document/d/1SfMIt8hYENwlm328rSGAMJcci6JM4PyXLoNlVF0Yno8/edit?usp=sharing>
which is fairly efficient and intuitive to understand.

Now the point for debate is this:
**Please read the design provided in the link above before proceeding**
The R-Tree will always be a local entity, no where in the algorithm is
there a need / necessity to have a distributed R-Tree kind of scenario. On
the other hand, the data points as well as the union find data structure
need to be stored in a DRM like data structure and they very well can be
represented in the form of a matrix. (A union find data structure basically
can be implemented using a vector)

1. Why not build an R-Tree module in the form of a normal tree with two
children and a key-value pair? (I'm not sure if this is allowed in Mahout,
so veterans please chip in). Since an R-tree will always be an in-core
entity.

2. If 1. is not done, then the method followed for the matrix
representation of a BST should be followed. Only, the elements and
conditions will be changed. On an abstract sense, Matrix representation of
an R-Tree and matrix representation of a Binary Search Tree are analogous.
But in this case, the construction and search costs for the Matrix version
of an R-Tree will be costlier.

*PS: Shannon, Dmitry, Andrew P, Andrew M and Trevor, it'd be great if you
could offer your insights.*

Thanks,
Aditya

Post by Trevor Grant
What if you had Arrays of Matrices, or Arrays of Arrays of Matrices? (e.g.
3d and 4d tensors)?
I implemented these for the MLPs (still WIP)
https://github.com/apache/mahout/pull/323/files#diff-
cd8a7c5e2cf42b91b5aa47c96daf19c0R25
But those functions were specifically to overcome the challenges you
describe wrt neural network weight sets.
If you could use those as is- that would be awesome, if not maybe you'll
find inspiration there.

written

Post by Aditya
by Andrew and Dmitry, I don't see how I can directly exploit the existing
data structures and operations to get the functionality of an R-Tree.
3. On the off chance that an R-Tree module has to built in Mahout,

because

Post by Aditya
it is not possible to easily use existing operations I need some insights
as to how to go about it. I learned that everything in Mahout at the end
should be serializable to a vector. I can't fathom how to do that
intuitively in the case of an R-Tree
There are a couple of other concerns that need to be discussed but these
are vital at the moment.
PS: The research paper that proposed the DBSCAN algorithm can be found

here

Post by Aditya
<https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf>.
Thanks!
-Aditya

Dmitriy Lyubimov

2017-07-05 15:59:46 UTC

Permalink

(1) I abandoned any attempts at DBScan and implemented another density
algorithm itself (can't say which, subject to patent restrictions). The
reason being, i couldn't immediately figure how to parallelize it
efficiently (aside from data structure discussions), the base algorithm is
inherently iterative.

(2) Samsara provides R-base level algebra, not general data structure
support. IMO it would not pay to adjust it to do that any more than trying
to fit R base to do it. I did implement spatial structures standardized on
using Samsara in terms of carrying out computations (mostly in-memory), but
those were still data structures in their on right.

(3) like i said before, experience tells me that using collection of 2d
tensors (esp. sparse tensors) is fine instead of trying to introduce n-d
tensor. The fundamental difference here is mostly with sparse operations
where n-d sparse tensor could intelligently execute those. But this is not
supported properly pretty much by any library i know, so all the difference
in most cases that say they support it directly is just putting a tensor
api over collection of tensors. Practicality of having dense n-d tensors is
also a bit questioned, since it immediately increases requirements for
processing memory quantity of a single tensor instance, whereas collection
of tensors could be represented as retaining streaming properties. etc.
etc. Overall it sounds to me like a solution in a search of a problem
(given my admittedly very limited experience as a practitioner in math).

Post by Aditya
***Important** **Do read** *
Hello everyone,
Trevor and I have been discussing as to how to effectively represent an
R-Tree in Mahout. Turns out there is a method to represent a Binary Search
Tree (BST) in the form of an ancestor matrix. This
<http://www.geeksforgeeks.org/construct-ancestor-matrix-
from-a-given-binary-tree/>
and this <http://www.geeksforgeeks.org/construct-tree-from-ancestor-
matrix/>
show the conversion logic from a tree representation to a matrix
representation and vice versa.
In a distributed scenario, I know of the following design
<https://docs.google.com/document/d/1SfMIt8hYENwlm328rSGAMJcci6JM4
PyXLoNlVF0Yno8/edit?usp=sharing>
which is fairly efficient and intuitive to understand.
**Please read the design provided in the link above before proceeding**
The R-Tree will always be a local entity, no where in the algorithm is
there a need / necessity to have a distributed R-Tree kind of scenario. On
the other hand, the data points as well as the union find data structure
need to be stored in a DRM like data structure and they very well can be
represented in the form of a matrix. (A union find data structure basically
can be implemented using a vector)
1. Why not build an R-Tree module in the form of a normal tree with two
children and a key-value pair? (I'm not sure if this is allowed in Mahout,
so veterans please chip in). Since an R-tree will always be an in-core
entity.
2. If 1. is not done, then the method followed for the matrix
representation of a BST should be followed. Only, the elements and
conditions will be changed. On an abstract sense, Matrix representation of
an R-Tree and matrix representation of a Binary Search Tree are analogous.
But in this case, the construction and search costs for the Matrix version
of an R-Tree will be costlier.
*PS: Shannon, Dmitry, Andrew P, Andrew M and Trevor, it'd be great if you
could offer your insights.*
Thanks,
Aditya

Post by Trevor Grant
What if you had Arrays of Matrices, or Arrays of Arrays of Matrices?

(e.g.

Post by Trevor Grant
3d and 4d tensors)?
I implemented these for the MLPs (still WIP)
https://github.com/apache/mahout/pull/323/files#diff-
cd8a7c5e2cf42b91b5aa47c96daf19c0R25
But those functions were specifically to overcome the challenges you
describe wrt neural network weight sets.
If you could use those as is- that would be awesome, if not maybe you'll
find inspiration there.

Post by Aditya
Hello everyone,
I've been working for the past few weeks on coding an in-core DBSCAN
algorithm.
A more efficient version with an O(n*log(n)) complexity does exist but