Trying to write the KMeans Clustering Using "Apache Mahout Samsara"

was my reply for your post on @user has been a bit confusing?

On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans clustering algorithm using Mahout Samsara
but i am bit confused
about how to leverage Distributed Row Matrix for the same. Can anybody help
me with same.
Thanks
Parth Khatwani

KHATWANI PARTH BHARAT

2017-03-31 16:45:12 UTC

yes i am unable to figure out the way ahead.
Like how to create the augmented matrix A := (0|D) which you have mentioned.

Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT <

help

Post by KHATWANI PARTH BHARAT
me with same.
Thanks
Parth Khatwani

KHATWANI PARTH BHARAT

2017-03-31 16:54:22 UTC

@Dmitriycan you please again tell me the approach to move ahead.

Thanks
Parth Khatwani

On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
yes i am unable to figure out the way ahead.
Like how to create the augmented matrix A := (0|D) which you have mentioned.

Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans clustering algorithm using Mahout

Samsara

Post by KHATWANI PARTH BHARAT
but i am bit confused
about how to leverage Distributed Row Matrix for the same. Can anybody

help

Post by KHATWANI PARTH BHARAT
me with same.
Thanks
Parth Khatwani

Dmitriy Lyubimov

2017-03-31 17:53:05 UTC

here is the outline. For details of APIs, please refer to samsara manual
[2], i will not be be repeating it.

Assume your training data input is m x n matrix A. For simplicity let's
assume it's a DRM with int row keys, i.e., DrmLike[Int].

Initialization:

First, classic k-means starts by selecting initial clusters, by sampling
them out. You can do that by using sampling api [1], thus forming a k x n
in-memory matrix C (current centroids). C is therefore of Mahout's Matrix
type.

You the proceed by alternating between cluster assignments and recompupting
centroid matrix C till convergence based on some test or simply limited by
epoch count budget, your choice.

Cluster assignments: here, we go over current generation of A and recompute
centroid indexes for each row in A. Once we recompute index, we put it into
the row key . You can do that by assigning centroid indices to keys of A
using operator mapblock() (details in [2], [3], [4]). You also need to
broadcast C in order to be able to access it in efficient manner inside
mapblock() closure. Examples of that are plenty given in [2]. Essentially,
in mapblock, you'd reform the row keys to reflect cluster index in C. while
going over A, you'd have a "nearest neighbor" problem to solve for the row
of A and centroids C. This is the bulk of computation really, and there are
a few tricks there that can speed this step up in both exact and
approximate manner, but you can start with a naive search.

Centroid recomputation:
once you assigned centroids to the keys of marix A, you'd want to do an
aggregating transpose of A to compute essentially average of row A grouped
by the centroid key. The trick is to do a computation of (1|A)' which will
results in a matrix of the shape (Counts/sums of cluster rows). This is the
part i find difficult to explain without a latex graphics.

In Samsara, construction of (1|A)' corresponds to DRM expression

(1 cbind A).t (again, see [2]).

So when you compute, say,

B = (1 | A)',

then B is (n+1) x k, so each column contains a vector corresponding to a
cluster 1..k. In such column, the first element would be # of points in the
cluster, and the rest of it would correspond to sum of all points. So in
order to arrive to an updated matrix C, we need to collect B into memory,
and slice out counters (first row) from the rest of it.

So, to compute C:

C <- B (2:,:) each row divided by B(1,:)

(watch out for empty clusters with 0 elements, this will cause lack of
convergence and NaNs in the newly computed C).

This operation obviously uses subblocking and row-wise iteration over B,
for which i am again making reference to [2].

[1]
https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/drm/package.scala#L149

[2], Sasmara manual, a bit dated but viable,
http://apache.github.io/mahout/doc/ScalaSparkBindings.html

[3] scaladoc, again, dated but largely viable for the purpose of this
exercise:
http://apache.github.io/mahout/0.10.1/docs/mahout-math-scala/index.htm

[4] mapblock etc.
http://apache.github.io/mahout/0.10.1/docs/mahout-math-scala/index.html#org.apache.mahout.math.drm.RLikeDrmOps

On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriycan you please again tell me the approach to move ahead.
Thanks
Parth Khatwani
On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
yes i am unable to figure out the way ahead.
Like how to create the augmented matrix A := (0|D) which you have mentioned.

Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans clustering algorithm using Mahout

Samsara

Post by KHATWANI PARTH BHARAT
but i am bit confused
about how to leverage Distributed Row Matrix for the same. Can anybody

help

Post by KHATWANI PARTH BHARAT
me with same.
Thanks
Parth Khatwani

Dmitriy Lyubimov

2017-03-31 18:04:55 UTC

ps1 this assumes row-wise construction of A based on training set of m
n-dimensional points.
ps2 since we are doing multiple passes over A it may make sense to make
sure it is committed to spark cache (by using checkpoint api), if spark is
used

Post by Dmitriy Lyubimov
here is the outline. For details of APIs, please refer to samsara manual
[2], i will not be be repeating it.
Assume your training data input is m x n matrix A. For simplicity let's
assume it's a DRM with int row keys, i.e., DrmLike[Int].
First, classic k-means starts by selecting initial clusters, by sampling
them out. You can do that by using sampling api [1], thus forming a k x n
in-memory matrix C (current centroids). C is therefore of Mahout's Matrix
type.
You the proceed by alternating between cluster assignments and
recompupting centroid matrix C till convergence based on some test or
simply limited by epoch count budget, your choice.
Cluster assignments: here, we go over current generation of A and
recompute centroid indexes for each row in A. Once we recompute index, we
put it into the row key . You can do that by assigning centroid indices to
keys of A using operator mapblock() (details in [2], [3], [4]). You also
need to broadcast C in order to be able to access it in efficient manner
inside mapblock() closure. Examples of that are plenty given in [2].
Essentially, in mapblock, you'd reform the row keys to reflect cluster
index in C. while going over A, you'd have a "nearest neighbor" problem to
solve for the row of A and centroids C. This is the bulk of computation
really, and there are a few tricks there that can speed this step up in
both exact and approximate manner, but you can start with a naive search.
once you assigned centroids to the keys of marix A, you'd want to do an
aggregating transpose of A to compute essentially average of row A grouped
by the centroid key. The trick is to do a computation of (1|A)' which will
results in a matrix of the shape (Counts/sums of cluster rows). This is the
part i find difficult to explain without a latex graphics.
In Samsara, construction of (1|A)' corresponds to DRM expression
(1 cbind A).t (again, see [2]).
So when you compute, say,
B = (1 | A)',
then B is (n+1) x k, so each column contains a vector corresponding to a
cluster 1..k. In such column, the first element would be # of points in the
cluster, and the rest of it would correspond to sum of all points. So in
order to arrive to an updated matrix C, we need to collect B into memory,
and slice out counters (first row) from the rest of it.
C <- B (2:,:) each row divided by B(1,:)
(watch out for empty clusters with 0 elements, this will cause lack of
convergence and NaNs in the newly computed C).
This operation obviously uses subblocking and row-wise iteration over B,
for which i am again making reference to [2].
[1] https://github.com/apache/mahout/blob/master/math-scala/
src/main/scala/org/apache/mahout/math/drm/package.scala#L149
[2], Sasmara manual, a bit dated but viable, http://apache.github.
io/mahout/doc/ScalaSparkBindings.html
[3] scaladoc, again, dated but largely viable for the purpose of this
http://apache.github.io/mahout/0.10.1/docs/mahout-math-scala/index.htm
[4] mapblock etc. http://apache.github.io/mahout/0.10.1/docs/mahout-
math-scala/index.html#org.apache.mahout.math.drm.RLikeDrmOps
On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriycan you please again tell me the approach to move ahead.
Thanks
Parth Khatwani
On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
yes i am unable to figure out the way ahead.
Like how to create the augmented matrix A := (0|D) which you have mentioned.

Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans clustering algorithm using Mahout

Samsara

Post by KHATWANI PARTH BHARAT
but i am bit confused
about how to leverage Distributed Row Matrix for the same. Can

anybody

Post by KHATWANI PARTH BHARAT
me with same.
Thanks
Parth Khatwani

KHATWANI PARTH BHARAT

2017-04-12 17:29:48 UTC

@Dmitriy Sir

I have completed the Kmeans code as per the algorithm you have Outline above

My code is as follows

This code works fine till step number 10

In step 11 i am assigning the new centriod index to corresponding row key
of data Point in the matrix
I think i am doing something wrong in step 11 may be i am using incorrect
syntax

Can you help me find out what am i doing wrong.

//start of main method

def main(args: Array[String]) {
//1. initialize the spark and mahout context
val conf = new SparkConf()
.setAppName("DRMExample")
.setMaster(args(0))
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.kryo.registrator",
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator")
implicit val sc = new SparkDistributedContext(new SparkContext(conf))

//2. read the data file and save it in the rdd
val lines = sc.textFile(args(1))

//3. convert data read in as string in to array of double
val test = lines.map(line => line.split('\t').map(_.toDouble))

//4. add a column having value 1 in array of double this will
create something like (1 | D)', which will be used while calculating
(1 | D)'
val augumentedArray = test.map(addCentriodColumn _)

//5. convert rdd of array of double in rdd of DenseVector
val rdd = augumentedArray.map(dvec(_))

//6. convert rdd to DrmRdd
val rddMatrixLike: DrmRdd[Int] = rdd.zipWithIndex.map { case (v,
idx) => (idx.toInt, v) } //7. convert DrmRdd to
CheckpointedDrm[Int] val matrix = drmWrap(rddMatrixLike) //8.
seperating the column having all ones created in step 4 and will use
it later val oneVector = matrix(::, 0 until 1) //9. final
input data in DrmLike[Int] format val dataDrmX = matrix(::, 1 until
4) //9. Sampling to select initial centriods val
centriods = drmSampleKRows(dataDrmX, 2, false) centriods.size
//10. Broad Casting the initial centriods val broadCastMatrix =
drmBroadcast(centriods) //11. Iterating over the Data
Matrix(in DrmLike[Int] format) to calculate the initial centriods
dataDrmX.mapBlock() { case (keys, block) => for (row <- 0
until block.nrow) { var dataPoint = block(row, ::)
//12. findTheClosestCentriod find the closest centriod to the
Data point specified by "dataPoint" val closesetIndex =
findTheClosestCentriod(dataPoint, centriods) //13.
assigning closest index to key keys(row) = closesetIndex
} keys -> block }

//14. Calculating the (1|D) val b = (oneVector cbind
dataDrmX) //15. Aggregating Transpose (1|D)' val bTranspose
= (oneVector cbind dataDrmX).t // after step 15 bTranspose will
have data in the following format /*(n+1)*K where n=dimension
of the data point, K=number of clusters * zeroth row will contain
the count of points assigned to each cluster * assuming 3d data
points * */

val nrows = b.nrow.toInt //16. slicing the count vectors out
val pointCountVectors = drmBroadcast(b(0 until 1, ::).collect(0, ::))
val vectorSums = b(1 until nrows, ::) //17. dividing the data
point by count vector vectorSums.mapBlock() { case (keys,
block) => for (row <- 0 until block.nrow) { block(row,
::) /= pointCountVectors } keys -> block } //18.
seperating the count vectors val newCentriods = vectorSums.t(::,1
until centriods.size) //19. iterate over the above code
till convergence criteria is meet }//end of main method

// method to find the closest centriod to data point( vec: Vector
in the arguments) def findTheClosestCentriod(vec: Vector, matrix:
Matrix): Int = {
var index = 0
var closest = Double.PositiveInfinity
for (row <- 0 until matrix.nrow) {
val squaredSum = ssr(vec, matrix(row, ::))
val tempDist = Math.sqrt(ssr(vec, matrix(row, ::)))
if (tempDist < closest) {
closest = tempDist
index = row
}
}
index
}

//calculating the sum of squared distance between the points(Vectors)
def ssr(a: Vector, b: Vector): Double = {
(a - b) ^= 2 sum
}

//method used to create (1|D)
def addCentriodColumn(arg: Array[Double]): Array[Double] = {
val newArr = new Array[Double](arg.length + 1)
newArr(0) = 1.0;
for (i <- 0 until (arg.size)) {
newArr(i + 1) = arg(i);
}
newArr
}

Thanks & Regards
Parth Khatwani

On Mon, Apr 3, 2017 at 7:37 PM, KHATWANI PARTH BHARAT <

---------- Forwarded message ----------
Date: Fri, Mar 31, 2017 at 11:34 PM
Subject: Re: Trying to write the KMeans Clustering Using "Apache Mahout
Samsara"
ps1 this assumes row-wise construction of A based on training set of m
n-dimensional points.
ps2 since we are doing multiple passes over A it may make sense to make
sure it is committed to spark cache (by using checkpoint api), if spark is
used

Post by Dmitriy Lyubimov
keys of A using operator mapblock() (details in [2], [3], [4]). You also
need to broadcast C in order to be able to access it in efficient manner
inside mapblock() closure. Examples of that are plenty given in [2].
Essentially, in mapblock, you'd reform the row keys to reflect cluster
index in C. while going over A, you'd have a "nearest neighbor" problem

Post by Dmitriy Lyubimov
solve for the row of A and centroids C. This is the bulk of computation
really, and there are a few tricks there that can speed this step up in
both exact and approximate manner, but you can start with a naive search.
once you assigned centroids to the keys of marix A, you'd want to do an
aggregating transpose of A to compute essentially average of row A

grouped

Post by Dmitriy Lyubimov
by the centroid key. The trick is to do a computation of (1|A)' which

will

Post by Dmitriy Lyubimov
results in a matrix of the shape (Counts/sums of cluster rows). This is

the

Post by Dmitriy Lyubimov
part i find difficult to explain without a latex graphics.
In Samsara, construction of (1|A)' corresponds to DRM expression
(1 cbind A).t (again, see [2]).
So when you compute, say,
B = (1 | A)',
then B is (n+1) x k, so each column contains a vector corresponding to a
cluster 1..k. In such column, the first element would be # of points in

the

Post by Dmitriy Lyubimov
cluster, and the rest of it would correspond to sum of all points. So in
order to arrive to an updated matrix C, we need to collect B into memory,
and slice out counters (first row) from the rest of it.
C <- B (2:,:) each row divided by B(1,:)
(watch out for empty clusters with 0 elements, this will cause lack of
convergence and NaNs in the newly computed C).
This operation obviously uses subblocking and row-wise iteration over B,
for which i am again making reference to [2].
[1] https://github.com/apache/mahout/blob/master/math-scala/
src/main/scala/org/apache/mahout/math/drm/package.scala#L149
[2], Sasmara manual, a bit dated but viable, http://apache.github.
io/mahout/doc/ScalaSparkBindings.html
[3] scaladoc, again, dated but largely viable for the purpose of this
http://apache.github.io/mahout/0.10.1/docs/mahout-math-scala/index.htm
[4] mapblock etc. http://apache.github.io/mahout/0.10.1/docs/mahout-
math-scala/index.html#org.apache.mahout.math.drm.RLikeDrmOps
On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriycan you please again tell me the approach to move ahead.
Thanks
Parth Khatwani
On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
yes i am unable to figure out the way ahead.
Like how to create the augmented matrix A := (0|D) which you have mentioned.

Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans clustering algorithm using Mahout

Samsara

Post by KHATWANI PARTH BHARAT
but i am bit confused
about how to leverage Distributed Row Matrix for the same. Can

anybody

Post by KHATWANI PARTH BHARAT
me with same.
Thanks
Parth Khatwani

Dmitriy Lyubimov

2017-04-12 18:25:06 UTC

can't say i can read this code well formatted that way...

it would seem to me that the code is not using the broadcast variable and
instead is using closure variable. that's the only thing i can immediately
see by looking in the middle of it.

it would be better if you created a branch on github for that code that
would allow for easy check-outs and comments.

-d

On Wed, Apr 12, 2017 at 10:29 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriy Sir
I have completed the Kmeans code as per the algorithm you have Outline above
My code is as follows
This code works fine till step number 10
In step 11 i am assigning the new centriod index to corresponding row key
of data Point in the matrix
I think i am doing something wrong in step 11 may be i am using incorrect
syntax
Can you help me find out what am i doing wrong.
//start of main method
def main(args: Array[String]) {
//1. initialize the spark and mahout context
val conf = new SparkConf()
.setAppName("DRMExample")
.setMaster(args(0))
.set("spark.serializer", "org.apache.spark.serializer.
KryoSerializer")
.set("spark.kryo.registrator",
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator")
implicit val sc = new SparkDistributedContext(new SparkContext(conf))
//2. read the data file and save it in the rdd
val lines = sc.textFile(args(1))
//3. convert data read in as string in to array of double
val test = lines.map(line => line.split('\t').map(_.toDouble))
//4. add a column having value 1 in array of double this will
create something like (1 | D)', which will be used while calculating
(1 | D)'
val augumentedArray = test.map(addCentriodColumn _)
//5. convert rdd of array of double in rdd of DenseVector
val rdd = augumentedArray.map(dvec(_))
//6. convert rdd to DrmRdd
val rddMatrixLike: DrmRdd[Int] = rdd.zipWithIndex.map { case (v,
idx) => (idx.toInt, v) } //7. convert DrmRdd to
CheckpointedDrm[Int] val matrix = drmWrap(rddMatrixLike) //8.
seperating the column having all ones created in step 4 and will use
it later val oneVector = matrix(::, 0 until 1) //9. final
input data in DrmLike[Int] format val dataDrmX = matrix(::, 1 until
4) //9. Sampling to select initial centriods val
centriods = drmSampleKRows(dataDrmX, 2, false) centriods.size
//10. Broad Casting the initial centriods val broadCastMatrix =
drmBroadcast(centriods) //11. Iterating over the Data
Matrix(in DrmLike[Int] format) to calculate the initial centriods
dataDrmX.mapBlock() { case (keys, block) => for (row <- 0
until block.nrow) { var dataPoint = block(row, ::)
//12. findTheClosestCentriod find the closest centriod to the
Data point specified by "dataPoint" val closesetIndex =
findTheClosestCentriod(dataPoint, centriods) //13.
assigning closest index to key keys(row) = closesetIndex
} keys -> block }
//14. Calculating the (1|D) val b = (oneVector cbind
dataDrmX) //15. Aggregating Transpose (1|D)' val bTranspose
= (oneVector cbind dataDrmX).t // after step 15 bTranspose will
have data in the following format /*(n+1)*K where n=dimension
of the data point, K=number of clusters * zeroth row will contain
the count of points assigned to each cluster * assuming 3d data
points * */
val nrows = b.nrow.toInt //16. slicing the count vectors out
val pointCountVectors = drmBroadcast(b(0 until 1, ::).collect(0, ::))
val vectorSums = b(1 until nrows, ::) //17. dividing the data
point by count vector vectorSums.mapBlock() { case (keys,
block) => for (row <- 0 until block.nrow) { block(row,
::) /= pointCountVectors } keys -> block } //18.
seperating the count vectors val newCentriods = vectorSums.t(::,1
until centriods.size) //19. iterate over the above code
till convergence criteria is meet }//end of main method
// method to find the closest centriod to data point( vec: Vector
Matrix): Int = {
var index = 0
var closest = Double.PositiveInfinity
for (row <- 0 until matrix.nrow) {
val squaredSum = ssr(vec, matrix(row, ::))
val tempDist = Math.sqrt(ssr(vec, matrix(row, ::)))
if (tempDist < closest) {
closest = tempDist
index = row
}
}
index
}
//calculating the sum of squared distance between the points(Vectors)
def ssr(a: Vector, b: Vector): Double = {
(a - b) ^= 2 sum
}
//method used to create (1|D)
def addCentriodColumn(arg: Array[Double]): Array[Double] = {
val newArr = new Array[Double](arg.length + 1)
newArr(0) = 1.0;
for (i <- 0 until (arg.size)) {
newArr(i + 1) = arg(i);
}
newArr
}
Thanks & Regards
Parth Khatwani
On Mon, Apr 3, 2017 at 7:37 PM, KHATWANI PARTH BHARAT <

used

Post by Dmitriy Lyubimov
here is the outline. For details of APIs, please refer to samsara

manual

Post by Dmitriy Lyubimov
[2], i will not be be repeating it.
Assume your training data input is m x n matrix A. For simplicity let's
assume it's a DRM with int row keys, i.e., DrmLike[Int].
First, classic k-means starts by selecting initial clusters, by

sampling

Post by Dmitriy Lyubimov
them out. You can do that by using sampling api [1], thus forming a k

x n

Post by Dmitriy Lyubimov
in-memory matrix C (current centroids). C is therefore of Mahout's

Matrix

Post by Dmitriy Lyubimov
type.
You the proceed by alternating between cluster assignments and
recompupting centroid matrix C till convergence based on some test or
simply limited by epoch count budget, your choice.
Cluster assignments: here, we go over current generation of A and
recompute centroid indexes for each row in A. Once we recompute index,

Post by Dmitriy Lyubimov
put it into the row key . You can do that by assigning centroid indices

Post by Dmitriy Lyubimov
keys of A using operator mapblock() (details in [2], [3], [4]). You

also

Post by Dmitriy Lyubimov
need to broadcast C in order to be able to access it in efficient

manner

Post by Dmitriy Lyubimov
inside mapblock() closure. Examples of that are plenty given in [2].
Essentially, in mapblock, you'd reform the row keys to reflect cluster
index in C. while going over A, you'd have a "nearest neighbor" problem

search.

Post by Dmitriy Lyubimov
once you assigned centroids to the keys of marix A, you'd want to do an
aggregating transpose of A to compute essentially average of row A

grouped

Post by Dmitriy Lyubimov
by the centroid key. The trick is to do a computation of (1|A)' which

will

Post by Dmitriy Lyubimov
results in a matrix of the shape (Counts/sums of cluster rows). This is

the

Post by Dmitriy Lyubimov
cluster 1..k. In such column, the first element would be # of points in

the

Post by Dmitriy Lyubimov
cluster, and the rest of it would correspond to sum of all points. So

Post by Dmitriy Lyubimov
order to arrive to an updated matrix C, we need to collect B into

memory,

Post by Dmitriy Lyubimov
and slice out counters (first row) from the rest of it.
C <- B (2:,:) each row divided by B(1,:)
(watch out for empty clusters with 0 elements, this will cause lack of
convergence and NaNs in the newly computed C).
This operation obviously uses subblocking and row-wise iteration over

Post by Dmitriy Lyubimov
for which i am again making reference to [2].
[1] https://github.com/apache/mahout/blob/master/math-scala/
src/main/scala/org/apache/mahout/math/drm/package.scala#L149
[2], Sasmara manual, a bit dated but viable, http://apache.github.
io/mahout/doc/ScalaSparkBindings.html
[3] scaladoc, again, dated but largely viable for the purpose of this
http://apache.github.io/mahout/0.10.1/docs/mahout-math-scala/index.htm
[4] mapblock etc. http://apache.github.io/mahout/0.10.1/docs/mahout-
math-scala/index.html#org.apache.mahout.math.drm.RLikeDrmOps
On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriycan you please again tell me the approach to move ahead.
Thanks
Parth Khatwani
On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
yes i am unable to figure out the way ahead.
Like how to create the augmented matrix A := (0|D) which you have mentioned.
On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy Lyubimov <

Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans clustering algorithm using Mahout

Samsara

Post by KHATWANI PARTH BHARAT
but i am bit confused
about how to leverage Distributed Row Matrix for the same. Can

anybody

Post by KHATWANI PARTH BHARAT
me with same.
Thanks
Parth Khatwani

KHATWANI PARTH BHARAT

2017-04-12 18:50:10 UTC

Ok i will do that.

Post by Dmitriy Lyubimov
can't say i can read this code well formatted that way...
it would seem to me that the code is not using the broadcast variable and
instead is using closure variable. that's the only thing i can immediately
see by looking in the middle of it.
it would be better if you created a branch on github for that code that
would allow for easy check-outs and comments.
-d
On Wed, Apr 12, 2017 at 10:29 AM, KHATWANI PARTH BHARAT <

key

Post by KHATWANI PARTH BHARAT
of data Point in the matrix
I think i am doing something wrong in step 11 may be i am using incorrect
syntax
Can you help me find out what am i doing wrong.
//start of main method
def main(args: Array[String]) {
//1. initialize the spark and mahout context
val conf = new SparkConf()
.setAppName("DRMExample")
.setMaster(args(0))
.set("spark.serializer", "org.apache.spark.serializer.
KryoSerializer")
.set("spark.kryo.registrator",
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator")
implicit val sc = new SparkDistributedContext(new SparkContext(conf))
//2. read the data file and save it in the rdd
val lines = sc.textFile(args(1))
//3. convert data read in as string in to array of double
val test = lines.map(line => line.split('\t').map(_.toDouble))
//4. add a column having value 1 in array of double this will
create something like (1 | D)', which will be used while calculating
(1 | D)'
val augumentedArray = test.map(addCentriodColumn _)
//5. convert rdd of array of double in rdd of DenseVector
val rdd = augumentedArray.map(dvec(_))
//6. convert rdd to DrmRdd
val rddMatrixLike: DrmRdd[Int] = rdd.zipWithIndex.map { case (v,
idx) => (idx.toInt, v) } //7. convert DrmRdd to
CheckpointedDrm[Int] val matrix = drmWrap(rddMatrixLike) //8.
seperating the column having all ones created in step 4 and will use
it later val oneVector = matrix(::, 0 until 1) //9. final
input data in DrmLike[Int] format val dataDrmX = matrix(::, 1 until
4) //9. Sampling to select initial centriods val
centriods = drmSampleKRows(dataDrmX, 2, false) centriods.size
//10. Broad Casting the initial centriods val broadCastMatrix =
drmBroadcast(centriods) //11. Iterating over the Data
Matrix(in DrmLike[Int] format) to calculate the initial centriods
dataDrmX.mapBlock() { case (keys, block) => for (row <- 0
until block.nrow) { var dataPoint = block(row, ::)
//12. findTheClosestCentriod find the closest centriod to the
Data point specified by "dataPoint" val closesetIndex =
findTheClosestCentriod(dataPoint, centriods) //13.
assigning closest index to key keys(row) = closesetIndex
} keys -> block }
//14. Calculating the (1|D) val b = (oneVector cbind
dataDrmX) //15. Aggregating Transpose (1|D)' val bTranspose
= (oneVector cbind dataDrmX).t // after step 15 bTranspose will
have data in the following format /*(n+1)*K where n=dimension
of the data point, K=number of clusters * zeroth row will contain
the count of points assigned to each cluster * assuming 3d data
points * */
val nrows = b.nrow.toInt //16. slicing the count vectors out
val pointCountVectors = drmBroadcast(b(0 until 1, ::).collect(0, ::))
val vectorSums = b(1 until nrows, ::) //17. dividing the data
point by count vector vectorSums.mapBlock() { case (keys,
block) => for (row <- 0 until block.nrow) { block(row,
::) /= pointCountVectors } keys -> block } //18.
seperating the count vectors val newCentriods = vectorSums.t(::,1
until centriods.size) //19. iterate over the above code
till convergence criteria is meet }//end of main method
// method to find the closest centriod to data point( vec: Vector
Matrix): Int = {
var index = 0
var closest = Double.PositiveInfinity
for (row <- 0 until matrix.nrow) {
val squaredSum = ssr(vec, matrix(row, ::))
val tempDist = Math.sqrt(ssr(vec, matrix(row, ::)))
if (tempDist < closest) {
closest = tempDist
index = row
}
}
index
}
//calculating the sum of squared distance between the points(Vectors)
def ssr(a: Vector, b: Vector): Double = {
(a - b) ^= 2 sum
}
//method used to create (1|D)
def addCentriodColumn(arg: Array[Double]): Array[Double] = {
val newArr = new Array[Double](arg.length + 1)
newArr(0) = 1.0;
for (i <- 0 until (arg.size)) {
newArr(i + 1) = arg(i);
}
newArr
}
Thanks & Regards
Parth Khatwani
On Mon, Apr 3, 2017 at 7:37 PM, KHATWANI PARTH BHARAT <

used

Post by Dmitriy Lyubimov
here is the outline. For details of APIs, please refer to samsara

manual

Post by Dmitriy Lyubimov
[2], i will not be be repeating it.
Assume your training data input is m x n matrix A. For simplicity

let's

Post by Dmitriy Lyubimov
assume it's a DRM with int row keys, i.e., DrmLike[Int].
First, classic k-means starts by selecting initial clusters, by

sampling

Post by Dmitriy Lyubimov
them out. You can do that by using sampling api [1], thus forming a k

x n

Post by Dmitriy Lyubimov
in-memory matrix C (current centroids). C is therefore of Mahout's

Matrix

index,

Post by Dmitriy Lyubimov
put it into the row key . You can do that by assigning centroid

indices

Post by Dmitriy Lyubimov
keys of A using operator mapblock() (details in [2], [3], [4]). You

also

Post by Dmitriy Lyubimov
need to broadcast C in order to be able to access it in efficient

manner

Post by Dmitriy Lyubimov
inside mapblock() closure. Examples of that are plenty given in [2].
Essentially, in mapblock, you'd reform the row keys to reflect

cluster

Post by Dmitriy Lyubimov
index in C. while going over A, you'd have a "nearest neighbor"

problem

Post by Dmitriy Lyubimov
solve for the row of A and centroids C. This is the bulk of

computation

Post by Dmitriy Lyubimov
really, and there are a few tricks there that can speed this step up

Post by Dmitriy Lyubimov
both exact and approximate manner, but you can start with a naive

search.

Post by Dmitriy Lyubimov
once you assigned centroids to the keys of marix A, you'd want to do

Post by Dmitriy Lyubimov
aggregating transpose of A to compute essentially average of row A

grouped

Post by Dmitriy Lyubimov
by the centroid key. The trick is to do a computation of (1|A)' which

will

Post by Dmitriy Lyubimov
results in a matrix of the shape (Counts/sums of cluster rows). This

the

Post by Dmitriy Lyubimov
cluster 1..k. In such column, the first element would be # of points

the

Post by Dmitriy Lyubimov
cluster, and the rest of it would correspond to sum of all points. So

Post by Dmitriy Lyubimov
order to arrive to an updated matrix C, we need to collect B into

memory,

Post by Dmitriy Lyubimov
and slice out counters (first row) from the rest of it.
C <- B (2:,:) each row divided by B(1,:)
(watch out for empty clusters with 0 elements, this will cause lack

Post by Dmitriy Lyubimov
convergence and NaNs in the newly computed C).
This operation obviously uses subblocking and row-wise iteration over

math-scala/index.htm

Post by Dmitriy Lyubimov
[4] mapblock etc. http://apache.github.io/mahout/0.10.1/docs/mahout-
math-scala/index.html#org.apache.mahout.math.drm.RLikeDrmOps
On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriycan you please again tell me the approach to move ahead.
Thanks
Parth Khatwani
On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
yes i am unable to figure out the way ahead.
Like how to create the augmented matrix A := (0|D) which you have
mentioned.
On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy Lyubimov <

Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans clustering algorithm using

Mahout

Post by Dmitriy Lyubimov
Samsara

Post by KHATWANI PARTH BHARAT
but i am bit confused
about how to leverage Distributed Row Matrix for the same. Can

anybody

Post by KHATWANI PARTH BHARAT
me with same.
Thanks
Parth Khatwani

KHATWANI PARTH BHARAT

2017-04-13 12:54:21 UTC

Dmitriy Sir,
I have Created a github branch Github Branch Having Initial Kmeans Code
<https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov>

Thanks & Regards
Parth Khatwani

+1 to creating a branch.
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/12/2017 11:25 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering Using "Apache Mahout
Samsara"
can't say i can read this code well formatted that way...
it would seem to me that the code is not using the broadcast variable and
instead is using closure variable. that's the only thing i can immediately
see by looking in the middle of it.
it would be better if you created a branch on github for that code that
would allow for easy check-outs and comments.
-d
On Wed, Apr 12, 2017 at 10:29 AM, KHATWANI PARTH BHARAT <

key

used

Post by Dmitriy Lyubimov
here is the outline. For details of APIs, please refer to samsara

manual

Post by Dmitriy Lyubimov
[2], i will not be be repeating it.
Assume your training data input is m x n matrix A. For simplicity

let's

Post by Dmitriy Lyubimov
assume it's a DRM with int row keys, i.e., DrmLike[Int].
First, classic k-means starts by selecting initial clusters, by

sampling

Post by Dmitriy Lyubimov
them out. You can do that by using sampling api [1], thus forming a k

x n

Post by Dmitriy Lyubimov
in-memory matrix C (current centroids). C is therefore of Mahout's

Matrix

index,

Post by Dmitriy Lyubimov
put it into the row key . You can do that by assigning centroid

indices

Post by Dmitriy Lyubimov
keys of A using operator mapblock() (details in [2], [3], [4]). You

also

Post by Dmitriy Lyubimov
need to broadcast C in order to be able to access it in efficient

manner

Post by Dmitriy Lyubimov
inside mapblock() closure. Examples of that are plenty given in [2].
Essentially, in mapblock, you'd reform the row keys to reflect

cluster

Post by Dmitriy Lyubimov
index in C. while going over A, you'd have a "nearest neighbor"

problem

Post by Dmitriy Lyubimov
solve for the row of A and centroids C. This is the bulk of

computation

Post by Dmitriy Lyubimov
really, and there are a few tricks there that can speed this step up

Post by Dmitriy Lyubimov
both exact and approximate manner, but you can start with a naive

search.

Post by Dmitriy Lyubimov
once you assigned centroids to the keys of marix A, you'd want to do

Post by Dmitriy Lyubimov
aggregating transpose of A to compute essentially average of row A

grouped

Post by Dmitriy Lyubimov
by the centroid key. The trick is to do a computation of (1|A)' which

will

Post by Dmitriy Lyubimov
results in a matrix of the shape (Counts/sums of cluster rows). This

the

Post by Dmitriy Lyubimov
cluster 1..k. In such column, the first element would be # of points

the

Post by Dmitriy Lyubimov
cluster, and the rest of it would correspond to sum of all points. So

Post by Dmitriy Lyubimov
order to arrive to an updated matrix C, we need to collect B into

memory,

Post by Dmitriy Lyubimov
and slice out counters (first row) from the rest of it.
C <- B (2:,:) each row divided by B(1,:)
(watch out for empty clusters with 0 elements, this will cause lack

Post by Dmitriy Lyubimov
convergence and NaNs in the newly computed C).
This operation obviously uses subblocking and row-wise iteration over

math-scala/index.htm

Post by KHATWANI PARTH BHARAT
@Dmitriycan you please again tell me the approach to move ahead.
Thanks
Parth Khatwani
On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
yes i am unable to figure out the way ahead.
Like how to create the augmented matrix A := (0|D) which you have
mentioned.
On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy Lyubimov <

Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans clustering algorithm using

Mahout

Post by Dmitriy Lyubimov
Samsara

Post by KHATWANI PARTH BHARAT
but i am bit confused
about how to leverage Distributed Row Matrix for the same. Can

anybody

Post by KHATWANI PARTH BHARAT
me with same.
Thanks
Parth Khatwani

KHATWANI PARTH BHARAT

2017-04-14 09:26:52 UTC

@Dmitriy Sir,
In the K means code above I think i am doing the following Incorrectly

Assigning the closest centriod index to the Row Keys of DRM

//11. Iterating over the Data Matrix(in DrmLike[Int] format) to calculate
the initial centriods
dataDrmX.mapBlock() {
case (keys, block) =>
for (row <- 0 until block.nrow) {
var dataPoint = block(row, ::)

//12. findTheClosestCentriod find the closest centriod to the
Data point specified by "dataPoint"
val closesetIndex = findTheClosestCentriod(dataPoint, centriods)

//13. assigning closest index to key
keys(row) = closesetIndex
}
keys -> block
}

in step 12 i am finding the centriod closest to the current dataPoint
in step13 i am assigning the closesetIndex to the key of the corresponding
row represented by the dataPoint
I think i am doing step13 incorrectly.

Also i am unable to find the proper reference for the same in the reference
links which you have mentioned above

Thanks & Regards
Parth Khatwani

On Thu, Apr 13, 2017 at 6:24 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
Dmitriy Sir,
I have Created a github branch Github Branch Having Initial Kmeans Code
<https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov>
Thanks & Regards
Parth Khatwani

key

Post by KHATWANI PARTH BHARAT
of data Point in the matrix
I think i am doing something wrong in step 11 may be i am using

incorrect

Post by KHATWANI PARTH BHARAT
syntax
Can you help me find out what am i doing wrong.
//start of main method
def main(args: Array[String]) {
//1. initialize the spark and mahout context
val conf = new SparkConf()
.setAppName("DRMExample")
.setMaster(args(0))
.set("spark.serializer", "org.apache.spark.serializer.
KryoSerializer")
.set("spark.kryo.registrator",
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator")
implicit val sc = new SparkDistributedContext(new

SparkContext(conf))

Post by KHATWANI PARTH BHARAT
//2. read the data file and save it in the rdd
val lines = sc.textFile(args(1))
//3. convert data read in as string in to array of double
val test = lines.map(line => line.split('\t').map(_.toDouble))
//4. add a column having value 1 in array of double this will
create something like (1 | D)', which will be used while calculating
(1 | D)'
val augumentedArray = test.map(addCentriodColumn _)
//5. convert rdd of array of double in rdd of DenseVector
val rdd = augumentedArray.map(dvec(_))
//6. convert rdd to DrmRdd
val rddMatrixLike: DrmRdd[Int] = rdd.zipWithIndex.map { case (v,
idx) => (idx.toInt, v) } //7. convert DrmRdd to
CheckpointedDrm[Int] val matrix = drmWrap(rddMatrixLike) //8.
seperating the column having all ones created in step 4 and will use
it later val oneVector = matrix(::, 0 until 1) //9. final
input data in DrmLike[Int] format val dataDrmX = matrix(::, 1 until
4) //9. Sampling to select initial centriods val
centriods = drmSampleKRows(dataDrmX, 2, false) centriods.size
//10. Broad Casting the initial centriods val broadCastMatrix =
drmBroadcast(centriods) //11. Iterating over the Data
Matrix(in DrmLike[Int] format) to calculate the initial centriods
dataDrmX.mapBlock() { case (keys, block) => for (row <- 0
until block.nrow) { var dataPoint = block(row, ::)
//12. findTheClosestCentriod find the closest centriod to the
Data point specified by "dataPoint" val closesetIndex =
findTheClosestCentriod(dataPoint, centriods) //13.
assigning closest index to key keys(row) = closesetIndex
} keys -> block }
//14. Calculating the (1|D) val b = (oneVector cbind
dataDrmX) //15. Aggregating Transpose (1|D)' val bTranspose
= (oneVector cbind dataDrmX).t // after step 15 bTranspose will
have data in the following format /*(n+1)*K where n=dimension
of the data point, K=number of clusters * zeroth row will contain
the count of points assigned to each cluster * assuming 3d data
points * */
val nrows = b.nrow.toInt //16. slicing the count vectors out
val pointCountVectors = drmBroadcast(b(0 until 1, ::).collect(0, ::))
val vectorSums = b(1 until nrows, ::) //17. dividing the data
point by count vector vectorSums.mapBlock() { case (keys,
block) => for (row <- 0 until block.nrow) { block(row,
::) /= pointCountVectors } keys -> block } //18.
seperating the count vectors val newCentriods = vectorSums.t(::,1
until centriods.size) //19. iterate over the above code
till convergence criteria is meet }//end of main method
// method to find the closest centriod to data point( vec: Vector
Matrix): Int = {
var index = 0
var closest = Double.PositiveInfinity
for (row <- 0 until matrix.nrow) {
val squaredSum = ssr(vec, matrix(row, ::))
val tempDist = Math.sqrt(ssr(vec, matrix(row, ::)))
if (tempDist < closest) {
closest = tempDist
index = row
}
}
index
}
//calculating the sum of squared distance between the points(Vectors)
def ssr(a: Vector, b: Vector): Double = {
(a - b) ^= 2 sum
}
//method used to create (1|D)
def addCentriodColumn(arg: Array[Double]): Array[Double] = {
val newArr = new Array[Double](arg.length + 1)
newArr(0) = 1.0;
for (i <- 0 until (arg.size)) {
newArr(i + 1) = arg(i);
}
newArr
}
Thanks & Regards
Parth Khatwani
On Mon, Apr 3, 2017 at 7:37 PM, KHATWANI PARTH BHARAT <

---------- Forwarded message ----------
Date: Fri, Mar 31, 2017 at 11:34 PM
Subject: Re: Trying to write the KMeans Clustering Using "Apache

Mahout

Samsara"
ps1 this assumes row-wise construction of A based on training set of m
n-dimensional points.
ps2 since we are doing multiple passes over A it may make sense to

make

sure it is committed to spark cache (by using checkpoint api), if

spark

used

Post by Dmitriy Lyubimov
here is the outline. For details of APIs, please refer to samsara

manual

Post by Dmitriy Lyubimov
[2], i will not be be repeating it.
Assume your training data input is m x n matrix A. For simplicity

let's

Post by Dmitriy Lyubimov
assume it's a DRM with int row keys, i.e., DrmLike[Int].
First, classic k-means starts by selecting initial clusters, by

sampling

Post by Dmitriy Lyubimov
them out. You can do that by using sampling api [1], thus forming a

Post by Dmitriy Lyubimov
in-memory matrix C (current centroids). C is therefore of Mahout's

Matrix

Post by Dmitriy Lyubimov
type.
You the proceed by alternating between cluster assignments and
recompupting centroid matrix C till convergence based on some test

Post by Dmitriy Lyubimov
simply limited by epoch count budget, your choice.
Cluster assignments: here, we go over current generation of A and
recompute centroid indexes for each row in A. Once we recompute

index,

Post by Dmitriy Lyubimov
put it into the row key . You can do that by assigning centroid

indices

Post by Dmitriy Lyubimov
keys of A using operator mapblock() (details in [2], [3], [4]). You

also

Post by Dmitriy Lyubimov
need to broadcast C in order to be able to access it in efficient

manner

Post by Dmitriy Lyubimov
inside mapblock() closure. Examples of that are plenty given in [2].
Essentially, in mapblock, you'd reform the row keys to reflect

cluster

Post by Dmitriy Lyubimov
index in C. while going over A, you'd have a "nearest neighbor"

problem

Post by Dmitriy Lyubimov
solve for the row of A and centroids C. This is the bulk of

computation

Post by Dmitriy Lyubimov
really, and there are a few tricks there that can speed this step

up in

Post by Dmitriy Lyubimov
both exact and approximate manner, but you can start with a naive

search.

Post by Dmitriy Lyubimov
once you assigned centroids to the keys of marix A, you'd want to

do an

Post by Dmitriy Lyubimov
aggregating transpose of A to compute essentially average of row A

grouped

Post by Dmitriy Lyubimov
by the centroid key. The trick is to do a computation of (1|A)'

which

will

Post by Dmitriy Lyubimov
results in a matrix of the shape (Counts/sums of cluster rows).

This is

the

Post by Dmitriy Lyubimov
cluster 1..k. In such column, the first element would be # of

points in

the

Post by Dmitriy Lyubimov
cluster, and the rest of it would correspond to sum of all points.

Post by Dmitriy Lyubimov
order to arrive to an updated matrix C, we need to collect B into

memory,

Post by Dmitriy Lyubimov
and slice out counters (first row) from the rest of it.
C <- B (2:,:) each row divided by B(1,:)
(watch out for empty clusters with 0 elements, this will cause lack

Post by Dmitriy Lyubimov
convergence and NaNs in the newly computed C).
This operation obviously uses subblocking and row-wise iteration

over

this

Post by Dmitriy Lyubimov
http://apache.github.io/mahout/0.10.1/docs/mahout-math-

scala/index.htm

Post by Dmitriy Lyubimov
[4] mapblock etc. http://apache.github.io/mahout

/0.10.1/docs/mahout-

Post by Dmitriy Lyubimov
math-scala/index.html#org.apache.mahout.math.drm.RLikeDrmOps
On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriycan you please again tell me the approach to move ahead.
Thanks
Parth Khatwani
On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
yes i am unable to figure out the way ahead.
Like how to create the augmented matrix A := (0|D) which you have
mentioned.
On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy Lyubimov <

Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans clustering algorithm using

Mahout

Post by Dmitriy Lyubimov
Samsara

Post by KHATWANI PARTH BHARAT
but i am bit confused
about how to leverage Distributed Row Matrix for the same. Can

anybody

Post by KHATWANI PARTH BHARAT
me with same.
Thanks
Parth Khatwani

Dmitriy Lyubimov

2017-04-14 18:27:46 UTC

i would think reassinging keys should work in most cases.
The only exception is that technically Spark contracts imply that effect
should be idempotent if task is retried, which might be a problem in a
specific scenario of the object tree coming out from block cache object
tree, which can stay there and be retried again. but specifically w.r.t.
this key assignment i don't see any problem since the action obviously
would be idempotent even if this code is run multiple times on the same
(key, block) pair. This part should be good IMO.

On Fri, Apr 14, 2017 at 2:26 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriy Sir,
In the K means code above I think i am doing the following Incorrectly
Assigning the closest centriod index to the Row Keys of DRM
//11. Iterating over the Data Matrix(in DrmLike[Int] format) to calculate
the initial centriods
dataDrmX.mapBlock() {
case (keys, block) =>
for (row <- 0 until block.nrow) {
var dataPoint = block(row, ::)
//12. findTheClosestCentriod find the closest centriod to the
Data point specified by "dataPoint"
val closesetIndex = findTheClosestCentriod(dataPoint, centriods)
//13. assigning closest index to key
keys(row) = closesetIndex
}
keys -> block
}
in step 12 i am finding the centriod closest to the current dataPoint
in step13 i am assigning the closesetIndex to the key of the corresponding
row represented by the dataPoint
I think i am doing step13 incorrectly.
Also i am unable to find the proper reference for the same in the reference
links which you have mentioned above
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 6:24 PM, KHATWANI PARTH BHARAT <

and

instead is using closure variable. that's the only thing i can

immediately

see by looking in the middle of it.
it would be better if you created a branch on github for that code that
would allow for easy check-outs and comments.
-d
On Wed, Apr 12, 2017 at 10:29 AM, KHATWANI PARTH BHARAT <

key

Post by KHATWANI PARTH BHARAT
of data Point in the matrix
I think i am doing something wrong in step 11 may be i am using

incorrect

SparkContext(conf))

points(Vectors)

Post by KHATWANI PARTH BHARAT
def ssr(a: Vector, b: Vector): Double = {
(a - b) ^= 2 sum
}
//method used to create (1|D)
def addCentriodColumn(arg: Array[Double]): Array[Double] = {
val newArr = new Array[Double](arg.length + 1)
newArr(0) = 1.0;
for (i <- 0 until (arg.size)) {
newArr(i + 1) = arg(i);
}
newArr
}
Thanks & Regards
Parth Khatwani
On Mon, Apr 3, 2017 at 7:37 PM, KHATWANI PARTH BHARAT <

---------- Forwarded message ----------
Date: Fri, Mar 31, 2017 at 11:34 PM
Subject: Re: Trying to write the KMeans Clustering Using "Apache

Mahout

Samsara"
ps1 this assumes row-wise construction of A based on training set

of m

n-dimensional points.
ps2 since we are doing multiple passes over A it may make sense to

make

sure it is committed to spark cache (by using checkpoint api), if

spark

used
On Fri, Mar 31, 2017 at 10:53 AM, Dmitriy Lyubimov <

Post by Dmitriy Lyubimov
here is the outline. For details of APIs, please refer to samsara

manual

Post by Dmitriy Lyubimov
[2], i will not be be repeating it.
Assume your training data input is m x n matrix A. For simplicity

let's

Post by Dmitriy Lyubimov
assume it's a DRM with int row keys, i.e., DrmLike[Int].
First, classic k-means starts by selecting initial clusters, by

sampling

Post by Dmitriy Lyubimov
them out. You can do that by using sampling api [1], thus forming

Post by Dmitriy Lyubimov
in-memory matrix C (current centroids). C is therefore of Mahout's

Matrix

Post by Dmitriy Lyubimov
type.
You the proceed by alternating between cluster assignments and
recompupting centroid matrix C till convergence based on some test

index,

Post by Dmitriy Lyubimov
put it into the row key . You can do that by assigning centroid

indices

Post by Dmitriy Lyubimov
keys of A using operator mapblock() (details in [2], [3], [4]).

You

Post by Dmitriy Lyubimov
need to broadcast C in order to be able to access it in efficient

manner

Post by Dmitriy Lyubimov
inside mapblock() closure. Examples of that are plenty given in

[2].

Post by Dmitriy Lyubimov
Essentially, in mapblock, you'd reform the row keys to reflect

cluster

Post by Dmitriy Lyubimov
index in C. while going over A, you'd have a "nearest neighbor"

problem

Post by Dmitriy Lyubimov
solve for the row of A and centroids C. This is the bulk of

computation

Post by Dmitriy Lyubimov
really, and there are a few tricks there that can speed this step

up in

Post by Dmitriy Lyubimov
both exact and approximate manner, but you can start with a naive

search.

Post by Dmitriy Lyubimov
once you assigned centroids to the keys of marix A, you'd want to

do an

Post by Dmitriy Lyubimov
aggregating transpose of A to compute essentially average of row A

grouped

Post by Dmitriy Lyubimov
by the centroid key. The trick is to do a computation of (1|A)'

which

will

Post by Dmitriy Lyubimov
results in a matrix of the shape (Counts/sums of cluster rows).

This is

the

corresponding

Post by Dmitriy Lyubimov
cluster 1..k. In such column, the first element would be # of

points in

the

Post by Dmitriy Lyubimov
cluster, and the rest of it would correspond to sum of all points.

Post by Dmitriy Lyubimov
order to arrive to an updated matrix C, we need to collect B into

memory,

Post by Dmitriy Lyubimov
and slice out counters (first row) from the rest of it.
C <- B (2:,:) each row divided by B(1,:)
(watch out for empty clusters with 0 elements, this will cause

lack

Post by Dmitriy Lyubimov
convergence and NaNs in the newly computed C).
This operation obviously uses subblocking and row-wise iteration

over

Post by Dmitriy Lyubimov
io/mahout/doc/ScalaSparkBindings.html
[3] scaladoc, again, dated but largely viable for the purpose of

this

Post by Dmitriy Lyubimov
http://apache.github.io/mahout/0.10.1/docs/mahout-math-

scala/index.htm

Post by Dmitriy Lyubimov
[4] mapblock etc. http://apache.github.io/mahout

/0.10.1/docs/mahout-

Post by Dmitriy Lyubimov
math-scala/index.html#org.apache.mahout.math.drm.RLikeDrmOps
On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriycan you please again tell me the approach to move ahead.
Thanks
Parth Khatwani
On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
yes i am unable to figure out the way ahead.
Like how to create the augmented matrix A := (0|D) which you

have

Post by KHATWANI PARTH BHARAT
mentioned.
On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy Lyubimov <

Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans clustering algorithm using

Mahout

Post by Dmitriy Lyubimov
Samsara

Post by KHATWANI PARTH BHARAT
but i am bit confused
about how to leverage Distributed Row Matrix for the same.

Can

Post by KHATWANI PARTH BHARAT
me with same.
Thanks
Parth Khatwani

Trevor Grant

2017-04-14 18:40:43 UTC

Parth and Dmitriy,

This is awesome- as a follow on can we work on getting this rolled in to
the algorithms framework?

Happy to work with you on this Parth!

Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things." -Virgil*

Post by Dmitriy Lyubimov
i would think reassinging keys should work in most cases.
The only exception is that technically Spark contracts imply that effect
should be idempotent if task is retried, which might be a problem in a
specific scenario of the object tree coming out from block cache object
tree, which can stay there and be retried again. but specifically w.r.t.
this key assignment i don't see any problem since the action obviously
would be idempotent even if this code is run multiple times on the same
(key, block) pair. This part should be good IMO.
On Fri, Apr 14, 2017 at 2:26 AM, KHATWANI PARTH BHARAT <

centriods)

Post by KHATWANI PARTH BHARAT
//13. assigning closest index to key
keys(row) = closesetIndex
}
keys -> block
}
in step 12 i am finding the centriod closest to the current dataPoint
in step13 i am assigning the closesetIndex to the key of the

corresponding

Post by KHATWANI PARTH BHARAT
row represented by the dataPoint
I think i am doing step13 incorrectly.
Also i am unable to find the proper reference for the same in the

reference

Post by KHATWANI PARTH BHARAT
links which you have mentioned above
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 6:24 PM, KHATWANI PARTH BHARAT <

Mahout

Samsara"
can't say i can read this code well formatted that way...
it would seem to me that the code is not using the broadcast variable

and

instead is using closure variable. that's the only thing i can

immediately

see by looking in the middle of it.
it would be better if you created a branch on github for that code

that

would allow for easy check-outs and comments.
-d
On Wed, Apr 12, 2017 at 10:29 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriy Sir
I have completed the Kmeans code as per the algorithm you have

Outline

Post by KHATWANI PARTH BHARAT
above
My code is as follows
This code works fine till step number 10
In step 11 i am assigning the new centriod index to corresponding

row

key

Post by KHATWANI PARTH BHARAT
of data Point in the matrix
I think i am doing something wrong in step 11 may be i am using

incorrect

SparkContext(conf))

calculating

Post by KHATWANI PARTH BHARAT
(1 | D)'
val augumentedArray = test.map(addCentriodColumn _)
//5. convert rdd of array of double in rdd of DenseVector
val rdd = augumentedArray.map(dvec(_))
//6. convert rdd to DrmRdd
val rddMatrixLike: DrmRdd[Int] = rdd.zipWithIndex.map { case (v,
idx) => (idx.toInt, v) } //7. convert DrmRdd to
CheckpointedDrm[Int] val matrix = drmWrap(rddMatrixLike) //8.
seperating the column having all ones created in step 4 and will use
it later val oneVector = matrix(::, 0 until 1) //9. final
input data in DrmLike[Int] format val dataDrmX = matrix(::, 1

until

Post by KHATWANI PARTH BHARAT
4) //9. Sampling to select initial centriods val
centriods = drmSampleKRows(dataDrmX, 2, false) centriods.size
//10. Broad Casting the initial centriods val broadCastMatrix =
drmBroadcast(centriods) //11. Iterating over the Data
Matrix(in DrmLike[Int] format) to calculate the initial centriods
dataDrmX.mapBlock() { case (keys, block) => for (row <-

Post by KHATWANI PARTH BHARAT
until block.nrow) { var dataPoint = block(row, ::)
//12. findTheClosestCentriod find the closest centriod to

the

Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint" val closesetIndex =
findTheClosestCentriod(dataPoint, centriods)

//13.

Post by KHATWANI PARTH BHARAT
assigning closest index to key keys(row) = closesetIndex
} keys -> block }
//14. Calculating the (1|D) val b = (oneVector cbind
dataDrmX) //15. Aggregating Transpose (1|D)' val

bTranspose

Post by KHATWANI PARTH BHARAT
= (oneVector cbind dataDrmX).t // after step 15 bTranspose will
have data in the following format /*(n+1)*K where n=dimension
of the data point, K=number of clusters * zeroth row will contain
the count of points assigned to each cluster * assuming 3d data
points * */
val nrows = b.nrow.toInt //16. slicing the count vectors out
val pointCountVectors = drmBroadcast(b(0 until 1, ::).collect(0,

::))

Post by KHATWANI PARTH BHARAT
val vectorSums = b(1 until nrows, ::) //17. dividing the data
point by count vector vectorSums.mapBlock() { case (keys,
block) => for (row <- 0 until block.nrow) {

block(row,

Post by KHATWANI PARTH BHARAT
::) /= pointCountVectors } keys -> block } //18.
seperating the count vectors val newCentriods = vectorSums.t(::,1
until centriods.size) //19. iterate over the above code
till convergence criteria is meet }//end of main method
// method to find the closest centriod to data point( vec: Vector
Matrix): Int = {
var index = 0
var closest = Double.PositiveInfinity
for (row <- 0 until matrix.nrow) {
val squaredSum = ssr(vec, matrix(row, ::))
val tempDist = Math.sqrt(ssr(vec, matrix(row, ::)))
if (tempDist < closest) {
closest = tempDist
index = row
}
}
index
}
//calculating the sum of squared distance between the

points(Vectors)

---------- Forwarded message ----------
Date: Fri, Mar 31, 2017 at 11:34 PM
Subject: Re: Trying to write the KMeans Clustering Using "Apache

Mahout

Samsara"
ps1 this assumes row-wise construction of A based on training set

of m

n-dimensional points.
ps2 since we are doing multiple passes over A it may make sense to

make

sure it is committed to spark cache (by using checkpoint api), if

spark

used
On Fri, Mar 31, 2017 at 10:53 AM, Dmitriy Lyubimov <

Post by Dmitriy Lyubimov
here is the outline. For details of APIs, please refer to

samsara

Post by Dmitriy Lyubimov
[2], i will not be be repeating it.
Assume your training data input is m x n matrix A. For

simplicity

let's

Post by Dmitriy Lyubimov
assume it's a DRM with int row keys, i.e., DrmLike[Int].
First, classic k-means starts by selecting initial clusters, by

sampling

Post by Dmitriy Lyubimov
them out. You can do that by using sampling api [1], thus

forming

Post by Dmitriy Lyubimov
in-memory matrix C (current centroids). C is therefore of

Mahout's

Post by Dmitriy Lyubimov
type.
You the proceed by alternating between cluster assignments and
recompupting centroid matrix C till convergence based on some

test

Post by Dmitriy Lyubimov
simply limited by epoch count budget, your choice.
Cluster assignments: here, we go over current generation of A

and

Post by Dmitriy Lyubimov
recompute centroid indexes for each row in A. Once we recompute

index,

Post by Dmitriy Lyubimov
put it into the row key . You can do that by assigning centroid

indices

Post by Dmitriy Lyubimov
keys of A using operator mapblock() (details in [2], [3], [4]).

You

Post by Dmitriy Lyubimov
need to broadcast C in order to be able to access it in

efficient

Post by Dmitriy Lyubimov
inside mapblock() closure. Examples of that are plenty given in

[2].

Post by Dmitriy Lyubimov
Essentially, in mapblock, you'd reform the row keys to reflect

cluster

Post by Dmitriy Lyubimov
index in C. while going over A, you'd have a "nearest neighbor"

problem

Post by Dmitriy Lyubimov
solve for the row of A and centroids C. This is the bulk of

computation

Post by Dmitriy Lyubimov
really, and there are a few tricks there that can speed this

step

up in

Post by Dmitriy Lyubimov
both exact and approximate manner, but you can start with a

naive

Post by Dmitriy Lyubimov
once you assigned centroids to the keys of marix A, you'd want

do an

Post by Dmitriy Lyubimov
aggregating transpose of A to compute essentially average of

row A

grouped

Post by Dmitriy Lyubimov
by the centroid key. The trick is to do a computation of (1|A)'

which

will

Post by Dmitriy Lyubimov
results in a matrix of the shape (Counts/sums of cluster rows).

This is

the

corresponding

Post by Dmitriy Lyubimov
cluster 1..k. In such column, the first element would be # of

points in

the

Post by Dmitriy Lyubimov
cluster, and the rest of it would correspond to sum of all

points.

Post by Dmitriy Lyubimov
order to arrive to an updated matrix C, we need to collect B

into

Post by Dmitriy Lyubimov
and slice out counters (first row) from the rest of it.
C <- B (2:,:) each row divided by B(1,:)
(watch out for empty clusters with 0 elements, this will cause

lack

Post by Dmitriy Lyubimov
convergence and NaNs in the newly computed C).
This operation obviously uses subblocking and row-wise iteration

over

http://apache.github

Post by Dmitriy Lyubimov
io/mahout/doc/ScalaSparkBindings.html
[3] scaladoc, again, dated but largely viable for the purpose of

this

Post by Dmitriy Lyubimov
http://apache.github.io/mahout/0.10.1/docs/mahout-math-

scala/index.htm

Post by Dmitriy Lyubimov
[4] mapblock etc. http://apache.github.io/mahout

/0.10.1/docs/mahout-

Post by Dmitriy Lyubimov
math-scala/index.html#org.apache.mahout.math.drm.RLikeDrmOps
On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriycan you please again tell me the approach to move

ahead.

Post by KHATWANI PARTH BHARAT
Thanks
Parth Khatwani
On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
yes i am unable to figure out the way ahead.
Like how to create the augmented matrix A := (0|D) which you

have

Post by KHATWANI PARTH BHARAT
mentioned.
On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy Lyubimov <

confusing?

Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans clustering algorithm using

Mahout

Post by Dmitriy Lyubimov
Samsara

Post by KHATWANI PARTH BHARAT
but i am bit confused
about how to leverage Distributed Row Matrix for the same.

Can

Post by KHATWANI PARTH BHARAT
me with same.
Thanks
Parth Khatwani

KHATWANI PARTH BHARAT

2017-04-15 04:37:22 UTC

@Dmitriy,
@Trevor and @Andrew

I have tried
Testing this Row Key assignment issue which i have mentioned in the above
mail,
By Writing the a separate code where i am assigning the a default value 1
to each row Key of The DRM and then taking the aggregating transpose
I have committed the separate test code to the Github Branch
<https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov>.

The Code is as follows

val inCoreA = dense((1,1, 2, 3), (1,2, 3, 4), (1,3, 4, 5), (1,4, 5, 6))
val A = drmParallelize(m = inCoreA)

//Mapblock
val drm2 = A.mapBlock() {
case (keys, block) => for(row <- 0 until keys.size) {

* //assigning 1 to each row index* keys(row) = 1
} (keys, block) } prinln("After New Cluster
assignment") println(""+drm2.collect) val aggTranspose = drm2.t
println("Result of aggregating tranpose")
println(""+aggTranspose.collect)

Out of 1st println After New Cluster assignment should be
This
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
(Here zeroth Column is used to store the centriod count and column 1,2 and
3 Contains Data)

But Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
And the result of aggregating Transpose should be
{
0 => {1: 4.0}
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}

I have referred to the book written by Andrew And Dmitriy Apache Mahout:
Beyond MapReduce
<https://www.amazon.com/Apache-Mahout-MapReduce-Dmitriy-Lyubimov/dp/1523775785>
Aggregating
Transpose and other concepts are explained very nicely over here but i am
unable to find any example where
Row Keys are assigned new Values . Mahout Samsara Manual
http://apache.github.io/mahout/doc/ScalaSparkBindings.html Also Does not
contain any such examples.
It will great if i can get some reference to solution of mentioned issue.

Thanks
Parth Khatwani

+1
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/14/2017 11:40 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering Using "Apache Mahout
Samsara"
Parth and Dmitriy,
This is awesome- as a follow on can we work on getting this rolled in to
the algorithms framework?
Happy to work with you on this Parth!
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*

calculate

Post by KHATWANI PARTH BHARAT
the initial centriods
dataDrmX.mapBlock() {
case (keys, block) =>
for (row <- 0 until block.nrow) {
var dataPoint = block(row, ::)
//12. findTheClosestCentriod find the closest centriod to the
Data point specified by "dataPoint"
val closesetIndex = findTheClosestCentriod(dataPoint,

centriods)

corresponding

Post by KHATWANI PARTH BHARAT
row represented by the dataPoint
I think i am doing step13 incorrectly.
Also i am unable to find the proper reference for the same in the

reference

Post by KHATWANI PARTH BHARAT
links which you have mentioned above
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 6:24 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
Dmitriy Sir,
I have Created a github branch Github Branch Having Initial Kmeans

Code

Post by KHATWANI PARTH BHARAT
<https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov>
Thanks & Regards
Parth Khatwani

Mahout

Samsara"
can't say i can read this code well formatted that way...
it would seem to me that the code is not using the broadcast

variable

instead is using closure variable. that's the only thing i can

immediately

see by looking in the middle of it.
it would be better if you created a branch on github for that code

that

would allow for easy check-outs and comments.
-d
On Wed, Apr 12, 2017 at 10:29 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriy Sir
I have completed the Kmeans code as per the algorithm you have

Outline

Post by KHATWANI PARTH BHARAT
above
My code is as follows
This code works fine till step number 10
In step 11 i am assigning the new centriod index to corresponding

row

key

Post by KHATWANI PARTH BHARAT
of data Point in the matrix
I think i am doing something wrong in step 11 may be i am using

incorrect

SparkContext(conf))

toDouble))

Post by KHATWANI PARTH BHARAT
//4. add a column having value 1 in array of double this will
create something like (1 | D)', which will be used while

calculating

(v,

Post by KHATWANI PARTH BHARAT
idx) => (idx.toInt, v) } //7. convert DrmRdd to
CheckpointedDrm[Int] val matrix = drmWrap(rddMatrixLike)

//8.

Post by KHATWANI PARTH BHARAT
seperating the column having all ones created in step 4 and will

use

Post by KHATWANI PARTH BHARAT
it later val oneVector = matrix(::, 0 until 1) //9.

final

Post by KHATWANI PARTH BHARAT
input data in DrmLike[Int] format val dataDrmX = matrix(::, 1

until

Post by KHATWANI PARTH BHARAT
until block.nrow) { var dataPoint = block(row, ::)
//12. findTheClosestCentriod find the closest centriod to

the

Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint" val closesetIndex =
findTheClosestCentriod(dataPoint, centriods)

//13.

bTranspose

Post by KHATWANI PARTH BHARAT
= (oneVector cbind dataDrmX).t // after step 15 bTranspose will
have data in the following format /*(n+1)*K where

n=dimension

Post by KHATWANI PARTH BHARAT
of the data point, K=number of clusters * zeroth row will

contain

Post by KHATWANI PARTH BHARAT
the count of points assigned to each cluster * assuming 3d data
points * */
val nrows = b.nrow.toInt //16. slicing the count vectors

out

Post by KHATWANI PARTH BHARAT
val pointCountVectors = drmBroadcast(b(0 until 1, ::).collect(0,

::))

Post by KHATWANI PARTH BHARAT
val vectorSums = b(1 until nrows, ::) //17. dividing the

data

Post by KHATWANI PARTH BHARAT
point by count vector vectorSums.mapBlock() { case (keys,
block) => for (row <- 0 until block.nrow) {

block(row,

Post by KHATWANI PARTH BHARAT
::) /= pointCountVectors } keys -> block }

//18.

Post by KHATWANI PARTH BHARAT
seperating the count vectors val newCentriods =

vectorSums.t(::,1

Post by KHATWANI PARTH BHARAT
until centriods.size) //19. iterate over the above code
till convergence criteria is meet }//end of main method

Vector

Post by KHATWANI PARTH BHARAT
Matrix): Int = {
var index = 0
var closest = Double.PositiveInfinity
for (row <- 0 until matrix.nrow) {
val squaredSum = ssr(vec, matrix(row, ::))
val tempDist = Math.sqrt(ssr(vec, matrix(row, ::)))
if (tempDist < closest) {
closest = tempDist
index = row
}
}
index
}
//calculating the sum of squared distance between the

points(Vectors)

---------- Forwarded message ----------
Date: Fri, Mar 31, 2017 at 11:34 PM
Subject: Re: Trying to write the KMeans Clustering Using "Apache

Mahout

Samsara"
ps1 this assumes row-wise construction of A based on training

set

n-dimensional points.
ps2 since we are doing multiple passes over A it may make sense

make

sure it is committed to spark cache (by using checkpoint api),

spark

used
On Fri, Mar 31, 2017 at 10:53 AM, Dmitriy Lyubimov <

Post by Dmitriy Lyubimov
here is the outline. For details of APIs, please refer to

samsara

Post by Dmitriy Lyubimov
[2], i will not be be repeating it.
Assume your training data input is m x n matrix A. For

simplicity

let's

Post by Dmitriy Lyubimov
assume it's a DRM with int row keys, i.e., DrmLike[Int].
First, classic k-means starts by selecting initial clusters,

Post by Dmitriy Lyubimov
them out. You can do that by using sampling api [1], thus

forming

Post by Dmitriy Lyubimov
in-memory matrix C (current centroids). C is therefore of

Mahout's

Post by Dmitriy Lyubimov
type.
You the proceed by alternating between cluster assignments and
recompupting centroid matrix C till convergence based on some

test

Post by Dmitriy Lyubimov
simply limited by epoch count budget, your choice.
Cluster assignments: here, we go over current generation of A

and

Post by Dmitriy Lyubimov
recompute centroid indexes for each row in A. Once we

recompute

index,

Post by Dmitriy Lyubimov
put it into the row key . You can do that by assigning

centroid

indices

Post by Dmitriy Lyubimov
keys of A using operator mapblock() (details in [2], [3],

[4]).

Post by Dmitriy Lyubimov
need to broadcast C in order to be able to access it in

efficient

Post by Dmitriy Lyubimov
inside mapblock() closure. Examples of that are plenty given

Post by Dmitriy Lyubimov
Essentially, in mapblock, you'd reform the row keys to reflect

cluster

Post by Dmitriy Lyubimov
index in C. while going over A, you'd have a "nearest

neighbor"

problem

Post by Dmitriy Lyubimov
solve for the row of A and centroids C. This is the bulk of

computation

Post by Dmitriy Lyubimov
really, and there are a few tricks there that can speed this

step

up in

Post by Dmitriy Lyubimov
both exact and approximate manner, but you can start with a

naive

Post by Dmitriy Lyubimov
once you assigned centroids to the keys of marix A, you'd want

do an

Post by Dmitriy Lyubimov
aggregating transpose of A to compute essentially average of

row A

grouped

Post by Dmitriy Lyubimov
by the centroid key. The trick is to do a computation of

(1|A)'

which

will

Post by Dmitriy Lyubimov
results in a matrix of the shape (Counts/sums of cluster

rows).

This is

the

Post by Dmitriy Lyubimov
part i find difficult to explain without a latex graphics.
In Samsara, construction of (1|A)' corresponds to DRM

expression

Post by Dmitriy Lyubimov
(1 cbind A).t (again, see [2]).
So when you compute, say,
B = (1 | A)',
then B is (n+1) x k, so each column contains a vector

corresponding

Post by Dmitriy Lyubimov
cluster 1..k. In such column, the first element would be # of

points in

the

Post by Dmitriy Lyubimov
cluster, and the rest of it would correspond to sum of all

points.

Post by Dmitriy Lyubimov
order to arrive to an updated matrix C, we need to collect B

into

Post by Dmitriy Lyubimov
and slice out counters (first row) from the rest of it.
C <- B (2:,:) each row divided by B(1,:)
(watch out for empty clusters with 0 elements, this will cause

lack

Post by Dmitriy Lyubimov
convergence and NaNs in the newly computed C).
This operation obviously uses subblocking and row-wise

iteration

over

http://apache.github

Post by Dmitriy Lyubimov
io/mahout/doc/ScalaSparkBindings.html
[3] scaladoc, again, dated but largely viable for the purpose

this

Post by Dmitriy Lyubimov
http://apache.github.io/mahout/0.10.1/docs/mahout-math-

scala/index.htm

Post by Dmitriy Lyubimov
[4] mapblock etc. http://apache.github.io/mahout

/0.10.1/docs/mahout-

Post by Dmitriy Lyubimov
math-scala/index.html#org.apache.mahout.math.drm.RLikeDrmOps
On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriycan you please again tell me the approach to move

ahead.

Post by KHATWANI PARTH BHARAT
Thanks
Parth Khatwani
On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
yes i am unable to figure out the way ahead.
Like how to create the augmented matrix A := (0|D) which

you

Post by KHATWANI PARTH BHARAT
mentioned.
On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy Lyubimov <

confusing?

Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans clustering algorithm

using

Mahout

Post by Dmitriy Lyubimov
Samsara

Post by KHATWANI PARTH BHARAT
but i am bit confused
about how to leverage Distributed Row Matrix for the

same.

Post by KHATWANI PARTH BHARAT
me with same.
Thanks
Parth Khatwani

KHATWANI PARTH BHARAT

2017-04-19 04:32:50 UTC

@Dmitriy,@Trevor and @Andrew Sir,
I am still stuck at the above problem can you please help me out with it.
I am unable to find the proper reference to solve the above issue.

Thanks & Regards
Parth Khatwani

<https://mailtrack.io/> Sent with Mailtrack
<https://mailtrack.io/install?source=signature&lang=en&referral=***@pilani.bits-pilani.ac.in&idSignature=22>

On Sat, Apr 15, 2017 at 10:07 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriy,
@Trevor and @Andrew
I have tried
Testing this Row Key assignment issue which i have mentioned in the above
mail,
By Writing the a separate code where i am assigning the a default value 1
to each row Key of The DRM and then taking the aggregating transpose
I have committed the separate test code to the Github Branch
<https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov>.
The Code is as follows
val inCoreA = dense((1,1, 2, 3), (1,2, 3, 4), (1,3, 4, 5), (1,4, 5, 6))
val A = drmParallelize(m = inCoreA)
//Mapblock
val drm2 = A.mapBlock() {
case (keys, block) => for(row <- 0 until keys.size) {
* //assigning 1 to each row index* keys(row) = 1 } (keys, block) } prinln("After New Cluster assignment") println(""+drm2.collect) val aggTranspose = drm2.t println("Result of aggregating tranpose") println(""+aggTranspose.collect)
Out of 1st println After New Cluster assignment should be
This
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
(Here zeroth Column is used to store the centriod count and column 1,2 and
3 Contains Data)
But Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
And the result of aggregating Transpose should be
{
0 => {1: 4.0}
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
Beyond MapReduce
<https://www.amazon.com/Apache-Mahout-MapReduce-Dmitriy-Lyubimov/dp/1523775785> Aggregating
Transpose and other concepts are explained very nicely over here but i am
unable to find any example where
Row Keys are assigned new Values . Mahout Samsara Manual
http://apache.github.io/mahout/doc/ScalaSparkBindings.html Also Does not
contain any such examples.
It will great if i can get some reference to solution of mentioned issue.
Thanks
Parth Khatwani

calculate

the

Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint"
val closesetIndex = findTheClosestCentriod(dataPoint,

centriods)

corresponding

Post by KHATWANI PARTH BHARAT
row represented by the dataPoint
I think i am doing step13 incorrectly.
Also i am unable to find the proper reference for the same in the

reference

Post by KHATWANI PARTH BHARAT
links which you have mentioned above
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 6:24 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
Dmitriy Sir,
I have Created a github branch Github Branch Having Initial Kmeans

Code

Post by KHATWANI PARTH BHARAT
<https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov>
Thanks & Regards
Parth Khatwani

Mahout

Samsara"
can't say i can read this code well formatted that way...
it would seem to me that the code is not using the broadcast

variable

instead is using closure variable. that's the only thing i can

immediately

see by looking in the middle of it.
it would be better if you created a branch on github for that code

that

would allow for easy check-outs and comments.
-d
On Wed, Apr 12, 2017 at 10:29 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriy Sir
I have completed the Kmeans code as per the algorithm you have

Outline

Post by KHATWANI PARTH BHARAT
above
My code is as follows
This code works fine till step number 10
In step 11 i am assigning the new centriod index to

corresponding

key

Post by KHATWANI PARTH BHARAT
of data Point in the matrix
I think i am doing something wrong in step 11 may be i am using

incorrect

SparkContext(conf))

e))

Post by KHATWANI PARTH BHARAT
//4. add a column having value 1 in array of double this will
create something like (1 | D)', which will be used while

calculating

(v,

Post by KHATWANI PARTH BHARAT
idx) => (idx.toInt, v) } //7. convert DrmRdd to
CheckpointedDrm[Int] val matrix = drmWrap(rddMatrixLike)

//8.

Post by KHATWANI PARTH BHARAT
seperating the column having all ones created in step 4 and will

use

Post by KHATWANI PARTH BHARAT
it later val oneVector = matrix(::, 0 until 1) //9.

final

Post by KHATWANI PARTH BHARAT
input data in DrmLike[Int] format val dataDrmX = matrix(::, 1

until

Post by KHATWANI PARTH BHARAT
drmBroadcast(centriods) //11. Iterating over the Data
Matrix(in DrmLike[Int] format) to calculate the initial centriods
dataDrmX.mapBlock() { case (keys, block) => for (row

Post by KHATWANI PARTH BHARAT
until block.nrow) { var dataPoint = block(row, ::)
//12. findTheClosestCentriod find the closest centriod to

the

Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint" val closesetIndex =
findTheClosestCentriod(dataPoint, centriods)

//13.

bTranspose

Post by KHATWANI PARTH BHARAT
= (oneVector cbind dataDrmX).t // after step 15 bTranspose

will

Post by KHATWANI PARTH BHARAT
have data in the following format /*(n+1)*K where

n=dimension

Post by KHATWANI PARTH BHARAT
of the data point, K=number of clusters * zeroth row will

contain

Post by KHATWANI PARTH BHARAT
the count of points assigned to each cluster * assuming 3d

data

Post by KHATWANI PARTH BHARAT
points * */
val nrows = b.nrow.toInt //16. slicing the count vectors

out

Post by KHATWANI PARTH BHARAT
val pointCountVectors = drmBroadcast(b(0 until 1, ::).collect(0,

::))

Post by KHATWANI PARTH BHARAT
val vectorSums = b(1 until nrows, ::) //17. dividing the

data

Post by KHATWANI PARTH BHARAT
point by count vector vectorSums.mapBlock() { case (keys,
block) => for (row <- 0 until block.nrow) {

block(row,

Post by KHATWANI PARTH BHARAT
::) /= pointCountVectors } keys -> block }

//18.

Post by KHATWANI PARTH BHARAT
seperating the count vectors val newCentriods =

vectorSums.t(::,1

Post by KHATWANI PARTH BHARAT
until centriods.size) //19. iterate over the above

code

Post by KHATWANI PARTH BHARAT
till convergence criteria is meet }//end of main method

Vector

Post by KHATWANI PARTH BHARAT
in the arguments) def findTheClosestCentriod(vec: Vector,
Matrix): Int = {
var index = 0
var closest = Double.PositiveInfinity
for (row <- 0 until matrix.nrow) {
val squaredSum = ssr(vec, matrix(row, ::))
val tempDist = Math.sqrt(ssr(vec, matrix(row, ::)))
if (tempDist < closest) {
closest = tempDist
index = row
}
}
index
}
//calculating the sum of squared distance between the

points(Vectors)

---------- Forwarded message ----------
Date: Fri, Mar 31, 2017 at 11:34 PM
Subject: Re: Trying to write the KMeans Clustering Using

"Apache

Mahout

Samsara"
ps1 this assumes row-wise construction of A based on training

set

n-dimensional points.
ps2 since we are doing multiple passes over A it may make

sense to

make

sure it is committed to spark cache (by using checkpoint api),

spark

used
On Fri, Mar 31, 2017 at 10:53 AM, Dmitriy Lyubimov <

Post by Dmitriy Lyubimov
here is the outline. For details of APIs, please refer to

samsara

Post by Dmitriy Lyubimov
[2], i will not be be repeating it.
Assume your training data input is m x n matrix A. For

simplicity

let's

Post by Dmitriy Lyubimov
assume it's a DRM with int row keys, i.e., DrmLike[Int].
First, classic k-means starts by selecting initial clusters,

Post by Dmitriy Lyubimov
them out. You can do that by using sampling api [1], thus

forming

Post by Dmitriy Lyubimov
in-memory matrix C (current centroids). C is therefore of

Mahout's

Post by Dmitriy Lyubimov
type.
You the proceed by alternating between cluster assignments

and

Post by Dmitriy Lyubimov
recompupting centroid matrix C till convergence based on some

test

Post by Dmitriy Lyubimov
simply limited by epoch count budget, your choice.
Cluster assignments: here, we go over current generation of A

and

Post by Dmitriy Lyubimov
recompute centroid indexes for each row in A. Once we

recompute

index,

Post by Dmitriy Lyubimov
put it into the row key . You can do that by assigning

centroid

indices

Post by Dmitriy Lyubimov
keys of A using operator mapblock() (details in [2], [3],

[4]).

Post by Dmitriy Lyubimov
need to broadcast C in order to be able to access it in

efficient

Post by Dmitriy Lyubimov
inside mapblock() closure. Examples of that are plenty given

Post by Dmitriy Lyubimov
Essentially, in mapblock, you'd reform the row keys to

reflect

cluster

Post by Dmitriy Lyubimov
index in C. while going over A, you'd have a "nearest

neighbor"

problem

Post by Dmitriy Lyubimov
solve for the row of A and centroids C. This is the bulk of

computation

Post by Dmitriy Lyubimov
really, and there are a few tricks there that can speed this

step

up in

Post by Dmitriy Lyubimov
both exact and approximate manner, but you can start with a

naive

Post by Dmitriy Lyubimov
once you assigned centroids to the keys of marix A, you'd

want

do an

Post by Dmitriy Lyubimov
aggregating transpose of A to compute essentially average of

row A

grouped

Post by Dmitriy Lyubimov
by the centroid key. The trick is to do a computation of

(1|A)'

which

will

Post by Dmitriy Lyubimov
results in a matrix of the shape (Counts/sums of cluster

rows).

This is

the

Post by Dmitriy Lyubimov
part i find difficult to explain without a latex graphics.
In Samsara, construction of (1|A)' corresponds to DRM

expression

Post by Dmitriy Lyubimov
(1 cbind A).t (again, see [2]).
So when you compute, say,
B = (1 | A)',
then B is (n+1) x k, so each column contains a vector

corresponding

Post by Dmitriy Lyubimov
cluster 1..k. In such column, the first element would be # of

points in

the

Post by Dmitriy Lyubimov
cluster, and the rest of it would correspond to sum of all

points.

Post by Dmitriy Lyubimov
order to arrive to an updated matrix C, we need to collect B

into

Post by Dmitriy Lyubimov
and slice out counters (first row) from the rest of it.
C <- B (2:,:) each row divided by B(1,:)
(watch out for empty clusters with 0 elements, this will

cause

Post by Dmitriy Lyubimov
convergence and NaNs in the newly computed C).
This operation obviously uses subblocking and row-wise

iteration

over

http://apache.github

Post by Dmitriy Lyubimov
io/mahout/doc/ScalaSparkBindings.html
[3] scaladoc, again, dated but largely viable for the

purpose of

this

Post by Dmitriy Lyubimov
http://apache.github.io/mahout/0.10.1/docs/mahout-math-

scala/index.htm

Post by Dmitriy Lyubimov
[4] mapblock etc. http://apache.github.io/mahout

/0.10.1/docs/mahout-

Post by Dmitriy Lyubimov
math-scala/index.html#org.apache.mahout.math.drm.RLikeDrmOps
On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriycan you please again tell me the approach to move

ahead.

Post by KHATWANI PARTH BHARAT
Thanks
Parth Khatwani
On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
yes i am unable to figure out the way ahead.
Like how to create the augmented matrix A := (0|D) which

you

Post by KHATWANI PARTH BHARAT
mentioned.
On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy Lyubimov <

confusing?

Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans clustering algorithm

using

Mahout

Post by Dmitriy Lyubimov
Samsara

Post by KHATWANI PARTH BHARAT
but i am bit confused
about how to leverage Distributed Row Matrix for the

same.

Post by KHATWANI PARTH BHARAT
me with same.
Thanks
Parth Khatwani

Trevor Grant

2017-04-20 22:50:37 UTC

Hey

Sorry for delay- was getting ready to tear into this.

Would you mind posting a small sample of data that you would expect this
application to consume.

tg

Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things." -Virgil*

On Tue, Apr 18, 2017 at 11:32 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriy,@Trevor and @Andrew Sir,
I am still stuck at the above problem can you please help me out with it.
I am unable to find the proper reference to solve the above issue.
Thanks & Regards
Parth Khatwani
<https://mailtrack.io/> Sent with Mailtrack
<https://mailtrack.io/install?source=signature&lang=en&
On Sat, Apr 15, 2017 at 10:07 AM, KHATWANI PARTH BHARAT <

} (keys, block) } prinln("After New Cluster assignment")
println(""+drm2.collect) val aggTranspose = drm2.t println("Result of
aggregating tranpose") println(""+aggTranspose.collect)

Post by KHATWANI PARTH BHARAT
Out of 1st println After New Cluster assignment should be
This
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
(Here zeroth Column is used to store the centriod count and column 1,2

and

Post by KHATWANI PARTH BHARAT
3 Contains Data)
But Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
And the result of aggregating Transpose should be
{
0 => {1: 4.0}
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
Beyond MapReduce
<https://www.amazon.com/Apache-Mahout-MapReduce-

Dmitriy-Lyubimov/dp/1523775785> Aggregating

Post by KHATWANI PARTH BHARAT
Transpose and other concepts are explained very nicely over here but i

Post by KHATWANI PARTH BHARAT
unable to find any example where
Row Keys are assigned new Values . Mahout Samsara Manual
http://apache.github.io/mahout/doc/ScalaSparkBindings.html Also Does not
contain any such examples.
It will great if i can get some reference to solution of mentioned issue.
Thanks
Parth Khatwani

Post by Dmitriy Lyubimov
i would think reassinging keys should work in most cases.
The only exception is that technically Spark contracts imply that

effect

Post by Dmitriy Lyubimov
should be idempotent if task is retried, which might be a problem in a
specific scenario of the object tree coming out from block cache

object

Post by Dmitriy Lyubimov
tree, which can stay there and be retried again. but specifically

w.r.t.

Post by Dmitriy Lyubimov
this key assignment i don't see any problem since the action obviously
would be idempotent even if this code is run multiple times on the

same

Post by Dmitriy Lyubimov
(key, block) pair. This part should be good IMO.
On Fri, Apr 14, 2017 at 2:26 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriy Sir,
In the K means code above I think i am doing the following

Incorrectly

Post by KHATWANI PARTH BHARAT
Assigning the closest centriod index to the Row Keys of DRM
//11. Iterating over the Data Matrix(in DrmLike[Int] format) to

calculate

the

Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint"
val closesetIndex = findTheClosestCentriod(dataPoint,

centriods)

Post by KHATWANI PARTH BHARAT
//13. assigning closest index to key
keys(row) = closesetIndex
}
keys -> block
}
in step 12 i am finding the centriod closest to the current

dataPoint

Post by KHATWANI PARTH BHARAT
in step13 i am assigning the closesetIndex to the key of the

corresponding

Post by KHATWANI PARTH BHARAT
row represented by the dataPoint
I think i am doing step13 incorrectly.
Also i am unable to find the proper reference for the same in the

reference

Post by KHATWANI PARTH BHARAT
links which you have mentioned above
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 6:24 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
Dmitriy Sir,
I have Created a github branch Github Branch Having Initial Kmeans

Code

Post by KHATWANI PARTH BHARAT
<https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov>
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 3:19 AM, Andrew Palumbo <

Mahout

Samsara"
can't say i can read this code well formatted that way...
it would seem to me that the code is not using the broadcast

variable

instead is using closure variable. that's the only thing i can

immediately

see by looking in the middle of it.
it would be better if you created a branch on github for that

code

would allow for easy check-outs and comments.
-d
On Wed, Apr 12, 2017 at 10:29 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriy Sir
I have completed the Kmeans code as per the algorithm you have

Outline

Post by KHATWANI PARTH BHARAT
above
My code is as follows
This code works fine till step number 10
In step 11 i am assigning the new centriod index to

corresponding

key

Post by KHATWANI PARTH BHARAT
of data Point in the matrix
I think i am doing something wrong in step 11 may be i am using

incorrect

SparkContext(conf))

e))

Post by KHATWANI PARTH BHARAT
//4. add a column having value 1 in array of double this

will

Post by KHATWANI PARTH BHARAT
create something like (1 | D)', which will be used while

calculating

case

(v,

Post by KHATWANI PARTH BHARAT
idx) => (idx.toInt, v) } //7. convert DrmRdd to
CheckpointedDrm[Int] val matrix = drmWrap(rddMatrixLike)

//8.

Post by KHATWANI PARTH BHARAT
seperating the column having all ones created in step 4 and

will

use

Post by KHATWANI PARTH BHARAT
it later val oneVector = matrix(::, 0 until 1) //9.

final

Post by KHATWANI PARTH BHARAT
input data in DrmLike[Int] format val dataDrmX = matrix(::,

Post by KHATWANI PARTH BHARAT
4) //9. Sampling to select initial centriods val
centriods = drmSampleKRows(dataDrmX, 2, false)

centriods.size

Post by KHATWANI PARTH BHARAT
//10. Broad Casting the initial centriods val

broadCastMatrix

Post by KHATWANI PARTH BHARAT
drmBroadcast(centriods) //11. Iterating over the

Data

Post by KHATWANI PARTH BHARAT
Matrix(in DrmLike[Int] format) to calculate the initial

centriods

Post by KHATWANI PARTH BHARAT
dataDrmX.mapBlock() { case (keys, block) => for

(row

Post by KHATWANI PARTH BHARAT
until block.nrow) { var dataPoint = block(row, ::)
//12. findTheClosestCentriod find the closest centriod

Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint" val closesetIndex

Post by KHATWANI PARTH BHARAT
findTheClosestCentriod(dataPoint, centriods)

//13.

Post by KHATWANI PARTH BHARAT
assigning closest index to key keys(row) =

closesetIndex

Post by KHATWANI PARTH BHARAT
} keys -> block }
//14. Calculating the (1|D) val b = (oneVector cbind
dataDrmX) //15. Aggregating Transpose (1|D)' val

bTranspose

Post by KHATWANI PARTH BHARAT
= (oneVector cbind dataDrmX).t // after step 15 bTranspose

will

Post by KHATWANI PARTH BHARAT
have data in the following format /*(n+1)*K where

n=dimension

Post by KHATWANI PARTH BHARAT
of the data point, K=number of clusters * zeroth row will

contain

Post by KHATWANI PARTH BHARAT
the count of points assigned to each cluster * assuming 3d

data

Post by KHATWANI PARTH BHARAT
points * */
val nrows = b.nrow.toInt //16. slicing the count vectors

out

Post by KHATWANI PARTH BHARAT
val pointCountVectors = drmBroadcast(b(0 until 1,

::).collect(0,

Post by KHATWANI PARTH BHARAT
val vectorSums = b(1 until nrows, ::) //17. dividing the

data

Post by KHATWANI PARTH BHARAT
point by count vector vectorSums.mapBlock() { case

(keys,

Post by KHATWANI PARTH BHARAT
block) => for (row <- 0 until block.nrow) {

block(row,

Post by KHATWANI PARTH BHARAT
::) /= pointCountVectors } keys -> block }

//18.

Post by KHATWANI PARTH BHARAT
seperating the count vectors val newCentriods =

vectorSums.t(::,1

Post by KHATWANI PARTH BHARAT
until centriods.size) //19. iterate over the above

code

Post by KHATWANI PARTH BHARAT
till convergence criteria is meet }//end of main method

Vector

points(Vectors)

---------- Forwarded message ----------
Date: Fri, Mar 31, 2017 at 11:34 PM
Subject: Re: Trying to write the KMeans Clustering Using

"Apache

Mahout

Samsara"
ps1 this assumes row-wise construction of A based on training

set

n-dimensional points.
ps2 since we are doing multiple passes over A it may make

sense to

make

sure it is committed to spark cache (by using checkpoint

api),

spark

used
On Fri, Mar 31, 2017 at 10:53 AM, Dmitriy Lyubimov <

Post by Dmitriy Lyubimov
here is the outline. For details of APIs, please refer to

samsara

Post by Dmitriy Lyubimov
[2], i will not be be repeating it.
Assume your training data input is m x n matrix A. For

simplicity

let's

Post by Dmitriy Lyubimov
assume it's a DRM with int row keys, i.e., DrmLike[Int].
First, classic k-means starts by selecting initial

clusters,

Post by Dmitriy Lyubimov
them out. You can do that by using sampling api [1], thus

forming

Post by Dmitriy Lyubimov
in-memory matrix C (current centroids). C is therefore of

Mahout's

Post by Dmitriy Lyubimov
type.
You the proceed by alternating between cluster assignments

and

Post by Dmitriy Lyubimov
recompupting centroid matrix C till convergence based on

some

Post by Dmitriy Lyubimov
simply limited by epoch count budget, your choice.
Cluster assignments: here, we go over current generation

of A

Post by Dmitriy Lyubimov
recompute centroid indexes for each row in A. Once we

recompute

index,

Post by Dmitriy Lyubimov
put it into the row key . You can do that by assigning

centroid

indices

Post by Dmitriy Lyubimov
keys of A using operator mapblock() (details in [2], [3],

[4]).

Post by Dmitriy Lyubimov
need to broadcast C in order to be able to access it in

efficient

Post by Dmitriy Lyubimov
inside mapblock() closure. Examples of that are plenty

given

Post by Dmitriy Lyubimov
Essentially, in mapblock, you'd reform the row keys to

reflect

cluster

Post by Dmitriy Lyubimov
index in C. while going over A, you'd have a "nearest

neighbor"

problem

Post by Dmitriy Lyubimov
solve for the row of A and centroids C. This is the bulk of

computation

Post by Dmitriy Lyubimov
really, and there are a few tricks there that can speed

this

up in

Post by Dmitriy Lyubimov
both exact and approximate manner, but you can start with a

naive

Post by Dmitriy Lyubimov
once you assigned centroids to the keys of marix A, you'd

want

do an

Post by Dmitriy Lyubimov
aggregating transpose of A to compute essentially average

grouped

Post by Dmitriy Lyubimov
by the centroid key. The trick is to do a computation of

(1|A)'

which

will

Post by Dmitriy Lyubimov
results in a matrix of the shape (Counts/sums of cluster

rows).

This is

the

Post by Dmitriy Lyubimov
part i find difficult to explain without a latex graphics.
In Samsara, construction of (1|A)' corresponds to DRM

expression

Post by Dmitriy Lyubimov
(1 cbind A).t (again, see [2]).
So when you compute, say,
B = (1 | A)',
then B is (n+1) x k, so each column contains a vector

corresponding

Post by Dmitriy Lyubimov
cluster 1..k. In such column, the first element would be #

points in

the

Post by Dmitriy Lyubimov
cluster, and the rest of it would correspond to sum of all

points.

Post by Dmitriy Lyubimov
order to arrive to an updated matrix C, we need to collect

Post by Dmitriy Lyubimov
and slice out counters (first row) from the rest of it.
C <- B (2:,:) each row divided by B(1,:)
(watch out for empty clusters with 0 elements, this will

cause

Post by Dmitriy Lyubimov
convergence and NaNs in the newly computed C).
This operation obviously uses subblocking and row-wise

iteration

over

Post by Dmitriy Lyubimov
for which i am again making reference to [2].
[1] https://github.com/apache/

mahout/blob/master/math-scala/

Post by Dmitriy Lyubimov
src/main/scala/org/apache/mahout/math/drm/package.scala#

L149

Post by Dmitriy Lyubimov
[2], Sasmara manual, a bit dated but viable,

http://apache.github

Post by Dmitriy Lyubimov
io/mahout/doc/ScalaSparkBindings.html
[3] scaladoc, again, dated but largely viable for the

purpose of

this

Post by Dmitriy Lyubimov
http://apache.github.io/mahout/0.10.1/docs/mahout-math-

scala/index.htm

Post by Dmitriy Lyubimov
[4] mapblock etc. http://apache.github.io/mahout

/0.10.1/docs/mahout-

Post by Dmitriy Lyubimov
math-scala/index.html#org.apache.mahout.math.drm.

RLikeDrmOps

Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriycan you please again tell me the approach to move

ahead.

Post by KHATWANI PARTH BHARAT
Thanks
Parth Khatwani
On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
yes i am unable to figure out the way ahead.
Like how to create the augmented matrix A := (0|D) which

you

Post by KHATWANI PARTH BHARAT
mentioned.
On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy Lyubimov <

confusing?

Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT

Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans clustering algorithm

using

Mahout

Post by Dmitriy Lyubimov
Samsara

Post by KHATWANI PARTH BHARAT
but i am bit confused
about how to leverage Distributed Row Matrix for the

same.

Post by KHATWANI PARTH BHARAT
me with same.
Thanks
Parth Khatwani

KHATWANI PARTH BHARAT

2017-04-21 02:01:33 UTC

@Trevor Sir,
I have attached the sample data file and here is the line to complete the Data
File <https://drive.google.com/open?id=0Bxnnu_Ig2Et9QjZoM3dmY1V5WXM>.

Following is the link for the Github Branch For the code
https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov

KmeansMahout.scala
<https://github.com/parth2691/Spark_Mahout/blob/Dmitriy-Lyubimov/KmeansMahout.scala>
is
the complete code

I also have made sample program just to test the assigning new values to
the key to Row Matrix and aggregating transpose.I think assigning new
values to the key to Row Matrix and aggregating transpose is causing the
main problem in Kmean code
Following is the link to Github repo for this code.
TestClusterAssign.scala
<https://github.com/parth2691/Spark_Mahout/blob/Dmitriy-Lyubimov/TestClusterAssign.scala>

above code contains the hard coded data. Following is the expected and the
actual output of the above code
Out of 1st println After New Cluster assignment should be
This
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
(Here zeroth Column is used to store the centriod count and column 1,2 and
3 Contains Data)

But Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
And the result of aggregating Transpose should be
{
0 => {1: 4.0}
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}

Thanks Trevor for such a great Help

Best Regards
Parth

Post by Trevor Grant
Hey
Sorry for delay- was getting ready to tear into this.
Would you mind posting a small sample of data that you would expect this
application to consume.
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
On Tue, Apr 18, 2017 at 11:32 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriy,
@Trevor and @Andrew
I have tried
Testing this Row Key assignment issue which i have mentioned in the

above

Post by KHATWANI PARTH BHARAT
mail,
By Writing the a separate code where i am assigning the a default

value 1

Post by KHATWANI PARTH BHARAT
to each row Key of The DRM and then taking the aggregating transpose
I have committed the separate test code to the Github Branch
<https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov>.
The Code is as follows
val inCoreA = dense((1,1, 2, 3), (1,2, 3, 4), (1,3, 4, 5), (1,4, 5, 6))
val A = drmParallelize(m = inCoreA)
//Mapblock
val drm2 = A.mapBlock() {
case (keys, block) => for(row <- 0 until keys.size) {
* //assigning 1 to each row index* keys(row) = 1

} (keys, block) } prinln("After New Cluster assignment")
println(""+drm2.collect) val aggTranspose = drm2.t println("Result

Post by KHATWANI PARTH BHARAT
aggregating tranpose") println(""+aggTranspose.collect)

and

Post by KHATWANI PARTH BHARAT
3 Contains Data)
But Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
And the result of aggregating Transpose should be
{
0 => {1: 4.0}
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
I have referred to the book written by Andrew And Dmitriy Apache
Beyond MapReduce
<https://www.amazon.com/Apache-Mahout-MapReduce-

Dmitriy-Lyubimov/dp/1523775785> Aggregating

Post by KHATWANI PARTH BHARAT
Transpose and other concepts are explained very nicely over here but i

Post by KHATWANI PARTH BHARAT
unable to find any example where
Row Keys are assigned new Values . Mahout Samsara Manual
http://apache.github.io/mahout/doc/ScalaSparkBindings.html Also Does

not

Post by KHATWANI PARTH BHARAT
contain any such examples.
It will great if i can get some reference to solution of mentioned

issue.

Post by KHATWANI PARTH BHARAT
Thanks
Parth Khatwani

+1
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/14/2017 11:40 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering Using "Apache

Mahout

Samsara"
Parth and Dmitriy,
This is awesome- as a follow on can we work on getting this rolled in

the algorithms framework?
Happy to work with you on this Parth!
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."

-Virgil*

Post by Dmitriy Lyubimov
i would think reassinging keys should work in most cases.
The only exception is that technically Spark contracts imply that

effect

Post by Dmitriy Lyubimov
should be idempotent if task is retried, which might be a problem

in a

Post by Dmitriy Lyubimov
specific scenario of the object tree coming out from block cache

object

Post by Dmitriy Lyubimov
tree, which can stay there and be retried again. but specifically

w.r.t.

Post by Dmitriy Lyubimov
this key assignment i don't see any problem since the action

obviously

Post by Dmitriy Lyubimov
would be idempotent even if this code is run multiple times on the

same

Post by Dmitriy Lyubimov
(key, block) pair. This part should be good IMO.
On Fri, Apr 14, 2017 at 2:26 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriy Sir,
In the K means code above I think i am doing the following

Incorrectly

Post by KHATWANI PARTH BHARAT
Assigning the closest centriod index to the Row Keys of DRM
//11. Iterating over the Data Matrix(in DrmLike[Int] format) to

calculate

the

Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint"
val closesetIndex = findTheClosestCentriod(dataPoint,

centriods)

Post by KHATWANI PARTH BHARAT
//13. assigning closest index to key
keys(row) = closesetIndex
}
keys -> block
}
in step 12 i am finding the centriod closest to the current

dataPoint

Post by KHATWANI PARTH BHARAT
in step13 i am assigning the closesetIndex to the key of the

corresponding

Post by KHATWANI PARTH BHARAT
row represented by the dataPoint
I think i am doing step13 incorrectly.
Also i am unable to find the proper reference for the same in the

reference

Post by KHATWANI PARTH BHARAT
links which you have mentioned above
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 6:24 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
Dmitriy Sir,
I have Created a github branch Github Branch Having Initial

Kmeans

Code

Post by KHATWANI PARTH BHARAT
<https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-

Lyubimov>

Post by KHATWANI PARTH BHARAT
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 3:19 AM, Andrew Palumbo <

+1 to creating a branch.
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/12/2017 11:25 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering Using

"Apache

Samsara"
can't say i can read this code well formatted that way...
it would seem to me that the code is not using the broadcast

variable

instead is using closure variable. that's the only thing i can

immediately

see by looking in the middle of it.
it would be better if you created a branch on github for that

code

would allow for easy check-outs and comments.
-d
On Wed, Apr 12, 2017 at 10:29 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriy Sir
I have completed the Kmeans code as per the algorithm you

have

Post by Dmitriy Lyubimov
Outline

Post by KHATWANI PARTH BHARAT
above
My code is as follows
This code works fine till step number 10
In step 11 i am assigning the new centriod index to

corresponding

key

Post by KHATWANI PARTH BHARAT
of data Point in the matrix
I think i am doing something wrong in step 11 may be i am

using

incorrect

SparkContext(conf))

Post by KHATWANI PARTH BHARAT
//2. read the data file and save it in the rdd
val lines = sc.textFile(args(1))
//3. convert data read in as string in to array of double
val test = lines.map(line =>

line.split('\t').map(_.toDoubl

e))

Post by KHATWANI PARTH BHARAT
//4. add a column having value 1 in array of double this

will

Post by KHATWANI PARTH BHARAT
create something like (1 | D)', which will be used while

calculating

case

(v,

Post by KHATWANI PARTH BHARAT
idx) => (idx.toInt, v) } //7. convert DrmRdd to
CheckpointedDrm[Int] val matrix = drmWrap(rddMatrixLike)

//8.

Post by KHATWANI PARTH BHARAT
seperating the column having all ones created in step 4 and

will

use

Post by KHATWANI PARTH BHARAT
it later val oneVector = matrix(::, 0 until 1) //9.

final

Post by KHATWANI PARTH BHARAT
input data in DrmLike[Int] format val dataDrmX =

matrix(::,

Post by KHATWANI PARTH BHARAT
4) //9. Sampling to select initial centriods

val

Post by KHATWANI PARTH BHARAT
centriods = drmSampleKRows(dataDrmX, 2, false)

centriods.size

Post by KHATWANI PARTH BHARAT
//10. Broad Casting the initial centriods val

broadCastMatrix

Post by KHATWANI PARTH BHARAT
drmBroadcast(centriods) //11. Iterating over the

Data

Post by KHATWANI PARTH BHARAT
Matrix(in DrmLike[Int] format) to calculate the initial

centriods

Post by KHATWANI PARTH BHARAT
dataDrmX.mapBlock() { case (keys, block) => for

(row

Post by KHATWANI PARTH BHARAT
until block.nrow) { var dataPoint = block(row, ::)
//12. findTheClosestCentriod find the closest

centriod

Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint" val

closesetIndex

Post by KHATWANI PARTH BHARAT
findTheClosestCentriod(dataPoint, centriods)

//13.

Post by KHATWANI PARTH BHARAT
assigning closest index to key keys(row) =

closesetIndex

Post by KHATWANI PARTH BHARAT
} keys -> block }
//14. Calculating the (1|D) val b = (oneVector cbind
dataDrmX) //15. Aggregating Transpose (1|D)' val

bTranspose

Post by KHATWANI PARTH BHARAT
= (oneVector cbind dataDrmX).t // after step 15 bTranspose

will

Post by KHATWANI PARTH BHARAT
have data in the following format /*(n+1)*K where

n=dimension

Post by KHATWANI PARTH BHARAT
of the data point, K=number of clusters * zeroth row will

contain

Post by KHATWANI PARTH BHARAT
the count of points assigned to each cluster * assuming 3d

data

Post by KHATWANI PARTH BHARAT
points * */
val nrows = b.nrow.toInt //16. slicing the count

vectors

out

Post by KHATWANI PARTH BHARAT
val pointCountVectors = drmBroadcast(b(0 until 1,

::).collect(0,

Post by KHATWANI PARTH BHARAT
val vectorSums = b(1 until nrows, ::) //17. dividing

the

data

Post by KHATWANI PARTH BHARAT
point by count vector vectorSums.mapBlock() { case

(keys,

Post by KHATWANI PARTH BHARAT
block) => for (row <- 0 until block.nrow) {

block(row,

Post by KHATWANI PARTH BHARAT
::) /= pointCountVectors } keys -> block }

//18.

Post by KHATWANI PARTH BHARAT
seperating the count vectors val newCentriods =

vectorSums.t(::,1

Post by KHATWANI PARTH BHARAT
until centriods.size) //19. iterate over the above

code

Post by KHATWANI PARTH BHARAT
till convergence criteria is meet }//end of main method

Vector

points(Vectors)

Post by KHATWANI PARTH BHARAT
def ssr(a: Vector, b: Vector): Double = {
(a - b) ^= 2 sum
}
//method used to create (1|D)
def addCentriodColumn(arg: Array[Double]): Array[Double] =

{

Post by KHATWANI PARTH BHARAT
val newArr = new Array[Double](arg.length + 1)
newArr(0) = 1.0;
for (i <- 0 until (arg.size)) {
newArr(i + 1) = arg(i);
}
newArr
}
Thanks & Regards
Parth Khatwani
On Mon, Apr 3, 2017 at 7:37 PM, KHATWANI PARTH BHARAT <

---------- Forwarded message ----------
Date: Fri, Mar 31, 2017 at 11:34 PM
Subject: Re: Trying to write the KMeans Clustering Using

"Apache

Mahout

Samsara"
ps1 this assumes row-wise construction of A based on

training

set

n-dimensional points.
ps2 since we are doing multiple passes over A it may make

sense to

make

sure it is committed to spark cache (by using checkpoint

api),

spark

used
On Fri, Mar 31, 2017 at 10:53 AM, Dmitriy Lyubimov <

Post by Dmitriy Lyubimov
here is the outline. For details of APIs, please refer to

samsara

Post by Dmitriy Lyubimov
[2], i will not be be repeating it.
Assume your training data input is m x n matrix A. For

simplicity

let's

Post by Dmitriy Lyubimov
assume it's a DRM with int row keys, i.e., DrmLike[Int].
First, classic k-means starts by selecting initial

clusters,

Post by Dmitriy Lyubimov
them out. You can do that by using sampling api [1], thus

forming

Post by Dmitriy Lyubimov
in-memory matrix C (current centroids). C is therefore of

Mahout's

Post by Dmitriy Lyubimov
type.
You the proceed by alternating between cluster

assignments

and

Post by Dmitriy Lyubimov
recompupting centroid matrix C till convergence based on

some

Post by Dmitriy Lyubimov
simply limited by epoch count budget, your choice.
Cluster assignments: here, we go over current generation

of A

Post by Dmitriy Lyubimov
recompute centroid indexes for each row in A. Once we

recompute

index,

Post by Dmitriy Lyubimov
put it into the row key . You can do that by assigning

centroid

indices

Post by Dmitriy Lyubimov
keys of A using operator mapblock() (details in [2], [3],

[4]).

Post by Dmitriy Lyubimov
need to broadcast C in order to be able to access it in

efficient

Post by Dmitriy Lyubimov
inside mapblock() closure. Examples of that are plenty

given

Post by Dmitriy Lyubimov
Essentially, in mapblock, you'd reform the row keys to

reflect

cluster

Post by Dmitriy Lyubimov
index in C. while going over A, you'd have a "nearest

neighbor"

problem

Post by Dmitriy Lyubimov
solve for the row of A and centroids C. This is the bulk

computation

Post by Dmitriy Lyubimov
really, and there are a few tricks there that can speed

this

up in

Post by Dmitriy Lyubimov
both exact and approximate manner, but you can start

with a

Post by Dmitriy Lyubimov
once you assigned centroids to the keys of marix A, you'd

want

do an

Post by Dmitriy Lyubimov
aggregating transpose of A to compute essentially average

grouped

Post by Dmitriy Lyubimov
by the centroid key. The trick is to do a computation of

(1|A)'

which

will

Post by Dmitriy Lyubimov
results in a matrix of the shape (Counts/sums of cluster

rows).

This is

the

Post by Dmitriy Lyubimov
part i find difficult to explain without a latex

graphics.

Post by Dmitriy Lyubimov
In Samsara, construction of (1|A)' corresponds to DRM

expression

Post by Dmitriy Lyubimov
(1 cbind A).t (again, see [2]).
So when you compute, say,
B = (1 | A)',
then B is (n+1) x k, so each column contains a vector

corresponding

Post by Dmitriy Lyubimov
cluster 1..k. In such column, the first element would be

points in

the

Post by Dmitriy Lyubimov
cluster, and the rest of it would correspond to sum of

all

Post by Dmitriy Lyubimov
points.

Post by Dmitriy Lyubimov
order to arrive to an updated matrix C, we need to

collect

Post by Dmitriy Lyubimov
and slice out counters (first row) from the rest of it.
C <- B (2:,:) each row divided by B(1,:)
(watch out for empty clusters with 0 elements, this will

cause

Post by Dmitriy Lyubimov
convergence and NaNs in the newly computed C).
This operation obviously uses subblocking and row-wise

iteration

over

Post by Dmitriy Lyubimov
for which i am again making reference to [2].
[1] https://github.com/apache/

mahout/blob/master/math-scala/

Post by Dmitriy Lyubimov
src/main/scala/org/apache/mahout/math/drm/package.scala#

L149

Post by Dmitriy Lyubimov
[2], Sasmara manual, a bit dated but viable,

http://apache.github

Post by Dmitriy Lyubimov
io/mahout/doc/ScalaSparkBindings.html
[3] scaladoc, again, dated but largely viable for the

purpose of

this

Post by Dmitriy Lyubimov
http://apache.github.io/mahout/0.10.1/docs/mahout-math-

scala/index.htm

Post by Dmitriy Lyubimov
[4] mapblock etc. http://apache.github.io/mahout

/0.10.1/docs/mahout-

Post by Dmitriy Lyubimov
math-scala/index.html#org.apache.mahout.math.drm.

RLikeDrmOps

Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriycan you please again tell me the approach to

move

Post by KHATWANI PARTH BHARAT
Thanks
Parth Khatwani
On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH BHARAT

Post by KHATWANI PARTH BHARAT
yes i am unable to figure out the way ahead.
Like how to create the augmented matrix A := (0|D)

which

you

Post by KHATWANI PARTH BHARAT
mentioned.
On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy Lyubimov <

confusing?

Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH

BHARAT

Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans clustering

algorithm

using

Mahout

Post by Dmitriy Lyubimov
Samsara

Post by KHATWANI PARTH BHARAT
but i am bit confused
about how to leverage Distributed Row Matrix for

the

same.

Post by KHATWANI PARTH BHARAT
me with same.
Thanks
Parth Khatwani

Trevor Grant

2017-04-21 16:36:27 UTC

OK, i dug into this before i read your question carefully, that was my bad.

Assuming you want the aggregate transpose of :
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}

to be
{
0 => {1: 5.0} // (not 4.0) // and 6.0 in your example...
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}

Then why not replace the mapBlock statement as follows:

val drm2 = (A(::, 1 until 4) cbind 0.0).mapBlock() {
case (keys, block) =>
for(row <- 0 until block.nrow) block(row, 3) = block(row, ::).sum
(keys, block)
}
val aggTranspose = drm2(::, 3 until 4).t
println("Result of aggregating tranpose")
println(""+aggTranspose.collect)

Where we are creating an empty row, then filling it with the row sums.

A distributed rowSums fn would be nice for just such an occasion... sigh

Let me know if that gets you going again. That was simpler than I thought-
sorry for delay on this.

PS
Candidly, I didn't explore further once i understood teh question, but if
you are going to collect this to the driver anyway (not sure if that is the
case)
A(::, 1 until 4).rowSums would also work.

Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things." -Virgil*

On Thu, Apr 20, 2017 at 9:01 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Trevor Sir,
I have attached the sample data file and here is the line to complete the Data
File <https://drive.google.com/open?id=0Bxnnu_Ig2Et9QjZoM3dmY1V5WXM>.
Following is the link for the Github Branch For the code
https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov
KmeansMahout.scala
<https://github.com/parth2691/Spark_Mahout/blob/Dmitriy-Lyubimov/KmeansMahout.scala> is
the complete code
I also have made sample program just to test the assigning new values to
the key to Row Matrix and aggregating transpose.I think assigning new
values to the key to Row Matrix and aggregating transpose is causing the
main problem in Kmean code
Following is the link to Github repo for this code.
TestClusterAssign.scala
<https://github.com/parth2691/Spark_Mahout/blob/Dmitriy-Lyubimov/TestClusterAssign.scala>
above code contains the hard coded data. Following is the expected and the
actual output of the above code
Out of 1st println After New Cluster assignment should be
This
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
(Here zeroth Column is used to store the centriod count and column 1,2 and
3 Contains Data)
But Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
And the result of aggregating Transpose should be
{
0 => {1: 4.0}
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
Thanks Trevor for such a great Help
Best Regards
Parth

Post by KHATWANI PARTH BHARAT
@Dmitriy,@Trevor and @Andrew Sir,
I am still stuck at the above problem can you please help me out with

it.

Post by KHATWANI PARTH BHARAT
I am unable to find the proper reference to solve the above issue.
Thanks & Regards
Parth Khatwani
<https://mailtrack.io/> Sent with Mailtrack
<https://mailtrack.io/install?source=signature&lang=en&
On Sat, Apr 15, 2017 at 10:07 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriy,
@Trevor and @Andrew
I have tried
Testing this Row Key assignment issue which i have mentioned in the

above

Post by KHATWANI PARTH BHARAT
mail,
By Writing the a separate code where i am assigning the a default

value 1

6))

Post by KHATWANI PARTH BHARAT
val A = drmParallelize(m = inCoreA)
//Mapblock
val drm2 = A.mapBlock() {
case (keys, block) => for(row <- 0 until keys.size) {
* //assigning 1 to each row index* keys(row) = 1

} (keys, block) } prinln("After New Cluster assignment")
println(""+drm2.collect) val aggTranspose = drm2.t

println("Result of

Post by KHATWANI PARTH BHARAT
aggregating tranpose") println(""+aggTranspose.collect)

and

Post by KHATWANI PARTH BHARAT
3 Contains Data)
But Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
And the result of aggregating Transpose should be
{
0 => {1: 4.0}
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
I have referred to the book written by Andrew And Dmitriy Apache
Beyond MapReduce
<https://www.amazon.com/Apache-Mahout-MapReduce-

Dmitriy-Lyubimov/dp/1523775785> Aggregating

Post by KHATWANI PARTH BHARAT
Transpose and other concepts are explained very nicely over here but

Post by KHATWANI PARTH BHARAT
unable to find any example where
Row Keys are assigned new Values . Mahout Samsara Manual
http://apache.github.io/mahout/doc/ScalaSparkBindings.html Also Does

not

Post by KHATWANI PARTH BHARAT
contain any such examples.
It will great if i can get some reference to solution of mentioned

issue.

Post by KHATWANI PARTH BHARAT
Thanks
Parth Khatwani

+1
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/14/2017 11:40 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering Using "Apache

Mahout

Samsara"
Parth and Dmitriy,
This is awesome- as a follow on can we work on getting this rolled

in to

-Virgil*

Post by Dmitriy Lyubimov
i would think reassinging keys should work in most cases.
The only exception is that technically Spark contracts imply that

effect

Post by Dmitriy Lyubimov
should be idempotent if task is retried, which might be a problem

in a

Post by Dmitriy Lyubimov
specific scenario of the object tree coming out from block cache

object

Post by Dmitriy Lyubimov
tree, which can stay there and be retried again. but specifically

w.r.t.

Post by Dmitriy Lyubimov
this key assignment i don't see any problem since the action

obviously

Post by Dmitriy Lyubimov
would be idempotent even if this code is run multiple times on the

same

Post by Dmitriy Lyubimov
(key, block) pair. This part should be good IMO.
On Fri, Apr 14, 2017 at 2:26 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriy Sir,
In the K means code above I think i am doing the following

Incorrectly

Post by KHATWANI PARTH BHARAT
Assigning the closest centriod index to the Row Keys of DRM
//11. Iterating over the Data Matrix(in DrmLike[Int] format) to

calculate

the

Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint"
val closesetIndex = findTheClosestCentriod(dataPoint,

centriods)

Post by KHATWANI PARTH BHARAT
//13. assigning closest index to key
keys(row) = closesetIndex
}
keys -> block
}
in step 12 i am finding the centriod closest to the current

dataPoint

Post by KHATWANI PARTH BHARAT
in step13 i am assigning the closesetIndex to the key of the

corresponding

Post by KHATWANI PARTH BHARAT
row represented by the dataPoint
I think i am doing step13 incorrectly.
Also i am unable to find the proper reference for the same in the

reference

Post by KHATWANI PARTH BHARAT
links which you have mentioned above
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 6:24 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
Dmitriy Sir,
I have Created a github branch Github Branch Having Initial

Kmeans

Code

Post by KHATWANI PARTH BHARAT
<https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyub

imov>

Post by KHATWANI PARTH BHARAT
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 3:19 AM, Andrew Palumbo <

+1 to creating a branch.
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/12/2017 11:25 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering Using

"Apache

Samsara"
can't say i can read this code well formatted that way...
it would seem to me that the code is not using the broadcast

variable

instead is using closure variable. that's the only thing i can

immediately

see by looking in the middle of it.
it would be better if you created a branch on github for that

code

would allow for easy check-outs and comments.
-d
On Wed, Apr 12, 2017 at 10:29 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriy Sir
I have completed the Kmeans code as per the algorithm you

have

Post by Dmitriy Lyubimov
Outline

Post by KHATWANI PARTH BHARAT
above
My code is as follows
This code works fine till step number 10
In step 11 i am assigning the new centriod index to

corresponding

key

Post by KHATWANI PARTH BHARAT
of data Point in the matrix
I think i am doing something wrong in step 11 may be i am

using

incorrect

SparkContext(conf))

Post by KHATWANI PARTH BHARAT
//2. read the data file and save it in the rdd
val lines = sc.textFile(args(1))
//3. convert data read in as string in to array of

double

Post by KHATWANI PARTH BHARAT
val test = lines.map(line =>

line.split('\t').map(_.toDoubl

e))

Post by KHATWANI PARTH BHARAT
//4. add a column having value 1 in array of double this

will

Post by KHATWANI PARTH BHARAT
create something like (1 | D)', which will be used while

calculating

Post by KHATWANI PARTH BHARAT
(1 | D)'
val augumentedArray = test.map(addCentriodColumn _)
//5. convert rdd of array of double in rdd of

DenseVector

Post by KHATWANI PARTH BHARAT
val rdd = augumentedArray.map(dvec(_))
//6. convert rdd to DrmRdd
val rddMatrixLike: DrmRdd[Int] = rdd.zipWithIndex.map {

case

(v,

Post by KHATWANI PARTH BHARAT
idx) => (idx.toInt, v) } //7. convert DrmRdd to
CheckpointedDrm[Int] val matrix = drmWrap(rddMatrixLike)

//8.

Post by KHATWANI PARTH BHARAT
seperating the column having all ones created in step 4 and

will

use

Post by KHATWANI PARTH BHARAT
it later val oneVector = matrix(::, 0 until 1)

//9.

final

Post by KHATWANI PARTH BHARAT
input data in DrmLike[Int] format val dataDrmX =

matrix(::,

Post by KHATWANI PARTH BHARAT
4) //9. Sampling to select initial centriods

val

Post by KHATWANI PARTH BHARAT
centriods = drmSampleKRows(dataDrmX, 2, false)

centriods.size

Post by KHATWANI PARTH BHARAT
//10. Broad Casting the initial centriods val

broadCastMatrix

Post by KHATWANI PARTH BHARAT
drmBroadcast(centriods) //11. Iterating over the

Data

Post by KHATWANI PARTH BHARAT
Matrix(in DrmLike[Int] format) to calculate the initial

centriods

Post by KHATWANI PARTH BHARAT
dataDrmX.mapBlock() { case (keys, block) => for

(row

Post by KHATWANI PARTH BHARAT
until block.nrow) { var dataPoint = block(row, ::)
//12. findTheClosestCentriod find the closest

centriod

Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint" val

closesetIndex

Post by KHATWANI PARTH BHARAT
findTheClosestCentriod(dataPoint, centriods)

//13.

Post by KHATWANI PARTH BHARAT
assigning closest index to key keys(row) =

closesetIndex

Post by KHATWANI PARTH BHARAT
} keys -> block }
//14. Calculating the (1|D) val b = (oneVector

cbind

Post by KHATWANI PARTH BHARAT
dataDrmX) //15. Aggregating Transpose (1|D)' val

bTranspose

Post by KHATWANI PARTH BHARAT
= (oneVector cbind dataDrmX).t // after step 15

bTranspose

will

Post by KHATWANI PARTH BHARAT
have data in the following format /*(n+1)*K where

n=dimension

Post by KHATWANI PARTH BHARAT
of the data point, K=number of clusters * zeroth row will

contain

Post by KHATWANI PARTH BHARAT
the count of points assigned to each cluster * assuming

data

Post by KHATWANI PARTH BHARAT
points * */
val nrows = b.nrow.toInt //16. slicing the count

vectors

out

Post by KHATWANI PARTH BHARAT
val pointCountVectors = drmBroadcast(b(0 until 1,

::).collect(0,

Post by KHATWANI PARTH BHARAT
val vectorSums = b(1 until nrows, ::) //17. dividing

the

data

Post by KHATWANI PARTH BHARAT
point by count vector vectorSums.mapBlock() { case

(keys,

Post by KHATWANI PARTH BHARAT
block) => for (row <- 0 until block.nrow) {

block(row,

Post by KHATWANI PARTH BHARAT
::) /= pointCountVectors } keys -> block }

//18.

Post by KHATWANI PARTH BHARAT
seperating the count vectors val newCentriods =

vectorSums.t(::,1

Post by KHATWANI PARTH BHARAT
until centriods.size) //19. iterate over the

above

code

Post by KHATWANI PARTH BHARAT
till convergence criteria is meet }//end of main method

Vector

points(Vectors)

Post by KHATWANI PARTH BHARAT
def ssr(a: Vector, b: Vector): Double = {
(a - b) ^= 2 sum
}
//method used to create (1|D)
def addCentriodColumn(arg: Array[Double]): Array[Double]

= {

---------- Forwarded message ----------
Date: Fri, Mar 31, 2017 at 11:34 PM
Subject: Re: Trying to write the KMeans Clustering Using

"Apache

Mahout

Samsara"
ps1 this assumes row-wise construction of A based on

training

set

n-dimensional points.
ps2 since we are doing multiple passes over A it may make

sense to

make

sure it is committed to spark cache (by using checkpoint

api),

spark

used
On Fri, Mar 31, 2017 at 10:53 AM, Dmitriy Lyubimov <

Post by Dmitriy Lyubimov
here is the outline. For details of APIs, please refer

Post by Dmitriy Lyubimov
samsara

Post by Dmitriy Lyubimov
[2], i will not be be repeating it.
Assume your training data input is m x n matrix A. For

simplicity

let's

Post by Dmitriy Lyubimov
assume it's a DRM with int row keys, i.e., DrmLike[Int].
First, classic k-means starts by selecting initial

clusters,

Post by Dmitriy Lyubimov
them out. You can do that by using sampling api [1],

thus

Post by Dmitriy Lyubimov
forming

Post by Dmitriy Lyubimov
in-memory matrix C (current centroids). C is therefore

Post by Dmitriy Lyubimov
Mahout's

Post by Dmitriy Lyubimov
type.
You the proceed by alternating between cluster

assignments

and

Post by Dmitriy Lyubimov
recompupting centroid matrix C till convergence based on

some

Post by Dmitriy Lyubimov
simply limited by epoch count budget, your choice.
Cluster assignments: here, we go over current generation

of A

Post by Dmitriy Lyubimov
recompute centroid indexes for each row in A. Once we

recompute

index,

Post by Dmitriy Lyubimov
put it into the row key . You can do that by assigning

centroid

indices

Post by Dmitriy Lyubimov
keys of A using operator mapblock() (details in [2],

[3],

[4]).

Post by Dmitriy Lyubimov
need to broadcast C in order to be able to access it in

efficient

Post by Dmitriy Lyubimov
inside mapblock() closure. Examples of that are plenty

given

Post by Dmitriy Lyubimov
Essentially, in mapblock, you'd reform the row keys to

reflect

cluster

Post by Dmitriy Lyubimov
index in C. while going over A, you'd have a "nearest

neighbor"

problem

Post by Dmitriy Lyubimov
solve for the row of A and centroids C. This is the

bulk of

computation

Post by Dmitriy Lyubimov
really, and there are a few tricks there that can speed

this

up in

Post by Dmitriy Lyubimov
both exact and approximate manner, but you can start

with a

Post by Dmitriy Lyubimov
once you assigned centroids to the keys of marix A,

you'd

want

do an

Post by Dmitriy Lyubimov
aggregating transpose of A to compute essentially

average

grouped

Post by Dmitriy Lyubimov
by the centroid key. The trick is to do a computation of

(1|A)'

which

will

Post by Dmitriy Lyubimov
results in a matrix of the shape (Counts/sums of cluster

rows).

This is

the

Post by Dmitriy Lyubimov
part i find difficult to explain without a latex

graphics.

Post by Dmitriy Lyubimov
In Samsara, construction of (1|A)' corresponds to DRM

expression

Post by Dmitriy Lyubimov
(1 cbind A).t (again, see [2]).
So when you compute, say,
B = (1 | A)',
then B is (n+1) x k, so each column contains a vector

corresponding

Post by Dmitriy Lyubimov
cluster 1..k. In such column, the first element would

be #

points in

the

Post by Dmitriy Lyubimov
cluster, and the rest of it would correspond to sum of

all

Post by Dmitriy Lyubimov
points.

Post by Dmitriy Lyubimov
order to arrive to an updated matrix C, we need to

collect

Post by Dmitriy Lyubimov
and slice out counters (first row) from the rest of it.
C <- B (2:,:) each row divided by B(1,:)
(watch out for empty clusters with 0 elements, this will

cause

Post by Dmitriy Lyubimov
convergence and NaNs in the newly computed C).
This operation obviously uses subblocking and row-wise

iteration

over

Post by Dmitriy Lyubimov
for which i am again making reference to [2].
[1] https://github.com/apache/

mahout/blob/master/math-scala/

Post by Dmitriy Lyubimov
src/main/scala/org/apache/maho

ut/math/drm/package.scala#

Post by Dmitriy Lyubimov
[2], Sasmara manual, a bit dated but viable,

http://apache.github

Post by Dmitriy Lyubimov
io/mahout/doc/ScalaSparkBindings.html
[3] scaladoc, again, dated but largely viable for the

purpose of

this

Post by Dmitriy Lyubimov
http://apache.github.io/mahout/0.10.1/docs/mahout-math-

scala/index.htm

Post by Dmitriy Lyubimov
[4] mapblock etc. http://apache.github.io/mahout

/0.10.1/docs/mahout-

Post by Dmitriy Lyubimov
math-scala/index.html#org.apache.mahout.math.drm.

RLikeDrmOps

Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriycan you please again tell me the approach to

move

Post by KHATWANI PARTH BHARAT
Thanks
Parth Khatwani
On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH

BHARAT <

Post by KHATWANI PARTH BHARAT
yes i am unable to figure out the way ahead.
Like how to create the augmented matrix A := (0|D)

which

you

Post by KHATWANI PARTH BHARAT
mentioned.
On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy Lyubimov <

confusing?

Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH

BHARAT

Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans clustering

algorithm

using

Mahout

Post by Dmitriy Lyubimov
Samsara

Post by KHATWANI PARTH BHARAT
but i am bit confused
about how to leverage Distributed Row Matrix for

the

same.

Post by KHATWANI PARTH BHARAT
me with same.
Thanks
Parth Khatwani

KHATWANI PARTH BHARAT

2017-04-21 19:26:30 UTC

@Trevor

In was trying to write the "*Kmeans*" Using Mahout DRM as per the algorithm
outlined by Dmitriy.
I was facing the Problem of assigning cluster Ids to the Row Keys
For Example
Consider the below matrix Where column 1 to 3 are the data points and
column 0 Containing the count of the point
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}

now after calculating the centriod which closest to the data point data
zeroth index i am trying to assign the centriod index to *row key *

Now Suppose say that every data point is assigned to centriod at index 1
so after assigning the key=1 to each and every row

using the code below

val drm2 = A.mapBlock() {
case (keys, block) => for(row <- 0 until keys.size) {

* //assigning 1 to each row index* keys(row) = 1
} (keys, block) }

I want above matrix to be in this form

{
1 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
1 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}

Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}

I am confused weather assigning the new Key Values to the row index is done
through the following code line

* //assigning 1 to each row index* keys(row) = 1

or is there any other way.

I am not able to find any use links or reference on internet even Andrew
and Dmitriy's book also does not have any proper reference for the
above mentioned issue.

Thanks & Regards
Parth Khatwani

Post by Trevor Grant
OK, i dug into this before i read your question carefully, that was my bad.
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
to be
{
0 => {1: 5.0} // (not 4.0) // and 6.0 in your example...
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
val drm2 = (A(::, 1 until 4) cbind 0.0).mapBlock() {
case (keys, block) =>
for(row <- 0 until block.nrow) block(row, 3) = block(row, ::).sum
(keys, block)
}
val aggTranspose = drm2(::, 3 until 4).t
println("Result of aggregating tranpose")
println(""+aggTranspose.collect)
Where we are creating an empty row, then filling it with the row sums.
A distributed rowSums fn would be nice for just such an occasion... sigh
Let me know if that gets you going again. That was simpler than I thought-
sorry for delay on this.
PS
Candidly, I didn't explore further once i understood teh question, but if
you are going to collect this to the driver anyway (not sure if that is the
case)
A(::, 1 until 4).rowSums would also work.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
On Thu, Apr 20, 2017 at 9:01 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Trevor Sir,
I have attached the sample data file and here is the line to complete

the Data

Post by KHATWANI PARTH BHARAT
File <https://drive.google.com/open?id=0Bxnnu_Ig2Et9QjZoM3dmY1V5WXM>.
Following is the link for the Github Branch For the code
https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov
KmeansMahout.scala
<https://github.com/parth2691/Spark_Mahout/blob/Dmitriy-

Lyubimov/KmeansMahout.scala> is

Post by KHATWANI PARTH BHARAT
the complete code
I also have made sample program just to test the assigning new values to
the key to Row Matrix and aggregating transpose.I think assigning new
values to the key to Row Matrix and aggregating transpose is causing the
main problem in Kmean code
Following is the link to Github repo for this code.
TestClusterAssign.scala
<https://github.com/parth2691/Spark_Mahout/blob/Dmitriy-

Lyubimov/TestClusterAssign.scala>

Post by KHATWANI PARTH BHARAT
above code contains the hard coded data. Following is the expected and

the

Post by KHATWANI PARTH BHARAT
actual output of the above code
Out of 1st println After New Cluster assignment should be
This
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
(Here zeroth Column is used to store the centriod count and column 1,2

and

Post by KHATWANI PARTH BHARAT
3 Contains Data)
But Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
And the result of aggregating Transpose should be
{
0 => {1: 4.0}
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
Thanks Trevor for such a great Help
Best Regards
Parth

Post by KHATWANI PARTH BHARAT
@Dmitriy,@Trevor and @Andrew Sir,
I am still stuck at the above problem can you please help me out with

it.

Post by KHATWANI PARTH BHARAT
@Dmitriy,
@Trevor and @Andrew
I have tried
Testing this Row Key assignment issue which i have mentioned in the

above

Post by KHATWANI PARTH BHARAT
mail,
By Writing the a separate code where i am assigning the a default

value 1

6))

} (keys, block) } prinln("After New Cluster

assignment")

Post by KHATWANI PARTH BHARAT
println(""+drm2.collect) val aggTranspose = drm2.t

println("Result of

Post by KHATWANI PARTH BHARAT
aggregating tranpose") println(""+aggTranspose.collect)

1,2

Post by KHATWANI PARTH BHARAT
3 Contains Data)
But Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
And the result of aggregating Transpose should be
{
0 => {1: 4.0}
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
I have referred to the book written by Andrew And Dmitriy Apache
Beyond MapReduce
<https://www.amazon.com/Apache-Mahout-MapReduce-

Dmitriy-Lyubimov/dp/1523775785> Aggregating

Post by KHATWANI PARTH BHARAT
Transpose and other concepts are explained very nicely over here

but

Post by KHATWANI PARTH BHARAT
unable to find any example where
Row Keys are assigned new Values . Mahout Samsara Manual
http://apache.github.io/mahout/doc/ScalaSparkBindings.html Also

Does

Post by KHATWANI PARTH BHARAT
contain any such examples.
It will great if i can get some reference to solution of mentioned

issue.

Post by KHATWANI PARTH BHARAT
Thanks
Parth Khatwani
On Sat, Apr 15, 2017 at 12:13 AM, Andrew Palumbo <

+1
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/14/2017 11:40 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering Using "Apache

Mahout

Samsara"
Parth and Dmitriy,
This is awesome- as a follow on can we work on getting this rolled

in to

-Virgil*

On Fri, Apr 14, 2017 at 1:27 PM, Dmitriy Lyubimov <

Post by Dmitriy Lyubimov
i would think reassinging keys should work in most cases.
The only exception is that technically Spark contracts imply that

effect

Post by Dmitriy Lyubimov
should be idempotent if task is retried, which might be a problem

in a

Post by Dmitriy Lyubimov
specific scenario of the object tree coming out from block cache

object

Post by Dmitriy Lyubimov
tree, which can stay there and be retried again. but specifically

w.r.t.

Post by Dmitriy Lyubimov
this key assignment i don't see any problem since the action

obviously

Post by Dmitriy Lyubimov
would be idempotent even if this code is run multiple times on

the

Post by Dmitriy Lyubimov
(key, block) pair. This part should be good IMO.
On Fri, Apr 14, 2017 at 2:26 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriy Sir,
In the K means code above I think i am doing the following

Incorrectly

Post by KHATWANI PARTH BHARAT
Assigning the closest centriod index to the Row Keys of DRM
//11. Iterating over the Data Matrix(in DrmLike[Int] format) to

calculate

centriod

the

Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint"
val closesetIndex = findTheClosestCentriod(

dataPoint,

Post by Dmitriy Lyubimov
centriods)

Post by KHATWANI PARTH BHARAT
//13. assigning closest index to key
keys(row) = closesetIndex
}
keys -> block
}
in step 12 i am finding the centriod closest to the current

dataPoint

Post by KHATWANI PARTH BHARAT
in step13 i am assigning the closesetIndex to the key of the

corresponding

Post by KHATWANI PARTH BHARAT
row represented by the dataPoint
I think i am doing step13 incorrectly.
Also i am unable to find the proper reference for the same in

the

Post by Dmitriy Lyubimov
reference

Post by KHATWANI PARTH BHARAT
links which you have mentioned above
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 6:24 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
Dmitriy Sir,
I have Created a github branch Github Branch Having Initial

Kmeans

Code

Post by KHATWANI PARTH BHARAT
<https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyub

imov>

Post by KHATWANI PARTH BHARAT
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 3:19 AM, Andrew Palumbo <

+1 to creating a branch.
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/12/2017 11:25 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering Using

"Apache

Samsara"
can't say i can read this code well formatted that way...
it would seem to me that the code is not using the broadcast

variable

instead is using closure variable. that's the only thing i

can

Post by KHATWANI PARTH BHARAT
immediately

see by looking in the middle of it.
it would be better if you created a branch on github for

that

would allow for easy check-outs and comments.
-d
On Wed, Apr 12, 2017 at 10:29 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriy Sir
I have completed the Kmeans code as per the algorithm you

have

Post by Dmitriy Lyubimov
Outline

Post by KHATWANI PARTH BHARAT
above
My code is as follows
This code works fine till step number 10
In step 11 i am assigning the new centriod index to

corresponding

key

Post by KHATWANI PARTH BHARAT
of data Point in the matrix
I think i am doing something wrong in step 11 may be i am

using

incorrect

"org.apache.spark.serializer.

Post by KHATWANI PARTH BHARAT
KryoSerializer")
.set("spark.kryo.registrator",
"org.apache.mahout.sparkbindings.io.

MahoutKryoRegistrator")

Post by KHATWANI PARTH BHARAT
implicit val sc = new SparkDistributedContext(new

SparkContext(conf))

Post by KHATWANI PARTH BHARAT
//2. read the data file and save it in the rdd
val lines = sc.textFile(args(1))
//3. convert data read in as string in to array of

double

Post by KHATWANI PARTH BHARAT
val test = lines.map(line =>

line.split('\t').map(_.toDoubl

e))

Post by KHATWANI PARTH BHARAT
//4. add a column having value 1 in array of double

this

Post by KHATWANI PARTH BHARAT
create something like (1 | D)', which will be used while

calculating

Post by KHATWANI PARTH BHARAT
(1 | D)'
val augumentedArray = test.map(addCentriodColumn _)
//5. convert rdd of array of double in rdd of

DenseVector

Post by KHATWANI PARTH BHARAT
val rdd = augumentedArray.map(dvec(_))
//6. convert rdd to DrmRdd
val rddMatrixLike: DrmRdd[Int] = rdd.zipWithIndex.map

{

(v,

Post by KHATWANI PARTH BHARAT
idx) => (idx.toInt, v) } //7. convert DrmRdd to
CheckpointedDrm[Int] val matrix =

drmWrap(rddMatrixLike)

//8.

Post by KHATWANI PARTH BHARAT
seperating the column having all ones created in step 4

and

use

Post by KHATWANI PARTH BHARAT
it later val oneVector = matrix(::, 0 until 1)

//9.

final

Post by KHATWANI PARTH BHARAT
input data in DrmLike[Int] format val dataDrmX =

matrix(::,

Post by KHATWANI PARTH BHARAT
4) //9. Sampling to select initial centriods

val

Post by KHATWANI PARTH BHARAT
centriods = drmSampleKRows(dataDrmX, 2, false)

centriods.size

Post by KHATWANI PARTH BHARAT
//10. Broad Casting the initial centriods val

broadCastMatrix

Post by KHATWANI PARTH BHARAT
drmBroadcast(centriods) //11. Iterating over

the

Post by KHATWANI PARTH BHARAT
Matrix(in DrmLike[Int] format) to calculate the initial

centriods

Post by KHATWANI PARTH BHARAT
dataDrmX.mapBlock() { case (keys, block) =>

for

Post by KHATWANI PARTH BHARAT
until block.nrow) { var dataPoint = block(row,

::)

Post by KHATWANI PARTH BHARAT
//12. findTheClosestCentriod find the closest

centriod

Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint" val

closesetIndex

Post by KHATWANI PARTH BHARAT
findTheClosestCentriod(dataPoint, centriods)

//13.

Post by KHATWANI PARTH BHARAT
assigning closest index to key keys(row) =

closesetIndex

Post by KHATWANI PARTH BHARAT
} keys -> block }
//14. Calculating the (1|D) val b = (oneVector

cbind

Post by KHATWANI PARTH BHARAT
dataDrmX) //15. Aggregating Transpose (1|D)' val

bTranspose

Post by KHATWANI PARTH BHARAT
= (oneVector cbind dataDrmX).t // after step 15

bTranspose

will

Post by KHATWANI PARTH BHARAT
have data in the following format /*(n+1)*K where

n=dimension

Post by KHATWANI PARTH BHARAT
of the data point, K=number of clusters * zeroth row

will

contain

Post by KHATWANI PARTH BHARAT
the count of points assigned to each cluster * assuming

data

Post by KHATWANI PARTH BHARAT
points * */
val nrows = b.nrow.toInt //16. slicing the count

vectors

out

Post by KHATWANI PARTH BHARAT
val pointCountVectors = drmBroadcast(b(0 until 1,

::).collect(0,

Post by KHATWANI PARTH BHARAT
val vectorSums = b(1 until nrows, ::) //17. dividing

the

data

Post by KHATWANI PARTH BHARAT
point by count vector vectorSums.mapBlock() { case

(keys,

Post by KHATWANI PARTH BHARAT
block) => for (row <- 0 until block.nrow) {

block(row,

Post by KHATWANI PARTH BHARAT
::) /= pointCountVectors } keys -> block

}

//18.

Post by KHATWANI PARTH BHARAT
seperating the count vectors val newCentriods =

vectorSums.t(::,1

Post by KHATWANI PARTH BHARAT
until centriods.size) //19. iterate over the

above

code

Post by KHATWANI PARTH BHARAT
till convergence criteria is meet }//end of main method
// method to find the closest centriod to data point(

Vector

points(Vectors)

Post by KHATWANI PARTH BHARAT
def ssr(a: Vector, b: Vector): Double = {
(a - b) ^= 2 sum
}
//method used to create (1|D)
def addCentriodColumn(arg: Array[Double]): Array[Double]

= {

---------- Forwarded message ----------
Date: Fri, Mar 31, 2017 at 11:34 PM
Subject: Re: Trying to write the KMeans Clustering Using

"Apache

Mahout

Samsara"
ps1 this assumes row-wise construction of A based on

training

set

n-dimensional points.
ps2 since we are doing multiple passes over A it may

make

sense to

make

sure it is committed to spark cache (by using checkpoint

api),

spark

used
On Fri, Mar 31, 2017 at 10:53 AM, Dmitriy Lyubimov <

Post by Dmitriy Lyubimov
here is the outline. For details of APIs, please refer

Post by Dmitriy Lyubimov
samsara

Post by Dmitriy Lyubimov
[2], i will not be be repeating it.
Assume your training data input is m x n matrix A. For

simplicity

let's

Post by Dmitriy Lyubimov
assume it's a DRM with int row keys, i.e.,

DrmLike[Int].

Post by Dmitriy Lyubimov
First, classic k-means starts by selecting initial

clusters,

Post by Dmitriy Lyubimov
them out. You can do that by using sampling api [1],

thus

Post by Dmitriy Lyubimov
forming

Post by Dmitriy Lyubimov
in-memory matrix C (current centroids). C is therefore

Post by Dmitriy Lyubimov
Mahout's

Post by Dmitriy Lyubimov
type.
You the proceed by alternating between cluster

assignments

and

Post by Dmitriy Lyubimov
recompupting centroid matrix C till convergence based

Post by Dmitriy Lyubimov
simply limited by epoch count budget, your choice.
Cluster assignments: here, we go over current

generation

Post by Dmitriy Lyubimov
recompute centroid indexes for each row in A. Once we

recompute

index,

Post by Dmitriy Lyubimov
put it into the row key . You can do that by assigning

centroid

indices

Post by Dmitriy Lyubimov
keys of A using operator mapblock() (details in [2],

[3],

[4]).

Post by Dmitriy Lyubimov
need to broadcast C in order to be able to access it

Post by Dmitriy Lyubimov
efficient

Post by Dmitriy Lyubimov
inside mapblock() closure. Examples of that are plenty

given

Post by Dmitriy Lyubimov
Essentially, in mapblock, you'd reform the row keys to

reflect

cluster

Post by Dmitriy Lyubimov
index in C. while going over A, you'd have a "nearest

neighbor"

problem

Post by Dmitriy Lyubimov
solve for the row of A and centroids C. This is the

bulk of

computation

Post by Dmitriy Lyubimov
really, and there are a few tricks there that can

speed

up in

Post by Dmitriy Lyubimov
both exact and approximate manner, but you can start

with a

Post by Dmitriy Lyubimov
once you assigned centroids to the keys of marix A,

you'd

want

do an

Post by Dmitriy Lyubimov
aggregating transpose of A to compute essentially

average

grouped

Post by Dmitriy Lyubimov
by the centroid key. The trick is to do a computation

(1|A)'

which

will

Post by Dmitriy Lyubimov
results in a matrix of the shape (Counts/sums of

cluster

rows).

This is

the

Post by Dmitriy Lyubimov
part i find difficult to explain without a latex

graphics.

Post by Dmitriy Lyubimov
In Samsara, construction of (1|A)' corresponds to DRM

expression

Post by Dmitriy Lyubimov
(1 cbind A).t (again, see [2]).
So when you compute, say,
B = (1 | A)',
then B is (n+1) x k, so each column contains a vector

corresponding

Post by Dmitriy Lyubimov
cluster 1..k. In such column, the first element would

be #

points in

the

Post by Dmitriy Lyubimov
cluster, and the rest of it would correspond to sum of

all

Post by Dmitriy Lyubimov
points.

Post by Dmitriy Lyubimov
order to arrive to an updated matrix C, we need to

collect

Post by Dmitriy Lyubimov
and slice out counters (first row) from the rest of

it.

Post by Dmitriy Lyubimov
C <- B (2:,:) each row divided by B(1,:)
(watch out for empty clusters with 0 elements, this

will

cause

Post by Dmitriy Lyubimov
convergence and NaNs in the newly computed C).
This operation obviously uses subblocking and row-wise

iteration

over

Post by Dmitriy Lyubimov
for which i am again making reference to [2].
[1] https://github.com/apache/

mahout/blob/master/math-scala/

Post by Dmitriy Lyubimov
src/main/scala/org/apache/maho

ut/math/drm/package.scala#

Post by Dmitriy Lyubimov
[2], Sasmara manual, a bit dated but viable,

http://apache.github

Post by Dmitriy Lyubimov
io/mahout/doc/ScalaSparkBindings.html
[3] scaladoc, again, dated but largely viable for the

purpose of

this

Post by Dmitriy Lyubimov
http://apache.github.io/mahout/0.10.1/docs/mahout-

math-

scala/index.htm

Post by Dmitriy Lyubimov
[4] mapblock etc. http://apache.github.io/mahout

/0.10.1/docs/mahout-

Post by Dmitriy Lyubimov
math-scala/index.html#org.apache.mahout.math.drm.

RLikeDrmOps

Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH

BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriycan you please again tell me the approach to

move

Post by KHATWANI PARTH BHARAT
Thanks
Parth Khatwani
On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH

BHARAT <

Post by KHATWANI PARTH BHARAT
yes i am unable to figure out the way ahead.
Like how to create the augmented matrix A := (0|D)

which

you

Post by KHATWANI PARTH BHARAT
mentioned.
On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy Lyubimov

Post by Dmitriy Lyubimov
confusing?

Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH

BHARAT

Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans clustering

algorithm

using

Mahout

Post by Dmitriy Lyubimov
Samsara

Post by KHATWANI PARTH BHARAT
but i am bit confused
about how to leverage Distributed Row Matrix for

the

same.

Post by KHATWANI PARTH BHARAT
me with same.
Thanks
Parth Khatwani

KHATWANI PARTH BHARAT

2017-04-21 19:51:40 UTC

@Trevor,

Following is the link for the Github Branch For the Kmeans code and Code
for the sample Program(which we are discussing above) which i am using to
figure what am i doing wrong in the Kmeans code using
https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov

Thanks & Regards
Parth

On Sat, Apr 22, 2017 at 1:26 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Trevor
In was trying to write the "*Kmeans*" Using Mahout DRM as per the
algorithm outlined by Dmitriy.
I was facing the Problem of assigning cluster Ids to the Row Keys
For Example
Consider the below matrix Where column 1 to 3 are the data points and
column 0 Containing the count of the point
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
now after calculating the centriod which closest to the data point data
zeroth index i am trying to assign the centriod index to *row key *
Now Suppose say that every data point is assigned to centriod at index 1
so after assigning the key=1 to each and every row
using the code below
val drm2 = A.mapBlock() {
case (keys, block) => for(row <- 0 until keys.size) {
* //assigning 1 to each row index* keys(row) = 1 } (keys, block) }
I want above matrix to be in this form
{
1 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
1 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
I am confused weather assigning the new Key Values to the row index is
done through the following code line
* //assigning 1 to each row index* keys(row) = 1
or is there any other way.
I am not able to find any use links or reference on internet even Andrew
and Dmitriy's book also does not have any proper reference for the
above mentioned issue.
Thanks & Regards
Parth Khatwani

Post by KHATWANI PARTH BHARAT
@Trevor Sir,
I have attached the sample data file and here is the line to complete

the Data

imov/KmeansMahout.scala> is

imov/TestClusterAssign.scala>

Post by KHATWANI PARTH BHARAT
above code contains the hard coded data. Following is the expected and

the

and

Post by KHATWANI PARTH BHARAT
3 Contains Data)
But Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
And the result of aggregating Transpose should be
{
0 => {1: 4.0}
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
Thanks Trevor for such a great Help
Best Regards
Parth

Post by Trevor Grant
Hey
Sorry for delay- was getting ready to tear into this.
Would you mind posting a small sample of data that you would expect

this

Post by Trevor Grant
application to consume.
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
On Tue, Apr 18, 2017 at 11:32 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriy,@Trevor and @Andrew Sir,
I am still stuck at the above problem can you please help me out with

it.

Post by KHATWANI PARTH BHARAT
@Dmitriy,
@Trevor and @Andrew
I have tried
Testing this Row Key assignment issue which i have mentioned in the

above

Post by KHATWANI PARTH BHARAT
mail,
By Writing the a separate code where i am assigning the a default

value 1

Post by KHATWANI PARTH BHARAT
to each row Key of The DRM and then taking the aggregating

transpose

Post by KHATWANI PARTH BHARAT
I have committed the separate test code to the Github Branch
<https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov>.
The Code is as follows
val inCoreA = dense((1,1, 2, 3), (1,2, 3, 4), (1,3, 4, 5), (1,4, 5,

6))

} (keys, block) } prinln("After New Cluster

assignment")

Post by KHATWANI PARTH BHARAT
println(""+drm2.collect) val aggTranspose = drm2.t

println("Result of

Post by KHATWANI PARTH BHARAT
aggregating tranpose") println(""+aggTranspose.collect)

1,2

Post by KHATWANI PARTH BHARAT
3 Contains Data)
But Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
And the result of aggregating Transpose should be
{
0 => {1: 4.0}
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
I have referred to the book written by Andrew And Dmitriy Apache
Beyond MapReduce
<https://www.amazon.com/Apache-Mahout-MapReduce-

Dmitriy-Lyubimov/dp/1523775785> Aggregating

Post by KHATWANI PARTH BHARAT
Transpose and other concepts are explained very nicely over here

but

Post by KHATWANI PARTH BHARAT
unable to find any example where
Row Keys are assigned new Values . Mahout Samsara Manual
http://apache.github.io/mahout/doc/ScalaSparkBindings.html Also

Does

Post by KHATWANI PARTH BHARAT
contain any such examples.
It will great if i can get some reference to solution of mentioned

issue.

Post by KHATWANI PARTH BHARAT
Thanks
Parth Khatwani
On Sat, Apr 15, 2017 at 12:13 AM, Andrew Palumbo <

+1
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/14/2017 11:40 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering Using "Apache

Mahout

Samsara"
Parth and Dmitriy,
This is awesome- as a follow on can we work on getting this rolled

in to

-Virgil*

On Fri, Apr 14, 2017 at 1:27 PM, Dmitriy Lyubimov <

Post by Dmitriy Lyubimov
i would think reassinging keys should work in most cases.
The only exception is that technically Spark contracts imply

that

Post by Dmitriy Lyubimov
should be idempotent if task is retried, which might be a

problem

Post by Dmitriy Lyubimov
specific scenario of the object tree coming out from block cache

object

Post by Dmitriy Lyubimov
tree, which can stay there and be retried again. but

specifically

Post by Dmitriy Lyubimov
this key assignment i don't see any problem since the action

obviously

Post by Dmitriy Lyubimov
would be idempotent even if this code is run multiple times on

the

Post by Dmitriy Lyubimov
(key, block) pair. This part should be good IMO.
On Fri, Apr 14, 2017 at 2:26 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriy Sir,
In the K means code above I think i am doing the following

Incorrectly

Post by KHATWANI PARTH BHARAT
Assigning the closest centriod index to the Row Keys of DRM
//11. Iterating over the Data Matrix(in DrmLike[Int] format)

calculate

centriod

the

Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint"
val closesetIndex = findTheClosestCentriod(dataPoi

nt,

Post by Dmitriy Lyubimov
centriods)

Post by KHATWANI PARTH BHARAT
//13. assigning closest index to key
keys(row) = closesetIndex
}
keys -> block
}
in step 12 i am finding the centriod closest to the current

dataPoint

Post by KHATWANI PARTH BHARAT
in step13 i am assigning the closesetIndex to the key of the

corresponding

Post by KHATWANI PARTH BHARAT
row represented by the dataPoint
I think i am doing step13 incorrectly.
Also i am unable to find the proper reference for the same in

the

Post by Dmitriy Lyubimov
reference

Post by KHATWANI PARTH BHARAT
links which you have mentioned above
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 6:24 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
Dmitriy Sir,
I have Created a github branch Github Branch Having Initial

Kmeans

Code

Post by KHATWANI PARTH BHARAT
<https://github.com/parth2691/

Spark_Mahout/tree/Dmitriy-Lyub

Post by Trevor Grant
imov>

Post by KHATWANI PARTH BHARAT
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 3:19 AM, Andrew Palumbo <

+1 to creating a branch.
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/12/2017 11:25 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering Using

"Apache

Samsara"
can't say i can read this code well formatted that way...
it would seem to me that the code is not using the

broadcast

variable

instead is using closure variable. that's the only thing i

can

Post by KHATWANI PARTH BHARAT
immediately

see by looking in the middle of it.
it would be better if you created a branch on github for

that

would allow for easy check-outs and comments.
-d
On Wed, Apr 12, 2017 at 10:29 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriy Sir
I have completed the Kmeans code as per the algorithm you

have

Post by Dmitriy Lyubimov
Outline

Post by KHATWANI PARTH BHARAT
above
My code is as follows
This code works fine till step number 10
In step 11 i am assigning the new centriod index to

corresponding

key

Post by KHATWANI PARTH BHARAT
of data Point in the matrix
I think i am doing something wrong in step 11 may be i am

using

incorrect

"org.apache.spark.serializer.

Post by KHATWANI PARTH BHARAT
KryoSerializer")
.set("spark.kryo.registrator",
"org.apache.mahout.sparkbindings.io

.MahoutKryoRegistrator")

Post by KHATWANI PARTH BHARAT
implicit val sc = new SparkDistributedContext(new

SparkContext(conf))

Post by KHATWANI PARTH BHARAT
//2. read the data file and save it in the rdd
val lines = sc.textFile(args(1))
//3. convert data read in as string in to array of

double

Post by KHATWANI PARTH BHARAT
val test = lines.map(line =>

line.split('\t').map(_.toDoubl

e))

Post by KHATWANI PARTH BHARAT
//4. add a column having value 1 in array of double

this

Post by KHATWANI PARTH BHARAT
create something like (1 | D)', which will be used while

calculating

Post by KHATWANI PARTH BHARAT
(1 | D)'
val augumentedArray = test.map(addCentriodColumn _)
//5. convert rdd of array of double in rdd of

DenseVector

Post by KHATWANI PARTH BHARAT
val rdd = augumentedArray.map(dvec(_))
//6. convert rdd to DrmRdd
val rddMatrixLike: DrmRdd[Int] =

rdd.zipWithIndex.map {

(v,

Post by KHATWANI PARTH BHARAT
idx) => (idx.toInt, v) } //7. convert DrmRdd to
CheckpointedDrm[Int] val matrix =

drmWrap(rddMatrixLike)

//8.

Post by KHATWANI PARTH BHARAT
seperating the column having all ones created in step 4

and

use

Post by KHATWANI PARTH BHARAT
it later val oneVector = matrix(::, 0 until 1)

//9.

final

Post by KHATWANI PARTH BHARAT
input data in DrmLike[Int] format val dataDrmX =

matrix(::,

Post by KHATWANI PARTH BHARAT
4) //9. Sampling to select initial centriods

val

Post by KHATWANI PARTH BHARAT
centriods = drmSampleKRows(dataDrmX, 2, false)

centriods.size

Post by KHATWANI PARTH BHARAT
//10. Broad Casting the initial centriods val

broadCastMatrix

Post by KHATWANI PARTH BHARAT
drmBroadcast(centriods) //11. Iterating over

the

Post by KHATWANI PARTH BHARAT
Matrix(in DrmLike[Int] format) to calculate the initial

centriods

Post by KHATWANI PARTH BHARAT
dataDrmX.mapBlock() { case (keys, block) =>

for

Post by KHATWANI PARTH BHARAT
until block.nrow) { var dataPoint = block(row,

::)

Post by KHATWANI PARTH BHARAT
//12. findTheClosestCentriod find the closest

centriod

Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint" val

closesetIndex

Post by KHATWANI PARTH BHARAT
findTheClosestCentriod(dataPoint, centriods)

//13.

Post by KHATWANI PARTH BHARAT
assigning closest index to key keys(row) =

closesetIndex

Post by KHATWANI PARTH BHARAT
} keys -> block }
//14. Calculating the (1|D) val b = (oneVector

cbind

Post by KHATWANI PARTH BHARAT
dataDrmX) //15. Aggregating Transpose (1|D)'

val

Post by Dmitriy Lyubimov
bTranspose

Post by KHATWANI PARTH BHARAT
= (oneVector cbind dataDrmX).t // after step 15

bTranspose

will

Post by KHATWANI PARTH BHARAT
have data in the following format /*(n+1)*K where

n=dimension

Post by KHATWANI PARTH BHARAT
of the data point, K=number of clusters * zeroth row

will

contain

Post by KHATWANI PARTH BHARAT
the count of points assigned to each cluster *

assuming

data

Post by KHATWANI PARTH BHARAT
points * */
val nrows = b.nrow.toInt //16. slicing the count

vectors

out

Post by KHATWANI PARTH BHARAT
val pointCountVectors = drmBroadcast(b(0 until 1,

::).collect(0,

Post by KHATWANI PARTH BHARAT
val vectorSums = b(1 until nrows, ::) //17.

dividing

data

Post by KHATWANI PARTH BHARAT
point by count vector vectorSums.mapBlock() {

case

Post by KHATWANI PARTH BHARAT
block) => for (row <- 0 until block.nrow) {

block(row,

Post by KHATWANI PARTH BHARAT
::) /= pointCountVectors } keys -> block

}

//18.

Post by KHATWANI PARTH BHARAT
seperating the count vectors val newCentriods =

vectorSums.t(::,1

Post by KHATWANI PARTH BHARAT
until centriods.size) //19. iterate over the

above

code

Post by KHATWANI PARTH BHARAT
till convergence criteria is meet }//end of main method
// method to find the closest centriod to data point(

Vector

Vector,

points(Vectors)

Post by KHATWANI PARTH BHARAT
def ssr(a: Vector, b: Vector): Double = {
(a - b) ^= 2 sum
}
//method used to create (1|D)

Array[Double]

---------- Forwarded message ----------
Date: Fri, Mar 31, 2017 at 11:34 PM
Subject: Re: Trying to write the KMeans Clustering

Using

"Apache

Mahout

Samsara"
ps1 this assumes row-wise construction of A based on

training

set

n-dimensional points.
ps2 since we are doing multiple passes over A it may

make

sense to

make

sure it is committed to spark cache (by using

checkpoint

spark

used
On Fri, Mar 31, 2017 at 10:53 AM, Dmitriy Lyubimov <

Post by Dmitriy Lyubimov
here is the outline. For details of APIs, please

refer

Post by Dmitriy Lyubimov
samsara

Post by Dmitriy Lyubimov
[2], i will not be be repeating it.
Assume your training data input is m x n matrix A.

For

Post by Dmitriy Lyubimov
simplicity

let's

Post by Dmitriy Lyubimov
assume it's a DRM with int row keys, i.e.,

DrmLike[Int].

Post by Dmitriy Lyubimov
First, classic k-means starts by selecting initial

clusters,

Post by Dmitriy Lyubimov
them out. You can do that by using sampling api [1],

thus

Post by Dmitriy Lyubimov
forming

Post by Dmitriy Lyubimov
in-memory matrix C (current centroids). C is

therefore

Post by Dmitriy Lyubimov
Mahout's

Post by Dmitriy Lyubimov
type.
You the proceed by alternating between cluster

assignments

and

Post by Dmitriy Lyubimov
recompupting centroid matrix C till convergence

based on

Post by Dmitriy Lyubimov
simply limited by epoch count budget, your choice.
Cluster assignments: here, we go over current

generation

Post by Dmitriy Lyubimov
recompute centroid indexes for each row in A. Once we

recompute

index,

Post by Dmitriy Lyubimov
put it into the row key . You can do that by

assigning

centroid

indices

Post by Dmitriy Lyubimov
keys of A using operator mapblock() (details in [2],

[3],

[4]).

Post by Dmitriy Lyubimov
need to broadcast C in order to be able to access it

Post by Dmitriy Lyubimov
efficient

Post by Dmitriy Lyubimov
inside mapblock() closure. Examples of that are

plenty

Post by Dmitriy Lyubimov
Essentially, in mapblock, you'd reform the row keys

reflect

cluster

Post by Dmitriy Lyubimov
index in C. while going over A, you'd have a "nearest

neighbor"

problem

Post by Dmitriy Lyubimov
solve for the row of A and centroids C. This is the

bulk of

computation

Post by Dmitriy Lyubimov
really, and there are a few tricks there that can

speed

up in

Post by Dmitriy Lyubimov
both exact and approximate manner, but you can start

with a

Post by Dmitriy Lyubimov
once you assigned centroids to the keys of marix A,

you'd

want

do an

Post by Dmitriy Lyubimov
aggregating transpose of A to compute essentially

average

grouped

Post by Dmitriy Lyubimov
by the centroid key. The trick is to do a

computation of

(1|A)'

which

will

Post by Dmitriy Lyubimov
results in a matrix of the shape (Counts/sums of

cluster

rows).

This is

the

Post by Dmitriy Lyubimov
part i find difficult to explain without a latex

graphics.

Post by Dmitriy Lyubimov
In Samsara, construction of (1|A)' corresponds to DRM

expression

Post by Dmitriy Lyubimov
(1 cbind A).t (again, see [2]).
So when you compute, say,
B = (1 | A)',
then B is (n+1) x k, so each column contains a vector

corresponding

Post by Dmitriy Lyubimov
cluster 1..k. In such column, the first element would

be #

points in

the

Post by Dmitriy Lyubimov
cluster, and the rest of it would correspond to sum

Post by Dmitriy Lyubimov
points.

Post by Dmitriy Lyubimov
order to arrive to an updated matrix C, we need to

collect

Post by Dmitriy Lyubimov
and slice out counters (first row) from the rest of

it.

Post by Dmitriy Lyubimov
C <- B (2:,:) each row divided by B(1,:)
(watch out for empty clusters with 0 elements, this

will

cause

Post by Dmitriy Lyubimov
convergence and NaNs in the newly computed C).
This operation obviously uses subblocking and

row-wise

iteration

over

Post by Dmitriy Lyubimov
for which i am again making reference to [2].
[1] https://github.com/apache/

mahout/blob/master/math-scala/

Post by Dmitriy Lyubimov
src/main/scala/org/apache/maho

ut/math/drm/package.scala#

Post by Dmitriy Lyubimov
[2], Sasmara manual, a bit dated but viable,

http://apache.github

Post by Dmitriy Lyubimov
io/mahout/doc/ScalaSparkBindings.html
[3] scaladoc, again, dated but largely viable for the

purpose of

this

Post by Dmitriy Lyubimov
http://apache.github.io/mahout

/0.10.1/docs/mahout-math-

scala/index.htm

Post by Dmitriy Lyubimov
[4] mapblock etc. http://apache.github.io/mahout

/0.10.1/docs/mahout-

Post by Dmitriy Lyubimov
math-scala/index.html#org.apache.mahout.math.drm.

RLikeDrmOps

Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH

BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriycan you please again tell me the approach to

move

Post by KHATWANI PARTH BHARAT
Thanks
Parth Khatwani
On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH

BHARAT <

Post by KHATWANI PARTH BHARAT
yes i am unable to figure out the way ahead.
Like how to create the augmented matrix A := (0|D)

which

you

Post by KHATWANI PARTH BHARAT
mentioned.
On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy

Lyubimov <
bit

Post by Dmitriy Lyubimov
confusing?

Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH

BHARAT

Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans clustering

algorithm

using

Mahout

Post by Dmitriy Lyubimov
Samsara

Post by KHATWANI PARTH BHARAT
but i am bit confused
about how to leverage Distributed Row Matrix

for

same.

Post by KHATWANI PARTH BHARAT
me with same.
Thanks
Parth Khatwani

Trevor Grant

2017-04-21 19:54:03 UTC

Got it- in short no.

Think of the keys like a dictionary or HashMap.

That's why everything is ending up on row 1.

What are you trying to achieve by creating keys of 1?

Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things." -Virgil*

On Fri, Apr 21, 2017 at 2:26 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Trevor
In was trying to write the "*Kmeans*" Using Mahout DRM as per the algorithm
outlined by Dmitriy.
I was facing the Problem of assigning cluster Ids to the Row Keys
For Example
Consider the below matrix Where column 1 to 3 are the data points and
column 0 Containing the count of the point
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
now after calculating the centriod which closest to the data point data
zeroth index i am trying to assign the centriod index to *row key *
Now Suppose say that every data point is assigned to centriod at index 1
so after assigning the key=1 to each and every row
using the code below
val drm2 = A.mapBlock() {
case (keys, block) => for(row <- 0 until keys.size) {
* //assigning 1 to each row index* keys(row) = 1
} (keys, block) }
I want above matrix to be in this form
{
1 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
1 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
I am confused weather assigning the new Key Values to the row index is done
through the following code line
* //assigning 1 to each row index* keys(row) = 1
or is there any other way.
I am not able to find any use links or reference on internet even Andrew
and Dmitriy's book also does not have any proper reference for the
above mentioned issue.
Thanks & Regards
Parth Khatwani

Post by Trevor Grant
OK, i dug into this before i read your question carefully, that was my

bad.

Post by Trevor Grant
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
to be
{
0 => {1: 5.0} // (not 4.0) // and 6.0 in your example...
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
val drm2 = (A(::, 1 until 4) cbind 0.0).mapBlock() {
case (keys, block) =>
for(row <- 0 until block.nrow) block(row, 3) = block(row, ::).sum
(keys, block)
}
val aggTranspose = drm2(::, 3 until 4).t
println("Result of aggregating tranpose")
println(""+aggTranspose.collect)
Where we are creating an empty row, then filling it with the row sums.
A distributed rowSums fn would be nice for just such an occasion... sigh
Let me know if that gets you going again. That was simpler than I

thought-

Post by Trevor Grant
sorry for delay on this.
PS
Candidly, I didn't explore further once i understood teh question, but if
you are going to collect this to the driver anyway (not sure if that is

the

Post by Trevor Grant
case)
A(::, 1 until 4).rowSums would also work.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
On Thu, Apr 20, 2017 at 9:01 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Trevor Sir,
I have attached the sample data file and here is the line to complete

the Data

Lyubimov/KmeansMahout.scala> is

Post by KHATWANI PARTH BHARAT
the complete code
I also have made sample program just to test the assigning new values

Post by KHATWANI PARTH BHARAT
the key to Row Matrix and aggregating transpose.I think assigning new
values to the key to Row Matrix and aggregating transpose is causing

the

Post by KHATWANI PARTH BHARAT
main problem in Kmean code
Following is the link to Github repo for this code.
TestClusterAssign.scala
<https://github.com/parth2691/Spark_Mahout/blob/Dmitriy-

Lyubimov/TestClusterAssign.scala>

Post by KHATWANI PARTH BHARAT
above code contains the hard coded data. Following is the expected and

the

and

Post by KHATWANI PARTH BHARAT
3 Contains Data)
But Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
And the result of aggregating Transpose should be
{
0 => {1: 4.0}
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
Thanks Trevor for such a great Help
Best Regards
Parth
On Fri, Apr 21, 2017 at 4:20 AM, Trevor Grant <

Post by Trevor Grant
Hey
Sorry for delay- was getting ready to tear into this.
Would you mind posting a small sample of data that you would expect

this

-Virgil*

Post by Trevor Grant
On Tue, Apr 18, 2017 at 11:32 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriy,@Trevor and @Andrew Sir,
I am still stuck at the above problem can you please help me out

with

Post by KHATWANI PARTH BHARAT
@Dmitriy,
@Trevor and @Andrew
I have tried
Testing this Row Key assignment issue which i have mentioned in

the

Post by Trevor Grant
above

Post by KHATWANI PARTH BHARAT
mail,
By Writing the a separate code where i am assigning the a default

value 1

Post by KHATWANI PARTH BHARAT
to each row Key of The DRM and then taking the aggregating

transpose

Post by KHATWANI PARTH BHARAT
I have committed the separate test code to the Github Branch
<https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov

Post by KHATWANI PARTH BHARAT
The Code is as follows
val inCoreA = dense((1,1, 2, 3), (1,2, 3, 4), (1,3, 4, 5), (1,4,

} (keys, block) } prinln("After New Cluster

assignment")

Post by KHATWANI PARTH BHARAT
println(""+drm2.collect) val aggTranspose = drm2.t

println("Result of

Post by KHATWANI PARTH BHARAT
aggregating tranpose") println(""+aggTranspose.collect)

1,2

Post by KHATWANI PARTH BHARAT
3 Contains Data)
But Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
And the result of aggregating Transpose should be
{
0 => {1: 4.0}
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
I have referred to the book written by Andrew And Dmitriy Apache
Beyond MapReduce
<https://www.amazon.com/Apache-Mahout-MapReduce-

Dmitriy-Lyubimov/dp/1523775785> Aggregating

Post by KHATWANI PARTH BHARAT
Transpose and other concepts are explained very nicely over here

but

Post by KHATWANI PARTH BHARAT
unable to find any example where
Row Keys are assigned new Values . Mahout Samsara Manual
http://apache.github.io/mahout/doc/ScalaSparkBindings.html Also

Does

Post by KHATWANI PARTH BHARAT
contain any such examples.
It will great if i can get some reference to solution of mentioned

issue.

Post by KHATWANI PARTH BHARAT
Thanks
Parth Khatwani
On Sat, Apr 15, 2017 at 12:13 AM, Andrew Palumbo <

+1
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/14/2017 11:40 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering Using "Apache

Mahout

Samsara"
Parth and Dmitriy,
This is awesome- as a follow on can we work on getting this

rolled

Post by Trevor Grant
in to

-Virgil*

On Fri, Apr 14, 2017 at 1:27 PM, Dmitriy Lyubimov <

Post by Dmitriy Lyubimov
i would think reassinging keys should work in most cases.
The only exception is that technically Spark contracts imply

that

Post by Dmitriy Lyubimov
should be idempotent if task is retried, which might be a

problem

Post by Dmitriy Lyubimov
specific scenario of the object tree coming out from block

cache

Post by Dmitriy Lyubimov
tree, which can stay there and be retried again. but

specifically

Post by Dmitriy Lyubimov
this key assignment i don't see any problem since the action

obviously

Post by Dmitriy Lyubimov
would be idempotent even if this code is run multiple times on

the

Post by Dmitriy Lyubimov
(key, block) pair. This part should be good IMO.
On Fri, Apr 14, 2017 at 2:26 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriy Sir,
In the K means code above I think i am doing the following

Incorrectly

Post by KHATWANI PARTH BHARAT
Assigning the closest centriod index to the Row Keys of DRM
//11. Iterating over the Data Matrix(in DrmLike[Int] format)

calculate

centriod

the

Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint"
val closesetIndex = findTheClosestCentriod(

dataPoint,

Post by Dmitriy Lyubimov
centriods)

Post by KHATWANI PARTH BHARAT
//13. assigning closest index to key
keys(row) = closesetIndex
}
keys -> block
}
in step 12 i am finding the centriod closest to the current

dataPoint

Post by KHATWANI PARTH BHARAT
in step13 i am assigning the closesetIndex to the key of the

corresponding

Post by KHATWANI PARTH BHARAT
row represented by the dataPoint
I think i am doing step13 incorrectly.
Also i am unable to find the proper reference for the same in

the

Post by Dmitriy Lyubimov
reference

Post by KHATWANI PARTH BHARAT
links which you have mentioned above
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 6:24 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
Dmitriy Sir,
I have Created a github branch Github Branch Having Initial

Kmeans

Code

Post by KHATWANI PARTH BHARAT
<https://github.com/parth2691/

Spark_Mahout/tree/Dmitriy-Lyub

Post by Trevor Grant
imov>

Post by KHATWANI PARTH BHARAT
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 3:19 AM, Andrew Palumbo <

+1 to creating a branch.
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/12/2017 11:25 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering Using

"Apache

Samsara"
can't say i can read this code well formatted that way...
it would seem to me that the code is not using the

broadcast

variable

instead is using closure variable. that's the only thing i

can

Post by KHATWANI PARTH BHARAT
immediately

see by looking in the middle of it.
it would be better if you created a branch on github for

that

would allow for easy check-outs and comments.
-d
On Wed, Apr 12, 2017 at 10:29 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriy Sir
I have completed the Kmeans code as per the algorithm

you

Post by Dmitriy Lyubimov
Outline

Post by KHATWANI PARTH BHARAT
above
My code is as follows
This code works fine till step number 10
In step 11 i am assigning the new centriod index to

corresponding

key

Post by KHATWANI PARTH BHARAT
of data Point in the matrix
I think i am doing something wrong in step 11 may be i

Post by Trevor Grant
using

incorrect

"org.apache.spark.serializer.

Post by KHATWANI PARTH BHARAT
KryoSerializer")
.set("spark.kryo.registrator",
"org.apache.mahout.sparkbindings.io.

MahoutKryoRegistrator")

Post by KHATWANI PARTH BHARAT
implicit val sc = new SparkDistributedContext(new

SparkContext(conf))

Post by KHATWANI PARTH BHARAT
//2. read the data file and save it in the rdd
val lines = sc.textFile(args(1))
//3. convert data read in as string in to array of

double

Post by KHATWANI PARTH BHARAT
val test = lines.map(line =>

line.split('\t').map(_.toDoubl

e))

Post by KHATWANI PARTH BHARAT
//4. add a column having value 1 in array of double

this

Post by KHATWANI PARTH BHARAT
create something like (1 | D)', which will be used

while

Post by Dmitriy Lyubimov
calculating

Post by KHATWANI PARTH BHARAT
(1 | D)'
val augumentedArray = test.map(addCentriodColumn _)
//5. convert rdd of array of double in rdd of

DenseVector

Post by KHATWANI PARTH BHARAT
val rdd = augumentedArray.map(dvec(_))
//6. convert rdd to DrmRdd
val rddMatrixLike: DrmRdd[Int] =

rdd.zipWithIndex.map

(v,

Post by KHATWANI PARTH BHARAT
idx) => (idx.toInt, v) } //7. convert DrmRdd to
CheckpointedDrm[Int] val matrix =

drmWrap(rddMatrixLike)

//8.

Post by KHATWANI PARTH BHARAT
seperating the column having all ones created in step 4

and

use

Post by KHATWANI PARTH BHARAT
it later val oneVector = matrix(::, 0 until 1)

//9.

final

Post by KHATWANI PARTH BHARAT
input data in DrmLike[Int] format val dataDrmX =

matrix(::,

Post by KHATWANI PARTH BHARAT
4) //9. Sampling to select initial centriods

val

Post by KHATWANI PARTH BHARAT
centriods = drmSampleKRows(dataDrmX, 2, false)

centriods.size

Post by KHATWANI PARTH BHARAT
//10. Broad Casting the initial centriods val

broadCastMatrix

Post by KHATWANI PARTH BHARAT
drmBroadcast(centriods) //11. Iterating over

the

Post by KHATWANI PARTH BHARAT
Matrix(in DrmLike[Int] format) to calculate the initial

centriods

Post by KHATWANI PARTH BHARAT
dataDrmX.mapBlock() { case (keys, block) =>

for

Post by KHATWANI PARTH BHARAT
until block.nrow) { var dataPoint = block(row,

::)

Post by KHATWANI PARTH BHARAT
//12. findTheClosestCentriod find the closest

centriod

Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint" val

closesetIndex

Post by KHATWANI PARTH BHARAT
findTheClosestCentriod(dataPoint, centriods)

//13.

Post by KHATWANI PARTH BHARAT
assigning closest index to key keys(row) =

closesetIndex

Post by KHATWANI PARTH BHARAT
} keys -> block }
//14. Calculating the (1|D) val b = (oneVector

cbind

Post by KHATWANI PARTH BHARAT
dataDrmX) //15. Aggregating Transpose (1|D)'

val

Post by Dmitriy Lyubimov
bTranspose

Post by KHATWANI PARTH BHARAT
= (oneVector cbind dataDrmX).t // after step 15

bTranspose

will

Post by KHATWANI PARTH BHARAT
have data in the following format /*(n+1)*K where

n=dimension

Post by KHATWANI PARTH BHARAT
of the data point, K=number of clusters * zeroth row

will

contain

Post by KHATWANI PARTH BHARAT
the count of points assigned to each cluster *

assuming

data

Post by KHATWANI PARTH BHARAT
points * */
val nrows = b.nrow.toInt //16. slicing the count

vectors

out

Post by KHATWANI PARTH BHARAT
val pointCountVectors = drmBroadcast(b(0 until 1,

::).collect(0,

Post by KHATWANI PARTH BHARAT
val vectorSums = b(1 until nrows, ::) //17.

dividing

data

Post by KHATWANI PARTH BHARAT
point by count vector vectorSums.mapBlock() {

case

Post by KHATWANI PARTH BHARAT
block) => for (row <- 0 until block.nrow) {

block(row,

Post by KHATWANI PARTH BHARAT
::) /= pointCountVectors } keys -> block

}

//18.

Post by KHATWANI PARTH BHARAT
seperating the count vectors val newCentriods =

vectorSums.t(::,1

Post by KHATWANI PARTH BHARAT
until centriods.size) //19. iterate over the

above

code

Post by KHATWANI PARTH BHARAT
till convergence criteria is meet }//end of main

method

Post by KHATWANI PARTH BHARAT
// method to find the closest centriod to data point(

Vector

Vector,

::)))

Post by KHATWANI PARTH BHARAT
if (tempDist < closest) {
closest = tempDist
index = row
}
}
index
}
//calculating the sum of squared distance between the

points(Vectors)

Post by KHATWANI PARTH BHARAT
def ssr(a: Vector, b: Vector): Double = {
(a - b) ^= 2 sum
}
//method used to create (1|D)

Array[Double]

---------- Forwarded message ----------
Date: Fri, Mar 31, 2017 at 11:34 PM
Subject: Re: Trying to write the KMeans Clustering

Using

"Apache

Mahout

Samsara"
ps1 this assumes row-wise construction of A based on

training

set

n-dimensional points.
ps2 since we are doing multiple passes over A it may

make

sense to

make

sure it is committed to spark cache (by using

checkpoint

spark

used
On Fri, Mar 31, 2017 at 10:53 AM, Dmitriy Lyubimov <

Post by Dmitriy Lyubimov
here is the outline. For details of APIs, please

refer

Post by Dmitriy Lyubimov
samsara

Post by Dmitriy Lyubimov
[2], i will not be be repeating it.
Assume your training data input is m x n matrix A.

For

Post by Dmitriy Lyubimov
simplicity

let's

Post by Dmitriy Lyubimov
assume it's a DRM with int row keys, i.e.,

DrmLike[Int].

Post by Dmitriy Lyubimov
First, classic k-means starts by selecting initial

clusters,

Post by Dmitriy Lyubimov
them out. You can do that by using sampling api [1],

thus

Post by Dmitriy Lyubimov
forming

Post by Dmitriy Lyubimov
in-memory matrix C (current centroids). C is

therefore

Post by Dmitriy Lyubimov
Mahout's

Post by Dmitriy Lyubimov
type.
You the proceed by alternating between cluster

assignments

and

Post by Dmitriy Lyubimov
recompupting centroid matrix C till convergence

based

Post by Dmitriy Lyubimov
simply limited by epoch count budget, your choice.
Cluster assignments: here, we go over current

generation

Post by Dmitriy Lyubimov
recompute centroid indexes for each row in A. Once

recompute

index,

Post by Dmitriy Lyubimov
put it into the row key . You can do that by

assigning

centroid

indices

Post by Dmitriy Lyubimov
keys of A using operator mapblock() (details in [2],

[3],

[4]).

Post by Dmitriy Lyubimov
need to broadcast C in order to be able to access it

Post by Dmitriy Lyubimov
efficient

Post by Dmitriy Lyubimov
inside mapblock() closure. Examples of that are

plenty

Post by Dmitriy Lyubimov
Essentially, in mapblock, you'd reform the row keys

reflect

cluster

Post by Dmitriy Lyubimov
index in C. while going over A, you'd have a

"nearest

neighbor"

problem

Post by Dmitriy Lyubimov
solve for the row of A and centroids C. This is the

bulk of

computation

Post by Dmitriy Lyubimov
really, and there are a few tricks there that can

speed

up in

Post by Dmitriy Lyubimov
both exact and approximate manner, but you can start

with a

Post by Dmitriy Lyubimov
once you assigned centroids to the keys of marix A,

you'd

want

do an

Post by Dmitriy Lyubimov
aggregating transpose of A to compute essentially

average

grouped

Post by Dmitriy Lyubimov
by the centroid key. The trick is to do a

computation

(1|A)'

which

will

Post by Dmitriy Lyubimov
results in a matrix of the shape (Counts/sums of

cluster

rows).

This is

the

Post by Dmitriy Lyubimov
part i find difficult to explain without a latex

graphics.

Post by Dmitriy Lyubimov
In Samsara, construction of (1|A)' corresponds to

DRM

expression

Post by Dmitriy Lyubimov
(1 cbind A).t (again, see [2]).
So when you compute, say,
B = (1 | A)',
then B is (n+1) x k, so each column contains a

vector

Post by KHATWANI PARTH BHARAT
corresponding

Post by Dmitriy Lyubimov
cluster 1..k. In such column, the first element

would

points in

the

Post by Dmitriy Lyubimov
cluster, and the rest of it would correspond to sum

Post by Dmitriy Lyubimov
points.

Post by Dmitriy Lyubimov
order to arrive to an updated matrix C, we need to

collect

Post by Dmitriy Lyubimov
and slice out counters (first row) from the rest of

it.

Post by Dmitriy Lyubimov
C <- B (2:,:) each row divided by B(1,:)
(watch out for empty clusters with 0 elements, this

will

cause

Post by Dmitriy Lyubimov
convergence and NaNs in the newly computed C).
This operation obviously uses subblocking and

row-wise

iteration

over

Post by Dmitriy Lyubimov
for which i am again making reference to [2].
[1] https://github.com/apache/

mahout/blob/master/math-scala/

Post by Dmitriy Lyubimov
src/main/scala/org/apache/maho

ut/math/drm/package.scala#

Post by Dmitriy Lyubimov
[2], Sasmara manual, a bit dated but viable,

http://apache.github

Post by Dmitriy Lyubimov
io/mahout/doc/ScalaSparkBindings.html
[3] scaladoc, again, dated but largely viable for

the

purpose of

this

Post by Dmitriy Lyubimov
http://apache.github.io/mahout/0.10.1/docs/mahout-

math-

scala/index.htm

Post by Dmitriy Lyubimov
[4] mapblock etc. http://apache.github.io/mahout

/0.10.1/docs/mahout-

Post by Dmitriy Lyubimov
math-scala/index.html#org.apache.mahout.math.drm.

RLikeDrmOps

Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH

BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriycan you please again tell me the approach

Post by KHATWANI PARTH BHARAT
Thanks
Parth Khatwani
On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH

BHARAT <

Post by KHATWANI PARTH BHARAT
yes i am unable to figure out the way ahead.
Like how to create the augmented matrix A :=

(0|D)

Post by Trevor Grant
which

you

Post by KHATWANI PARTH BHARAT
mentioned.
On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy

Lyubimov

bit

Post by Dmitriy Lyubimov
confusing?

Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH

BHARAT

Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans clustering

algorithm

using

Mahout

Post by Dmitriy Lyubimov
Samsara

Post by KHATWANI PARTH BHARAT
but i am bit confused
about how to leverage Distributed Row Matrix

for

same.

Post by KHATWANI PARTH BHARAT
me with same.
Thanks
Parth Khatwani

KHATWANI PARTH BHARAT

2017-04-21 20:06:28 UTC

One is the cluster ID of the Index to which the data point should be
assigned.
As per what is given in this book Apache-Mahout-Mapreduce-Dmitriy-Lyubimov
<http://www.amazon.in/Apache-Mahout-Mapreduce-Dmitriy-Lyubimov/dp/1523775785>
in
chapter 4 about the aggregating Transpose.
From what i have understood is that row having the same key will added when
we take aggregating transpose of the matrix.
So i think there should be a way to assign new values to row keys and i
think Dimitriy Has also mentioned the same thing i approach he has
outlined in this mail chain
Correct me if i am wrong.

Thanks
Parth Khatwani

Post by Trevor Grant
Got it- in short no.
Think of the keys like a dictionary or HashMap.
That's why everything is ending up on row 1.
What are you trying to achieve by creating keys of 1?
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
On Fri, Apr 21, 2017 at 2:26 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Trevor
In was trying to write the "*Kmeans*" Using Mahout DRM as per the

algorithm

Post by KHATWANI PARTH BHARAT
outlined by Dmitriy.
I was facing the Problem of assigning cluster Ids to the Row Keys
For Example
Consider the below matrix Where column 1 to 3 are the data points and
column 0 Containing the count of the point
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
now after calculating the centriod which closest to the data point data
zeroth index i am trying to assign the centriod index to *row key *
Now Suppose say that every data point is assigned to centriod at index 1
so after assigning the key=1 to each and every row
using the code below
val drm2 = A.mapBlock() {
case (keys, block) => for(row <- 0 until keys.size) {
* //assigning 1 to each row index* keys(row) = 1
} (keys, block) }
I want above matrix to be in this form
{
1 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
1 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
I am confused weather assigning the new Key Values to the row index is

done

Post by KHATWANI PARTH BHARAT
through the following code line
* //assigning 1 to each row index* keys(row) = 1
or is there any other way.
I am not able to find any use links or reference on internet even Andrew
and Dmitriy's book also does not have any proper reference for the
above mentioned issue.
Thanks & Regards
Parth Khatwani

Post by Trevor Grant
OK, i dug into this before i read your question carefully, that was my

bad.

sigh

Post by Trevor Grant
Let me know if that gets you going again. That was simpler than I

thought-

Post by Trevor Grant
sorry for delay on this.
PS
Candidly, I didn't explore further once i understood teh question, but

Post by Trevor Grant
you are going to collect this to the driver anyway (not sure if that is

the

Post by KHATWANI PARTH BHARAT
@Trevor Sir,
I have attached the sample data file and here is the line to complete

the Data

Post by KHATWANI PARTH BHARAT
File <https://drive.google.com/open?id=0Bxnnu_Ig2Et9QjZoM3dmY1V5WXM

Post by KHATWANI PARTH BHARAT
Following is the link for the Github Branch For the code
https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-Lyubimov
KmeansMahout.scala
<https://github.com/parth2691/Spark_Mahout/blob/Dmitriy-

Lyubimov/KmeansMahout.scala> is

Post by KHATWANI PARTH BHARAT
the complete code
I also have made sample program just to test the assigning new values

Post by KHATWANI PARTH BHARAT
the key to Row Matrix and aggregating transpose.I think assigning new
values to the key to Row Matrix and aggregating transpose is causing

the

Post by KHATWANI PARTH BHARAT
main problem in Kmean code
Following is the link to Github repo for this code.
TestClusterAssign.scala
<https://github.com/parth2691/Spark_Mahout/blob/Dmitriy-

Lyubimov/TestClusterAssign.scala>

Post by KHATWANI PARTH BHARAT
above code contains the hard coded data. Following is the expected

and

1,2

Post by KHATWANI PARTH BHARAT
3 Contains Data)
But Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
And the result of aggregating Transpose should be
{
0 => {1: 4.0}
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
Thanks Trevor for such a great Help
Best Regards
Parth
On Fri, Apr 21, 2017 at 4:20 AM, Trevor Grant <

Post by Trevor Grant
Hey
Sorry for delay- was getting ready to tear into this.
Would you mind posting a small sample of data that you would expect

this

-Virgil*

Post by Trevor Grant
On Tue, Apr 18, 2017 at 11:32 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriy,@Trevor and @Andrew Sir,
I am still stuck at the above problem can you please help me out

with

Post by KHATWANI PARTH BHARAT
I am unable to find the proper reference to solve the above

issue.

Post by KHATWANI PARTH BHARAT
Thanks & Regards
Parth Khatwani
<https://mailtrack.io/> Sent with Mailtrack
<https://mailtrack.io/install?source=signature&lang=en&
On Sat, Apr 15, 2017 at 10:07 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriy,
@Trevor and @Andrew
I have tried
Testing this Row Key assignment issue which i have mentioned in

the

Post by Trevor Grant
above

Post by KHATWANI PARTH BHARAT
mail,
By Writing the a separate code where i am assigning the a

default

Post by Trevor Grant
value 1

Post by KHATWANI PARTH BHARAT
to each row Key of The DRM and then taking the aggregating

transpose

Post by KHATWANI PARTH BHARAT
I have committed the separate test code to the Github Branch
<https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-

Lyubimov

Post by KHATWANI PARTH BHARAT
The Code is as follows
val inCoreA = dense((1,1, 2, 3), (1,2, 3, 4), (1,3, 4, 5), (1,4,

Post by KHATWANI PARTH BHARAT
val A = drmParallelize(m = inCoreA)
//Mapblock
val drm2 = A.mapBlock() {
case (keys, block) => for(row <- 0 until

keys.size) {

Post by KHATWANI PARTH BHARAT
* //assigning 1 to each row index* keys(row)

= 1

Post by KHATWANI PARTH BHARAT
} (keys, block) } prinln("After New Cluster

assignment")

Post by KHATWANI PARTH BHARAT
println(""+drm2.collect) val aggTranspose = drm2.t

println("Result of

Post by KHATWANI PARTH BHARAT
aggregating tranpose") println(""+aggTranspose.collect)

column

Post by KHATWANI PARTH BHARAT
3 Contains Data)
But Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
And the result of aggregating Transpose should be
{
0 => {1: 4.0}
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
I have referred to the book written by Andrew And Dmitriy

Apache

Post by KHATWANI PARTH BHARAT
Beyond MapReduce
<https://www.amazon.com/Apache-Mahout-MapReduce-

Dmitriy-Lyubimov/dp/1523775785> Aggregating

Post by KHATWANI PARTH BHARAT
Transpose and other concepts are explained very nicely over

here

Post by KHATWANI PARTH BHARAT
unable to find any example where
Row Keys are assigned new Values . Mahout Samsara Manual
http://apache.github.io/mahout/doc/ScalaSparkBindings.html Also

Does

Post by KHATWANI PARTH BHARAT
contain any such examples.
It will great if i can get some reference to solution of

mentioned

Post by Trevor Grant
issue.

Post by KHATWANI PARTH BHARAT
Thanks
Parth Khatwani
On Sat, Apr 15, 2017 at 12:13 AM, Andrew Palumbo <

+1
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/14/2017 11:40 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering Using

"Apache

Post by Trevor Grant
Mahout

Samsara"
Parth and Dmitriy,
This is awesome- as a follow on can we work on getting this

rolled

Post by Trevor Grant
in to

-Virgil*

On Fri, Apr 14, 2017 at 1:27 PM, Dmitriy Lyubimov <

Post by Dmitriy Lyubimov
i would think reassinging keys should work in most cases.
The only exception is that technically Spark contracts imply

that

Post by Dmitriy Lyubimov
should be idempotent if task is retried, which might be a

problem

Post by Dmitriy Lyubimov
specific scenario of the object tree coming out from block

cache

Post by Dmitriy Lyubimov
tree, which can stay there and be retried again. but

specifically

Post by Dmitriy Lyubimov
this key assignment i don't see any problem since the action

obviously

Post by Dmitriy Lyubimov
would be idempotent even if this code is run multiple times

Post by Dmitriy Lyubimov
(key, block) pair. This part should be good IMO.
On Fri, Apr 14, 2017 at 2:26 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriy Sir,
In the K means code above I think i am doing the following

Incorrectly

Post by KHATWANI PARTH BHARAT
Assigning the closest centriod index to the Row Keys of DRM
//11. Iterating over the Data Matrix(in DrmLike[Int]

format)

calculate

centriod

the

Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint"
val closesetIndex = findTheClosestCentriod(

dataPoint,

Post by Dmitriy Lyubimov
centriods)

Post by KHATWANI PARTH BHARAT
//13. assigning closest index to key
keys(row) = closesetIndex
}
keys -> block
}
in step 12 i am finding the centriod closest to the

current

Post by KHATWANI PARTH BHARAT
dataPoint

Post by KHATWANI PARTH BHARAT
in step13 i am assigning the closesetIndex to the key of

the

Post by Dmitriy Lyubimov
corresponding

Post by KHATWANI PARTH BHARAT
row represented by the dataPoint
I think i am doing step13 incorrectly.
Also i am unable to find the proper reference for the same

Post by Dmitriy Lyubimov
reference

Post by KHATWANI PARTH BHARAT
links which you have mentioned above
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 6:24 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
Dmitriy Sir,
I have Created a github branch Github Branch Having

Initial

Post by Trevor Grant
Kmeans

Code

Post by KHATWANI PARTH BHARAT
<https://github.com/parth2691/

Spark_Mahout/tree/Dmitriy-Lyub

Post by Trevor Grant
imov>

Post by KHATWANI PARTH BHARAT
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 3:19 AM, Andrew Palumbo <

+1 to creating a branch.
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/12/2017 11:25 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering Using

"Apache

Samsara"
can't say i can read this code well formatted that

way...

it would seem to me that the code is not using the

broadcast

variable

instead is using closure variable. that's the only

thing i

Post by KHATWANI PARTH BHARAT
immediately

see by looking in the middle of it.
it would be better if you created a branch on github for

that

would allow for easy check-outs and comments.
-d
On Wed, Apr 12, 2017 at 10:29 AM, KHATWANI PARTH BHARAT

Post by KHATWANI PARTH BHARAT
@Dmitriy Sir
I have completed the Kmeans code as per the algorithm

you

Post by Dmitriy Lyubimov
Outline

Post by KHATWANI PARTH BHARAT
above
My code is as follows
This code works fine till step number 10
In step 11 i am assigning the new centriod index to

corresponding

key

Post by KHATWANI PARTH BHARAT
of data Point in the matrix
I think i am doing something wrong in step 11 may be i

Post by Trevor Grant
using

incorrect

"org.apache.spark.serializer.

Post by KHATWANI PARTH BHARAT
KryoSerializer")
.set("spark.kryo.registrator",
"org.apache.mahout.sparkbindings.io.

MahoutKryoRegistrator")

Post by KHATWANI PARTH BHARAT
implicit val sc = new SparkDistributedContext(new

SparkContext(conf))

Post by KHATWANI PARTH BHARAT
//2. read the data file and save it in the rdd
val lines = sc.textFile(args(1))
//3. convert data read in as string in to array of

double

Post by KHATWANI PARTH BHARAT
val test = lines.map(line =>

line.split('\t').map(_.toDoubl

e))

Post by KHATWANI PARTH BHARAT
//4. add a column having value 1 in array of

double

Post by KHATWANI PARTH BHARAT
create something like (1 | D)', which will be used

while

Post by Dmitriy Lyubimov
calculating

Post by KHATWANI PARTH BHARAT
(1 | D)'
val augumentedArray = test.map(addCentriodColumn

Post by KHATWANI PARTH BHARAT
//5. convert rdd of array of double in rdd of

DenseVector

Post by KHATWANI PARTH BHARAT
val rdd = augumentedArray.map(dvec(_))
//6. convert rdd to DrmRdd
val rddMatrixLike: DrmRdd[Int] =

rdd.zipWithIndex.map

(v,

Post by KHATWANI PARTH BHARAT
idx) => (idx.toInt, v) } //7. convert DrmRdd to
CheckpointedDrm[Int] val matrix =

drmWrap(rddMatrixLike)

//8.

Post by KHATWANI PARTH BHARAT
seperating the column having all ones created in step

use

Post by KHATWANI PARTH BHARAT
it later val oneVector = matrix(::, 0 until 1)

//9.

final

Post by KHATWANI PARTH BHARAT
input data in DrmLike[Int] format val dataDrmX =

matrix(::,

Post by KHATWANI PARTH BHARAT
4) //9. Sampling to select initial

centriods

Post by KHATWANI PARTH BHARAT
centriods = drmSampleKRows(dataDrmX, 2, false)

centriods.size

Post by KHATWANI PARTH BHARAT
//10. Broad Casting the initial centriods val

broadCastMatrix

Post by KHATWANI PARTH BHARAT
drmBroadcast(centriods) //11. Iterating

over

Post by KHATWANI PARTH BHARAT
Matrix(in DrmLike[Int] format) to calculate the

initial

Post by KHATWANI PARTH BHARAT
centriods

Post by KHATWANI PARTH BHARAT
dataDrmX.mapBlock() { case (keys, block) =>

for

Post by KHATWANI PARTH BHARAT
until block.nrow) { var dataPoint =

block(row,

Post by KHATWANI PARTH BHARAT
//12. findTheClosestCentriod find the closest

centriod

Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint" val

closesetIndex

Post by KHATWANI PARTH BHARAT
findTheClosestCentriod(dataPoint, centriods)

//13.

Post by KHATWANI PARTH BHARAT
assigning closest index to key keys(row) =

closesetIndex

Post by KHATWANI PARTH BHARAT
} keys -> block }
//14. Calculating the (1|D) val b =

(oneVector

Post by Trevor Grant
cbind

Post by KHATWANI PARTH BHARAT
dataDrmX) //15. Aggregating Transpose (1|D)'

val

Post by Dmitriy Lyubimov
bTranspose

Post by KHATWANI PARTH BHARAT
= (oneVector cbind dataDrmX).t // after step 15

bTranspose

will

Post by KHATWANI PARTH BHARAT
have data in the following format /*(n+1)*K

where

n=dimension

Post by KHATWANI PARTH BHARAT
of the data point, K=number of clusters * zeroth

row

contain

Post by KHATWANI PARTH BHARAT
the count of points assigned to each cluster *

assuming

data

Post by KHATWANI PARTH BHARAT
points * */
val nrows = b.nrow.toInt //16. slicing the

count

Post by Trevor Grant
vectors

out

Post by KHATWANI PARTH BHARAT
val pointCountVectors = drmBroadcast(b(0 until 1,

::).collect(0,

Post by KHATWANI PARTH BHARAT
val vectorSums = b(1 until nrows, ::) //17.

dividing

data

Post by KHATWANI PARTH BHARAT
point by count vector vectorSums.mapBlock() {

case

Post by KHATWANI PARTH BHARAT
block) => for (row <- 0 until block.nrow) {

block(row,

Post by KHATWANI PARTH BHARAT
::) /= pointCountVectors } keys -> block

}

//18.

Post by KHATWANI PARTH BHARAT
seperating the count vectors val newCentriods =

vectorSums.t(::,1

Post by KHATWANI PARTH BHARAT
until centriods.size) //19. iterate over

the

Post by Trevor Grant
above

code

Post by KHATWANI PARTH BHARAT
till convergence criteria is meet }//end of main

method

Post by KHATWANI PARTH BHARAT
// method to find the closest centriod to data

point(

Vector

Vector,

::)))

Post by KHATWANI PARTH BHARAT
if (tempDist < closest) {
closest = tempDist
index = row
}
}
index
}
//calculating the sum of squared distance between

the

Post by KHATWANI PARTH BHARAT
points(Vectors)

Post by KHATWANI PARTH BHARAT
def ssr(a: Vector, b: Vector): Double = {
(a - b) ^= 2 sum
}
//method used to create (1|D)

Array[Double]

---------- Forwarded message ----------
Date: Fri, Mar 31, 2017 at 11:34 PM
Subject: Re: Trying to write the KMeans Clustering

Using

"Apache

Mahout

Samsara"
ps1 this assumes row-wise construction of A based on

training

set

n-dimensional points.
ps2 since we are doing multiple passes over A it may

make

sense to

make

sure it is committed to spark cache (by using

checkpoint

spark

used
On Fri, Mar 31, 2017 at 10:53 AM, Dmitriy Lyubimov <

Post by Dmitriy Lyubimov
here is the outline. For details of APIs, please

refer

Post by Dmitriy Lyubimov
samsara

Post by Dmitriy Lyubimov
[2], i will not be be repeating it.
Assume your training data input is m x n matrix A.

For

Post by Dmitriy Lyubimov
simplicity

let's

Post by Dmitriy Lyubimov
assume it's a DRM with int row keys, i.e.,

DrmLike[Int].

Post by Dmitriy Lyubimov
First, classic k-means starts by selecting initial

clusters,

Post by Dmitriy Lyubimov
them out. You can do that by using sampling api

[1],

Post by Dmitriy Lyubimov
forming

Post by Dmitriy Lyubimov
in-memory matrix C (current centroids). C is

therefore

Post by Dmitriy Lyubimov
Mahout's

Post by Dmitriy Lyubimov
type.
You the proceed by alternating between cluster

assignments

and

Post by Dmitriy Lyubimov
recompupting centroid matrix C till convergence

based

Post by Dmitriy Lyubimov
simply limited by epoch count budget, your choice.
Cluster assignments: here, we go over current

generation

Post by Dmitriy Lyubimov
recompute centroid indexes for each row in A. Once

recompute

index,

Post by Dmitriy Lyubimov
put it into the row key . You can do that by

assigning

centroid

indices

Post by Dmitriy Lyubimov
keys of A using operator mapblock() (details in

[2],

[4]).

Post by Dmitriy Lyubimov
need to broadcast C in order to be able to access

Post by Dmitriy Lyubimov
efficient

Post by Dmitriy Lyubimov
inside mapblock() closure. Examples of that are

plenty

Post by Dmitriy Lyubimov
Essentially, in mapblock, you'd reform the row

keys

reflect

cluster

Post by Dmitriy Lyubimov
index in C. while going over A, you'd have a

"nearest

neighbor"

problem

Post by Dmitriy Lyubimov
solve for the row of A and centroids C. This is

the

Post by Trevor Grant
bulk of

computation

Post by Dmitriy Lyubimov
really, and there are a few tricks there that can

speed

up in

Post by Dmitriy Lyubimov
both exact and approximate manner, but you can

start

Post by Trevor Grant
with a

Post by Dmitriy Lyubimov
once you assigned centroids to the keys of marix

Post by Trevor Grant
you'd

want

do an

Post by Dmitriy Lyubimov
aggregating transpose of A to compute essentially

average

grouped

Post by Dmitriy Lyubimov
by the centroid key. The trick is to do a

computation

(1|A)'

which

will

Post by Dmitriy Lyubimov
results in a matrix of the shape (Counts/sums of

cluster

rows).

This is

the

Post by Dmitriy Lyubimov
part i find difficult to explain without a latex

graphics.

Post by Dmitriy Lyubimov
In Samsara, construction of (1|A)' corresponds to

DRM

expression

Post by Dmitriy Lyubimov
(1 cbind A).t (again, see [2]).
So when you compute, say,
B = (1 | A)',
then B is (n+1) x k, so each column contains a

vector

Post by KHATWANI PARTH BHARAT
corresponding

Post by Dmitriy Lyubimov
cluster 1..k. In such column, the first element

would

points in

the

Post by Dmitriy Lyubimov
cluster, and the rest of it would correspond to

sum

Post by Dmitriy Lyubimov
points.

Post by Dmitriy Lyubimov
order to arrive to an updated matrix C, we need to

collect

Post by Dmitriy Lyubimov
and slice out counters (first row) from the rest

Post by Dmitriy Lyubimov
C <- B (2:,:) each row divided by B(1,:)
(watch out for empty clusters with 0 elements,

this

cause

Post by Dmitriy Lyubimov
convergence and NaNs in the newly computed C).
This operation obviously uses subblocking and

row-wise

iteration

over

Post by Dmitriy Lyubimov
for which i am again making reference to [2].
[1] https://github.com/apache/

mahout/blob/master/math-scala/

Post by Dmitriy Lyubimov
src/main/scala/org/apache/maho

ut/math/drm/package.scala#

Post by Dmitriy Lyubimov
[2], Sasmara manual, a bit dated but viable,

http://apache.github

Post by Dmitriy Lyubimov
io/mahout/doc/ScalaSparkBindings.html
[3] scaladoc, again, dated but largely viable for

the

purpose of

this

Post by Dmitriy Lyubimov
http://apache.github.io/

mahout/0.10.1/docs/mahout-

Post by Trevor Grant
math-

scala/index.htm

Post by Dmitriy Lyubimov
[4] mapblock etc. http://apache.github.io/mahout

/0.10.1/docs/mahout-

Post by Dmitriy Lyubimov
math-scala/index.html#org.apache.mahout.math.drm.

RLikeDrmOps

Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH

BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriycan you please again tell me the approach

Post by KHATWANI PARTH BHARAT
Thanks
Parth Khatwani
On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH

BHARAT <

Post by KHATWANI PARTH BHARAT
yes i am unable to figure out the way ahead.
Like how to create the augmented matrix A :=

(0|D)

Post by Trevor Grant
which

you

Post by KHATWANI PARTH BHARAT
mentioned.
On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy

Lyubimov

bit

Post by Dmitriy Lyubimov
confusing?

Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI

PARTH

Post by Trevor Grant
BHARAT

Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans clustering

algorithm

using

Mahout

Post by Dmitriy Lyubimov
Samsara

Post by KHATWANI PARTH BHARAT
but i am bit confused
about how to leverage Distributed Row Matrix

for

same.

Post by KHATWANI PARTH BHARAT
me with same.
Thanks
Parth Khatwani

Dmitriy Lyubimov

2017-04-21 23:50:42 UTC

There appears to be a bug in Spark transposition operator w.r.t.
aggregating semantics which appears in cases where the same cluster (key)
is present more than once in the same block. The fix is one character long
(+ better test for aggregation).

On Fri, Apr 21, 2017 at 1:06 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
One is the cluster ID of the Index to which the data point should be
assigned.
As per what is given in this book Apache-Mahout-Mapreduce-Dmitriy-Lyubimov
<http://www.amazon.in/Apache-Mahout-Mapreduce-Dmitriy-
Lyubimov/dp/1523775785>
in
chapter 4 about the aggregating Transpose.
From what i have understood is that row having the same key will added when
we take aggregating transpose of the matrix.
So i think there should be a way to assign new values to row keys and i
think Dimitriy Has also mentioned the same thing i approach he has
outlined in this mail chain
Correct me if i am wrong.
Thanks
Parth Khatwani

Post by KHATWANI PARTH BHARAT
@Trevor
In was trying to write the "*Kmeans*" Using Mahout DRM as per the

algorithm

data

Post by KHATWANI PARTH BHARAT
zeroth index i am trying to assign the centriod index to *row key *
Now Suppose say that every data point is assigned to centriod at index

Post by KHATWANI PARTH BHARAT
so after assigning the key=1 to each and every row
using the code below
val drm2 = A.mapBlock() {
case (keys, block) => for(row <- 0 until keys.size) {
* //assigning 1 to each row index* keys(row) = 1
} (keys, block) }
I want above matrix to be in this form
{
1 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
1 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
I am confused weather assigning the new Key Values to the row index is

done

Andrew

Post by KHATWANI PARTH BHARAT
and Dmitriy's book also does not have any proper reference for the
above mentioned issue.
Thanks & Regards
Parth Khatwani
On Fri, Apr 21, 2017 at 10:06 PM, Trevor Grant <

Post by Trevor Grant
OK, i dug into this before i read your question carefully, that was

sums.

Post by Trevor Grant
A distributed rowSums fn would be nice for just such an occasion...

sigh

Post by Trevor Grant
Let me know if that gets you going again. That was simpler than I

thought-

Post by Trevor Grant
sorry for delay on this.
PS
Candidly, I didn't explore further once i understood teh question,

but

Post by Trevor Grant
you are going to collect this to the driver anyway (not sure if that

-Virgil*

Post by Trevor Grant
On Thu, Apr 20, 2017 at 9:01 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Trevor Sir,
I have attached the sample data file and here is the line to

complete

Post by Trevor Grant
the Data

Post by KHATWANI PARTH BHARAT
File <https://drive.google.com/open?id=0Bxnnu_

Ig2Et9QjZoM3dmY1V5WXM

Lyubimov/KmeansMahout.scala> is

Post by KHATWANI PARTH BHARAT
the complete code
I also have made sample program just to test the assigning new

values

Post by KHATWANI PARTH BHARAT
the key to Row Matrix and aggregating transpose.I think assigning

new

Post by KHATWANI PARTH BHARAT
values to the key to Row Matrix and aggregating transpose is

causing

Post by KHATWANI PARTH BHARAT
main problem in Kmean code
Following is the link to Github repo for this code.
TestClusterAssign.scala
<https://github.com/parth2691/Spark_Mahout/blob/Dmitriy-

Lyubimov/TestClusterAssign.scala>

Post by KHATWANI PARTH BHARAT
above code contains the hard coded data. Following is the expected

and

1,2

Post by KHATWANI PARTH BHARAT
3 Contains Data)
But Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
And the result of aggregating Transpose should be
{
0 => {1: 4.0}
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
Thanks Trevor for such a great Help
Best Regards
Parth
On Fri, Apr 21, 2017 at 4:20 AM, Trevor Grant <

Post by Trevor Grant
Hey
Sorry for delay- was getting ready to tear into this.
Would you mind posting a small sample of data that you would

expect

-Virgil*

Post by Trevor Grant
On Tue, Apr 18, 2017 at 11:32 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriy,@Trevor and @Andrew Sir,
I am still stuck at the above problem can you please help me out

with

Post by KHATWANI PARTH BHARAT
I am unable to find the proper reference to solve the above

issue.

Post by KHATWANI PARTH BHARAT
@Dmitriy,
@Trevor and @Andrew
I have tried
Testing this Row Key assignment issue which i have mentioned

Post by Trevor Grant
above

Post by KHATWANI PARTH BHARAT
mail,
By Writing the a separate code where i am assigning the a

default

Post by Trevor Grant
value 1

Post by KHATWANI PARTH BHARAT
to each row Key of The DRM and then taking the aggregating

transpose

Post by KHATWANI PARTH BHARAT
I have committed the separate test code to the Github Branch
<https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-

Lyubimov

Post by KHATWANI PARTH BHARAT
The Code is as follows
val inCoreA = dense((1,1, 2, 3), (1,2, 3, 4), (1,3, 4, 5),

(1,4,

Post by KHATWANI PARTH BHARAT
val A = drmParallelize(m = inCoreA)
//Mapblock
val drm2 = A.mapBlock() {
case (keys, block) => for(row <- 0 until

keys.size) {

Post by KHATWANI PARTH BHARAT
* //assigning 1 to each row index* keys(row)

= 1

Post by KHATWANI PARTH BHARAT
} (keys, block) } prinln("After New Cluster

assignment")

Post by KHATWANI PARTH BHARAT
println(""+drm2.collect) val aggTranspose = drm2.t

println("Result of

Post by KHATWANI PARTH BHARAT
aggregating tranpose") println(""+aggTranspose.collect)

column

Post by KHATWANI PARTH BHARAT
3 Contains Data)
But Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
And the result of aggregating Transpose should be
{
0 => {1: 4.0}
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
I have referred to the book written by Andrew And Dmitriy

Apache

Post by KHATWANI PARTH BHARAT
Beyond MapReduce
<https://www.amazon.com/Apache-Mahout-MapReduce-

Dmitriy-Lyubimov/dp/1523775785> Aggregating

Post by KHATWANI PARTH BHARAT
Transpose and other concepts are explained very nicely over

here

Post by KHATWANI PARTH BHARAT
unable to find any example where
Row Keys are assigned new Values . Mahout Samsara Manual
http://apache.github.io/mahout/doc/ScalaSparkBindings.html

Also

Post by KHATWANI PARTH BHARAT
contain any such examples.
It will great if i can get some reference to solution of

mentioned

Post by Trevor Grant
issue.

Post by KHATWANI PARTH BHARAT
Thanks
Parth Khatwani
On Sat, Apr 15, 2017 at 12:13 AM, Andrew Palumbo <

+1
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/14/2017 11:40 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering Using

"Apache

Post by Trevor Grant
Mahout

Samsara"
Parth and Dmitriy,
This is awesome- as a follow on can we work on getting this

rolled

Post by Trevor Grant
in to

-Virgil*

On Fri, Apr 14, 2017 at 1:27 PM, Dmitriy Lyubimov <

Post by Dmitriy Lyubimov
i would think reassinging keys should work in most cases.
The only exception is that technically Spark contracts

imply

Post by Dmitriy Lyubimov
should be idempotent if task is retried, which might be a

problem

Post by Dmitriy Lyubimov
specific scenario of the object tree coming out from block

cache

Post by Dmitriy Lyubimov
tree, which can stay there and be retried again. but

specifically

Post by Dmitriy Lyubimov
this key assignment i don't see any problem since the

action

Post by Trevor Grant
obviously

Post by Dmitriy Lyubimov
would be idempotent even if this code is run multiple times

Post by Dmitriy Lyubimov
(key, block) pair. This part should be good IMO.
On Fri, Apr 14, 2017 at 2:26 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriy Sir,
In the K means code above I think i am doing the

following

Post by KHATWANI PARTH BHARAT
Incorrectly

Post by KHATWANI PARTH BHARAT
Assigning the closest centriod index to the Row Keys of

DRM

Post by KHATWANI PARTH BHARAT
//11. Iterating over the Data Matrix(in DrmLike[Int]

format)

calculate

centriod

the

Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint"
val closesetIndex = findTheClosestCentriod(

dataPoint,

Post by Dmitriy Lyubimov
centriods)

Post by KHATWANI PARTH BHARAT
//13. assigning closest index to key
keys(row) = closesetIndex
}
keys -> block
}
in step 12 i am finding the centriod closest to the

current

Post by KHATWANI PARTH BHARAT
dataPoint

Post by KHATWANI PARTH BHARAT
in step13 i am assigning the closesetIndex to the key of

the

Post by Dmitriy Lyubimov
corresponding

Post by KHATWANI PARTH BHARAT
row represented by the dataPoint
I think i am doing step13 incorrectly.
Also i am unable to find the proper reference for the

same

Post by Dmitriy Lyubimov
reference

Post by KHATWANI PARTH BHARAT
links which you have mentioned above
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 6:24 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
Dmitriy Sir,
I have Created a github branch Github Branch Having

Initial

Post by Trevor Grant
Kmeans

Code

Post by KHATWANI PARTH BHARAT
<https://github.com/parth2691/

Spark_Mahout/tree/Dmitriy-Lyub

Post by Trevor Grant
imov>

Post by KHATWANI PARTH BHARAT
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 3:19 AM, Andrew Palumbo <

+1 to creating a branch.
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/12/2017 11:25 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering

Using

Post by Trevor Grant
"Apache

Samsara"
can't say i can read this code well formatted that

way...

it would seem to me that the code is not using the

broadcast

variable

instead is using closure variable. that's the only

thing i

Post by KHATWANI PARTH BHARAT
immediately

see by looking in the middle of it.
it would be better if you created a branch on github

for

would allow for easy check-outs and comments.
-d
On Wed, Apr 12, 2017 at 10:29 AM, KHATWANI PARTH

BHARAT

Post by KHATWANI PARTH BHARAT
@Dmitriy Sir
I have completed the Kmeans code as per the

algorithm

Post by Dmitriy Lyubimov
Outline

Post by KHATWANI PARTH BHARAT
above
My code is as follows
This code works fine till step number 10
In step 11 i am assigning the new centriod index to

corresponding

key

Post by KHATWANI PARTH BHARAT
of data Point in the matrix
I think i am doing something wrong in step 11 may

be i

Post by Trevor Grant
using

incorrect

"org.apache.spark.serializer.

Post by KHATWANI PARTH BHARAT
KryoSerializer")
.set("spark.kryo.registrator",
"org.apache.mahout.sparkbindings.io.

MahoutKryoRegistrator")

Post by KHATWANI PARTH BHARAT
implicit val sc = new

SparkDistributedContext(new

SparkContext(conf))

Post by KHATWANI PARTH BHARAT
//2. read the data file and save it in the rdd
val lines = sc.textFile(args(1))
//3. convert data read in as string in to array

Post by Trevor Grant
double

Post by KHATWANI PARTH BHARAT
val test = lines.map(line =>

line.split('\t').map(_.toDoubl

e))

Post by KHATWANI PARTH BHARAT
//4. add a column having value 1 in array of

double

Post by KHATWANI PARTH BHARAT
create something like (1 | D)', which will be used

while

Post by Dmitriy Lyubimov
calculating

Post by KHATWANI PARTH BHARAT
(1 | D)'
val augumentedArray = test.map(addCentriodColumn

Post by KHATWANI PARTH BHARAT
//5. convert rdd of array of double in rdd of

DenseVector

Post by KHATWANI PARTH BHARAT
val rdd = augumentedArray.map(dvec(_))
//6. convert rdd to DrmRdd
val rddMatrixLike: DrmRdd[Int] =

rdd.zipWithIndex.map

(v,

Post by KHATWANI PARTH BHARAT
idx) => (idx.toInt, v) } //7. convert DrmRdd

Post by KHATWANI PARTH BHARAT
CheckpointedDrm[Int] val matrix =

drmWrap(rddMatrixLike)

//8.

Post by KHATWANI PARTH BHARAT
seperating the column having all ones created in

step

use

Post by KHATWANI PARTH BHARAT
it later val oneVector = matrix(::, 0 until 1)

//9.

final

Post by KHATWANI PARTH BHARAT
input data in DrmLike[Int] format val dataDrmX =

matrix(::,

Post by KHATWANI PARTH BHARAT
4) //9. Sampling to select initial

centriods

Post by KHATWANI PARTH BHARAT
centriods = drmSampleKRows(dataDrmX, 2, false)

centriods.size

Post by KHATWANI PARTH BHARAT
//10. Broad Casting the initial centriods val

broadCastMatrix

Post by KHATWANI PARTH BHARAT
drmBroadcast(centriods) //11. Iterating

over

Post by KHATWANI PARTH BHARAT
Matrix(in DrmLike[Int] format) to calculate the

initial

Post by KHATWANI PARTH BHARAT
centriods

Post by KHATWANI PARTH BHARAT
dataDrmX.mapBlock() { case (keys, block) =>

for

Post by KHATWANI PARTH BHARAT
until block.nrow) { var dataPoint =

block(row,

Post by KHATWANI PARTH BHARAT
//12. findTheClosestCentriod find the

closest

Post by Trevor Grant
centriod

Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint" val

closesetIndex

Post by KHATWANI PARTH BHARAT
findTheClosestCentriod(dataPoint, centriods)

//13.

Post by KHATWANI PARTH BHARAT
assigning closest index to key keys(row) =

closesetIndex

Post by KHATWANI PARTH BHARAT
} keys -> block }
//14. Calculating the (1|D) val b =

(oneVector

Post by Trevor Grant
cbind

Post by KHATWANI PARTH BHARAT
dataDrmX) //15. Aggregating Transpose (1|D)'

val

Post by Dmitriy Lyubimov
bTranspose

Post by KHATWANI PARTH BHARAT
= (oneVector cbind dataDrmX).t // after step 15

bTranspose

will

Post by KHATWANI PARTH BHARAT
have data in the following format /*(n+1)*K

where

n=dimension

Post by KHATWANI PARTH BHARAT
of the data point, K=number of clusters * zeroth

row

contain

Post by KHATWANI PARTH BHARAT
the count of points assigned to each cluster *

assuming

data

Post by KHATWANI PARTH BHARAT
points * */
val nrows = b.nrow.toInt //16. slicing the

count

Post by Trevor Grant
vectors

out

Post by KHATWANI PARTH BHARAT
val pointCountVectors = drmBroadcast(b(0 until 1,

::).collect(0,

Post by KHATWANI PARTH BHARAT
val vectorSums = b(1 until nrows, ::) //17.

dividing

data

Post by KHATWANI PARTH BHARAT
point by count vector vectorSums.mapBlock() {

case

Post by KHATWANI PARTH BHARAT
block) => for (row <- 0 until block.nrow) {

block(row,

Post by KHATWANI PARTH BHARAT
::) /= pointCountVectors } keys ->

block

//18.

Post by KHATWANI PARTH BHARAT
seperating the count vectors val newCentriods =

vectorSums.t(::,1

Post by KHATWANI PARTH BHARAT
until centriods.size) //19. iterate over

the

Post by Trevor Grant
above

code

Post by KHATWANI PARTH BHARAT
till convergence criteria is meet }//end of main

method

Post by KHATWANI PARTH BHARAT
// method to find the closest centriod to data

point(

Vector

Vector,

::)))

Post by KHATWANI PARTH BHARAT
if (tempDist < closest) {
closest = tempDist
index = row
}
}
index
}
//calculating the sum of squared distance between

the

Post by KHATWANI PARTH BHARAT
points(Vectors)

Post by KHATWANI PARTH BHARAT
def ssr(a: Vector, b: Vector): Double = {
(a - b) ^= 2 sum
}
//method used to create (1|D)

Array[Double]

BHARAT

---------- Forwarded message ----------
Date: Fri, Mar 31, 2017 at 11:34 PM
Subject: Re: Trying to write the KMeans Clustering

Using

"Apache

Mahout

Samsara"
ps1 this assumes row-wise construction of A based

Post by Trevor Grant
training

set

n-dimensional points.
ps2 since we are doing multiple passes over A it

may

sense to

make

sure it is committed to spark cache (by using

checkpoint

spark

used
On Fri, Mar 31, 2017 at 10:53 AM, Dmitriy

Lyubimov <

Post by Dmitriy Lyubimov
here is the outline. For details of APIs, please

refer

Post by Dmitriy Lyubimov
samsara

Post by Dmitriy Lyubimov
[2], i will not be be repeating it.
Assume your training data input is m x n matrix

Post by Dmitriy Lyubimov
simplicity

let's

Post by Dmitriy Lyubimov
assume it's a DRM with int row keys, i.e.,

DrmLike[Int].

Post by Dmitriy Lyubimov
First, classic k-means starts by selecting

initial

Post by KHATWANI PARTH BHARAT
clusters,

Post by Dmitriy Lyubimov
them out. You can do that by using sampling api

[1],

Post by Dmitriy Lyubimov
forming

Post by Dmitriy Lyubimov
in-memory matrix C (current centroids). C is

therefore

Post by Dmitriy Lyubimov
Mahout's

Post by Dmitriy Lyubimov
type.
You the proceed by alternating between cluster

assignments

and

Post by Dmitriy Lyubimov
recompupting centroid matrix C till convergence

based

Post by Dmitriy Lyubimov
simply limited by epoch count budget, your

choice.

Post by Dmitriy Lyubimov
Cluster assignments: here, we go over current

generation

Post by Dmitriy Lyubimov
recompute centroid indexes for each row in A.

Once

recompute

index,

Post by Dmitriy Lyubimov
put it into the row key . You can do that by

assigning

centroid

indices

Post by Dmitriy Lyubimov
keys of A using operator mapblock() (details in

[2],

[4]).

Post by Dmitriy Lyubimov
need to broadcast C in order to be able to

access

Post by Dmitriy Lyubimov
efficient

Post by Dmitriy Lyubimov
inside mapblock() closure. Examples of that are

plenty

Post by Dmitriy Lyubimov
Essentially, in mapblock, you'd reform the row

keys

reflect

cluster

Post by Dmitriy Lyubimov
index in C. while going over A, you'd have a

"nearest

neighbor"

problem

Post by Dmitriy Lyubimov
solve for the row of A and centroids C. This is

the

Post by Trevor Grant
bulk of

computation

Post by Dmitriy Lyubimov
really, and there are a few tricks there that

can

Post by Trevor Grant
speed

up in

Post by Dmitriy Lyubimov
both exact and approximate manner, but you can

start

Post by Trevor Grant
with a

Post by Dmitriy Lyubimov
once you assigned centroids to the keys of marix

Post by Trevor Grant
you'd

want

do an

Post by Dmitriy Lyubimov
aggregating transpose of A to compute

essentially

Post by Trevor Grant
average

grouped

Post by Dmitriy Lyubimov
by the centroid key. The trick is to do a

computation

(1|A)'

which

will

Post by Dmitriy Lyubimov
results in a matrix of the shape (Counts/sums of

cluster

rows).

This is

the

Post by Dmitriy Lyubimov
part i find difficult to explain without a latex

graphics.

Post by Dmitriy Lyubimov
In Samsara, construction of (1|A)' corresponds

expression

Post by Dmitriy Lyubimov
(1 cbind A).t (again, see [2]).
So when you compute, say,
B = (1 | A)',
then B is (n+1) x k, so each column contains a

vector

Post by KHATWANI PARTH BHARAT
corresponding

Post by Dmitriy Lyubimov
cluster 1..k. In such column, the first element

would

points in

the

Post by Dmitriy Lyubimov
cluster, and the rest of it would correspond to

sum

Post by Dmitriy Lyubimov
points.

Post by Dmitriy Lyubimov
order to arrive to an updated matrix C, we need

Post by Trevor Grant
collect

Post by Dmitriy Lyubimov
and slice out counters (first row) from the rest

Post by Dmitriy Lyubimov
C <- B (2:,:) each row divided by B(1,:)
(watch out for empty clusters with 0 elements,

this

cause

Post by Dmitriy Lyubimov
convergence and NaNs in the newly computed C).
This operation obviously uses subblocking and

row-wise

iteration

over

Post by Dmitriy Lyubimov
for which i am again making reference to [2].
[1] https://github.com/apache/

mahout/blob/master/math-scala/

Post by Dmitriy Lyubimov
src/main/scala/org/apache/maho

ut/math/drm/package.scala#

Post by Dmitriy Lyubimov
[2], Sasmara manual, a bit dated but viable,

http://apache.github

Post by Dmitriy Lyubimov
io/mahout/doc/ScalaSparkBindings.html
[3] scaladoc, again, dated but largely viable

for

purpose of

this

Post by Dmitriy Lyubimov
http://apache.github.io/

mahout/0.10.1/docs/mahout-

Post by Trevor Grant
math-

scala/index.htm

Post by Dmitriy Lyubimov
[4] mapblock etc.

http://apache.github.io/mahout

/0.10.1/docs/mahout-

Post by Dmitriy Lyubimov
math-scala/index.html#org.

apache.mahout.math.drm.

Post by KHATWANI PARTH BHARAT
RLikeDrmOps

Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH

BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriycan you please again tell me the

approach

Post by KHATWANI PARTH BHARAT
Thanks
Parth Khatwani
On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI

PARTH

Post by Trevor Grant
BHARAT <

Post by KHATWANI PARTH BHARAT
yes i am unable to figure out the way ahead.
Like how to create the augmented matrix A :=

(0|D)

Post by Trevor Grant
which

you

Post by KHATWANI PARTH BHARAT
mentioned.
On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy

Lyubimov

been a

Post by Dmitriy Lyubimov
confusing?

Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI

PARTH

Post by Trevor Grant
BHARAT

Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans clustering

algorithm

using

Mahout

Post by Dmitriy Lyubimov
Samsara

Post by KHATWANI PARTH BHARAT
but i am bit confused
about how to leverage Distributed Row

Matrix

same.

Post by KHATWANI PARTH BHARAT
me with same.
Thanks
Parth Khatwani

KHATWANI PARTH BHARAT

2017-04-22 04:25:11 UTC

@Dmitriy
I didn't get this "The fix is one character long
(+ better test for aggregation)."
And even before Aggregating Transpose I Trying to assign Cluster IDs to Row
Key
Which doesn't seems to work.

I want above matrix to be in this form
{
1 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
1 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
Turns out to be this when i assign 1 to each every Row key
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}

From what i have understood is that even before doing the aggregating Trans
pose the Matrix should be in the following format
{
1 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
1 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}

only then Rows with same Key will be added.

Correct me if i am wrong.

Thanks
Parth Khatwani

Post by Dmitriy Lyubimov
There appears to be a bug in Spark transposition operator w.r.t.
aggregating semantics which appears in cases where the same cluster (key)
is present more than once in the same block. The fix is one character long
(+ better test for aggregation).
On Fri, Apr 21, 2017 at 1:06 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
One is the cluster ID of the Index to which the data point should be
assigned.
As per what is given in this book Apache-Mahout-Mapreduce-

Dmitriy-Lyubimov

Post by KHATWANI PARTH BHARAT
<http://www.amazon.in/Apache-Mahout-Mapreduce-Dmitriy-
Lyubimov/dp/1523775785>
in
chapter 4 about the aggregating Transpose.
From what i have understood is that row having the same key will added

when

Post by KHATWANI PARTH BHARAT
we take aggregating transpose of the matrix.
So i think there should be a way to assign new values to row keys and i
think Dimitriy Has also mentioned the same thing i approach he has
outlined in this mail chain
Correct me if i am wrong.
Thanks
Parth Khatwani

Post by KHATWANI PARTH BHARAT
@Trevor
In was trying to write the "*Kmeans*" Using Mahout DRM as per the

algorithm

data

Post by KHATWANI PARTH BHARAT
zeroth index i am trying to assign the centriod index to *row key *
Now Suppose say that every data point is assigned to centriod at

index

Post by Trevor Grant
done

Andrew

Post by Trevor Grant
OK, i dug into this before i read your question carefully, that was

::).sum

Post by Trevor Grant
(keys, block)
}
val aggTranspose = drm2(::, 3 until 4).t
println("Result of aggregating tranpose")
println(""+aggTranspose.collect)
Where we are creating an empty row, then filling it with the row

sums.

Post by Trevor Grant
A distributed rowSums fn would be nice for just such an occasion...

sigh

Post by Trevor Grant
Let me know if that gets you going again. That was simpler than I

thought-

Post by Trevor Grant
sorry for delay on this.
PS
Candidly, I didn't explore further once i understood teh question,

but

Post by Trevor Grant
you are going to collect this to the driver anyway (not sure if

that

-Virgil*

Post by Trevor Grant
On Thu, Apr 20, 2017 at 9:01 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Trevor Sir,
I have attached the sample data file and here is the line to

complete

Post by Trevor Grant
the Data

Post by KHATWANI PARTH BHARAT
File <https://drive.google.com/open?id=0Bxnnu_

Ig2Et9QjZoM3dmY1V5WXM

Lyubimov/KmeansMahout.scala> is

Post by KHATWANI PARTH BHARAT
the complete code
I also have made sample program just to test the assigning new

values

Post by KHATWANI PARTH BHARAT
the key to Row Matrix and aggregating transpose.I think assigning

new

Post by KHATWANI PARTH BHARAT
values to the key to Row Matrix and aggregating transpose is

causing

Post by KHATWANI PARTH BHARAT
main problem in Kmean code
Following is the link to Github repo for this code.
TestClusterAssign.scala
<https://github.com/parth2691/Spark_Mahout/blob/Dmitriy-

Lyubimov/TestClusterAssign.scala>

Post by KHATWANI PARTH BHARAT
above code contains the hard coded data. Following is the

expected

column

Post by KHATWANI PARTH BHARAT
3 Contains Data)
But Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
And the result of aggregating Transpose should be
{
0 => {1: 4.0}
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
Thanks Trevor for such a great Help
Best Regards
Parth
On Fri, Apr 21, 2017 at 4:20 AM, Trevor Grant <

Post by Trevor Grant
Hey
Sorry for delay- was getting ready to tear into this.
Would you mind posting a small sample of data that you would

expect

-Virgil*

Post by Trevor Grant
On Tue, Apr 18, 2017 at 11:32 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriy,@Trevor and @Andrew Sir,
I am still stuck at the above problem can you please help me

out

Post by KHATWANI PARTH BHARAT
with

Post by KHATWANI PARTH BHARAT
I am unable to find the proper reference to solve the above

issue.

Post by KHATWANI PARTH BHARAT
@Dmitriy,
@Trevor and @Andrew
I have tried
Testing this Row Key assignment issue which i have mentioned

Post by Trevor Grant
above

Post by KHATWANI PARTH BHARAT
mail,
By Writing the a separate code where i am assigning the a

default

Post by Trevor Grant
value 1

Post by KHATWANI PARTH BHARAT
to each row Key of The DRM and then taking the aggregating

transpose

Post by KHATWANI PARTH BHARAT
I have committed the separate test code to the Github

Branch

Post by KHATWANI PARTH BHARAT
<https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-

Lyubimov

Post by KHATWANI PARTH BHARAT
The Code is as follows
val inCoreA = dense((1,1, 2, 3), (1,2, 3, 4), (1,3, 4, 5),

(1,4,

Post by KHATWANI PARTH BHARAT
val A = drmParallelize(m = inCoreA)
//Mapblock
val drm2 = A.mapBlock() {
case (keys, block) => for(row <- 0 until

keys.size) {

Post by KHATWANI PARTH BHARAT
* //assigning 1 to each row index*

keys(row)

Post by Trevor Grant
= 1

Post by KHATWANI PARTH BHARAT
} (keys, block) } prinln("After New Cluster

assignment")

Post by KHATWANI PARTH BHARAT
println(""+drm2.collect) val aggTranspose = drm2.t

println("Result of

Post by KHATWANI PARTH BHARAT
aggregating tranpose") println(""+aggTranspose.collect)

column

Post by KHATWANI PARTH BHARAT
3 Contains Data)
But Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
And the result of aggregating Transpose should be
{
0 => {1: 4.0}
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
I have referred to the book written by Andrew And Dmitriy

Apache

Post by KHATWANI PARTH BHARAT
Beyond MapReduce
<https://www.amazon.com/Apache-Mahout-MapReduce-

Dmitriy-Lyubimov/dp/1523775785> Aggregating

Post by KHATWANI PARTH BHARAT
Transpose and other concepts are explained very nicely over

here

Post by KHATWANI PARTH BHARAT
unable to find any example where
Row Keys are assigned new Values . Mahout Samsara Manual
http://apache.github.io/mahout/doc/ScalaSparkBindings.html

Also

Post by KHATWANI PARTH BHARAT
contain any such examples.
It will great if i can get some reference to solution of

mentioned

Post by Trevor Grant
issue.

Post by KHATWANI PARTH BHARAT
Thanks
Parth Khatwani
On Sat, Apr 15, 2017 at 12:13 AM, Andrew Palumbo <

+1
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/14/2017 11:40 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering Using

"Apache

Post by Trevor Grant
Mahout

Samsara"
Parth and Dmitriy,
This is awesome- as a follow on can we work on getting this

rolled

Post by Trevor Grant
in to

things."

Post by Trevor Grant
-Virgil*

On Fri, Apr 14, 2017 at 1:27 PM, Dmitriy Lyubimov <

Post by Dmitriy Lyubimov
i would think reassinging keys should work in most cases.
The only exception is that technically Spark contracts

imply

Post by Dmitriy Lyubimov
should be idempotent if task is retried, which might be a

problem

Post by Dmitriy Lyubimov
specific scenario of the object tree coming out from

block

Post by KHATWANI PARTH BHARAT
cache

Post by Dmitriy Lyubimov
tree, which can stay there and be retried again. but

specifically

Post by Dmitriy Lyubimov
this key assignment i don't see any problem since the

action

Post by Trevor Grant
obviously

Post by Dmitriy Lyubimov
would be idempotent even if this code is run multiple

times

Post by Dmitriy Lyubimov
(key, block) pair. This part should be good IMO.
On Fri, Apr 14, 2017 at 2:26 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriy Sir,
In the K means code above I think i am doing the

following

Post by KHATWANI PARTH BHARAT
Incorrectly

Post by KHATWANI PARTH BHARAT
Assigning the closest centriod index to the Row Keys of

DRM

Post by KHATWANI PARTH BHARAT
//11. Iterating over the Data Matrix(in DrmLike[Int]

format)

calculate

centriod

the

Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint"
val closesetIndex = findTheClosestCentriod(

dataPoint,

Post by Dmitriy Lyubimov
centriods)

Post by KHATWANI PARTH BHARAT
//13. assigning closest index to key
keys(row) = closesetIndex
}
keys -> block
}
in step 12 i am finding the centriod closest to the

current

Post by KHATWANI PARTH BHARAT
dataPoint

Post by KHATWANI PARTH BHARAT
in step13 i am assigning the closesetIndex to the key

Post by Dmitriy Lyubimov
corresponding

Post by KHATWANI PARTH BHARAT
row represented by the dataPoint
I think i am doing step13 incorrectly.
Also i am unable to find the proper reference for the

same

Post by Dmitriy Lyubimov
reference

Post by KHATWANI PARTH BHARAT
links which you have mentioned above
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 6:24 PM, KHATWANI PARTH BHARAT

Post by KHATWANI PARTH BHARAT
Dmitriy Sir,
I have Created a github branch Github Branch Having

Initial

Post by Trevor Grant
Kmeans

Code

Post by KHATWANI PARTH BHARAT
<https://github.com/parth2691/

Spark_Mahout/tree/Dmitriy-Lyub

Post by Trevor Grant
imov>

Post by KHATWANI PARTH BHARAT
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 3:19 AM, Andrew Palumbo <

+1 to creating a branch.
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/12/2017 11:25 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering

Using

Post by Trevor Grant
"Apache

Samsara"
can't say i can read this code well formatted that

way...

it would seem to me that the code is not using the

broadcast

variable

instead is using closure variable. that's the only

thing i

Post by KHATWANI PARTH BHARAT
immediately

see by looking in the middle of it.
it would be better if you created a branch on github

for

would allow for easy check-outs and comments.
-d
On Wed, Apr 12, 2017 at 10:29 AM, KHATWANI PARTH

BHARAT

Post by KHATWANI PARTH BHARAT
@Dmitriy Sir
I have completed the Kmeans code as per the

algorithm

Post by Dmitriy Lyubimov
Outline

Post by KHATWANI PARTH BHARAT
above
My code is as follows
This code works fine till step number 10
In step 11 i am assigning the new centriod index

corresponding

key

Post by KHATWANI PARTH BHARAT
of data Point in the matrix
I think i am doing something wrong in step 11 may

be i

Post by Trevor Grant
using

incorrect

"org.apache.spark.serializer.

Post by KHATWANI PARTH BHARAT
KryoSerializer")
.set("spark.kryo.registrator",
"org.apache.mahout.sparkbindings.io.

MahoutKryoRegistrator")

Post by KHATWANI PARTH BHARAT
implicit val sc = new

SparkDistributedContext(new

SparkContext(conf))

Post by KHATWANI PARTH BHARAT
//2. read the data file and save it in the rdd
val lines = sc.textFile(args(1))
//3. convert data read in as string in to

array

Post by Trevor Grant
double

Post by KHATWANI PARTH BHARAT
val test = lines.map(line =>

line.split('\t').map(_.toDoubl

e))

Post by KHATWANI PARTH BHARAT
//4. add a column having value 1 in array of

double

Post by KHATWANI PARTH BHARAT
create something like (1 | D)', which will be

used

Post by KHATWANI PARTH BHARAT
while

Post by Dmitriy Lyubimov
calculating

Post by KHATWANI PARTH BHARAT
(1 | D)'
val augumentedArray =

test.map(addCentriodColumn

Post by Trevor Grant
_)

Post by KHATWANI PARTH BHARAT
//5. convert rdd of array of double in rdd of

DenseVector

Post by KHATWANI PARTH BHARAT
val rdd = augumentedArray.map(dvec(_))
//6. convert rdd to DrmRdd
val rddMatrixLike: DrmRdd[Int] =

rdd.zipWithIndex.map

(v,

Post by KHATWANI PARTH BHARAT
idx) => (idx.toInt, v) } //7. convert

DrmRdd

Post by KHATWANI PARTH BHARAT
CheckpointedDrm[Int] val matrix =

drmWrap(rddMatrixLike)

//8.

Post by KHATWANI PARTH BHARAT
seperating the column having all ones created in

step

use

Post by KHATWANI PARTH BHARAT
it later val oneVector = matrix(::, 0 until 1)

//9.

final

Post by KHATWANI PARTH BHARAT
input data in DrmLike[Int] format val dataDrmX

Post by Trevor Grant
matrix(::,

Post by KHATWANI PARTH BHARAT
4) //9. Sampling to select initial

centriods

Post by KHATWANI PARTH BHARAT
centriods = drmSampleKRows(dataDrmX, 2, false)

centriods.size

Post by KHATWANI PARTH BHARAT
//10. Broad Casting the initial centriods val

broadCastMatrix

Post by KHATWANI PARTH BHARAT
drmBroadcast(centriods) //11. Iterating

over

Post by KHATWANI PARTH BHARAT
Matrix(in DrmLike[Int] format) to calculate the

initial

Post by KHATWANI PARTH BHARAT
centriods

Post by KHATWANI PARTH BHARAT
dataDrmX.mapBlock() { case (keys, block) =>

for

Post by KHATWANI PARTH BHARAT
until block.nrow) { var dataPoint =

block(row,

Post by KHATWANI PARTH BHARAT
//12. findTheClosestCentriod find the

closest

Post by Trevor Grant
centriod

Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint" val

closesetIndex

Post by KHATWANI PARTH BHARAT
findTheClosestCentriod(dataPoint, centriods)

//13.

Post by KHATWANI PARTH BHARAT
assigning closest index to key keys(row)

Post by KHATWANI PARTH BHARAT
closesetIndex

Post by KHATWANI PARTH BHARAT
} keys -> block }
//14. Calculating the (1|D) val b =

(oneVector

Post by Trevor Grant
cbind

Post by KHATWANI PARTH BHARAT
dataDrmX) //15. Aggregating Transpose

(1|D)'

Post by KHATWANI PARTH BHARAT
val

Post by Dmitriy Lyubimov
bTranspose

Post by KHATWANI PARTH BHARAT
= (oneVector cbind dataDrmX).t // after step 15

bTranspose

will

Post by KHATWANI PARTH BHARAT
have data in the following format /*(n+1)*K

where

n=dimension

Post by KHATWANI PARTH BHARAT
of the data point, K=number of clusters *

zeroth

Post by Trevor Grant
row

contain

Post by KHATWANI PARTH BHARAT
the count of points assigned to each cluster *

assuming

data

Post by KHATWANI PARTH BHARAT
points * */
val nrows = b.nrow.toInt //16. slicing the

count

Post by Trevor Grant
vectors

out

Post by KHATWANI PARTH BHARAT
val pointCountVectors = drmBroadcast(b(0 until 1,

::).collect(0,

Post by KHATWANI PARTH BHARAT
val vectorSums = b(1 until nrows, ::) //17.

dividing

data

Post by KHATWANI PARTH BHARAT
point by count vector vectorSums.mapBlock() {

case

Post by KHATWANI PARTH BHARAT
block) => for (row <- 0 until block.nrow) {

block(row,

Post by KHATWANI PARTH BHARAT
::) /= pointCountVectors } keys ->

block

//18.

Post by KHATWANI PARTH BHARAT
seperating the count vectors val newCentriods =

vectorSums.t(::,1

Post by KHATWANI PARTH BHARAT
until centriods.size) //19. iterate

over

Post by Trevor Grant
above

code

Post by KHATWANI PARTH BHARAT
till convergence criteria is meet }//end of main

method

Post by KHATWANI PARTH BHARAT
// method to find the closest centriod to data

point(

Vector

Vector,

matrix(row,

Post by KHATWANI PARTH BHARAT
::)))

Post by KHATWANI PARTH BHARAT
if (tempDist < closest) {
closest = tempDist
index = row
}
}
index
}
//calculating the sum of squared distance

between

Post by KHATWANI PARTH BHARAT
points(Vectors)

Post by KHATWANI PARTH BHARAT
def ssr(a: Vector, b: Vector): Double = {
(a - b) ^= 2 sum
}
//method used to create (1|D)

Array[Double]

BHARAT

---------- Forwarded message ----------
Date: Fri, Mar 31, 2017 at 11:34 PM
Subject: Re: Trying to write the KMeans

Clustering

"Apache

Mahout

Samsara"
ps1 this assumes row-wise construction of A

based

Post by KHATWANI PARTH BHARAT
on

Post by Trevor Grant
training

set

n-dimensional points.
ps2 since we are doing multiple passes over A it

may

sense to

make

sure it is committed to spark cache (by using

checkpoint

spark

used
On Fri, Mar 31, 2017 at 10:53 AM, Dmitriy

Lyubimov <

Post by Dmitriy Lyubimov
here is the outline. For details of APIs,

please

Post by KHATWANI PARTH BHARAT
refer

Post by Dmitriy Lyubimov
samsara

Post by Dmitriy Lyubimov
[2], i will not be be repeating it.
Assume your training data input is m x n

matrix

Post by KHATWANI PARTH BHARAT
A.

Post by Dmitriy Lyubimov
simplicity

let's

Post by Dmitriy Lyubimov
assume it's a DRM with int row keys, i.e.,

DrmLike[Int].

Post by Dmitriy Lyubimov
First, classic k-means starts by selecting

initial

Post by KHATWANI PARTH BHARAT
clusters,

Post by Dmitriy Lyubimov
them out. You can do that by using sampling

api

Post by Trevor Grant
[1],

Post by Dmitriy Lyubimov
forming

Post by Dmitriy Lyubimov
in-memory matrix C (current centroids). C is

therefore

Post by Dmitriy Lyubimov
Mahout's

Post by Dmitriy Lyubimov
type.
You the proceed by alternating between cluster

assignments

and

Post by Dmitriy Lyubimov
recompupting centroid matrix C till

convergence

Post by KHATWANI PARTH BHARAT
based

Post by Dmitriy Lyubimov
simply limited by epoch count budget, your

choice.

Post by Dmitriy Lyubimov
Cluster assignments: here, we go over current

generation

Post by Dmitriy Lyubimov
recompute centroid indexes for each row in A.

Once

recompute

index,

Post by Dmitriy Lyubimov
put it into the row key . You can do that by

assigning

centroid

indices

Post by Dmitriy Lyubimov
keys of A using operator mapblock() (details

Post by Trevor Grant
[2],

[4]).

Post by Dmitriy Lyubimov
need to broadcast C in order to be able to

access

Post by Dmitriy Lyubimov
efficient

Post by Dmitriy Lyubimov
inside mapblock() closure. Examples of that

are

Post by KHATWANI PARTH BHARAT
plenty

Post by Dmitriy Lyubimov
Essentially, in mapblock, you'd reform the row

keys

reflect

cluster

Post by Dmitriy Lyubimov
index in C. while going over A, you'd have a

"nearest

neighbor"

problem

Post by Dmitriy Lyubimov
solve for the row of A and centroids C. This

Post by Trevor Grant
bulk of

computation

Post by Dmitriy Lyubimov
really, and there are a few tricks there that

can

Post by Trevor Grant
speed

up in

Post by Dmitriy Lyubimov
both exact and approximate manner, but you can

start

Post by Trevor Grant
with a

Post by Dmitriy Lyubimov
once you assigned centroids to the keys of

marix

Post by Trevor Grant
A,

Post by Trevor Grant
you'd

want

do an

Post by Dmitriy Lyubimov
aggregating transpose of A to compute

essentially

Post by Trevor Grant
average

grouped

Post by Dmitriy Lyubimov
by the centroid key. The trick is to do a

computation

(1|A)'

which

will

Post by Dmitriy Lyubimov
results in a matrix of the shape (Counts/sums

Post by Trevor Grant
cluster

rows).

This is

the

Post by Dmitriy Lyubimov
part i find difficult to explain without a

latex

Post by Trevor Grant
graphics.

Post by Dmitriy Lyubimov
In Samsara, construction of (1|A)' corresponds

expression

Post by Dmitriy Lyubimov
(1 cbind A).t (again, see [2]).
So when you compute, say,
B = (1 | A)',
then B is (n+1) x k, so each column contains a

vector

Post by KHATWANI PARTH BHARAT
corresponding

Post by Dmitriy Lyubimov
cluster 1..k. In such column, the first

element

Post by KHATWANI PARTH BHARAT
would

points in

the

Post by Dmitriy Lyubimov
cluster, and the rest of it would correspond

Post by Trevor Grant
sum

Post by Dmitriy Lyubimov
points.

Post by Dmitriy Lyubimov
order to arrive to an updated matrix C, we

need

Post by Trevor Grant
collect

Post by Dmitriy Lyubimov
and slice out counters (first row) from the

rest

Post by Dmitriy Lyubimov
C <- B (2:,:) each row divided by B(1,:)
(watch out for empty clusters with 0 elements,

this

cause

Post by Dmitriy Lyubimov
convergence and NaNs in the newly computed C).
This operation obviously uses subblocking and

row-wise

iteration

over

Post by Dmitriy Lyubimov
for which i am again making reference to [2].
[1] https://github.com/apache/

mahout/blob/master/math-scala/

Post by Dmitriy Lyubimov
src/main/scala/org/apache/maho

ut/math/drm/package.scala#

Post by Dmitriy Lyubimov
[2], Sasmara manual, a bit dated but viable,

http://apache.github

Post by Dmitriy Lyubimov
io/mahout/doc/ScalaSparkBindings.html
[3] scaladoc, again, dated but largely viable

for

purpose of

this

Post by Dmitriy Lyubimov
http://apache.github.io/

mahout/0.10.1/docs/mahout-

Post by Trevor Grant
math-

scala/index.htm

Post by Dmitriy Lyubimov
[4] mapblock etc.

http://apache.github.io/mahout

/0.10.1/docs/mahout-

Post by Dmitriy Lyubimov
math-scala/index.html#org.

apache.mahout.math.drm.

Post by KHATWANI PARTH BHARAT
RLikeDrmOps

Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI

PARTH

Post by Trevor Grant
BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriycan you please again tell me the

approach

Post by KHATWANI PARTH BHARAT
Thanks
Parth Khatwani
On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI

PARTH

Post by Trevor Grant
BHARAT <

Post by KHATWANI PARTH BHARAT
yes i am unable to figure out the way

ahead.

Post by KHATWANI PARTH BHARAT
Like how to create the augmented matrix A

Post by KHATWANI PARTH BHARAT
(0|D)

Post by Trevor Grant
which

you

Post by KHATWANI PARTH BHARAT
mentioned.
On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy

Lyubimov

been a

Post by Dmitriy Lyubimov
confusing?

Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI

PARTH

Post by Trevor Grant
BHARAT

Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans

clustering

Post by Trevor Grant
algorithm

using

Mahout

Post by Dmitriy Lyubimov
Samsara

Post by KHATWANI PARTH BHARAT
but i am bit confused
about how to leverage Distributed Row

Matrix

same.

Post by KHATWANI PARTH BHARAT
me with same.
Thanks
Parth Khatwani

Trevor Grant

2017-04-22 18:42:32 UTC

In short Khatwani, you found a bug!

They creep in from time to time. Thank you and sorry for the inconvenience.

You'll find
https://issues.apache.org/jira/browse/MAHOUT-1971

and subsequent PR
https://github.com/apache/mahout/pull/307/files

addressing this issue.

Wait for these to close and then try building mahout again with mvn clean
install.

Your code will hopefully work then.

tg

Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things." -Virgil*

On Fri, Apr 21, 2017 at 11:25 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriy
I didn't get this "The fix is one character long
(+ better test for aggregation)."
And even before Aggregating Transpose I Trying to assign Cluster IDs to Row
Key
Which doesn't seems to work.
I want above matrix to be in this form
{
1 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
1 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
Turns out to be this when i assign 1 to each every Row key
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
From what i have understood is that even before doing the aggregating Trans
pose the Matrix should be in the following format
{
1 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
1 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
only then Rows with same Key will be added.
Correct me if i am wrong.
Thanks
Parth Khatwani

long

Post by Dmitriy Lyubimov
(+ better test for aggregation).
On Fri, Apr 21, 2017 at 1:06 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
One is the cluster ID of the Index to which the data point should be
assigned.
As per what is given in this book Apache-Mahout-Mapreduce-

Dmitriy-Lyubimov

when

Post by KHATWANI PARTH BHARAT
we take aggregating transpose of the matrix.
So i think there should be a way to assign new values to row keys and

Post by KHATWANI PARTH BHARAT
think Dimitriy Has also mentioned the same thing i approach he has
outlined in this mail chain
Correct me if i am wrong.
Thanks
Parth Khatwani
On Sat, Apr 22, 2017 at 1:54 AM, Trevor Grant <

-Virgil*

Post by Trevor Grant
On Fri, Apr 21, 2017 at 2:26 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Trevor
In was trying to write the "*Kmeans*" Using Mahout DRM as per the

algorithm

Post by KHATWANI PARTH BHARAT
outlined by Dmitriy.
I was facing the Problem of assigning cluster Ids to the Row Keys
For Example
Consider the below matrix Where column 1 to 3 are the data points

and

Post by KHATWANI PARTH BHARAT
column 0 Containing the count of the point
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
now after calculating the centriod which closest to the data point

data

Post by KHATWANI PARTH BHARAT
zeroth index i am trying to assign the centriod index to *row key *
Now Suppose say that every data point is assigned to centriod at

index

Post by Trevor Grant
done

Andrew

Post by Trevor Grant
OK, i dug into this before i read your question carefully, that

was

Post by KHATWANI PARTH BHARAT
my

::).sum

sums.

Post by Trevor Grant
A distributed rowSums fn would be nice for just such an

occasion...

Post by Trevor Grant
sigh

Post by Trevor Grant
Let me know if that gets you going again. That was simpler than

Post by KHATWANI PARTH BHARAT
thought-

Post by Trevor Grant
sorry for delay on this.
PS
Candidly, I didn't explore further once i understood teh

question,

Post by KHATWANI PARTH BHARAT
but

Post by Trevor Grant
you are going to collect this to the driver anyway (not sure if

that

-Virgil*

Post by Trevor Grant
On Thu, Apr 20, 2017 at 9:01 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Trevor Sir,
I have attached the sample data file and here is the line to

complete

Post by Trevor Grant
the Data

Post by KHATWANI PARTH BHARAT
File <https://drive.google.com/open?id=0Bxnnu_

Ig2Et9QjZoM3dmY1V5WXM

Post by KHATWANI PARTH BHARAT
Following is the link for the Github Branch For the code
https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-

Lyubimov

Post by KHATWANI PARTH BHARAT
KmeansMahout.scala
<https://github.com/parth2691/Spark_Mahout/blob/Dmitriy-

Lyubimov/KmeansMahout.scala> is

Post by KHATWANI PARTH BHARAT
the complete code
I also have made sample program just to test the assigning new

values

Post by KHATWANI PARTH BHARAT
the key to Row Matrix and aggregating transpose.I think

assigning

Post by KHATWANI PARTH BHARAT
new

Post by KHATWANI PARTH BHARAT
values to the key to Row Matrix and aggregating transpose is

causing

Post by KHATWANI PARTH BHARAT
main problem in Kmean code
Following is the link to Github repo for this code.
TestClusterAssign.scala
<https://github.com/parth2691/Spark_Mahout/blob/Dmitriy-

Lyubimov/TestClusterAssign.scala>

Post by KHATWANI PARTH BHARAT
above code contains the hard coded data. Following is the

expected

column

Post by KHATWANI PARTH BHARAT
3 Contains Data)
But Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
And the result of aggregating Transpose should be
{
0 => {1: 4.0}
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
Thanks Trevor for such a great Help
Best Regards
Parth
On Fri, Apr 21, 2017 at 4:20 AM, Trevor Grant <

Post by Trevor Grant
Hey
Sorry for delay- was getting ready to tear into this.
Would you mind posting a small sample of data that you would

expect

-Virgil*

Post by Trevor Grant
On Tue, Apr 18, 2017 at 11:32 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriy,@Trevor and @Andrew Sir,
I am still stuck at the above problem can you please help me

out

Post by KHATWANI PARTH BHARAT
with

Post by KHATWANI PARTH BHARAT
I am unable to find the proper reference to solve the above

issue.

Post by KHATWANI PARTH BHARAT
@Dmitriy,
@Trevor and @Andrew
I have tried
Testing this Row Key assignment issue which i have

mentioned

Post by Trevor Grant
above

Post by KHATWANI PARTH BHARAT
mail,
By Writing the a separate code where i am assigning the a

default

Post by Trevor Grant
value 1

Post by KHATWANI PARTH BHARAT
to each row Key of The DRM and then taking the aggregating

transpose

Post by KHATWANI PARTH BHARAT
I have committed the separate test code to the Github

Branch

Post by KHATWANI PARTH BHARAT
<https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-

Lyubimov

Post by KHATWANI PARTH BHARAT
The Code is as follows
val inCoreA = dense((1,1, 2, 3), (1,2, 3, 4), (1,3, 4, 5),

(1,4,

Post by KHATWANI PARTH BHARAT
val A = drmParallelize(m = inCoreA)
//Mapblock
val drm2 = A.mapBlock() {
case (keys, block) => for(row <- 0 until

keys.size) {

Post by KHATWANI PARTH BHARAT
* //assigning 1 to each row index*

keys(row)

Post by Trevor Grant
= 1

Post by KHATWANI PARTH BHARAT
} (keys, block) } prinln("After New Cluster

assignment")

Post by KHATWANI PARTH BHARAT
println(""+drm2.collect) val aggTranspose = drm2.t

println("Result of

Post by KHATWANI PARTH BHARAT
aggregating tranpose") println(""+aggTranspose.collect)

and

Post by Trevor Grant
column

Post by KHATWANI PARTH BHARAT
3 Contains Data)
But Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
And the result of aggregating Transpose should be
{
0 => {1: 4.0}
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
I have referred to the book written by Andrew And Dmitriy

Apache

Post by KHATWANI PARTH BHARAT
Beyond MapReduce
<https://www.amazon.com/Apache-Mahout-MapReduce-

Dmitriy-Lyubimov/dp/1523775785> Aggregating

Post by KHATWANI PARTH BHARAT
Transpose and other concepts are explained very nicely

over

Post by Trevor Grant
here

Post by KHATWANI PARTH BHARAT
unable to find any example where
Row Keys are assigned new Values . Mahout Samsara Manual
http://apache.github.io/mahout/doc/ScalaSparkBindings.

html

Post by KHATWANI PARTH BHARAT
Also

Post by KHATWANI PARTH BHARAT
contain any such examples.
It will great if i can get some reference to solution of

mentioned

Post by Trevor Grant
issue.

Post by KHATWANI PARTH BHARAT
Thanks
Parth Khatwani
On Sat, Apr 15, 2017 at 12:13 AM, Andrew Palumbo <

+1
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/14/2017 11:40 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering Using

"Apache

Post by Trevor Grant
Mahout

Samsara"
Parth and Dmitriy,
This is awesome- as a follow on can we work on getting

this

Post by KHATWANI PARTH BHARAT
rolled

Post by Trevor Grant
in to

things."

Post by Trevor Grant
-Virgil*

On Fri, Apr 14, 2017 at 1:27 PM, Dmitriy Lyubimov <

Post by Dmitriy Lyubimov
i would think reassinging keys should work in most

cases.

Post by Dmitriy Lyubimov
The only exception is that technically Spark contracts

imply

Post by Dmitriy Lyubimov
should be idempotent if task is retried, which might

be a

Post by KHATWANI PARTH BHARAT
problem

Post by Dmitriy Lyubimov
specific scenario of the object tree coming out from

block

Post by KHATWANI PARTH BHARAT
cache

Post by Dmitriy Lyubimov
tree, which can stay there and be retried again. but

specifically

Post by Dmitriy Lyubimov
this key assignment i don't see any problem since the

action

Post by Trevor Grant
obviously

Post by Dmitriy Lyubimov
would be idempotent even if this code is run multiple

times

Post by Dmitriy Lyubimov
(key, block) pair. This part should be good IMO.
On Fri, Apr 14, 2017 at 2:26 AM, KHATWANI PARTH BHARAT

Post by KHATWANI PARTH BHARAT
@Dmitriy Sir,
In the K means code above I think i am doing the

following

Post by KHATWANI PARTH BHARAT
Incorrectly

Post by KHATWANI PARTH BHARAT
Assigning the closest centriod index to the Row Keys

Post by KHATWANI PARTH BHARAT
//11. Iterating over the Data Matrix(in DrmLike[Int]

format)

calculate

closest

Post by Trevor Grant
centriod

the

Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint"
val closesetIndex = findTheClosestCentriod(

dataPoint,

Post by Dmitriy Lyubimov
centriods)

Post by KHATWANI PARTH BHARAT
//13. assigning closest index to key
keys(row) = closesetIndex
}
keys -> block
}
in step 12 i am finding the centriod closest to the

current

Post by KHATWANI PARTH BHARAT
dataPoint

Post by KHATWANI PARTH BHARAT
in step13 i am assigning the closesetIndex to the

key

Post by Dmitriy Lyubimov
corresponding

Post by KHATWANI PARTH BHARAT
row represented by the dataPoint
I think i am doing step13 incorrectly.
Also i am unable to find the proper reference for the

same

Post by Dmitriy Lyubimov
reference

Post by KHATWANI PARTH BHARAT
links which you have mentioned above
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 6:24 PM, KHATWANI PARTH

BHARAT

Post by Dmitriy Lyubimov
<

Post by KHATWANI PARTH BHARAT
Dmitriy Sir,
I have Created a github branch Github Branch Having

Initial

Post by Trevor Grant
Kmeans

Code

Post by KHATWANI PARTH BHARAT
<https://github.com/parth2691/

Spark_Mahout/tree/Dmitriy-Lyub

Post by Trevor Grant
imov>

Post by KHATWANI PARTH BHARAT
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 3:19 AM, Andrew Palumbo <

+1 to creating a branch.
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/12/2017 11:25 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering

Using

Post by Trevor Grant
"Apache

Samsara"
can't say i can read this code well formatted that

way...

it would seem to me that the code is not using the

broadcast

variable

instead is using closure variable. that's the only

thing i

Post by KHATWANI PARTH BHARAT
immediately

see by looking in the middle of it.
it would be better if you created a branch on

github

would allow for easy check-outs and comments.
-d
On Wed, Apr 12, 2017 at 10:29 AM, KHATWANI PARTH

BHARAT

Post by KHATWANI PARTH BHARAT
@Dmitriy Sir
I have completed the Kmeans code as per the

algorithm

Post by Dmitriy Lyubimov
Outline

Post by KHATWANI PARTH BHARAT
above
My code is as follows
This code works fine till step number 10
In step 11 i am assigning the new centriod index

corresponding

key

Post by KHATWANI PARTH BHARAT
of data Point in the matrix
I think i am doing something wrong in step 11

may

Post by KHATWANI PARTH BHARAT
be i

Post by Trevor Grant
using

incorrect

Post by KHATWANI PARTH BHARAT
syntax
Can you help me find out what am i doing wrong.
//start of main method
def main(args: Array[String]) {
//1. initialize the spark and mahout

context

Post by KHATWANI PARTH BHARAT
val conf = new SparkConf()
.setAppName("DRMExample")
.setMaster(args(0))
.set("spark.serializer",

"org.apache.spark.serializer.

Post by KHATWANI PARTH BHARAT
KryoSerializer")
.set("spark.kryo.registrator",
"org.apache.mahout.sparkbindings.io.

MahoutKryoRegistrator")

Post by KHATWANI PARTH BHARAT
implicit val sc = new

SparkDistributedContext(new

SparkContext(conf))

Post by KHATWANI PARTH BHARAT
//2. read the data file and save it in the

rdd

Post by KHATWANI PARTH BHARAT
val lines = sc.textFile(args(1))
//3. convert data read in as string in to

array

Post by Trevor Grant
double

Post by KHATWANI PARTH BHARAT
val test = lines.map(line =>

line.split('\t').map(_.toDoubl

e))

Post by KHATWANI PARTH BHARAT
//4. add a column having value 1 in array of

double

Post by KHATWANI PARTH BHARAT
create something like (1 | D)', which will be

used

Post by KHATWANI PARTH BHARAT
while

Post by Dmitriy Lyubimov
calculating

Post by KHATWANI PARTH BHARAT
(1 | D)'
val augumentedArray =

test.map(addCentriodColumn

Post by Trevor Grant
_)

Post by KHATWANI PARTH BHARAT
//5. convert rdd of array of double in rdd

Post by Trevor Grant
DenseVector

Post by KHATWANI PARTH BHARAT
val rdd = augumentedArray.map(dvec(_))
//6. convert rdd to DrmRdd
val rddMatrixLike: DrmRdd[Int] =

rdd.zipWithIndex.map

(v,

Post by KHATWANI PARTH BHARAT
idx) => (idx.toInt, v) } //7. convert

DrmRdd

Post by KHATWANI PARTH BHARAT
CheckpointedDrm[Int] val matrix =

drmWrap(rddMatrixLike)

//8.

Post by KHATWANI PARTH BHARAT
seperating the column having all ones created in

step

use

Post by KHATWANI PARTH BHARAT
it later val oneVector = matrix(::, 0 until

Post by Trevor Grant
//9.

final

Post by KHATWANI PARTH BHARAT
input data in DrmLike[Int] format val

dataDrmX

Post by Trevor Grant
matrix(::,

Post by KHATWANI PARTH BHARAT
4) //9. Sampling to select initial

centriods

Post by KHATWANI PARTH BHARAT
centriods = drmSampleKRows(dataDrmX, 2, false)

centriods.size

Post by KHATWANI PARTH BHARAT
//10. Broad Casting the initial centriods val

broadCastMatrix

Post by KHATWANI PARTH BHARAT
drmBroadcast(centriods) //11.

Iterating

Post by Trevor Grant
over

Post by KHATWANI PARTH BHARAT
Matrix(in DrmLike[Int] format) to calculate the

initial

Post by KHATWANI PARTH BHARAT
centriods

Post by KHATWANI PARTH BHARAT
dataDrmX.mapBlock() { case (keys, block) =>

for

Post by KHATWANI PARTH BHARAT
until block.nrow) { var dataPoint =

block(row,

Post by KHATWANI PARTH BHARAT
//12. findTheClosestCentriod find the

closest

Post by Trevor Grant
centriod

Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint" val

closesetIndex

Post by KHATWANI PARTH BHARAT
findTheClosestCentriod(dataPoint, centriods)

//13.

Post by KHATWANI PARTH BHARAT
assigning closest index to key

keys(row)

Post by KHATWANI PARTH BHARAT
closesetIndex

Post by KHATWANI PARTH BHARAT
} keys -> block }
//14. Calculating the (1|D) val b =

(oneVector

Post by Trevor Grant
cbind

Post by KHATWANI PARTH BHARAT
dataDrmX) //15. Aggregating Transpose

(1|D)'

Post by KHATWANI PARTH BHARAT
val

Post by Dmitriy Lyubimov
bTranspose

Post by KHATWANI PARTH BHARAT
= (oneVector cbind dataDrmX).t // after step

Post by Trevor Grant
bTranspose

will

Post by KHATWANI PARTH BHARAT
have data in the following format

/*(n+1)*K

Post by Trevor Grant
where

n=dimension

Post by KHATWANI PARTH BHARAT
of the data point, K=number of clusters *

zeroth

Post by Trevor Grant
row

contain

Post by KHATWANI PARTH BHARAT
the count of points assigned to each cluster

Post by KHATWANI PARTH BHARAT
assuming

data

Post by KHATWANI PARTH BHARAT
points * */
val nrows = b.nrow.toInt //16. slicing

the

Post by Trevor Grant
count

Post by Trevor Grant
vectors

out

Post by KHATWANI PARTH BHARAT
val pointCountVectors = drmBroadcast(b(0 until

Post by KHATWANI PARTH BHARAT
::).collect(0,

Post by KHATWANI PARTH BHARAT
val vectorSums = b(1 until nrows, ::)

//17.

Post by KHATWANI PARTH BHARAT
dividing

data

Post by KHATWANI PARTH BHARAT
point by count vector vectorSums.mapBlock() {

case

Post by KHATWANI PARTH BHARAT
block) => for (row <- 0 until

block.nrow) {

Post by Dmitriy Lyubimov
block(row,

Post by KHATWANI PARTH BHARAT
::) /= pointCountVectors } keys ->

block

//18.

Post by KHATWANI PARTH BHARAT
seperating the count vectors val

newCentriods =

vectorSums.t(::,1

Post by KHATWANI PARTH BHARAT
until centriods.size) //19. iterate

over

Post by Trevor Grant
above

code

Post by KHATWANI PARTH BHARAT
till convergence criteria is meet }//end of

main

Post by KHATWANI PARTH BHARAT
method

Post by KHATWANI PARTH BHARAT
// method to find the closest centriod to data

point(

Vector

Post by KHATWANI PARTH BHARAT
in the arguments) def

Vector,

matrix(row,

Post by KHATWANI PARTH BHARAT
::)))

Post by KHATWANI PARTH BHARAT
if (tempDist < closest) {
closest = tempDist
index = row
}
}
index
}
//calculating the sum of squared distance

between

Post by KHATWANI PARTH BHARAT
points(Vectors)

Post by KHATWANI PARTH BHARAT
def ssr(a: Vector, b: Vector): Double = {
(a - b) ^= 2 sum
}
//method used to create (1|D)

Array[Double]

Post by KHATWANI PARTH BHARAT
val newArr = new Array[Double](arg.length +

Post by KHATWANI PARTH BHARAT
newArr(0) = 1.0;
for (i <- 0 until (arg.size)) {
newArr(i + 1) = arg(i);
}
newArr
}
Thanks & Regards
Parth Khatwani
On Mon, Apr 3, 2017 at 7:37 PM, KHATWANI PARTH

BHARAT

---------- Forwarded message ----------
Date: Fri, Mar 31, 2017 at 11:34 PM
Subject: Re: Trying to write the KMeans

Clustering

"Apache

Mahout

Samsara"
ps1 this assumes row-wise construction of A

based

Post by KHATWANI PARTH BHARAT
on

Post by Trevor Grant
training

set

n-dimensional points.
ps2 since we are doing multiple passes over A

Post by KHATWANI PARTH BHARAT
may

sense to

make

sure it is committed to spark cache (by using

checkpoint

spark

used
On Fri, Mar 31, 2017 at 10:53 AM, Dmitriy

Lyubimov <

Post by Dmitriy Lyubimov
here is the outline. For details of APIs,

please

Post by KHATWANI PARTH BHARAT
refer

Post by Dmitriy Lyubimov
samsara

Post by Dmitriy Lyubimov
[2], i will not be be repeating it.
Assume your training data input is m x n

matrix

Post by KHATWANI PARTH BHARAT
A.

Post by Dmitriy Lyubimov
simplicity

let's

Post by Dmitriy Lyubimov
assume it's a DRM with int row keys, i.e.,

DrmLike[Int].

Post by Dmitriy Lyubimov
First, classic k-means starts by selecting

initial

Post by KHATWANI PARTH BHARAT
clusters,

Post by Dmitriy Lyubimov
them out. You can do that by using sampling

api

Post by Trevor Grant
[1],

Post by Dmitriy Lyubimov
forming

Post by Dmitriy Lyubimov
in-memory matrix C (current centroids). C is

therefore

Post by Dmitriy Lyubimov
Mahout's

Post by Dmitriy Lyubimov
type.
You the proceed by alternating between

cluster

Post by Trevor Grant
assignments

and

Post by Dmitriy Lyubimov
recompupting centroid matrix C till

convergence

Post by KHATWANI PARTH BHARAT
based

Post by Dmitriy Lyubimov
simply limited by epoch count budget, your

choice.

Post by Dmitriy Lyubimov
Cluster assignments: here, we go over

current

Post by Trevor Grant
generation

Post by Dmitriy Lyubimov
recompute centroid indexes for each row in

Post by KHATWANI PARTH BHARAT
Once

recompute

index,

Post by Dmitriy Lyubimov
put it into the row key . You can do that by

assigning

centroid

indices

Post by Dmitriy Lyubimov
keys of A using operator mapblock() (details

Post by Trevor Grant
[2],

[4]).

Post by Dmitriy Lyubimov
need to broadcast C in order to be able to

access

Post by Dmitriy Lyubimov
efficient

Post by Dmitriy Lyubimov
inside mapblock() closure. Examples of that

are

Post by KHATWANI PARTH BHARAT
plenty

Post by Dmitriy Lyubimov
Essentially, in mapblock, you'd reform the

row

Post by Trevor Grant
keys

reflect

cluster

Post by Dmitriy Lyubimov
index in C. while going over A, you'd have a

"nearest

neighbor"

problem

Post by Dmitriy Lyubimov
solve for the row of A and centroids C. This

Post by Trevor Grant
bulk of

computation

Post by Dmitriy Lyubimov
really, and there are a few tricks there

that

Post by KHATWANI PARTH BHARAT
can

Post by Trevor Grant
speed

up in

Post by Dmitriy Lyubimov
both exact and approximate manner, but you

can

Post by Trevor Grant
start

Post by Trevor Grant
with a

Post by Dmitriy Lyubimov
once you assigned centroids to the keys of

marix

Post by Trevor Grant
A,

Post by Trevor Grant
you'd

want

do an

Post by Dmitriy Lyubimov
aggregating transpose of A to compute

essentially

Post by Trevor Grant
average

grouped

Post by Dmitriy Lyubimov
by the centroid key. The trick is to do a

computation

(1|A)'

which

will

Post by Dmitriy Lyubimov
results in a matrix of the shape

(Counts/sums

Post by Trevor Grant
cluster

rows).

This is

the

Post by Dmitriy Lyubimov
part i find difficult to explain without a

latex

Post by Trevor Grant
graphics.

Post by Dmitriy Lyubimov
In Samsara, construction of (1|A)'

corresponds

expression

Post by Dmitriy Lyubimov
(1 cbind A).t (again, see [2]).
So when you compute, say,
B = (1 | A)',
then B is (n+1) x k, so each column

contains a

Post by KHATWANI PARTH BHARAT
vector

Post by KHATWANI PARTH BHARAT
corresponding

Post by Dmitriy Lyubimov
cluster 1..k. In such column, the first

element

Post by KHATWANI PARTH BHARAT
would

points in

the

Post by Dmitriy Lyubimov
cluster, and the rest of it would correspond

Post by Trevor Grant
sum

Post by Dmitriy Lyubimov
points.

Post by Dmitriy Lyubimov
order to arrive to an updated matrix C, we

need

Post by Trevor Grant
collect

Post by Dmitriy Lyubimov
and slice out counters (first row) from the

rest

Post by Dmitriy Lyubimov
C <- B (2:,:) each row divided by B(1,:)
(watch out for empty clusters with 0

elements,

cause

Post by Dmitriy Lyubimov
convergence and NaNs in the newly computed

C).

Post by Dmitriy Lyubimov
This operation obviously uses subblocking

and

Post by KHATWANI PARTH BHARAT
row-wise

iteration

over

Post by Dmitriy Lyubimov
for which i am again making reference to

[2].

Post by Dmitriy Lyubimov
[1] https://github.com/apache/

mahout/blob/master/math-scala/

Post by Dmitriy Lyubimov
src/main/scala/org/apache/maho

ut/math/drm/package.scala#

Post by Dmitriy Lyubimov
[2], Sasmara manual, a bit dated but viable,

http://apache.github

Post by Dmitriy Lyubimov
io/mahout/doc/ScalaSparkBindings.html
[3] scaladoc, again, dated but largely

viable

purpose of

this

Post by Dmitriy Lyubimov
http://apache.github.io/

mahout/0.10.1/docs/mahout-

Post by Trevor Grant
math-

scala/index.htm

Post by Dmitriy Lyubimov
[4] mapblock etc.

http://apache.github.io/mahout

/0.10.1/docs/mahout-

Post by Dmitriy Lyubimov
math-scala/index.html#org.

apache.mahout.math.drm.

Post by KHATWANI PARTH BHARAT
RLikeDrmOps

Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI

PARTH

Post by Trevor Grant
BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriycan you please again tell me the

approach

Post by KHATWANI PARTH BHARAT
Thanks
Parth Khatwani
On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI

PARTH

Post by Trevor Grant
BHARAT <

Post by KHATWANI PARTH BHARAT
yes i am unable to figure out the way

ahead.

Post by KHATWANI PARTH BHARAT
Like how to create the augmented matrix A

Post by KHATWANI PARTH BHARAT
(0|D)

Post by Trevor Grant
which

you

Post by KHATWANI PARTH BHARAT
mentioned.
On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy

Lyubimov

been a

Post by Dmitriy Lyubimov
confusing?

Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 8:40 AM,

KHATWANI

Post by Trevor Grant
PARTH

Post by Trevor Grant
BHARAT

Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans

clustering

Post by Trevor Grant
algorithm

using

Mahout

Post by Dmitriy Lyubimov
Samsara

Post by KHATWANI PARTH BHARAT
but i am bit confused
about how to leverage Distributed Row

Matrix

same.

Post by KHATWANI PARTH BHARAT
me with same.
Thanks
Parth Khatwani

KHATWANI PARTH BHARAT

2017-04-22 19:21:39 UTC

Thanks Tervor.

I will wait for the bug fix.

Post by Trevor Grant
In short Khatwani, you found a bug!
They creep in from time to time. Thank you and sorry for the
inconvenience.
You'll find
https://issues.apache.org/jira/browse/MAHOUT-1971
and subsequent PR
https://github.com/apache/mahout/pull/307/files
addressing this issue.
Wait for these to close and then try building mahout again with mvn clean
install.
Your code will hopefully work then.
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
On Fri, Apr 21, 2017 at 11:25 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriy
I didn't get this "The fix is one character long
(+ better test for aggregation)."
And even before Aggregating Transpose I Trying to assign Cluster IDs to

Row

Post by KHATWANI PARTH BHARAT
Key
Which doesn't seems to work.
I want above matrix to be in this form
{
1 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
1 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
Turns out to be this when i assign 1 to each every Row key
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
From what i have understood is that even before doing the aggregating

Trans

Post by KHATWANI PARTH BHARAT
pose the Matrix should be in the following format
{
1 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
1 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
only then Rows with same Key will be added.
Correct me if i am wrong.
Thanks
Parth Khatwani

Post by Dmitriy Lyubimov
There appears to be a bug in Spark transposition operator w.r.t.
aggregating semantics which appears in cases where the same cluster

(key)

Post by Dmitriy Lyubimov
is present more than once in the same block. The fix is one character

long

Post by Dmitriy Lyubimov
(+ better test for aggregation).
On Fri, Apr 21, 2017 at 1:06 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
One is the cluster ID of the Index to which the data point should be
assigned.
As per what is given in this book Apache-Mahout-Mapreduce-

Dmitriy-Lyubimov

added

Post by Dmitriy Lyubimov
when

Post by KHATWANI PARTH BHARAT
we take aggregating transpose of the matrix.
So i think there should be a way to assign new values to row keys

and

Post by KHATWANI PARTH BHARAT
i

-Virgil*

Post by Trevor Grant
On Fri, Apr 21, 2017 at 2:26 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Trevor
In was trying to write the "*Kmeans*" Using Mahout DRM as per the

algorithm

Post by KHATWANI PARTH BHARAT
outlined by Dmitriy.
I was facing the Problem of assigning cluster Ids to the Row Keys
For Example
Consider the below matrix Where column 1 to 3 are the data points

and

point

Post by KHATWANI PARTH BHARAT
data

Post by KHATWANI PARTH BHARAT
zeroth index i am trying to assign the centriod index to *row

key *

Post by KHATWANI PARTH BHARAT
Now Suppose say that every data point is assigned to centriod at

index

Post by KHATWANI PARTH BHARAT
so after assigning the key=1 to each and every row
using the code below
val drm2 = A.mapBlock() {
case (keys, block) => for(row <- 0 until keys.size)

{

Post by KHATWANI PARTH BHARAT
* //assigning 1 to each row index* keys(row) =

Post by KHATWANI PARTH BHARAT
} (keys, block) }
I want above matrix to be in this form
{
1 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
1 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
I am confused weather assigning the new Key Values to the row

index

Post by Dmitriy Lyubimov
is

Post by Trevor Grant
done

Andrew

Post by KHATWANI PARTH BHARAT
and Dmitriy's book also does not have any proper reference for

the

Post by KHATWANI PARTH BHARAT
above mentioned issue.
Thanks & Regards
Parth Khatwani
On Fri, Apr 21, 2017 at 10:06 PM, Trevor Grant <

Post by Trevor Grant
OK, i dug into this before i read your question carefully, that

was

Post by KHATWANI PARTH BHARAT
my

::).sum

row

Post by KHATWANI PARTH BHARAT
sums.

Post by Trevor Grant
A distributed rowSums fn would be nice for just such an

occasion...

Post by Trevor Grant
sigh

Post by Trevor Grant
Let me know if that gets you going again. That was simpler

than

Post by KHATWANI PARTH BHARAT
I

Post by KHATWANI PARTH BHARAT
thought-

Post by Trevor Grant
sorry for delay on this.
PS
Candidly, I didn't explore further once i understood teh

question,

Post by KHATWANI PARTH BHARAT
but

Post by Trevor Grant
you are going to collect this to the driver anyway (not sure if

that

-Virgil*

Post by Trevor Grant
On Thu, Apr 20, 2017 at 9:01 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Trevor Sir,
I have attached the sample data file and here is the line to

complete

Post by Trevor Grant
the Data

Post by KHATWANI PARTH BHARAT
File <https://drive.google.com/open?id=0Bxnnu_

Ig2Et9QjZoM3dmY1V5WXM

Post by KHATWANI PARTH BHARAT
Following is the link for the Github Branch For the code
https://github.com/parth2691/Spark_Mahout/tree/Dmitriy-

Lyubimov

Post by KHATWANI PARTH BHARAT
KmeansMahout.scala
<https://github.com/parth2691/Spark_Mahout/blob/Dmitriy-

Lyubimov/KmeansMahout.scala> is

Post by KHATWANI PARTH BHARAT
the complete code
I also have made sample program just to test the assigning

new

Post by KHATWANI PARTH BHARAT
values

Post by KHATWANI PARTH BHARAT
the key to Row Matrix and aggregating transpose.I think

assigning

Post by KHATWANI PARTH BHARAT
new

Post by KHATWANI PARTH BHARAT
values to the key to Row Matrix and aggregating transpose is

causing

Post by KHATWANI PARTH BHARAT
main problem in Kmean code
Following is the link to Github repo for this code.
TestClusterAssign.scala
<https://github.com/parth2691/Spark_Mahout/blob/Dmitriy-

Lyubimov/TestClusterAssign.scala>

Post by KHATWANI PARTH BHARAT
above code contains the hard coded data. Following is the

expected

column

Post by KHATWANI PARTH BHARAT
3 Contains Data)
But Turns out to be this
{
0 => {}
1 => {0:1.0,1:4.0,2:5.0,3:6.0}
2 => {}
3 => {}
}
And the result of aggregating Transpose should be
{
0 => {1: 4.0}
1 => {1: 9.0}
2 => {1: 12.0}
3 => {1: 15.0}
}
Thanks Trevor for such a great Help
Best Regards
Parth
On Fri, Apr 21, 2017 at 4:20 AM, Trevor Grant <

Post by Trevor Grant
Hey
Sorry for delay- was getting ready to tear into this.
Would you mind posting a small sample of data that you would

expect

things."

Post by KHATWANI PARTH BHARAT
-Virgil*

Post by Trevor Grant
On Tue, Apr 18, 2017 at 11:32 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriy,@Trevor and @Andrew Sir,
I am still stuck at the above problem can you please help

Post by Dmitriy Lyubimov
out

Post by KHATWANI PARTH BHARAT
with

Post by KHATWANI PARTH BHARAT
I am unable to find the proper reference to solve the

above

Post by Trevor Grant
issue.

Post by KHATWANI PARTH BHARAT
Thanks & Regards
Parth Khatwani
<https://mailtrack.io/> Sent with Mailtrack
<https://mailtrack.io/install?source=signature&lang=en&

&idSignature=22>

Post by KHATWANI PARTH BHARAT
On Sat, Apr 15, 2017 at 10:07 AM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriy,
@Trevor and @Andrew
I have tried
Testing this Row Key assignment issue which i have

mentioned

Post by Trevor Grant
above

Post by KHATWANI PARTH BHARAT
mail,
By Writing the a separate code where i am assigning the

Post by Trevor Grant
default

Post by Trevor Grant
value 1

Post by KHATWANI PARTH BHARAT
to each row Key of The DRM and then taking the

aggregating

Post by KHATWANI PARTH BHARAT
transpose

Post by KHATWANI PARTH BHARAT
I have committed the separate test code to the Github

Branch

Post by KHATWANI PARTH BHARAT
<https://github.com/parth2691/

Spark_Mahout/tree/Dmitriy-

Post by Trevor Grant
Lyubimov

Post by KHATWANI PARTH BHARAT
The Code is as follows
val inCoreA = dense((1,1, 2, 3), (1,2, 3, 4), (1,3, 4,

5),

Post by KHATWANI PARTH BHARAT
(1,4,

Post by KHATWANI PARTH BHARAT
val A = drmParallelize(m = inCoreA)
//Mapblock
val drm2 = A.mapBlock() {
case (keys, block) => for(row <- 0 until

keys.size) {

Post by KHATWANI PARTH BHARAT
* //assigning 1 to each row index*

keys(row)

Post by Trevor Grant
= 1

Post by KHATWANI PARTH BHARAT
} (keys, block) } prinln("After New Cluster

assignment")

Post by KHATWANI PARTH BHARAT
println(""+drm2.collect) val aggTranspose = drm2.t

println("Result of

Post by KHATWANI PARTH BHARAT
aggregating tranpose") println(""+aggTranspose.

collect)

Post by KHATWANI PARTH BHARAT
Out of 1st println After New Cluster assignment should

Post by KHATWANI PARTH BHARAT
This
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
(Here zeroth Column is used to store the centriod count

and

Post by Trevor Grant
column

Dmitriy

Post by Trevor Grant
Apache

Post by KHATWANI PARTH BHARAT
Beyond MapReduce
<https://www.amazon.com/Apache-Mahout-MapReduce-

Dmitriy-Lyubimov/dp/1523775785> Aggregating

Post by KHATWANI PARTH BHARAT
Transpose and other concepts are explained very nicely

over

Post by Trevor Grant
here

Post by KHATWANI PARTH BHARAT
unable to find any example where
Row Keys are assigned new Values . Mahout Samsara Manual
http://apache.github.io/mahout/doc/ScalaSparkBindings.

html

Post by KHATWANI PARTH BHARAT
Also

Post by KHATWANI PARTH BHARAT
contain any such examples.
It will great if i can get some reference to solution of

mentioned

Post by Trevor Grant
issue.

Post by KHATWANI PARTH BHARAT
Thanks
Parth Khatwani
On Sat, Apr 15, 2017 at 12:13 AM, Andrew Palumbo <

+1
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/14/2017 11:40 (GMT-08:00)
Subject: Re: Trying to write the KMeans Clustering

Using

Post by Trevor Grant
"Apache

Post by Trevor Grant
Mahout

Samsara"
Parth and Dmitriy,
This is awesome- as a follow on can we work on getting

this

Post by KHATWANI PARTH BHARAT
rolled

Post by Trevor Grant
in to

things."

Post by Trevor Grant
-Virgil*

On Fri, Apr 14, 2017 at 1:27 PM, Dmitriy Lyubimov <

Post by Dmitriy Lyubimov
i would think reassinging keys should work in most

cases.

Post by Dmitriy Lyubimov
The only exception is that technically Spark

contracts

Post by KHATWANI PARTH BHARAT
imply

Post by Dmitriy Lyubimov
should be idempotent if task is retried, which might

be a

Post by KHATWANI PARTH BHARAT
problem

Post by Dmitriy Lyubimov
specific scenario of the object tree coming out from

block

Post by KHATWANI PARTH BHARAT
cache

Post by Dmitriy Lyubimov
tree, which can stay there and be retried again. but

specifically

Post by Dmitriy Lyubimov
this key assignment i don't see any problem since the

action

Post by Trevor Grant
obviously

Post by Dmitriy Lyubimov
would be idempotent even if this code is run multiple

times

Post by Dmitriy Lyubimov
(key, block) pair. This part should be good IMO.
On Fri, Apr 14, 2017 at 2:26 AM, KHATWANI PARTH

BHARAT

Post by KHATWANI PARTH BHARAT
@Dmitriy Sir,
In the K means code above I think i am doing the

following

Post by KHATWANI PARTH BHARAT
Incorrectly

Post by KHATWANI PARTH BHARAT
Assigning the closest centriod index to the Row

Keys

Post by KHATWANI PARTH BHARAT
//11. Iterating over the Data Matrix(in

DrmLike[Int]

Post by Trevor Grant
format)

calculate

closest

Post by Trevor Grant
centriod

the

Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint"
val closesetIndex =

findTheClosestCentriod(

Post by Trevor Grant
dataPoint,

Post by Dmitriy Lyubimov
centriods)

Post by KHATWANI PARTH BHARAT
//13. assigning closest index to key
keys(row) = closesetIndex
}
keys -> block
}
in step 12 i am finding the centriod closest to

the

Post by Trevor Grant
current

Post by KHATWANI PARTH BHARAT
dataPoint

Post by KHATWANI PARTH BHARAT
in step13 i am assigning the closesetIndex to the

key

Post by Dmitriy Lyubimov
corresponding

Post by KHATWANI PARTH BHARAT
row represented by the dataPoint
I think i am doing step13 incorrectly.
Also i am unable to find the proper reference for

the

Post by Dmitriy Lyubimov
reference

Post by KHATWANI PARTH BHARAT
links which you have mentioned above
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 6:24 PM, KHATWANI PARTH

BHARAT

Post by Dmitriy Lyubimov
<

Post by KHATWANI PARTH BHARAT
Dmitriy Sir,
I have Created a github branch Github Branch

Having

Post by Trevor Grant
Initial

Post by Trevor Grant
Kmeans

Code

Post by KHATWANI PARTH BHARAT
<https://github.com/parth2691/

Spark_Mahout/tree/Dmitriy-Lyub

Post by Trevor Grant
imov>

Post by KHATWANI PARTH BHARAT
Thanks & Regards
Parth Khatwani
On Thu, Apr 13, 2017 at 3:19 AM, Andrew Palumbo <

+1 to creating a branch.
Sent from my Verizon Wireless 4G LTE smartphone
-------- Original message --------
Date: 04/12/2017 11:25 (GMT-08:00)
Subject: Re: Trying to write the KMeans

Clustering

Post by Trevor Grant
"Apache

Samsara"
can't say i can read this code well formatted

that

Post by Trevor Grant
way...

it would seem to me that the code is not using

the

Post by KHATWANI PARTH BHARAT
broadcast

variable

instead is using closure variable. that's the

only

Post by Trevor Grant
thing i

Post by KHATWANI PARTH BHARAT
immediately

see by looking in the middle of it.
it would be better if you created a branch on

github

would allow for easy check-outs and comments.
-d
On Wed, Apr 12, 2017 at 10:29 AM, KHATWANI PARTH

BHARAT

Post by KHATWANI PARTH BHARAT
@Dmitriy Sir
I have completed the Kmeans code as per the

algorithm

Post by Dmitriy Lyubimov
Outline

Post by KHATWANI PARTH BHARAT
above
My code is as follows
This code works fine till step number 10
In step 11 i am assigning the new centriod

index

corresponding

key

Post by KHATWANI PARTH BHARAT
of data Point in the matrix
I think i am doing something wrong in step 11

may

Post by KHATWANI PARTH BHARAT
be i

Post by Trevor Grant
using

incorrect

Post by KHATWANI PARTH BHARAT
syntax
Can you help me find out what am i doing

wrong.

Post by KHATWANI PARTH BHARAT
//start of main method
def main(args: Array[String]) {
//1. initialize the spark and mahout

context

Post by KHATWANI PARTH BHARAT
val conf = new SparkConf()
.setAppName("DRMExample")
.setMaster(args(0))
.set("spark.serializer",

"org.apache.spark.serializer.

Post by KHATWANI PARTH BHARAT
KryoSerializer")
.set("spark.kryo.registrator",
"org.apache.mahout.sparkbindings.io.

MahoutKryoRegistrator")

Post by KHATWANI PARTH BHARAT
implicit val sc = new

SparkDistributedContext(new

SparkContext(conf))

Post by KHATWANI PARTH BHARAT
//2. read the data file and save it in the

rdd

Post by KHATWANI PARTH BHARAT
val lines = sc.textFile(args(1))
//3. convert data read in as string in to

array

Post by Trevor Grant
double

Post by KHATWANI PARTH BHARAT
val test = lines.map(line =>

line.split('\t').map(_.toDoubl

e))

Post by KHATWANI PARTH BHARAT
//4. add a column having value 1 in array

Post by Trevor Grant
double

Post by KHATWANI PARTH BHARAT
create something like (1 | D)', which will be

used

Post by KHATWANI PARTH BHARAT
while

Post by Dmitriy Lyubimov
calculating

Post by KHATWANI PARTH BHARAT
(1 | D)'
val augumentedArray =

test.map(addCentriodColumn

Post by Trevor Grant
_)

Post by KHATWANI PARTH BHARAT
//5. convert rdd of array of double in rdd

Post by Trevor Grant
DenseVector

Post by KHATWANI PARTH BHARAT
val rdd = augumentedArray.map(dvec(_))
//6. convert rdd to DrmRdd
val rddMatrixLike: DrmRdd[Int] =

rdd.zipWithIndex.map

(v,

Post by KHATWANI PARTH BHARAT
idx) => (idx.toInt, v) } //7. convert

DrmRdd

Post by KHATWANI PARTH BHARAT
CheckpointedDrm[Int] val matrix =

drmWrap(rddMatrixLike)

//8.

Post by KHATWANI PARTH BHARAT
seperating the column having all ones created

Post by KHATWANI PARTH BHARAT
step

use

Post by KHATWANI PARTH BHARAT
it later val oneVector = matrix(::, 0 until

Post by Trevor Grant
//9.

final

Post by KHATWANI PARTH BHARAT
input data in DrmLike[Int] format val

dataDrmX

Post by Trevor Grant
matrix(::,

Post by KHATWANI PARTH BHARAT
4) //9. Sampling to select initial

centriods

Post by KHATWANI PARTH BHARAT
centriods = drmSampleKRows(dataDrmX, 2, false)

centriods.size

Post by KHATWANI PARTH BHARAT
//10. Broad Casting the initial centriods

val

Post by KHATWANI PARTH BHARAT
broadCastMatrix

Post by KHATWANI PARTH BHARAT
drmBroadcast(centriods) //11.

Iterating

Post by Trevor Grant
over

Post by KHATWANI PARTH BHARAT
Matrix(in DrmLike[Int] format) to calculate

the

Post by Trevor Grant
initial

Post by KHATWANI PARTH BHARAT
centriods

Post by KHATWANI PARTH BHARAT
dataDrmX.mapBlock() { case (keys, block)

Post by Trevor Grant
for

Post by KHATWANI PARTH BHARAT
until block.nrow) { var dataPoint =

block(row,

Post by KHATWANI PARTH BHARAT
//12. findTheClosestCentriod find the

closest

Post by Trevor Grant
centriod

Post by KHATWANI PARTH BHARAT
Data point specified by "dataPoint"

val

Post by Trevor Grant
closesetIndex

Post by KHATWANI PARTH BHARAT
findTheClosestCentriod(dataPoint, centriods)

//13.

Post by KHATWANI PARTH BHARAT
assigning closest index to key

keys(row)

Post by KHATWANI PARTH BHARAT
closesetIndex

Post by KHATWANI PARTH BHARAT
} keys -> block }
//14. Calculating the (1|D) val b =

(oneVector

Post by Trevor Grant
cbind

Post by KHATWANI PARTH BHARAT
dataDrmX) //15. Aggregating Transpose

(1|D)'

Post by KHATWANI PARTH BHARAT
val

Post by Dmitriy Lyubimov
bTranspose

Post by KHATWANI PARTH BHARAT
= (oneVector cbind dataDrmX).t // after

step

Post by KHATWANI PARTH BHARAT
15

Post by Trevor Grant
bTranspose

will

Post by KHATWANI PARTH BHARAT
have data in the following format

/*(n+1)*K

Post by Trevor Grant
where

n=dimension

Post by KHATWANI PARTH BHARAT
of the data point, K=number of clusters *

zeroth

Post by Trevor Grant
row

contain

Post by KHATWANI PARTH BHARAT
the count of points assigned to each cluster

Post by KHATWANI PARTH BHARAT
assuming

data

Post by KHATWANI PARTH BHARAT
points * */
val nrows = b.nrow.toInt //16. slicing

the

Post by Trevor Grant
count

Post by Trevor Grant
vectors

out

Post by KHATWANI PARTH BHARAT
val pointCountVectors = drmBroadcast(b(0

until

Post by KHATWANI PARTH BHARAT
1,

Post by KHATWANI PARTH BHARAT
::).collect(0,

Post by KHATWANI PARTH BHARAT
val vectorSums = b(1 until nrows, ::)

//17.

Post by KHATWANI PARTH BHARAT
dividing

data

Post by KHATWANI PARTH BHARAT
point by count vector

vectorSums.mapBlock() {

Post by KHATWANI PARTH BHARAT
block) => for (row <- 0 until

block.nrow) {

Post by Dmitriy Lyubimov
block(row,

Post by KHATWANI PARTH BHARAT
::) /= pointCountVectors } keys

Post by KHATWANI PARTH BHARAT
block

//18.

Post by KHATWANI PARTH BHARAT
seperating the count vectors val

newCentriods =

vectorSums.t(::,1

Post by KHATWANI PARTH BHARAT
until centriods.size) //19. iterate

over

Post by Trevor Grant
above

code

Post by KHATWANI PARTH BHARAT
till convergence criteria is meet }//end of

main

Post by KHATWANI PARTH BHARAT
method

Post by KHATWANI PARTH BHARAT
// method to find the closest centriod to

data

Post by Trevor Grant
point(

Vector

Post by KHATWANI PARTH BHARAT
in the arguments) def

Vector,

Post by KHATWANI PARTH BHARAT
Matrix): Int = {
var index = 0
var closest = Double.PositiveInfinity
for (row <- 0 until matrix.nrow) {
val squaredSum = ssr(vec, matrix(row,

::))

Post by KHATWANI PARTH BHARAT
val tempDist = Math.sqrt(ssr(vec,

matrix(row,

Post by KHATWANI PARTH BHARAT
::)))

Post by KHATWANI PARTH BHARAT
if (tempDist < closest) {
closest = tempDist
index = row
}
}
index
}
//calculating the sum of squared distance

between

Post by KHATWANI PARTH BHARAT
points(Vectors)

Post by KHATWANI PARTH BHARAT
def ssr(a: Vector, b: Vector): Double = {
(a - b) ^= 2 sum
}
//method used to create (1|D)

Array[Double]

Post by KHATWANI PARTH BHARAT
val newArr = new Array[Double](arg.length

Post by KHATWANI PARTH BHARAT
1)

Post by KHATWANI PARTH BHARAT
newArr(0) = 1.0;
for (i <- 0 until (arg.size)) {
newArr(i + 1) = arg(i);
}
newArr
}
Thanks & Regards
Parth Khatwani
On Mon, Apr 3, 2017 at 7:37 PM, KHATWANI PARTH

BHARAT

---------- Forwarded message ----------
Date: Fri, Mar 31, 2017 at 11:34 PM
Subject: Re: Trying to write the KMeans

Clustering

"Apache

Mahout

Samsara"
ps1 this assumes row-wise construction of A

based

Post by KHATWANI PARTH BHARAT
on

Post by Trevor Grant
training

set

n-dimensional points.
ps2 since we are doing multiple passes over

Post by KHATWANI PARTH BHARAT
it

Post by KHATWANI PARTH BHARAT
may

sense to

make

sure it is committed to spark cache (by

using

Post by KHATWANI PARTH BHARAT
checkpoint

spark

used
On Fri, Mar 31, 2017 at 10:53 AM, Dmitriy

Lyubimov <

Post by Dmitriy Lyubimov
here is the outline. For details of APIs,

please

Post by KHATWANI PARTH BHARAT
refer

Post by Dmitriy Lyubimov
samsara

Post by Dmitriy Lyubimov
[2], i will not be be repeating it.
Assume your training data input is m x n

matrix

Post by KHATWANI PARTH BHARAT
A.

Post by Dmitriy Lyubimov
simplicity

let's

Post by Dmitriy Lyubimov
assume it's a DRM with int row keys, i.e.,

DrmLike[Int].

Post by Dmitriy Lyubimov
First, classic k-means starts by selecting

initial

Post by KHATWANI PARTH BHARAT
clusters,

Post by Dmitriy Lyubimov
them out. You can do that by using

sampling

Post by Dmitriy Lyubimov
api

Post by Trevor Grant
[1],

Post by Dmitriy Lyubimov
forming

Post by Dmitriy Lyubimov
in-memory matrix C (current centroids). C

Post by KHATWANI PARTH BHARAT
therefore

Post by Dmitriy Lyubimov
Mahout's

Post by Dmitriy Lyubimov
type.
You the proceed by alternating between

cluster

Post by Trevor Grant
assignments

and

Post by Dmitriy Lyubimov
recompupting centroid matrix C till

convergence

Post by KHATWANI PARTH BHARAT
based

Post by Dmitriy Lyubimov
simply limited by epoch count budget, your

choice.

Post by Dmitriy Lyubimov
Cluster assignments: here, we go over

current

Post by Trevor Grant
generation

Post by Dmitriy Lyubimov
recompute centroid indexes for each row in

Post by KHATWANI PARTH BHARAT
Once

recompute

index,

Post by Dmitriy Lyubimov
put it into the row key . You can do that

Post by KHATWANI PARTH BHARAT
assigning

centroid

indices

Post by Dmitriy Lyubimov
keys of A using operator mapblock()

(details

Post by Dmitriy Lyubimov
in

Post by Trevor Grant
[2],

[4]).

Post by Dmitriy Lyubimov
need to broadcast C in order to be able to

access

Post by Dmitriy Lyubimov
efficient

Post by Dmitriy Lyubimov
inside mapblock() closure. Examples of

that

Post by Dmitriy Lyubimov
are

Post by KHATWANI PARTH BHARAT
plenty

Post by Dmitriy Lyubimov
Essentially, in mapblock, you'd reform the

row

Post by Trevor Grant
keys

reflect

cluster

Post by Dmitriy Lyubimov
index in C. while going over A, you'd

have a

Post by KHATWANI PARTH BHARAT
"nearest

neighbor"

problem

Post by Dmitriy Lyubimov
solve for the row of A and centroids C.

This

Post by Dmitriy Lyubimov
is

Post by Trevor Grant
bulk of

computation

Post by Dmitriy Lyubimov
really, and there are a few tricks there

that

Post by KHATWANI PARTH BHARAT
can

Post by Trevor Grant
speed

up in

Post by Dmitriy Lyubimov
both exact and approximate manner, but you

can

Post by Trevor Grant
start

Post by Trevor Grant
with a

Post by Dmitriy Lyubimov
once you assigned centroids to the keys of

marix

Post by Trevor Grant
A,

Post by Trevor Grant
you'd

want

do an

Post by Dmitriy Lyubimov
aggregating transpose of A to compute

essentially

Post by Trevor Grant
average

grouped

Post by Dmitriy Lyubimov
by the centroid key. The trick is to do a

computation

(1|A)'

which

will

Post by Dmitriy Lyubimov
results in a matrix of the shape

(Counts/sums

Post by Trevor Grant
cluster

rows).

This is

the

Post by Dmitriy Lyubimov
part i find difficult to explain without a

latex

Post by Trevor Grant
graphics.

Post by Dmitriy Lyubimov
In Samsara, construction of (1|A)'

corresponds

expression

Post by Dmitriy Lyubimov
(1 cbind A).t (again, see [2]).
So when you compute, say,
B = (1 | A)',
then B is (n+1) x k, so each column

contains a

Post by KHATWANI PARTH BHARAT
vector

Post by KHATWANI PARTH BHARAT
corresponding

Post by Dmitriy Lyubimov
cluster 1..k. In such column, the first

element

Post by KHATWANI PARTH BHARAT
would

points in

the

Post by Dmitriy Lyubimov
cluster, and the rest of it would

correspond

Post by Trevor Grant
sum

Post by Dmitriy Lyubimov
points.

Post by Dmitriy Lyubimov
order to arrive to an updated matrix C, we

need

Post by Trevor Grant
collect

Post by Dmitriy Lyubimov
and slice out counters (first row) from

the

Post by Dmitriy Lyubimov
rest

Post by Dmitriy Lyubimov
C <- B (2:,:) each row divided by B(1,:)
(watch out for empty clusters with 0

elements,

cause

Post by Dmitriy Lyubimov
convergence and NaNs in the newly computed

C).

Post by Dmitriy Lyubimov
This operation obviously uses subblocking

and

Post by KHATWANI PARTH BHARAT
row-wise

iteration

over

Post by Dmitriy Lyubimov
for which i am again making reference to

[2].

Post by Dmitriy Lyubimov
[1] https://github.com/apache/

mahout/blob/master/math-scala/

Post by Dmitriy Lyubimov
src/main/scala/org/apache/maho

ut/math/drm/package.scala#

Post by Dmitriy Lyubimov
[2], Sasmara manual, a bit dated but

viable,

Post by Dmitriy Lyubimov
http://apache.github

Post by Dmitriy Lyubimov
io/mahout/doc/ScalaSparkBindings.html
[3] scaladoc, again, dated but largely

viable

purpose of

this

Post by Dmitriy Lyubimov
http://apache.github.io/

mahout/0.10.1/docs/mahout-

Post by Trevor Grant
math-

scala/index.htm

Post by Dmitriy Lyubimov
[4] mapblock etc.

http://apache.github.io/mahout

/0.10.1/docs/mahout-

Post by Dmitriy Lyubimov
math-scala/index.html#org.

apache.mahout.math.drm.

Post by KHATWANI PARTH BHARAT
RLikeDrmOps

Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI

PARTH

Post by Trevor Grant
BHARAT <

Post by KHATWANI PARTH BHARAT
@Dmitriycan you please again tell me the

approach

Post by KHATWANI PARTH BHARAT
Thanks
Parth Khatwani
On Fri, Mar 31, 2017 at 10:15 PM,

KHATWANI

Post by KHATWANI PARTH BHARAT
PARTH

Post by Trevor Grant
BHARAT <

Post by KHATWANI PARTH BHARAT
yes i am unable to figure out the way

ahead.

Post by KHATWANI PARTH BHARAT
Like how to create the augmented

matrix A

Post by Dmitriy Lyubimov
:=

Post by KHATWANI PARTH BHARAT
(0|D)

Post by Trevor Grant
which

you

Post by KHATWANI PARTH BHARAT
mentioned.
On Fri, Mar 31, 2017 at 10:10 PM,

Dmitriy

Post by KHATWANI PARTH BHARAT
Lyubimov

has

Post by KHATWANI PARTH BHARAT
been a

Post by Dmitriy Lyubimov
confusing?

Post by Dmitriy Lyubimov
On Fri, Mar 31, 2017 at 8:40 AM,

KHATWANI

Post by Trevor Grant
PARTH

Post by Trevor Grant
BHARAT

Post by KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans

clustering

Post by Trevor Grant
algorithm

using

Mahout

Post by Dmitriy Lyubimov
Samsara

Post by KHATWANI PARTH BHARAT
but i am bit confused
about how to leverage Distributed

Row

same.

Post by KHATWANI PARTH BHARAT
me with same.
Thanks
Parth Khatwani

KHATWANI PARTH BHARAT

2017-04-25 02:34:02 UTC

@Trevor and @Dmitriy

Tough Bug in Aggregating Transpose is fixed. One issue is still left which
is causing hindrance in completing the KMeans Code
That issue is of Assigning the the Row Keys of The DRM with the "Closest
Cluster Index" found
Consider the Matrix of Data points given as follows

{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
Now these are
0 =>
1 =>
2 =>
3 =>
the Row keys. Here Zeroth column(0) contains the values which will be used
the store the count of Points assigned to each cluster and Column 1 to 3
contains co-ordinates of the data points.

So now after cluster assignment step of Kmeans algorithm which @Dmitriy has
Outlined in the beginning of this mail chain,

the above Matrix should look like this(Assuming that the 0th and 1st data
points are assigned to the cluster with index 0 and 2nd and 3rd data points
are assigned to cluster with index 1)

{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
0 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
1 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}

to achieve above mentioned result i using following code lines of code

//11. Iterating over the Data Matrix(in DrmLike[Int] format)
dataDrmX.mapBlock() {
case (keys, block) =>
for (row <- 0 until block.nrow) {
var dataPoint = block(row, ::)

//12. findTheClosestCentriod find the closest centriod to the Data
point specified by "dataPoint"
val closesetIndex = findTheClosestCentriod(dataPoint, centriods)

//13. assigning closest index to key
keys(row) = closesetIndex
}
keys -> block
}

But it turns out to be

{
0 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}

So is there any thing wrong with the syntax of the above code.I am unable
to find any reference to the way in which i should assign a value to the
row keys.

@Trevor as per what you have mentioned in the above mail chain
"Got it- in short no.

Think of the keys like a dictionary or HashMap.

That's why everything is ending up on row 1."

But according to Algorithm outlined ***@Dmitriy at start of the mail chain
we assign same key To Multiple Rows is possible.
Same is also mentioned in the Book Written by Dmitriy and Andrew.
It is mentioned that the rows having the same row keys summed up when we
take aggregating transpose.

I now confused that weather it possible to achieve what i have mentioned
above or it is not possible to achieve or it is the Bug in the API.

Thanks & Regards
Parth
<#m_33347126371020841_m_5688102708516554904_>

Khurrum Nasim

2017-04-25 15:37:13 UTC

Can mahout be used for self driving tech ?

Thanks,

Khurrum.

Post by KHATWANI PARTH BHARAT
@Trevor and @Dmitriy
Tough Bug in Aggregating Transpose is fixed. One issue is still left which
is causing hindrance in completing the KMeans Code
That issue is of Assigning the the Row Keys of The DRM with the "Closest
Cluster Index" found
Consider the Matrix of Data points given as follows
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
Now these are
0 =
1 =
2 =
3 =
the Row keys. Here Zeroth column(0) contains the values which will be used
the store the count of Points assigned to each cluster and Column 1 to 3
contains co-ordinates of the data points.
Outlined in the beginning of this mail chain,
the above Matrix should look like this(Assuming that the 0th and 1st data
points are assigned to the cluster with index 0 and 2nd and 3rd data points
are assigned to cluster with index 1)
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
0 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
1 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
to achieve above mentioned result i using following code lines of code
//11. Iterating over the Data Matrix(in DrmLike[Int] format)
dataDrmX.mapBlock() {
case (keys, block) =
for (row <- 0 until block.nrow) {
var dataPoint = block(row, ::)
//12. findTheClosestCentriod find the closest centriod to the Data
point specified by "dataPoint"
val closesetIndex = findTheClosestCentriod(dataPoint, centriods)
//13. assigning closest index to key
keys(row) = closesetIndex
}
keys -> block
}
But it turns out to be
{
0 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
So is there any thing wrong with the syntax of the above code.I am unable
to find any reference to the way in which i should assign a value to the
row keys.
@Trevor as per what you have mentioned in the above mail chain
"Got it- in short no.
Think of the keys like a dictionary or HashMap.
That's why everything is ending up on row 1."
we assign same key To Multiple Rows is possible.
Same is also mentioned in the Book Written by Dmitriy and Andrew.
It is mentioned that the rows having the same row keys summed up when we
take aggregating transpose.
I now confused that weather it possible to achieve what i have mentioned
above or it is not possible to achieve or it is the Bug in the API.
Thanks & Regards
Parth
<#m_33347126371020841_m_5688102708516554904_

KHATWANI PARTH BHARAT

2017-04-26 09:04:02 UTC

Post by Khurrum Nasim
Can mahout be used for self driving tech ?
Thanks,
Khurrum.
On Apr 24, 2017, 10:34 PM -0400, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Trevor and @Dmitriy
Tough Bug in Aggregating Transpose is fixed. One issue is still left

which

Post by KHATWANI PARTH BHARAT
is causing hindrance in completing the KMeans Code
That issue is of Assigning the the Row Keys of The DRM with the "Closest
Cluster Index" found
Consider the Matrix of Data points given as follows
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
Now these are
0 =
1 =
2 =
3 =
the Row keys. Here Zeroth column(0) contains the values which will be

used

Post by KHATWANI PARTH BHARAT
the store the count of Points assigned to each cluster and Column 1 to 3
contains co-ordinates of the data points.

has

Post by KHATWANI PARTH BHARAT
Outlined in the beginning of this mail chain,
the above Matrix should look like this(Assuming that the 0th and 1st data
points are assigned to the cluster with index 0 and 2nd and 3rd data

points

Post by KHATWANI PARTH BHARAT
are assigned to cluster with index 1)
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
0 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
1 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
to achieve above mentioned result i using following code lines of code
//11. Iterating over the Data Matrix(in DrmLike[Int] format)
dataDrmX.mapBlock() {
case (keys, block) =
for (row <- 0 until block.nrow) {
var dataPoint = block(row, ::)
//12. findTheClosestCentriod find the closest centriod to the Data
point specified by "dataPoint"
val closesetIndex = findTheClosestCentriod(dataPoint, centriods)
//13. assigning closest index to key
keys(row) = closesetIndex
}
keys -> block
}
But it turns out to be
{
0 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
So is there any thing wrong with the syntax of the above code.I am unable
to find any reference to the way in which i should assign a value to the
row keys.
@Trevor as per what you have mentioned in the above mail chain
"Got it- in short no.
Think of the keys like a dictionary or HashMap.
That's why everything is ending up on row 1."

chain

Post by KHATWANI PARTH BHARAT
we assign same key To Multiple Rows is possible.
Same is also mentioned in the Book Written by Dmitriy and Andrew.
It is mentioned that the rows having the same row keys summed up when we
take aggregating transpose.
I now confused that weather it possible to achieve what i have mentioned
above or it is not possible to achieve or it is the Bug in the API.
Thanks & Regards
Parth
<#m_33347126371020841_m_5688102708516554904_

Trevor Grant

2017-05-19 20:59:59 UTC

Bumping this-

Parth, is there anything we can do to assist you?

Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things." -Virgil*

On Mon, Apr 24, 2017 at 9:34 PM, KHATWANI PARTH BHARAT <

Post by KHATWANI PARTH BHARAT
@Trevor and @Dmitriy
Tough Bug in Aggregating Transpose is fixed. One issue is still left which
is causing hindrance in completing the KMeans Code
That issue is of Assigning the the Row Keys of The DRM with the "Closest
Cluster Index" found
Consider the Matrix of Data points given as follows
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
Now these are
0 =>
1 =>
2 =>
3 =>
the Row keys. Here Zeroth column(0) contains the values which will be used
the store the count of Points assigned to each cluster and Column 1 to 3
contains co-ordinates of the data points.
Outlined in the beginning of this mail chain,
the above Matrix should look like this(Assuming that the 0th and 1st data
points are assigned to the cluster with index 0 and 2nd and 3rd data points
are assigned to cluster with index 1)
{
0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
0 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
1 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
to achieve above mentioned result i using following code lines of code
//11. Iterating over the Data Matrix(in DrmLike[Int] format)
dataDrmX.mapBlock() {
case (keys, block) =>
for (row <- 0 until block.nrow) {
var dataPoint = block(row, ::)
//12. findTheClosestCentriod find the closest centriod to the Data
point specified by "dataPoint"
val closesetIndex = findTheClosestCentriod(dataPoint, centriods)
//13. assigning closest index to key
keys(row) = closesetIndex
}
keys -> block
}
But it turns out to be
{
0 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
}
So is there any thing wrong with the syntax of the above code.I am unable
to find any reference to the way in which i should assign a value to the
row keys.
@Trevor as per what you have mentioned in the above mail chain
"Got it- in short no.
Think of the keys like a dictionary or HashMap.
That's why everything is ending up on row 1."
we assign same key To Multiple Rows is possible.
Same is also mentioned in the Book Written by Dmitriy and Andrew.
It is mentioned that the rows having the same row keys summed up when we
take aggregating transpose.
I now confused that weather it possible to achieve what i have mentioned
above or it is not possible to achieve or it is the Bug in the API.
Thanks & Regards
Parth
<#m_33347126371020841_m_5688102708516554904_>

KHATWANI PARTH BHARAT

2017-05-20 13:24:09 UTC